# Task 1

---

## Web scraping and analysis

This Jupyter notebook includes some code to get you started with web scraping. We will use a package called `BeautifulSoup` to collect the data from the web. Once you've collected your data and saved it into a local `.csv` file you should start with your analysis.

### Scraping data from Skytrax

If you visit [https://www.airlinequality.com] you can see that there is a lot of data there. For this task, we are only interested in reviews related to British Airways and the Airline itself.

If you navigate to this link: [https://www.airlinequality.com/airline-reviews/british-airways] you will see this data. Now, we can use `Python` and `BeautifulSoup` to collect all the links to the reviews and then to collect the text data on each of the individual review links.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [2]:
base_url = "https://www.airlinequality.com/airline-reviews/british-airways"
pages = 10
page_size = 200

reviews = []

# for i in range(1, pages + 1):
for i in range(1, pages + 1):

    print(f"Scraping page {i}")

    # Create URL to collect links from paginated data
    url = f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize={page_size}"

    # Collect HTML data from this page
    response = requests.get(url)

    # Parse content
    content = response.content
    parsed_content = BeautifulSoup(content, 'html.parser')
    for para in parsed_content.find_all("div", {"class": "text_content"}):
        reviews.append(para.get_text())
    
    print(f"   ---> {len(reviews)} total reviews")

Scraping page 1
   ---> 200 total reviews
Scraping page 2
   ---> 400 total reviews
Scraping page 3
   ---> 600 total reviews
Scraping page 4
   ---> 800 total reviews
Scraping page 5
   ---> 1000 total reviews
Scraping page 6
   ---> 1200 total reviews
Scraping page 7
   ---> 1400 total reviews
Scraping page 8
   ---> 1600 total reviews
Scraping page 9
   ---> 1800 total reviews
Scraping page 10
   ---> 2000 total reviews


In [3]:
df = pd.DataFrame()
df["reviews"] = reviews
df.head()

Unnamed: 0,reviews
0,✅ Trip Verified | A short BA euro trip and thi...
1,Not Verified | We are flying Business class f...
2,✅ Trip Verified | I am in Australia and on Fr...
3,✅ Trip Verified | At 7.54 am on the day of tr...
4,✅ Trip Verified | Would happily fly them agai...


In [4]:
df.to_csv("data/BA_reviews.csv")

Congratulations! Now you have your dataset for this task! The loops above collected 1000 reviews by iterating through the paginated pages on the website. However, if you want to collect more data, try increasing the number of pages!

 The next thing that you should do is clean this data to remove any unnecessary text from each of the rows. For example, "✅ Trip Verified" can be removed from each row if it exists, as it's not relevant to what we want to investigate.

In [5]:
import pandas as pd 

In [6]:
df1 = pd.read_csv("data/BA_reviews.csv")

In [7]:
df1

Unnamed: 0.1,Unnamed: 0,reviews
0,0,✅ Trip Verified | A short BA euro trip and thi...
1,1,Not Verified | We are flying Business class f...
2,2,✅ Trip Verified | I am in Australia and on Fr...
3,3,✅ Trip Verified | At 7.54 am on the day of tr...
4,4,✅ Trip Verified | Would happily fly them agai...
...,...,...
1995,1995,✅ Verified Review | Flew Marseille to London....
1996,1996,✅ Verified Review | Los Angeles to Rome via L...
1997,1997,✅ Verified Review | My wife and I used Avios ...
1998,1998,✅ Verified Review | Flew British Airways from...


In [8]:
del df1['Unnamed: 0']


In [9]:
df1

Unnamed: 0,reviews
0,✅ Trip Verified | A short BA euro trip and thi...
1,Not Verified | We are flying Business class f...
2,✅ Trip Verified | I am in Australia and on Fr...
3,✅ Trip Verified | At 7.54 am on the day of tr...
4,✅ Trip Verified | Would happily fly them agai...
...,...
1995,✅ Verified Review | Flew Marseille to London....
1996,✅ Verified Review | Los Angeles to Rome via L...
1997,✅ Verified Review | My wife and I used Avios ...
1998,✅ Verified Review | Flew British Airways from...


In [10]:
df.head(20)

Unnamed: 0,reviews
0,✅ Trip Verified | A short BA euro trip and thi...
1,Not Verified | We are flying Business class f...
2,✅ Trip Verified | I am in Australia and on Fr...
3,✅ Trip Verified | At 7.54 am on the day of tr...
4,✅ Trip Verified | Would happily fly them agai...
5,"Not Verified | Flew premium, only worth the e..."
6,✅ Trip Verified | First our morning flight wa...
7,✅ Trip Verified | Although it was a bit uncom...
8,✅ Trip Verified | Boarding was decently organ...
9,✅ Trip Verified | Boarding on time and departu...


In [11]:
df1.tail(20)

Unnamed: 0,reviews
1980,British Airways standards have dropped dramati...
1981,✅ Verified Review | Flew London Heathrow to B...
1982,Refusing to pay £73 pp to reserve seats in Bus...
1983,Having flown on an Emirates and Thai A380 I ha...
1984,Toronto to Entebbe via London with British Air...
1985,✅ Verified Review | Absolute appalling servic...
1986,✅ Verified Review | Cannot believe how bad Br...
1987,Zurich to London Heathrow with British Airways...
1988,✅ Verified Review | First class flight from Lo...
1989,✅ Verified Review | \r\nFlying in Club Europe...


In [12]:
df1['reviews'][1]

"Not Verified |  We are flying Business class for most of our flight and then Premium economy for the balance. In addition to the plane tickets we paid an additional $225/pp for our seats. Now BA is changing planes, they arbitrarily put us in separate seating areas (my wife & I) when we were sitting together before and they want to charge one of us additional $$ to be re-seated next to each other.  They moved our seats away from each other and we shouldn't have to pay for their change of planes and their decision to not have us sitting together! We haven't even flown yet, this airline is horrible."

In [13]:
# Removing unnecessary text
for i in range(2000):
    if df1['reviews'][i].find('Trip Verified') != -1:
        df1['reviews'][i] = df1['reviews'][i][17:]
    elif df1['reviews'][i].find('Verified Review') != -1:
        df1['reviews'][i] = df1['reviews'][i][17:]
    else:
        pass

In [14]:
df1

Unnamed: 0,reviews
0,A short BA euro trip and this is where BA exc...
1,Not Verified | We are flying Business class f...
2,"I am in Australia and on Friday night, went ..."
3,At 7.54 am on the day of travel whilst drivi...
4,Would happily fly them again. I had a person...
...,...
1995,| Flew Marseille to London. Absolutely terri...
1996,| Los Angeles to Rome via London with Britis...
1997,| My wife and I used Avios to get two return...
1998,| Flew British Airways from London Heathrow ...


In [15]:
"""
Checking for missing values if there are missing values 
we can two methods one is by filling the missing values with mean value (df = df.fillna(df.mean())) 
and another method is to remove the rows with has missing values(df = df.dropna())

"""
df1.isnull().sum()

reviews    0
dtype: int64

In [16]:
#Checking for duplicate values if there are duplicate values we can use df.drop_duplicates() to remove the duplicate values
df1.duplicated().sum()

0

In [17]:
df1['reviews'][1995]

" |  Flew Marseille to London. Absolutely terrible service. Flight delayed four hours. Check-in staff rude and arrogant and didn't explain the delay. Plane was not that clean. The one consolation was the pilot coming out of the cockpit to apologise about the delay and offered the opportunity for passengers to view the inside of the cockpit. For the money they charge you can get better service elsewhere. I would recommend using other airlines personally."

In [18]:
v_count = 0
for i in range(2000):
    if df1['reviews'][i].find('Not Verified') == -1:
        v_count = v_count +1
    else:
        pass

In [19]:
print("Number of  Verified Reviews:",v_count)

Number of  Verified Reviews: 1824


In [20]:
nv_count= 0
for i in range(2000):
    if df1['reviews'][i].find('Not Verified') != -1:
        nv_count = nv_count +1
    else:
        pass

In [21]:
print("Number of  not Verified Reviews:",nv_count)

Number of  not Verified Reviews: 176


In [22]:
result = df1[df1['reviews'].str.contains('Not Verified')]

In [23]:
result

Unnamed: 0,reviews
1,Not Verified | We are flying Business class f...
5,"Not Verified | Flew premium, only worth the e..."
28,Not Verified | It seems that there is a race t...
29,Not Verified | As a Spanish born individual l...
33,Not Verified | I find BA incredibly tacky and...
...,...
963,Not Verified | The overall flight wasn't too ...
967,Not Verified | First time flying with British...
1172,❎ Not Verified | Madrid to London Heathrow on...
1173,❎ Not Verified | Flew British Airways from Bo...


In [51]:
import matplotlib.pyplot as plt
import seaborn as sns