# Task 1

---

## Web scraping and analysis

This Jupyter notebook includes some code to get you started with web scraping. We will use a package called `BeautifulSoup` to collect the data from the web. Once you've collected your data and saved it into a local `.csv` file you should start with your analysis.

### Scraping data from Skytrax

If you visit [https://www.airlinequality.com] you can see that there is a lot of data there. For this task, we are only interested in reviews related to British Airways and the Airline itself.

If you navigate to this link: [https://www.airlinequality.com/airline-reviews/british-airways] you will see this data. Now, we can use `Python` and `BeautifulSoup` to collect all the links to the reviews and then to collect the text data on each of the individual review links.

In [91]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [92]:
base_url = "https://www.airlinequality.com/airline-reviews/british-airways"
pages = 10
page_size = 100

url = f"{base_url}/page/{1}/?sortby=post_date%3ADesc&pagesize={page_size}"

# Collect HTML data from this page
response = requests.get(url)

# Parse content
content = response.content
parsed_content = BeautifulSoup(content, 'html.parser')
    




In [93]:
def safe_find_all_stars(tag, clas):
    f_child = tag.find('td', class_ = clas)
    return len(f_child.find_next_sibling().find_all('span', class_ = 'star fill')) if f_child else pd.NA

def safe_find_next_sibling(tag, clas):
    f_child = tag.find('td', class_ = clas)
    return f_child.find_next_sibling().get_text() if f_child else pd.NA


columns = ['name', 'header', 'time', 'text_reviow', 'Aircraft', 'Traveller', 'Seat_Type', 'Route',
           'Date_Flown', 'star_reting_aircraft', 'star_reting_Food', 'star_reting_Inflight', 'star_reting_Ground_Service',
           'star_reting_Wifi', 'star_reting_Value_For_Money', 'recommended']

review_ration_catagory = ['review-rating-header aircraft', 'review-rating-header type_of_traveller', 'review-rating-header cabin_flown', 'review-rating-header route', 'review-rating-header date_flown']
review_ration_stars = ['review-rating-header cabin_staff_service', 'review-rating-header food_and_beverages', 
'review-rating-header inflight_entertainment', 'review-rating-header ground_service', 'review-rating-header wifi_and_connectivity', 'review-rating-header value_for_money']

reviews = pd.DataFrame(columns= columns)


bodys = parsed_content.find_all("div", {"class": "body"})
for i, body in enumerate(bodys):
    row = []
    row.append(body.find('span', itemprop = 'name').get_text())
    row.append(body.find('h2', class_ = 'text_header').get_text())
    row.append(body.find('time', itemprop = 'datePublished').get('datetime'))
    row.append(body.find('div', class_ = 'text_content').get_text())
    
    for j in review_ration_catagory:
        row.append(safe_find_next_sibling(body, j))

    for j in review_ration_stars:
        row.append(safe_find_all_stars(body, j))
    
    row.append(body.find('td', class_ = 'review-rating-header recommended').find_next_sibling('td').get_text())
    
    reviews.loc[len(reviews)] = row # type: ignore

In [94]:
reviews.head(1)

Unnamed: 0,name,header,time,text_reviow,Aircraft,Traveller,Seat_Type,Route,Date_Flown,star_reting_aircraft,star_reting_Food,star_reting_Inflight,star_reting_Ground_Service,star_reting_Wifi,star_reting_Value_For_Money,recommended
0,S Brydon,"""Great customer service""",2023-08-19,✅ Trip Verified | My family flew from Washing...,A380,Family Leisure,Economy Class,Washington to London,August 2023,5,3,2,1,,4,yes


### Chacking noumbers of route present in the data

In [95]:
reviews['Route'].nunique()

93

Note: found 93 out of 100
### deviding Route in to two columns "From" and "To"

In [96]:

split_route = reviews['Route'].str.split(' to ', expand=True)
reviews.insert(7, 'From', split_route[0])
reviews.insert(8, 'To', split_route[1])

In [97]:
def extract_verification(text):
    if '✅ Trip Verified | ' in text:
        return 1, text.replace('✅ Trip Verified | ', '')
    elif 'Not Verified | ' in text:
        return 0, text.replace('Not Verified | ', '')
    else:
        return None, text

reviews.insert(3, 'Verified', reviews['text_reviow'].apply(lambda x: extract_verification(x)[0]))
reviews['text_reviow'] = reviews['text_reviow'].apply(lambda x: extract_verification(x)[0])

In [98]:
reviews.head(2)

Unnamed: 0,name,header,time,Verified,text_reviow,Aircraft,Traveller,Seat_Type,From,To,Route,Date_Flown,star_reting_aircraft,star_reting_Food,star_reting_Inflight,star_reting_Ground_Service,star_reting_Wifi,star_reting_Value_For_Money,recommended
0,S Brydon,"""Great customer service""",2023-08-19,1.0,1.0,A380,Family Leisure,Economy Class,Washington,London,Washington to London,August 2023,5,3,2,1,,4,yes
1,E Smyth,"""Cabin crew were all fantastic""",2023-08-13,1.0,1.0,A380,Family Leisure,Business Class,London,Miami,London to Miami,August 2023,5,5,5,4,5.0,4,yes


In [108]:
# ---
stars_columns = ['star_reting_aircraft',	'star_reting_Food', 'star_reting_Inflight',	'star_reting_Ground_Service',	'star_reting_Wifi',	'star_reting_Value_For_Money']
mean_of_stars_verified = reviews.loc[reviews['Verified']==1, stars_columns].mean()
mean_of_stars_not_verified = reviews.loc[reviews['Verified']==0, stars_columns].mean()
mean_of_stars = reviews.loc[:, stars_columns].mean()
print('------------ Verified mean of stars ------------\n',
    verified_mean_of_stars, 
    '\n------------ Non-verified mean of stars ------------\n',
    mean_of_stars_not_verified, 
    '\n------------ Over all mean of stars ------------\n',
    mean_of_stars)


------------ Verified mean of stars ------------
 star_reting_aircraft           2.924242
star_reting_Food               2.471698
star_reting_Inflight           2.585366
star_reting_Ground_Service          2.0
star_reting_Wifi               1.964286
star_reting_Value_For_Money    1.902778
dtype: object 
------------ Non-verified mean of stars ------------
 star_reting_aircraft               2.76
star_reting_Food               1.954545
star_reting_Inflight                2.4
star_reting_Ground_Service         2.08
star_reting_Wifi               1.285714
star_reting_Value_For_Money    1.576923
dtype: object 
------------ Over all mean of stars ------------
 star_reting_aircraft           2.870968
star_reting_Food               2.311688
star_reting_Inflight           2.508772
star_reting_Ground_Service          2.0
star_reting_Wifi                1.72093
star_reting_Value_For_Money         1.8
dtype: object


Another patren is found that there batter review from the people verified then people are not, also over all the reviews are avarige 

In [None]:
# df.to_csv("data/BA_reviews.csv")

Congratulations! Now you have your dataset for this task! The loops above collected 1000 reviews by iterating through the paginated pages on the website. However, if you want to collect more data, try increasing the number of pages!

 The next thing that you should do is clean this data to remove any unnecessary text from each of the rows. For example, "✅ Trip Verified" can be removed from each row if it exists, as it's not relevant to what we want to investigate.