# Task 1

---

## Web scraping and analysis

This Jupyter notebook includes some code to get you started with web scraping. We will use a package called `BeautifulSoup` to collect the data from the web. Once you've collected your data and saved it into a local `.csv` file you should start with your analysis.

### Scraping data from Skytrax

If you visit [https://www.airlinequality.com] you can see that there is a lot of data there. For this task, we are only interested in reviews related to British Airways and the Airline itself.

If you navigate to this link: [https://www.airlinequality.com/airline-reviews/british-airways] you will see this data. Now, we can use `Python` and `BeautifulSoup` to collect all the links to the reviews and then to collect the text data on each of the individual review links.

In [2]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

In [None]:
reviews  = []

#create an empty list to collect rating stars
stars = []

#create an empty list to collect date
date = []

#create an empty list to collect country the reviewer is from
country = []

In [None]:
for i in range(1, 36):
    page = requests.get(f"https://www.airlinequality.com/airline-reviews/british-airways/page/{i}/?sortby=post_date%3ADesc&pagesize=100")
    
    soup = BeautifulSoup(page.content, "html5")
    
    for item in soup.find_all("div", class_="text_content"):
        reviews.append(item.text)
    
    for item in soup.find_all("div", class_ = "rating-10"):
        try:
            stars.append(item.span.text)
        except:
            print(f"Error on page {i}")
            stars.append("None")
            
    #date
    for item in soup.find_all("time"):
        date.append(item.text)
        
    #country
    for item in soup.find_all("h3"):
        country.append(item.span.next_sibling.text.strip(" ()"))

In [14]:
base_url = "https://www.airlinequality.com/airline-reviews/british-airways"
pages = 38
page_size = 100

# Lists to hold the reviews and corresponding details
reviews = []
ratings = []
aircrafts = []
travel_types = []
seat_types = []
routes = []
dates_flown = []
seat_comforts = []
staff_services = []
foods_beverages = []
ground_services = []
value_for_money = []
recommendations = []
inflight_ent = []
wifi = []

# for i in range(1, pages + 1):
for i in range(1, pages + 1):

    print(f"Scraping page {i}")

    # Create URL to collect links from paginated data
    url = f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize={page_size}"

    # Collect HTML data from this page
    response = requests.get(url)

    # Parse content
    content = response.content
    parsed_content = BeautifulSoup(content, 'html.parser')
    for para in parsed_content.find_all("div", {"class": "text_content"}):
        reviews.append(para.get_text())
    # Scrape ratings
    for rating in parsed_content.find_all("div", {"class": "rating-10"}):
        rating_value = rating.find("span", {"itemprop": "ratingValue"})
        if rating_value:
            ratings.append(rating_value.get_text())
        else:
            ratings.append("None")  # Handle case where rating is missing
    
    # Extract additional details from the review-stats section
    for stats in parsed_content.find_all("div", {"class": "review-stats"}):
        # Aircraft
        aircraft = stats.find("td", {"class": "aircraft"})
        aircrafts.append(aircraft.find_next_sibling("td").get_text().strip() if aircraft else "None")
        
        # Type of Traveller
        traveller_type = stats.find("td", {"class": "type_of_traveller"})
        travel_types.append(traveller_type.find_next_sibling("td").get_text().strip() if traveller_type else "None")
        
        # Seat Type
        seat_type = stats.find("td", {"class": "cabin_flown"})
        seat_types.append(seat_type.find_next_sibling("td").get_text().strip() if seat_type else "None")

        #Inflight Entertainment
        inflgent = stats.find("td",{"class":"inflight_entertainment"})
        inflight_ent.append(len(inflgent.find_next_sibling("td").find_all("span", {"class": "star fill"})) if inflgent else "None")

        #WIFI
        wifi_data = stats.find("td",{"class":"wifi_and_connectivity"})
        wifi.append(len(wifi_data.find_next_sibling("td").find_all("span", {"class": "star fill"})) if wifi_data else "None")

        
        # Route
        route = stats.find("td", {"class": "route"})
        routes.append(route.find_next_sibling("td").get_text().strip() if route else "None")
        
        # Date Flown
        date_flown = stats.find("td", {"class": "date_flown"})
        dates_flown.append(date_flown.find_next_sibling("td").get_text().strip() if date_flown else "None")
        
        # Seat Comfort
        seat_comfort = stats.find("td", {"class": "seat_comfort"})
        seat_comforts.append(len(seat_comfort.find_next_sibling("td").find_all("span", {"class": "star fill"})) if seat_comfort else "None")
        
        # Cabin Staff Service
        cabin_service = stats.find("td", {"class": "cabin_staff_service"})
        staff_services.append(len(cabin_service.find_next_sibling("td").find_all("span", {"class": "star fill"})) if cabin_service else "None")
        
        # Food & Beverages
        food_beverage = stats.find("td", {"class": "food_and_beverages"})
        foods_beverages.append(len(food_beverage.find_next_sibling("td").find_all("span", {"class": "star fill"})) if food_beverage else "None")
        
        # Ground Service
        ground_service = stats.find("td", {"class": "ground_service"})
        ground_services.append(len(ground_service.find_next_sibling("td").find_all("span", {"class": "star fill"})) if ground_service else "None")
        
        # Value For Money
        value_money = stats.find("td", {"class": "value_for_money"})
        value_for_money.append(len(value_money.find_next_sibling("td").find_all("span", {"class": "star fill"})) if value_money else "None")
        
        # Recommended
        recommended = stats.find("td", {"class": "recommended"})
        recommendations.append(recommended.find_next_sibling("td").get_text().strip() if recommended else "None")
    
    
    
    print(f"   ---> {len(reviews)} total reviews, {len(ratings)} total ratings")

Scraping page 1
   ---> 100 total reviews, 101 total ratings
Scraping page 2
   ---> 200 total reviews, 202 total ratings
Scraping page 3
   ---> 300 total reviews, 303 total ratings
Scraping page 4
   ---> 400 total reviews, 404 total ratings
Scraping page 5
   ---> 500 total reviews, 505 total ratings
Scraping page 6
   ---> 600 total reviews, 606 total ratings
Scraping page 7
   ---> 700 total reviews, 707 total ratings
Scraping page 8
   ---> 800 total reviews, 808 total ratings
Scraping page 9
   ---> 900 total reviews, 909 total ratings
Scraping page 10
   ---> 1000 total reviews, 1010 total ratings
Scraping page 11
   ---> 1100 total reviews, 1111 total ratings
Scraping page 12
   ---> 1200 total reviews, 1212 total ratings
Scraping page 13
   ---> 1300 total reviews, 1313 total ratings
Scraping page 14
   ---> 1400 total reviews, 1414 total ratings
Scraping page 15
   ---> 1500 total reviews, 1515 total ratings
Scraping page 16
   ---> 1600 total reviews, 1616 total ratings
Scr

In [9]:
ratings

['\n\t\t\t\t\t\t\t\t\t\t\t\t5',
 '1',
 '8',
 '1',
 '8',
 '1',
 '1',
 '4',
 '2',
 '8',
 '1',
 '1',
 '1',
 '9',
 '1',
 '2',
 '10',
 '1',
 '5',
 '4',
 '2',
 '1',
 '8',
 '8',
 '3',
 '1',
 '2',
 '2',
 '8',
 '10',
 '1',
 '1',
 '1',
 '10',
 '1',
 '3',
 '1',
 '1',
 '1',
 '1',
 '1',
 '1',
 '1',
 '1',
 '6',
 '3',
 '5',
 '9',
 '2',
 '8',
 '7',
 '1',
 '8',
 '9',
 '1',
 '1',
 '5',
 '1',
 '9',
 '2',
 '1',
 '1',
 '2',
 '3',
 '3',
 '9',
 '3',
 '2',
 '1',
 '4',
 '2',
 '6',
 '1',
 '1',
 '1',
 '1',
 '1',
 '6',
 '1',
 '1',
 '10',
 '7',
 '3',
 '4',
 '8',
 '1',
 '8',
 '5',
 '10',
 '9',
 '3',
 '8',
 '8',
 '3',
 '3',
 '7',
 '3',
 '1',
 '7',
 '1',
 '3',
 '\n\t\t\t\t\t\t\t\t\t\t\t\t5',
 '1',
 '1',
 '3',
 '10',
 '1',
 '3',
 '2',
 '1',
 '10',
 '3',
 '3',
 '2',
 '5',
 '1',
 '8',
 '1',
 '6',
 '9',
 '8',
 '3',
 '2',
 '1',
 '6',
 '3',
 '2',
 '1',
 '1',
 '1',
 '4',
 '9',
 '1',
 '9',
 '6',
 '1',
 '8',
 '6',
 '2',
 '5',
 '3',
 '6',
 '10',
 '1',
 '1',
 '6',
 '6',
 '4',
 '1',
 '8',
 '3',
 '4',
 '9',
 '9',
 '10',
 '1',
 '1

In [10]:
cleaned_list = [item for item in ratings if item != '\n\t\t\t\t\t\t\t\t\t\t\t\t5']
cleaned_list

['1',
 '8',
 '1',
 '8',
 '1',
 '1',
 '4',
 '2',
 '8',
 '1',
 '1',
 '1',
 '9',
 '1',
 '2',
 '10',
 '1',
 '5',
 '4',
 '2',
 '1',
 '8',
 '8',
 '3',
 '1',
 '2',
 '2',
 '8',
 '10',
 '1',
 '1',
 '1',
 '10',
 '1',
 '3',
 '1',
 '1',
 '1',
 '1',
 '1',
 '1',
 '1',
 '1',
 '6',
 '3',
 '5',
 '9',
 '2',
 '8',
 '7',
 '1',
 '8',
 '9',
 '1',
 '1',
 '5',
 '1',
 '9',
 '2',
 '1',
 '1',
 '2',
 '3',
 '3',
 '9',
 '3',
 '2',
 '1',
 '4',
 '2',
 '6',
 '1',
 '1',
 '1',
 '1',
 '1',
 '6',
 '1',
 '1',
 '10',
 '7',
 '3',
 '4',
 '8',
 '1',
 '8',
 '5',
 '10',
 '9',
 '3',
 '8',
 '8',
 '3',
 '3',
 '7',
 '3',
 '1',
 '7',
 '1',
 '3',
 '1',
 '1',
 '3',
 '10',
 '1',
 '3',
 '2',
 '1',
 '10',
 '3',
 '3',
 '2',
 '5',
 '1',
 '8',
 '1',
 '6',
 '9',
 '8',
 '3',
 '2',
 '1',
 '6',
 '3',
 '2',
 '1',
 '1',
 '1',
 '4',
 '9',
 '1',
 '9',
 '6',
 '1',
 '8',
 '6',
 '2',
 '5',
 '3',
 '6',
 '10',
 '1',
 '1',
 '6',
 '6',
 '4',
 '1',
 '8',
 '3',
 '4',
 '9',
 '9',
 '10',
 '1',
 '1',
 '7',
 '1',
 '5',
 '1',
 '9',
 '1',
 '3',
 '8',
 '1',
 '1',
 

In [11]:
df = pd.DataFrame()
df["reviews"] = reviews
df["ratings"] = cleaned_list
df["Aircraft"] = aircrafts
df["Type Of Traveller"] = travel_types
df["Seat Type"] = seat_types
df["Routes"] =routes
df["Date of flying"] = dates_flown
df["Seat Comfort Ratings"] = seat_comforts
df["Staff Ratings"] = staff_services
df["Food Ratings"] = foods_beverages
df["Ground Services Ratings"] = ground_services
df["Value for Money Ratings"] = value_for_money
df["Recomendation"] = recommendations
df["Inflight Entertainment"] = inflight_ent
df["Wifi & Connectivity"] = wifi
df.head()

Unnamed: 0,reviews,ratings,Aircraft,Type Of Traveller,Seat Type,Routes,Date of flying,Seat Comfort Ratings,Staff Ratings,Food Ratings,Ground Services Ratings,Value for Money Ratings,Recomendation,Inflight Entertainment,Wifi & Connectivity
0,Not Verified | My wife and I are very disappo...,1,,Family Leisure,Economy Class,Amsterdam to Pittsburgh via London,September 2024,2.0,2.0,2.0,1,1,no,2.0,2.0
1,Not Verified | We flew BA between Heathrow an...,8,A321,Couple Leisure,Economy Class,Heathrow to Berlin,July 2024,3.0,4.0,3.0,4,3,yes,,
2,Not Verified | Absolutely disgusted with BA. ...,1,,Couple Leisure,Economy Class,Manchester to Seattle via London,May 2024,,,,1,1,no,,
3,Not Verified | Took a trip to Nashville with m...,8,Boeing 777-2OOLR,Couple Leisure,Business Class,London Heathrow to Nashville,August 2024,4.0,4.0,4.0,2,3,yes,2.0,
4,Not Verified | A nightmare journey courtesy o...,1,A319 / A321NEO,Couple Leisure,Economy Class,London to Venice,September 2024,2.0,3.0,,1,1,no,,


In [11]:
print(df.iloc[0])

reviews    Not Verified |  My wife and I are very disappo...
Name: 0, dtype: object


In [39]:
# df.to_csv("data/BA_reviews.csv")

In [13]:
df.to_csv("data/BA_reviews(new).csv")

Congratulations! Now you have your dataset for this task! The loops above collected 1000 reviews by iterating through the paginated pages on the website. However, if you want to collect more data, try increasing the number of pages!

 The next thing that you should do is clean this data to remove any unnecessary text from each of the rows. For example, "✅ Trip Verified" can be removed from each row if it exists, as it's not relevant to what we want to investigate.