# British Airways Data Science Problem

## Web Scraping

This notebook details the data extraction/collection portion of the project. I used the BeautifulSoup to web scrape data from the website "https://www.airlinequality.com/airline-reviews/british-airways" and toss it into a csv file.

Here are the necessary packages for this portion of the project.

In [2]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

I wanted to scrape as much data as possible from the website so I decided to scrape 35 pages of the website. This includes the date of the review, the rating of the review, and the body text of the review. 

In [8]:
base_url = "https://www.airlinequality.com/airline-reviews/british-airways"
pages = 2
page_size = 100

reviews = []

# for i in range(1, pages + 1):
for i in range(1, pages + 1):

    print(f"Scraping page {i}")

    # Create URL to collect links from paginated data
    url = f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize={page_size}"

    # Collect HTML data from this page
    response = requests.get(url)

    # Parse content
    content = response.content
    parsed_content = BeautifulSoup(content, 'html.parser')
    for item in parsed_content.find_all("article", {"itemprop": "review"}):
        try:
            review = {
            'date': item.find('time', {'itemprop': 'datePublished'}).text.strip(),
            'rating':  float(item.find('span', {'itemprop': 'ratingValue'}).text.strip()),
            'traveler_type': item.find('td', {'class': 'review-value'}, {'class': 'review-rating-header type_of_traveller'}).text.replace('Type of Traveller', '').strip(),
            'seat_type': item.find('td', {'class': 'review-rating-header cabin_flown'}, '/tr').text.replace("Seat Type", '').strip(),
            'body': item.find('div', {'class': 'text_content'}).text.strip(),
            }
            reviews.append(review)
            ##reviews.append(para.get_text())
        
        except:
            pass
    
    print(f"   ---> {len(reviews)} total reviews")

Scraping page 1
   ---> 100 total reviews
Scraping page 2
   ---> 200 total reviews


All of the data should have parsed and fit into the "reviews" list that was created. Let's toss it into a pandas DataFrame and see if it worked properly.

In [9]:
df = pd.DataFrame(reviews)
df

Unnamed: 0,date,rating,traveler_type,seat_type,body
0,9th January 2023,4.0,Boeing 777-200,,✅ Trip Verified | Flew ATL to LHR 8th Jan 202...
1,8th January 2023,5.0,A380,,Not Verified | Great thing about British Airw...
2,6th January 2023,1.0,Family Leisure,,Not Verified | The staff are friendly. The pla...
3,2nd January 2023,1.0,"A320, A380",,✅ Trip Verified | Probably the worst business ...
4,2nd January 2023,2.0,Business,,"✅ Trip Verified | Definitely not recommended, ..."
...,...,...,...,...,...
195,28th February 2022,9.0,A350,,"✅ Trip Verified | Outstanding! From the warm,..."
196,26th February 2022,5.0,A320,,✅ Trip Verified | British Airways has scrappe...
197,21st February 2022,1.0,A322,,✅ Trip Verified | For once more BA got it all...
198,20th February 2022,1.0,Solo Leisure,,Not Verified | Another episode of unmitigated...


Final step is to place all the data into a csv file for the next portion of the project.

In [44]:
df.to_csv("/Users/afifmazhar/Desktop/Data Science/Data Science Projects/British_Airways_Data_Science/data/BA_reviews.csv")