

## Web scraping and analysis

In this notebook, I'll scrape data from Skytrax's page with British Airways customer reviews using the `BeautifulSoup` package.

Once the data has been collected, it will be saved to a `.csv` file.

### Scraping data from Skytrax

The reviews are available at this link: [https://www.airlinequality.com/airline-reviews/british-airways] 

# Importing relevant packages

In [None]:
from google.colab import drive 
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Web Scraping with `BeautifulSoup` 

The time period for the reviews retrieved is the period between 6th May 2014 and 9th January 2023.

---
This code was last executed on 9th January, 2023.

In [None]:
base_url = "https://www.airlinequality.com/airline-reviews/british-airways"
pages = 344
page_size = 100

reviews = []

# for i in range(1, pages + 1):
for i in range(1, pages + 1):

    print(f"Scraping page {i}")

    # Create URL to collect links from paginated data
    url = f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize={page_size}"

    # Collect HTML data from this page
    response = requests.get(url)

    # Parse content
    content = response.content
    parsed_content = BeautifulSoup(content, 'html.parser')
    for para in parsed_content.find_all("div", {"class": "text_content"}):
        reviews.append(para.get_text())
    
    print(f"   ---> {len(reviews)} total reviews")

Scraping page 1
   ---> 100 total reviews
Scraping page 2
   ---> 200 total reviews
Scraping page 3
   ---> 300 total reviews
Scraping page 4
   ---> 400 total reviews
Scraping page 5
   ---> 500 total reviews
Scraping page 6
   ---> 600 total reviews
Scraping page 7
   ---> 700 total reviews
Scraping page 8
   ---> 800 total reviews
Scraping page 9
   ---> 900 total reviews
Scraping page 10
   ---> 1000 total reviews
Scraping page 11
   ---> 1100 total reviews
Scraping page 12
   ---> 1200 total reviews
Scraping page 13
   ---> 1300 total reviews
Scraping page 14
   ---> 1400 total reviews
Scraping page 15
   ---> 1500 total reviews
Scraping page 16
   ---> 1600 total reviews
Scraping page 17
   ---> 1700 total reviews
Scraping page 18
   ---> 1800 total reviews
Scraping page 19
   ---> 1900 total reviews
Scraping page 20
   ---> 2000 total reviews
Scraping page 21
   ---> 2100 total reviews
Scraping page 22
   ---> 2200 total reviews
Scraping page 23
   ---> 2300 total reviews
Scrapi

In [None]:
df = pd.DataFrame()
df["reviews"] = reviews
df.head(20)

Unnamed: 0,reviews
0,Not Verified | Great thing about British Airw...
1,Not Verified | The staff are friendly. The pla...
2,✅ Trip Verified | Probably the worst business ...
3,"✅ Trip Verified | Definitely not recommended, ..."
4,✅ Trip Verified | BA shuttle service across t...
5,✅ Trip Verified | I must admit like many other...
6,Not Verified | When will BA update their Busi...
7,✅ Trip Verified | Paid £200 day before flight...
8,✅ Trip Verified | BA website did not work (we...
9,✅ Trip Verified | Absolutely terrible experie...


In [None]:
df.tail(20)

Unnamed: 0,reviews
3433,Chicago O'Hare to London Heathrow on 2 May. ch...
3434,Travelled to HKG on board the new A380. Boardi...
3435,BA 059 London to Cape Town April 29 2014 econo...
3436,Las Vegas-LGW 777 3 class. Business. The uniqu...
3437,An interesting contrast on recent Gatwick to T...
3438,Heathrow Marrakech. Had previously travelled o...
3439,Just got back from Bridgetown Barbados flying ...
3440,LHR-JFK-LAX-LHR. Check in was ok apart from be...
3441,HKG-LHR in New Club World on Boeing 777-300 - ...
3442,YYZ to LHR - July 2012 - I flew overnight in p...


Saving to `.csv` file to continue analysis in another notebook

In [None]:
df.to_csv("/content/drive/MyDrive/BA/data/BA_reviews.csv")