# Web Scraping and Analysis of British Airways Reviews

This script automates the process of scraping customer reviews from British Airways' review pages. Utilizing libraries such as `requests` and `BeautifulSoup` for web scraping, it collects reviews across 50 pages, each containing up to 100 reviews. 

Data extracted includes author names, review content, ratings, and publication dates. Additional review statistics are also gathered. The reviews are stored in a pandas DataFrame, cleaned for missing values, and finally saved to a CSV file. Visual and textual analysis, such as generating word clouds and sentiment analysis, can be performed on the cleaned dataset.

### Importing necessary libraries

In [30]:
import requests
from bs4 import BeautifulSoup
import numpy as np
import seaborn as sns
import pandas as pd
import time 
from wordcloud import WordCloud
from textblob import TextBlob

### Setting Up Scraping Parameters.

This section defines the key parameters and initializes storage for web scraping.

- base_url: Sets the target URL for British Airways reviews on the Airline Quality website.
- pages_to_scrape: Determines the number of pages to scrape, set to 50.
- page_size: Specifies the number of reviews per page, set to 100.
- all_reviews: Initializes an empty list to store reviews collected across pages. This list will be filled with review data as each page is processed.

In [31]:
# URL of the webiste to scrape 
base_url = "https://www.airlinequality.com/airline-reviews/british-airways"
pages_to_scrape = 50
page_size = 100

# Initialize the list to store all reviews across pages
all_reviews = []

### Scrapping British Airways Reviews

This section demonstrates the process of iterating through multiple pages of British Airways reviews, extracting key information, and storing it for analysis.

### Steps for Scraping British Airways Reviews
1. Iterate Pages: Loop through specified pages, printing the current page number.
2. Construct URL: Form the URL for each page.
3. Fetch and Parse: Request and parse HTML with BeautifulSoup.
4. Extract Data: Extract review details and statistics.
5. Store Data: Append data to lists.
6. Delay: Introduce a one-second delay.
7. Print Progress: Display the number of reviews collected per page.

In [32]:
for i in range(1, pages_to_scrape + 1):
    print(f"Scraping page {i}")

    # Create URL to collect links from paginated data
    url = f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize={page_size}"

    # Collect HTML data from this page
    response = requests.get(url)

    # Parse content
    content = response.content
    parsed_content = BeautifulSoup(content, 'html.parser')

    # Extract the relevant information from the HTML code
    page_reviews = []
    for article in parsed_content.find_all('article', class_='comp_media-review-rated'):
        
        # Extracting data directly from the article
        rating_value_tag = article.find('span', itemprop='ratingValue')
        rating_value = rating_value_tag.get_text() if rating_value_tag else None

        rating_max_tag = article.find('span', itemprop='bestRating')
        rating_max = rating_max_tag.get_text() if rating_max_tag else None

        review_title_tag = article.find('h2', class_='text_header')
        review_title = review_title_tag.get_text(strip=True) if review_title_tag else None

        author_name_tag = article.find('span', itemprop='name')
        author_name = author_name_tag.get_text(strip=True) if author_name_tag else None

        date_published_tag = article.find('time', itemprop='datePublished')
        date_published = date_published_tag['datetime'] if date_published_tag else None

        review_body_tag = article.find('div', itemprop='reviewBody')
        review_body = review_body_tag.get_text(strip=True) if review_body_tag else None

        # Extracting additional information from the review-stats table
        review_stats = article.find('table', class_='review-ratings')
        data = {}
        for item in review_stats.find_all('tr'):
            header = item.find('td', class_='review-rating-header')
            value = item.find('td', class_='review-value')
            if header and value:
                data[header.text.strip()] = value.text.strip()

        page_reviews.append({
            'AuthorName': author_name,
            'ReviewBody': review_body,
            'RatingValue': rating_value,
            'RatingMax': rating_max,
            'ReviewTitle': review_title,
            'DatePublished': date_published,
            **data  # Include additional information from the review-stats table
        })

    all_reviews.extend(page_reviews)  # Extend the list with reviews from the current page

    # Add a delay between requests to avoid overwhelming the website with requests
    time.sleep(1)

    print(f"   ---> {len(page_reviews)} reviews on page {i}")


Scraping page 1
   ---> 100 reviews on page 1
Scraping page 2
   ---> 100 reviews on page 2
Scraping page 3
   ---> 100 reviews on page 3
Scraping page 4
   ---> 100 reviews on page 4
Scraping page 5
   ---> 100 reviews on page 5
Scraping page 6
   ---> 100 reviews on page 6
Scraping page 7
   ---> 100 reviews on page 7
Scraping page 8
   ---> 100 reviews on page 8
Scraping page 9
   ---> 100 reviews on page 9
Scraping page 10
   ---> 100 reviews on page 10
Scraping page 11
   ---> 100 reviews on page 11
Scraping page 12
   ---> 100 reviews on page 12
Scraping page 13
   ---> 100 reviews on page 13
Scraping page 14
   ---> 100 reviews on page 14
Scraping page 15
   ---> 100 reviews on page 15
Scraping page 16
   ---> 100 reviews on page 16
Scraping page 17
   ---> 100 reviews on page 17
Scraping page 18
   ---> 100 reviews on page 18
Scraping page 19
   ---> 100 reviews on page 19
Scraping page 20
   ---> 100 reviews on page 20
Scraping page 21
   ---> 100 reviews on page 21
Scraping p

### Storing Scraped Data

The following code stores the collected reviews into a pandas DataFrame.and displays the first five rows of the dataframe. This creates a structured tabular format for easier data manipulation and analysis.

In [34]:
# Store all reviews in a pandas dataframe
df = pd.DataFrame(all_reviews)

df.head()

Unnamed: 0,AuthorName,ReviewBody,RatingValue,RatingMax,ReviewTitle,DatePublished,Aircraft,Type Of Traveller,Seat Type,Route,Date Flown,Recommended
0,Jonathan Rodden,✅Trip Verified| Flew British Airways on BA 43...,9,10,"""flight itself was quite good""",2024-06-10,A320,Solo Leisure,Business Class,London to Amsterdam,May 2024,yes
1,A Hammad,✅Trip Verified| BA cancelled the flight from ...,1,10,"""You expect better from BA""",2024-06-09,,Couple Leisure,Premium Economy,Tokyo to Manchester via London,May 2024,no
2,D Baker,✅Trip Verified| I strongly advise everyone to ...,1,10,"“never fly British Airways""",2024-06-06,Boeing 747,Business,Business Class,Heathrow to San Francisco,June 2024,no
3,Val Rose,✅Trip Verified| My partner and I were on the B...,5,10,“we will rethink BA moving forward”,2024-06-03,,Couple Leisure,Business Class,Tampa to Gatwick,May 2024,no
4,Jason George,Not Verified| We had a Premium Economy return...,1,10,“extremely poor customer service”,2024-06-01,,Family Leisure,Premium Economy,Los Angeles to London,January 2024,no


### Saving the Scraped Data as a CSV file

In [43]:
# save to csv file 
df.to_csv("data/BA_scrapped_data.csv")