# Task 1

---

## Web scraping and analysis

This Jupyter notebook includes some code to get you started with web scraping. We will use a package called `BeautifulSoup` to collect the data from the web. Once you've collected your data and saved it into a local `.csv` file you should start with your analysis.

### Scraping data from Skytrax

If you visit [https://www.airlinequality.com] you can see that there is a lot of data there. For this task, we are only interested in reviews related to British Airways and the Airline itself.

If you navigate to this link: [https://www.airlinequality.com/airline-reviews/british-airways] you will see this data. Now, we can use `Python` and `BeautifulSoup` to collect all the links to the reviews and then to collect the text data on each of the individual review links.

## Phase 1 -  Data Collection 

In this phase we will collect the customer ratings data from the airline quality website called [Skytrax](https://www.airlinequality.com/airline-reviews/british-airways). We will collect data about airline ratings, seat ratings and lounge experience ratings from this website. 

In [1]:
#imports

import requests
from bs4 import BeautifulSoup
import pandas as pd 
import numpy as np

In [2]:
# create an empty list to collect all reviews 
reviews = []

# create an empty list to collect all ratings 
ratings = []

#create an empty list to collect all date
dates = []

#create an empty list to collect all country 
countries = []


In [3]:
base_url = "https://www.airlinequality.com/airline-reviews/british-airways"
pages = 10 
page_size = 100


# for i in range(1, rpages + 1):
for i in range(1, pages + 1):
    print(f"Scraping page {i}")

    # Create URL to collect links from paginated data
    url = f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize={page_size}"

    #Collect HTML data from this page
    response = requests.get(url)

    #Parse Content 
    content =  response.content
    parsed_content = BeautifulSoup(content, 'html.parser')

    for review in parsed_content.find_all("div", {"class": "text_content"}):
        reviews.append(review.get_text())

    for rating_element in parsed_content.find_all("div", {"class": "rating-10"}):
        try:
           rating = rating_element.span.text
        except AttributeError:
           rating = "None"
           ratings.append(rating)

    # date
    for date_element in parsed_content.find_all("time"):
        date = date_element.text.strip()
        dates.append(date)

    # country
    for country_element in parsed_content.find_all("h3"):
        country = country_element.span.next_sibling.text.strip(" ()")
        countries.append(country)


    print(f" ---> {len(reviews)} total reviews")


Scraping page 1
 ---> 100 total reviews
Scraping page 2
 ---> 200 total reviews
Scraping page 3
 ---> 300 total reviews
Scraping page 4
 ---> 400 total reviews
Scraping page 5
 ---> 500 total reviews
Scraping page 6
 ---> 600 total reviews
Scraping page 7
 ---> 700 total reviews
Scraping page 8
 ---> 800 total reviews
Scraping page 9
 ---> 900 total reviews
Scraping page 10
 ---> 1000 total reviews


In [4]:
# checking the length of the total reviews extracted 
len(reviews)

1000

In [5]:
len(countries)

1000

In [12]:
# check the length
#ratings = ratings[:1000]
len(ratings)

0

In [13]:

# Create a dataframe for these collected lists of data

df = pd.DataFrame({"reviews":reviews, "dates":dates, "countries":countries})
df.shape

   

(1000, 3)

In [14]:
df.head()


Unnamed: 0,reviews,dates,countries
0,"✅ Trip Verified | Filthy plane, cabin staff o...",28th August 2023,United Kingdom
1,✅ Trip Verified | Chaos at Terminal 5 with B...,27th August 2023,United Kingdom
2,Not Verified | BA cancelled our flight and co...,27th August 2023,United Kingdom
3,✅ Trip Verified | When on our way to Heathrow ...,27th August 2023,United Kingdom
4,"✅ Trip Verified | Nice flight, good crew, very...",26th August 2023,United States


In [15]:
df.shape

(1000, 3)

In [None]:
df.to_csv("BA_reviews.csv")