# Task 1

---

## Web scraping and analysis

This Jupyter notebook includes some code to get you started with web scraping. We will use a package called `BeautifulSoup` to collect the data from the web. Once you've collected your data and saved it into a local `.csv` file you should start with your analysis.

### Scraping data from Skytrax

If you visit [https://www.airlinequality.com] you can see that there is a lot of data there. For this task, we are only interested in reviews related to British Airways and the Airline itself.

If you navigate to this link: [https://www.airlinequality.com/airline-reviews/british-airways] you will see this data. Now, we can use `Python` and `BeautifulSoup` to collect all the links to the reviews and then to collect the text data on each of the individual review links.

In [36]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [37]:
# Data want to collect

review = []    # Empty list to store reviews     

date = []       # Empty list to store date of review

country = []    # Empty list to store country of a reviwer

rating = []     # Empty list to store star rating

In [38]:
base_url = "https://www.airlinequality.com/airline-reviews/british-airways"
pages = 10
page_size = 100

for i in range(1, pages + 1):
    print(f"Scraping page {i}")

    # Create URL to collect links from paginated data
    url = f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize={page_size}"

    # Collect HTML data from this page
    response = requests.get(url)

    # Parse content
    content = response.content
    parsed_content = BeautifulSoup(content, 'html.parser')
    
    # collect all the reviwes from the pages
    for para in parsed_content.find_all("div", class_ = "text_content"):
        review.append(para.get_text())
    print(f"   ---> {len(review)} total reviews")

    # collect the dates of the review
    for para in parsed_content.find_all("time"):
        date.append(para.get_text())
    print(f"   ---> {len(date)} total date")

    # collect the country name of the reviewers
    for para in parsed_content.find_all("h3"):
        country.append(para.span.next_sibling.text.strip(" ()"))
    print(f"   ---> {len(country)} total country")

    # colect the rating given by the reviewers
    for para in parsed_content.find_all("div", class_ = "rating-10"):
        rating.append(para.span.text)


Scraping page 1
   ---> 100 total reviews
   ---> 100 total date
   ---> 100 total country
Scraping page 2
   ---> 200 total reviews
   ---> 200 total date
   ---> 200 total country
Scraping page 3
   ---> 300 total reviews
   ---> 300 total date
   ---> 300 total country
Scraping page 4
   ---> 400 total reviews
   ---> 400 total date
   ---> 400 total country
Scraping page 5
   ---> 500 total reviews
   ---> 500 total date
   ---> 500 total country
Scraping page 6
   ---> 600 total reviews
   ---> 600 total date
   ---> 600 total country
Scraping page 7
   ---> 700 total reviews
   ---> 700 total date
   ---> 700 total country
Scraping page 8
   ---> 800 total reviews
   ---> 800 total date
   ---> 800 total country
Scraping page 9
   ---> 900 total reviews
   ---> 900 total date
   ---> 900 total country
Scraping page 10
   ---> 1000 total reviews
   ---> 1000 total date
   ---> 1000 total country


In [39]:
df = pd.DataFrame()
df["reviews"] = pd.Series(review)
df["date"] = pd.Series(date)
df["country"] = pd.Series(country)
df["rating"] = pd.Series(rating)
df.head()

Unnamed: 0,reviews,date,country,rating
0,Not Verified | Regarding the aircraft and seat...,28th April 2023,United Kingdom,\n\t\t\t\t\t\t\t\t\t\t\t\t\t5
1,Not Verified | I travelled with British Airway...,26th April 2023,Sweden,5
2,Not Verified | Food was lousy. Who ever is pl...,24th April 2023,United States,1
3,✅ Trip Verified | Had the worst experience. Th...,24th April 2023,Canada,2
4,✅ Trip Verified | The ground staff were not h...,23rd April 2023,Ireland,1


In [40]:
df.to_csv("BA_reviews.csv")

Congratulations! Now you have your dataset for this task! The loops above collected 1000 reviews by iterating through the paginated pages on the website. However, if you want to collect more data, try increasing the number of pages!

 The next thing that you should do is clean this data to remove any unnecessary text from each of the rows. For example, "✅ Trip Verified" can be removed from each row if it exists, as it's not relevant to what we want to investigate.