# Task 1

---

## Web scraping and analysis

This Jupyter notebook includes some code to get you started with web scraping. We will use a package called `BeautifulSoup` to collect the data from the web. Once you've collected your data and saved it into a local `.csv` file you should start with your analysis.

### Scraping data from Skytrax

If you visit [https://www.airlinequality.com] you can see that there is a lot of data there. For this task, we are only interested in reviews related to British Airways and the Airline itself.

If you navigate to this link: [https://www.airlinequality.com/airline-reviews/british-airways] you will see this data. Now, we can use `Python` and `BeautifulSoup` to collect all the links to the reviews and then to collect the text data on each of the individual review links.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [2]:
#create an empty list to collect all reviews
reviews = []

#create an empty list to collect all ratings (rated on a scale out of 10)
ratings = []

#creating an empty list to collect date
date = []

#creating an empty list to collect reviewer's country
country = []

In [3]:
base_url = "https://www.airlinequality.com/airline-reviews/british-airways"
pages = 100
page_size = 100



# Looping through pages from 1 to 100
for i in range(1, pages + 1):
    
    page = requests.get(f"https://www.airlinequality.com/airline-reviews/british-airways/page/{i}/?sortby=post_date%3ADesc&pagesize=100")

    soup = BeautifulSoup(page.content,"html5")

    print(f"Scraping page {i}")

    # Create URL to collect links from paginated data
    url = f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize={page_size}"

    # Collect HTML data from this page
    response = requests.get(url)

    # Parse content
    content = response.content
    parsed_content = BeautifulSoup(content, 'html.parser')
    for para in parsed_content.find_all("div", {"class": "text_content"}):
        reviews.append(para.get_text())
        
        
        
        
    #to extract ratings
    for item in soup.find_all("div",class_ = "rating-10"):
        try: 
            ratings.append(item.span.text)
        except:
            print(f"Error on page {i}")
            ratings.append("None")
            
    #extract date
    for item in soup.find_all("time"):
        date.append(item.text)
        
    #extracting country names of reviewer's
    for item in soup.find_all("h3"):
        country.append(item.span.next_sibling.text.strip(" ()"))
    
    print(f"   ---> {len(reviews)} total reviews")

Scraping page 1
   ---> 100 total reviews
Scraping page 2
   ---> 200 total reviews
Scraping page 3
   ---> 300 total reviews
Scraping page 4
   ---> 400 total reviews
Scraping page 5
   ---> 500 total reviews
Scraping page 6
   ---> 600 total reviews
Scraping page 7
   ---> 700 total reviews
Scraping page 8
   ---> 800 total reviews
Scraping page 9
   ---> 900 total reviews
Scraping page 10
   ---> 1000 total reviews
Scraping page 11
   ---> 1100 total reviews
Scraping page 12
   ---> 1200 total reviews
Scraping page 13
   ---> 1300 total reviews
Scraping page 14
   ---> 1400 total reviews
Scraping page 15
   ---> 1500 total reviews
Scraping page 16
   ---> 1600 total reviews
Scraping page 17
   ---> 1700 total reviews
Scraping page 18
   ---> 1800 total reviews
Scraping page 19
   ---> 1900 total reviews
Scraping page 20
   ---> 2000 total reviews
Scraping page 21
   ---> 2100 total reviews
Scraping page 22
   ---> 2200 total reviews
Scraping page 23
   ---> 2300 total reviews
Scrapi

In [4]:
print(len(reviews))
print(len(ratings))
print(len(date))
print(len(country))

3742
3842
3742
3742


In [6]:
df = pd.DataFrame({"reviews":reviews,"date":date,"country":country})

df.head()

Unnamed: 0,reviews,date,country
0,✅ Trip Verified | I have come to boarding and...,28th January 2024,Ukraine
1,✅ Trip Verified | Stinking nappies being chang...,26th January 2024,United Kingdom
2,✅ Trip Verified | Worst service ever. Lost bag...,23rd January 2024,Germany
3,✅ Trip Verified | BA 246 21JAN 2023 Did not a...,21st January 2024,United Kingdom
4,✅ Trip Verified | Not a great experience. I co...,18th January 2024,United Kingdom


In [7]:
df.tail()

Unnamed: 0,reviews,date,country
3737,Flew LHR - VIE return operated by bmi but BA a...,29th August 2012,United Kingdom
3738,LHR to HAM. Purser addresses all club passenge...,28th August 2012,United Kingdom
3739,My son who had worked for British Airways urge...,12th October 2011,United Kingdom
3740,London City-New York JFK via Shannon on A318 b...,11th October 2011,United States
3741,SIN-LHR BA12 B747-436 First Class. Old aircraf...,9th October 2011,United Kingdom


In [13]:
df['ratings'] = ratings[1:3743]

In [14]:
df['ratings']

0       3
1       2
2       1
3       6
4       3
       ..
3737    8
3738    2
3739    7
3740    1
3741    9
Name: ratings, Length: 3742, dtype: object

In [15]:
df 

Unnamed: 0,reviews,date,country,ratings
0,✅ Trip Verified | I have come to boarding and...,28th January 2024,Ukraine,3
1,✅ Trip Verified | Stinking nappies being chang...,26th January 2024,United Kingdom,2
2,✅ Trip Verified | Worst service ever. Lost bag...,23rd January 2024,Germany,1
3,✅ Trip Verified | BA 246 21JAN 2023 Did not a...,21st January 2024,United Kingdom,6
4,✅ Trip Verified | Not a great experience. I co...,18th January 2024,United Kingdom,3
...,...,...,...,...
3737,Flew LHR - VIE return operated by bmi but BA a...,29th August 2012,United Kingdom,8
3738,LHR to HAM. Purser addresses all club passenge...,28th August 2012,United Kingdom,2
3739,My son who had worked for British Airways urge...,12th October 2011,United Kingdom,7
3740,London City-New York JFK via Shannon on A318 b...,11th October 2011,United States,1


In [16]:
df.to_csv("BA_reviews.csv")

Congratulations! Now you have your dataset for this task! The loops above collected 1000 reviews by iterating through the paginated pages on the website. However, if you want to collect more data, try increasing the number of pages!

 The next thing that you should do is clean this data to remove any unnecessary text from each of the rows. For example, "✅ Trip Verified" can be removed from each row if it exists, as it's not relevant to what we want to investigate.