# Task 1

---

## Web scraping and analysis
### Scraping data from Skytrax

If you visit [https://www.airlinequality.com] you can see that there is a lot of data there. For this task, we are only interested in reviews related to British Airways and the Airline itself.

If you navigate to this link: [https://www.airlinequality.com/airline-reviews/british-airways] you will see this data. Now, we can use `Python` and `BeautifulSoup` to collect all the links to the reviews and then to collect the text data on each of the individual review links.

In [6]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [7]:
# Declare constants
BASE_URL = 'https://www.airlinequality.com/airline-reviews/british-airways'
PAGE_NUM = 34
PAGE_SIZE = 100

# Empty list to store scrapped review
reviews = []

# Loop through page
for i in range(1, PAGE_NUM + 1):
    print(f'Scrapping page {i}')

    # Create url to collect links from paginated data
    url = f'{BASE_URL}/page/{i}/?sortby=post_date%3ADesc&pagesize={PAGE_SIZE}'

    # Collect HTML response from the page
    response = requests.get(url)

    # Parse HTML with Beautiful Soup
    html_content = response.content
    parsed_content = BeautifulSoup(html_content, 'html.parser')
    
    # Loop trough all text_content div in the website which contains the review
    for element in parsed_content.find_all('div', {'class': 'text_content'}):
        reviews.append(element.get_text())

    print(f"   ---> {len(reviews)} total reviews")


Scrapping page 1
   ---> 100 total reviews
Scrapping page 2
   ---> 200 total reviews
Scrapping page 3
   ---> 300 total reviews
Scrapping page 4
   ---> 400 total reviews
Scrapping page 5
   ---> 500 total reviews
Scrapping page 6
   ---> 600 total reviews
Scrapping page 7
   ---> 700 total reviews
Scrapping page 8
   ---> 800 total reviews
Scrapping page 9
   ---> 900 total reviews
Scrapping page 10
   ---> 1000 total reviews
Scrapping page 11
   ---> 1100 total reviews
Scrapping page 12
   ---> 1200 total reviews
Scrapping page 13
   ---> 1300 total reviews
Scrapping page 14
   ---> 1400 total reviews
Scrapping page 15
   ---> 1500 total reviews
Scrapping page 16
   ---> 1600 total reviews
Scrapping page 17
   ---> 1700 total reviews
Scrapping page 18
   ---> 1800 total reviews
Scrapping page 19
   ---> 1900 total reviews
Scrapping page 20
   ---> 2000 total reviews
Scrapping page 21
   ---> 2100 total reviews
Scrapping page 22
   ---> 2200 total reviews
Scrapping page 23
   ---> 23

In [16]:
# Turn the list of reviews into dataframe
df = pd.DataFrame()
df['reviews'] = reviews

In [21]:
# See some sample
df.head()

Unnamed: 0,reviews
0,✅ Trip Verified | The incoming and outgoing f...
1,✅ Trip Verified | Back in December my family ...
2,✅ Trip Verified | As usual the flight is dela...
3,✅ Trip Verified | A short BA euro trip and thi...
4,Not Verified | We are flying Business class f...


In [23]:
path_to_dataset = '../dataset/'
dataset_name = 'british_airways_review.csv'

In [24]:
# Save the acquired dataframe into csv for later text analysis
df.to_csv(path_to_dataset+dataset_name, index=False)