# Task 1

---

## Web scraping and analysis

This Jupyter notebook includes some code to get you started with web scraping. We will use a package called `BeautifulSoup` to collect the data from the web. Once you've collected your data and saved it into a local `.csv` file you should start with your analysis.

### Scraping data from Skytrax

If you visit [https://www.airlinequality.com] you can see that there is a lot of data there. For this task, we are only interested in reviews related to British Airways and the Airline itself.

If you navigate to this link: [https://www.airlinequality.com/airline-reviews/british-airways] you will see this data. Now, we can use `Python` and `BeautifulSoup` to collect all the links to the reviews and then to collect the text data on each of the individual review links.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [2]:
base_url = "https://www.airlinequality.com/airline-reviews/british-airways"
pages = 20
page_size = 500

reviews = []

# for i in range(1, pages + 1):
for i in range(1, pages + 1):

    print(f"Scraping page {i}")

    # Create URL to collect links from paginated data
    url = f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize={page_size}"

    # Collect HTML data from this page
    response = requests.get(url)

    # Parse content
    content = response.content
    parsed_content = BeautifulSoup(content, 'html.parser')
    for para in parsed_content.find_all("div", {"class": "text_content"}):
        reviews.append(para.get_text())
    
    print(f"   ---> {len(reviews)} total reviews")

Scraping page 1
   ---> 500 total reviews
Scraping page 2
   ---> 1000 total reviews
Scraping page 3
   ---> 1500 total reviews
Scraping page 4
   ---> 2000 total reviews
Scraping page 5
   ---> 2500 total reviews
Scraping page 6
   ---> 3000 total reviews
Scraping page 7
   ---> 3500 total reviews
Scraping page 8
   ---> 3513 total reviews
Scraping page 9
   ---> 3513 total reviews
Scraping page 10
   ---> 3513 total reviews
Scraping page 11
   ---> 3513 total reviews
Scraping page 12
   ---> 3513 total reviews
Scraping page 13
   ---> 3513 total reviews
Scraping page 14
   ---> 3513 total reviews
Scraping page 15
   ---> 3513 total reviews
Scraping page 16
   ---> 3513 total reviews
Scraping page 17
   ---> 3513 total reviews
Scraping page 18
   ---> 3513 total reviews
Scraping page 19
   ---> 3513 total reviews
Scraping page 20
   ---> 3513 total reviews


In [3]:
df = pd.DataFrame()
df["reviews"] = reviews
df.head()

Unnamed: 0,reviews
0,✅ Trip Verified | Boarding at Mumbai was chaot...
1,"Not Verified | Mexico City Airport is a zoo, b..."
2,"✅ Trip Verified | Very poor service, very fru..."
3,Not Verified | Generally poor. Sent to gate o...
4,Not Verified | BA changed our prepaid seats a...


In [4]:
df.to_csv("data/BA_reviews.csv")


 The next thing that you should do is clean this data to remove any unnecessary text from each of the rows. For example, "✅ Trip Verified" can be removed from each row if it exists, as it's not relevant to what we want to investigate.