In [None]:
# Task 1

---

## Web scraping and analysis

This Jupyter notebook includes some code to get you started with web scraping. We will use a package called `BeautifulSoup` to collect the data from the web. Once you've collected your data and saved it into a local `.csv` file you should start with your analysis.

### Scraping data from Skytrax

If you visit [https://www.airlinequality.com] you can see that there is a lot of data there. For this task, we are only interested in reviews related to British Airways and the Airline itself.

If you navigate to this link: [https://www.airlinequality.com/airline-reviews/british-airways] you will see this data. Now, we can use `Python` and `BeautifulSoup` to collect all the links to the reviews and then to collect the text data on each of the individual review links.

In [35]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import string

In [3]:
base_url = "https://www.airlinequality.com/airline-reviews/british-airways"
pages = 20
page_size = 100

reviews = []

for i in range(1, pages + 1):

    print(f"Scraping page {i}")

    # Create URL to collect links from paginated data
    url = f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize={page_size}"

    # Collect HTML data from this page
    response = requests.get(url)

    # Parse content
    content = response.content
    parsed_content = BeautifulSoup(content, 'html.parser')
    for para in parsed_content.find_all("div", {"class": "text_content"}):
        reviews.append(para.get_text())
    
    print(f"   ---> {len(reviews)} total reviews")

Scraping page 1
   ---> 100 total reviews
Scraping page 2
   ---> 200 total reviews
Scraping page 3
   ---> 300 total reviews
Scraping page 4
   ---> 400 total reviews
Scraping page 5
   ---> 500 total reviews
Scraping page 6
   ---> 600 total reviews
Scraping page 7
   ---> 700 total reviews
Scraping page 8
   ---> 800 total reviews
Scraping page 9
   ---> 900 total reviews
Scraping page 10
   ---> 1000 total reviews
Scraping page 11
   ---> 1100 total reviews
Scraping page 12
   ---> 1200 total reviews
Scraping page 13
   ---> 1300 total reviews
Scraping page 14
   ---> 1400 total reviews
Scraping page 15
   ---> 1500 total reviews
Scraping page 16
   ---> 1600 total reviews
Scraping page 17
   ---> 1700 total reviews
Scraping page 18
   ---> 1800 total reviews
Scraping page 19
   ---> 1900 total reviews
Scraping page 20
   ---> 2000 total reviews


In [6]:
df = pd.DataFrame()
df["reviews"] = reviews
df.head()

Unnamed: 0,reviews
0,✅ Trip Verified | The worst business class ex...
1,Not Verified | Quite possibly the worst busin...
2,Not Verified | I will never be flying with BA...
3,✅ Trip Verified | On the my trip to Mexico Ci...
4,✅ Trip Verified | I upgraded at check in to C...


In [9]:
df.to_csv("C:/Users/ramva/OneDrive/Desktop/Adithya/Data Science/Projects/British-Airways-virtual-internship/BA_reviews.csv")

Congratulations! Now you have your dataset for this task! The loops above collected 1000 reviews by iterating through the paginated pages on the website. However, if you want to collect more data, try increasing the number of pages!

 The next thing that you should do is clean this data to remove any unnecessary text from each of the rows. For example, "✅ Trip Verified" can be removed from each row if it exists, as it's not relevant to what we want to investigate.
 
  Read Dataset

In [36]:
reviews = pd.read_csv('BA_reviews.csv')
reviews = reviews.pop("reviews")
reviews

0       ✅ Trip Verified |  The worst business class ex...
1       Not Verified |  Quite possibly the worst busin...
2       Not Verified |  I will never be flying with BA...
3       ✅ Trip Verified |  On the my trip to Mexico Ci...
4       ✅ Trip Verified |  I upgraded at check in to C...
                              ...                        
1995    ✅ Verified Review |  British Airways was known...
1996    ✅ Verified Review |  London to Philadelphia. W...
1997    ✅ Verified Review |  Dallas Ft Worth to Mumbai...
1998    ✅ Verified Review |  Flew from LHR to AUS. Bri...
1999    ✅ Verified Review |  London to Hong Kong in bu...
Name: reviews, Length: 2000, dtype: object

In [37]:
def remove_punc(text):
    for punctuation in string.punctuation:
        text = text.replace(punctuation, ' ')
    return text

In [42]:
reviews = reviews.str.replace("Trip Verified |", ' ')
reviews = reviews.str.replace("Verified Review |", ' ')
reviews = reviews.str.replace("Not Verified", ' ')
reviews = reviews.str.replace("✅",' ')
reviews = reviews.str.replace("|",' ')
reviews = reviews.str.replace(r'\b(\w{1,3})\b','')
reviews = reviews.apply(remove_punc)
reviews.head(10)


0         The worst business class experience  Grou...
1         Quite possibly the worst business class I...
2         I will never be flying with BA again  Thi...
3         On the my trip to Mexico City  I had the ...
4         I upgraded at check in to Club Europe sea...
5         I bought a return trip with BA  through W...
6         Poor from start to finish  Six months aft...
7        Communication and customer service non exi...
8         That was supposed to be my flight but it ...
9         Have no fear when your BA flight is opera...
Name: reviews, dtype: object

# reviews.shape

In [33]:
freq_words = pd.Series(' '.join(reviews).lower().split()).value_counts()[:50]
freq_words

the        16546
to         11554
and         9864
a           7427
was         7061
i           6925
in          4750
of          4685
on          4107
flight      3828
for         3745
with        3092
it          2713
ba          2684
that        2675
my          2654
is          2638
we          2519
not         2482
were        2363
they        2310
at          2164
but         2091
had         2072
this        2070
have        1944
as          1824
no          1783
from        1633
service     1592
london      1444
very        1315
me          1308
you         1295
be          1292
so          1226
an          1181
are         1173
food        1170
seat        1158
there       1078
time        1077
crew        1074
british     1042
class       1029
airways     1028
our         1026
seats        961
cabin        957
t            950
Name: count, dtype: int64