# Task 1

---

## Web scraping and analysis

This Jupyter notebook includes some code to get you started with web scraping. We will use a package called `BeautifulSoup` to collect the data from the web. Once you've collected your data and saved it into a local `.csv` file you should start with your analysis.

### Scraping data from Skytrax

If you visit [https://www.airlinequality.com] you can see that there is a lot of data there. For this task, we are only interested in reviews related to British Airways and the Airline itself.

If you navigate to this link: [https://www.airlinequality.com/airline-reviews/british-airways] you will see this data. Now, we can use `Python` and `BeautifulSoup` to collect all the links to the reviews and then to collect the text data on each of the individual review links.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [2]:
base_url = "https://www.airlinequality.com/airline-reviews/british-airways"
pages = 10
page_size = 200

reviews = []


# for i in range(1, pages + 1):
for i in range(1, pages + 1):

    print(f"Scraping page {i}")

    # Create URL to collect links from paginated data
    url = f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize={page_size}"

    # Collect HTML data from this page
    response = requests.get(url)

    # Parse content
    content = response.content
    parsed_content = BeautifulSoup(content, 'html.parser')
    for para in parsed_content.find_all("div", {"class": "text_content"}):
        reviews.append(para.get_text())
    
    print(f"   ---> {len(reviews)} total reviews")

Scraping page 1
   ---> 200 total reviews
Scraping page 2
   ---> 400 total reviews
Scraping page 3
   ---> 600 total reviews
Scraping page 4
   ---> 800 total reviews
Scraping page 5
   ---> 1000 total reviews
Scraping page 6
   ---> 1200 total reviews
Scraping page 7
   ---> 1400 total reviews
Scraping page 8
   ---> 1600 total reviews
Scraping page 9
   ---> 1800 total reviews
Scraping page 10
   ---> 2000 total reviews


In [3]:
df = pd.DataFrame()
df["reviews"] = reviews
df.head()

Unnamed: 0,reviews
0,✅ Trip Verified | Very competent check in sta...
1,"✅ Trip Verified | Check in was so slow, no se..."
2,✅ Trip Verified | My review relates to the ap...
3,✅ Trip Verified | This was my first time flyin...
4,✅ Trip Verified | Lots of cancellations and d...


In [4]:
df.to_csv("data/BA_reviews.csv")

Congratulations! Now you have your dataset for this task! The loops above collected 1000 reviews by iterating through the paginated pages on the website. However, if you want to collect more data, try increasing the number of pages!

 The next thing that you should do is clean this data to remove any unnecessary text from each of the rows. For example, "✅ Trip Verified" can be removed from each row if it exists, as it's not relevant to what we want to investigate.

## Data Cleaning

In [1]:
import pandas as pd

csv_path = "data/BA_reviews.csv"
df1 = pd.read_csv(csv_path)
df1.reset_index(drop=True, inplace=True)

print(df1['reviews'])

0       ✅ Trip Verified |  Very competent check in sta...
1       ✅ Trip Verified |  Check in was so slow, no se...
2       ✅ Trip Verified |  My review relates to the ap...
3       ✅ Trip Verified | This was my first time flyin...
4       ✅ Trip Verified |  Lots of cancellations and d...
                              ...                        
1995    ✅ Verified Review |  For those who have allude...
1996    British Airways have randomly cancelled a flig...
1997    ✅ Verified Review |  Domestic BA from London a...
1998    Las Vegas to London Heathrow return, and we di...
1999    My wife and I flew to Dublin from London Heath...
Name: reviews, Length: 2000, dtype: object


In [2]:
df1.info()
df1.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  2000 non-null   int64 
 1   reviews     2000 non-null   object
dtypes: int64(1), object(1)
memory usage: 31.4+ KB


Unnamed: 0.1,Unnamed: 0
count,2000.0
mean,999.5
std,577.494589
min,0.0
25%,499.75
50%,999.5
75%,1499.25
max,1999.0


In [3]:
df1['reviews'] = df1['reviews'].str.strip()
df1['reviews']=df1['reviews'].str.lstrip('✅ Trip Verified |')
df1['reviews']=df1['reviews'].str.lstrip('Not Verified |')
df1['reviews']= df1['reviews'].str.lower()
print(df1)

      Unnamed: 0                                            reviews
0              0  y competent check in staff, saw had a problem ...
1              1  check in was so slow, no self check in and bag...
2              2  my review relates to the appalling experiences...
3              3  his was my first time flying with ba & i was p...
4              4  lots of cancellations and delays and no one ap...
...          ...                                                ...
1995        1995  review |  for those who have alluded to there ...
1996        1996  british airways have randomly cancelled a flig...
1997        1997  review |  domestic ba from london and edinburg...
1998        1998  las vegas to london heathrow return, and we di...
1999        1999  my wife and i flew to dublin from london heath...

[2000 rows x 2 columns]


In [4]:
#remove punctuation
df1['reviews'] = df1['reviews'].str.replace('[^\w\s]','')
print(df1['reviews'])

0       y competent check in staff saw had a problem w...
1       check in was so slow no self check in and bag ...
2       my review relates to the appalling experiences...
3       his was my first time flying with ba  i was pl...
4       lots of cancellations and delays and no one ap...
                              ...                        
1995    review   for those who have alluded to there b...
1996    british airways have randomly cancelled a flig...
1997    review   domestic ba from london and edinburgh...
1998    las vegas to london heathrow return and we did...
1999    my wife and i flew to dublin from london heath...
Name: reviews, Length: 2000, dtype: object


  df1['reviews'] = df1['reviews'].str.replace('[^\w\s]','')


In [5]:
del df1['Unnamed: 0']
df1.head()

Unnamed: 0,reviews
0,y competent check in staff saw had a problem w...
1,check in was so slow no self check in and bag ...
2,my review relates to the appalling experiences...
3,his was my first time flying with ba i was pl...
4,lots of cancellations and delays and no one ap...


In [7]:
df1.to_csv("data/BA_reviews_cleaning.csv")

## Text preprocessing functions
1. U have to make sure there are:
2. No useless text data.
3. No Uppercase letters (turn all letters to lowercase).
4. No Punctuations.
4. Tokenization and stop words handling.

In [30]:
import string   # we need it for Punctuations removal
from stop_words import get_stop_words   # or we can use from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize # it turn text to list but more faster 

# preprocessing function for sentiment analysis
def sentiment_clean_text(text):
    '''
    this function take text and clean it  

    Parameters
    ----------
    text : string before preprocessing.

    Returns
    -------
    text : string after preprocessing.

    '''


    # A. first step remove useless text data (if there are any) note: i don't need data before | so I will remove it 
    if '|' in text:
        text =  text.split('|')[1]   
       
    # B. second turn letters into lowercase 
    text = text.lower()
        
    # C. third remove all Punctuations.
    text = text.translate(str.maketrans('','',string.punctuation))
    
    return text


# preprocessing function for emotion analysis
def emotion_clean_text(text):
    '''
    this function take text and clean it then turn it to list of words 

    Parameters
    ----------
    text : string 

    Returns
    -------
    text_list : list of text words after cleaning.

    '''
        
    # D. forth step Tokenization and stop words 
        
    # Tokenizaiton: turning string into list of words.
    # Stop words: words without meaning for sentiment analysis.

        
    # Tokenizaiton
    text = word_tokenize(text,"english")
    
    # handeling the stop words but what are the stop words 
    stop_words = get_stop_words('english') #or we can use stop_words = stopwords.words('english')

    # Removing stop words from the tokenized words list
    text_list = []
    
    for word in text:
        if word not in stop_words:
            text_list.append(word)

       
    # return the list of words
    return text_list
    

## Emotion text dectionary function
- now it's time to make dictionary function for emotions

In [8]:
def emotion_maping (file,di): 
    '''
    this function take emotions file and store emotions in dictionary 

    Parameters
    ----------
    file : emotions file  

    Returns
    -------
    di : emotions dictionary.
    
    '''
    for line in file:
        clear_line = line.replace("\n", '').replace(",", '').replace("'", '').strip()
        word, emotion = clear_line.split(':')
        di[word] = emotion
            
    return di

## Vader Sentiment analysis function
- Now we have cleaned data so we are ready to do sentiment analysis

In [32]:
# let's import the needed packages 
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# function to carry out the     
def sentiment_analyze(text):
    
    scores = SentimentIntensityAnalyzer().polarity_scores(text) # return dictionary of scores
    
    if (scores['neg'] > scores['pos']):
        
        return 0
    
    else:
        return 1

## Main Section
- apply functions to the dataset texts

In [9]:
import nltk
nltk.download('vader_lexicon')
import opendatasets as od
# define needed data structures

cleaned_text = ""
temp_emotion_list = []
score = 0
emotion_dict = {}
words_score_dict = {}
moods_list_st = []
moods_list_tp = []

# create category list for better understanding 
airline_main_categories = ['flight','service','seat','food','crew','time','good','class','cabin','seats','staff','business']
temp_category_list = []

# get the emotion dictionary ready
emotion_file = open('data/emotion.txt','r',encoding='utf-8') 
emotion_dict = emotion_maping(emotion_file,emotion_dict)
emotion_file.close()

# ------------------------ loop for the skytrx Dataframe ------------------------------

# loop for all reviews in Skytrax dataFrame  
for i in range(len(df1)):
    
    # get the review of index i
    text = str(df1['reviews'][i])
    
    # step 1: let's clean the text and assign cleaned list to dataFrame 
    # simple clean 
    cleaned_text= sentiment_clean_text(text)
    
    #Step 2: sentiment Analysis
    score = sentiment_analyze(cleaned_text)
    moods_list_st.append(score)

    
    # Step 3: advanced clean for emotions
    cleaned_text_list = emotion_clean_text(cleaned_text)
    df_st['reviews'][i] = cleaned_text_list
    

    # Step 4: emotion list builder
    for word in emotion_dict.keys():
        if word in cleaned_text_list:
            temp_emotion_list.append(emotion_dict[word])   
    
    # Step 5: category list builder
    for cat in airline_main_categories:
        if cat in cleaned_text_list:
            temp_category_list.append(cat)  

# now let's create new column for moods for skytrax
df_st['mood'] = moods_list_st


# ------------------------ loop for the trustpilot Dataframe ------------------------------

# loop for all reviews in trustpilot dataFrame  
for i in range(len(df_tp)):
    
    # get the review of index i
    text = str(df_tp['reviews'][i])
    
    # step 1: let's clean the text and assign cleaned list to dataFrame 
    # simple clean 
    cleaned_text= sentiment_clean_text(text)
    
    #Step 2: sentiment Analysis
    score = sentiment_analyze(cleaned_text)
    moods_list_tp.append(score)

    
    # Step 3: advanced clean for emotions
    cleaned_text_list = emotion_clean_text(cleaned_text)
    df_tp['reviews'][i] = cleaned_text_list
    

    # Step 4: emotion list builder
    for word in emotion_dict.keys():
        if word in cleaned_text_list:
            temp_emotion_list.append(emotion_dict[word])  

    # Step 5: category list builder
    for cat in airline_main_categories:
        if cat in cleaned_text_list:
            temp_category_list.append(cat)  
            
# now let's create new column for moods trust pilot dataframe
df_tp['mood'] = moods_list_tp


[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\agusa\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


NameError: name 'sentiment_clean_text' is not defined