# <span style="color:blue">Scraping data from airlinequality.com</span>
To understand how airline reviews have changed due to the COVID-19 pandemic, we will set up a scraper using the BeautifulSoup library from python. In this jupyter notebook we will discuss three chapters:

1. The prerequisets for scraping
2. The airlinequality.com scraper
3. Saving the scraped data in a csv file

Throughout this notebook the code will be explained and their purpose will be stated.

# 1. The prerequisets for scraping
Before the airlinequality website can be scraped a few things need to be done. Firstly, important libraries need to be imported for the code to run. Secondly, we have to generate the seeds for all different airline reviews. Lastly, we will need to be able to generate all page url's so that we can navigate all different review pages.   

## 1.1 Importing necessary libraries
In the first cell of our notebook we import the libraries that are necessary to run our code. These libraries are needed for the following:

* The requests library will allow us to load the web data of the airlinequality website and save it in HTML.
* The BeautifulSoup library will allow us to extract data from HTML files.
* As we want to extract quite some data, we will need the sleep package to obey retrieval limits.
* The math library is necessary for a function to count the total pages of a review for a specific airline. More about that later.
* The csv library will be used to store the scraped data in a csv file.
* The datetime library is necessary to know exactly when the data was scraped from the website as reviews will be added frequently.

In [1]:
import requests
from bs4 import BeautifulSoup
from time import sleep
import math
import csv
from datetime import datetime

## 1.2 Seed generation
It is important for research to examine reviews of multiple airlines thus we need to have a simple method that can generate different airline review page url's. We do this by making a function that will generate these seeds. We describe the three necessary steps to achieve this in this paragraph.  

Firstly, we define the 'base url' which is the part of the url that is the same for all different airline reviews. The home page is https://www.airlinequality.com. When you want to see all airline reviews you can got to Airline Reviews -> A-Z Airline Reviews which gives us an overview of all airlines which have reviews on the website (see image below).
<img src="../../docs/A-Z_reviews.png" />

If you click on AB Aviation and Adria Airways they would have the following url:
- https://www.airlinequality.com/airline-reviews/ab-aviation
- https://www.airlinequality.com/airline-reviews/adria-airways

Thus, we can conclude that there is a constant part in this url which we will define as our base url.

In [2]:
# Defining the base url.
base_url = 'https://www.airlinequality.com/airline-reviews/'

Secondly, a list of airlines of interest is defined. It is important that these are written exactly the same way as they are presented in the url of the airlinequality website. 

In [3]:
# A list of airlines. This can obviously be changed depending on the interest of the researcher.
airlines = ['klm-royal-dutch-airlines', 'air-china', 'american-airlines', 'air-caraibes']

Lastly, a function is defined where the input is the base url and the airline list. The function has a for loop that connects the base url with each airline name from the airline list and stores the result in a list. 

In [4]:
def generate_airline_urls(base_url, airline):
    """
    A function to generate a list of urls for airlines on airlinequality.
  
    Two parameters:
        base_url: The part of the url that stays the same for each airline review on airline quality.
        airline: A list of airlines that need to be scraped. They need to be written as they are presented in the url
        of airlinequality
    
    Returns:
        A list of airline review url's
    """
    page_urls = []
    for airline in airlines:
        full_url = base_url + airline
        page_urls.append(full_url)
    return page_urls

Here we assign the results of the function to a list called airline_urls and show the result. 

In [5]:
# A list that stores all airline review urls.  
airline_urls = generate_airline_urls(base_url, airlines)
print(airline_urls)

['https://www.airlinequality.com/airline-reviews/klm-royal-dutch-airlines', 'https://www.airlinequality.com/airline-reviews/air-china', 'https://www.airlinequality.com/airline-reviews/american-airlines', 'https://www.airlinequality.com/airline-reviews/air-caraibes']


When the URL of a new airline needs to be generated, the new airline can be added to the back of the 'airlines' list. Thereafter, 'airline_urls[-1]' can be printed to quickly generate a new airline review url. We assigned airline_urls[-1] to a variable which allows us to quickly scrape the page of a different airline when testing the scraper.

In [6]:
# A variable that gives the airline url of the airline at the end of the 'airlines' list 
Newest_page_url = airline_urls[-1]
print(Newest_page_url)

https://www.airlinequality.com/airline-reviews/air-caraibes


## 1.3 Navigating all the different review pages for an airline
To navigate the different page urls for a specific airline we have set up a function called "generate_page_urls". Information about the function is given in the docstring. We use a for loop and a counter to generate all page URL's. It is important that the range of the counter is 'num_pages'+ 1 as python considers the first element in a list to be 0.  

In [7]:
# function to generate all page urls
def generate_page_urls(airline_url, num_pages):
    """
    A function to generate all page urls for airlines on airlinequality.
  
    Two parameters:
        airline_url: The airline URL of which you want to generate page url's 
        num_pages: the amount of pages of which you want to generate page url's
        
    Returns:
        A list of airline page url's
    """
    
    page_urls = []    
    for counter in range(1, num_pages + 1):
        full_url = airline_url + "/page/" + str(counter)
        page_urls.append(full_url)
        
    return page_urls

Here we use the "airline_urls" list of the previous paragraph to generate the first two pages of the KLM reviews on airlinequality.

In [8]:
# Page_url function for KLM
print(generate_page_urls(airline_urls[0], 2))

['https://www.airlinequality.com/airline-reviews/klm-royal-dutch-airlines/page/1', 'https://www.airlinequality.com/airline-reviews/klm-royal-dutch-airlines/page/2']


### <ins>Total page function</ins>
However, to scrape all the pages of the airline we will need the total amount of pages. To avoid looking this up manually for each new airline, we have set up a function that calculates this. On the first page of the airline review URL (the 'airline_urls' list) the total reviews are given in text. We use this text (highlighted in red in the image below) to generate the total pages.
<img src="../../docs/total_pages.png" />
We modify this text in three ways to get the total pages:
1. We remove the unnecesary text with the replace function so that the total reviews remain.
2. We transform the total reviews to a float so that we can use functions from the math package.
3. We calculate the total pages by dividing the total reviews by 10 and rounding it upwards with the math.ceil function. We round it by 10 because the default way airlinequality shows reviews is in groups of 10 per page. Futhermore, it needs to be rounded upwards as leftover reviews are stored on the last page.

Once we have the total pages it could be used in the 'generate_page_urls' function to generate all page_urls. 

In [9]:
# function to calculate the total pages for an airline
def total_pages(airline_url):
    """
    A function to generate the total pages.
  
    One parameters:
        airline_url: The airline URL of which you want to generate the total pages
        
    Returns:
        The total pages of an airline on airlinequality.com
    """
    res = requests.get(airline_url)
    review_source_code = res.text
    soup = BeautifulSoup(review_source_code, 'html.parser')
    
    text = soup.find(class_='pagination-total').get_text()
    clean = text.replace('1 to 10 of ','') # removing unnecessary text
    total_reviews = float(clean.replace(' Reviews','')) # removing unnecessary text and transforming variable to a float
    total_pages = math.ceil(total_reviews/10) # dividing the number by ten and rounding it upwards

    return print(total_pages)

In [10]:
# The total pages for KLM (123 at 14-10-21)
total_pages(airline_urls[0])

124


# 2. The airlinequality.com scraper
In this chapter we will introduce our scraper function. The function will have as input the list of page urls which we were able to gather in the previous chapter. With the use of a for loop we are able to iterate over all the reviews on the different page url's. Each review has a similair structure as shown in the image below. In the top half of the review the title, name of the writer, the country of the reviewer, review verify status, and the date the review was posted. In the bottom half of the review a table is displayed showing various information. Firstly, we will discuss how we gather data from the top half of the review. Secondly, we will discuss how we have gathered the data in the bottom half of the review which is a bit more complex. Thirdly, it will be explained why the sleep() function and datetime.now() are used to finalize the scraper function.
<img src="../../docs/review_example.png" />

## 2.1 scraping the top half of the review
This half of the review is quite straight forward to scrape. We look for the tags in which the text is located and use the find() function of BeautifulSoup together with get_text(). For the date the review was published we instead look for the attribute "datetime" within the "time" tag. That is because this data is a bit more easier to convert to a date variable in R if we want to conduct an analysis (see image below).
<img src="../../docs/date_review_is_published.png" />


The verify status, the review rating and the country of the reviewer needed a different approach as in older reviews this data can sometimes be missing (see image below highlighted in red).
<img src="../../docs/tophalf_old.png" />
Therefore, we have given the verify status and the review rating a default value and set up an if-condition. The find() function will return None if nothing was found found thus we use != None for our if-condition. If the data is not there the default values are returned. If the data is there we use the same approach as discussed before to assign the text to a variable.

For the Verify status there were some complications due to differences between older and newer reviews (see images below).
<img src="../../docs/verify_status.png" />

Although .find('strong').get_text() might work in 2021 and 2017 it will generate wrong text in 2015. Therefore we have corrected this with an if-condition: 
- if review_verify == "Delta Air Lines": 
    review_verify = 'n/a'

For the country of the reviewer we also had to be creative as this data was akwardly positioned inbetween two different tags (see image below).
<img src="../../docs/country.png" />
Thus, to extract this data we use the previousSibling() function which return the previous element of a specific element. As the country is given before the time we used the 'time' tag. As the country is always inbetween brackets we used these brackets for an if-else condition basically if brackets are found remove them with replace() so that only the country is returned. If the brackets are not found the country is given a default value, similair to the verify status and review rating. 

## 2.2 scraping the bottom half of the review
To scrape the table of the review we first save the source code of all table rows in a list using find_all(). Afterwards we iterate over the list of rows. Similairly, as some data in the top half of the review the table can be completely empty (see image below), thus default values are given. 
<img src="../../docs/bottomhalf_old.png" />
We once again use if-conditions to change the default values if the specific class is found in the row. 

For the star rating we needed to adopt a different strategy because if we would use row.find(class_= "stars").get_text() we would get '12345' for all ratings (see image below). 
<img src="../../docs/star_ratings.png" />
Thus, instead we focus on the column with the star rating and with col.find_all(class_ = "star fill") we are able to store the ratings in a list. With the len() function we are then able to accurately get the star rating. 

In older reviews it was sometimes possible to state 'N/A' for a specific star rating (see image below).
<img src="../../docs/table_na.png" />
With the code explained above the star rating 0 would be assigned to all ratings in this review as the function find_all() returns an empty list if nothing was found. Thus, we have corrected that with an if-condition that states:                     if seat_comfort == 0:
    seat_comfort = 'n/a' 
This is safe as it is impossible to give a 0 star rating. This was done for all values with a star rating.   

## 2.3 Finishing the scraper
To finish the scraper we use sleep() function to avoid breaching the retrieval limit. In addition, when the scrape function gets executed we added a print of the current time with the datetime.now() function. The current time can than be used in the csv file (in the title) so that it is clear at what exact the moment the data has been scraped.

In [11]:
# The scraper
def scrape(page_urls):
    """
    A function that scrapes the data of the airlinequality.com website
  
    One parameter:
        page_urls: A list of page urls of an airline review on airlinequality.com
        
    Returns:
        A list of dictionaries with all the available airline review data of a specific airline, with the current date and time 
    """
    
    review_data = [] 
    for page_url in page_urls: 
        res = requests.get(page_url)
        soup = BeautifulSoup(res.text, "html.parser")
        reviews = soup.find_all(class_= 'media')
        
        # older reviews sometimes do not have a rating given which mean a defaut value is added
        review_rating = 'n/a'
        
        #  simiair to review rating the verify status is sometimes not given thus verify gets a default 'Not Verified' value  
        review_verify = 'Not Verified' 
        
        for review in reviews:
            review_writer = review.find("span", itemprop="name").get_text() # the name of the review writer 
            review_title = review.find("h2").get_text() # review title
            review_published = review.find("time").attrs["datetime"] # date the review is published
            
            # in older reviews the country is sometimes leftout which is why we assign the value 'n/a' to it initially
            # the country is always inbetween brackets that is why we look if they are present in the string
            h3 = review.find('h3')
            review_country = h3.find('time').previousSibling
            if "(" not in review_country or ")" not in review_country: 
                review_country = 'n/a'
            else:
                review_country = review_country.replace(' (','').replace(') ','')
            
            # the find function returns None if nothing is found 
            # thus the following if-conditions will adjust the default values if the rating and verify statements are given  
            if review.find("span" , itemprop="ratingValue") != None:  
                review_rating = review.find("span" , itemprop="ratingValue").get_text() # review rating
            
            # if the verify status is mentioned than it will be in the strong tag
            if review.find('strong') != None:  
                review_verify = review.find("strong").get_text() # verified or not
                
                # correction for the delta air lines text (see 2.1)
                if review_verify == "Delta Air Lines":
                    review_verify = 'n/a'
            
        # table information
            # default value if the star rating is not present in the review
            Type_Of_Traveller = 'n/a'
            cabin_flown  = 'n/a'
            route  = 'n/a'
            aircraft = 'n/a'
            date_flown = 'n/a'
            value_for_money = 'n/a'
            ground_service = 'n/a'
            seat_comfort = 'n/a'
            cabin_staff_service = 'n/a'
            food_and_beverages = 'n/a'
            inflight_entertainment = 'n/a'
            wifi_and_connectivity = 'n/a'
            recommended = 'n/a'
            
            # html code for the table and all table rows in a list
            table = review.find('table', attrs={'class':'review-ratings'})
            all_rows = table.find_all('tr')
            
            # for every table in the review data is being saved and assigned to a value
            for row in all_rows:
                if row.find(class_= "type_of_traveller") != None:
                    Type_Of_Traveller =  row.find(class_= "review-value").get_text()
                if row.find(class_= "cabin_flown") != None: 
                    cabin_flown = row.find(class_= "review-value").get_text() 
                if row.find(class_= "route") != None: 
                    route = row.find(class_= "review-value").get_text() 
                if row.find(class_= "aircraft") != None: 
                    aircraft = row.find(class_= "review-value").get_text()
                if row.find(class_= "date_flown") != None:
                    date_flown = row.find(class_= "review-value").get_text()
                if row.find(class_= "recommended") != None:
                    recommended = row.find(class_= "review-value").get_text()
                    
                # extracting the star ratings by calculating the lenght of the list create by find_all (class_ = "star fill")
                if row.find(class_= "value_for_money") != None:
                    col = row.find(class_= "stars")
                    value_for_money = len(col.find_all(class_ = "star fill"))
                    
                # some older reviews do not work with star ratings but can give N/A as a value. This means that the value 
                # above will be 0, however this score was not actually given by the reviewer. this if condition corrects this
                    if value_for_money == 0:
                        value_for_money = 'n/a'
                        
                if row.find(class_= "ground_service") != None:
                    col = row.find(class_= "stars")
                    ground_service = len(col.find_all(class_ = "star fill"))
                    if ground_service == 0:
                        ground_service = 'n/a'     
                    
                if row.find(class_= "seat_comfort") != None:
                    col = row.find(class_= "stars")
                    seat_comfort = len(col.find_all(class_ = "star fill"))
                    if seat_comfort == 0:
                        seat_comfort = 'n/a'
                        
                if row.find(class_= "cabin_staff_service") != None:
                    col = row.find(class_= "stars")
                    cabin_staff_service = len(col.find_all(class_ = "star fill"))
                    if cabin_staff_service == 0:
                        cabin_staff_service = 'n/a'
                        
                if row.find(class_= "food_and_beverages") != None:
                    col = row.find(class_= "stars")
                    food_and_beverages = len(col.find_all(class_ = "star fill"))
                    if food_and_beverages == 0:
                        food_and_beverages = 'n/a'
                    
                if row.find(class_= "inflight_entertainment") != None:
                    col = row.find(class_= "stars")
                    inflight_entertainment = len(col.find_all(class_ = "star fill"))
                    if inflight_entertainment == 0:
                        inflight_entertainment = 'n/a'
                    
                if row.find(class_= "wifi_and_connectivity") != None:
                    col = row.find(class_= "stars")
                    wifi_and_connectivity = len(col.find_all(class_ = "star fill"))
                    if wifi_and_connectivity == 0:
                        wifi_and_connectivity = 'n/a'

            # Saving the data in a dictionary
            review_data.append({'Review Rating': review_rating,
                                'Review Writer': review_writer,
                                'Title': review_title,
                                'Date Published': review_published, 
                                'Verify Status':  review_verify,
                                'Country': review_country,
                                'Aircraft': aircraft,
                                'Route': route,
                                'Type Of Traveller': Type_Of_Traveller,
                                'Seat type': cabin_flown,
                                'Date Flown': date_flown,
                                'Value For Money (rating out of five)': value_for_money,
                                'Seat Comfort (rating out of five)': seat_comfort,
                                'Cabin Staff Service (rating out of five)': cabin_staff_service,
                                'Food & Beverages (rating out of five)': food_and_beverages,
                                'inflight_entertainment (rating out of five)': inflight_entertainment,
                                'Wifi & Connectivity (rating out of five)': wifi_and_connectivity,
                                'Ground Service (rating out of five)': ground_service,
                                'Recommended': recommended})   
    sleep(5)
    print(datetime.now())
    return review_data

The scraped data will be saved as "KLM_data". We have includes 123 pages which is the total pages that we found using the total_pages() function we used earlier. 

In [12]:
# previewing the first page of reviews
KLM_data = scrape(generate_page_urls(airline_urls[0], 123))
print(KLM_data[0:10])

2021-10-17 15:36:20.974732
[{'Review Rating': '1', 'Review Writer': 'Martin Dite', 'Title': '"time moved back without informing us"', 'Date Published': '2021-10-16', 'Verify Status': 'Trip Verified', 'Country': 'Czech Republic', 'Aircraft': 'n/a', 'Route': 'Amsterdam to Prague', 'Type Of Traveller': 'Solo Leisure', 'Seat type': 'Economy Class', 'Date Flown': 'October 2021', 'Value For Money (rating out of five)': 1, 'Seat Comfort (rating out of five)': 'n/a', 'Cabin Staff Service (rating out of five)': 'n/a', 'Food & Beverages (rating out of five)': 'n/a', 'inflight_entertainment (rating out of five)': 'n/a', 'Wifi & Connectivity (rating out of five)': 'n/a', 'Ground Service (rating out of five)': 1, 'Recommended': 'no'}, {'Review Rating': '8', 'Review Writer': 'Anders Pedersen', 'Title': '"good value for money"', 'Date Published': '2021-10-15', 'Verify Status': 'Not Verified', 'Country': 'Vietnam', 'Aircraft': 'Boeing 787-8 and A330', 'Route': 'Copenhagen to Kigali via Amsterdam', 'Ty

# 3. Saving the scraped data in a csv file
Now that the data has been scraped we have made a function that converts the data into a csv file. First, we will explain the function and than we will briefly cover where the data will be saved. 

## 3.1 The CSV function (with windows correction)
For this function we use the writer() function of the csv package. The 'w' flag in the with statement indicates that the csv file will be overwritten each time. This is chosen as reviews are very rarely changed thus having historical data might be less relevant. Normally the csv.writer writes \r\n into the file directly. However, on windows it will write \r\r\n because on Windows text mode each \n will be translated into \r\n. Therefore we override the newlines with the parameter newline='' (empty string). Source: https://stackoverflow.com/questions/3348460/csv-file-written-with-python-has-blank-lines-between-each-row.

As a delimiter we use the semi-colon (";") as this is not used in the scraped data which means that no suprise columns will be created. With writerow() the table headers are given. Lastly, we iterate over each review in the scrape data to make sure all data falls under the right header.

In [52]:
# making a function that will convert the scraped data to a csv file
def convert(airline_data, airline):
    """
    A function that convert the scraped data of the airlinequality.com to a csv file
  
    Two parameter:
        airline_data: A list of dictionaries created by the scrape() function of a specific airline
        airline: the name of the airline
        
    Returns:
        A csv file in the data folder and 'done!'. 
    """
    with open("../../data/" + str(airline) + "_data.csv", "w", newline='', encoding='utf-8') as csv_file: 
        writer = csv.writer(csv_file, delimiter = ";")
        writer.writerow(['Review Rating',
                         'Review Writer',
                         'Title',
                         'Date Published', 
                         'Verify Status',
                         'Country',
                         'Aircraft',
                         'Route',
                         'Type Of Traveller',
                         'Seat type',
                         'Date Flown',
                         'Value For Money (rating out of five)',
                         'Seat Comfort (rating out of five)',
                         'Cabin Staff Service (rating out of five)',
                         'Food & Beverages (rating out of five)',
                         'inflight_entertainment (rating out of five)',
                         'Wifi & Connectivity (rating out of five)',
                         'Ground Service (rating out of five)',
                         'Recommended'])
    
        for data in airline_data: 
            writer.writerow([data['Review Rating'],
                             data['Review Writer'],
                             data['Title'],
                             data['Date Published'],
                             data['Verify Status'],
                             data['Country'],
                             data['Aircraft'],
                             data['Route'],
                             data['Type Of Traveller'],
                             data['Seat type'],
                             data['Date Flown'],
                             data['Value For Money (rating out of five)'],
                             data['Seat Comfort (rating out of five)'],
                             data['Cabin Staff Service (rating out of five)'],
                             data['Food & Beverages (rating out of five)'],
                             data['inflight_entertainment (rating out of five)'],
                             data['Wifi & Connectivity (rating out of five)'],
                             data['Ground Service (rating out of five)'],
                             data['Recommended']])
    print("done!")

## 3.2 Saving the scraped data in a csv file
Now we convert the KLM data to a csv file. As seen below, when the convert() function gets executed it prints 'done!' as output. This comfirms that the function was able to be executed without a problem. The csv file gets automatically stored in the data folder of this directory. This is because "../../data/" was added as a string to the file name. In addition, the airline name gets automatically added to the file name because the second parameter of the function is transformed to a string. We have manually added the date of when the data of KLM has been scraped to the csv file name. We gathered this date and time from the scrape function as this is printed when the function is executed (see paragraph 2.3).   

In [53]:
# Making a csv file for the KLM data.
convert(KLM_data, airlines[0])

done!
