## Airline Review Extraction for Text Classification and Topic Modeling

*Install relevant packages and libraries*

In [1]:
# Install playwright library

# !pip3 install playwright   # for package

# !pip3 install asyncio      # for async calls

# !playwright install        # for browser binaries

*Import relevant packages and libraries*

In [2]:
# Import relevant libaries

import os 
import datetime 
import re

import requests
import time
from bs4 import BeautifulSoup
from collections import defaultdict, Counter 
import random

# Additional libraries

import shutil
import pprint
import numpy as np
import pandas as pd
import json
from json import loads, dumps
import string
import asyncio
from playwright.async_api import async_playwright
from playwright.sync_api import sync_playwright
from pprint import pprint 
import string  


import seaborn as sns
import plotly.express as px

from collections import Counter, defaultdict
from string import punctuation
from nltk.corpus import stopwords
from nltk.metrics import ConfusionMatrix

import nltk
nltk.download('wordnet')

from nltk.stem import WordNetLemmatizer
from random import randint

import warnings
warnings.filterwarnings("ignore")



  from pandas.core import (
[nltk_data] Downloading package wordnet to /Users/dunya/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### Initial TripAdvisor Scraper

The goal of this project is to perform text classification and topic modeling on extracted airline reviews. In order to acquire this data, the Playwright library is utilized to scrape reviews from TripAdvisor, a popular travel website. Upon initial examination of the TripAdvisor website, it became evident that many airlines were either (1) too small to be considered or (2) not longer operating. The initial web scraper was written to extract all airlines reviewed on TripAdvisor, along with the total reviews written for each one. The methods for this are below:

*TripAdvisor Scrape Methods*

In [3]:
# Helper to concat list of lists

def flatten_sum(matrix):
    return sum(matrix, [])

# Helper to remove punctuation and clean reviews string

def conv_int(text):
    arr = text.split(' ')
    strg = arr[0].translate(str.maketrans('', '', string.punctuation))
    return strg

# Function to scrape items on TripAdvisor page

async def scrape_page(page):
    info = list()
    airline_href = page.locator('//div[@class="airlineSummary"]/a[1]')  # get HTML elements
    airline_name = page.locator('//div[@class="airlineSummary"]')
    airline_review_count = page.locator('//div[@class="airlineSummary"]/a[@class="detailsLink"]')
    
    hrefs = await airline_href.all()                        # get <a> tags
    names = await airline_name.all_inner_texts()            # get airline names
    reviews = await airline_review_count.all_inner_texts()  # get airline review counts

    idx = 0
    for idx in range(len(names)):
        name = names[idx].split('\n')[0]                    # extract airline name
        href = await hrefs[idx].get_attribute('href')       # extract href from link
        href = f"https://www.tripadvisor.com{href}"         # build out complete link
        el = {
            "airline_link": f"{href}",
            "airline_name": f"{name}",
            "airline_reviews": f"{reviews[idx]}"
        }
        info.append(el)                                     # put info into list to return
        idx += 1
    return info

# Function to go to next page

async def paginate(page, browser):
    try:
        next_page = page.locator('//span[@class="nav next ui_button primary"]')
        await next_page.click()
        return page
    except:
        await browser.close()                               # 

# Main execution function using methods defined above

async def main():
    async with async_playwright() as p:

        # Init browser
        browser = await p.chromium.launch(headless=False, slow_mo=1500)
        page = await browser.new_page()

        # Go to trip advisor
        await page.goto("https://www.tripadvisor.com/Airlines")

        # Scrape page and paginate
        airline_details = list()
        pg = 0
        while pg < 60:
            scraped = await scrape_page(page)
            airline_details.append(scraped)
            pg += 1
            await paginate(page, browser)

        # Flatten list of lists stored in airline_details
        
        # print("Length before concat: ", len(airline_details)) 
        airline_deets = flatten_sum(airline_details)
        # print("Length after concat: ", len(airline_deets))      
        return airline_deets

The use-case for each method above is as follows:

- flatten-sum: This is a helper function that takes in a list of lists and flattens, returning a one-dimensional list

- conv_int: This is a helper function that takes the total reviews string from the website, removes punctuation, and converts the value into an integer

- scrape_page: This function deals with the main extraction of text from the HTML elements contained on a review page. It identifies where the airline names and review totals are, and stores them along with links that contain the airline reviews

- paginate: This function is a scraping helper that tells the parser to move on to the next page of reviews

- main: This is the main execution function, employing all of the previous methods to gather relevant information

*Initial TripAdvisor Dataframe*

In [4]:
# Execute web scraper to get review information

information = await main()

In [5]:
# Extract number of reviews per airline
 
airlines_df = pd.DataFrame(information)
airlines_df['review_count'] = airlines_df['airline_reviews'].apply(conv_int)
airlines_df['review_count'] = pd.to_numeric(airlines_df['review_count'])

# Get top airline reviews

airlines_df = airlines_df.sort_values('review_count', ascending=False)

The code above converts the extracted information from TripAdvisor to a pandas DataFrame. It applies the conv_int() function previously mentioned to transform the total reviews into integers. Once the review counts are numeric, we are able to sort the values in descending order.

*Save JSON*

In [6]:
# Save scraped information
 
# Serializing json
json_object = json.dumps(information, indent=4)
 
# Writing to json file
with open("review_counts.json", "w") as outfile:
    outfile.write(json_object)

display(airlines_df.head(10))

Unnamed: 0,airline_link,airline_name,airline_reviews,review_count
557,https://www.tripadvisor.com/Airline_Review-d87...,Transavia,"109,503 reviews",109503.0
454,https://www.tripadvisor.com/Airline_Review-d87...,Ryanair,"87,473 reviews",87473.0
107,https://www.tripadvisor.com/Airline_Review-d87...,American Airlines,"77,654 reviews",77654.0
225,https://www.tripadvisor.com/Airline_Review-d87...,Emirates,"64,158 reviews",64158.0
209,https://www.tripadvisor.com/Airline_Review-d87...,Delta Air Lines,"61,690 reviews",61690.0
219,https://www.tripadvisor.com/Airline_Review-d87...,easyJet,"61,667 reviews",61667.0
160,https://www.tripadvisor.com/Airline_Review-d87...,British Airways,"58,623 reviews",58623.0
575,https://www.tripadvisor.com/Airline_Review-d87...,United Airlines,"54,146 reviews",54146.0
505,https://www.tripadvisor.com/Airline_Review-d87...,Southwest Airlines,"45,351 reviews",45351.0
350,https://www.tripadvisor.com/Airline_Review-d10...,LATAM Airlines,"40,972 reviews",40972.0


The dataframe is saved as a json file. From the table above, we are able to gauge the most highly reviewed airlines. At first, the reason for gathering this information stemmed from the idea of choosing airlines with a lot of reviews for future text classification and topic modeling. Unfortunately, after examining the most reviewed airlines on TripAdvisor, it was evident that the majority of them were low-budget. This could imply that the reviews among these airlines may be too similar.

### Skytrax Scraper

*Skytrax scraper for region information*

In [7]:
async def skytrax():
    async with async_playwright() as p:

        # Init browser
        browser = await p.chromium.launch(headless=False, slow_mo=1500)
        page = await browser.new_page()
        region_list = list()

        # Define main url and query parameters
        og_url = "https://skytraxratings.com/airlines?regions="
        url_options = [
            "africa",
            "asia",
            "australiapacific",
            "central-america-caribbean",
            "china",
            "europe",
            "middle-east",
            "north-america",
            "russia-and-cis",
            "south-america"
        ]

        # Iterate through Skytrax pages filtered by region
        for reg in url_options:
            await page.goto(f"{og_url}{reg}")

            # Click on drop down for 50 options
            await page.select_option("select#set_posts_per_page", value="50")
            names = page.locator('//div[contains(@class,"pageportal__title")]/h2')
            titles = await names.all_inner_texts()
            region_list.append({
                f"{reg}": titles
            })
        return region_list

In order to obtain airlines less similar to one another, the information on Skytrax was leveraged. The website contained filters that allowed users to choose airlines by region. There were 10 separate regions available as options, these regions will serve as the class labels for the text classification to come. The first 50 airlines of each region were stored.

In [8]:
region_airlines = await skytrax()

pprint(region_airlines)

[{'africa': ['Air Algerie',
             'Air Botswana',
             'Air Madagascar',
             'Air Mauritius',
             'Air Namibia',
             'Air Seychelles',
             'Arik Air',
             'Cabo Verde Airlines',
             'Egyptair',
             'Ethiopian Airlines',
             'fastjet',
             'FlySafair',
             'Kenya Airways',
             'Kulula',
             'LAM Mozambique Airlines',
             'Mango',
             'Nile Air',
             'Nouvelair',
             'Royal Air Maroc',
             'Rwandair',
             'South African Airways',
             'Tunisair']},
 {'asia': ['9 Air',
           'Air Busan',
           'Air China',
           'Air India',
           'Air India Express',
           'Air Macau',
           'AirAsia',
           'AirAsia India',
           'AirAsia X',
           'airblue',
           'ANA All Nippon Airways',
           'Asiana Airlines',
           'Bamboo Airways',
           'Bangkok Airw

In [9]:
# Choose region specific airlines

airlines_for_reviews = [
    "South African Airways",
    "Cathay Pacific",
    "Virgin Australia",
    "Copa Airlines",
    "Air China",
    "Lufthansa",
    "Emirates",
    "Air Canada",
    "Aeroflot",
    "LATAM Airlines"
]

# Create dataframe consisting of chosen airlines

airlines_chosen = airlines_df[airlines_df["airline_name"].isin(airlines_for_reviews)]

# Helper function for assigning region to airline (based on Skytrax)

def add_region(name):
    match name:
        case "South African Airways":
            return "Africa"
        case "Cathay Pacific":
            return "Asia"
        case "Virgin Australia":
            return "Pacific"
        case "Copa Airlines":
            return "Central America"
        case "Air China":
            return "China"
        case "Lufthansa":
            return "Europe"
        case "Emirates":
            return "Middle East"
        case "Air Canada":
            return "North America"
        case "Aeroflot":
            return "Indian Subcontinent"
        case "LATAM Airlines":
            return "South America"

# Add to dataframe

airlines_chosen["region"] = airlines_chosen["airline_name"].apply(add_region)

airlines_chosen

Unnamed: 0,airline_link,airline_name,airline_reviews,review_count,region
225,https://www.tripadvisor.com/Airline_Review-d87...,Emirates,"64,158 reviews",64158.0,Middle East
350,https://www.tripadvisor.com/Airline_Review-d10...,LATAM Airlines,"40,972 reviews",40972.0,South America
364,https://www.tripadvisor.com/Airline_Review-d87...,Lufthansa,"40,747 reviews",40747.0,Europe
28,https://www.tripadvisor.com/Airline_Review-d87...,Air Canada,"29,770 reviews",29770.0,North America
5,https://www.tripadvisor.com/Airline_Review-d87...,Aeroflot,"20,694 reviews",20694.0,Indian Subcontinent
175,https://www.tripadvisor.com/Airline_Review-d87...,Cathay Pacific,"16,645 reviews",16645.0,Asia
198,https://www.tripadvisor.com/Airline_Review-d87...,Copa Airlines,"13,291 reviews",13291.0,Central America
591,https://www.tripadvisor.com/Airline_Review-d87...,Virgin Australia,"11,964 reviews",11964.0,Pacific
34,https://www.tripadvisor.com/Airline_Review-d87...,Air China,"6,005 reviews",6005.0,China
503,https://www.tripadvisor.com/Airline_Review-d87...,South African Airways,"4,884 reviews",4884.0,Africa


One airline from each of the ten regions was selected and a separate dataframe containing specific airlines was created. The region was added to the dataframe of the chosen airlines. The links for the airline reviews were put into a list for further web scraping below.

*Create search query list*

In [10]:
# Get list of links for chosen airlines

search_queries = airlines_chosen['airline_link'].to_list()

print(search_queries)
print(type(search_queries))

['https://www.tripadvisor.com/Airline_Review-d8729069-Reviews-Emirates', 'https://www.tripadvisor.com/Airline_Review-d10290698-Reviews-LATAM-Airlines', 'https://www.tripadvisor.com/Airline_Review-d8729113-Reviews-Lufthansa', 'https://www.tripadvisor.com/Airline_Review-d8728998-Reviews-Air-Canada', 'https://www.tripadvisor.com/Airline_Review-d8728987-Reviews-Aeroflot', 'https://www.tripadvisor.com/Airline_Review-d8729046-Reviews-Cathay-Pacific', 'https://www.tripadvisor.com/Airline_Review-d8729055-Reviews-Copa-Airlines', 'https://www.tripadvisor.com/Airline_Review-d8728931-Reviews-Virgin-Australia', 'https://www.tripadvisor.com/Airline_Review-d8729000-Reviews-Air-China', 'https://www.tripadvisor.com/Airline_Review-d8729155-Reviews-South-African-Airways']
<class 'list'>


### Final TripAdvisor Scraper

*Scrape TripAdvisor Reviews based on extracted region information*

In [11]:
# Function to go to next page

async def paginate_reviews(page, browser):
    try:
        next_page = page.locator('//a[@class="ui_button nav next primary "]')
        await next_page.click()
        # print('Clicked next page!!!')
        return page
    except:
        print('No next page!!!')

# Helper function to click Read more of review - Not currently used

async def click_read_more(page):
    read_more = page.locator('//div[@data-test-target="expand-review"]/span[1]')
    await read_more.all()
    return page

# Function to scrape reviews on page

async def review_scrape(page, browser):
    # Create lists to store scraped info
    reviews = list()
    idx = 0
    revc_title=[]
    revc_text=[]
    while idx < 20:                                 # 20 pages of reviews 
        page_info = list()
        await page.mouse.wheel(0, 15000)            # scroll down function
        title = page.locator('//div[@data-test-target="review-title"]/a/span/span')
        text = page.locator('//div[@data-test-target="review-title"]/following-sibling::div')
        rating = page.locator('//span[@class="ammfn"]')
        
        titles = await title.all_inner_texts()      # grab review titles
        texts = await text.all_inner_texts()        # grab review texts
        rating = await rating.all_inner_texts()     # grab airline rating

        # Merge two lists into dictionary with title and text keys
        for jdx in range(len(titles)):
            rev_title = titles[jdx]
            revc_title.append(rev_title)

            rev_raw_text = texts[jdx]
            rev_text = rev_raw_text.split("Read more")  # Split review by Read more button
            rev_text = rev_text[0]
            revc_text.append(rev_text)
            el = {                                      # Add information to relevant list
                "review_title": f"{rev_title}",
                "review_text": f"{rev_text}"
            }
            page_info.append(el)
        reviews.append(page_info)

        # Paginate
        idx += 1
        await paginate_reviews(page, browser)           # Go to next page of reviews
    return reviews, revc_title, revc_text, rating

The use-case for each method above is as follows:

- click_read_more: This is a helper function that is meant to expand the review text in the "Read more" section of a review. It is currently not being used in the project

- paginate-reviews: This function is a scraping helper that tells the parser to move on to the next page of reviews. It is used in the review_scrape() method

- review_scrape: This function deals with the main extraction of text contained in HTML elements of review pages. It identifies where the review titles and texts are on a given page, along with the airline ratings. 

Below, the functions are utilized to scrape the airline reviews:

In [12]:
# Function to scrape reviews on page using functions above

async with async_playwright() as p:

    # Init browser

    browser = await p.chromium.launch(headless=False, slow_mo=1500)
    page = await browser.new_page()
    rev_list = list()

    # Go to trip advisor

    await page.goto("https://www.tripadvisor.com/Airlines")

    # Scrape page and paginate
    rtlist=[]
    rtextlist=[]
    airlineratinglist=[]
    for q in search_queries:
        # Go to url with reviews
        await page.goto(f"{q}")

        # Scrape relevant information
        reviews,rt,rtext,airlinerating = await review_scrape(page, browser)
        
        # Add to relevant list
        rtlist.append(rt)
        rtextlist.append(rtext)
        airlineratinglist.append(airlinerating)
        rev_list.append({
            "Airline": f"{q}",
            "Reviews": f"{reviews}"
        })

Extracting the review text for the 10 chosen airlines takes roughly 10-15 minutes. This process gathers 100 reviews per airline, totalling 1,000 reviews. This can be extended to gather more reviews if needed. The speed at which the information can be improved by altering the "slow_mo" parameter in the browser initialization. However, if the browser sifts through the pages too quickly, it may lose some of the information along the way.

In [13]:
# Append relevant lists from code block above to dataframe

airlines_df = airlines_chosen
airlines_df['Rating']=airlineratinglist
airlines_df['Review_Title']=rtlist
airlines_df['Review_Text']=rtextlist

airlines_df

Unnamed: 0,airline_link,airline_name,airline_reviews,review_count,region,Rating,Review_Title,Review_Text
225,https://www.tripadvisor.com/Airline_Review-d87...,Emirates,"64,158 reviews",64158.0,Middle East,[4.0],"[Unsatisfactory on the seat selection, Top cla...",[We have to pay extra to select the seat like ...
350,https://www.tripadvisor.com/Airline_Review-d10...,LATAM Airlines,"40,972 reviews",40972.0,South America,[3.5],"[LATAM is changing, for the bad..., The worst ...","[Our flight was delayed more than an hour, boa..."
364,https://www.tripadvisor.com/Airline_Review-d87...,Lufthansa,"40,747 reviews",40747.0,Europe,[3.5],"[Skip Lufthansa for a stress free flight, don'...",[I booked Lufthansa for a better experience an...
28,https://www.tripadvisor.com/Airline_Review-d87...,Air Canada,"29,770 reviews",29770.0,North America,[3.0],"[delay delay delay, Terrible experience, avoid...",[24 hours prior online checkin ..... Delay 80 ...
5,https://www.tripadvisor.com/Airline_Review-d87...,Aeroflot,"20,694 reviews",20694.0,Indian Subcontinent,[4.0],"[Day 320: Repayment still did not work out, WO...","[To my great surprise, I received the free, wo..."
175,https://www.tripadvisor.com/Airline_Review-d87...,Cathay Pacific,"16,645 reviews",16645.0,Asia,[4.0],"[Never again!, Shocking experience...never aga...",[Terrible experience. Ran out of options for b...
198,https://www.tripadvisor.com/Airline_Review-d87...,Copa Airlines,"13,291 reviews",13291.0,Central America,[3.5],"[Inconsistent, Great experience on Copa, Worst...",[Overall okay experience. Some learning curves...
591,https://www.tripadvisor.com/Airline_Review-d87...,Virgin Australia,"11,964 reviews",11964.0,Pacific,[3.5],"[Canceled Again, easy process, Don't recommend...",[Virgin Australia used to be a good airline bu...
34,https://www.tripadvisor.com/Airline_Review-d87...,Air China,"6,005 reviews",6005.0,China,[3.0],"[Crappiest offboard service so far, Great trip...",[Crappiest offboard service so far. Phone serv...
503,https://www.tripadvisor.com/Airline_Review-d87...,South African Airways,"4,884 reviews",4884.0,Africa,[3.5],"[Pathetic, Happy Customer’s, Avoid at all cost...",[I am not going to go into the detail..but wor...


*Save Reviews as JSON*

In [14]:
# # Save Reviews to file

# # Reset index
# airlines_df=airlines_df.reset_index().drop(columns=['index'])

# # Convert to json
# reviews_json = airlines_df.to_dict("records")

# # Serializing json
# reviews_json = json.dumps(reviews_json, indent=4)
 
# # Writing to file
# with open("airline_reviews.json", "w") as outfile:
#     outfile.write(reviews_json)

# airlines_df

*Plot Total Review Distributions*

In [15]:
# Seaborn plot

color = sns.color_palette()

fig = px.histogram(airlines_df, x="review_count",  width=800, height=400)
fig.update_traces(marker_color="pink",marker_line_color='orchid',
                  marker_line_width=1.5)
fig.update_layout(title_text='Airline review total distribution')
fig.show()

Based on the histogram displayed above, it can be seen that the majority of airlines have under 20,000 reviews. Roughly 90% of airlines have under 60,000 reviews, with only 10% of them having 60,000 reviews or more.

### Clean up Reviews

In [16]:
# Function to remove repetitive phrases among all reviews

def replace_in_list(sentence_list):
    new_word = ''
    new_list = [re.sub(r'\d', '', sentence).replace('This review is the subjective opinion of a Tripadvisor member and not of Tripadvisor LLC. Tripadvisor performs checks on reviews.', new_word) for sentence in sentence_list]
    new_list = [sentence.replace('\nHelpful\nShare', new_word) for sentence in new_list]
    new_list = [sentence.replace('\nRead more\nDate of travel:', new_word) for sentence in new_list]
    new_list = [sentence.replace('\n1 Helpful vote', new_word) for sentence in new_list]
    new_list = [sentence.replace('\n', new_word) for sentence in new_list]
    new_list = [sentence.replace('airline', new_word) for sentence in new_list]
    new_list = [sentence.replace('flight', new_word) for sentence in new_list]
    return new_list

# Helper function to match all months using regex

def remove_months(sentence):
    # Define a function to remove months from a sentence
    months_pattern = r'\b(?:January|February|March|April|May|June|July|August|September|October|November|December)\b'
    return re.sub(months_pattern, '', sentence)

# Function to remove repetitve months

def replace_months_in_list(sentence_list):
    return [remove_months(sentence) for sentence in sentence_list]

In [17]:
# Apply the functions above to each list in the 'Review_Text' column

airlines_df['Review_Text'] = airlines_df['Review_Text'].apply(replace_in_list)
airlines_df['Review_Text'] = airlines_df['Review_Text'].apply(replace_months_in_list)

# Extract the numeric values from the lists and convert to float

airlines_df['Rating'] = airlines_df['Rating'].apply(lambda x: pd.to_numeric(x[0], 
                                            errors='coerce')).astype(float)

Certain words and phrases were found to be present in the extracted reviews. The replace_in_list() function was used to remove the unneccessary text, along with the replace_months_in_list() method. These were applied to the "Review_Text" feature in the airlines dataframe. The "Rating" feature was also converted to a float since it was initially stored as a string.

### Descriptive Statistics

In [18]:
punctuation = set(punctuation)      # punctuation
tw_punct = punctuation - {"#"}

sw = stopwords.words("english")     # stopwords and null removal
sw = sw + ['nan']

whitespace_pattern = re.compile(r"\s+")           # Two useful
hashtag_pattern = re.compile(r"^#[0-9a-zA-Z]+")   #   Regex

# Function for descriptive statistics

def descriptive_stats(tokens, num_tokens = 5, verbose=True) :
    """
        Given a list of tokens, print number of tokens, number of unique tokens, 
        number of characters, lexical diversity, and num_tokens most common
        tokens. Returns a list of descriptive statistics
    """

    num_tokens = len(tokens)
    num_unique_tokens = len(set(tokens))
    lexical_diversity = num_unique_tokens/num_tokens
    num_characters = sum(len(token) for token in tokens)
    
    if verbose :        
        print(f"There are {num_tokens} tokens in the data.")
        print(f"There are {num_unique_tokens} unique tokens in the data.")
        print(f"There are {num_characters} characters in the data.")
        print(f"The lexical diversity is {lexical_diversity:.3f} in the data.")
    
    return([num_tokens, num_unique_tokens,
            lexical_diversity,
            num_characters])    

# Function to remove stop words

def remove_stop(tokens) :
    return[t for t in tokens if t not in sw]

# Function to remove punctuation
 
def remove_punctuation(text, punct_set=tw_punct) : 
    return("".join([ch for ch in text if ch not in punct_set]))

# Function to tokenize text

def tokenize(text) : 
    """ Splitting on whitespace rather than the book's tokenize function. That 
        function will drop tokens like '#hashtag' or '2A', which we need for Twitter. """
    
    return([item.lower() for item in whitespace_pattern.split(text)])
    
# Function to prepare pipline with necessary functions

def prepare(text, pipeline) : 
    tokens = str(text)
    
    for transform in pipeline : 
        tokens = transform(tokens)
        
    return(tokens)

# Construct pipeline using methods above

full_pipeline = [str.lower, remove_punctuation, tokenize, remove_stop]


The functions above handle a variety of data preparation steps:

- remove_stop: removes stop words in a given text

- remove_punctuation: utilizes the string.punctuation library to remove unwanted punctuation that may be present in review texts

- tokenize: employs a whitespace pattern capable of removing a variety of white space elements in a text.

- prepare: applies relevant data processing and cleaning steps to a given text using a designated pipeline

The full_pipline variable aggregates all of the necessary data handling steps. It is used with the apply() method below to create "title_tokens" and "summary_tokens".

In [19]:
# Create tokens for review titles and summaries

airlines_df['title_tokens'] = airlines_df['Review_Title'].apply(prepare, pipeline=full_pipeline)
airlines_df['summary_tokens'] = airlines_df['Review_Text'].apply(prepare, pipeline=full_pipeline)

*Save Cleaned Reviews as JSON*

In [20]:
# Save Reviews to file

# Reset index
airlines_df = airlines_df.reset_index().drop(columns=['index'])

# Convert to json
reviews_json = airlines_df.to_dict("records")

# Serializing json
reviews_json = json.dumps(reviews_json, indent=4)
 
# Writing to file
with open("airline_reviews.json", "w") as outfile:
    outfile.write(reviews_json)

display(airlines_df)

Unnamed: 0,airline_link,airline_name,airline_reviews,review_count,region,Rating,Review_Title,Review_Text,title_tokens,summary_tokens
0,https://www.tripadvisor.com/Airline_Review-d87...,Emirates,"64,158 reviews",64158.0,Middle East,4.0,"[Unsatisfactory on the seat selection, Top cla...",[We have to pay extra to select the seat like ...,"[unsatisfactory, seat, selection, top, class, ...","[pay, extra, select, seat, like, low, cost, ho..."
1,https://www.tripadvisor.com/Airline_Review-d10...,LATAM Airlines,"40,972 reviews",40972.0,South America,3.5,"[LATAM is changing, for the bad..., The worst ...","[Our was delayed more than an hour, boarding ...","[latam, changing, bad, worst, customer, servic...","[delayed, hour, boarding, total, mess, groups,..."
2,https://www.tripadvisor.com/Airline_Review-d87...,Lufthansa,"40,747 reviews",40747.0,Europe,3.5,"[Skip Lufthansa for a stress free flight, don'...",[I booked Lufthansa for a better experience an...,"[skip, lufthansa, stress, free, flight, dont, ...","[booked, lufthansa, better, experience, smooth..."
3,https://www.tripadvisor.com/Airline_Review-d87...,Air Canada,"29,770 reviews",29770.0,North America,3.0,"[delay delay delay, Terrible experience, avoid...",[ hours prior online checkin ..... Delay mins...,"[delay, delay, delay, terrible, experience, av...","[, hours, prior, online, checkin, delay, mins,..."
4,https://www.tripadvisor.com/Airline_Review-d87...,Aeroflot,"20,694 reviews",20694.0,Indian Subcontinent,4.0,"[Day 320: Repayment still did not work out, WO...","[To my great surprise, I received the free, wo...","[day, 320, repayment, still, work, worst, comp...","[great, surprise, received, free, worryfree, a..."
5,https://www.tripadvisor.com/Airline_Review-d87...,Cathay Pacific,"16,645 reviews",16645.0,Asia,4.0,"[Never again!, Shocking experience...never aga...",[Terrible experience. Ran out of options for b...,"[never, shocking, experiencenever, excellent, ...","[terrible, experience, ran, options, breakfast..."
6,https://www.tripadvisor.com/Airline_Review-d87...,Copa Airlines,"13,291 reviews",13291.0,Central America,3.5,"[Inconsistent, Great experience on Copa, Worst...",[Overall okay experience. Some learning curves...,"[inconsistent, great, experience, copa, worst,...","[overall, okay, experience, learning, curves, ..."
7,https://www.tripadvisor.com/Airline_Review-d87...,Virgin Australia,"11,964 reviews",11964.0,Pacific,3.5,"[Canceled Again, easy process, Don't recommend...",[Virgin Australia used to be a good but I wou...,"[canceled, easy, process, dont, recommend, vir...","[virgin, australia, used, good, would, recomme..."
8,https://www.tripadvisor.com/Airline_Review-d87...,Air China,"6,005 reviews",6005.0,China,3.0,"[Crappiest offboard service so far, Great trip...",[Crappiest offboard service so far. Phone serv...,"[crappiest, offboard, service, far, great, tri...","[crappiest, offboard, service, far, phone, ser..."
9,https://www.tripadvisor.com/Airline_Review-d87...,South African Airways,"4,884 reviews",4884.0,Africa,3.5,"[Pathetic, Happy Customer’s, Avoid at all cost...",[I am not going to go into the detail..but wor...,"[pathetic, happy, customer’s, avoid, costs, ze...","[going, go, detailbut, words, cannot, describe..."


After the reviews are extracted, the Rating, Review_Title, and Review_Text lists are appended to the airline dataframe. The reviews are saved as a json file titled "airline_reviews.json".

*Descriptive Statistics for Review Titles*

In [21]:
# Call descriptive statistics on titles

for i in range(10):
    print(airlines_df["airline_name"][i])
    print(descriptive_stats(airlines_df["title_tokens"][i]))
    print("\n")


Emirates
There are 364 tokens in the data.
There are 203 unique tokens in the data.
There are 2434 characters in the data.
The lexical diversity is 0.558 in the data.
[364, 203, 0.5576923076923077, 2434]


LATAM Airlines
There are 371 tokens in the data.
There are 210 unique tokens in the data.
There are 2402 characters in the data.
The lexical diversity is 0.566 in the data.
[371, 210, 0.5660377358490566, 2402]


Lufthansa
There are 436 tokens in the data.
There are 264 unique tokens in the data.
There are 2865 characters in the data.
The lexical diversity is 0.606 in the data.
[436, 264, 0.6055045871559633, 2865]


Air Canada
There are 364 tokens in the data.
There are 192 unique tokens in the data.
There are 2268 characters in the data.
The lexical diversity is 0.527 in the data.
[364, 192, 0.5274725274725275, 2268]


Aeroflot
There are 279 tokens in the data.
There are 156 unique tokens in the data.
There are 1694 characters in the data.
The lexical diversity is 0.559 in the data.


Based on the descriptive statistics generated on the 10 different regional airlines, the total token count, unique token count, total characters, and lexical diversity in the titles can be observed. The airline with the most tokens/unique tokens/characters in the titles is Lufthansa (436/264/2865), while the airline with the least amount of tokens/unique tokens/characters in the titles is Aeroflot (279/156/1694). The rest of the airlines seem to fall into the same range on those three parameters. The airlines that have the most lexically diverse titles are South African Airways and Air China (0.619), while the airline with the least lexically diverse titles is Air Canada (0.527). It is important to note that these statistics are based off the most recent 100 reviews extracted from Trip Advisor. While these comparisons are not set in stone, they do help us gauge the attributes of textual information present in different airline review titles.

*Descriptive Statistics for Review Texts (Summaries)*

In [22]:
# Call descriptive statistics on review summary

for i in range(10):
    print(airlines_df["airline_name"][i])
    print(descriptive_stats(airlines_df["summary_tokens"][i]))
    print("\n")

Emirates
There are 2330 tokens in the data.
There are 1132 unique tokens in the data.
There are 13881 characters in the data.
The lexical diversity is 0.486 in the data.
[2330, 1132, 0.4858369098712446, 13881]


LATAM Airlines
There are 2350 tokens in the data.
There are 1087 unique tokens in the data.
There are 13701 characters in the data.
The lexical diversity is 0.463 in the data.
[2350, 1087, 0.4625531914893617, 13701]


Lufthansa
There are 2267 tokens in the data.
There are 1088 unique tokens in the data.
There are 13468 characters in the data.
The lexical diversity is 0.480 in the data.
[2267, 1088, 0.4799294221438024, 13468]


Air Canada
There are 2265 tokens in the data.
There are 1052 unique tokens in the data.
There are 13239 characters in the data.
The lexical diversity is 0.464 in the data.
[2265, 1052, 0.46445916114790287, 13239]


Aeroflot
There are 1791 tokens in the data.
There are 834 unique tokens in the data.
There are 11132 characters in the data.
The lexical diver

Based on the descriptive statistics generated on the 10 different regional airlines, the total token count, unique token count, total characters, and lexical diversity in the revies can be observed. The airline with the highest token count is LATAM Airlines (2350), with Emirates as the runner-uo (2330). The airline with the most unique tokens is Emirates (1132). The airline with the most amount of characters is Emirates (13881). The least amount of tokens/unique tokens/characters, once again, is Aeroflot (1791/834/11132). The most lexically diverse reviews belong to South African Airways (0.516), while the least lexically diverse reviews belong to Air China (0.462). Again, it is important to note that these statistics are based off the most recent 100 reviews extracted from Trip Advisor. While these comparisons are not set in stone, they do help us gauge the attributes of textual information present in different airline reviews.

### Save Tokenized JSON

In [23]:
# Save Reviews to file

# Reset index
airlines_df=airlines_df.reset_index().drop(columns=['index'])

# Convert to json
tokens_json = airlines_df.to_dict("records")

# Serializing json
tokens_json = json.dumps(tokens_json, indent=4)
 
# Writing to file
with open("tokenized_reviews.json", "w") as outfile:
    outfile.write(tokens_json)

airlines_df

Unnamed: 0,airline_link,airline_name,airline_reviews,review_count,region,Rating,Review_Title,Review_Text,title_tokens,summary_tokens
0,https://www.tripadvisor.com/Airline_Review-d87...,Emirates,"64,158 reviews",64158.0,Middle East,4.0,"[Unsatisfactory on the seat selection, Top cla...",[We have to pay extra to select the seat like ...,"[unsatisfactory, seat, selection, top, class, ...","[pay, extra, select, seat, like, low, cost, ho..."
1,https://www.tripadvisor.com/Airline_Review-d10...,LATAM Airlines,"40,972 reviews",40972.0,South America,3.5,"[LATAM is changing, for the bad..., The worst ...","[Our was delayed more than an hour, boarding ...","[latam, changing, bad, worst, customer, servic...","[delayed, hour, boarding, total, mess, groups,..."
2,https://www.tripadvisor.com/Airline_Review-d87...,Lufthansa,"40,747 reviews",40747.0,Europe,3.5,"[Skip Lufthansa for a stress free flight, don'...",[I booked Lufthansa for a better experience an...,"[skip, lufthansa, stress, free, flight, dont, ...","[booked, lufthansa, better, experience, smooth..."
3,https://www.tripadvisor.com/Airline_Review-d87...,Air Canada,"29,770 reviews",29770.0,North America,3.0,"[delay delay delay, Terrible experience, avoid...",[ hours prior online checkin ..... Delay mins...,"[delay, delay, delay, terrible, experience, av...","[, hours, prior, online, checkin, delay, mins,..."
4,https://www.tripadvisor.com/Airline_Review-d87...,Aeroflot,"20,694 reviews",20694.0,Indian Subcontinent,4.0,"[Day 320: Repayment still did not work out, WO...","[To my great surprise, I received the free, wo...","[day, 320, repayment, still, work, worst, comp...","[great, surprise, received, free, worryfree, a..."
5,https://www.tripadvisor.com/Airline_Review-d87...,Cathay Pacific,"16,645 reviews",16645.0,Asia,4.0,"[Never again!, Shocking experience...never aga...",[Terrible experience. Ran out of options for b...,"[never, shocking, experiencenever, excellent, ...","[terrible, experience, ran, options, breakfast..."
6,https://www.tripadvisor.com/Airline_Review-d87...,Copa Airlines,"13,291 reviews",13291.0,Central America,3.5,"[Inconsistent, Great experience on Copa, Worst...",[Overall okay experience. Some learning curves...,"[inconsistent, great, experience, copa, worst,...","[overall, okay, experience, learning, curves, ..."
7,https://www.tripadvisor.com/Airline_Review-d87...,Virgin Australia,"11,964 reviews",11964.0,Pacific,3.5,"[Canceled Again, easy process, Don't recommend...",[Virgin Australia used to be a good but I wou...,"[canceled, easy, process, dont, recommend, vir...","[virgin, australia, used, good, would, recomme..."
8,https://www.tripadvisor.com/Airline_Review-d87...,Air China,"6,005 reviews",6005.0,China,3.0,"[Crappiest offboard service so far, Great trip...",[Crappiest offboard service so far. Phone serv...,"[crappiest, offboard, service, far, great, tri...","[crappiest, offboard, service, far, phone, ser..."
9,https://www.tripadvisor.com/Airline_Review-d87...,South African Airways,"4,884 reviews",4884.0,Africa,3.5,"[Pathetic, Happy Customer’s, Avoid at all cost...",[I am not going to go into the detail..but wor...,"[pathetic, happy, customer’s, avoid, costs, ze...","[going, go, detailbut, words, cannot, describe..."


In [24]:
# Function for Word lemmatization

def word_lemmatizer(text):
    lem_text = [WordNetLemmatizer().lemmatize(i) for i in text]
    return lem_text

# Apply lemmatization

airlines_df['title_lemma'] = airlines_df['title_tokens'].apply(lambda x: word_lemmatizer(x))
airlines_df['summary_lemma'] = airlines_df['summary_tokens'].apply(lambda x: word_lemmatizer(x))

airlines_df.head()

Unnamed: 0,airline_link,airline_name,airline_reviews,review_count,region,Rating,Review_Title,Review_Text,title_tokens,summary_tokens,title_lemma,summary_lemma
0,https://www.tripadvisor.com/Airline_Review-d87...,Emirates,"64,158 reviews",64158.0,Middle East,4.0,"[Unsatisfactory on the seat selection, Top cla...",[We have to pay extra to select the seat like ...,"[unsatisfactory, seat, selection, top, class, ...","[pay, extra, select, seat, like, low, cost, ho...","[unsatisfactory, seat, selection, top, class, ...","[pay, extra, select, seat, like, low, cost, ho..."
1,https://www.tripadvisor.com/Airline_Review-d10...,LATAM Airlines,"40,972 reviews",40972.0,South America,3.5,"[LATAM is changing, for the bad..., The worst ...","[Our was delayed more than an hour, boarding ...","[latam, changing, bad, worst, customer, servic...","[delayed, hour, boarding, total, mess, groups,...","[latam, changing, bad, worst, customer, servic...","[delayed, hour, boarding, total, mess, group, ..."
2,https://www.tripadvisor.com/Airline_Review-d87...,Lufthansa,"40,747 reviews",40747.0,Europe,3.5,"[Skip Lufthansa for a stress free flight, don'...",[I booked Lufthansa for a better experience an...,"[skip, lufthansa, stress, free, flight, dont, ...","[booked, lufthansa, better, experience, smooth...","[skip, lufthansa, stress, free, flight, dont, ...","[booked, lufthansa, better, experience, smooth..."
3,https://www.tripadvisor.com/Airline_Review-d87...,Air Canada,"29,770 reviews",29770.0,North America,3.0,"[delay delay delay, Terrible experience, avoid...",[ hours prior online checkin ..... Delay mins...,"[delay, delay, delay, terrible, experience, av...","[, hours, prior, online, checkin, delay, mins,...","[delay, delay, delay, terrible, experience, av...","[, hour, prior, online, checkin, delay, min, t..."
4,https://www.tripadvisor.com/Airline_Review-d87...,Aeroflot,"20,694 reviews",20694.0,Indian Subcontinent,4.0,"[Day 320: Repayment still did not work out, WO...","[To my great surprise, I received the free, wo...","[day, 320, repayment, still, work, worst, comp...","[great, surprise, received, free, worryfree, a...","[day, 320, repayment, still, work, worst, comp...","[great, surprise, received, free, worryfree, a..."


### Load in Scraped Review JSON

In [25]:
# Load in using pandas

airline = pd.read_json("airline_reviews.json")

display(checking)



NameError: name 'checking' is not defined