# Emirates Review Project

This project created by **Ahmad Ulfi Jihad Dzulqornain** for data analyst portfolio <br>

Our adventure begins with a visit to the beloved website AirlineQuality.com, where we'll immerse ourselves in a treasure trove of reviews from Emirates passengers. But before we set off, let's make sure we have all the necessary tools in our arsenal:


_______

First thing first, we need to install the package needed including:
- Selenium 🕸️
- BeautifulSoup4 🍲
- Pandas 🐼
- Numpy 🔢

Join me on this thrilling adventure through the Emirates Review Project, where we'll soar through the skies of data analysis, discovering the highs, the lows, and everything in between. Let's unravel the stories, sentiments, and trends that shape the Emirates airline experience.

```Note: This project is a showcase of my data analysis skills and is not officially affiliated with Emirates or AirlineQuality.com.✍️```

In [1]:
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium import webdriver
from bs4 import BeautifulSoup
import time
import numpy as np
from tqdm import tqdm
import os
import pickle
from datetime import datetime

Display default pertama di web tersebut hanya memunculkan 10 review. Agar lebih efektif, saya buka akses untuk menampilkan 100 review. Setelah itu saya melakukan scraping yang terdiri atas: <br>
- Nama
- Asal Negara
- Tanggal 
- Overall Rating (by 10)
- Headline
- Review
- Aircraft
- Type of Traveller
- Seat Type
- Route
- Seat Comfort Rating
- Cabin Staff Service Rating
- FnB Rating
- Inflight Entertainment Rating
- Ground Service Rating
- Wifi & Connectivity Rating
- Value for Money Rating
- Recommended (Yes/No)

This `next_page` function is created to navigate through the next page. Because sometimes it got the same xpath but sometimes is not. So we using try-loop from possible xpaths.

In [2]:
def next_page():
    xpaths = [
        '//*[@id="main"]/section[3]/div[1]/div/article/ul/li[15]/a',
        '//*[@id="main"]/section[3]/div[1]/div/article/ul/li[14]/a',
        '//*[@id="main"]/section[3]/div[1]/div/article/ul/li[13]/a',
        '//*[@id="main"]/section[3]/div[1]/div/article/ul/li[12]/a',
        '//*[@id="main"]/section[3]/div[1]/div/article/ul/li[11]/a',
        '//*[@id="main"]/section[3]/div[1]/div/article/ul/li[10]/a',
        '//*[@id="main"]/section[3]/div[1]/div/article/ul/li[9]/a',
        '//*[@id="main"]/section[3]/div[1]/div/article/ul/li[8]/a'
    ]

    for xpath in xpaths:
        try:
            driver.find_element(By.XPATH, xpath).click()
            break  # Break the loop if element is found and clicked
        except:
            pass
    # Code continues...


The provided code uses Selenium and BeautifulSoup libraries to scrape airline reviews from the website [https://www.airlinequality.com/airline-reviews/emirates/](https://www.airlinequality.com/airline-reviews/emirates/).

**Code Flow:**
- Initializes a Chrome WebDriver session.
- Navigates to the specified URL.
- Clicks the "Show 100" button to display 100 reviews per page.
- Parses HTML responses and stores them in a dictionary.
- Loops through multiple pages, navigating to the next page and waiting for it to load.
- Closes the WebDriver session.

This code enables efficient scraping of airline reviews for further analysis.


In [5]:
path_chromedriver = os.path.join(os.getcwd(),"chromedriver_win32\chromedriver.exe")
driver = webdriver.Chrome(path_chromedriver)
url = 'https://www.airlinequality.com/airline-reviews/emirates/'

#Website
driver.get(url)
time.sleep(1)


show_100 = '//*[@id="main"]/section[3]/div[1]/article/div[1]/div[2]/form/ul/li[4]/label'
driver.find_element(By.XPATH, show_100).click()

# Parse HTML
response_dict = {}
for page_num in range(1, 24):  # Assuming 23 pages in total
    response_key = page_num
    response_dict[response_key] = BeautifulSoup(driver.page_source, 'html.parser')

    # Navigate to the next page
    next_page()
    time.sleep(2)
driver.close

  driver = webdriver.Chrome(path_chromedriver)


<bound method WebDriver.close of <selenium.webdriver.chrome.webdriver.WebDriver (session="fd949afb81d82ab74d477243a86ee5ba")>>

In [26]:
# Simpan dictionary ke dalam file
filename = 'datascrap.pkl'
with open(filename, 'wb') as file:
    pickle.dump(response_dict, file)


In [14]:
# Buka kembali file dan muat dictionary
with open('datascrap.pkl', 'rb') as file:
    loaded_dict = pickle.load(file)

In [8]:
result = response_dict[1]

In [16]:
result = loaded_dict[6]

In [17]:
#Retrieve User Name
name_element = result.find_all('span',itemprop='name')
name = [name.text for name in name_element]
print(name)

['J Caminet', 'V Shandira', 'Loay Osman', 'Elias Parker', 'S Kunda', 'Roy Carpenter', 'D Barton', 'Ahmad Hafeez', 'Piotr Wegrzyn', 'Piotr Wegrzyn', 'India Stoughton', 'H Deiner', 'B Ramatos', 'Olivier Gueris', 'W Dean', 'P Kalak', 'Alena Paletar', 'R Robinson', 'M Carter', 'Anslem Wong', 'S Mellor', 'N Larten', 'V Warnuda', 'S Remzam', 'D Meares', 'Jan Birkner', 'Mandy Brawn', 'A Dergan', 'Jan Birkner', 'Ambika Sharma', 'Riccardo Micci', 'Umer Khan', 'S Danleepa', 'David Kassai', 'Z Haneed', 'Amal Bukhamseen', 'Katie Wong', 'Mossie Banks', 'Madhav Devarakonda', 'B Masten', 'Melissa Legner', 'M Heany', 'M Lafarina', 'R Hazanul', 'R Hazanul', 'G Dawes', 'David Houlihan', 'M Harmaz', 'Almuth Waechter', 'O Hemar', 'J Meares', 'James Goldie', 'Ali Shahid', 'A Samali', 'Bhavesh Devkaran', 'A Kazeen', 'Damiano Massimi', 'David Wharton', 'Neil Jeram', 'Neil Jeram', 'J Candare', 'Y Mack-Bazemore', 'N Martin', 'B Lakuwiec', 'S White', 'Rabia Sarwar', 'P Jahinda', 'Mostafa Hassan', 'C Beale', 'L 

In [18]:
#Retrieve Country Name
country_element = result.find_all('span',itemprop="author")
countries = [element.next_sibling.strip(' ()"')
             for element in country_element]
print(countries)

['Australia', 'Australia', 'United Arab Emirates', 'United States', 'United States', 'Portugal', 'United Kingdom', 'United Kingdom', 'Norway', 'Norway', 'Lebanon', 'United Kingdom', 'United States', 'Kazakhstan', 'United States', 'India', 'Canada', 'United Kingdom', 'United Kingdom', 'Malaysia', 'United Kingdom', 'United States', 'United Kingdom', 'United Kingdom', 'United Kingdom', 'Germany', 'United Kingdom', 'United Arab Emirates', 'Germany', 'Australia', 'South Africa', 'United States', 'United States', 'Switzerland', 'Pakistan', 'Saudi Arabia', 'Hong Kong', 'United Kingdom', 'Germany', 'United States', 'Austria', 'Australia', 'Switzerland', 'United Kingdom', 'United Kingdom', 'United Kingdom', 'Bahrain', 'United Kingdom', 'New Zealand', 'Egypt', 'United Kingdom', 'United Arab Emirates', 'Kuwait', 'Maldives', 'India', 'United States', 'Germany', 'United Kingdom', 'Australia', 'Australia', 'Belgium', 'United States', 'United States', 'United Kingdom', 'Australia', 'United States', '

In [19]:
#Retrieve Date Published
date_element = result.find_all('time',itemprop='datePublished')
dates = [element['datetime'] for element in date_element]
#Change type
dates = [datetime.strptime(string_dates,'%Y-%m-%d')
         for string_dates in dates]

In [20]:
#Rating
rating_element = result.find_all('span',itemprop="ratingValue")
rating = [int(element.text) for element in rating_element][1:] #Because the first value is the overall value that always appear in each page
print(rating)

[4, 10, 3, 2, 7, 4, 2, 4, 4, 4, 6, 1, 4, 1, 1, 10, 1, 6, 2, 9, 1, 1, 1, 2, 2, 10, 9, 1, 10, 3, 8, 5, 2, 3, 3, 2, 1, 2, 10, 1, 1, 1, 10, 5, 6, 3, 4, 3, 7, 5, 5, 9, 4, 3, 7, 1, 10, 2, 6, 6, 1, 3, 4, 3, 9, 3, 4, 1, 2, 1, 6, 1, 6, 1, 1, 3, 3, 4, 10, 2, 9, 10, 1, 7, 2, 9, 1, 5, 2, 2, 5, 3, 4, 9, 4, 3, 6, 1, 5, 1]


In [21]:
#Retrieve Headline
headline_element = result.find_all(class_='text_header')
headline = [element.text.strip('"') for element in headline_element]
print(headline)

['seats low quality and grubby', 'had a fantastic flight', 'the worst flight I ever had with EK', 'it has been a nightmare', 'exceeded my expectation', 'Emirates is cost cutting', 'Emirates is vastly overrated', 'flight was unremarkable', 'choose an another airline', 'I expected something more', 'Terrible customer service', ' Such a terrible service.', 'Very bad customer service', 'display different info', 'like they cut costs severely', 'Attendants welcomed us', 'I would never fly with Emirates again', 'Really poor show', 'seems to have cut costs', 'would still recommend', 'penalizing its most loyal customers', 'never fly Emirates again', 'experienced the worst service', 'cutbacks and poor service', 'good service until lunch', 'a great flight with Emirates', 'upgraded to business', 'spend $600+ for the fee', 'We had an exceptional flight', 'avoid Dubai airport', 'food offered on this flight gets a 10/10', 'service is declining', 'been incredibly disappointing', 'I was very disappointe

In [28]:
#Retrieve Review
rev_element = result.find_all('div',class_='text_content')
reviews = [element.text for element in rev_element]
print(reviews)

['✅ Trip Verified |  Brisbane to Dubai. I always thought Emirates was supposed to be one of the best but let me tell you, it is not. The staff were pretty professional but that is common among airlines. I was initially excited to be flying in an A380 for the first time but the moment I boarded, my heart sunk. It was obvious Emirates had chosen the the cheapest possible configuration that you would expect on a budget domestic airline. The seats were super low quality and grubby. There was next to no padding which made it painful for a 14 hour flight. The air was dry, but worse is the LED lighting which has a low Hz flicker. This becomes migraine inducing as the lighting splits into green blue and red as you move your eyes about. There is also an insufficient number of toilets available. The Boeing 787 is much better from my experiences on other airlines. Overall disappointed.', "✅ Trip Verified |  Melbourne to Dubai via Singapore. I had a fantastic flight with Emirates, from check-in to

Might be usefull, i'll separate the verified notes

In [None]:
for i in range(0,len(reviews)):
    try:
        reviews[i].split('|')[1]
    except:
        print(f'error {i}')
reviews[43]

In [None]:
reviews[43]

In [None]:
verified_status = ['Verified' if review.startswith('✅ Trip Verified') else 'Not Verified' for review in reviews]
reviews_only = [review.split('|', 1)[1].strip() for review in reviews]

print(verified_status)
print(reviews_only)

In [144]:
verified_status = ['Verified' if review.startswith('✅ Trip Verified') else 'Not Verified' for review in reviews]
reviews_only = [review.split('|')[1].strip() 
                if (review.startswith('✅ Trip Verified') or review.startswith('Not Verified')) 
                else review for review in reviews]

In [11]:
reviewsrate = result.find_all('table', class_='review-ratings')

Aircraft = []
Traveller = []
Seat = []
Route = []
Comfort = []
Staff = []
FnB = []
Entertainment = []
Service = []
Wifi = []
Value = []
Recommended = []

for review in reviewsrate:
    data = {}

    try:
        Aircraft.append(review.select_one('.review-rating-header.aircraft + td').text.strip())
    except AttributeError:
        Aircraft.append(np.nan)

    try:
        Traveller.append(review.select_one('.review-rating-header.type_of_traveller + td').text.strip())
    except AttributeError:
        Traveller.append(np.nan)

    try:
        Seat.append(review.select_one('.review-rating-header.cabin_flown + td').text.strip())
    except AttributeError:
        Seat.append(np.nan)

    try:
        Route.append(review.select_one('.review-rating-header.route + td').text.strip())
    except AttributeError:
        Route.append(np.nan)

    Comfort.append(len(review.select('.review-rating-header.seat_comfort + td .star.fill')))
    Staff.append(len(review.select('.review-rating-header.cabin_staff_service + td .star.fill')))
    FnB.append(len(review.select('.review-rating-header.food_and_beverages + td .star.fill')))
    Entertainment.append(len(review.select('.review-rating-header.inflight_entertainment + td .star.fill')))
    Service.append(len(review.select('.review-rating-header.ground_service + td .star.fill')))
    Wifi.append(len(review.select('.review-rating-header.wifi_and_connectivity + td .star.fill')))
    Value.append(len(review.select('.review-rating-header.value_for_money + td .star.fill')))

    try:
        Recommended.append(review.select_one('.review-rating-header.recommended + td').text.strip())
    except AttributeError:
        Recommended.append(np.nan)

Since i finally made the code with 1 sample, I will create loopable code so all page parsed.

In [207]:
def process_result(result):
    import numpy as np
    from datetime import datetime

    # Retrieve User Name
    name_element = result.find_all('span', itemprop='name')
    name = [name.text for name in name_element]

    # Retrieve Country Name
    country_element = result.find_all('span', itemprop="author")
    countries = [element.next_sibling.strip(' ()"') for element in country_element]

    # Retrieve Date Published
    date_element = result.find_all('time', itemprop='datePublished')
    dates = [element['datetime'] for element in date_element]
    # Change type
    dates = [datetime.strptime(string_dates, '%Y-%m-%d') for string_dates in dates]

    # Rating
    rating_element = result.find_all('span', itemprop="ratingValue")
    rating = [int(element.text) for element in rating_element][1:]  # Because the first value is the overall value that always appears in each page

    # Retrieve Headline
    headline_element = result.find_all(class_='text_header')
    headline = [element.text.strip('"') for element in headline_element]
    headline = headline[:-4]

    # Retrieve Review
    rev_element = result.find_all('div', class_='text_content')
    reviews = [element.text for element in rev_element]

    verified_status = ['Verified' if review.startswith('✅ Trip Verified') else 'Not Verified' for review in reviews]
    reviews_only = [review.split('|')[1].strip() 
                    if (review.startswith('✅ Trip Verified') or review.startswith('Not Verified')) 
                    else review for review in reviews]



    reviewsrate = result.find_all('table', class_='review-ratings')

    Aircraft = []
    Traveller = []
    Seat = []
    Route = []
    Comfort = []
    Staff = []
    FnB = []
    Entertainment = []
    Service = []
    Wifi = []
    Value = []
    Recommended = []

    for review in reviewsrate:

        try:
            Aircraft.append(review.select_one('.review-rating-header.aircraft + td').text.strip())
        except AttributeError:
            Aircraft.append(np.nan)

        try:
            Traveller.append(review.select_one('.review-rating-header.type_of_traveller + td').text.strip())
        except AttributeError:
            Traveller.append(np.nan)

        try:
            Seat.append(review.select_one('.review-rating-header.cabin_flown + td').text.strip())
        except AttributeError:
            Seat.append(np.nan)

        try:
            Route.append(review.select_one('.review-rating-header.route + td').text.strip())
        except AttributeError:
            Route.append(np.nan)

        Comfort.append(len(review.select('.review-rating-header.seat_comfort + td .star.fill')))
        Staff.append(len(review.select('.review-rating-header.cabin_staff_service + td .star.fill')))
        FnB.append(len(review.select('.review-rating-header.food_and_beverages + td .star.fill')))
        Entertainment.append(len(review.select('.review-rating-header.inflight_entertainment + td .star.fill')))
        Service.append(len(review.select('.review-rating-header.ground_service + td .star.fill')))
        Wifi.append(len(review.select('.review-rating-header.wifi_and_connectivity + td .star.fill')))
        Value.append(len(review.select('.review-rating-header.value_for_money + td .star.fill')))

        try:
            Recommended.append(review.select_one('.review-rating-header.recommended + td').text.strip())
        except AttributeError:
            Recommended.append(np.nan)

    Aircraft = Aircraft[1:]
    Traveller = Traveller[1:]
    Seat = Seat[1:]
    Route = Route[1:]
    Comfort = Comfort[1:]
    Staff = Staff[1:]
    FnB = FnB[1:]
    Entertainment = Entertainment[1:]
    Service = Service[1:]
    Wifi = Wifi[1:]
    Value = Value[1:]
    Recommended = Recommended[1:]

    return {
        'Name': name,
        'Country': countries,
        'Date': dates,
        'Rating': rating,
        'Headline': headline,
        'Review': reviews_only,
        'Verified': verified_status,
        'Aircraft': Aircraft,
        'Type': Traveller,
        'Seat': Seat,
        'Route': Route,
        'Comfort': Comfort,
        'Staff': Staff,
        'FnB': FnB,
        'Entertainment': Entertainment,
        'Service': Service,
        'Wifi': Wifi,
        'Value': Value,
        'Recommended': Recommended
    }


In [208]:
# Create an empty list to store the processed results
dataset = []
i = 1
# Loop through each dictionary in your list of dictionaries
for result in loaded_dict.values():
    try:
        # Process the current result
        processed_result = process_result(result)
    
        # Append the processed result to the dataset list
        dataset.append(processed_result)
    except:
        print(f'Error in page {i}')
        pass
    i+=1