<a href="https://colab.research.google.com/github/billyotieno/analytics-datasets/blob/main/Group_5_PDS_kq_customer_reviews.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Kenya Airways & Industry Airline Customer Reviews Analysis Notebook**


---
 
![Kenya Airways Image](https://www.kindpng.com/picc/m/337-3373993_kenya-airways-hd-png-download.png)

> ## **Introduction**   
> Kenya Airways receives airline reviews from trip advisors from both local, continental and international travellers. Their customer service team would like to extract insights from their customer reviews on TripAdvisor and conduct competitor analysis of the top 10 airlines from Skytrax Ranking to discover their competitive edge and where they fall short. 

> However, they need help analyzing reviews due to the large volume of customer reviews they have to go through manually. It's time-consuming and resource-intensive. Additionally, there’s a challenge in identifying common trends and themes in customer feedback, considering customers provide review feedback on a wide range of topics, e.g. quality of food to their in-flight experiences.

> In this notebook we will be using text mining and sentiment analysis to process and analyze customer reviews to help Kenya Airways overcome these challenges through Data Science & Analytics. This would allow the airline to quickly and efficiently gain insights from the data and identify common issues and trends in customer feedback. The airline could then use this information to improve its products and services and provide better support to its customers.

> ## **Dataset Source**   
> To meet the objectives of the analysis we've extracted Airline customer reviews feedback from TripAdvisor for Kenya Airways and the top 10 leading airlines in Africa by SkyTrack Ranking. This datasets will help us analyze reviews both at organization level (Kenya Airways) and how it compares to Industry (9 other airlines).

Datasets can be accessed through github repository via [This Link](https://github.com/billyotieno/analytics-datasets/tree/main/Transport%20Services/Airlines/african-airlines-reviews-dataset)


# **Table of Contents**

>[Kenya Airways & Industry Airline Customer Reviews Analysis Notebook](#scrollTo=jNpZkgLrpWiF)

>>[Introduction](#scrollTo=jNpZkgLrpWiF)

>>[Dataset Source](#scrollTo=jNpZkgLrpWiF)

>[Table of Contents](#scrollTo=DM0RD92SwuJF)

>>[Setting up and Installing Required Libraries](#scrollTo=5yvGfaTlygLb)

>>[Sourcing Data from the Github Respository](#scrollTo=j4rYvvQ61Ps7)

>>[Importing Required Libraries](#scrollTo=Yuqv8THG2ZXG)

>>[Loading Data into DataFrames](#scrollTo=sII_fRkJpVcT)

>>[Initial Data Exploration](#scrollTo=L6LOVnaX0mhI)

>>>[Renaming Columns to Clear Columns](#scrollTo=RdNUoUxq9XNA)

>>>[Checking Dataset Shape](#scrollTo=aB5iN376RWDo)

>>>[Checking DataTypes](#scrollTo=8WxPPNl0RcRH)

>>>[Checking for Missing Values](#scrollTo=8oOLLEXfRfj8)

>>>[Dataset Description](#scrollTo=pYf9YG2GRjIL)

>>>[Initial Data Cleaning: Overlapped Text](#scrollTo=n6Peyes6Rowq)

>>[Data Exploration: Focused on Non-Review Columns](#scrollTo=bO-JbEUMR30e)

>>>[Total Number of Reviews by Airlines](#scrollTo=xjZz7X0gRw9q)

>>>[Flight Types or Regions Travelled by Reviewers for Each Airline](#scrollTo=kZe6gvL1SO-T)

>>>[Distribution of Ratings (1 -5)](#scrollTo=dVmmWr8LSjsC)

>>>[Average Rating Across the Airlines for the Various Travel Classes](#scrollTo=aFC4ts_yTchJ)

>>>[Data Cleaning: Correcting Travel Month Column](#scrollTo=znqKrX7ZTtpb)

>>>[Exploring Review Ratings by Airline and Flight Travel Class](#scrollTo=FzY5o-D7T2cU)

>>>[Exploring Review Ratings by Airlines across Regions](#scrollTo=ET3DkYQyt0rf)

>>>[Breakdown of Airlines by Respective Travel Classes](#scrollTo=wvNVWzd8uk0Z)

>>[Data Exploration: Focused on Review Text](#scrollTo=R5Fe-nXgZ3G7)

>>>[Checking for NaNs in Extracted Review Columns](#scrollTo=3KTia-Ad8L6w)

>>>[Features distributions into Boolean, Categorical and Numerical types](#scrollTo=YvyCIsjiAseg)

>>>[Plotting the correlation matrix for the features](#scrollTo=6V63NJ4jBlXl)

>>[Data Quality Summary](#scrollTo=1ZosZNZ4mAvp)

>>[Data Preparation](#scrollTo=MqaDRPCsq64H)

>>>[Merging the two Datasets - Text Profiled & Non-Review Dataset](#scrollTo=zDXZ9KbArND8)



## **Setting up and Installing Required Libraries**

In [61]:
# Installing required libraries (-q quiet installing all libraries)
! pip install -q pandas pandera numpy matplotlib seaborn textblob dask missingno wordcloud
! pip install -q fasttext 

In [62]:
# Install pandas profiling - required for initial exporation
!pip install -q https://github.com/pandas-profiling/pandas-profiling/archive/master.zip

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.5/78.5 KB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
nlp-profiler 0.0.3 requires tqdm==4.46.0, but you have tqdm 4.64.1 which is incompatible.[0m[31m
[0m

In [63]:
# Install NLP Profiler for text datasets
# !pip install -U -q git+https://github.com/neomatrix369/nlp_profiler@scale-when-applied-to-larger-datasets
# print("\n Installation Completed")

!pip install -U -q git+https://github.com/neomatrix369/nlp_profiler.git@master

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
panel 0.12.1 requires tqdm>=4.48.0, but you have tqdm 4.46.0 which is incompatible.
pandas-profiling 0.0.dev0 requires tqdm<4.65,>=4.48.2, but you have tqdm 4.46.0 which is incompatible.[0m[31m
[0m

## **Sourcing Data from the Github Respository**

In [64]:
from google.colab import files

# Create an airline-datasets directory on google colab to host the files
!rm -rf airline-datasets
!mkdir -p airline-datasets
!cd airline-datasets

# fetch all the datasets
!wget -q --show-progress "https://raw.githubusercontent.com/billyotieno/analytics-datasets/main/Transport Services/Airlines/african-airlines-reviews-dataset/kenya_airways_flights.csv" -P ./airline-datasets
!wget -q --show-progress "https://raw.githubusercontent.com/billyotieno/analytics-datasets/main/Transport Services/Airlines/african-airlines-reviews-dataset/air_mauritius.csv" -P ./airline-datasets
!wget -q --show-progress "https://raw.githubusercontent.com/billyotieno/analytics-datasets/main/Transport Services/Airlines/african-airlines-reviews-dataset/egypt_airways.csv" -P ./airline-datasets
!wget -q --show-progress "https://raw.githubusercontent.com/billyotieno/analytics-datasets/main/Transport Services/Airlines/african-airlines-reviews-dataset/ethiopian_airlines.csv" -P ./airline-datasets
!wget -q --show-progress "https://raw.githubusercontent.com/billyotieno/analytics-datasets/main/Transport Services/Airlines/african-airlines-reviews-dataset/fastjet_flights.csv" -P ./airline-datasets
!wget -q --show-progress "https://raw.githubusercontent.com/billyotieno/analytics-datasets/main/Transport Services/Airlines/african-airlines-reviews-dataset/flysafair_flights.csv" -P ./airline-datasets
!wget -q --show-progress "https://raw.githubusercontent.com/billyotieno/analytics-datasets/main/Transport Services/Airlines/african-airlines-reviews-dataset/royal_air_maroc.csv" -P ./airline-datasets
!wget -q --show-progress "https://raw.githubusercontent.com/billyotieno/analytics-datasets/main/Transport Services/Airlines/african-airlines-reviews-dataset/rwand_air_flights.csv" -P ./airline-datasets
!wget -q --show-progress "https://raw.githubusercontent.com/billyotieno/analytics-datasets/main/Transport Services/Airlines/african-airlines-reviews-dataset/seychelles_airways.csv" -P ./airline-datasets
!wget -q --show-progress "https://raw.githubusercontent.com/billyotieno/analytics-datasets/main/Transport Services/Airlines/african-airlines-reviews-dataset/south_african_airways.csv" -P ./airline-datasets



## **Importing Required Libraries**

In [65]:
# Import required libraries
import pandas as pd
import pandera as pn
import dask 
import seaborn as sns
import spacy
import re
import nltk
import string
import fasttext
import warnings
import inflect # converting numbers in text to words
import wordclous
import missingno as msno
from pandas_profiling import ProfileReport
from nlp_profiler.core import apply_text_profiling

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer
import nltk
from nltk import wordpunct_tokenize
import matplotlib.pyplot as plt

# Pandas settings
pd.options.mode.chained_assignment = None
pd.set_option('display.max_colwidth', 20)
pd.options.display.max_rows = 4000
from IPython.display import Image

%matplotlib inline
warnings.filterwarnings("ignore", category=DeprecationWarning)

# NLTK Download Options
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

ModuleNotFoundError: ignored

In [None]:
# Visualization Fonts
!wget -O IBM_Sans.zip "https://fonts.google.com/download?family=IBM%20Plex%20Sans"
!wget -O McKinsey_Bower.zip "https://cdn.mckinsey.com/assets/fonts/web/Bower_Fonts.zip"

In [None]:
!unzip -o '*.zip'

In [None]:
!mv *.ttf /usr/share/fonts/truetype/
!mv *.otf /usr/share/fonts/truetype/

## **Loading Data into DataFrames**

In [None]:
from pathlib import Path 

path = "./airline-datasets/"
files = Path(path).glob('*.csv')

In [None]:
# Read data into dataframe with a new column identifying airline dataset
dfs = list()
for f in files:
  data = pd.read_csv(f, 
                     usecols=['Title','Image','Avatar_URL',
                              'crvsd','ui_header_link','default',
                              'phmbo','phmbo1','dmrsr','dmrsr2','dmrsr3',
                              'qwuub_URL','qwuub','tehyy','xcjrc',
                              'Rating'])
  data['source'] = f.stem
  dfs.append(data)

In [None]:
df = pd.concat(dfs, ignore_index=True)
df.head()

## **Initial Data Exploration**

### **Renaming Columns to Clear Columns**

In [None]:
# Rename dataset to clear & understandable columns
column_rename = {
    'Title':'review', 
    'Image':'review_image', 
    'Avatar_URL':'avatar_url', 
    'crvsd':'writing_month', 
    'ui_header_link':'reviewer_username', 
    'default':'reviewer_city',
    'phmbo':'reviewer_contribution', 
    'phmbo1':'helpful_votes', 
    'dmrsr':'flight_path', 
    'dmrsr2':'flight_type', 
    'dmrsr3':'travel_class', 
    'qwuub_URL':'review_link', 
    'qwuub':'review_headline',
    'tehyy':'travel_month', 
    'xcjrc':'disclaimer', 
    'Rating':'review_rating', 
    'source':'airline'
}

df.rename(columns=column_rename, inplace=True)

In [None]:
# Proper Naming for Airlines
df['airline'] = df.airline.astype('category')
df['airline'] = df['airline'].cat.rename_categories({
  'air_mauritius':'Air Mauritius',
  'egypt_airways':'Egypt Air',
  'ethiopian_airlines':'Ethiopian Airlines',
  'fastjet_flights':'FastJet',
  'flysafair_flights':'FlySafair',
  'kenya_airways_flights':'Kenya Airways',
  'royal_air_maroc':'Royal Air Maroc',
  'rwand_air_flights':'RwandAir',
  'seychelles_airways':'Air Seychelles',
  'south_african_airways':'South African Airways',
})

In [None]:
# Check new columns
df.columns

### **Checking Dataset Shape**

In [None]:
# Checking dataframe shape
df.shape

### **Checking DataTypes**

In [None]:
# Checking datatypes
df.dtypes

### **Checking for Missing Values**

In [None]:
# Check for Missing Values
msno.matrix(df)

### **Dataset Description**

In [None]:
# Checking dataframe description
df.describe(include='all')

In [None]:
#Getting the total number of reviews in the dataset
n_reviews = df.shape[0]
print('Number of customer reviews in the dataset: {:d}'.format(n_reviews))

### **Initial Data Cleaning: Overlapped Text**

In [None]:
# Initial dataset cleaning to support exploration
def remove_overlapped_text(df):
  df = df.copy()
  index = df[(df["travel_class"] != "Economy") & (df["travel_class"] != "Business Class") & (df["travel_class"] != "First Class")].index 
  df.drop(index, inplace=True)
  df.reset_index(drop=True, inplace=True)
  return df

df = remove_overlapped_text(df)
df.shape

## **Data Exploration: Focused on Non-Review Columns**

### **Total Number of Reviews by Airlines**

In [None]:
import matplotlib.font_manager as fm
viz_color = "#102747"

# path = '/usr/share/fonts/truetype/IBMPlexSans-Regular.ttf'
path = '/usr/share/fonts/truetype/Bower-Bold.otf'
fontprop = fm.FontProperties(fname=path)


sns.set_theme(style="whitegrid")
fig, ax = plt.subplots(figsize=(12,8))

plt.suptitle("Number of Reviews by Airline", ha='left', fontproperties=fontprop, fontsize=40, x=0.125, y=0.98)
plt.title("Showing distribution of reviews per Airline", loc='left',alpha=0.9, fontproperties=fontprop, fontsize=20)

sns.countplot(data=df, y="airline", ax=ax, color=viz_color, order=df['airline'].value_counts().index)

plt.xticks(fontproperties=fontprop, fontsize=15)
plt.yticks(fontproperties=fontprop, fontsize=15)
plt.xlabel('Airlines', fontproperties=fontprop, fontsize=20)
plt.ylabel('Number of Reviews', fontproperties=fontprop, fontsize=20)

# ax.set(ylabel="Airlines", xlabel="Number of Reviews")

plt.show()

### **Flight Types or Regions Travelled by Reviewers for Each Airline**

In [None]:
# What are the most common flight types across the various airlines experience by reviewers?
airline_flight_type = df.groupby(['airline', 'flight_type']).size().reset_index().pivot(columns='flight_type', index='airline', values=0)
airline_flight_type

### **Distribution of Ratings (1 -5)**

In [None]:
# What is the distribution of Ratings in the review dataset??
# Clean up review rating & convert column into ratings / interger
df.review_rating = df.review_rating.str[-2:]
df.review_rating = df.review_rating.astype(int) / 10
df.review_rating.value_counts().plot(kind="barh")

### **Average Rating Across the Airlines for the Various Travel Classes**

In [None]:
# Whats the average rating experience by travellers within KQ and across the various airlines in the different classes?
import numpy as np

plt.figure(figsize=(12, 8), dpi= 80)
airline_class_review = df.groupby(['airline', 'travel_class']).agg({'review_rating':[np.mean]}).reset_index().pivot(columns="travel_class", index="airline").droplevel(0, axis=1).droplevel(0, axis=1) 
airline_class_review

### **Data Cleaning: Correcting Travel Month Column**

In [None]:
# Clean Travel Month
df.travel_month = df.travel_month.str[16:]
df.travel_month

In [None]:
df['travel_year'] = df.travel_month.str[-4:]
df.travel_year.head()

In [None]:
# Assumptions, due to extraction error, we'll convert 25** years to 2022
df.isna().sum()

### **Exploring Review Ratings by Airline and Flight Travel Class**

In [None]:
sns.set_theme(style="whitegrid")
fig, ax = plt.subplots(figsize=(12,8))

plt.suptitle("Heatmap of reviews by Airline, Travel Class", ha='left', fontproperties=fontprop, fontsize=30, x=0.125, y=1)
plt.title("Most airlines tend to have good ratings for their Business Class compared to Economy. \n", 
          loc='left', alpha=0.9, fontproperties=fontprop, fontsize=15)

sns.heatmap(airline_class_review, cmap="Blues", linewidth=1, linecolor="#F4F4F4", cbar_kws = {"location":"bottom", "use_gridspec":False})

plt.xticks(fontproperties=fontprop, fontsize=15)
plt.yticks(fontproperties=fontprop, fontsize=15)
plt.xlabel('Flight Travel Class', fontproperties=fontprop, fontsize=20)
plt.ylabel('Airlines', fontproperties=fontprop, fontsize=20)

From the heatmap above, it shows that Travellers have had a great experience with Business Class as opposed to the Economy Class. 
FlySafari is an exception since it only runs flights in the Economy Class.

### **Exploring Review Ratings by Airlines across Regions**

In [None]:
sns.set_theme(style="whitegrid")
fig, ax = plt.subplots(figsize=(12,8))

plt.suptitle("Exploring Common Flight Type across Airlines", ha='left', fontproperties=fontprop, fontsize=30, x=0.125, y=1)
plt.title("Most of the travel done by airline customers were International followed by Africa. \n", 
          loc='left', alpha=0.9, fontproperties=fontprop, fontsize=15)

sns.heatmap(airline_flight_type, cmap="Blues", linewidth=1, linecolor="#F4F4F4")

plt.xticks(fontproperties=fontprop, fontsize=15)
plt.yticks(fontproperties=fontprop, fontsize=15)
plt.xlabel('Flight Type / Regions', fontproperties=fontprop, fontsize=20)
plt.ylabel('Airlines', fontproperties=fontprop, fontsize=20)

From the heatmap above, it shows that most airline travellers took international flights, followed closely by travels to Africa.

### **Breakdown of Airlines by Respective Travel Classes**

In [None]:
airline_travel_class = df.groupby(['airline', 'travel_class']).size().reset_index().pivot(columns='travel_class', index='airline', values=0)
airline_travel_class

In [None]:
sns.set_theme(style="whitegrid")
fig, ax = plt.subplots(figsize=(12,8))

plt.suptitle("Travel Classes by Airline", ha='left', fontproperties=fontprop, fontsize=40, x=0.125, y=0.98)
plt.title("Graph showing airline travel class experienced by Travelers", loc='left',alpha=0.9, fontproperties=fontprop, fontsize=20)

airline_travel_class.plot(kind="barh", stacked=True, ax=ax, )

plt.xticks(fontproperties=fontprop, fontsize=15)
plt.yticks(fontproperties=fontprop, fontsize=15)
plt.xlabel('Number of Reviews', fontproperties=fontprop, fontsize=20)
plt.ylabel('Airlines', fontproperties=fontprop, fontsize=20)

In [None]:
fig, ax = plt.subplots(figsize=(12,8))

plt.suptitle("Top Flight Paths by Reviews", ha='left', fontproperties=fontprop, fontsize=40, x=0.125, y=0.98)
plt.title("Flight paths used by customers giving reviews", loc='left',alpha=0.9, fontproperties=fontprop, fontsize=20)

plt.xticks(fontproperties=fontprop, fontsize=12)
plt.yticks(fontproperties=fontprop, fontsize=12)

df.flight_path.value_counts()[:5].plot(kind='barh', color=viz_color)

plt.xlabel('Number of Reviews', fontproperties=fontprop, fontsize=20)
plt.ylabel('Flight Paths', fontproperties=fontprop, fontsize=20)

In [None]:
df.describe()

## **Data Exploration: Focused on Review Text**

At this step we drill down into the Review Text, Extract text features and perform an exploratory analysis from the extracted features. This feature will then be used downstream in Modelling Stage.

In [None]:
# From the Dataframe we fetch the Review Column and peform text profilling.
text_nlp = pd.DataFrame(df, columns=['review'])
# Exploring a Sample Review
text_nlp["review"][0]

In [None]:
%%script echo skipping
# We'll skip this step due to the execution timeline on the notebook.
# As an alternative we have saved the files from this output into .csv so that
# we only read directly from the CSV files.
profile_data = apply_text_profiling(
    text_nlp, 'review', 
    params={'spelling_check': False,
            'grammar_check': False,
            'ease_of_reading_check':False,
            'parallelisation_method': 'default'})

# Generating a profiling report into HTML
profile_text = ProfileReport(profile_data)
profile_text.to_file("airline-review-text-profiler-form.html")

# Saving the profiled data to CSV to save on execution
profile_data.to_csv("airline-review-text-profiled-dataset.csv")

In [None]:
profile_data = pd.read_csv("https://raw.githubusercontent.com/billyotieno/analytics-datasets/main/Transport%20Services/Airlines/airline-review-text-profiled-dataset.csv")

In [None]:
# Dropping the Unnamed: 0 column created during file export
profile_data.drop(["Unnamed: 0"], axis=1, inplace=True)
profile_data.columns

In [None]:
# Check the datatypes of the newly created 
profile_data.dtypes

In [None]:
# Comparing Common Words used by Different Airlines - we'll use df for this
df.columns

In [None]:
# Summary of key text statistics
print("Number of Emojis in Corpus - ", profile_data["emoji_count"].sum())
print("Number of Punctuations in Corpus - ", profile_data["punctuations_count"].sum())
print("Number of Stop Words in Corpus - ", profile_data["stop_words_count"].sum())
print("Number of Dates in Corpus - ", profile_data["dates_count"].sum())
print("Number of Non-English Character in Corpus - ", profile_data["non_english_characters_count"].sum())
print("Number of Repeated Whitespaces in Corpus - ", profile_data["repeated_whitespaces_count"].sum())

### **Checking for NaNs in Extracted Review Columns**

In [None]:
# Percentage of non-null values.
filling_rates = 100.*profile_data.count().sort_values(ascending=False)/profile_data.shape[0]
print(filling_rates)

In [None]:
values_filling_rates = filling_rates.values
text_filling_rates = filling_rates.index.to_list()
print(text_filling_rates)

In [None]:
plt.figure(figsize=(6,6),dpi=100)
sns.set(style="whitegrid")
ax = sns.barplot(x=values_filling_rates, y=text_filling_rates,color="Red")
ax.set(xlabel='Filling percentage (%)', ylabel='Feature')
plt.tight_layout()
plt.show()

### **Features distributions into Boolean, Categorical and Numerical types**

In [None]:
df_for_training = profile_data.copy()

In [None]:
cols_for_training = df_for_training.columns.to_list()

In [None]:
feats_bool = ['recommended',
              'has_layover']
feats_cat = ['airline',
             'traveller_type',
             'cabin','review_text', 'review',
             'pos_neu_neg_review_score']
feats_num = [feat for feat in cols_for_training if feat not in feats_bool and feat not in feats_cat]

In [None]:
print('Boolean features: \n{}\n'.format(feats_bool))
print('Categorical features: \n{}\n'.format(feats_cat))
print('Numerical features: \n{}\n'.format(feats_num))

### **Plotting the correlation matrix for the features**

In [None]:
# Let's plot a correlation matrix among the features
def plot_cmap(matrix_values, figsize_w, figsize_h, filename):
    """
    Plot a heatmap corresponding to the input values.
    """
    if figsize_w is not None and figsize_h is not None:
        plt.figure(figsize=(figsize_w,figsize_h))
    else:
        plt.figure()
    cmap = sns.diverging_palette(240, 10, sep=20, as_cmap=True)
    sns.heatmap(matrix_values, annot=True, fmt=".2f", cmap=cmap, vmin=-1, vmax=1)
    plt.savefig(filename)
    plt.show()
    return cmap

corr_values = df_for_training[feats_num].dropna(axis=0,how='any').corr()
plot_cmap(matrix_values=corr_values, 
          figsize_w=17, 
          figsize_h=17, 
          filename='./Corr.png')

Note:

1. A positive correlation between the different types of review scores and subscores
2. A negative correlation between the length of the review text and the value of the different types of review scores and subscores
3. The similarity between using the number of characters and the number of words, from which we conclude that we can drop one of the two features

In [None]:
corr_matrix = profile_data.corr().abs()

# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))

# Find features with correlation greater than 0.95
to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]

to_drop

## **Data Quality Summary**

As an output of our Data Exploration efforts we have been able to identify the following Data Quality Issues within the Review Text and additional columns from the datasets. Below is a summary of our Data Quality findings which informs our Data Preparation Stage:

**Data Quality Issues Identified in Non-Review Text Columns**

> - *Redundant Columns*   
> - *Duplicate Rows*  
> - *Wrong Data Types* 
> - *Missing Values*

**Data Quality Issues Identified in Review Text**

> - *Extra Whitespaces in Text*. 
> - *Digitis in Review Text*.  
> - *Existing Emoji's in Review Text*.  
> - *Punctuations*.  
> - *URL's*.  

For the review text, the data will be cleaned up at part of text pre-processing.

## **Data Preparation**

At this stage we prepare the data for modelling and further analysis.

### **Merging the two Datasets - Text Profiled & Non-Review Dataset**

In [None]:
# profile_data
profile_data["id"] = profile_data.index
df["id"] = df.index

In [None]:
review_df = pd.merge(df, profile_data, how="left", on="id")
review_df.shape

In [None]:
df.shape

In [None]:
profile_data.shape

In [None]:
review_df.columns

In [None]:
review_df.isna().sum()

In [None]:
df.head()

In [None]:
profile_data.head()

In [None]:
df.tail()

In [None]:
profile_data.tail()