## Data Cleaning

Now since we have extracted data from the website, it is not cleaned and ready to be analyzed yet. The reviews section will need to be cleaned for punctuations, spellings and other characters. 

In [1]:
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns
import os
import re

In [2]:
# Create a dataframe from csv file

cwd = os.getcwd()

df = pd.read_csv(cwd+"/BA_reviews.csv", index_col=0)

In [3]:
df.head()

Unnamed: 0,reviews,stars,date,country
0,✅ Trip Verified | Not a great experience. I co...,5.0,18th January 2024,United Kingdom
1,Not Verified | I was excited to fly BA as I'd ...,3.0,18th January 2024,United Kingdom
2,Not Verified | I just want to warn everyone o...,2.0,17th January 2024,Germany
3,Not Verified | Paid for business class travell...,1.0,16th January 2024,United Kingdom
4,✅ Trip Verified | The plane was extremely dir...,1.0,15th January 2024,Ireland


* Create a column which will mentions if the user is verified or not

In [4]:
df['verified'] = df.reviews.str.contains('Trip Verified')

In [5]:
df['verified']

0        True
1       False
2       False
3       False
4        True
        ...  
3495    False
3496    False
3497    False
3498    False
3499    False
Name: verified, Length: 3500, dtype: bool

### Cleaning Reviews

We will extract he column of reviews into a seperate dataframe and clean it for semantic analysis

In [6]:
pip install nltk




[notice] A new release of pip is available: 23.3.1 -> 23.3.2
[notice] To update, run: python.exe -m pip install --upgrade pip





In [7]:
import nltk
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\user5\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\user5\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [8]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
lemma = WordNetLemmatizer()

reviews_df = df.reviews.str.strip("✅ Trip Verified |")

# Create an empty list to collect cleaned data corpus
corpus = []

# Loop through each reviews, remove punctuations, samll case it, join it and add it to corpus
for revview in reviews_df:
    revview = re.sub('[^a-zA-Z]',' ', revview)
    revview = revview.lower()
    revview = revview.split()
    revview = [lemma.lemmatize(word) for word in revview if word not in set(stopwords.words('english'))]
    revview = " ".join(revview)
    corpus.append(revview)

In [9]:
# Add the corpus to the original dataframe

df['corpus'] = corpus

In [10]:
df.head()

Unnamed: 0,reviews,stars,date,country,verified,corpus
0,✅ Trip Verified | Not a great experience. I co...,5.0,18th January 2024,United Kingdom,True,great experience could check online two separa...
1,Not Verified | I was excited to fly BA as I'd ...,3.0,18th January 2024,United Kingdom,False,verified excited fly ba travelled long haul yr...
2,Not Verified | I just want to warn everyone o...,2.0,17th January 2024,Germany,False,verified want warn everyone worst customer ser...
3,Not Verified | Paid for business class travell...,1.0,16th January 2024,United Kingdom,False,verified paid business class travelling cairo ...
4,✅ Trip Verified | The plane was extremely dir...,1.0,15th January 2024,Ireland,True,plane extremely dirty chocolate smudged mine c...


### Cleaning data

In [11]:
df.dtypes

reviews      object
stars       float64
date         object
country      object
verified       bool
corpus       object
dtype: object

In [13]:
# Convert the date to datetime format

df.date = pd.to_datetime(df.date, format='mixed')

In [14]:
df.date.head()

0   2024-01-18
1   2024-01-18
2   2024-01-17
3   2024-01-16
4   2024-01-15
Name: date, dtype: datetime64[ns]

### Cleaning ratings with stars

In [29]:
df.stars

0       5.0
1       3.0
2       2.0
3       1.0
4       1.0
       ... 
3495    9.0
3496    1.0
3497    1.0
3498    7.0
3499    9.0
Name: stars, Length: 3500, dtype: float64

In [24]:
# Check for unique values
df.stars.unique()

array([ 5.,  3.,  2.,  1.,  4.,  9.,  6.,  8., 10.,  7., nan])

In [18]:
df.stars.value_counts()

stars
1.0     847
2.0     406
3.0     392
8.0     341
10.0    286
9.0     284
7.0     279
5.0     249
4.0     240
6.0     173
Name: count, dtype: int64

In [19]:
# Check if stars having any null values or not
df.stars.isna().sum()

3

In [25]:
# Drop the rows where the values of rating is nan
df.drop(df[df.stars == "nan"].index, axis=0, inplace=True)

In [26]:
# Check the unique values again
df.stars.unique()

array([ 5.,  3.,  2.,  1.,  4.,  9.,  6.,  8., 10.,  7., nan])

### Check for null values

In [30]:
df.isnull().value_counts()

reviews  stars  date   country  verified  corpus
False    False  False  False    False     False     3495
         True   False  False    False     False        3
         False  False  True     False     False        2
Name: count, dtype: int64

In [31]:
df.country.isnull().value_counts()

country
False    3498
True        2
Name: count, dtype: int64

### We having 3 missing values in stars and 2 for country . For this we can just remove those two review from the dataframe

In [34]:
# Drop the rows using index where the country values is null
df.drop(df[df.country.isnull() == True].index, axis=0, inplace=True)
df.drop(df[df.stars.isnull() == True].index, axis=0, inplace=True)

In [35]:
df.shape

(3495, 6)

In [36]:
# Resetting the index
df.reset_index(drop=True)

Unnamed: 0,reviews,stars,date,country,verified,corpus
0,✅ Trip Verified | Not a great experience. I co...,5.0,2024-01-18,United Kingdom,True,great experience could check online two separa...
1,Not Verified | I was excited to fly BA as I'd ...,3.0,2024-01-18,United Kingdom,False,verified excited fly ba travelled long haul yr...
2,Not Verified | I just want to warn everyone o...,2.0,2024-01-17,Germany,False,verified want warn everyone worst customer ser...
3,Not Verified | Paid for business class travell...,1.0,2024-01-16,United Kingdom,False,verified paid business class travelling cairo ...
4,✅ Trip Verified | The plane was extremely dir...,1.0,2024-01-15,Ireland,True,plane extremely dirty chocolate smudged mine c...
...,...,...,...,...,...,...
3490,LGW-MCO-LGW 11-25th August Economy Class. Got ...,9.0,2014-09-01,United Kingdom,False,lgw mco lgw th august economy class got gate l...
3491,LGW-Faro return 19th/26th Aug Economy 2 adults...,1.0,2014-09-01,United Kingdom,False,lgw faro return th th aug economy adult child ...
3492,I have flown twice with BA now in business cla...,1.0,2014-09-01,United Kingdom,False,flown twice ba business class delhi say absolu...
3493,Flew from London to Doha on a newly refurbishe...,7.0,2014-09-01,United Kingdom,False,flew london doha newly refurbished smooth ride...


### Now our data is all cleaned and ready for data visualization and data analysis

In [37]:
# Export the cleaned data

df.to_csv(cwd+ "/cleaned-BA-reviews.csv")