# Data Cleaning 

Now we have extracted data from the website, it is not cleaned and ready to be analyzed. 
The review section will need to be cleaned for punctuations, spelling and other characters.

In [20]:
# imports

import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#regex
import re 

In [21]:
#create a dataframe froma a csv file 

df = pd.read_csv("BA_reviews.csv", index_col=0)


In [22]:
df.head()

Unnamed: 0,reviews,dates,countries
0,"✅ Trip Verified | Filthy plane, cabin staff o...",28th August 2023,United Kingdom
1,✅ Trip Verified | Chaos at Terminal 5 with B...,27th August 2023,United Kingdom
2,Not Verified | BA cancelled our flight and co...,27th August 2023,United Kingdom
3,✅ Trip Verified | When on our way to Heathrow ...,27th August 2023,United Kingdom
4,"✅ Trip Verified | Nice flight, good crew, very...",26th August 2023,United States


In [23]:
df['verified'] = df.reviews.str.contains("Trip Verified")

In [24]:
df['verified']

0       True
1       True
2      False
3       True
4       True
       ...  
995     True
996     True
997     True
998     True
999     True
Name: verified, Length: 1000, dtype: bool

## Cleaning Reviews

we will extract the column of reviews into a seperate dataframe and clean it for semantic analysis

In [25]:
#for lemmatization of words we will use nltk library
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
lemma = WordNetLemmatizer()


reviews_data = df.reviews.str.strip("✅ Trip Verified |")

#create an empty list to collect cleaned data corpus
corpus =[]

#loop through each review, remove punctuations, small case it, join it and add it to corpus
for rev in reviews_data:
    rev = re.sub('[^a-zA-Z]',' ', rev)
    rev = rev.lower()
    rev = rev.split()
    rev = [lemma.lemmatize(word) for word in rev if word not in set(stopwords.words("english"))]
    rev = " ".join(rev)
    corpus.append(rev)

In [26]:
# add the corpus to the original dataframe

df['corpus'] = corpus

In [27]:
df.head()

Unnamed: 0,reviews,dates,countries,verified,corpus
0,"✅ Trip Verified | Filthy plane, cabin staff o...",28th August 2023,United Kingdom,True,filthy plane cabin staff ok appalling customer...
1,✅ Trip Verified | Chaos at Terminal 5 with B...,27th August 2023,United Kingdom,True,chaos terminal ba cancellation delay staff giv...
2,Not Verified | BA cancelled our flight and co...,27th August 2023,United Kingdom,False,verified ba cancelled flight could book u onto...
3,✅ Trip Verified | When on our way to Heathrow ...,27th August 2023,United Kingdom,True,way heathrow airport merely half hour schedule...
4,"✅ Trip Verified | Nice flight, good crew, very...",26th August 2023,United States,True,nice flight good crew good seat food would exp...


 Cleaning/ formating of date

In [28]:
df.dtypes

reviews      object
dates        object
countries    object
verified       bool
corpus       object
dtype: object

In [31]:
# convert the data to datatime format 

df.dates = pd.to_datetime(df['dates'], format='mixed')

In [32]:
df.dates.head()

0   2023-08-28
1   2023-08-27
2   2023-08-27
3   2023-08-27
4   2023-08-26
Name: dates, dtype: datetime64[ns]

Check for null Values 

In [35]:
df.isnull().value_counts()



reviews  dates  countries  verified  corpus
False    False  False      False     False     1000
Name: count, dtype: int64

In [36]:
df.countries.isnull().value_counts()

countries
False    1000
Name: count, dtype: int64

In [37]:
df.shape 

(1000, 5)

In [39]:
#Resetting the index 

df.reset_index(drop=True)

Unnamed: 0,reviews,dates,countries,verified,corpus
0,"✅ Trip Verified | Filthy plane, cabin staff o...",2023-08-28,United Kingdom,True,filthy plane cabin staff ok appalling customer...
1,✅ Trip Verified | Chaos at Terminal 5 with B...,2023-08-27,United Kingdom,True,chaos terminal ba cancellation delay staff giv...
2,Not Verified | BA cancelled our flight and co...,2023-08-27,United Kingdom,False,verified ba cancelled flight could book u onto...
3,✅ Trip Verified | When on our way to Heathrow ...,2023-08-27,United Kingdom,True,way heathrow airport merely half hour schedule...
4,"✅ Trip Verified | Nice flight, good crew, very...",2023-08-26,United States,True,nice flight good crew good seat food would exp...
...,...,...,...,...,...
995,✅ Trip Verified | London to Philadelphia. I u...,2018-10-25,United States,True,london philadelphia upgraded coach business al...
996,✅ Trip Verified | Madrid to London. Good impro...,2018-10-24,Spain,True,madrid london good improvement ba club europe ...
997,✅ Trip Verified | London to Munich. The groun...,2018-10-23,Germany,True,london munich ground staff friendly plane clea...
998,✅ Trip Verified | London to Cape Town. Waiti...,2018-10-23,France,True,london cape town waiting gate staff announced ...


Now my data is clean and ready for data visualization and data analysis 

In [40]:
#Export the clean data 

df.to_csv('cleaned-BA-reviews.csv')