# Data Cleaning 

Now we have extracted data from the website, it is not cleaned and ready to be analyzed. 
The review section will need to be cleaned for punctuations, spelling and other characters.

In [1]:
# imports

import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#regex
import re 

In [2]:
#create a dataframe froma a csv file 

df = pd.read_csv("BA_reviews.csv", index_col=0)


In [3]:
df.head()

Unnamed: 0,reviews,dates,countries
0,✅ Trip Verified | British Airways has a total...,4th September 2023,United Kingdom
1,"✅ Trip Verified | London Heathrow to Keflavik,...",4th September 2023,Iceland
2,✅ Trip Verified | Mumbai to London Heathrow in...,4th September 2023,Iceland
3,✅ Trip Verified | Care and support shocking. ...,4th September 2023,United Kingdom
4,✅ Trip Verified | Flying A380 business class ...,2nd September 2023,Australia


In [4]:
df['verified'] = df.reviews.str.contains("Trip Verified")

In [5]:
df['verified']

0        True
1        True
2        True
3        True
4        True
        ...  
3638    False
3639    False
3640    False
3641    False
3642    False
Name: verified, Length: 3643, dtype: bool

## Cleaning Reviews

we will extract the column of reviews into a seperate dataframe and clean it for semantic analysis

In [6]:
#for lemmatization of words we will use nltk library
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
lemma = WordNetLemmatizer()


reviews_data = df.reviews.str.strip("✅ Trip Verified |")

#create an empty list to collect cleaned data corpus
corpus =[]

#loop through each review, remove punctuations, small case it, join it and add it to corpus
for rev in reviews_data:
    rev = re.sub('[^a-zA-Z]',' ', rev)
    rev = rev.lower()
    rev = rev.split()
    rev = [lemma.lemmatize(word) for word in rev if word not in set(stopwords.words("english"))]
    rev = " ".join(rev)
    corpus.append(rev)

In [7]:
# add the corpus to the original dataframe

df['corpus'] = corpus

In [8]:
df.head()

Unnamed: 0,reviews,dates,countries,verified,corpus
0,✅ Trip Verified | British Airways has a total...,4th September 2023,United Kingdom,True,british airway total lack respect customer boo...
1,"✅ Trip Verified | London Heathrow to Keflavik,...",4th September 2023,Iceland,True,london heathrow keflavik iceland business clas...
2,✅ Trip Verified | Mumbai to London Heathrow in...,4th September 2023,Iceland,True,mumbai london heathrow business class ageing b...
3,✅ Trip Verified | Care and support shocking. ...,4th September 2023,United Kingdom,True,care support shocking written previously loyal...
4,✅ Trip Verified | Flying A380 business class ...,2nd September 2023,Australia,True,flying business class pleasure ba made disaste...


 Cleaning/ formating of date

In [9]:
df.dtypes

reviews      object
dates        object
countries    object
verified       bool
corpus       object
dtype: object

In [10]:
# convert the data to datatime format 

df.dates = pd.to_datetime(df['dates'], format='mixed')

In [11]:
df.dates.head()

0   2023-09-04
1   2023-09-04
2   2023-09-04
3   2023-09-04
4   2023-09-02
Name: dates, dtype: datetime64[ns]

Check for null Values 

In [12]:
df.isnull().value_counts()



reviews  dates  countries  verified  corpus
False    False  False      False     False     3641
                True       False     False        2
Name: count, dtype: int64

In [13]:
df.countries.isnull().value_counts()

countries
False    3641
True        2
Name: count, dtype: int64

In [14]:
df.shape 

(3643, 5)

In [15]:
#Resetting the index 

df.reset_index(drop=True)

Unnamed: 0,reviews,dates,countries,verified,corpus
0,✅ Trip Verified | British Airways has a total...,2023-09-04,United Kingdom,True,british airway total lack respect customer boo...
1,"✅ Trip Verified | London Heathrow to Keflavik,...",2023-09-04,Iceland,True,london heathrow keflavik iceland business clas...
2,✅ Trip Verified | Mumbai to London Heathrow in...,2023-09-04,Iceland,True,mumbai london heathrow business class ageing b...
3,✅ Trip Verified | Care and support shocking. ...,2023-09-04,United Kingdom,True,care support shocking written previously loyal...
4,✅ Trip Verified | Flying A380 business class ...,2023-09-02,Australia,True,flying business class pleasure ba made disaste...
...,...,...,...,...,...
3638,Flew LHR - VIE return operated by bmi but BA a...,2012-08-29,United Kingdom,False,flew lhr vie return operated bmi ba aircraft a...
3639,LHR to HAM. Purser addresses all club passenge...,2012-08-28,United Kingdom,False,lhr ham purser address club passenger name boa...
3640,My son who had worked for British Airways urge...,2011-10-12,United Kingdom,False,son worked british airway urged fly british ai...
3641,London City-New York JFK via Shannon on A318 b...,2011-10-11,United States,False,london city new york jfk via shannon really ni...


Now my data is clean and ready for data visualization and data analysis 

In [16]:
#Export the clean data 

df.to_csv('cleaned-BA-reviews.csv')