# British Airways Data Science
 
## Data Wrangling/Cleaning

The next step of the process is to clean/wrangle the data. Given that it is text data, we will have to run a couple of NLP cleaning techniques to get the data ready for exploratory data analysis and modeling. First, let's import the needed libraries and take a look at the data.

In [72]:
import pandas as pd
import numpy as np
import re
import nltk
from nltk import *
from cleantext import *
import spacy
from nltk.corpus import stopwords
nltk.download('stopwords')
stop = stopwords.words('english')

[nltk_data] Error loading stopwords: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:997)>


In [73]:
data = pd.read_csv('/Users/afifmazhar/Desktop/Data Science/Data Science Projects/British_Airways_Data_Science/data/BA_reviews.csv')
data.head()

Unnamed: 0.1,Unnamed: 0,date,rating,body
0,0,8th January 2023,5.0,Not Verified | Great thing about British Airw...
1,1,6th January 2023,1.0,Not Verified | The staff are friendly. The pla...
2,2,2nd January 2023,1.0,✅ Trip Verified | Probably the worst business ...
3,3,2nd January 2023,2.0,"✅ Trip Verified | Definitely not recommended, ..."
4,4,2nd January 2023,8.0,✅ Trip Verified | BA shuttle service across t...


In [74]:
data.shape

(3448, 4)

In [75]:
data.dtypes

Unnamed: 0      int64
date           object
rating        float64
body           object
dtype: object

Let's remove the unnecessary 'Unnamed: 0' column and check for null values.

In [76]:
data = data.drop('Unnamed: 0', axis = 1)
data

Unnamed: 0,date,rating,body
0,8th January 2023,5.0,Not Verified | Great thing about British Airw...
1,6th January 2023,1.0,Not Verified | The staff are friendly. The pla...
2,2nd January 2023,1.0,✅ Trip Verified | Probably the worst business ...
3,2nd January 2023,2.0,"✅ Trip Verified | Definitely not recommended, ..."
4,2nd January 2023,8.0,✅ Trip Verified | BA shuttle service across t...
...,...,...,...
3443,29th August 2012,10.0,Flew LHR - VIE return operated by bmi but BA a...
3444,28th August 2012,9.0,LHR to HAM. Purser addresses all club passenge...
3445,12th October 2011,5.0,My son who had worked for British Airways urge...
3446,11th October 2011,4.0,London City-New York JFK via Shannon on A318 b...


In [77]:
for column in data:
    print(column + ": ", sum(data[column].isnull()))

date:  0
rating:  0
body:  0


In [78]:
data.columns

Index(['date', 'rating', 'body'], dtype='object')

In [79]:
data.body[1]

'Not Verified | The staff are friendly. The plane was cold, we were shivering, they gave light blankets but they were not enough. Meals were basic. Entertainment was basic. Luggage is delayed, today is day 6 and BA staff over the phone say "call after 72 hours". Tracking system is very vague, had to extract information from staff that they arrived to Vancouver on Jan 2nd. I offered to collect baggage but very vague answer "call the Airport", asked her for BA phone number at YVR but she said "we don\'t have a phone number for you to call, call the main Airport". Their policy states you can make a claim only after 21 days.'

In [80]:
tokenizer = WhitespaceTokenizer()
lemmatizer = WordNetLemmatizer()

In [81]:
def lemmatize_text(text):
    return [lemmatizer.lemmatize(w) for w in tokenizer.tokenize(text)]

In [82]:
data['body'] = data['body'].apply(lemmatize_text)
data['body'] = [' '.join(map(str, l)) for l in data['body']]

The review has a lot of unnecessary characters that will interfere in the analysis. We need to remove any additional characters that are irrelevant. This includes links, symbols, special characters, numbers, emojies, etc. Specifically, I want to remove the "not verified" and "trip verified" portion of the string as it won't be needed. I've created a function that deals with these issues.

In [83]:
def cleanText(text):
    text = re.sub(r'https?:\/\/\S+', '', text) # remove links
    text = re.sub(r'@[A-Za-z0-0]+', '', text) # remove @
    text = re.sub(r'[!@#$%^&*()_\-+=''""}{[\]|,.?<>:;\'’`~]', '', text) # remove special characters
    text = re.sub(r'[0-9]+[a-zA-Z]*', '', text) # remove numbers
    text = clean(text, no_emoji = True)
    text = re.sub(r'\Anot verified|\Atrip verified','', text) # remove 'not verified' or 'trip verified'
    text = re.sub(r'\bwa\b|\bba\b','',text)
    return text

In [84]:
data['body'] = data['body'].apply(cleanText)

In [85]:
data['body'] = data['body'].apply(lemmatize_text)
data['body'] = [' '.join(map(str, l)) for l in data['body']]

In [86]:
data.head()

Unnamed: 0,date,rating,body
0,8th January 2023,5.0,great thing about british airway a is the econ...
1,6th January 2023,1.0,the staff are friendly the plane cold we were ...
2,2nd January 2023,1.0,probably the worst business class experience i...
3,2nd January 2023,2.0,definitely not recommended especially for busi...
4,2nd January 2023,8.0,shuttle service across the uk is still surpris...


After cleaning most of the additional jargon from the reviews, let's also remove the stopwords using a lambda function down the "reviews" column.

In [87]:
data['body'] = data['body'].apply(lambda words: ' '.join(word.lower() for word in words.split() if word not in stop))

In [88]:
data.head()

Unnamed: 0,date,rating,body
0,8th January 2023,5.0,great thing british airway economy section ups...
1,6th January 2023,1.0,staff friendly plane cold shivering gave light...
2,2nd January 2023,1.0,probably worst business class experience ive e...
3,2nd January 2023,2.0,definitely recommended especially business cla...
4,2nd January 2023,8.0,shuttle service across uk still surprisingly g...


In [89]:
data.body[0]

'great thing british airway economy section upstairs get allows small stowage cupboard window seat despite old looked tired inside broken side stowage seat reclined uncontrollably slow react ife food supposed christmas dinner wafer thin bit dry turkey cooked sprout cubed potato poor taste quality mousse desert great though slight issue snack meal ordered child option marked regular meal arrived well exactly asked crew member told difference sticker box staff ok couple decent mostly ok overall seems tad cheap day great sit upstairs enjoyment ended seemed bit dull like average airline'

Now that the 'body' column is cleaned and ready for EDA, we have to do some feature engineering and more data cleaning on the 'date' and 'rating' column. I will change the 'date' type format to datetime and create a 'review' column that indicates if the 'rating' provided by the customer was either good or bad.

In [90]:
data['date'] = pd.to_datetime(data['date'])
data.head()

Unnamed: 0,date,rating,body
0,2023-01-08,5.0,great thing british airway economy section ups...
1,2023-01-06,1.0,staff friendly plane cold shivering gave light...
2,2023-01-02,1.0,probably worst business class experience ive e...
3,2023-01-02,2.0,definitely recommended especially business cla...
4,2023-01-02,8.0,shuttle service across uk still surprisingly g...


In [91]:
data['review'] = np.where(((data.rating == 8.0)|(data.rating == 9.0)|(data.rating == 10.0)), "Good", "Bad")
data.head()

Unnamed: 0,date,rating,body,review
0,2023-01-08,5.0,great thing british airway economy section ups...,Bad
1,2023-01-06,1.0,staff friendly plane cold shivering gave light...,Bad
2,2023-01-02,1.0,probably worst business class experience ive e...,Bad
3,2023-01-02,2.0,definitely recommended especially business cla...,Bad
4,2023-01-02,8.0,shuttle service across uk still surprisingly g...,Good


Now the data is ready for exploratory data analysis. I will save it into a new csv file before jumping to the next portion of the project.

In [93]:
data.to_csv("/Users/afifmazhar/Desktop/Data Science/Data Science Projects/British_Airways_Data_Science/data/BA_reviews_clean.csv")