#### Part 3: Pre-Processing

In this notebook, I will continue using other cleaning processes to ensure the data used for my models is up to standard regarding data quality.I will focus on modifying the content of columns as well as identifying and adding features for my model. 

This notebook is also the introduction of various NLP (Natural Lanaguage Processing) techniques to prepare our project for the machne learning workflow.

#### __Notebook Contents__

__3.1__ Binarize Column

__3.2__ Train_Test_Split

__3.3__ Vectorization

__3.4__ Saving Output for Modelling

__3.5__ Closing Remarks

In [1]:
# Generic Imports
import joblib
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


# Machine Learning
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
import re

import warnings
warnings.filterwarnings('ignore')

Let's proceed to load the cleaned dataframe from the loading notebook for pre-processing.

In [2]:
#load data
cleandf = joblib.load('../data/cleandf.pkl')

_____

I aim to create a final dataset which is ready for modelling. This can be achieved with a very simple step:

- Binarize target column (recommended)

#### __3.1 Binarize column__

Our first step would be to encode our target column so it is suitable for modelling. Models do not work well with text data, so our aim is to convert the data into a numerical format. In this ase of our classififcation model, we are converting `yes` to 1 and `no` to 0 from our original `recommended` column.

Once that is complete, we will then drop the `recommended` column as this would be the equivalent to a duplicate column.

In [3]:
cleandf['recommended_class'] = cleandf['recommended'].apply(lambda x: 1 if x == 'yes' else 0)

In [4]:
#output check for sanity
cleandf.sample(2)

Unnamed: 0,airline,overall,author,customer_review,cabin,seat_comfort,cabin_service,food_bev,entertainment,ground_service,value_for_money,recommended,review_year,review_month,recommended_class
19557,Air New Zealand,9.0,Peter Somerville,have flown with over 30 airlines and air new z...,Premium Economy,5.0,5.0,5.0,5.0,2.69282,5.0,yes,2013,7,1
34278,Air France,10.0,Robert Stork,singapore to toulouse via paris we booked our...,First Class,5.0,5.0,5.0,4.0,5.0,5.0,yes,2017,6,1


Our new column `recommended_class` has been created. 

In [5]:
cleandf = cleandf.drop(['recommended'], axis=1).copy()

### __Text Data Pre-processing__

As we are looking into focusing on text data for this part of the modelling, the column (**`customer_review`**), with the text data will be our intital feature, we will later engineer some more features out of this column. 

In [6]:
#selecting features and target variable
X = cleandf['customer_review']
y = cleandf['recommended_class']

#### __3.2 Train_Test_Split__

Our next step in the process is to split the data into our training and test split. This ensures overfitting is reduced and our model can adapt to new data that is later introduced.

The training set will be used to train our models and the test set will be used to evaluate the model.

We are using a 80:20 ratio, 80% of our dataset goes into the training set and 20% goes to the test set.

In [7]:
#Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=40)


Now we can really focus on feature extraction;

We will use the `nltk` library to remove stop words from our data. 

In [8]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

stopwords = (stopwords.words('english'))

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/faisal/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


#### __3.3 Vectorization__

For our vectorizer, we chose the TF-IDF vectorizer.

TF-IDF is split in two steps;

- Term Frequency (TF): The number of occurences a word has in a doument
- Inverse Document Frequency (IDF): The inverse of the number of documents the word occurs in

In a nutshell, the words that tend to appear more in the document will have a scaled down count, and the words which are less common will have a scaled up count.

This is great for our use case as the words less common are more likely to be related to topics, which is exactly what we aim to identify from this model.


In [9]:
vectorizer_bi = TfidfVectorizer(max_features= 500, min_df=25, max_df=0.80,  stop_words= stopwords, ngram_range=(2,2))
vectorizer_uni  = TfidfVectorizer(max_features= 500, min_df=25, max_df=0.80,  stop_words= stopwords, ngram_range=(1,1))



In [23]:
vectorizer_demo = TfidfVectorizer(max_features= 500, max_df=0.80, stop_words=stopwords, ngram_range=(1,))

We tuned some of the features of our vectorizer for our use case:

We allowed 500 features to work with, this is mainly to minimise the taxing of tasks for our machine.

We focused on words which appear in at least 25 reviews, but also do not appear in more than 80% of our reviews.

Finally, we tuned our n_gram range to focus on bigrams. This is because we would expect unigrams to be a good predictor of positive and negative sentiment as it would maiinly be a direct connection with an adjective. Bigrams allow us to pick out topics which for our use case of this project, would work best as it would add context to our inference problem.

In [10]:
#Transform X_train and X_test using TF-IDF Vectorizer

transformed_train = vectorizer_bi.fit_transform(X_train)
transformed_test = vectorizer_bi.transform(X_test)

transformed_train_uni = vectorizer_uni.fit_transform(X_train)
transformed_test_uni = vectorizer_uni.transform(X_test)

In [24]:
demo_transformed_train = vectorizer_demo.fit_transform(X_train)
demo_transformed_test =  vectorizer_demo.transform(X_test)

In [25]:
joblib.dump(vectorizer_demo, "demo_vectorizer_new.pkl")

['demo_vectorizer_new.pkl']

#### __3.4 Saving Outputs for Modelling__

Again a very simple process, we want to convert each dataset into a dataframe. This was it ensures readability for the next person, and also  this was would prevent any shape imbalanced at the modelling stage and should run with smoothly.
 

In [11]:
# Trained_df .pklfor bigrams
train_df = pd.DataFrame(columns=vectorizer_bi.get_feature_names(), data= transformed_train.toarray())

In [12]:
# Trained_df .pkl for unigrams
train_df_uni = pd.DataFrame(columns=vectorizer_uni.get_feature_names(), data= transformed_train_uni.toarray())

In [13]:
# test set with y variables fro bigrams for modelling
test_df = pd.DataFrame(columns=vectorizer_bi.get_feature_names(), data= transformed_test.toarray())

In [14]:
# test set with y variables fro unigrams for modelling
test_df_uni = pd.DataFrame(columns=vectorizer_uni.get_feature_names(), data= transformed_test_uni.toarray())

In [15]:
joblib.dump(train_df, "../data/train_df.pkl")

['../data/train_df.pkl']

In [16]:
joblib.dump(y_train, "../data/y_train.pkl")

['../data/y_train.pkl']

In [17]:
joblib.dump(test_df, "../data/test_df.pkl")

['../data/test_df.pkl']

In [18]:
joblib.dump(y_test, "../data/y_test.pkl")

['../data/y_test.pkl']

In [19]:
joblib.dump(train_df_uni, "../data/uni_trained.pkl")

['../data/uni_trained.pkl']

In [20]:
joblib.dump(test_df_uni, "../data/uni_test.pkl")

['../data/uni_test.pkl']

In [21]:
joblib.dump(vectorizer_bi, "demo_vectorizer.pkl")

['demo_vectorizer.pkl']

In [22]:
joblib.dump(vectorizer_uni, "uni_vec.pkl")

['uni_vec.pkl']

In the next notebook(s), I will be perfoming some Modelling and Evaluation.