**Preprocessing data**
<p>Some AI techniques (i.e. Neural Networks) can only work with numerical data.
So, we need to convert text data into numerical data.</p>
<p>This can be achieved by preprocessing techniques</p>

In [1]:
# importing the libraries
import pandas as pd # for reading csv
import matplotlib.pyplot as plt #for plotting graphs

In [2]:
df=pd.read_csv('Clean_Dataset.csv')
# Dropping column 'Unnamed: 0'
df=df.drop('Unnamed: 0',axis=1)
df.head()

Unnamed: 0,airline,flight,source_city,departure_time,stops,arrival_time,destination_city,class,duration,days_left,price
0,SpiceJet,SG-8709,Delhi,Evening,zero,Night,Mumbai,Economy,2.17,1,5953
1,SpiceJet,SG-8157,Delhi,Early_Morning,zero,Morning,Mumbai,Economy,2.33,1,5953
2,AirAsia,I5-764,Delhi,Early_Morning,zero,Early_Morning,Mumbai,Economy,2.17,1,5956
3,Vistara,UK-995,Delhi,Morning,zero,Afternoon,Mumbai,Economy,2.25,1,5955
4,Vistara,UK-963,Delhi,Morning,zero,Morning,Mumbai,Economy,2.33,1,5955


Data can have missing values. 
A method is to simply ignore the missing values.

In [3]:
df = pd.read_csv("Clean_Dataset.csv", na_values=['NA', '?'])

You can fill missing values with the most repeated valued (i.e. the median)

In [4]:
# Convert all missing values in the specified column to the median
df = pd.read_csv("Clean_Dataset.csv")
def missing_median(df, name): # code from excercise #2
    med = df[name].median()
    df[name] = df[name].fillna(med)
missing_median(df,"duration") # works with numerical data

One way to convert text data (i.e. airline) to numerical data is by using a label encoder

In [5]:
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
for col in df.columns:
    if df[col].dtype=='object': # no need to convert duration, days_left, and price columns
        df[col]=le.fit_transform(df[col])
df.head()

Unnamed: 0.1,Unnamed: 0,airline,flight,source_city,departure_time,stops,arrival_time,destination_city,class,duration,days_left,price
0,0,4,1408,2,2,2,5,5,1,2.17,1,5953
1,1,4,1387,2,1,2,4,5,1,2.33,1,5953
2,2,0,1213,2,1,2,1,5,1,2.17,1,5956
3,3,5,1559,2,4,2,0,5,1,2.25,1,5955
4,4,5,1549,2,4,2,4,5,1,2.33,1,5955


Once data is converted to numeric data, we can use it to train our model.
To train the model we can split the data into training and testing sets (to avoid biased result, i.e. training and testing on same data)

In [6]:
price = df['price']
features = df.drop('price',axis=1)
from sklearn.model_selection import train_test_split
xtrain,xtest,ytrain,ytest=train_test_split(features,price,test_size=0.30,random_state=42)
print("Sample in test set ",xtest.shape[0])
print("Sample in train set ",xtrain.shape[0])

Sample in test set  90046
Sample in train set  210107


Encoded lables are still not very good for training, we can do some scaling to get better results
<p><b>MinMaxScaler</b> is one way of doing this</p>

In [7]:
# Scaling the values to convert the int values to Machine Languages
from sklearn.preprocessing import MinMaxScaler
mmscaler=MinMaxScaler(feature_range=(0,1)) # It's good to scale the values between 0 and 1
xtrain=pd.DataFrame(mmscaler.fit_transform(xtrain))
xtest=pd.DataFrame(mmscaler.fit_transform(xtest))  
xtrain.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,0.008013,1.0,0.971795,0.4,1.0,0.0,0.8,1.0,1.0,0.256939,0.270833
1,0.919087,1.0,0.973077,0.6,0.8,0.0,0.4,0.4,0.0,0.178571,0.458333
2,0.990022,1.0,0.952564,0.2,0.2,0.0,0.4,0.8,0.0,0.21102,0.583333
3,0.042729,1.0,0.991026,0.4,0.4,0.0,1.0,0.0,1.0,0.086735,0.3125
4,0.310395,0.4,0.719231,0.0,0.8,0.0,1.0,0.4,1.0,0.239796,0.916667


Another way of scaling/normalizing data is <b> Standard Scalar</b>

In [8]:
from sklearn.preprocessing import StandardScaler
xtrain,xtest,ytrain,ytest=train_test_split(features,price,test_size=0.30,random_state=42)
sc = StandardScaler()
sc.fit(xtrain)
xtrain = pd.DataFrame(sc.transform(xtrain))
xtrain.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,-1.703407,1.031881,1.000674,-0.329965,1.474368,-0.426818,0.529421,1.381935,0.672494,0.16805,-0.886605
1,1.451913,1.031881,1.005366,0.241011,0.904019,-0.426818,-0.619031,-0.337076,-1.487001,-0.366146,-0.22329
2,1.69758,1.031881,0.930283,-0.90094,-0.80703,-0.426818,-0.619031,0.808931,-1.487001,-0.144955,0.21892
3,-1.583175,1.031881,1.071065,-0.329965,-0.23668,-0.426818,1.103647,-1.483084,0.672494,-0.992157,-0.739202
4,-0.656168,-0.604111,0.076207,-1.471916,0.904019,-0.426818,1.103647,-0.337076,0.672494,0.051195,1.398146


Another way of scaling/normalizing data is <b> Z-Score </b>

In [9]:
from scipy.stats import zscore
xtrain,xtest,ytrain,ytest=train_test_split(features,price,test_size=0.30,random_state=42)
for col in xtrain.columns:
    xtrain[col]=zscore(xtrain[col])
xtrain.head()

Unnamed: 0.1,Unnamed: 0,airline,flight,source_city,departure_time,stops,arrival_time,destination_city,class,duration,days_left
2406,-1.703407,1.031881,1.000674,-0.329965,1.474368,-0.426818,0.529421,1.381935,0.672494,0.16805,-0.886605
275865,1.451913,1.031881,1.005366,0.241011,0.904019,-0.426818,-0.619031,-0.337076,-1.487001,-0.366146,-0.22329
297156,1.69758,1.031881,0.930283,-0.90094,-0.80703,-0.426818,-0.619031,0.808931,-1.487001,-0.144955,0.21892
12826,-1.583175,1.031881,1.071065,-0.329965,-0.23668,-0.426818,1.103647,-1.483084,0.672494,-0.992157,-0.739202
93166,-0.656168,-0.604111,0.076207,-1.471916,0.904019,-0.426818,1.103647,-0.337076,0.672494,0.051195,1.398146
