**Improving the FF Network**
<p>Once data has been preprocessed, we have to build model using the data. </p>
<p>After building the model, we test it using different data sets and measure it's performance (i.e. accuracy)</p>
<li>We will try a simple <b>feed forward network</b> with different parameters i.e. number of layers, epoch, activation function, and scaled and unscaled data</li>

In [8]:
# importing the libraries
import pandas as pd # for reading csv

df=pd.read_csv('Clean_Dataset.csv', na_values=['NA','?']) #ignore missing values
# Dropping column 'Unnamed: 0'
df=df.drop('Unnamed: 0',axis=1)
df.head(3)

Unnamed: 0,airline,flight,source_city,departure_time,stops,arrival_time,destination_city,class,duration,days_left,price
0,SpiceJet,SG-8709,Delhi,Evening,zero,Night,Mumbai,Economy,2.17,1,5953
1,SpiceJet,SG-8157,Delhi,Early_Morning,zero,Morning,Mumbai,Economy,2.33,1,5953
2,AirAsia,I5-764,Delhi,Early_Morning,zero,Early_Morning,Mumbai,Economy,2.17,1,5956


In [9]:
from sklearn.preprocessing import LabelEncoder
encoder=LabelEncoder() #convert txt data to numeric
for col in df.columns:
    if df[col].dtype=='object': # no need to convert duration, days_left, and price columns
        df[col]=encoder.fit_transform(df[col])
df.head(3)

Unnamed: 0,airline,flight,source_city,departure_time,stops,arrival_time,destination_city,class,duration,days_left,price
0,4,1408,2,2,2,5,5,1,2.17,1,5953
1,4,1387,2,1,2,4,5,1,2.33,1,5953
2,0,1213,2,1,2,1,5,1,2.17,1,5956


In [10]:
price = df['price'] # target
# features = df.drop('price',axis=1) #data features
features =df[['days_left','duration']] #data features

In [11]:
# from sklearn.preprocessing import StandardScaler 

# sc = StandardScaler()
# sc.fit(features)
# features= sc.transform(features)
# features[0:3]

# from scipy.stats import zscore
# for col in features.columns:
#     features[col] = zscore(features[col])
# features[0:3]

from sklearn.preprocessing import MinMaxScaler
mmscaler=MinMaxScaler(feature_range=(0,1))
features=pd.DataFrame(mmscaler.fit_transform(features))
features.head(3)

Unnamed: 0,0,1
0,0.0,0.027347
1,0.0,0.030612
2,0.0,0.027347


In [12]:
from sklearn.model_selection import train_test_split
#splitting the data
xtrain,xtest,ytrain,ytest=train_test_split(features,price,test_size=0.95,random_state=42)
print("Sample in test set ",xtest.shape[0])
print("Sample in train set ",xtrain.shape[0])

Sample in test set  285146
Sample in train set  15007


<h3>Adding more layers (i.e. hidden layer)</h3>

In [15]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

model = Sequential()
model.add(Dense(64, input_dim=xtrain.shape[1], activation='relu')) # Hidden 1
model.add(Dense(64,activation='relu')) #Hidden 2
model.add(Dense(1)) # Output
model.compile(loss='mean_squared_error', optimizer='adam')
model.summary()

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_6 (Dense)             (None, 64)                192       
                                                                 
 dense_7 (Dense)             (None, 64)                4160      
                                                                 
 dense_8 (Dense)             (None, 1)                 65        
                                                                 
Total params: 4,417
Trainable params: 4,417
Non-trainable params: 0
_________________________________________________________________


In [16]:
model.fit(xtrain,ytrain,verbose=2,epochs=50) # tried to run for 600 iterations but no significant improvement

Epoch 1/50
469/469 - 4s - loss: 955552640.0000 - 4s/epoch - 9ms/step
Epoch 2/50
469/469 - 4s - loss: 792036160.0000 - 4s/epoch - 8ms/step
Epoch 3/50
469/469 - 4s - loss: 579474688.0000 - 4s/epoch - 8ms/step
Epoch 4/50
469/469 - 4s - loss: 534664544.0000 - 4s/epoch - 8ms/step
Epoch 5/50
469/469 - 4s - loss: 528904224.0000 - 4s/epoch - 8ms/step
Epoch 6/50
469/469 - 4s - loss: 524473696.0000 - 4s/epoch - 8ms/step
Epoch 7/50
469/469 - 4s - loss: 520458112.0000 - 4s/epoch - 8ms/step
Epoch 8/50


KeyboardInterrupt: 

In [17]:
import numpy as np
from sklearn import metrics
pred = model.predict(xtest)
print("Shape: {}".format(pred.shape))
print(pred[:4])
print(ytest.to_numpy()[:4])
# Measure RMSE error.  RMSE is common for regression.
score = np.sqrt(metrics.mean_squared_error(pred,ytest))
print(f"Final score (RMSE): {score}")

Shape: (285146, 1)
[[25130.361]
 [22032.256]
 [22126.248]
 [19885.064]]
[ 7366 64831  6195 60160]
Final score (RMSE): 22550.46589603257
