This jupyter notebook contains the informations related to the task from Mohirdev platform

    Predicting the prices of airplane tickets

Author : Umidjon Sattorov. Student at Mohirdev platform

    Data uploading and initial acquitance

In [124]:
#Importing essential libraries
import pandas as pd
import numpy as np 

#Visualization 
import matplotlib.pyplot as plt
import seaborn as sns 

#Machine learning algorithms, metrics and feature engineering tools 
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, root_mean_squared_error, r2_score
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.neural_network import MLPRegressor
import fastai 
from fastai.tabular.all import *

#Preserving machine learning model
import pickle as pkl

In [125]:
#Data uploading
df = pd.read_csv(filepath_or_buffer = './data/train_data.csv', sep = ",", index_col = 'id')
df.head()

Unnamed: 0_level_0,airline,flight,source_city,departure_time,stops,arrival_time,destination_city,class,duration,days_left,price
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,Vistara,UK-810,Bangalore,Early_Morning,one,Night,Mumbai,Economy,14.25,21,7212
2,SpiceJet,SG-5094,Hyderabad,Evening,zero,Night,Kolkata,Economy,1.75,7,5292
3,Vistara,UK-846,Bangalore,Morning,one,Evening,Delhi,Business,9.58,5,60553
4,Vistara,UK-706,Kolkata,Morning,one,Evening,Hyderabad,Economy,6.75,28,5760
5,Indigo,6E-5394,Chennai,Early_Morning,zero,Morning,Mumbai,Economy,2.0,4,10712


Dataset information : 

1) Id = indentification number of customer
2) Airline - travel organizator company
3) flight - flight number
4) source_city - departure city
5) departure_time - the time for departure 
6) stop - the number of stops during commutation
7) destination_city - arrival city
8) arrival_time - the time of arrival
9) price - the price of ticket
10) class - the class of user who purchased particular ticket for travel
11) days_left - when the airplane is going to come back home.

In [126]:
#Checking the emptyness of given dataset
print("The emptyness of given dataset :")
df.isna().sum()

The emptyness of given dataset :


airline             0
flight              0
source_city         0
departure_time      0
stops               0
arrival_time        0
destination_city    0
class               0
duration            0
days_left           0
price               0
dtype: int64

Conclusion : So far, we have tested the number of Nan values for each column of the dataset. It seems like there is not any empty values in the given dataset(So I don't have to worry about such problem anymore).

In [127]:
#Checking the number of unique values for each column of the dataset
for i in df.columns :
    print(f"The column {i} contains : {df[i].nunique()} values")

The column airline contains : 6 values
The column flight contains : 1310 values
The column source_city contains : 6 values
The column departure_time contains : 6 values
The column stops contains : 3 values
The column arrival_time contains : 6 values
The column destination_city contains : 6 values
The column class contains : 2 values
The column duration contains : 404 values
The column days_left contains : 49 values
The column price contains : 4420 values


In [128]:
#Reducing the number of unique value in the column named 'flight' to save computing resources
df['flight'] = df['flight'].apply(lambda x : x[0 : 2])

In [129]:
#Checking the number of unique values for each column of the dataset
for i in df.columns :
    print(f"The column {i} contains : {df[i].nunique()} values")

The column airline contains : 6 values
The column flight contains : 6 values
The column source_city contains : 6 values
The column departure_time contains : 6 values
The column stops contains : 3 values
The column arrival_time contains : 6 values
The column destination_city contains : 6 values
The column class contains : 2 values
The column duration contains : 404 values
The column days_left contains : 49 values
The column price contains : 4420 values


In [130]:
df['stops'].value_counts()

stops
one            16666
zero            2440
two_or_more      894
Name: count, dtype: int64

In [131]:
df['stops'] = df['stops'].apply(lambda x : 0 if (x == 'zero') else 1 if (x == 'one') else 2)
df['stops'].value_counts()

stops
1    16666
0     2440
2      894
Name: count, dtype: int64

In [133]:
#Saving analysis ready dataset
df.to_csv('./data/analysis_ready_dataset.csv', index = False)
print("File has been saved successfully")

File has been saved successfully


    Data analysis and data preparation stage

In [134]:
#Checking the correlation between numerical values and our target column
corr = df[['stops', 'duration', 'days_left', 'price']].corr()
corr

Unnamed: 0,stops,duration,days_left,price
stops,1.0,0.470493,0.003238,0.121455
duration,0.470493,1.0,-0.020091,0.213158
days_left,0.003238,-0.020091,1.0,-0.102545
price,0.121455,0.213158,-0.102545,1.0


    Feature engineering 

For this problem, I am going to use one hot encoder, label encoder and hash encoders for categorical features. For numerical features, I am going to use Standart Normalizer(or MinMax normalizer - that really depends)

In [135]:
ohe_cols = ['airline', 'source_city', 'departure_time', 'arrival_time', 'destination_city', 'class', 'flight']
std_scaler = ['duration', 'days_left', 'price']

#One Hot encoding
ohe = OneHotEncoder(sparse_output = False)
ohe_data = ohe.fit_transform(df[ohe_cols])
df[ohe.get_feature_names_out()] = ohe_data

#Standart scaler
std_scl = StandardScaler()
std_scaled = std_scl.fit_transform(df[std_scaler])
std_scaled_cols = [x + '_std' for x in std_scaler]
df[std_scaled_cols] = std_scaled

df.drop(columns = ohe_cols + std_scaler, inplace = True)
df.head()

Unnamed: 0_level_0,stops,airline_AirAsia,airline_Air_India,airline_GO_FIRST,airline_Indigo,airline_SpiceJet,airline_Vistara,source_city_Bangalore,source_city_Chennai,source_city_Delhi,...,class_Economy,flight_6E,flight_AI,flight_G8,flight_I5,flight_SG,flight_UK,duration_std,days_left_std,price_std
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.289528,-0.361418,-0.60366
2,0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,1.0,0.0,-1.456827,-1.388976,-0.687963
3,1,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,-0.36291,-1.53577,1.738437
4,1,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,1.0,-0.758285,0.152361,-0.667414
5,0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,-1.4219,-1.609167,-0.449982


In [136]:
df.to_csv(path_or_buf = './data/train_ready.csv', sep = ',')
print("Dataset for training machine learning algorithm is preserved successfully !")

Dataset for training machine learning algorithm is preserved successfully !


    Modelling

For making model I am going to try four algorithms : MLP multilayer perceptron, logistic regression and tabular learner from fastai. Model performance is measured using accuracy score.

In [137]:
#Data loading and separating into train, test features
model_data = pd.read_csv(filepath_or_buffer = './data/train_ready.csv', sep = ',')
x = model_data.drop(columns = 'price_std')
y = model_data['price_std']

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 42)

#Teaching model using MLPRegression 
mlp_reg = MLPRegressor(random_state = 1, max_iter = 1000).fit(x_train, y_train)

In [138]:
pred_test = mlp_reg.predict(X = x_test)

print(f"Mean absolute error of MLP regression in test dataset is {mean_absolute_error(y_true = y_test, y_pred = pred_test)}")

Mean absolute error of MLP regression in test dataset is 0.4738381339638545


In [121]:
#Saving machine learning models into pickle format
pkl.dump(mlp_reg, open('./models/mlp_reg_relu.pkl', 'wb'))
print("Model saved successfully!")

Model saved successfully!


In [122]:
#Logistic regression with MLP
mlp_reg_log = MLPRegressor(random_state = 42, max_iter = 1000, activation = 'logistic').fit(X = x_train, y = y_train)
pred_test = mlp_reg_log.predict(X = x_test)
print(f"Mean absolute error of MLP regression(logistic) in test dataset is {mean_absolute_error(y_true = y_test, y_pred = pred_test)}")

Mean absolute error of MLP regression(logistic) in test dataset is 0.8745774846187149


Well it is better to preserve mlp model with relu activation instead of logistic.

In [139]:
df_fastai = pd.read_csv(filepath_or_buffer = './data/analysis_ready_dataset.csv', sep = ',')
df_fastai.head()

Unnamed: 0,airline,flight,source_city,departure_time,stops,arrival_time,destination_city,class,duration,days_left,price
0,Vistara,UK,Bangalore,Early_Morning,1,Night,Mumbai,Economy,14.25,21,7212
1,SpiceJet,SG,Hyderabad,Evening,0,Night,Kolkata,Economy,1.75,7,5292
2,Vistara,UK,Bangalore,Morning,1,Evening,Delhi,Business,9.58,5,60553
3,Vistara,UK,Kolkata,Morning,1,Evening,Hyderabad,Economy,6.75,28,5760
4,Indigo,6E,Chennai,Early_Morning,0,Morning,Mumbai,Economy,2.0,4,10712


In [149]:
df_fastai.dtypes[(df_fastai.dtypes == 'int64') | (df_fastai.dtypes == 'float')].index.to_list()

['stops', 'duration', 'days_left', 'price']

In [152]:
#Fastai model
train_data, test_data = train_test_split(df_fastai, test_size = 0.2, random_state = 42)
dls = TabularDataLoaders.from_df(df = train_data, bs = 16, y_names = 'price', skipinitialspace = False, cat_names = df_fastai.dtypes[df_fastai.dtypes == 'object'].index.to_list(), cont_names = df_fastai.dtypes[(df_fastai.dtypes == 'int64') | (df_fastai.dtypes == 'float')].index.to_list(), procs = [Categorify, Normalize])
learn = tabular_learner(dls = dls, metrics = mae)
learn.fit_one_cycle(4)

epoch,train_loss,valid_loss,mae,time
0,0.082723,0.021034,0.126233,00:06
1,0.090916,0.021687,0.128831,00:05
2,0.093414,0.022729,0.101589,00:05
3,0.065214,0.026711,0.112767,00:05


In [153]:
test_dls = dls.test_dl(test_data)
preds, targs = learn.get_preds(dl = test_dls)

print(f"Mean absolute error of the model in the test dataset is : { mean_absolute_error(targs, preds)}")


Mean absolute error of the model in the test dataset is : 0.11611131578683853


In [154]:
pkl.dump(learn, open('./models/fastai_model.pkl', 'wb'))