# **Flight Price Prediction**


**Objective**
* Predict flight ticket prices based on date, destination, and other factors.

**Data collection:**
* the datasets are collected from Kaggle: Flight Price Prediction uploaded by Shubham Bathwal

 > ## Data Cleaning


In [24]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


df=pd.read_csv('Clean_Dataset.csv')
df.info()
df.head()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300153 entries, 0 to 300152
Data columns (total 12 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   Unnamed: 0        300153 non-null  int64  
 1   airline           300153 non-null  object 
 2   flight            300153 non-null  object 
 3   source_city       300153 non-null  object 
 4   departure_time    300153 non-null  object 
 5   stops             300153 non-null  object 
 6   arrival_time      300153 non-null  object 
 7   destination_city  300153 non-null  object 
 8   class             300153 non-null  object 
 9   duration          300153 non-null  float64
 10  days_left         300153 non-null  int64  
 11  price             300153 non-null  int64  
dtypes: float64(1), int64(3), object(8)
memory usage: 27.5+ MB


Unnamed: 0.1,Unnamed: 0,airline,flight,source_city,departure_time,stops,arrival_time,destination_city,class,duration,days_left,price
0,0,SpiceJet,SG-8709,Delhi,Evening,zero,Night,Mumbai,Economy,2.17,1,5953
1,1,SpiceJet,SG-8157,Delhi,Early_Morning,zero,Morning,Mumbai,Economy,2.33,1,5953
2,2,AirAsia,I5-764,Delhi,Early_Morning,zero,Early_Morning,Mumbai,Economy,2.17,1,5956
3,3,Vistara,UK-995,Delhi,Morning,zero,Afternoon,Mumbai,Economy,2.25,1,5955
4,4,Vistara,UK-963,Delhi,Morning,zero,Morning,Mumbai,Economy,2.33,1,5955


In [25]:
df.isnull().sum()

Unnamed: 0          0
airline             0
flight              0
source_city         0
departure_time      0
stops               0
arrival_time        0
destination_city    0
class               0
duration            0
days_left           0
price               0
dtype: int64

In [26]:
df.duplicated().sum()

np.int64(0)

```since source and destination city has no effect on our datasets so deleting it```

In [27]:
df = df.drop(columns=['source_city','destination_city'])
df.head()

Unnamed: 0.1,Unnamed: 0,airline,flight,departure_time,stops,arrival_time,class,duration,days_left,price
0,0,SpiceJet,SG-8709,Evening,zero,Night,Economy,2.17,1,5953
1,1,SpiceJet,SG-8157,Early_Morning,zero,Morning,Economy,2.33,1,5953
2,2,AirAsia,I5-764,Early_Morning,zero,Early_Morning,Economy,2.17,1,5956
3,3,Vistara,UK-995,Morning,zero,Afternoon,Economy,2.25,1,5955
4,4,Vistara,UK-963,Morning,zero,Morning,Economy,2.33,1,5955


In [28]:
df.tail()


Unnamed: 0.1,Unnamed: 0,airline,flight,departure_time,stops,arrival_time,class,duration,days_left,price
300148,300148,Vistara,UK-822,Morning,one,Evening,Business,10.08,49,69265
300149,300149,Vistara,UK-826,Afternoon,one,Night,Business,10.42,49,77105
300150,300150,Vistara,UK-832,Early_Morning,one,Night,Business,13.83,49,79099
300151,300151,Vistara,UK-828,Early_Morning,one,Evening,Business,10.0,49,81585
300152,300152,Vistara,UK-822,Morning,one,Evening,Business,10.08,49,81585


In [29]:
df.columns

Index(['Unnamed: 0', 'airline', 'flight', 'departure_time', 'stops',
       'arrival_time', 'class', 'duration', 'days_left', 'price'],
      dtype='object')

In [30]:
categorical_variables = df.dtypes[df.dtypes == "object"].index
categorical_variables

Index(['airline', 'flight', 'departure_time', 'stops', 'arrival_time',
       'class'],
      dtype='object')

In [31]:
for category in categorical_variables:
    print(f"Number of unique values in {category} = {len(df[category].unique())}")

Number of unique values in airline = 6
Number of unique values in flight = 1561
Number of unique values in departure_time = 6
Number of unique values in stops = 3
Number of unique values in arrival_time = 6
Number of unique values in class = 2


since the number of flight is very high so ignoring it 

In [32]:
df = df.drop(columns=['flight'])

df.head()

Unnamed: 0.1,Unnamed: 0,airline,departure_time,stops,arrival_time,class,duration,days_left,price
0,0,SpiceJet,Evening,zero,Night,Economy,2.17,1,5953
1,1,SpiceJet,Early_Morning,zero,Morning,Economy,2.33,1,5953
2,2,AirAsia,Early_Morning,zero,Early_Morning,Economy,2.17,1,5956
3,3,Vistara,Morning,zero,Afternoon,Economy,2.25,1,5955
4,4,Vistara,Morning,zero,Morning,Economy,2.33,1,5955


In [33]:
df=df.rename(columns={'Unnamed: 0': 'index'})
df.head()

Unnamed: 0,index,airline,departure_time,stops,arrival_time,class,duration,days_left,price
0,0,SpiceJet,Evening,zero,Night,Economy,2.17,1,5953
1,1,SpiceJet,Early_Morning,zero,Morning,Economy,2.33,1,5953
2,2,AirAsia,Early_Morning,zero,Early_Morning,Economy,2.17,1,5956
3,3,Vistara,Morning,zero,Afternoon,Economy,2.25,1,5955
4,4,Vistara,Morning,zero,Morning,Economy,2.33,1,5955


In [34]:
# List of categorical variables to be one-hot encoded
categorical_variables = ['airline', 'departure_time', 'stops', 'arrival_time', 'class']

# Convert categorical variables to dummy variables
df_clean_encoded = pd.get_dummies(df, columns=categorical_variables, drop_first=True)

print("Encoded DataFrame:")
df_clean_encoded = df_clean_encoded.astype(int) # 0== false 1==true
print(df_clean_encoded)

Encoded DataFrame:
         index  duration  days_left  price  airline_Air_India  \
0            0         2          1   5953                  0   
1            1         2          1   5953                  0   
2            2         2          1   5956                  0   
3            3         2          1   5955                  0   
4            4         2          1   5955                  0   
...        ...       ...        ...    ...                ...   
300148  300148        10         49  69265                  0   
300149  300149        10         49  77105                  0   
300150  300150        13         49  79099                  0   
300151  300151        10         49  81585                  0   
300152  300152        10         49  81585                  0   

        airline_GO_FIRST  airline_Indigo  airline_SpiceJet  airline_Vistara  \
0                      0               0                 1                0   
1                      0               0  

In [35]:
df_clean_encoded.isnull().sum()

index                           0
duration                        0
days_left                       0
price                           0
airline_Air_India               0
airline_GO_FIRST                0
airline_Indigo                  0
airline_SpiceJet                0
airline_Vistara                 0
departure_time_Early_Morning    0
departure_time_Evening          0
departure_time_Late_Night       0
departure_time_Morning          0
departure_time_Night            0
stops_two_or_more               0
stops_zero                      0
arrival_time_Early_Morning      0
arrival_time_Evening            0
arrival_time_Late_Night         0
arrival_time_Morning            0
arrival_time_Night              0
class_Economy                   0
dtype: int64

In [39]:
from sklearn.preprocessing import LabelEncoder

# Initialize the LabelEncoder
label_encoder = LabelEncoder()
df_clean_encoded = df_clean_encoded.copy()

# Fit and transform the 'airline' column


df_clean_encoded['airline'] = label_encoder.fit_transform(df_clean_encoded['airline'])
df_clean_encoded['departure_time'] = label_encoder.fit_transform(df_clean_encoded['departure_time'])
df_clean_encoded['arrival_time'] = label_encoder.fit_transform(df_clean_encoded['arrival_time'])
print(df_clean_encoded.head())


KeyError: 'airline'