## EDA And Feature Engineering Flight Price Prediction
### FEATURES
The various features of the cleaned dataset are explained below:
1) Airline: The name of the airline company is stored in the airline column. It is a categorical feature having 6 different airlines.
2) Flight: Flight stores information regarding the plane's flight code. It is a categorical feature.
3) Source City: City from which the flight takes off. It is a categorical feature having 6 unique cities.
4) Departure Time: This is a derived categorical feature obtained created by grouping time periods into bins. It stores information about the departure time and have 6 unique time labels.
5) Stops: A categorical feature with 3 distinct values that stores the number of stops between the source and destination cities.
6) Arrival Time: This is a derived categorical feature created by grouping time intervals into bins. It has six distinct time labels and keeps information about the arrival time.
7) Destination City: City where the flight will land. It is a categorical feature having 6 unique cities.
8) Class: A categorical feature that contains information on seat class; it has two distinct values: Business and Economy.
9) Duration: A continuous feature that displays the overall amount of time it takes to travel between cities in hours.
10) Days Left: This is a derived characteristic that is calculated by subtracting the trip date by the booking date.
11) Price: Target variable stores information of the ticket price.

In [None]:
#importing basics libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
! pip install openpyxl

In [None]:
df=pd.read_excel('flight_price.xlsx')
df.head()

In [None]:
df.info()

In [None]:
df.describe()

##### Splitting the column - 'Date of Journey'

In [None]:
df['Date_of_Journey'].str.split('/')

In [None]:
df["Date"] = df['Date_of_Journey'].str.split('/').str[0].astype(int)
type(df["Date"][0])

In [None]:
df["Month"] = df['Date_of_Journey'].str.split('/').str[1].astype(int)

In [None]:
df["Year"] = df['Date_of_Journey'].str.split('/').str[2].astype(int)

In [None]:
df.info()

In [None]:
df.drop('Date_of_Journey', axis=1, inplace=True)
df.head(2)

In [None]:
df

##### Setting up arrival time - only time

In [None]:
df['Arrival_Time'].str.split(' ').str[0]

In [None]:
df['Arrival_hour'] = df['Arrival_Time'].str.split(
    ' ').str[0].str.split(':').str[0].astype(int)


In [None]:
df['Arrival_minute'] = df['Arrival_Time'].str.split(
    ' ').str[0].str.split(':').str[1].astype(int)


In [None]:
df.drop('Arrival_Time', axis=1, inplace=True)

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df['Dep_hour'] = df['Dep_Time'].str.split(':').str[0].astype(int)


In [None]:
df['Dep_min'] = df['Dep_Time'].str.split(':').str[1].astype(int)


In [None]:
df.drop('Dep_Time', axis=1, inplace=True)

In [None]:
df.head()

In [None]:
df.info()

##### Dropping Route

In [None]:
df.drop('Route', axis=1, inplace=True)

In [None]:
df.head()

In [None]:
# df['Duration_Hour'] = df['Duration'].str.split('h').str[0].astype(int)
# df['Duration_Hour'] = df['Duration'].str.split('h').str[0].astype(int)
# df['Duration'].str.split(' ').str[0].str.split('h').str[0].astype(int)
# df['Duration'].str.split('h').str[0]

In [None]:
# df['Duration_Minute'] = df['Duration'].str.split('h').str[1].str.split('m').str[0]
# df['Duration_Minute'] = df['Duration'].str.split('h').str[1].str.split('m').str[0].astype(int)

In [None]:
# df['Duration'].str.split('h').str[1].str.split('m').str[0]


##### Assigning and converting Duration column into list to extract hours ans minutes seperately

In [None]:
duration = list(df["Duration"])
for i in range(len(duration)):
    if len(duration[i].split()) != 2:  # Check if duration contains only hour or mins
        if "h" in duration[i]:
            duration[i] = duration[i].strip() + " 0m"   # Adds 0 minute
        else:
            duration[i] = "0h " + duration[i]           # Adds 0 hour

Duration_Hours = []
Duration_Mins = []
for i in range(len(duration)):
    # Extract hours from duration
    Duration_Hours.append(int(duration[i].split(sep="h")[0]))
    # Extracts only minutes from duration
    Duration_Mins.append(int(duration[i].split(sep="m")[0].split()[-1]))

df["Duration_Hours"] = Duration_Hours
df["Duration_Mins"] = Duration_Mins

# we will remove the Durtaion column
df.drop(['Duration'], axis=1, inplace=True)


In [None]:
df.info()

In [None]:
df.head()

In [None]:
df['Total_Stops'].unique()

In [None]:
df.isnull().sum()

##### Replacing the null with the most frequent value i.e mode

In [None]:
df['Total_Stops'].mode()

In [None]:
df['Total_Stops'] = df['Total_Stops'].map(
    {'non-stop': 0, '1 stop': 1, '2 stops': 2, '3 stops': 3, '4 stops': 4, np.nan: 1})


In [None]:
df['Total_Stops'].isnull().sum()

In [None]:
df.info()


In [None]:
df.head()

##### Using One-Hot Encoding for categorical features

In [None]:
df.columns

In [None]:
df['Airline'].unique()

In [None]:
df['Source'].unique()

In [None]:
df['Destination'].unique()

In [None]:
from sklearn.preprocessing import OneHotEncoder

In [None]:
encoder = OneHotEncoder()


In [None]:
encoder_fit= encoder.fit_transform(
        df[['Airline', 'Source', 'Destination', 'Additional_Info']]).toarray()

In [None]:
encoded_df = pd.DataFrame(
    encoder_fit, columns=encoder.get_feature_names_out())
encoded_df


In [None]:
final_df = encoded_df.join(df.drop(['Airline', 'Source', 'Destination',
                                   'Additional_Info'], axis=1))
final_df = final_df.astype(int)
final_df.head()


In [None]:
final_df.info()

In [None]:
final_df.to_csv('final_df.csv')