# Flight Price Prediction: Feature Engineering

In this notebook, we focus on preparing flight ticket data for price prediction. Key steps include handling date-time data, extracting new features, and cleaning the data.

- The dataset contains date and time columns, which can be complex to handle. We will parse and extract meaningful features from them (e.g., journey day, month, departure hour).

In [1]:
# Importing essential libraries for data handling and visualization
import pandas as pd  # for data manipulation
import numpy as np   # for numerical computations
import matplotlib.pyplot as plt  # for plotting
import seaborn as sns  # for advanced visualizations

%matplotlib inline  # show plots within notebook

UsageError: unrecognized arguments: # show plots within notebook


In [None]:
# 📥 Load the training dataset
train_df = pd.read_excel('Data_Train.xlsx')

In [None]:
# 📥 Load the testing dataset
test_df = pd.read_excel('Test_set.xlsx')

In [None]:
# 🔍 Display the first few records of the training dataset
train_df.head()

In [None]:
# 📏 Check the shape (rows, columns) of the datasets
print('Train shape:', train_df.shape)
print('Test shape:', test_df.shape)

In [None]:
# 📊 Summary statistics for numerical features
train_df.describe()

In [None]:
# ℹ️ Overview of dataset including data types and missing values
train_df.info()

## 🧹 Data Cleaning

Now that we've loaded the datasets, we begin preprocessing by handling missing values, correcting data types, and removing inconsistencies.

In [None]:
# Check for missing values
train_df.isnull().sum()

Observed that there are 2 missing values, drop these as they are very less

In [None]:
#drop the nullvalues
train_df.dropna(inplace=True)

In [None]:
train_df.isnull().sum()

In [None]:
train_df.dtypes # datatypes

The datatypes of Date_of_journey,Arrival_Time and Dep_Time is object.
So, we convert it into date and time for proper predicion.
dt.day method will extract only day of that date
dt.month method will extract only month of that date

In [None]:
def change_into_datetime(col):
    train_df[col]=pd.to_datetime(train_df[col])

In [None]:
train_df.columns

In [None]:
for i in ['Date_of_Journey','Dep_Time', 'Arrival_Time']:
    change_into_datetime(i)

In [None]:
train_df.dtypes

Finding the categorical value

In [None]:
column=[column for column in train_df.columns if train_df[column].dtype=='object']
column

Finding the numerical value

In [None]:
numerical_col =[column for column in train_df.columns if train_df[column].dtype!='object']
numerical_col

In [None]:
# Distribution of flight prices
plt.figure(figsize=(10,6))
sns.histplot(train_df['Price'], kde=True)
plt.title('Distribution of Flight Prices')
plt.xlabel('Price')
plt.ylabel('Frequency')
plt.show()

In [None]:
# Count plot of airlines
plt.figure(figsize=(12,6))
sns.countplot(y='Airline', data=train_df, order=train_df['Airline'].value_counts().index)
plt.title('Number of Flights by Airline')
plt.xlabel('Count')
plt.ylabel('Airline')
plt.show()

In [None]:
# Count plot for Source and Destination
fig, ax = plt.subplots(1, 2, figsize=(14, 6))

sns.countplot(data=train_df, x='Source', ax=ax[0])
ax[0].set_title("Flight Source Distribution")
ax[0].tick_params(axis='x', rotation=45)

sns.countplot(data=train_df, x='Destination', ax=ax[1])
ax[1].set_title("Flight Destination Distribution")
ax[1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

In [None]:
# Boxplot to see variation of price with different Airlines
plt.figure(figsize=(14,6))
sns.boxplot(data=train_df, x='Airline', y='Price')
plt.xticks(rotation=45)
plt.title('Flight Price by Airline')
plt.show()

In [None]:
# Checking correlation heatmap (numerical features only)
plt.figure(figsize=(10,6))
sns.heatmap(train_df.corr(numeric_only=True), annot=True, cmap='coolwarm', fmt='.2f')
plt.title("Correlation Heatmap")
plt.show()

Combine train_df and test_df into a single DataFrame

In [None]:
# 🔍 Display the first few records of the training dataset
train_df.head()

In [None]:
final_df.tail()

In [None]:
# 📏 Check the shape (rows, columns) of the datasets
print('Train shape:', train_df.shape)
print('Test shape:', test_df.shape)

In [None]:
# 📊 Summary statistics for numerical features
train_df.describe()

In [None]:
final_df.isnull().sum()

In [None]:
# ℹ️ Overview of dataset including data types and missing values
train_df.info()

## Feature Engineering

### Numerical Values

In [None]:
final_df['Date']=final_df['Date_of_Journey'].str.split('/').str[0]
final_df['Month']=final_df['Date_of_Journey'].str.split('/').str[1]
final_df['Year']=final_df['Date_of_Journey'].str.split('/').str[2]

In [None]:
final_df.head(2)

In [None]:
final_df['Date']=final_df['Date'].astype(int)
final_df['Month']=final_df['Month'].astype(int)
final_df['Year']=final_df['Year'].astype(int)

In [None]:
# ℹ️ Overview of dataset including data types and missing values
train_df.info()

In [None]:
final_df.drop('Date_of_Journey', axis=1, inplace=True)

In [None]:
final_df.head(2)

In [None]:
final_df['Arrival_Time']=final_df['Arrival_Time'].apply(lambda x: x.split(' ')[0])

In [None]:
final_df['Arrival_hour']=final_df['Arrival_Time'].str.split(':').str[0]
final_df['Arrival_min']=final_df['Arrival_Time'].str.split(':').str[1]

In [None]:
# 🔍 Display the first few records of the training dataset
train_df.head()

In [None]:
final_df['Arrival_hour']=final_df['Arrival_hour'].astype(int)
final_df['Arrival_min']=final_df['Arrival_min'].astype(int)

In [None]:
# ℹ️ Overview of dataset including data types and missing values
train_df.info()

In [None]:
final_df.drop('Arrival_Time', axis=1, inplace=True)

In [None]:
final_df.head(1)

In [None]:
final_df['Dep_hour']=final_df['Dep_Time'].str.split(':').str[0]
final_df['Dep_min']=final_df['Dep_Time'].str.split(':').str[1]
final_df.drop('Dep_Time', axis=1, inplace=True)

In [None]:
final_df['Dep_hour']=final_df['Dep_hour'].astype(int)
final_df['Dep_min']=final_df['Dep_min'].astype(int)

In [None]:
# ℹ️ Overview of dataset including data types and missing values
train_df.info()

In [None]:
final_df['Total_Stops'].unique()

In [None]:
final_df['Total_Stops']=final_df['Total_Stops'].map({'non-stop':0,'1 stop':1,'2 stops':2,'3 stops':3,'4 stops':4,'nan':1})

In [None]:
final_df.drop('Route',axis=1, inplace=True)

In [None]:
# 🔍 Display the first few records of the training dataset
train_df.head()

In [None]:
final_df['duration_hour']=final_df['Duration'].str.split(' ').str[0].str.split('h').str[0]

In [None]:
final_df[final_df['duration_hour']=='5m']

Drop the rows - Duration 5m is not correct

In [None]:
final_df.drop(6474,axis=0,inplace=True)
final_df.drop(13343,axis=0,inplace=True)

In [None]:
final_df

In [None]:
final_df['duration_hour']=final_df['duration_hour'].astype('int')

In [None]:
# ℹ️ Overview of dataset including data types and missing values
train_df.info()

In [None]:
final_df.drop('Duration', axis=1, inplace=True)

In [None]:
# 🔍 Display the first few records of the training dataset
train_df.head()

### Categorical Values

In [None]:
final_df['Airline'].unique()

In [None]:
import sys
import subprocess

subprocess.check_call([sys.executable, "-m", "pip", "install", "scikit-learn"])

In [None]:
# Import LabelEncoder to convert categorical string values into numeric labels
from sklearn.preprocessing import LabelEncoder
labelencoder=LabelEncoder()

In [None]:
# Convert categorical columns into numeric form so that they can be used in ML models
final_df['Airline']=labelencoder.fit_transform(final_df['Airline'])
final_df['Source']=labelencoder.fit_transform(final_df['Source'])
final_df['Destination']=labelencoder.fit_transform(final_df['Destination'])
final_df['Additional_Info']=labelencoder.fit_transform(final_df['Additional_Info'])

In [None]:
# 🔍 Display the first few records of the training dataset
train_df.head()