# Airfare Prediction using Machine Learning

### Problem Definition
Airline ticket prices are influenced by numerous factors, including flight routes, departure and arrival times, airline carriers, and ticket classes. The variability and dynamic nature of these prices present challenges for both travelers seeking the best deals and airlines aiming to optimize revenue.

The goal of this project is to develop a machine learning model that accurately predicts the prices of airline tickets based on historical data provided in the "Flight Price Prediction Dataset." The dataset includes various features such as flight routes, departure and arrival cities, airline carriers, departure and arrival times, and ticket class. By analyzing these features, the model will aim to forecast future ticket prices, providing valuable insights for consumers and aiding airlines in refining their pricing strategies. This project will address the need for accurate airfare predictions and contribute to better decision-making in the travel industry.

### Data
The data is downloaded from Kaggle:
https://www.kaggle.com/datasets/muhammadbinimran/flight-price-prediction

### Data Dictionary:
1. `Airline` - Names of the Airlines
2. `Date_of_Journey` - Date of journey of the flight
3. `Source` - Place of departure; starting point; origin
4. `Destination` - Place of arrival
5. `Route` - The path taken from a source to a destination, which may include specific stops along the way
6. `Dep_Time` - Time of departure from a starting point (source)
7. `Arrival_Time` - Time of arrival at a destination
8. `Duration` - Duration of flight from source to destination
9. `Total_Stops` - Total number of stops between source and destination
10. `Additional_Info` - Meals
11. `Price` - Airfare

# Data Understanding

## Importing libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Reading the train dataset

> Converting `python string` to `raw string` by adding a prefix `r` to the `pd.read_excel()` string.
<br>
> **Reason:** Different operating systems support either `/` or `\` while reading files. With the prefix `r`, we can use both the types of slashes to read the data file.

In [2]:
train_data = pd.read_excel(r'data/Data_Train.xlsx')

ImportError: Missing optional dependency 'openpyxl'.  Use pip or conda to install openpyxl.

## Exploring the dataset

In [None]:
train_data.head()

In [None]:
train_data.tail()

In [None]:
train_data.info()

# Data Preprocessing #1

## Checking for null values

In [None]:
train_data.isnull().sum()

In [None]:
train_data['Total_Stops'].isnull().head()

### Fetching only the null records from the `Total_Stops` feature

In [None]:
train_data[train_data['Total_Stops'].isnull()]

### Fetching only the null records from the `Route` feature

In [None]:
train_data[train_data['Route'].isnull()]

## Dropping null values

> NaN - Not Available Number

Dropping all the null values because there is only 1 record with null values.

In [None]:
train_data.dropna(inplace=True)

In [None]:
train_data.isnull().sum()

In [None]:
train_data.shape

In [None]:
train_data.dtypes

> Like `string` in Python, there exists `object` in Pandas.

In [None]:
train_data.info()

### Finding the exact memory usage of the dataset

In [None]:
train_data.info(memory_usage = 'deep')

### Making a copy of the dataset to perform Exploratory Data Analysis (EDA)

In [None]:
data = train_data.copy()

In [None]:
data.columns

In [None]:
data.head(4)

In [None]:
data.dtypes

### Converting `Dep_Time`, `Arrival_Time`, and `Date_of_Journey` features into `timestamp` format

> Machine learning cannot work with string data because it performs by utilizing Linear Algebra, Calculus, and Vectors which never work with string data. <br>
> Hence, we need to convert `object` and `string` to `numeric` or `vector` format.
<br><br>
> `timestamp` formats are of two types in Numpy: `datetime64[ns]` and `<m8[ns]`. <br>
> The 'm' in `<m8[ns]` stands for `datetime`, '8' in bytes which stands for `64` bits. Hence, `<m8[ns]` stands for `datetime64[ns]`. '[ns]' stands for `nano seconds`. <br>
> The format of `timestamp` depends on how the Numpy package is compiled.

Creating a function to avoid repitition of the same Lines of Code (LOC).

In [None]:
def change_into_Datetime(col):
    data[col] = pd.to_datetime(data[col])

> Ignoring warnings that may appear in the later LOC

In [None]:
import warnings
from warnings import filterwarnings
filterwarnings("ignore")

In [None]:
data.columns

In [None]:
for feature in ['Dep_Time', 'Arrival_Time', 'Date_of_Journey']:
    change_into_Datetime(feature)

In [None]:
data.dtypes

### Splitting the `Date_of_Journey` feature into derived attributes/features: `Journey_Day`, `Journey_Month`, and `Journey_Year`

> Utilizing the `dt` accessor from pandas in order to access the `datetime` properties of `day`, `month`, and `year`

In [None]:
data['Journey_Day'] = data['Date_of_Journey'].dt.day
data['Journey_Month'] = data['Date_of_Journey'].dt.month
data['Journey_Year'] = data['Date_of_Journey'].dt.year

In [None]:
data.head(2)

### Extracting derived attributes from `Dep_Time` and `Arrival_Time` features: `Dep_Time_Hour`, `Dep_Time_Minute`, `Arrival_Time_Hour`, and `Arrival_Time_Minute`

In [None]:
def extract_hour_min(df, col):
    df[col+'_Hour'] = df[col].dt.hour
    df[col+'_Minute'] = df[col].dt.minute
    return df.head(3)

In [None]:
data.columns

In [None]:
extract_hour_min(data, 'Dep_Time')

In [None]:
extract_hour_min(data, 'Arrival_Time')

### Dropping `Dep_Time` and `Arrival_Time` features because derived attributes have been extracted from them and hence they are of no use.

In [None]:
cols_to_drop = ['Dep_Time', 'Arrival_Time']

data.drop(cols_to_drop, axis=1, inplace=True)

> `axis=1` refers to the column of the dataframe <br>
> `axis=0` refers to the row of the dataframe

In [None]:
data.head(2)

In [None]:
data.shape

# Data Analysis & Visualization #1

## When do most of the flights take off?

In [None]:
data.columns

### Defining a function to breakdown `Dep_Time_Hour` into different parts of the day
1. **Early Morning:** 5:00 AM - 8:00 AM
2. **Morning:** 8:00 AM - 12:00 PM
3. **Afternoon:** 12:00 PM - 3:00 PM
4. **Late Afternoon:** 3:00 PM - 6:00 PM
5. **Evening:** 6:00 PM - 9:00 PM
6. **Night:** 9:00 PM - 12:00 AM
7. **Midnight:** 12:00 AM - 1:00 AM
8. **Late Night:** 1:00 AM - 4:00 AM

In [None]:
def flight_dep_time(x):
    if(x>5) and (x<=8):
        return 'Early Morning'
    elif(x>8) and (x<=12):
        return 'Morning'
    elif(x>12) and (x<=15):
        return 'Afternoon'
    elif(x>15) and (x<=18):
        return 'Late Afternoon'
    elif(x>18) and (x<=21):
        return 'Evening'
    elif(x>21) and (x<=24):
        return 'Night'
    elif(x>24) and (x<=1):
        return 'Midnight'
    else:
        return 'Late Night'

### Applying the `flight_dep_time` function to the `Dep_Time_Hour` feature

In [None]:
data['Dep_Time_Hour'].apply(flight_dep_time).head(6)

### Counting the frequencies of `Dep_Time_Hour`

In [None]:
data['Dep_Time_Hour'].apply(flight_dep_time).value_counts()

### Plotting the frequencies into a bar plot

In [None]:
data['Dep_Time_Hour'].apply(flight_dep_time).value_counts().plot(kind='bar', color='c')

### Utilizing plotly to create an interactive plot of the `Dep_Time_Hour` frequencies

In [None]:
# !pip install plotly
# !pip install chart_studio
# !pip install cufflinks

In [None]:
import plotly
import cufflinks as cf
from cufflinks.offline import go_offline
from plotly.offline import plot, iplot, init_notebook_mode, download_plotlyjs
init_notebook_mode(connected=True)
cf.go_offline()

- `go_offline` from `cufflinks.offline` is required to set the plotly plot for offline access within the Jupyter notebook.
- `plot` helps to create matplotlib like visualizations. Creates Plotly plot as an HTML file. It renders the plot offline.
- `iplot` helps to create interactive visualizations (creates javascript plot). Specifically designed for use in Jupyter notebook.
- `init_notebook_mode(connected=True)` helps to initialize the interactive plotting mode in the Jupyter notebook. By setting `connected=True`, plots will be rendered using the Plotly cloud service, which allows for interactive features and better rendering. If set to `connected=False`, the plots will be rendered locally without connecting to Plotly's cloud service.
- `cf.go_offline()` helps to render plots locally within the environment where the code is executed (Jupyter notebook) without requiring an internet connection.

In [None]:
data['Dep_Time_Hour'].apply(flight_dep_time).value_counts().iplot(kind='bar')

# Data Preprocessing #2
> Preprocessing on the `Duration` feature

In [None]:
data.head(3)

In [None]:
data.dtypes

## Converting the `Duration` feature from `object` to `numeric` data type
- Some records contain both `hours` and `minutes`, but some contain only one. It is necessary to make these records consistent to convert the `Duration` feature to a numeric form. Hence, all the records will be changed to `hours` and `minutes`.

### Defining a function to normalize the `Duration` feature
- The goal is to normalize all the records in the form: `0h 0m`.
- Hence, records with only `0h` will be appended with `0m` at the end; Records with only `0m` will be appended with `0h` in the front.

In [None]:
def preprocess_duration(x):
    if 'h' not in x:
        x = '0h' + ' ' + x
    elif 'm' not in x:
        x = x + ' ' + '0m'
    return x

### Applying the function to the `Duration` feature

In [None]:
data['Duration'] = data['Duration'].apply(preprocess_duration)

In [None]:
data['Duration'].head()

## Extracting derived attributes from the `Duration` feature: `Duration_Hours` and `Duration_Mins`

- Using `split()` to extract derived attributes
- Using positive indexing and negative indexing to access the numeric values of the string
> How negative indexing works: <br>
> Example: <br><br>
> Positive Indexing: <br>
> `D u r a t i o n` <br>
> `0 1 2 3 4 5 6 7` <br><br>
> Negative Indexing: <br>
> ` D  u  r  a  t  i  o  n` <br>
> `-8 -7 -6 -5 -4 -3 -2 -1`

In [None]:
data['Duration'][0]

In [None]:
data['Duration'][0].split(' ')

In [None]:
data['Duration'][0].split(' ')[0]

In [None]:
data['Duration'][0].split(' ')[1]

> Accessing only the numeric value by excluding the string `h` or `m` using: <br>
> `[0:-1]`.<br>
> where Python excludes the index `-1` and only includes the indexes `[0:]`.

In [None]:
data['Duration'][0].split(' ')[0][0:-1]

In [None]:
data['Duration'][0].split(' ')[1][0:-1]

> Using `type()` to check the data type

In [None]:
type(data['Duration'][0].split(' ')[0][0:-1])

> Since the data type of the extracted numeric value is `string`, `int()` function will be used to convert it to integer.

In [None]:
int(data['Duration'][0].split(' ')[0][0:-1])

In [None]:
int(data['Duration'][0].split(' ')[1][0:-1])

### Creating and applying lambda function to the `Duration` feature in order to extract the derived attributes

> The `x` in the lambda function will be treated as a pointer to each row/record in the `Duration` feature.

In [None]:
data['Duration_Hours'] = data['Duration'].apply(lambda x: int(x.split(' ')[0][0:-1]))

In [None]:
data['Duration_Mins'] = data['Duration'].apply(lambda x: int(x.split(' ')[1][0:-1]))

In [None]:
data.head(2)

# Data Analysis & Visualization #2

## Does the duration of a flight have any impact on its price?

In [None]:
data.dtypes

### Converting the values in the `Duration` feature from hours to minutes

> The current form of values in the `Duration` feature is `0h 0m` <br><br>
> `0h 0m` can be converted to minutes as follows: <br>
> `0h*60 + 0m*1`
> <br><br>
> How it will be implemented: <br>
> `h` will be replaced with `*60`, and <br>
> `m` will be replaced with `*1`
> <br><br>
> However, the `Duration` feature is of the datatype `object` or `string`. Hence, the `eval()` function can be utilized to perform arithmetic operations on string values.

In [None]:
'2*60'

In [None]:
eval('2*60')

In [None]:
data['Duration'].head(6)

> Utilizing the `str` accessor and the `replace` function to replace specific characters from the string values in the `Duration` feature:<br>
> `'h'` -> `'*60'`<br>
> `' '` -> `'+'`<br>
> `'m'` -> `'*1'`
>
> Finally storing it in `Duration_In_Mins`

In [None]:
data['Duration_In_Mins'] = data['Duration'].str.replace('h', '*60').str.replace(' ', '+').str.replace('m', '*1')

In [None]:
data['Duration_In_Mins'].head(6)

> Applying the `eval()` function to perform arithmetic operations on the string values

In [None]:
data['Duration_In_Mins'] = data['Duration_In_Mins'].apply(eval)

In [None]:
data['Duration_In_Mins'].head(6)

In [None]:
data.columns

### Creating a scatter plot to visualize the impact of flight duration on its price

In [None]:
sns.scatterplot(x='Duration_In_Mins', y='Price', data=data)