Explain all Features

**Airline**: This column represents the name of the airline operating the flight. Examples might include "Air India," "IndiGo," "SpiceJet," etc.

**Date_of_Journey**: The date on which the journey is scheduled. This is typically in the format "DD/MM/YYYY" or "YYYY-MM-DD".

**Source**: The departure location or city from where the flight is taking off. Examples could be "Delhi," "Mumbai," etc.

**Destination**: The arrival location or city where the flight is landing. Examples could be "Bangalore," "Chennai," etc.

**Route**: The flight path taken to reach the destination, including any layovers or stops. For example, "DEL → BOM → BLR" indicates a flight from Delhi to Bangalore with a stop in Mumbai.

**Dep_Time**: The departure time of the flight. This is usually given in 24-hour format (HH).

**Arrival_Time**: The arrival time of the flight at the destination. This is also usually in 24-hour format (HH).

**Duration**: The total time taken for the journey from departure to arrival. This is typically in the format "HHh MMm".

**Total_Stops**: The number of stops or layovers the flight makes before reaching the destination. For example, "non-stop" means no stops, "1 stop" means one stop, etc.

**Additional_Info**: Any extra information about the flight. This could include details like "No info," "In-flight meal not included," "Red-eye flight," etc.

**Price**: The target variable in your dataset, representing the price of the flight ticket.


# Introduction

# Importing Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
import calendar
import warnings
warnings.filterwarnings("ignore")
import plotly.graph_objs as go
from plotly.offline import init_notebook_mode,iplot
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
import tensorflow as tf
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.layers import Dense, Dropout
from sklearn.svm import SVR
from sklearn import metrics
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler

# Importing the Dataset

In [None]:
df = pd.read_excel('Data.xlsx')

# Exploring and Cleanning the Data

In [None]:
df.head()

In [None]:
# Checking length of the data
print(f"Data length is  {df.shape[0]}, Number of Features is {df.shape[1]}")

In [None]:
# Check datatypes
df.describe()

We can see that prices range between 1759 and 79512 with mean of 9087 which may indicate that we have outliers problem

In [None]:
#checking data types
df.dtypes

As we can see all features are objects except Price which means we need to change data types of Date_of_Journey, Dep_Time, Arrival_time and Duration to timestamp so we can do calculation on it easly

before continue exploring we need to do some proccessing on the data so we can explort it better
change Date_of_Journey, Dep_Time, Arrival_time and Duration to timestamp and extract needed features from it

In [None]:
#Checking None Values
df.isnull().sum()

In [None]:
#Since None value are too low then I will remove them
df.dropna(inplace=True)

In [None]:
#Check for dublication
df.duplicated().sum()

In [None]:
df.drop_duplicates(inplace=True)

In [None]:
def change_into_datetime(col):
    df[col] = pd.to_datetime(df[col])

for i in ['Date_of_Journey','Dep_Time', 'Arrival_Time']:
    change_into_datetime(i)

In [None]:
# Function to convert Timestamp to 'HH.MM'
def convert_time(timestamp,base=60):
    hours = timestamp.hour
    minutes = timestamp.minute
    decimal_minutes = minutes / base
    return round(hours + decimal_minutes, 2)

# Apply the conversion
df['Dep_Time'] = df['Dep_Time'].apply(convert_time)

In [None]:
df['Arrival_Time'] = df['Arrival_Time'].apply(convert_time)

In [None]:
df['Month_of_Journey'] = pd.to_datetime(df['Date_of_Journey'], format='%d/%m/%Y').dt.month
df['Day_of_Journey'] = pd.to_datetime(df['Date_of_Journey'], format='%d/%m/%Y').dt.day
df['Date_of_Journey_timeStamp'] = pd.to_datetime(df['Date_of_Journey'], format='%d/%m/%Y')

In [None]:
def convert_month(timestamp):
    month = timestamp.month
    day = timestamp.day
    days_in_month = calendar.monthrange(timestamp.year, month)[1]
    decimal_day = day / days_in_month
    return round(month + decimal_day, 2)

df['Date_of_Journey'] = df['Date_of_Journey'].apply(convert_month)

In [None]:
df['Duration'] = df['Duration'].str.replace('h','*60').str.replace(' ','+').str.replace('m','*1').apply(eval)

In [None]:
df.head()

In [None]:
print(f"Our data range between {df['Date_of_Journey_timeStamp'].min()} to {df['Date_of_Journey_timeStamp'].max()} whic means {(df['Date_of_Journey_timeStamp'].max()-df['Date_of_Journey_timeStamp'].min()).days} days")

Our Data range between 2019-03-01 to 2019-06-27

### Checking unique values

In [None]:
#Checking unique values for catergorical features so we can know how many dimensions we gonna end with after encoding

def features_info(feature):
    print(df[feature].unique())
    print(f'Number of unique values for {feature} is {len(df[feature].unique())}')

In [None]:
features_info('Airline')

In [None]:
features_info('Source')

In [None]:
features_info('Destination')

In [None]:
features_info('Route')

In [None]:
features_info('Total_Stops')

In [None]:
features_info('Additional_Info')

we have dublication here in unique values between No info and No Info we need to combine them

After Checking the numbers of unique values for categorical data we discovered that Route needs additional processing to be ready for calculation because we cant end up with 129 extra dimensions becaues its gonna cost a lot of computaion power

### Exploring frequency of each value in categorical data

In [None]:
def counting_plot(feature):
    plt.figure(figsize=(12, 6))
    sns.countplot(x=feature, data=df)
    plt.title(f'Count Plot of {feature}')
    plt.xlabel(feature)
    plt.ylabel('Count')
    plt.xticks(rotation=90)
    plt.show()

In [None]:
counting_plot('Airline')

As we can see we can Jet Airways is the most frequent and Vistara Premium Economy, jet Airways Business, Multiple carriers Premium Economy and Trujet can be neglected due to low frequency

In [None]:
counting_plot('Source')

In [None]:
counting_plot('Destination')

In [None]:
df.Route.value_counts()

As we can see that these routes and the most populare DEL → BOM → COK, BLR → DEL, CCU → BOM → BLR

In [None]:
counting_plot('Total_Stops')

As the number of stops increases the number of people choosing it decrease this maybe due to long time or high prices we gonna discover it soon

In [None]:
counting_plot('Additional_Info')

We cannot say much about this No info, but it appears that travelers choose no meal. Perhaps this is due to reducing the cost of travel or because the food on the plane is of low quality or simply inbalance in our data. We need more examination.

In [None]:
counting_plot('Day_of_Journey')

In [None]:
counting_plot('Month_of_Journey')

In [None]:
counting_plot('Date_of_Journey')

We can understand that mid-year (maybe late-year but all the data we have is between months 3 and 6) is much more popular for travel.

And month 4 is the least popular.

And between Days 6 and 9 we can see are popular maybe because of low cost of somthing else we need more exploring

### What are the factors affecting the price?

In [None]:
def feature_vs_price_in_scatter(feature, hue=None):
    plt.figure(figsize=(12, 8))
    if hue:
        sns.scatterplot(x=feature, y='Price', hue=hue,alpha=0.8, data=df)
    else:
        sns.scatterplot(x=feature, y='Price', data=df)
    plt.title(f'Scatter Plot of Flight Prices vs {feature}')
    plt.xlabel(feature)
    plt.ylabel('Price')
    plt.show()

def feature_vs_price_in_box(feature, hue=None):
    plt.figure(figsize=(14, 8))
    if hue:
        sns.boxplot(x=feature, y='Price', hue=hue,alpha=0.8, data=df)
    else:
        sns.boxplot(x=feature, y='Price', data=df)

    plt.title('Box Plot of Prices by Airline')
    plt.xlabel('Airline')
    plt.ylabel('Price')
    plt.xticks(rotation=90)
    plt.show()

In [None]:
# Check the repation between Date and price
feature_vs_price_in_scatter('Date_of_Journey',hue='Day_of_Journey')

We can see that the earlier time of the year is much expensive and earlier days of the month is also much expensive

In [None]:
# Check the relation between Dep_Time and Price
feature_vs_price_in_scatter('Dep_Time')

No relation between Dep_time and Price

In [None]:
# Check the relation between Arrival_Time and Price
feature_vs_price_in_scatter('Arrival_Time')

No relation between Arrival_Time and Price

In [None]:
# Check the relation between Duration and Price
feature_vs_price_in_scatter('Duration',hue='Day_of_Journey')

In [None]:
feature_vs_price_in_scatter('Duration',hue='Month_of_Journey')

In [None]:
feature_vs_price_in_scatter('Duration',hue='Total_Stops')

In [None]:
feature_vs_price_in_scatter('Duration',hue='Airline')

In [None]:
feature_vs_price_in_scatter('Duration',hue='Additional_Info')

In [None]:
feature_vs_price_in_scatter('Duration',hue='Source')

In [None]:
feature_vs_price_in_scatter('Duration',hue='Destination')

In [None]:
latitude = {
    'New Delhi': 28.6139,
    'Banglore': 12.9716,
    'Cochin': 9.9312,
    'Kolkata': 22.5726,
    'Delhi': 28.7041,
    'Hyderabad': 17.3850
}
df['D_latitude'] = df['Destination'].map(latitude)

longitude = {
    'New Delhi': 77.2090,
    'Banglore': 77.5946,
    'Cochin': 76.2673,
    'Kolkata': 88.3639,
    'Delhi': 77.1025,
    'Hyderabad': 78.4867
}
df['D_longitude'] = df['Destination'].map(longitude)

In [None]:
S_longitude = {
    'Banglore': 77.5946,
    'Kolkata': 88.3639,
    'Delhi': 77.1025,
    'Chennai': 80.2707,
    'Mumbai': 72.8777
}
df['S_longitude'] = df['Source'].map(S_longitude)

S_latitude = {
    'Banglore': 12.9716,
    'Kolkata': 22.5726,
    'Delhi': 28.7041,
    'Chennai': 13.0827,
    'Mumbai': 19.0760
}
df['S_latitude'] = df['Source'].map(S_latitude)

Factors that Effect The Price

1. **Duration** slightly effects increasing of price but most of flights stay under 20000 price unite and less than 1700 minutes.

1. **Month of the Year** Hugely effect the Price at month 3 people are willing to pay more and take longer flight this might.
be for these reasons.

    * **Seasonality and Holidays**: March falls within peak travel seasons or holidays in many regions. For example, spring break in various countries, festivals, or school holidays can lead to increased demand for flights, which in turn can drive up prices.

    * **Weather Conditions**: Weather can significantly impact travel preferences. March might offer more favorable weather conditions in certain destinations, making it a preferred time to travel. This could lead to higher demand and subsequently higher prices.

    * **Business Travel**: March may coincide with important business events, conferences, or trade shows, leading to increased business travel. Business travelers often have more flexible budgets or are reimbursed for travel expenses, which can contribute to higher ticket prices.

    * **Supply and Demand Dynamics**: Airlines adjust their pricing based on supply and demand dynamics. If demand for flights exceeds available seats (supply), airlines may raise prices. Higher prices can also incentivize airlines to operate longer flights or use larger aircraft to meet demand.

    * **Travel Preferences**: Personal or cultural preferences may also play a role. Some travelers may have specific reasons for choosing March, such as cultural events, family gatherings, or personal milestones, which justify longer and more expensive flights.

    * **Booking Patterns**: Booking patterns can influence pricing. If travelers book flights well in advance for March, airlines may adjust prices based on anticipated demand. Last-minute bookings or peak travel periods can also affect ticket prices.
    
1. **Day of the Month**: People are willing to pay more and travel longer during the early days of the month, while during the late days of the month they are still willing to travel longer but not paying more.

    * **Early Days of the Month**
        * **Income Cycles**:

            * **Salary Payments**: Many people receive their salaries at the beginning of the month. With fresh funds available, they might be more willing to spend on higher-priced tickets and plan longer trips.
            Disposable Income: At the start of the month, disposable income is typically higher, leading to a greater willingness to spend on travel.
        * **Planning and Scheduling**:

            * **Work and Personal Schedules**: People might plan trips at the beginning of the month to align with their work schedules, avoiding the rush towards month-end deadlines.
            Vacation Planning: Early month trips can be strategically planned to maximize the use of vacation days and return in time for any end-of-month responsibilities.
        * **Promotions and Offers**:

            * **Travel Deals**: Airlines and travel agencies often release promotions and discounts at the beginning of the month, incentivizing early bookings at potentially higher prices for longer trips.
    * **Late Days of the Month**
        * **Budget Constraints**:

            * **Reduced Disposable Income**: As the month progresses, disposable income tends to decrease due to the cumulative effect of expenses. This can lead to more budget-conscious travel decisions, such as looking for cheaper fares while still being willing to travel longer distances.
            * **Saving for Necessities**: People may start saving towards the end of the month for upcoming essential expenses, making them less likely to pay premium prices for travel.
        * **Flexibility and Last-Minute Travel**:

            * **Last-Minute Plans**: Late-month travelers might be those making last-minute plans. While they are willing to travel longer distances, they might prioritize finding more affordable tickets due to remaining budget constraints.
            * **Extended Stays**: People who plan to travel towards the end of the month might do so with the intention of staying through the beginning of the next month, hence looking for cost-effective options for extended trips.
        * **Airline Pricing Strategies**:

            * **Dynamic Pricing**: Airlines often adjust prices based on demand and booking patterns. Towards the end of the month, if seats are still available, airlines might lower prices to fill up the remaining seats, making travel more affordable even for longer distances.

1. **Number of Stops**: Increasing number of stops while decreasing the Duration reflected on Price increase, **None-Stops** concentrated on bottom-left side with least Duration and Price but sometimes break the 20000 limit, and **1-stop** are willing more than anyone to pay more to travel faster, but **2-stops** prefer to spend more time than travel faster, and 3-stops prefere to spend more time than travel faster at all.
    * **Convenience and Demand**
        * **Direct Flights**: Non-stop flights are the most convenient option for travelers as they offer the shortest travel time and eliminate the hassle of layovers. The higher demand for this convenience often drives up the price.
        * **Reduced Travel Fatigue**: Passengers prefer fewer stops to avoid the additional stress and fatigue associated with multiple layovers. This preference leads to higher demand and allows airlines to charge a premium.
    * **Airline Economics**
        * **Operational Costs**: Direct flights typically have lower operational costs per passenger mile compared to flights with multiple stops, which require more fuel for takeoffs and landings, additional airport fees, and handling costs.
        * **Aircraft Utilization**: Airlines aim to maximize aircraft utilization. Direct flights often fit better into an airline's scheduling and operational efficiency, allowing for higher ticket prices.
        * **Revenue Management**: Airlines use sophisticated revenue management systems to price tickets based on demand and supply. Non-stop flights, being more in demand, are priced higher to maximize revenue.
    * **Market Segmentation**
        * **Business Travelers**: Business travelers, who are less price-sensitive, often prefer non-stop flights for the time savings and convenience. Airlines target this segment with higher prices.
        * **Premium Services**: Non-stop and fewer-stop flights may offer better services and amenities, which appeal to premium customers willing to pay more for enhanced comfort and convenience.
    * **Competitive Dynamics**
        * **Route Competition**: Non-stop flights often have less competition on popular routes, allowing airlines to charge higher prices. Flights with multiple stops might compete on price to attract more budget-conscious travelers.
        * **Hub-and-Spoke Model**: Many airlines operate on a hub-and-spoke model where non-stop flights are more frequent and strategically priced. Connecting flights through hubs may be cheaper but involve longer travel times.
    * **Supply and Demand**
        * **Seat Availability**: Non-stop and fewer-stop flights might have fewer seats available due to higher demand, leading to higher prices. Longer travel time flights may have more seat availability, leading to lower prices.
        * **Dynamic Pricing**: Airlines adjust prices dynamically based on real-time demand. High-demand direct flights see prices rise quickly, while lower-demand longer flights may see more discounts.
    * **Customer Preferences**
        * **Time vs. Cost Trade-off**: Many travelers are willing to pay more for the convenience of shorter travel times, valuing their time over cost savings. This willingness to pay drives up prices for non-stop and fewer-stop flights.
        * **Booking Patterns**: Last-minute travelers often prefer direct flights to minimize travel time, leading to higher prices due to increased demand closer to the travel date.
1. **Company**: Jet Airwars can cost more at Duration Decrease but number of stops increase, and Jet Airwars Business always high over 40000.
1. **Meal**: people prefer no meal in flight around 10000 and Duration between 250 and 1800 minutes
1. **Business Class**: Pay more and spend less time.
    1. Convenience and Time Sensitivity
        * **Time is Money**: Business travelers prioritize time efficiency because time saved can translate into productivity and revenue for their companies. As a result, they are willing to pay more for direct flights that minimize travel time.
        * **Tight Schedules**: Business travelers often have tight schedules with back-to-back meetings, conferences, or events. Direct flights and flights with fewer stops help them adhere to these schedules more effectively.
    2. Corporate Travel Policies
        * **Expense Accounts**: Many businesses provide travel expense accounts for their employees, allowing them to book more expensive, convenient flights without personal financial burden. This increases the demand for higher-priced, time-efficient travel options.
        * **Travel Management**: Companies often use corporate travel management services that prioritize efficiency and reliability over cost savings, leading to a preference for more expensive, direct flights.
    3. Airline Pricing Strategies
        * **Revenue Management**: Airlines use sophisticated revenue management systems to maximize profits. They know that business travelers are less price-sensitive and more time-sensitive, so they price direct flights higher to capture this segment of the market.
        * **Premium Services**: Airlines often provide additional services and amenities for business travelers, such as priority boarding, extra legroom, in-flight Wi-Fi, and dedicated business class sections. These added services justify higher ticket prices.
    4. Demand and Supply Dynamics
        * **Peak Demand**: Business travel demand tends to be concentrated during weekdays and peak business hours. Airlines adjust prices higher during these times to capitalize on the increased demand.
        * **Less Price Elasticity**: Business travelers typically book flights closer to the travel date and are less sensitive to price changes, giving airlines the opportunity to charge higher prices.
    5. Airport and Route Considerations
        * **Primary Airports**: Business travelers often fly to and from major airports located in business hubs. These airports typically have higher landing fees and operational costs, contributing to higher ticket prices.
        * **Frequent Flyer Programs**: Business travelers often participate in frequent flyer programs, encouraging them to choose specific airlines that offer direct and convenient routes, even at a higher cost.
    6. Market Segmentation
        * **Business Class and Premium Economy**: Airlines offer different classes of service tailored to business travelers, such as business class and premium economy. These classes come with higher prices due to the enhanced comfort, space, and services provided.
        * **Loyalty and Rewards Programs**: Business travelers are often part of loyalty and rewards programs, which can drive them to choose specific airlines and routes that offer direct flights and premium services, contributing to higher prices.
1. Source and Destination: Source and Destination can effect the price due to Distance.
    * **Distance**: Longer distances generally require more fuel, leading to higher operational costs. This is often reflected in ticket prices.
    * **Airport Fees**: Different airports have varying landing fees, gate fees, and other charges. Airports in major cities or business hubs typically have higher fees, contributing to higher ticket prices.
    * **Demand**: High-demand routes (e.g., between major cities or popular tourist destinations) tend to have higher prices due to the increased willingness of travelers to pay.
    * **Local Economy**: The economic conditions of the source and destination regions can influence ticket prices. Higher-income areas might see higher prices as residents can afford to pay more.
    * **Business Travel**: Cities with a significant amount of business travel typically have higher ticket prices, especially during weekdays.
    * **Seasonal Demand**: Certain routes experience seasonal demand peaks (e.g., holiday seasons, summer vacations, winter getaways) that drive up prices.
    * **Local Events**: Major events (conferences, festivals, sports events) in the source or destination can increase demand and ticket prices.
    * **Frequency**: Routes with frequent flights may have lower prices due to the higher availability of seats. Conversely, routes with limited flights may have higher prices.
    * **Flight Availability**: Limited availability of flights on certain routes, especially non-stop options, can lead to higher prices.
    * **Operational Hubs**: Flights to and from an airline's operational hubs are often cheaper due to the efficiency of scale. Non-hub routes might be more expensive.
    * **Crew and Maintenance**: The costs associated with crew salaries, aircraft maintenance, and other operational factors can vary based on the source and destination.
    

# Pre-processing for Customers Segementaion

## Add travel Distance columns

We have All city symbols we can use them to get their coordinates and then calculate the travel distance in kilometers

Note: we can try target encoding with the Price with the result in 6 columns then divided them by their orders then sum them up but the coordinates approach gives a better explanation for variation

first we split them with ' → ' and add them again to column 'Route'

In [None]:
df['Route'] = df['Route'].str.split(' → ')

In [None]:
city_coordinates_dict = {
    'AMD': (23.0734, 72.6268), 'ATQ': (31.7055, 74.7973), 'BBI': (20.2445, 85.8178),
    'BDQ': (22.3368, 73.2263), 'BHO': (23.2871, 77.3378), 'BLR': (12.9724, 77.5806),
    'BOM': (19.0896, 72.8656), 'CCU': (22.6540, 88.4467), 'COK': (10.1520, 76.3922),
    'DED': (30.3165, 78.0322), 'DEL': (28.6139, 77.2090), 'GAU': (26.1060, 91.5852),
    'GOI': (15.3808, 73.8314), 'GWL': (26.2937, 78.1956), 'HBX': (15.3600, 75.0849),
    'HYD': (17.2315, 78.4294), 'IDR': (22.7210, 75.8682), 'IMF': (24.7597, 93.8967),
    'ISK': (17.7219, 73.2172), 'IXA': (23.8860, 91.2404), 'IXB': (27.2676, 88.6065),
    'IXC': (30.6737, 76.7889), 'IXR': (23.3175, 85.3213), 'IXU': (19.8627, 75.3962),
    'IXZ': (11.6415, 92.7297), 'JAI': (26.9124, 75.7873), 'JDH': (26.2517, 73.0489),
    'JLR': (12.9604, 77.6413), 'KNU': (26.4043, 80.4108), 'LKO': (26.8467, 80.9462),
    'MAA': (13.0827, 80.2707), 'NAG': (21.0914, 79.0479), 'NDC': (19.1833, 73.0255),
    'PAT': (25.5941, 85.1356), 'PNQ': (18.5822, 73.9196), 'RPR': (21.1809, 81.7383),
    'STV': (21.1140, 72.7411), 'TRV': (8.4821, 76.9204), 'UDR': (24.6173, 73.8963),
    'VGA': (16.5302, 80.7960), 'VNS': (25.3176, 82.9739), 'VTZ': (17.7215, 83.2991)
}

In [None]:
from math import radians, sin, cos, sqrt, atan2

def haversine_distance_between_cities(city_list,city_coordinates_dict=city_coordinates_dict):
    """
    Calculate the total travel distance in kilometers between a list of cities
    specified by their names, using a dictionary of city coordinates.

    city_coordinates_dict: Dictionary mapping city names to (latitude, longitude) coordinates.
                           Example: {'DEL': (28.6139, 77.2090), 'BLR': (12.9716, 77.5946), ...}

    city_list: List of city codes in the order of travel.
               Example: ['DEL', 'BLR', ...]

    Returns the total travel distance in kilometers.
    """
    total_distance = 0.0

    # Iterate through consecutive pairs of cities in the city_list
    for i in range(len(city_list) - 1):
        city1 = city_list[i]
        city2 = city_list[i + 1]

        # Retrieve coordinates from the city_coordinates_dict
        lat1, lon1 = city_coordinates_dict[city1]
        lat2, lon2 = city_coordinates_dict[city2]

        # Convert latitude and longitude from degrees to radians
        lat1, lon1, lat2, lon2 = map(radians, [lat1, lon1, lat2, lon2])

        # Haversine formula
        dlon = lon2 - lon1
        dlat = lat2 - lat1
        a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
        c = 2 * atan2(sqrt(a), sqrt(1-a))

        # Radius of the Earth in kilometers
        R = 6371.0

        # Calculate the distance between consecutive cities
        distance = R * c

        # Add to the total distance
        total_distance += distance

    return total_distance

In [None]:
df['Travel_distance'] = df['Route'].apply(haversine_distance_between_cities)

Encode Categorical data with one-hot-encoding

In [None]:
# Dealing with companies Catergorical Values replaceing these values by others because their frequancy is so low and can be neglected
df['Airline'] = df['Airline'].replace('Multiple carriers Premium economy', 'Other')
df['Airline'] = df['Airline'].replace('Jet Airways Business', 'Other')
df['Airline'] = df['Airline'].replace('Vistara Premium economy', 'Other')
df['Airline'] = df['Airline'].replace('Trujet', 'Other')

In [None]:
df['Airline'].value_counts()

In [None]:
Airline=pd.get_dummies(df['Airline'], drop_first=True).astype(int)
Airline.head()

In [None]:
#Dealing with Additional_Info columns
df['Additional_Info'].value_counts()

In [None]:
df['Additional_Info'] = df['Additional_Info'].replace('No Info', 'No info')
df['Additional_Info'] = df['Additional_Info'].replace('1 Short layover', 'No info')
df['Additional_Info'] = df['Additional_Info'].replace('Change airports', 'No info')
df['Additional_Info'] = df['Additional_Info'].replace('1 Long layover', 'No info')
df['Additional_Info'] = df['Additional_Info'].replace('Business class', 'No info')
df['Additional_Info'] = df['Additional_Info'].replace('Red-eye flight', 'No info')
df['Additional_Info'] = df['Additional_Info'].replace('2 Long layover', 'No info')

In [None]:
Additional_Info=pd.get_dummies(df['Additional_Info'], drop_first=True).astype(int)
Additional_Info.head()

In [None]:
df['Total_Stops'] = df['Total_Stops'].map({'non-stop':0, '2 stops':2, '1 stop':1, '3 stops':3, '4 stops':4})

In [None]:
df.columns

In [None]:
df.drop(['Airline','Source','Destination','Route','Additional_Info','Date_of_Journey_timeStamp','Day_of_Journey','Month_of_Journey'],axis=1,inplace=True)

In [None]:
pd.set_option('display.max_columns', None)
endocded_data = pd.concat([df,Additional_Info,Airline], axis=1)
endocded_data

### Remove outliers

In [None]:
def plot(df,col):
    # this fucntion plot hist and box plot
    # takes two argument df: dataframe and col: column name
    fig, (ax1,ax2) = plt.subplots(2,1)
    sns.distplot(df[col],ax=ax1)
    sns.boxplot(x=df[col],ax=ax2)

In [None]:
plot(endocded_data,'Price')

I will use IQR method to remove outliers

In [None]:
q1 = endocded_data['Price'].quantile(0.25)
q3 = endocded_data['Price'].quantile(0.75)
iqr = q3 - q1
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr

#outliers
outliers = endocded_data[(endocded_data['Price'] < lower_bound) | (endocded_data['Price'] > upper_bound)]
outliers.Price.sort_values()

endocded_data = endocded_data[(endocded_data['Price'] >= lower_bound) & (endocded_data['Price'] <= upper_bound)]

In [None]:
print(f'Numbers of outliers is {len(outliers)}')

In [None]:
#check the result
plot(endocded_data,'Price')

In [None]:
# Check the correlation using heatmap
correlation_matrix = endocded_data.corr()
plt.figure(figsize=(23, 15))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0)
plt.show()

In [None]:
# Its a good idea to take long of the price
endocded_data['Log_Price'] = np.log(endocded_data.Price)

We can see that Duration, total stops, source and destination coordination, and travel distance correlate with the price

Split data to X and Y then to train and test sets

In [None]:
X=endocded_data.drop(['Price','Log_Price'], axis=1)
y=endocded_data['Log_Price']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
#apply Standard Scaling to make all feature at same scale so give us better result on ANN and Linear Regression
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [None]:
def predict(ml_model,X_train=X_train,X_test=X_test):
    #this fucntion helps to train the model and plot the results

  model = ml_model.fit(X_train,y_train)
  print('Training score : {}'.format(model.score(X_train,y_train)))
  y_prediction = model.predict(X_test)
  print('Predictions are : {}'.format(y_prediction))
  print('\n')
  r2_score = metrics.r2_score(y_test, y_prediction)
  print('r2_score : {}'.format(r2_score))
  print('MAE : {}'.format(metrics.mean_absolute_error(y_test, y_prediction)))
  print('MSE : {}'.format(metrics.mean_squared_error(y_test, y_prediction)))
  print('RMSE : {}'.format(np.sqrt(metrics.mean_squared_error(y_test, y_prediction))))

  fig, (ax1,ax2) = plt.subplots(1,2,figsize=(15,5))
  sns.distplot(y_test-y_prediction,ax=ax1)
  ax1.set_title('Distribution of Prediction Errors')
  #sns.distplot(y_test-y_prediction)
  ax2.scatter(y_test, y_prediction, color = 'blue')
  ax2.plot(y_prediction, y_prediction, color = 'red')
  ax2.set_xlabel('Predicted')
  ax2.set_ylabel('Actual')
  ax2.set_title('Actual vs Predicted')
  plt.show()

### Try Random Forest Model

Note: Random Forest do not need scaled data but it is a good idea to do that

In [None]:
from sklearn.model_selection import RandomizedSearchCV
#Randomized Search CV

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 100, stop = 1200, num = 12)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(5, 30, num = 6)]
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10, 15, 100]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 5, 10]

# Create the random grid

random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf}

# Random search of parameters, using 5 fold cross validation,
# search across 100 different combinations
rf_random = RandomizedSearchCV(estimator = RandomForestRegressor(), param_distributions = random_grid,scoring='neg_mean_squared_error', n_iter = 10, cv = 5, verbose=2, random_state=42, n_jobs = 1)
rf_random.fit(X_train_scaled,y_train)

In [None]:
rf_random.best_params_

In [None]:
predict(RandomForestRegressor(**rf_random.best_params_,random_state=42),X_test=X_test_scaled,X_train=X_train_scaled)

## try Linear Regression Model

In [None]:
predict(LinearRegression(),X_test=X_test_scaled,X_train=X_train_scaled)

## Try Neural Network

In [None]:
X_train.shape

We Have 20 Feature so we need to make our outer leyar to be 20 unit
our ac

In [None]:

ann = tf.keras.models.Sequential()
ann.add(tf.keras.layers.Dense(units=20, activation='sigmoid'))

ann.add(tf.keras.layers.Dense(units=80, activation='tanh'))

ann.add(tf.keras.layers.Dense(units=160, activation='sigmoid'))

ann.add(tf.keras.layers.Dense(units=80, activation='tanh'))



ann.add(tf.keras.layers.Dense(units=1))


early_stop = EarlyStopping(monitor='val_loss',mode='min',verbose=1, patience=25)
ann.compile(optimizer = 'adam', loss = 'mean_squared_error')
ann.fit(X_train_scaled, y_train, batch_size = 32, epochs = 5000,validation_data=(X_test_scaled, y_test),callbacks=[early_stop])



In [None]:

losses = pd.DataFrame(ann.history.history)[100:]
losses.plot()

In [None]:
y_pred = ann.predict(X_test_scaled)

In [None]:
fig, (ax1,ax2) = plt.subplots(1,2,figsize=(15,5))
sns.distplot(y_test-y_pred.reshape(-1),ax=ax1)
ax1.set_title('Distribution of Prediction Errors')
#sns.distplot(y_test-y_pred)
ax2.scatter(y_test, y_pred.reshape(-1), color = 'blue')
ax2.plot(y_pred.reshape(-1), y_pred.reshape(-1), color = 'red')
ax2.set_xlabel('Predicted')
ax2.set_ylabel('Actual')
ax2.set_title('Actual vs Predicted')
plt.show()

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print(f'Mean Absolute Error (MAE): {mae:.2f}')
print(f'Mean Squared Error (MSE): {mse:.2f}')
print(f'Root Mean Squared Error (RMSE): {rmse:.2f}')
print(f'R-squared (R2): {r2:.2f}')

Conclusion, Random Forest is the best Accurcy by ~95%