<a href="https://colab.research.google.com/github/abhinavnautiyalDS/flight_price_prediction/blob/main/flight_price_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##**Importing Libraries**

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, train_test_split



## **DATA PREPROCESSING**

In [None]:
df=pd.read_csv("/content/final flight dataset.csv")
df1=df.sample(1000)
df2=df.sample(1000)

df = pd.concat([df1, df2], axis=0)



In [None]:
df.shape

**The columns ARRIVAL_TIME, Duration, and DATE_OF_JOURNEY_2 can be dropped because Dep_Time is already represented in DEPARTURE_TIME and ARRIVAL_TIME, Duration is covered in DURATIONMIN, and the information from DEPARTURE_TIME is also captured in ARRIVAL_TIME.**

In [None]:


df.drop(['ARRIVAL_TIME','Duration','DATE_OF_JOURNEY_2','Dep_Time'],inplace=True,axis=1)


**Now, I will extract the day of travel and month of travel from DEPARTURE_TIME, and the time of day from Dep_Time.**

In [None]:

# Convert DEPARTURE_TIME to datetime
df['DEPARTURE_TIME'] = pd.to_datetime(df['DEPARTURE_TIME'])

# Extract day and month
df['Day'] = df['DEPARTURE_TIME'].dt.day
df['Month'] = df['DEPARTURE_TIME'].dt.month
df['DayOfweek']=df['DEPARTURE_TIME'].dt.day_name()

In [None]:
df

**Now, I will create a new feature called daytime by categorizing Dep_Time into time slots: Morning, Afternoon, Evening, Night, and Midnight.**

In [None]:
#now i fill make daytime: Morning, afternoon, everning,night, midnight
# Extract hour
df['Hour'] = df['DEPARTURE_TIME'].dt.hour
# Define function to categorize time of day
def get_daytime(hour):
    if 5 <= hour < 12:
        return 'Morning'
    elif 12 <= hour < 17:
        return 'Afternoon'
    elif 17 <= hour < 21:
        return 'Evening'
    elif 21 <= hour <= 23:
        return 'Night'
    else:  # 0 <= hour < 5
        return 'Midnight'

# Apply function to create DayTime column
df['DayTime'] = df['Hour'].apply(get_daytime)


In [None]:
df

In [None]:
#now remove DEPARTURE_TIME
df.drop('DEPARTURE_TIME',inplace=True,axis=1)

In [None]:
df.drop('Hour',inplace=True,axis=1)

**Converting Durationmin in hour**

In [None]:
#converting Durationmin in hour
df['DURATIONHour']=df['DURATIONMIN']/60

In [None]:
df['DURATIONHour']=df['DURATIONHour'].astype('int')

In [None]:
df.drop('DURATIONMIN',inplace=True,axis=1)

### **Data Endoding**


For this i have used one-hot-encoding

In [None]:
#data preprocessing

cat_cols = ['Airline', 'Source', 'Destination','Total_Stops','DayOfweek','DayTime']
df_label_encoded=df.copy()

# One-Hot Encode with binary format (0/1)
df_label_encoded= pd.get_dummies(df, columns=cat_cols, drop_first=True).astype('int')  # drop_first=True to avoid dummy variable trap



In [None]:
df_label_encoded

In [None]:

# Reorder columns to move 'Price' to the last
cols = [col for col in df_label_encoded.columns if col != 'Price'] + ['Price']
df_label_encoded = pd.DataFrame(df_label_encoded[cols])

# Display resuld
df_label_encoded


### **Data Scaling**

For this i have used min-max scaling

In [None]:
#scaling

from sklearn.preprocessing import MinMaxScaler

# Initialize the scaler
scaler = MinMaxScaler()

# List numeric columns you want to scale
num_cols = ['DURATIONHour', 'Price','Month','Day']  # replace/add other numerical column names if needed

# Apply min-max scaling and replace in DataFrame
df_label_encoded[num_cols] = scaler.fit_transform(df_label_encoded[num_cols])

# Display the scaled DataFram
df_label_encoded.head()

### **Model splitting**

In [None]:
# Features and target
X = df_label_encoded.drop('Price', axis=1)
y = df_label_encoded['Price']

# Split data
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [None]:
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

def Evaluation_Metrics_Regression(model_object, X_test, Y_test):
    """
    Evaluate a regression model with key metrics.
    """
    Y_pred = model_object.predict(X_test)

    metrics = {
        'R2 Score': r2_score(Y_test, Y_pred),
        'MAE': mean_absolute_error(Y_test, Y_pred),
        'MSE': mean_squared_error(Y_test, Y_pred),
        'RMSE': mean_squared_error(Y_test, Y_pred)**(0.5)
    }

    return pd.DataFrame.from_dict(metrics, orient='index', columns=['Score'])


### **Feature Selection**

In [None]:
fig,ax=plt.subplots(figsize=(25,15))
sns.heatmap(df_label_encoded.corr(),annot=True)

1. **lasso regression**

In [None]:

#using lasso regression


import pandas as pd
import numpy as np
from sklearn.linear_model import Lasso
from sklearn.preprocessing import StandardScaler

# Separate features and target
X = df_label_encoded.drop('Price', axis=1)
y = df_label_encoded['Price']



# Try multiple alpha values
alphas = [0.0001, 0.001, 0.01,0.1]

# Dictionary to store coefficients
coef_dict = {}

for alpha in alphas:
    lasso = Lasso(alpha=alpha)
    lasso.fit(X, y)
    coef_dict[f'alpha_{alpha}'] = lasso.coef_

# Create DataFrame with features as index and alphas as columns
coef_variation_df = pd.DataFrame(coef_dict, index=X.columns)

# Optionally, round for neatness
coef_variation_df = coef_variation_df.round(4)

# Show how coefficients vary with alpha
coef_variation_df


In [None]:
nonimportant_features1=set(['Airline_GoAir','Airline_Multiple carriers Premium economy','Airline_Vistara Premium economy','Total_Stops_3 stops'])

2. **Wrapping Method**

In [None]:
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

#X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize model
model = LinearRegression()

# Sequential Feature Selector (Forward selection)
sfs = SequentialFeatureSelector(estimator=model,
                                n_features_to_select=5,
                                direction='forward',
                                scoring='r2',
                                cv=5)

sfs.fit(X_train, Y_train)

# Get selected feature names
selected_features = X_train.columns[sfs.get_support()]

nonimportant_features=set([i for i in X_train.columns if i not in selected_features])


In [None]:
nonimportant_features1.intersection(nonimportant_features)

In [None]:
#dropping common columns
df_label_encoded.drop(nonimportant_features1.intersection(nonimportant_features),axis=1,inplace=True)

In [None]:
df_label_encoded

###**Evaulation Metrics and Hyperparameter Tuning**

In [None]:
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

def Evaluation_Metrics_Regression(model_object, X_test, Y_test):
    """
    Evaluate a regression model with key metrics.
    """
    Y_pred = model_object.predict(X_test)

    metrics = {
        'R2 Score': r2_score(Y_test, Y_pred),
        'MAE': mean_absolute_error(Y_test, Y_pred),
        'MSE': mean_squared_error(Y_test, Y_pred),
        'RMSE': mean_squared_error(Y_test, Y_pred)**(0.5)
    }
    sns.scatterplot(x=Y_test,y=Y_pred,color='red')
    plt.xlabel('Actual')
    plt.ylabel('Predicted')
    plt.show()

    return pd.DataFrame.from_dict(metrics, orient='index', columns=['Score'])


In [None]:


def HyperparameterTuning(X, y, model_name, search_type='grid', n_iter_random=10):
    # Select model and hyperparameter grid
    if model_name == LR:
        model = LinearRegression()
        param_grid = {
            'fit_intercept': [True, False],
        }

    elif model_name == DT:
        model = DecisionTreeRegressor(random_state=42)
        param_grid = {
            'max_depth': [3, 5, 10, None],
            'min_samples_split': [2, 5, 10],
            'min_samples_leaf': [1, 2, 4]
        }

    elif model_name == RF:
        model = RandomForestRegressor(random_state=42)
        param_grid = {
            'n_estimators': [50, 100, 150],
            'max_depth': [5, 10, 20, None],
            'min_samples_split': [2, 5, 10],
            'min_samples_leaf': [1, 2, 4]
        }

    else:
        raise ValueError("Invalid model name. Choose from 'LR', 'DTR', or 'RFR'.")

    # Hyperparameter tuning
    if search_type == 'grid':
        search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, scoring='r2', n_jobs=-1)
    elif search_type == 'random':
        search = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=n_iter_random, cv=5, scoring='r2', random_state=42, n_jobs=-1)
    else:
        raise ValueError("Invalid search_type. Choose 'grid' or 'random'.")

    # Fit and get best model/params
    search.fit(X, y)
    best_model = search.best_estimator_
    best_params = search.best_params_

    return best_model, best_params


### **LINER REGRESSION**

In [None]:
LR=LinearRegression()
LR.fit(X_train,Y_train)

In [None]:
print(Evaluation_Metrics_Regression(LR, X_train, Y_train))

In [None]:
print(Evaluation_Metrics_Regression(LR, X_test, Y_test))

**Hyperparameter Tuning**

In [None]:
HyperparameterTuning(X_train, Y_train, LR, search_type='grid', n_iter_random=10)

**Retrain my model on best parameter**

In [None]:
LR1=LinearRegression(fit_intercept=False)
LR1.fit(X_train,Y_train)

In [None]:
print(Evaluation_Metrics_Regression(LR1, X_train, Y_train))


### **DecisionTreeRegressor**

In [None]:
DT=DecisionTreeRegressor()
DT.fit(X_train,Y_train)

In [None]:
print(Evaluation_Metrics_Regression(DT, X_train, Y_train))

In [None]:
print(Evaluation_Metrics_Regression(DT, X_test, Y_test))

**Hyperparamter tuning**

In [None]:
HyperparameterTuning(X_train, Y_train, DT, search_type='grid', n_iter_random=10)

**Retrain my model on best paramters**

In [None]:
DT1=DecisionTreeRegressor(max_depth= None, min_samples_leaf= 1, min_samples_split= 2)
DT1.fit(X_train,Y_train)

In [None]:
print("After Tuning",Evaluation_Metrics_Regression(DT1, X_test, Y_test))

In [None]:
print("Before Tuning",Evaluation_Metrics_Regression(DT, X_test, Y_test))

### **RandomForestReggressor**

In [None]:
RF=RandomForestRegressor()
RF.fit(X_train,Y_train)

In [None]:
print(Evaluation_Metrics_Regression(RF, X_train, Y_train))

In [None]:
print(Evaluation_Metrics_Regression(RF, X_test, Y_test))

**Hyperparameter Tuning**

In [None]:
HyperparameterTuning(X_train, Y_train, RF, search_type='grid', n_iter_random=10)

**Retrain my model after hyperparameter tuning**

In [None]:
RF1=RandomForestRegressor(n_estimators=150,max_depth=None,
  min_samples_leaf=1,
  min_samples_split= 2,
  )
RF1.fit(X_train,Y_train)

In [None]:
print("After tuning",Evaluation_Metrics_Regression(RF1, X_test, Y_test))

In [None]:
print("Before tuning",Evaluation_Metrics_Regression(RF, X_test, Y_test))

Out of the three models I trained, Linear Regression had the lowest R² score. Decision Tree achieved the highest score, but it's overfitting as there's a significant difference between the training and testing performance. Therefore, Random Forest is the best-performing and most balanced model, and I will select it for final deployment.

### **Saving pickle file**

In [None]:
import pickle
with open('FarePrediction.pkl', 'wb') as file:
    pickle.dump(RF1, file)


In [None]:
df_label_encoded.columns