## Problem Statement
Flight ticket prices can be something hard to guess, today we might see a price, check out the price of the same flight tomorrow, it will be a different story. We might have often heard travelers saying that flight ticket prices are so unpredictable. That’s why we will try to use machine learning to solve this problem. This can help airlines by predicting what prices they can maintain. 

## Introduction
There is dataset given which consist the data of the airline and according to that data the price of the fare is given

## 1. 📥 Data Accessing 

1. Since the data is given in excel format we have to use pandas read_excel fucntion to convert the excel file into dataframe
2. import pandas library 

In [769]:
import pandas as pd
import numpy as np
#pd.set_option('display.max_rows', None)
#pd.set_option('display.max_columns', None)
import warnings
warnings.filterwarnings("ignore")  # Ignore all warnings

In [770]:
flight = pd.read_excel("Flight_Fare.xlsx")

In [771]:
flight

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price
0,IndiGo,24/03/2019,Banglore,New Delhi,BLR → DEL,22:20,01:10 22 Mar,2h 50m,non-stop,No info,3897
1,Air India,1/05/2019,Kolkata,Banglore,CCU → IXR → BBI → BLR,05:50,13:15,7h 25m,2 stops,No info,7662
2,Jet Airways,9/06/2019,Delhi,Cochin,DEL → LKO → BOM → COK,09:25,04:25 10 Jun,19h,2 stops,No info,13882
3,IndiGo,12/05/2019,Kolkata,Banglore,CCU → NAG → BLR,18:05,23:30,5h 25m,1 stop,No info,6218
4,IndiGo,01/03/2019,Banglore,New Delhi,BLR → NAG → DEL,16:50,21:35,4h 45m,1 stop,No info,13302
...,...,...,...,...,...,...,...,...,...,...,...
10678,Air Asia,9/04/2019,Kolkata,Banglore,CCU → BLR,19:55,22:25,2h 30m,non-stop,No info,4107
10679,Air India,27/04/2019,Kolkata,Banglore,CCU → BLR,20:45,23:20,2h 35m,non-stop,No info,4145
10680,Jet Airways,27/04/2019,Banglore,Delhi,BLR → DEL,08:20,11:20,3h,non-stop,No info,7229
10681,Vistara,01/03/2019,Banglore,New Delhi,BLR → DEL,11:30,14:10,2h 40m,non-stop,No info,12648


#### Basic Data Information 

There are __10683 rows__ and __11 columns__

##### Column Description 

__1. Airline :-__ Name of the Airline like Indigo, Air Inida ...etc.<br>
__2. Date_of_Journey :-__ On This Date the passanger will start its journey given in format dd/mm/yyyy.<br>
__3. Source :-__ Name of the place where the passenger’s journey will start.<br>
__4. Destination :-__  Name of the place to where passengers wanted to travel.<br>
__5. Route :-__ The route is through which passengers have opted to travel from his/her source to their destination.<br>
__6. Dep_Time :-__ The Time at which flight will takeoff and passengers will start there journey.<br>
__7. Arrival_Time :-__ Arrival time is when the passenger will reach his/her destination.<br>
__8. Duration :-__ Duration is the whole period that a flight will take to complete its journey from source to destination.<br>
__9. Total_Stops :-__ This will let us know in how many places flights will stop there for the flight in the whole journey.<br>
__10. Additional_Info :-__ Information about food, kind of food, and other amenities.<br>
__11. Price :-__ Price of the flight for a complete journey including all the expenses before onboarding.<br>

## 2. 🧹 Data Cleaning

First will try perform data analysis to list down all the issues in dataset into two categories __Quality Issue__ and __Tidiness Issue__ <br>
quality issue consist issues like Duplicate data, Missing data, Corrupt data, Inaccurate data.<br>
tidness issue consist the structrual part of the data

#### Data Analysis

In [777]:
flight.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10683 entries, 0 to 10682
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Airline          10683 non-null  object
 1   Date_of_Journey  10683 non-null  object
 2   Source           10683 non-null  object
 3   Destination      10683 non-null  object
 4   Route            10682 non-null  object
 5   Dep_Time         10683 non-null  object
 6   Arrival_Time     10683 non-null  object
 7   Duration         10683 non-null  object
 8   Total_Stops      10682 non-null  object
 9   Additional_Info  10683 non-null  object
 10  Price            10683 non-null  int64 
dtypes: int64(1), object(10)
memory usage: 918.2+ KB


❌ Here we can see that __Date_of_Journey__ has given data type object which is wrong its a __validity issue__

In [779]:
flight["Price"].describe()

count    10683.000000
mean      9087.064121
std       4611.359167
min       1759.000000
25%       5277.000000
50%       8372.000000
75%      12373.000000
max      79512.000000
Name: Price, dtype: float64

In [780]:
flight[flight["Price"] == flight["Price"].max()]

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price
2924,Jet Airways Business,01/03/2019,Banglore,New Delhi,BLR → BOM → DEL,05:45,11:25,5h 40m,1 stop,Business class,79512


In [781]:
flight.duplicated().sum()

220

In [782]:
flight[flight.duplicated()]

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price
683,Jet Airways,1/06/2019,Delhi,Cochin,DEL → NAG → BOM → COK,14:35,04:25 02 Jun,13h 50m,2 stops,No info,13376
1061,Air India,21/05/2019,Delhi,Cochin,DEL → GOI → BOM → COK,22:00,19:15 22 May,21h 15m,2 stops,No info,10231
1348,Air India,18/05/2019,Delhi,Cochin,DEL → HYD → BOM → COK,17:15,19:15 19 May,26h,2 stops,No info,12392
1418,Jet Airways,6/06/2019,Delhi,Cochin,DEL → JAI → BOM → COK,05:30,04:25 07 Jun,22h 55m,2 stops,In-flight meal not included,10368
1674,IndiGo,24/03/2019,Banglore,New Delhi,BLR → DEL,18:25,21:20,2h 55m,non-stop,No info,7303
...,...,...,...,...,...,...,...,...,...,...,...
10594,Jet Airways,27/06/2019,Delhi,Cochin,DEL → AMD → BOM → COK,23:05,12:35 28 Jun,13h 30m,2 stops,No info,12819
10616,Jet Airways,1/06/2019,Delhi,Cochin,DEL → JAI → BOM → COK,09:40,12:35 02 Jun,26h 55m,2 stops,No info,13014
10634,Jet Airways,6/06/2019,Delhi,Cochin,DEL → JAI → BOM → COK,09:40,12:35 07 Jun,26h 55m,2 stops,In-flight meal not included,11733
10672,Jet Airways,27/06/2019,Delhi,Cochin,DEL → AMD → BOM → COK,23:05,19:00 28 Jun,19h 55m,2 stops,In-flight meal not included,11150


❌ Here we can see that there are __220 rows__ are duplicated which is a __validity issue__

In [784]:
flight.isnull().sum()

Airline            0
Date_of_Journey    0
Source             0
Destination        0
Route              1
Dep_Time           0
Arrival_Time       0
Duration           0
Total_Stops        1
Additional_Info    0
Price              0
dtype: int64

In [785]:
flight[flight.isnull().any(axis=1)]

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price
9039,Air India,6/05/2019,Delhi,Cochin,,09:45,09:25 07 May,23h 40m,,No info,7480


❌ Here we can see that row number 9039 has missing value in __Route__ and __Total_Stops__ `completness`

In [787]:
l = [i for i in flight.columns if flight[i].dtype == 'O']
def unique_columns(list: l)->str:
    for i in l:
        print(f'{pd.Series(flight[i].dropna().unique())}')
        print("")
        print(f'{flight[i].value_counts()}')
        print("")
        print("="*50)

In [788]:
unique_columns(l)

0                                IndiGo
1                             Air India
2                           Jet Airways
3                              SpiceJet
4                     Multiple carriers
5                                 GoAir
6                               Vistara
7                              Air Asia
8               Vistara Premium economy
9                  Jet Airways Business
10    Multiple carriers Premium economy
11                               Trujet
dtype: object

Airline
Jet Airways                          3849
IndiGo                               2053
Air India                            1752
Multiple carriers                    1196
SpiceJet                              818
Vistara                               479
Air Asia                              319
GoAir                                 194
Multiple carriers Premium economy      13
Jet Airways Business                    6
Vistara Premium economy                 3
Trujet                             

#### ❌ Issues
1.Here we can see that there is __tideness issue__ in __Date_of_Journey__<br>
2.There is __tideness issue__ in __Dep_Time__ and __Arrival_Time__ also in __Arrival_Time__ there sometimes day and month is also give which is a __completness issue__<br>
3. In __Duration__ there is __tideness and validity issue__ `h and m` present<br>
4. In __Additional_Info__ there is __consistancy issue__ `No info -> No Info`<br>


### 📋 Listing The Issues 
#### ⚠️ Quality Issues
1. __Date_of_Journey__ -> Wrong data type given (object) `validity issue`<br>
2. __220 rows__ are duplicated  `validity issue`<br>
3. __Route__,__Total_Stops__ -> null value in row number __9030__ `completeness issue`<br>
4. __Arrival_Time__ ->  Sometimes day and month given and sometime not `completeness issue`<br>
6. __Duration__ -> letter h & m present in between hour and minutes `validity issue`<br>
7. __Additional_Info__ -> No info and No Info are same but written differently `consistancy issue`<br>

#### ⚠️ Tideness Issues
1. __Date_of_Journey__  -> day,month and year must be in seperate columns<br>
2. __Dep_Time__ and __Arrival_Time__ -> hour and minutes must be in seperate columns<br>
3. __Duration__ -> hour and minutes must be in seperate columns<br>

### ✅ Rectifing Issues

First create a copy of Original Data Frame 

In [793]:
df_flight = flight.copy()

First we will solve validity issues
1. __Date_of_Journey__ has given data type as Object , we have to change the data type from object to __datetime__ => by using __pd.to_datetime__ we can change the data type 

In [795]:
df_flight["Date_of_Journey"] = pd.to_datetime(df_flight["Date_of_Journey"])

In [796]:
df_flight.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10683 entries, 0 to 10682
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   Airline          10683 non-null  object        
 1   Date_of_Journey  10683 non-null  datetime64[ns]
 2   Source           10683 non-null  object        
 3   Destination      10683 non-null  object        
 4   Route            10682 non-null  object        
 5   Dep_Time         10683 non-null  object        
 6   Arrival_Time     10683 non-null  object        
 7   Duration         10683 non-null  object        
 8   Total_Stops      10682 non-null  object        
 9   Additional_Info  10683 non-null  object        
 10  Price            10683 non-null  int64         
dtypes: datetime64[ns](1), int64(1), object(9)
memory usage: 918.2+ KB


2. There are __220 rows__ are duplicate We will remove all duplicate rows ==> by using __drop_duplicates(keep = 'first', axis = 0)__

In [798]:
df_flight.drop_duplicates(keep = 'first',ignore_index = True,inplace = True)

In [799]:
df_flight.shape[0]

10463

<span style="color: green;">3. We can see that there is only one row which contains null values so we can remove that row ===> by using __dropna__</span>

In [801]:
df_flight.dropna(inplace = True)

In [802]:
df_flight.isnull().sum()

Airline            0
Date_of_Journey    0
Source             0
Destination        0
Route              0
Dep_Time           0
Arrival_Time       0
Duration           0
Total_Stops        0
Additional_Info    0
Price              0
dtype: int64

In [803]:
df_flight.shape[0]

10462

<span style="color: Green;"> 4. we will remove the day and month part from __Arrival_Time__ </span>

In [805]:
df_flight['Arrival_Time'] = df_flight['Arrival_Time'].str.strip().apply(lambda x : x.split()[0])

In [806]:
df_flight.head()

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price
0,IndiGo,2019-03-24,Banglore,New Delhi,BLR → DEL,22:20,01:10,2h 50m,non-stop,No info,3897
1,Air India,2019-05-01,Kolkata,Banglore,CCU → IXR → BBI → BLR,05:50,13:15,7h 25m,2 stops,No info,7662
2,Jet Airways,2019-06-09,Delhi,Cochin,DEL → LKO → BOM → COK,09:25,04:25,19h,2 stops,No info,13882
3,IndiGo,2019-05-12,Kolkata,Banglore,CCU → NAG → BLR,18:05,23:30,5h 25m,1 stop,No info,6218
4,IndiGo,2019-03-01,Banglore,New Delhi,BLR → NAG → DEL,16:50,21:35,4h 45m,1 stop,No info,13302


In [807]:
flight.head()

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price
0,IndiGo,24/03/2019,Banglore,New Delhi,BLR → DEL,22:20,01:10 22 Mar,2h 50m,non-stop,No info,3897
1,Air India,1/05/2019,Kolkata,Banglore,CCU → IXR → BBI → BLR,05:50,13:15,7h 25m,2 stops,No info,7662
2,Jet Airways,9/06/2019,Delhi,Cochin,DEL → LKO → BOM → COK,09:25,04:25 10 Jun,19h,2 stops,No info,13882
3,IndiGo,12/05/2019,Kolkata,Banglore,CCU → NAG → BLR,18:05,23:30,5h 25m,1 stop,No info,6218
4,IndiGo,01/03/2019,Banglore,New Delhi,BLR → NAG → DEL,16:50,21:35,4h 45m,1 stop,No info,13302


5. Now we will remove __h and m__ from __duration__ 

In [809]:
def rep1(x):
    if ('h' in x) and ('m' in x):
        return x.replace('h','').replace('m','').replace(' ',':')
    elif 'm' not in x:
        j = x.replace('h','')
        return j+":0"
    else:
        j = x.replace('m','')
        return '0:'+ j
        
df_flight['Duration'] = df_flight['Duration'].str.strip().apply(lambda x : rep1(x))


In [810]:
df_flight.head()

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price
0,IndiGo,2019-03-24,Banglore,New Delhi,BLR → DEL,22:20,01:10,2:50,non-stop,No info,3897
1,Air India,2019-05-01,Kolkata,Banglore,CCU → IXR → BBI → BLR,05:50,13:15,7:25,2 stops,No info,7662
2,Jet Airways,2019-06-09,Delhi,Cochin,DEL → LKO → BOM → COK,09:25,04:25,19:0,2 stops,No info,13882
3,IndiGo,2019-05-12,Kolkata,Banglore,CCU → NAG → BLR,18:05,23:30,5:25,1 stop,No info,6218
4,IndiGo,2019-03-01,Banglore,New Delhi,BLR → NAG → DEL,16:50,21:35,4:45,1 stop,No info,13302


6. we have to replace __No info__ with __No Info__ in __Additional_Info__ ==> using __replace()__

In [812]:
df_flight['Additional_Info'] = df_flight['Additional_Info'].str.replace('No info', 'No Info')

In [813]:
list(df_flight['Additional_Info'].unique())

['No Info',
 'In-flight meal not included',
 'No check-in baggage included',
 '1 Short layover',
 '1 Long layover',
 'Change airports',
 'Business class',
 'Red-eye flight',
 '2 Long layover']

In [814]:
df_flight.sample(50)

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price
2414,SpiceJet,2019-03-18,Mumbai,Hyderabad,BOM → HYD,22:45,00:10,1:25,non-stop,No check-in baggage included,1965
2707,SpiceJet,2019-05-21,Mumbai,Hyderabad,BOM → HYD,22:45,00:15,1:30,non-stop,No check-in baggage included,1965
3842,IndiGo,2019-05-03,Kolkata,Banglore,CCU → BLR,22:15,00:50,2:35,non-stop,No Info,4804
1652,Air India,2019-03-21,Banglore,New Delhi,BLR → BOM → DEL,08:50,23:15,14:25,1 stop,No Info,6824
1048,Vistara,2019-05-24,Kolkata,Banglore,CCU → DEL → BLR,07:10,18:50,11:40,1 stop,No Info,9555
8372,Multiple carriers,2019-05-15,Delhi,Cochin,DEL → BOM → COK,07:30,19:00,11:30,1 stop,No Info,9315
9443,Jet Airways,2019-05-01,Kolkata,Banglore,CCU → BOM → BLR,14:05,09:20,19:15,1 stop,In-flight meal not included,9663
6595,IndiGo,2019-03-09,Banglore,New Delhi,BLR → DEL,07:10,10:05,2:55,non-stop,No Info,7303
8850,IndiGo,2019-06-03,Kolkata,Banglore,CCU → BLR,11:30,14:05,2:35,non-stop,No Info,4804
7429,Multiple carriers,2019-05-18,Delhi,Cochin,DEL → BOM → COK,09:15,19:15,10:0,1 stop,No Info,8309


Now we will solve the structrual part which is a tideness issues

1. spliting the __Date_of_Journey__ into __Journey_Day__ and __Journey_Month__

In [817]:
df_flight["Journey_Day"] = df_flight["Date_of_Journey"].dt.day
df_flight["Journey_Month"] = df_flight["Date_of_Journey"].dt.month

In [818]:
df_flight.head()

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price,Journey_Day,Journey_Month
0,IndiGo,2019-03-24,Banglore,New Delhi,BLR → DEL,22:20,01:10,2:50,non-stop,No Info,3897,24,3
1,Air India,2019-05-01,Kolkata,Banglore,CCU → IXR → BBI → BLR,05:50,13:15,7:25,2 stops,No Info,7662,1,5
2,Jet Airways,2019-06-09,Delhi,Cochin,DEL → LKO → BOM → COK,09:25,04:25,19:0,2 stops,No Info,13882,9,6
3,IndiGo,2019-05-12,Kolkata,Banglore,CCU → NAG → BLR,18:05,23:30,5:25,1 stop,No Info,6218,12,5
4,IndiGo,2019-03-01,Banglore,New Delhi,BLR → NAG → DEL,16:50,21:35,4:45,1 stop,No Info,13302,1,3


2. spliting the __Dep_Time__ and __Arrival_Time__ into __Dep_Hours__,__Dep_Mins__,__Arrival__Hours__,__Arrival_Mins__---- but first we have to change the data type of all from object to __datetime.time__

In [820]:
df_flight['Dep_Hours'] = pd.to_datetime(df_flight["Dep_Time"]).dt.hour
df_flight['Dep_Mins'] = pd.to_datetime(df_flight["Dep_Time"]).dt.minute
df_flight['Arrival_Hours'] = pd.to_datetime(df_flight["Arrival_Time"]).dt.hour
df_flight['Arrival_Mins'] = pd.to_datetime(df_flight["Arrival_Time"]).dt.minute


In [821]:
df_flight.head()

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price,Journey_Day,Journey_Month,Dep_Hours,Dep_Mins,Arrival_Hours,Arrival_Mins
0,IndiGo,2019-03-24,Banglore,New Delhi,BLR → DEL,22:20,01:10,2:50,non-stop,No Info,3897,24,3,22,20,1,10
1,Air India,2019-05-01,Kolkata,Banglore,CCU → IXR → BBI → BLR,05:50,13:15,7:25,2 stops,No Info,7662,1,5,5,50,13,15
2,Jet Airways,2019-06-09,Delhi,Cochin,DEL → LKO → BOM → COK,09:25,04:25,19:0,2 stops,No Info,13882,9,6,9,25,4,25
3,IndiGo,2019-05-12,Kolkata,Banglore,CCU → NAG → BLR,18:05,23:30,5:25,1 stop,No Info,6218,12,5,18,5,23,30
4,IndiGo,2019-03-01,Banglore,New Delhi,BLR → NAG → DEL,16:50,21:35,4:45,1 stop,No Info,13302,1,3,16,50,21,35


3. Spliting the __Duration__ into __Duration_Hours__, __Duration_Mins__

In [823]:
df_flight['Duration_Hours'] = df_flight['Duration'].str.split(':').apply(lambda x : x[0])
df_flight['Duration_Mins'] = df_flight['Duration'].str.split(':').apply(lambda x : x[1])

In [824]:
df_flight.head()

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price,Journey_Day,Journey_Month,Dep_Hours,Dep_Mins,Arrival_Hours,Arrival_Mins,Duration_Hours,Duration_Mins
0,IndiGo,2019-03-24,Banglore,New Delhi,BLR → DEL,22:20,01:10,2:50,non-stop,No Info,3897,24,3,22,20,1,10,2,50
1,Air India,2019-05-01,Kolkata,Banglore,CCU → IXR → BBI → BLR,05:50,13:15,7:25,2 stops,No Info,7662,1,5,5,50,13,15,7,25
2,Jet Airways,2019-06-09,Delhi,Cochin,DEL → LKO → BOM → COK,09:25,04:25,19:0,2 stops,No Info,13882,9,6,9,25,4,25,19,0
3,IndiGo,2019-05-12,Kolkata,Banglore,CCU → NAG → BLR,18:05,23:30,5:25,1 stop,No Info,6218,12,5,18,5,23,30,5,25
4,IndiGo,2019-03-01,Banglore,New Delhi,BLR → NAG → DEL,16:50,21:35,4:45,1 stop,No Info,13302,1,3,16,50,21,35,4,45


At last we will delete columns __Date_of_Journey__, __Dep_Time__, __Arrival_Time__ and __Duration__ because we already splited them into different columns

In [826]:
df_flight.drop(columns = ['Date_of_Journey','Dep_Time','Arrival_Time','Duration'],axis = 1, inplace = True)

In [827]:
df_flight.sample(20)

Unnamed: 0,Airline,Source,Destination,Route,Total_Stops,Additional_Info,Price,Journey_Day,Journey_Month,Dep_Hours,Dep_Mins,Arrival_Hours,Arrival_Mins,Duration_Hours,Duration_Mins
6439,Jet Airways,Kolkata,Banglore,CCU → BOM → BLR,1 stop,No Info,14571,9,6,6,30,18,15,11,45
4531,Air India,Delhi,Cochin,DEL → MAA → COK,1 stop,No Info,6482,21,3,17,20,9,25,16,5
1506,Jet Airways,Banglore,New Delhi,BLR → MAA → DEL,1 stop,In-flight meal not included,10562,9,3,9,45,0,30,14,45
4236,Jet Airways,Kolkata,Banglore,CCU → BOM → BLR,1 stop,In-flight meal not included,9663,1,5,8,25,9,20,24,55
10368,Jet Airways,Kolkata,Banglore,CCU → BOM → BLR,1 stop,In-flight meal not included,9663,6,5,14,5,20,45,6,40
8647,Jet Airways,Kolkata,Banglore,CCU → BOM → BLR,1 stop,In-flight meal not included,10844,15,5,16,30,18,15,25,45
6152,Air Asia,Banglore,Delhi,BLR → DEL,non-stop,No Info,4282,24,6,4,55,7,45,2,50
6206,Air India,Kolkata,Banglore,CCU → BBI → HYD → BLR,2 stops,No Info,6117,15,5,9,10,4,40,19,30
9434,IndiGo,Delhi,Cochin,DEL → BOM → COK,1 stop,No Info,6069,6,6,16,0,21,0,5,0
6402,GoAir,Banglore,Delhi,BLR → DEL,non-stop,No Info,3898,9,6,7,45,10,40,2,55


In [883]:
df_flight.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10462 entries, 0 to 10462
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Airline          10462 non-null  object
 1   Source           10462 non-null  object
 2   Destination      10462 non-null  object
 3   Route            10462 non-null  object
 4   Total_Stops      10462 non-null  object
 5   Additional_Info  10462 non-null  object
 6   Price            10462 non-null  int64 
 7   Journey_Day      10462 non-null  int32 
 8   Journey_Month    10462 non-null  int32 
 9   Dep_Hours        10462 non-null  int32 
 10  Dep_Mins         10462 non-null  int32 
 11  Arrival_Hours    10462 non-null  int32 
 12  Arrival_Mins     10462 non-null  int32 
 13  Duration_Hours   10462 non-null  object
 14  Duration_Mins    10462 non-null  object
dtypes: int32(6), int64(1), object(8)
memory usage: 1.0+ MB


❌ here we can see that  __Duration_Hours__ and __Duration_Mins__ has data type object , we have to convert it into integer

In [894]:
df_flight[['Duration_Hours','Duration_Mins']]= df_flight[['Duration_Hours','Duration_Mins']].astype('int32')


In [896]:
df_flight.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10462 entries, 0 to 10462
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Airline          10462 non-null  object
 1   Source           10462 non-null  object
 2   Destination      10462 non-null  object
 3   Route            10462 non-null  object
 4   Total_Stops      10462 non-null  object
 5   Additional_Info  10462 non-null  object
 6   Price            10462 non-null  int64 
 7   Journey_Day      10462 non-null  int32 
 8   Journey_Month    10462 non-null  int32 
 9   Dep_Hours        10462 non-null  int32 
 10  Dep_Mins         10462 non-null  int32 
 11  Arrival_Hours    10462 non-null  int32 
 12  Arrival_Mins     10462 non-null  int32 
 13  Duration_Hours   10462 non-null  int32 
 14  Duration_Mins    10462 non-null  int32 
dtypes: int32(8), int64(1), object(6)
memory usage: 980.8+ KB


### Now the Dataset is Clean but in EDA and Model creation there may be some posibilities that we have to clean dataset again

## 3. 🗄️ We we save this dataset in an Excel File then Pass this File For EDA + Feature Engineering

In [898]:
df_flight.to_excel('clean_dataset_flight.xlsx',index = False)