# Predict The Flight Ticket Price Hackathon

**Features**:

    Airline: The name of the airline.
    Date_of_Journey: The date of the journey
    Source: The source from which the service begins.
    Destination: The destination where the service ends.
    Route: The route taken by the flight to reach the destination.
    Dep_Time: The time when the journey starts from the source.
    Arrival_Time: Time of arrival at the destination.
    Duration: Total duration of the flight.
    Total_Stops: Total stops between the source and destination.
    Additional_Info: Additional information about the flight
    Price: The price of the ticket

### Import data and libraries

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

In [3]:
train = pd.read_excel('Data_Train.xlsx')
test = pd.read_excel('Test_set.xlsx')

In [4]:
train.shape, test.shape

((10683, 11), (2671, 10))

In [5]:
train.info()
#test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10683 entries, 0 to 10682
Data columns (total 11 columns):
Airline            10683 non-null object
Date_of_Journey    10683 non-null object
Source             10683 non-null object
Destination        10683 non-null object
Route              10682 non-null object
Dep_Time           10683 non-null object
Arrival_Time       10683 non-null object
Duration           10683 non-null object
Total_Stops        10682 non-null object
Additional_Info    10683 non-null object
Price              10683 non-null int64
dtypes: int64(1), object(10)
memory usage: 918.1+ KB


In [6]:
train.head(5)

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price
0,IndiGo,24/03/2019,Banglore,New Delhi,BLR → DEL,22:20,01:10 22 Mar,2h 50m,non-stop,No info,3897
1,Air India,1/05/2019,Kolkata,Banglore,CCU → IXR → BBI → BLR,05:50,13:15,7h 25m,2 stops,No info,7662
2,Jet Airways,9/06/2019,Delhi,Cochin,DEL → LKO → BOM → COK,09:25,04:25 10 Jun,19h,2 stops,No info,13882
3,IndiGo,12/05/2019,Kolkata,Banglore,CCU → NAG → BLR,18:05,23:30,5h 25m,1 stop,No info,6218
4,IndiGo,01/03/2019,Banglore,New Delhi,BLR → NAG → DEL,16:50,21:35,4h 45m,1 stop,No info,13302


In [7]:
for i in train.columns:
    print("Unique values in", i, train[i].nunique())

Unique values in Airline 12
Unique values in Date_of_Journey 44
Unique values in Source 5
Unique values in Destination 6
Unique values in Route 128
Unique values in Dep_Time 222
Unique values in Arrival_Time 1343
Unique values in Duration 368
Unique values in Total_Stops 5
Unique values in Additional_Info 10
Unique values in Price 1870


In [8]:
for i in test.columns:
    print("Unique values in", i, test[i].nunique())

Unique values in Airline 11
Unique values in Date_of_Journey 44
Unique values in Source 5
Unique values in Destination 6
Unique values in Route 100
Unique values in Dep_Time 199
Unique values in Arrival_Time 704
Unique values in Duration 320
Unique values in Total_Stops 5
Unique values in Additional_Info 6


### Data pre-processing

In [9]:
train_df = train[['Airline', 'Source', 'Destination', 'Total_Stops', 'Additional_Info', 'Date_of_Journey', 'Dep_Time', 
                  'Route', 'Arrival_Time', 'Price']]
test_df = test[['Airline', 'Source', 'Destination', 'Total_Stops', 'Additional_Info', 'Date_of_Journey', 'Dep_Time', 
                'Route', 'Arrival_Time']]

This is new feature __Booking_Class__ to identify the booking class i.e. **Economy, Premium Economy & Business**. For the __'Premium Economy'__ and __Business__ class its already mentioned. Rest of the airlines I have assumed as __Economy__.

In [10]:
Class = {'IndiGo': 'Economy',
         'GoAir': 'Economy',
         'Vistara': 'Economy',
         'Vistara Premium economy': 'Premium Economy',
         'Air Asia': 'Economy',
         'Trujet': 'Economy',
         'Jet Airways': 'Economy',
         'SpiceJet': 'Economy',
         'Jet Airways Business': 'Business',
         'Air India': 'Economy',
         'Multiple carriers': 'Economy',
         'Multiple carriers Premium economy': 'Premium Economy'}
train_df['Booking_Class'] = train_df['Airline'].map(Class)
test_df['Booking_Class'] = test_df['Airline'].map(Class)

This is new feature is used to indicate __Market_Share__ of each airline. This information is taken mostly from Wikipedia. For _Multiple carriers_ & _Multiple carriers Premium Economy_ I have assumed 1% & for the _Trujet_ which is new entrant in the Airline I have assumed 0.1%.

In [11]:
market = {'IndiGo': 41.3,
         'GoAir': 8.4,
         'Vistara': 3.3,
         'Vistara Premium economy': 3.3,
         'Air Asia': 3.3,
         'Trujet': 0.1,
         'Jet Airways': 17.8,
         'SpiceJet': 13.3,
         'Jet Airways Business': 17.8,
         'Air India': 13.5,
         'Multiple carriers': 1,
         'Multiple carriers Premium economy': 1}
train_df['Market_Share'] = train_df['Airline'].map(market)
test_df['Market_Share'] = test_df['Airline'].map(market)

One of the very important factors which influences Flight Ticket price is how soon you book the ticket. Since this information was not provided in the dataset I have assumed 01-Mar-2019 as ticket booking date and created new feature __Days_to_Departure__ 

In [12]:
df1 = train_df.copy() 
df1['Day_of_Booking'] = '1/3/2019'
df1['Day_of_Booking'] = pd.to_datetime(df1['Day_of_Booking'],format='%d/%m/%Y')
df1['Date_of_Journey'] = pd.to_datetime(df1['Date_of_Journey'],format='%d/%m/%Y')
df1['Days_to_Departure'] = (df1['Date_of_Journey'] - df1['Day_of_Booking']).dt.days
train_df['Days_to_Departure'] = df1['Days_to_Departure']

df2 = test_df.copy() 
df2['Day_of_Booking'] = '1/3/2019'
df2['Day_of_Booking'] = pd.to_datetime(df2['Day_of_Booking'],format='%d/%m/%Y')
df2['Date_of_Journey'] = pd.to_datetime(df2['Date_of_Journey'],format='%d/%m/%Y')
df2['Days_to_Departure'] = (df2['Date_of_Journey'] - df2['Day_of_Booking']).dt.days
test_df['Days_to_Departure'] = df2['Days_to_Departure']

del df1, df2

In [13]:
train_df.head(2)

Unnamed: 0,Airline,Source,Destination,Total_Stops,Additional_Info,Date_of_Journey,Dep_Time,Route,Arrival_Time,Price,Booking_Class,Market_Share,Days_to_Departure
0,IndiGo,Banglore,New Delhi,non-stop,No info,24/03/2019,22:20,BLR → DEL,01:10 22 Mar,3897,Economy,41.3,23
1,Air India,Kolkata,Banglore,2 stops,No info,1/05/2019,05:50,CCU → IXR → BBI → BLR,13:15,7662,Economy,13.5,61


In [14]:
# Let's take only Arrial Time (withut including date)
train_df['Arrival_Time'] = train['Arrival_Time'].str.split(' ').str[0]
test_df['Arrival_Time'] = test['Arrival_Time'].str.split(' ').str[0]

Another important parameter which influences Flight Price is Departure time of the flight i.e. Morning, Noon, Evening or Night. So created this new feature __Dep_timeofday__ which indicate Departure Time of the day. Also applied same concept to Arrival Time and created another feature __Arr_timeofday__

In [15]:
def get_departure(dep):
    dep = dep.split(':')
    dep = int(dep[0])
    if (dep >= 6 and dep < 12):
        return 'Morning'
    elif (dep >= 12 and dep < 17):
        return 'Noon'
    elif (dep >= 17 and dep < 20):
        return 'Evening'
    else:
        return 'Night'
    
train_df['Dep_timeofday'] = train['Dep_Time'].apply(get_departure)   
test_df['Dep_timeofday'] = test['Dep_Time'].apply(get_departure) 

train_df['Arr_timeofday'] = train['Arrival_Time'].apply(get_departure)   
test_df['Arr_timeofday'] = test['Arrival_Time'].apply(get_departure) 

Converted __Total_Stops__ categorical column into numeric

In [16]:
train_df['Total_Stops'] = train_df['Total_Stops'].str.replace('non-stop','0')
train_df['Total_Stops'] = train_df['Total_Stops'].str.replace('stops','')
train_df['Total_Stops'] = train_df['Total_Stops'].str.replace('stop','')
train_df['Total_Stops'].fillna(0, inplace=True)   
train_df['Total_Stops'] = train_df['Total_Stops'].astype(float)

test_df['Total_Stops'] = test_df['Total_Stops'].str.replace('non-stop','0')
test_df['Total_Stops'] = test_df['Total_Stops'].str.replace('stops','')
test_df['Total_Stops'] = test_df['Total_Stops'].str.replace('stop','')
#test_df['Total_Stops'].fillna(0, inplace=True)
test_df['Total_Stops'] = test_df['Total_Stops'].astype(float)

Converted __Duration__ column into minutes

In [17]:
train_df['Hours'] = train['Duration'].str.split(' ').str[0]
train_df['Hours'] = train_df['Hours'].str.replace('h','').astype(float)
train_df['Hours'].fillna(0, inplace=True) 

train_df['Minutes'] = train['Duration'].str.split(' ').str[1]
train_df['Minutes'] = train_df['Minutes'].str.replace('m','').astype(float)
train_df['Minutes'].fillna(0, inplace=True)

test_df['Hours'] = test['Duration'].str.split(' ').str[0]
test_df['Hours'] = test_df['Hours'].str.replace('h','').astype(float)
test_df['Hours'].fillna(0, inplace=True) 

test_df['Minutes'] = test['Duration'].str.split(' ').str[1]
test_df['Minutes'] = test_df['Minutes'].str.replace('m','').astype(float)
test_df['Minutes'].fillna(0, inplace=True)

train_df['Hours'] = train_df['Hours'] * 60
train_df['Duration'] = train_df['Hours'] + train_df['Minutes']

test_df['Hours'] = test_df['Hours'] * 60
test_df['Duration'] = test_df['Hours'] + test_df['Minutes']

train_df.drop(['Hours', 'Minutes'], axis=1, inplace=True)
test_df.drop(['Hours', 'Minutes'], axis=1, inplace=True)

In [18]:
train_df.head(2)

Unnamed: 0,Airline,Source,Destination,Total_Stops,Additional_Info,Date_of_Journey,Dep_Time,Route,Arrival_Time,Price,Booking_Class,Market_Share,Days_to_Departure,Dep_timeofday,Arr_timeofday,Duration
0,IndiGo,Banglore,New Delhi,0.0,No info,24/03/2019,22:20,BLR → DEL,01:10,3897,Economy,41.3,23,Night,Night,170.0
1,Air India,Kolkata,Banglore,2.0,No info,1/05/2019,05:50,CCU → IXR → BBI → BLR,13:15,7662,Economy,13.5,61,Night,Noon,445.0


Let's take logarithmic values of __Price__ and __Duration__

In [19]:
train_df['Price'] = np.log1p(train_df['Price'])

train_df['Duration'] = np.log1p(train_df['Duration'])
test_df['Duration'] = np.log1p(test_df['Duration'])

In [20]:
train_df['Additional_Info'] = train_df['Additional_Info'].str.replace('No info', 'No Info')
test_df['Additional_Info'] = test_df['Additional_Info'].str.replace('No info', 'No Info')

Therea are lot of categorical variable. Used pandas __get_dummies__ to deal with all the categorical variables

In [21]:
train_df = pd.get_dummies(train_df, columns=['Airline', 'Source', 'Destination', 'Additional_Info', 'Date_of_Journey',
                                             'Dep_Time', 'Arrival_Time', 'Dep_timeofday', 'Booking_Class', 'Arr_timeofday'],
                          drop_first=True)
test_df = pd.get_dummies(test_df, columns=['Airline', 'Source', 'Destination', 'Additional_Info', 'Date_of_Journey',
                                           'Dep_Time', 'Arrival_Time', 'Dep_timeofday', 'Booking_Class', 'Arr_timeofday'],
                         drop_first=True)

For the __Route__ column, I have applied TF-IDF text extraction to create one column for each value of location. There are 43 unique location so 43 new feature created out of __Route__ column. The results are stored in dataframe.

In [22]:
def clean_route(route):
    route = str(route)
    route = route.split(' → ')
    return ' '.join(route)

train_df['Route'] = train_df['Route'].apply(clean_route)
test_df['Route'] = test_df['Route'].apply(clean_route)

from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer(ngram_range=(1, 1), lowercase=False)
train_route = tf.fit_transform(train_df['Route'])
test_route = tf.transform(test_df['Route'])

train_route = pd.DataFrame(data=train_route.toarray(), columns=tf.get_feature_names())
test_route = pd.DataFrame(data=test_route.toarray(), columns=tf.get_feature_names())

In [23]:
train_route.head(5)
#test_route.head(5)

Unnamed: 0,AMD,ATQ,BBI,BDQ,BHO,BLR,BOM,CCU,COK,DED,...,PAT,PNQ,RPR,STV,TRV,UDR,VGA,VNS,VTZ,nan
0,0.0,0.0,0.0,0.0,0.0,0.783012,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.619874,0.0,0.0,0.194833,0.0,0.248235,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.241494,0.0,0.271844,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.274089,0.0,0.349214,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.284914,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Now, let's concatenate __train_route__ and __test_route__ dataframes with corresponding __train_df__ and __test_df__ dataframe which will be used further for modelling 

In [24]:
train_df = pd.concat([train_df, train_route], axis=1) 
train_df.drop('Route', axis=1, inplace=True)

test_df = pd.concat([test_df, test_route], axis=1) 
test_df.drop('Route', axis=1, inplace=True)

In [25]:
train_df.head()
#test_df.head()

Unnamed: 0,Total_Stops,Price,Market_Share,Days_to_Departure,Duration,Airline_Air India,Airline_GoAir,Airline_IndiGo,Airline_Jet Airways,Airline_Jet Airways Business,...,PAT,PNQ,RPR,STV,TRV,UDR,VGA,VNS,VTZ,nan
0,0.0,8.268219,41.3,23,5.141664,0,0,1,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2.0,8.944159,13.5,61,6.100319,1,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2.0,9.53842,17.8,100,7.03966,0,0,0,1,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1.0,8.735364,41.3,72,5.786897,0,0,1,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1.0,9.495745,41.3,0,5.655992,0,0,1,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Train test split

In [187]:
X = train_df.drop(labels=['Price'], axis=1)
y = train_df['Price'].values

from sklearn.model_selection import train_test_split
X_train, X_cv, y_train, y_cv = train_test_split(X, y, test_size=0.25, random_state=1)

In [188]:
X_train.shape, y_train.shape, X_cv.shape, y_cv.shape

((8012, 569), (8012,), (2671, 569), (2671,))

### Build the model

In [189]:
from math import sqrt 
from sklearn.metrics import mean_squared_log_error

In [190]:
import lightgbm as lgb
train_data = lgb.Dataset(X_train, label=y_train)
test_data = lgb.Dataset(X_cv, label=y_cv)

param = {'objective': 'regression',
         'boosting': 'gbdt',
         'num_iterations': 3000,   
         'learning_rate': 0.06,  
         'num_leaves': 40,  
         'max_depth': 24,   
         'min_data_in_leaf':11,  
         'max_bin': 4, 
         'metric': 'l2_root'
         }

lgbm = lgb.train(params=param,
                 verbose_eval=1000,
                 train_set=train_data,
                 valid_sets=[test_data])

y_pred2 = lgbm.predict(X_cv)
print('RMSLE:', sqrt(mean_squared_log_error(np.exp(y_cv), np.exp(y_pred2))))

[1000]	valid_0's rmse: 0.107172
[2000]	valid_0's rmse: 0.105563
[3000]	valid_0's rmse: 0.105689
RMSLE: 0.10567509598337499


In [191]:
from xgboost import XGBRegressor
xgb = XGBRegressor(max_depth=9, 
                   learning_rate=0.5, 
                   n_estimators=112, 
                   silent=False, 
                   objective='reg:linear', 
                   booster='gbtree', 
                   n_jobs=1, 
                   nthread=None, 
                   gamma=0, 
                   min_child_weight=1, 
                   max_delta_step=0, 
                   subsample=1, 
                   colsample_bytree=1, 
                   colsample_bylevel=1, 
                   reg_alpha=1, 
                   reg_lambda=1, 
                   scale_pos_weight=1, 
                   base_score=0.5, 
                   random_state=0, 
                   seed=None)
xgb.fit(X_train, y_train)
y_pred1 = xgb.predict(X_cv)
print('RMSLE:', sqrt(mean_squared_log_error(np.exp(y_cv), np.exp(y_pred1))))
#RMSLE:0.11124007381703037

RMSLE: 0.11125264058400451


In [192]:
from sklearn.ensemble import BaggingRegressor
br = BaggingRegressor(base_estimator=None, 
                      n_estimators=50, 
                      max_samples=1.0, 
                      max_features=1.0, 
                      bootstrap=True, 
                      bootstrap_features=False, 
                      oob_score=False, 
                      warm_start=False, 
                      n_jobs=1, 
                      random_state=1, 
                      verbose=0)
br.fit(X_train, y_train)
y_pred3 = br.predict(X_cv)
print('RMSLE:', sqrt(mean_squared_log_error(np.exp(y_cv), np.exp(y_pred3))))
#RMSLE:0.11265336177662688

RMSLE: 0.11278238526307509


In [193]:
from sklearn.ensemble import GradientBoostingRegressor
gb = GradientBoostingRegressor(loss='ls', 
                               learning_rate=0.3, 
                               n_estimators=380, 
                               subsample=1.0, 
                               criterion='friedman_mse', 
                               min_samples_split=30, 
                               min_samples_leaf=1, 
                               min_weight_fraction_leaf=0.0, 
                               max_depth=7, 
                               min_impurity_decrease=0.0, 
                               min_impurity_split=None, 
                               init=None, 
                               random_state=0, 
                               max_features=None, 
                               alpha=0.9, 
                               verbose=100, 
                               max_leaf_nodes=None, 
                               warm_start=False, 
                               presort='auto')
gb.fit(X_train, y_train)
y_pred4 = gb.predict(X_cv)
print('RMSLE:', sqrt(mean_squared_log_error(np.exp(y_cv), np.exp(y_pred4))))
#RMSLE:0.11014737617371691

      Iter       Train Loss   Remaining Time 
         1           0.1420            4.18m
         2           0.0829            3.82m
         3           0.0515            4.03m
         4           0.0356            3.96m
         5           0.0270            3.91m
         6           0.0225            4.00m
         7           0.0195            3.92m
         8           0.0177            3.85m
         9           0.0163            3.81m
        10           0.0155            3.73m
        11           0.0146            3.63m
        12           0.0141            3.49m
        13           0.0137            3.37m
        14           0.0133            3.28m
        15           0.0129            3.18m
        16           0.0127            3.08m
        17           0.0125            2.98m
        18           0.0124            2.87m
        19           0.0122            2.77m
        20           0.0117            2.81m
        21           0.0116            2.76m
        2

       183           0.0052           58.23s
       184           0.0051           57.90s
       185           0.0051           57.57s
       186           0.0051           57.23s
       187           0.0051           56.85s
       188           0.0051           56.47s
       189           0.0051           56.10s
       190           0.0051           55.77s
       191           0.0051           55.53s
       192           0.0050           55.20s
       193           0.0050           55.06s
       194           0.0050           54.89s
       195           0.0050           54.57s
       196           0.0050           54.18s
       197           0.0050           53.80s
       198           0.0049           53.53s
       199           0.0049           53.21s
       200           0.0049           52.91s
       201           0.0049           52.66s
       202           0.0049           52.58s
       203           0.0049           52.21s
       204           0.0049           51.83s
       205

       366           0.0032            4.24s
       367           0.0032            3.94s
       368           0.0031            3.64s
       369           0.0031            3.34s
       370           0.0031            3.03s
       371           0.0031            2.73s
       372           0.0031            2.43s
       373           0.0031            2.12s
       374           0.0031            1.82s
       375           0.0031            1.52s
       376           0.0031            1.21s
       377           0.0031            0.91s
       378           0.0031            0.61s
       379           0.0031            0.30s
       380           0.0031            0.00s
RMSLE: 0.10977714681935279


In [194]:
y_pred = y_pred1*0.10 + y_pred2*0.50 + y_pred3*0.20 + y_pred4*0.20
print('RMSLE:', sqrt(mean_squared_log_error(np.exp(y_cv), np.exp(y_pred))))

RMSLE: 0.1013472363709534


## Predict on test set

In [195]:
train_df['Dep_Time_22:30'] = 0

In [196]:
missing_cols_test = []
for col in train_df.columns:
    if col not in test_df.columns:
        missing_cols_test.append(col)
        
for i in missing_cols_test:
    test_df[i] = 0

test_df.drop('Price', axis=1, inplace=True)

In [197]:
train_df = train_df.reindex(sorted(train_df.columns), axis=1)
test_df = test_df.reindex(sorted(test_df.columns), axis=1)

In [198]:
train_df.shape, test_df.shape

((10683, 571), (2671, 570))

In [199]:
X_train = train_df.drop(labels='Price', axis=1)
y_train = train_df['Price'].values

X_test = test_df

In [200]:
X_train.shape, X_test.shape

((10683, 570), (2671, 570))

In [201]:
from xgboost import XGBRegressor
xgb = XGBRegressor(max_depth=9, 
                   learning_rate=0.5, 
                   n_estimators=112, 
                   silent=False, 
                   objective='reg:linear', 
                   booster='gbtree', 
                   n_jobs=1, 
                   nthread=None, 
                   gamma=0, 
                   min_child_weight=1, 
                   max_delta_step=0, 
                   subsample=1, 
                   colsample_bytree=1, 
                   colsample_bylevel=1, 
                   reg_alpha=0.89, 
                   reg_lambda=1, 
                   scale_pos_weight=1, 
                   base_score=0.5, 
                   random_state=0, 
                   seed=None)
xgb.fit(X_train, y_train)
y_pred1 = xgb.predict(X_test)

In [202]:
import lightgbm as lgb
train_data = lgb.Dataset(X_train, label=y_train)

param = {'objective': 'regression',
         'boosting': 'gbdt',
         'num_iterations': 3000,   
         'learning_rate': 0.06,  
         'num_leaves': 40,  
         'max_depth': 24,   
         'min_data_in_leaf':11,  
         'max_bin': 4, 
         'metric': 'l2_root'
         }

lgbm = lgb.train(params=param,
                 train_set=train_data)

y_pred2 = lgbm.predict(X_test)

In [203]:
from sklearn.ensemble import BaggingRegressor
br = BaggingRegressor(base_estimator=None, 
                      n_estimators=50, 
                      max_samples=1.0, 
                      max_features=1.0, 
                      bootstrap=True, 
                      bootstrap_features=False, 
                      oob_score=False, 
                      warm_start=False, 
                      n_jobs=1, 
                      random_state=1, 
                      verbose=0)
br.fit(X_train, y_train)
y_pred3 = br.predict(X_test)

In [204]:
from sklearn.ensemble import GradientBoostingRegressor
gb = GradientBoostingRegressor(loss='ls', 
                               learning_rate=0.3, 
                               n_estimators=380, 
                               subsample=1.0, 
                               criterion='friedman_mse', 
                               min_samples_split=30, 
                               min_samples_leaf=1, 
                               min_weight_fraction_leaf=0.0, 
                               max_depth=7, 
                               min_impurity_decrease=0.0, 
                               min_impurity_split=None, 
                               init=None, 
                               random_state=0, 
                               max_features=None, 
                               alpha=0.9, 
                               verbose=100, 
                               max_leaf_nodes=None, 
                               warm_start=False, 
                               presort='auto')
gb.fit(X_train, y_train)
y_pred4 = gb.predict(X_test)

      Iter       Train Loss   Remaining Time 
         1           0.1429            4.23m
         2           0.0829            4.17m
         3           0.0526            4.09m
         4           0.0367            4.03m
         5           0.0281            3.99m
         6           0.0228            3.95m
         7           0.0199            3.97m
         8           0.0178            3.89m
         9           0.0167            3.92m
        10           0.0159            3.86m
        11           0.0153            3.79m
        12           0.0145            3.74m
        13           0.0140            3.69m
        14           0.0137            3.60m
        15           0.0133            3.52m
        16           0.0131            3.44m
        17           0.0128            3.43m
        18           0.0127            3.36m
        19           0.0126            3.31m
        20           0.0124            3.33m
        21           0.0122            3.34m
        2

       183           0.0054            1.37m
       184           0.0054            1.37m
       185           0.0054            1.36m
       186           0.0053            1.36m
       187           0.0053            1.35m
       188           0.0052            1.34m
       189           0.0052            1.34m
       190           0.0052            1.33m
       191           0.0051            1.32m
       192           0.0051            1.31m
       193           0.0051            1.31m
       194           0.0051            1.30m
       195           0.0051            1.29m
       196           0.0051            1.29m
       197           0.0051            1.28m
       198           0.0050            1.27m
       199           0.0050            1.26m
       200           0.0050            1.25m
       201           0.0050            1.25m
       202           0.0050            1.24m
       203           0.0050            1.23m
       204           0.0050            1.22m
       205

       366           0.0034            6.04s
       367           0.0034            5.60s
       368           0.0034            5.18s
       369           0.0034            4.74s
       370           0.0034            4.31s
       371           0.0034            3.88s
       372           0.0033            3.44s
       373           0.0033            3.01s
       374           0.0033            2.59s
       375           0.0033            2.16s
       376           0.0033            1.73s
       377           0.0033            1.30s
       378           0.0033            0.86s
       379           0.0033            0.43s
       380           0.0033            0.00s


In [205]:
y_pred = y_pred1*0.15 + y_pred2*0.50 + y_pred3*0.15 + y_pred4*0.20

In [206]:
y_pred = np.exp(y_pred)

In [207]:
y_pred[:5]

array([14406.06843791,  4209.53341524, 12895.31525683, 11535.5327643 ,
        3943.37074688])

In [208]:
df_sub = pd.DataFrame(data=y_pred, columns=['Price'])
writer = pd.ExcelWriter('Output.xlsx', engine='xlsxwriter')
df_sub.to_excel(writer,sheet_name='Sheet1', index=False)
writer.save()