#ASHRAE - Great Energy Predictor III

##Steps 
  1. Downloading datasets, Installing Packages & Loading data into dataframes
  2. Exploratory Data Analysis
  4. Data Cleaning
  4. Feature Engineering
  5. Building & Evaluating different Machine Learning models

##1. Downloading datasets, Installing Packages & Loading data into dataframes

####Installing packages & Downloading datasets

- Install opendatasets to download the required datasets from Kaggle.
- Add the kaggle API file containing username, API key to the files folder or enter the credentials when asked.

In [1]:
!pip install opendatasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting opendatasets
  Downloading opendatasets-0.1.22-py3-none-any.whl (15 kB)
Installing collected packages: opendatasets
Successfully installed opendatasets-0.1.22


In [2]:
import opendatasets as od
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import random
import seaborn as sns

In [3]:
download_url = 'https://www.kaggle.com/competitions/ashrae-energy-prediction/'

In [4]:
od.download(download_url)

Downloading ashrae-energy-prediction.zip to ./ashrae-energy-prediction


100%|██████████| 379M/379M [00:02<00:00, 197MB/s]



Extracting archive ./ashrae-energy-prediction/ashrae-energy-prediction.zip to ./ashrae-energy-prediction


In [5]:
data_dir = 'ashrae-energy-prediction/'

####Loading different CSV files into Dataframes

In [6]:
sample_fraction = 0.01

random.seed(42)
def skip_row(row_idx):
  if row_idx == 0:
    return False
  else:
    return random.random() > sample_fraction

In [7]:
#load data
train = pd.read_csv(data_dir+'train.csv', parse_dates=['timestamp'], skiprows=skip_row)
weather_train = pd.read_csv(data_dir+'weather_train.csv', parse_dates=['timestamp'])
build_meta = pd.read_csv(data_dir+'building_metadata.csv')

##2. Exploratory Data Analysis

In [8]:
train.meter.value_counts()

0    119909
1     41395
2     26986
3     12507
Name: meter, dtype: int64

In [9]:
len(train)

200797

In [10]:
buildings = train.building_id.unique()
num_builds = []
for x in range(16):
  df = build_meta[build_meta.site_id==x]
  num_builds.append(df.building_id.unique())

In [11]:
total_sum = 0
for x in num_builds:
  total_sum += len(x)
  print(len(x))
total_sum

105
51
135
274
91
89
44
15
70
124
30
5
36
154
102
124


1449

##3. Data Cleaning

####Handling Missing Values

A utility function to get summary of missing values in a dataset

In [12]:
def get_missing_info(df):
  num_entries = df.shape[0]*df.shape[1]
  num_nulls = df.isna().sum().sum()
  percent_null = num_nulls/num_entries*100
  num_missing = df.isna().sum()
  percent_missing = num_missing/len(df)*100
  col_modes = df.mode().loc[0] #returns the zeroth row of dataframe of modes all columns(modes becuase more than one values can be the most frequent values in a column)
  percent_mode = [df[x].isin([df[x].mode()[0]]).sum()/len(df)*100 for x in df]
  missing_value_df = pd.DataFrame({'num_missing':num_missing,
                                   'percent_missing':percent_missing,
                                   'mode':col_modes,
                                   'percent_mode':percent_mode})
  print('total empty percent:', percent_null, '%')
  print('columns that are more than 97% mode:', missing_value_df.loc[missing_value_df['percent_mode']>97].index.values)
  return missing_value_df


In [13]:
get_missing_info(train)

total empty percent: 0.0 %
columns that are more than 97% mode: []


Unnamed: 0,num_missing,percent_missing,mode,percent_mode
building_id,0,0.0,1249,0.182772
meter,0,0.0,0.0,59.71653
timestamp,0,0.0,2016-01-30 01:00:00,0.021415
meter_reading,0,0.0,0.0,9.280517


In [14]:
get_missing_info(build_meta)

total empty percent: 21.486082355647575 %
columns that are more than 97% mode: []


Unnamed: 0,num_missing,percent_missing,mode,percent_mode
site_id,0,0.0,3.0,18.909593
building_id,0,0.0,0,0.069013
primary_use,0,0.0,Education,37.888199
square_feet,0,0.0,387638.0,0.483092
year_built,774,53.416149,1976.0,3.795721
floor_count,1094,75.500345,1.0,7.522429


In [15]:
#Filling in missing values in the building meta-data column
#Make a copy so we are not changing the initial data
build_meta_f = build_meta.copy()
#fill all the missing floor counts by the mode (1) and the missing year built by the mean. Nothing else is missing
build_meta_f.fillna({'floor_count':1, 'year_built':int(build_meta['year_built'].mean())}, inplace=True) 
#this is the only categorical column. Convert so it can be handled later by lgbm during fitting
build_meta_f['primary_use'] = build_meta_f['primary_use'].astype('category') 

In [16]:
get_missing_info(weather_train)
# get_missing_info(weather_test)

total empty percent: 10.876365408356566 %
columns that are more than 97% mode: []


Unnamed: 0,num_missing,percent_missing,mode,percent_mode
site_id,0,0.0,0.0,6.284476
timestamp,0,0.0,2016-01-01 01:00:00,0.011447
air_temperature,55,0.03935,15.0,1.947443
cloud_coverage,69173,49.489529,0.0,24.232863
dew_temperature,113,0.080845,10.0,1.973915
precip_depth_1_hr,50289,35.979052,0.0,55.740379
sea_level_pressure,10618,7.596603,1015.2,0.608844
wind_direction,6268,4.484414,0.0,9.410974
wind_speed,304,0.217496,2.1,10.288825


In [17]:
#Forward filling missing data in the weather dataset +-24 hours
#Train weather
weather_train_f = weather_train.copy() #make a copy so we aren't changing our oridinal data
weather_train_f['timestamp'] = pd.to_datetime(weather_train_f['timestamp']) #turn the timestamp column into a datetime object
weather_train_f = weather_train_f.sort_values(by=['site_id', 'timestamp']) #sort values by site id then timestamp

weather_train_f.fillna(method = 'ffill', inplace=True, limit = 24) #forward fill the missing data up to 12 hours
weather_train_f.fillna(method = 'bfill', inplace=True, limit = 24) #backfill up to 12 hours


In [18]:
get_missing_info(weather_train_f)

total empty percent: 5.124658474017792 %
columns that are more than 97% mode: []


Unnamed: 0,num_missing,percent_missing,mode,percent_mode
site_id,0,0.0,0.0,6.284476
timestamp,0,0.0,2016-01-01 01:00:00,0.011447
air_temperature,0,0.0,15.0,1.947443
cloud_coverage,20378,14.579354,0.0,36.310303
dew_temperature,0,0.0,10.0,1.975346
precip_depth_1_hr,35309,25.261674,0.0,55.996509
sea_level_pressure,8779,6.280898,1015.2,0.61743
wind_direction,0,0.0,0.0,10.343199
wind_speed,0,0.0,2.1,10.333183


In [19]:
#Train data
missing_cols = [col for col in weather_train_f.columns if weather_train_f[col].isna().any()] 
fill_lib = weather_train_f.groupby('site_id')[missing_cols].transform('mean')#stores the mean of each feature for each site id
weather_train_f.fillna(fill_lib, inplace=True) #for each feature with missing values, fill the missing entry with the mean for that site


In [20]:
get_missing_info(weather_train_f)

total empty percent: 0.0 %
columns that are more than 97% mode: []


Unnamed: 0,num_missing,percent_missing,mode,percent_mode
site_id,0,0.0,0.0,6.284476
timestamp,0,0.0,2016-01-01 01:00:00,0.011447
air_temperature,0,0.0,15.0,1.947443
cloud_coverage,0,0.0,0.0,36.310303
dew_temperature,0,0.0,10.0,1.975346
precip_depth_1_hr,0,0.0,0.0,68.46172
sea_level_pressure,0,0.0,1016.7,6.229386
wind_direction,0,0.0,0.0,10.343199
wind_speed,0,0.0,2.1,10.333183


####Converting GMT Time of Weather data to Local time

In [21]:
import datetime
timediff = {0:4,1:0,2:7,3:4,4:7,5:0,6:4,7:4,8:4,9:5,10:7,11:4,12:0,13:5,14:4,15:4}
weather_train_f['time_diff']= weather_train_f['site_id'].map(timediff)

weather_train_f['time_diff'] = weather_train_f['time_diff'].apply(lambda x: datetime.timedelta(hours=x))

weather_train_f['timestamp'] = pd.to_datetime(weather_train_f['timestamp'])
weather_train_f['timestamp'] = weather_train_f['timestamp'] - weather_train_f['time_diff']

In [22]:
#merge the building meta data and weather data into the train data
train_m = train.merge(build_meta_f, how='left', on = ['building_id'], validate='many_to_one') #merge the building meta data into the train data
train_m = train_m.merge(weather_train_f, how='left', on = ['site_id', 'timestamp'], validate='many_to_one')#add weather data to each time entry for each site ID

In [23]:
get_missing_info(train_m)
#get_missing_info(test_m)

total empty percent: 0.2299073486274842 %
columns that are more than 97% mode: []


Unnamed: 0,num_missing,percent_missing,mode,percent_mode
building_id,0,0.0,1249,0.182772
meter,0,0.0,0.0,59.71653
timestamp,0,0.0,2016-01-30 01:00:00,0.021415
meter_reading,0,0.0,0.0,9.280517
site_id,0,0.0,13.0,13.425997
primary_use,0,0.0,Education,40.222712
square_feet,0,0.0,387638.0,0.301299
year_built,0,0.0,1967.0,60.731485
floor_count,0,0.0,1.0,87.220427
air_temperature,981,0.488553,24.4,1.994552


In [24]:
#train_data
train_m = train_m.sort_values(by=['building_id', 'timestamp'])
train_m.fillna(method = 'ffill', inplace=True)


Converting some columns with unnecessarily high memory usage data types to data types with lower memory usage

In [25]:
def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2    
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)    
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df

train_m = reduce_mem_usage(train_m)

Mem. usage decreased to 10.53 Mb (59.9% reduction)


####Outlier Detection & Fixing

In [26]:
for m in range(4):
    idxm = train_m[train_m['meter']==m].groupby('timestamp')['meter_reading'].idxmax() #index of max meter reading for the timestamp
    print('meter {}'.format(m))
    #print the number of hours the building was the highest consumer for the top 5 buildings
    print(train_m.loc[idxm, 'building_id'].value_counts().iloc[:5]) 

meter 0
794     104
801      90
795      86
1159     83
798      82
Name: building_id, dtype: int64
meter 1
1289    90
1284    86
1298    79
1258    78
209     75
Name: building_id, dtype: int64
meter 2
1148    98
1156    90
1197    87
1214    86
1284    83
Name: building_id, dtype: int64
meter 3
1272    94
1018    88
1317    86
1267    86
1273    85
Name: building_id, dtype: int64


In [27]:
print('Mean meter 0 reading: outlier building #803')
print(train_m[(train_m['building_id']==803) & (train_m['meter']==0)]['meter_reading'].mean())
print('Mean meter 0  reading: overall')
print(train_m[(train_m['meter']==0)]['meter_reading'].mean())

Mean meter 0 reading: outlier building #803
3934.5771
Mean meter 0  reading: overall
171.37746


Scaling down the meter readings for each building + meter type group

In [28]:
#We would like to rescale the meter reading column for each building and meter reading to prevent outliers from skewing the reults.
#This is a class to achieve that for any chosen groups. It is a modified version of code by Szymon Maszke: 
#https://stackoverflow.com/questions/55601928/apply-multiple-standardscalers-to-individual-groups
from sklearn.base import clone
class GroupTargetTransform:
    def __init__(self, transformation):
        self.transformation = transformation
        self._group_transforms = {} #this library will hold the group transforms

    def _call_with_function(self, X, y, function: str):
        yhat = pd.Series(dtype = 'float32')#this will hold the rescaled target data
        X['target'] = pd.Series(y, index=X.index)
        for gr in X.groupby(self.features):
            n = gr[0] #this is a tuple id for the group
            g_X = gr[1] #this is the group dataframe
            g_yhat = getattr(self._group_transforms[n], function)(g_X['target'].values.reshape(-1,1))#scale the target variable
            g_yhat = pd.Series(g_yhat.flatten(), index = g_X.index)
            yhat = yhat.append(g_yhat)
        X.drop('target', axis=1, inplace = True)
        return yhat.sort_index()
    
    def fit(self, X, y, features):
        self.features = features
        X['target'] = pd.Series(y, index=X.index) 
        for gr in X.groupby(self.features):
            n = gr[0] #this is a tuple id for the group
            g_X = gr[1] #this is the group dataframe
            sc = clone(self.transformation) #create a new instance of the transform
            self._group_transforms[n] = sc.fit(g_X['target'].values.reshape(-1,1))
        X.drop('target', axis=1, inplace=True)
        return self

    def transform(self, X, y):
        return self._call_with_function(X, y, "transform")

    def fit_transform(self, X, y, features):
        self.fit(X, y, features)
        return self.transform(X, y)

    def inverse_transform(self, X, y):
        return self._call_with_function(X, y, "inverse_transform")

In [29]:
train_m.meter_reading.mean()

2290.095

In [30]:
#rescale the target variable for each building and meter type.
from sklearn.preprocessing import MinMaxScaler

scaler = GroupTargetTransform(MinMaxScaler(feature_range = (0,2100))) #2000 is roughly the average meter reading for all the train data
train_m['meter_reading_rescaled'] = scaler.fit_transform(train_m, train_m['meter_reading'], ['building_id', 'meter'])
#convert to log(y+1) so the RMSE evaluation metric is actually giving the RMSLE (the evaluation metric for the competition)
train_m['meter_reading_rescaled'] = np.log1p(train_m['meter_reading_rescaled']) 

In [31]:
train_m.head()

Unnamed: 0,building_id,meter,timestamp,meter_reading,site_id,primary_use,square_feet,year_built,floor_count,air_temperature,cloud_coverage,dew_temperature,precip_depth_1_hr,sea_level_pressure,wind_direction,wind_speed,time_diff,meter_reading_rescaled
4232,0,0,2016-01-08 16:00:00,0.0,0,Education,7432,2008.0,1.0,20.59375,8.0,17.796875,0.0,1012.0,0.0,0.0,0 days 04:00:00,0.0
9437,0,0,2016-01-18 04:00:00,0.0,0,Education,7432,2008.0,1.0,8.898438,0.0,3.900391,0.0,1019.0,330.0,3.599609,0 days 04:00:00,0.0
11516,0,0,2016-01-22 01:00:00,0.0,0,Education,7432,2008.0,1.0,15.601562,4.0,9.398438,0.0,1017.5,140.0,4.601562,0 days 04:00:00,0.0
18536,0,0,2016-02-03 19:00:00,0.0,0,Education,7432,2008.0,1.0,25.59375,6.0,18.90625,0.0,1017.0,160.0,3.599609,0 days 04:00:00,0.0
18743,0,0,2016-02-04 04:00:00,0.0,0,Education,7432,2008.0,1.0,20.59375,6.0,18.90625,0.0,1017.5,160.0,3.599609,0 days 04:00:00,0.0


In [32]:
train_m.drop(['time_diff'],axis=1,inplace=True)

##4. Feature Engineering

####Adding time based features

In [33]:
import holidays
holidays_list = []
for item in holidays.USA(years=2016).items():
  holidays_list.append(item[0])
holidays_list

[datetime.date(2016, 1, 1),
 datetime.date(2016, 1, 18),
 datetime.date(2016, 2, 15),
 datetime.date(2016, 5, 30),
 datetime.date(2016, 7, 4),
 datetime.date(2016, 9, 5),
 datetime.date(2016, 10, 10),
 datetime.date(2016, 11, 11),
 datetime.date(2016, 11, 24),
 datetime.date(2016, 12, 25),
 datetime.date(2016, 12, 26)]

In [34]:
train_m['is_weekend'] = train_m['timestamp'].dt.weekday.isin([5,6]).astype(int)
train_m['is_holiday'] = train_m['timestamp'].dt.date.isin(holidays_list)
train_m['age'] = 2016 - train_m['year_built'] 

In [35]:
train_m.head()

Unnamed: 0,building_id,meter,timestamp,meter_reading,site_id,primary_use,square_feet,year_built,floor_count,air_temperature,cloud_coverage,dew_temperature,precip_depth_1_hr,sea_level_pressure,wind_direction,wind_speed,meter_reading_rescaled,is_weekend,is_holiday,age
4232,0,0,2016-01-08 16:00:00,0.0,0,Education,7432,2008.0,1.0,20.59375,8.0,17.796875,0.0,1012.0,0.0,0.0,0.0,0,False,8.0
9437,0,0,2016-01-18 04:00:00,0.0,0,Education,7432,2008.0,1.0,8.898438,0.0,3.900391,0.0,1019.0,330.0,3.599609,0.0,0,True,8.0
11516,0,0,2016-01-22 01:00:00,0.0,0,Education,7432,2008.0,1.0,15.601562,4.0,9.398438,0.0,1017.5,140.0,4.601562,0.0,0,False,8.0
18536,0,0,2016-02-03 19:00:00,0.0,0,Education,7432,2008.0,1.0,25.59375,6.0,18.90625,0.0,1017.0,160.0,3.599609,0.0,0,False,8.0
18743,0,0,2016-02-04 04:00:00,0.0,0,Education,7432,2008.0,1.0,20.59375,6.0,18.90625,0.0,1017.5,160.0,3.599609,0.0,0,False,8.0


####One hot encoding categorical columns

In [36]:
train_m.primary_use.value_counts()

Education                        80766
Office                           43971
Entertainment/public assembly    22419
Lodging/residential              21456
Public services                  16470
Healthcare                        3975
Other                             2389
Parking                           2125
Manufacturing/industrial          1209
Food sales and service            1156
Warehouse/storage                 1103
Retail                            1094
Services                          1032
Technology/science                 788
Utility                            542
Religious worship                  302
Name: primary_use, dtype: int64

In [37]:
categories = train_m.primary_use.unique()

In [38]:
for category in categories:
  train_m['primary_use_is_'+category] = train_m.primary_use==category

In [39]:
train_m = train_m.drop(['primary_use'], axis=1)

##5. Training & Evaluating different Models

####Splitting the dataset

In [40]:
from sklearn.model_selection import train_test_split

X = train_m.dropna(subset=['meter_reading']) #drop all rows where the meter reading is not included
X = X.sort_values(by=['timestamp'], axis=0) #ensure X is sorted by timestamp for later timeseries cross-validation

builds = X['building_id'].unique() #array of building ids in the dataset
build_train, build_val = train_test_split(builds, test_size = 0.1, random_state=0) #hold out 10% of the buildings for validation

train = X.loc[(X['timestamp']<'2016-10-15') 
          & (X['building_id'].isin(build_train))] #we will train on only the first 80% of the year and 90% buildings
val_t = X.loc[(X['timestamp']>='2016-10-15') & (X['building_id'].isin(build_train))] #rest of the year and same buildings as above
val_b = X.loc[(X['building_id'].isin(build_val))] #full year and the rest of the buildings

####Utility Functions for Evaluation

In [41]:
from sklearn.metrics import make_scorer, mean_squared_error, mean_absolute_error, mean_squared_log_error

#defining a couple of functions for later use
def clip(x):
    return np.clip(x, a_min=0, a_max=None)
def rmse(y, y_pred):
    out = np.sqrt(mean_squared_error(clip(y), clip(y_pred)))
    return out

In [42]:
def evaluate(model, X_val_t, y_val_t, X_val_b, y_val_b):
  print('Time predictions...')
  preds = clip(model.predict(X_val_t)) #make time predictions
  preds_inv = scaler.inverse_transform(X_val_t, np.expm1(preds)) #convert back to original scale, remembering to invert the log transform
  y_val_t = y_val_t.sort_index()
  score = mean_absolute_error(preds_inv, y_val_t)
  print('Mean absolute error - time prediction:', score)
  RMSLE = np.sqrt(mean_squared_log_error(preds_inv, y_val_t))
  print('RMSLE - time prediction:', RMSLE)

  print('Building predictions...')
  preds = clip(model.predict(X_val_b))
  preds_inv = scaler.inverse_transform(X_val_b, np.expm1(preds))
  y_val_b = y_val_b.sort_index()
  score = mean_absolute_error(preds_inv, y_val_b)
  print('Mean absolute error - new buildings:', score)
  RMSLE = np.sqrt(mean_squared_log_error(preds_inv, y_val_b))
  print('RMSLE - new buildings:', RMSLE)

####Training

In [43]:
y_train, y_val_t, y_val_b = train['meter_reading_rescaled'], val_t['meter_reading'], val_b['meter_reading'] #extracting the meter reading as our target variable
X_train, X_val_t, X_val_b = train.drop(['meter_reading', 'meter_reading_rescaled', 'timestamp','age'], axis=1), val_t.drop(['meter_reading', 'meter_reading_rescaled','timestamp','age'], axis=1), val_b.drop(['meter_reading','meter_reading_rescaled','timestamp','age'], axis=1)

In [44]:
X_val_t

Unnamed: 0,building_id,meter,site_id,square_feet,year_built,floor_count,air_temperature,cloud_coverage,dew_temperature,precip_depth_1_hr,...,primary_use_is_Parking,primary_use_is_Public services,primary_use_is_Warehouse/storage,primary_use_is_Food sales and service,primary_use_is_Religious worship,primary_use_is_Healthcare,primary_use_is_Utility,primary_use_is_Technology/science,primary_use_is_Manufacturing/industrial,primary_use_is_Services
157403,1288,0,14,164206,1967.0,1.0,6.101562,0.0,4.398438,0.0,...,False,False,False,False,False,False,False,False,False,False
157392,899,2,9,225014,1967.0,1.0,24.406250,4.0,20.593750,0.0,...,False,False,False,False,False,False,False,False,False,False
157399,1211,0,13,94988,1967.0,1.0,17.203125,6.0,12.203125,0.0,...,False,False,False,False,False,False,False,False,False,False
157402,1283,2,14,76537,1967.0,1.0,6.101562,0.0,4.398438,0.0,...,False,True,False,False,False,False,False,False,False,False
157388,808,0,8,9357,1967.0,1.0,23.906250,4.0,20.593750,0.0,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
200785,418,0,3,86120,1926.0,1.0,5.601562,4.0,-5.000000,0.0,...,False,False,False,False,False,False,False,False,False,False
200787,559,0,3,48994,2015.0,1.0,5.601562,6.0,1.700195,0.0,...,False,False,False,False,False,False,False,False,False,False
200789,780,0,6,120836,1967.0,1.0,5.601562,4.0,-9.398438,0.0,...,False,False,False,False,False,False,False,False,False,False
200794,1254,2,14,24979,1967.0,1.0,0.600098,0.0,-5.000000,0.0,...,False,False,False,False,False,False,False,False,False,False


####1. Ridge Regression Model

In [45]:
from sklearn.linear_model import Ridge

In [46]:
%%time
model_rr = Ridge(random_state=42,alpha=0.9)
model_rr = model_rr.fit(X_train,y_train)

CPU times: user 334 ms, sys: 70.7 ms, total: 405 ms
Wall time: 401 ms


In [47]:
evaluate(model_rr, X_val_t, y_val_t, X_val_b, y_val_b)

Time predictions...
Mean absolute error - time prediction: 722.5108211530039
RMSLE - time prediction: 1.5440729728941012
Building predictions...
Mean absolute error - new buildings: 264.2590635233504
RMSLE - new buildings: 1.6007067899496794


####2. Random Forest Model

In [48]:
from sklearn.ensemble import RandomForestRegressor

In [49]:
%%time
model_rf = RandomForestRegressor(n_estimators=100,max_depth=10,random_state=42)
model_rf = model_rf.fit(X_train,y_train)

CPU times: user 59.8 s, sys: 65.1 ms, total: 59.8 s
Wall time: 59.4 s


In [50]:
evaluate(model_rf, X_val_t, y_val_t, X_val_b, y_val_b)

Time predictions...
Mean absolute error - time prediction: 1981.657130653935
RMSLE - time prediction: 1.5746471590092028
Building predictions...
Mean absolute error - new buildings: 216.59741385851024
RMSLE - new buildings: 1.4005996446229543


####3. XGBoost Model

In [51]:
from xgboost import XGBRegressor

In [52]:
%%time
model_xg = XGBRegressor(max_depth=5,random_state=42)
mode_xg = model_xg.fit(X_train,y_train)

CPU times: user 21.3 s, sys: 100 ms, total: 21.4 s
Wall time: 22.6 s


In [53]:
evaluate(model_xg, X_val_t, y_val_t, X_val_b, y_val_b)

Time predictions...
Mean absolute error - time prediction: 1027.2732
RMSLE - time prediction: 1.5704128
Building predictions...
Mean absolute error - new buildings: 222.94385
RMSLE - new buildings: 1.4202152
