# Assessment 2 ML: Hackthon
### Supplement Sales Prediction
Your Client WOMart is a leading nutrition and supplement retail chain that offers a comprehensive range of products for all your wellness and fitness needs. 

WOMart follows a multi-channel distribution strategy with 350+ retail stores spread across 100+ cities. 

Effective forecasting for store sales gives essential insight into upcoming cash flow, meaning WOMart can more accurately plan the cashflow at the store level.

Sales data for 18 months from 365 stores of WOMart is available along with information on Store Type, Location Type for each store, Region Code for every store, Discount provided by the store on every day, Number of Orders everyday etc.

Our task is to predict the store sales for each store in the test set for the next two months.

In [37]:
# Importing the required libraries
import pandas as pd
import numpy as np
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.pipeline import make_pipeline
from sklearn.utils import check_array
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import KFold
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler
%matplotlib inline

# To ignore warnings
import warnings
warnings.filterwarnings("ignore")
import os

### Import the train and test csv files

In [53]:
train = pd.read_csv(r'C:\\Users\\arjun\\CONTESTS\\AV_JOB_SEPT2021\\TRAIN1.csv',index_col= [0])
test = pd.read_csv(r'C:\\Users\\arjun\\CONTESTS\\AV_JOB_SEPT2021\\TEST_FINAL.csv',index_col= [0])
print(train.shape)
print(test.shape)

(188340, 9)
(22265, 7)


# Data Preprocessing Steps
### Data Sanity check

In [54]:
print(train.isnull().sum())
print("*"*20)
print(test.isnull().sum())


Store_id         0
Store_Type       0
Location_Type    0
Region_Code      0
Date             0
Holiday          0
Discount         0
#Order           0
Sales            0
dtype: int64
********************
Store_id         0
Store_Type       0
Location_Type    0
Region_Code      0
Date             0
Holiday          0
Discount         0
dtype: int64


In [4]:
print(train.head())
print("*"*100)
print(test.head())


          Store_id Store_Type Location_Type Region_Code        Date  Holiday  \
ID                                                                             
T1000001         1         S1            L3          R1  2018-01-01        1   
T1000002       253         S4            L2          R1  2018-01-01        1   
T1000003       252         S3            L2          R1  2018-01-01        1   
T1000004       251         S2            L3          R1  2018-01-01        1   
T1000005       250         S2            L3          R4  2018-01-01        1   

         Discount  #Order     Sales  
ID                                   
T1000001      Yes       9   7011.84  
T1000002      Yes      60  51789.12  
T1000003      Yes      42  36868.20  
T1000004      Yes      23  19715.16  
T1000005      Yes      62  45614.52  
****************************************************************************************************
          Store_id Store_Type Location_Type Region_Code        Date  Hol

### Feature Engineering
Define function for getting weekend or not for a particular date

In [5]:
# importing Pandas module
import pandas as pd

# Creating a Function
def check_weekday(date):
    res = len(pd.bdate_range(date,date))
    
    if res == 0:
        out = 1
        #print("This is weekend")
    else:
        out = 0
        #print("This is your working day")
    return(out)

# user input
date = "2020-08-17"
check_weekday(date)

date = "2020-08-16"
check_weekday(date)


1

### Inserting the weekend column in train and test data.

In [56]:
train['weekend']= train['Date'].apply(check_weekday)
test['weekend']= test['Date'].apply(check_weekday)


In [57]:
#See the distribution of weekend variable in train data
train.weekend.value_counts()

0    135050
1     53290
Name: weekend, dtype: int64

In [58]:
#See the distribution of weekend variable in train data
test.weekend.value_counts()

0    15695
1     6570
Name: weekend, dtype: int64

## Train dataset
### Introducing dummy variables for all categorical variables by dropping the first dummy variable

In [59]:
Region_Code = pd.get_dummies(train.Region_Code, prefix='Region_Code', drop_first=True)
Store_id = pd.get_dummies(train.Store_id, prefix='Store_id', drop_first=True)
Store_Type = pd.get_dummies(train.Store_Type, prefix='Store_Type', drop_first=True)
Location_Type = pd.get_dummies(train.Location_Type, prefix='Location_Type', drop_first=True)
Region_Code = pd.get_dummies(train.Region_Code, prefix='Region_Code', drop_first=True)
Holiday = pd.get_dummies(train.Holiday, prefix='Holiday', drop_first=True)
Discount = pd.get_dummies(train.Discount, prefix='Discount', drop_first=True)
weekend = pd.get_dummies(train.weekend, prefix='weekend', drop_first=True)

### Original categorical variables used to create dummy variables are dropped from the original dataset

In [60]:
train.drop(["Region_Code","Store_id","Store_Type","Location_Type","Region_Code","Holiday","Discount","weekend"], axis = 1,inplace=True)

### The dummy variables are concatenated to the original dataset

In [61]:
train =  pd.concat([train, Region_Code,Store_id,Store_Type,Location_Type,Region_Code,Holiday,Discount,weekend], axis=1)

In [62]:
train.head()

Unnamed: 0_level_0,Date,#Order,Sales,Region_Code_R2,Region_Code_R3,Region_Code_R4,Store_id_2,Store_id_3,Store_id_4,Store_id_5,...,Location_Type_L2,Location_Type_L3,Location_Type_L4,Location_Type_L5,Region_Code_R2,Region_Code_R3,Region_Code_R4,Holiday_1,Discount_Yes,weekend_1
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
T1000001,2018-01-01,9,7011.84,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,1,1,0
T1000002,2018-01-01,60,51789.12,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,1,1,0
T1000003,2018-01-01,42,36868.2,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,1,1,0
T1000004,2018-01-01,23,19715.16,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,1,1,0
T1000005,2018-01-01,62,45614.52,0,0,1,0,0,0,0,...,0,1,0,0,0,0,1,1,1,0


## Test dataset
### Introducing dummy variables for all categorical variables by dropping the first dummy variable

In [63]:
Region_Code = pd.get_dummies(test.Region_Code, prefix='Region_Code', drop_first=True)
Store_id = pd.get_dummies(test.Store_id, prefix='Store_id', drop_first=True)
Store_Type = pd.get_dummies(test.Store_Type, prefix='Store_Type', drop_first=True)
Location_Type = pd.get_dummies(test.Location_Type, prefix='Location_Type', drop_first=True)
Region_Code = pd.get_dummies(test.Region_Code, prefix='Region_Code', drop_first=True)
Holiday = pd.get_dummies(test.Holiday, prefix='Holiday', drop_first=True)
Discount = pd.get_dummies(test.Discount, prefix='Discount', drop_first=True)
weekend = pd.get_dummies(test.weekend, prefix='weekend', drop_first=True)

### Original categorical variables used to create dummy variables are dropped from the original dataset

In [64]:
test.drop(["Region_Code","Store_id","Store_Type","Location_Type","Region_Code","Holiday","Discount","weekend"], axis = 1,inplace=True)

### The dummy variables are concatenated to the original dataset

In [65]:
test =  pd.concat([test, Region_Code,Store_id,Store_Type,Location_Type,Region_Code,Holiday,Discount,weekend], axis=1)

In [66]:
test.head()

Unnamed: 0_level_0,Date,Region_Code_R2,Region_Code_R3,Region_Code_R4,Store_id_2,Store_id_3,Store_id_4,Store_id_5,Store_id_6,Store_id_7,...,Location_Type_L2,Location_Type_L3,Location_Type_L4,Location_Type_L5,Region_Code_R2,Region_Code_R3,Region_Code_R4,Holiday_1,Discount_Yes,weekend_1
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
T1188341,2019-06-01,0,1,0,0,0,0,0,0,0,...,1,0,0,0,0,1,0,0,0,1
T1188342,2019-06-01,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
T1188343,2019-06-01,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,1
T1188344,2019-06-01,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,1
T1188345,2019-06-01,1,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,1


### Divide the data into train and test set

In [67]:
from sklearn.model_selection import train_test_split

In [68]:
#Remove the '#Order' column from the train data as it is not present in the test data
train1 = train.copy()
del train1['#Order']

In [69]:
# Putting feature variable to X
X = train1.drop(['Sales'], axis=1)
# Puttting response variable to y
y = train1.loc[:,['Sales']]
# Splitting the data into train and test with test size as 30% and random state as 101
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3, random_state = 101)

In [70]:
X_train.head()

Unnamed: 0_level_0,Date,Region_Code_R2,Region_Code_R3,Region_Code_R4,Store_id_2,Store_id_3,Store_id_4,Store_id_5,Store_id_6,Store_id_7,...,Location_Type_L2,Location_Type_L3,Location_Type_L4,Location_Type_L5,Region_Code_R2,Region_Code_R3,Region_Code_R4,Holiday_1,Discount_Yes,weekend_1
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
T1153063,2019-02-24,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,1
T1143888,2019-01-30,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
T1120875,2018-11-28,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,1,0
T1055894,2018-06-03,1,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,1,1
T1073479,2018-07-21,1,0,0,0,0,0,0,0,0,...,1,0,0,0,1,0,0,0,1,1


In [71]:
X_test.head()

Unnamed: 0_level_0,Date,Region_Code_R2,Region_Code_R3,Region_Code_R4,Store_id_2,Store_id_3,Store_id_4,Store_id_5,Store_id_6,Store_id_7,...,Location_Type_L2,Location_Type_L3,Location_Type_L4,Location_Type_L5,Region_Code_R2,Region_Code_R3,Region_Code_R4,Holiday_1,Discount_Yes,weekend_1
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
T1114697,2018-11-11,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,1,1
T1185946,2019-05-25,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,1,1
T1150535,2019-02-17,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,1,1
T1131591,2018-12-27,0,0,1,0,0,0,0,0,0,...,0,1,0,0,0,0,1,0,1,0
T1108082,2018-10-24,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,1,1,0,0


In [72]:
y_train.head()

Unnamed: 0_level_0,Sales
ID,Unnamed: 1_level_1
T1153063,39897.0
T1143888,29214.0
T1120875,39690.0
T1055894,43539.0
T1073479,63525.0


In [73]:
y_test.head()

Unnamed: 0_level_0,Sales
ID,Unnamed: 1_level_1
T1114697,57006.0
T1185946,33468.0
T1150535,36093.0
T1131591,33882.0
T1108082,24013.44


In [74]:
# Remove the "ID" and "Date" columns from the features of train and test data.
X_train = X_train.drop(['Date'], axis=1)
X_test = X_test.drop(['Date'], axis=1)


In [75]:
X_train.head()

Unnamed: 0_level_0,Region_Code_R2,Region_Code_R3,Region_Code_R4,Store_id_2,Store_id_3,Store_id_4,Store_id_5,Store_id_6,Store_id_7,Store_id_8,...,Location_Type_L2,Location_Type_L3,Location_Type_L4,Location_Type_L5,Region_Code_R2,Region_Code_R3,Region_Code_R4,Holiday_1,Discount_Yes,weekend_1
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
T1153063,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,1
T1143888,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
T1120875,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,1,0
T1055894,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,1,1
T1073479,1,0,0,0,0,0,0,0,0,0,...,1,0,0,0,1,0,0,0,1,1


### Saved the preprocessed and splitted data for reuse purpose 

In [76]:
X_train.to_csv("X_train.csv")
X_test.to_csv("X_test.csv")
y_train.to_csv("y_train.csv")
y_test.to_csv("y_test.csv")
test.to_csv("test.csv")

### Read the preprocessed train and test data

In [77]:
X_train = pd.read_csv(r'C:\\Users\\arjun\\CONTESTS\\AV_JOB_SEPT2021\\X_train.csv',index_col= [0])
X_test = pd.read_csv(r'C:\\Users\\arjun\\CONTESTS\\AV_JOB_SEPT2021\\X_test.csv',index_col= [0])
y_train = pd.read_csv(r'C:\\Users\\arjun\\CONTESTS\\AV_JOB_SEPT2021\\y_train.csv',index_col= [0])
y_test = pd.read_csv(r'C:\\Users\\arjun\\CONTESTS\\AV_JOB_SEPT2021\\y_test.csv',index_col= [0])
test = pd.read_csv(r'C:\\Users\\arjun\\CONTESTS\\AV_JOB_SEPT2021\\test.csv',index_col= [0])


In [78]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)
print(test.shape)

(131838, 380)
(131838, 1)
(56502, 380)
(56502, 1)
(22265, 381)


In [79]:
X_train.head()

Unnamed: 0_level_0,Region_Code_R2,Region_Code_R3,Region_Code_R4,Store_id_2,Store_id_3,Store_id_4,Store_id_5,Store_id_6,Store_id_7,Store_id_8,...,Location_Type_L2,Location_Type_L3,Location_Type_L4,Location_Type_L5,Region_Code_R2.1,Region_Code_R3.1,Region_Code_R4.1,Holiday_1,Discount_Yes,weekend_1
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
T1153063,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,1
T1143888,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
T1120875,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,1,0
T1055894,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,1,1
T1073479,1,0,0,0,0,0,0,0,0,0,...,1,0,0,0,1,0,0,0,1,1


In [80]:
y_train.head()

Unnamed: 0_level_0,Sales
ID,Unnamed: 1_level_1
T1153063,39897.0
T1143888,29214.0
T1120875,39690.0
T1055894,43539.0
T1073479,63525.0


In [81]:
y_train.head()

Unnamed: 0_level_0,Sales
ID,Unnamed: 1_level_1
T1153063,39897.0
T1143888,29214.0
T1120875,39690.0
T1055894,43539.0
T1073479,63525.0


In [82]:
y_test.head()

Unnamed: 0_level_0,Sales
ID,Unnamed: 1_level_1
T1114697,57006.0
T1185946,33468.0
T1150535,36093.0
T1131591,33882.0
T1108082,24013.44


In [83]:
test.head()

Unnamed: 0_level_0,Date,Region_Code_R2,Region_Code_R3,Region_Code_R4,Store_id_2,Store_id_3,Store_id_4,Store_id_5,Store_id_6,Store_id_7,...,Location_Type_L2,Location_Type_L3,Location_Type_L4,Location_Type_L5,Region_Code_R2.1,Region_Code_R3.1,Region_Code_R4.1,Holiday_1,Discount_Yes,weekend_1
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
T1188341,2019-06-01,0,1,0,0,0,0,0,0,0,...,1,0,0,0,0,1,0,0,0,1
T1188342,2019-06-01,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
T1188343,2019-06-01,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,1
T1188344,2019-06-01,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,1
T1188345,2019-06-01,1,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,1


# Model building 
### Random Forest



In [33]:
## Pipeline Estimator 
standardscaler =StandardScaler()
radomforestregressor = RandomForestRegressor(random_state=1,n_jobs = -1,verbose = 1)
pipeline = make_pipeline(standardscaler,radomforestregressor)
radomforestregressor

RandomForestRegressor(n_jobs=-1, random_state=1, verbose=1)

In [34]:
# fit model on training data
pipeline.fit(X_train,y_train)

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  2.0min
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:  4.4min finished


Pipeline(steps=[('standardscaler', StandardScaler()),
                ('randomforestregressor',
                 RandomForestRegressor(n_jobs=-1, random_state=1, verbose=1))])

In [35]:
# Predict the sales of the test data
y_test['rf_without_tune_predicted_sales'] = pipeline.predict(X_test)

[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.4s
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed:    1.1s finished


### Mean Squared Error and Root Mean Square Error are used as the performance metrics

In [38]:
# Calculate the MSE and RMSE of the test dataset
print("MSE is ",sklearn.metrics.mean_squared_error(y_test['Sales'],y_test['rf_without_tune_predicted_sales'] ))
print("RMSE is ",(sklearn.metrics.mean_squared_error(y_test['Sales'],y_test['rf_without_tune_predicted_sales'] ))**(0.5))

MSE is  108922535.73891221
RMSE is  10436.59598427151


In [84]:
# Delete the ID and Date column from the data for which we are going to predict the sales.
test_d = test.drop([ 'Date'], axis=1)


In [86]:
# Create an empty dataframe from a dictionary with ID of customers to store the predictions of Sales
dictt = {'ID':test_d.index}
result = pd.DataFrame(dictt)

# predict the sales for the two months data
result['Sales'] = pipeline.predict(test_d)

# Write the data to our local folder.
result.to_csv(r"C:\\Users\\arjun\\CONTESTS\\AV_JOB_SEPT2021\\submission_rf_without_tune.csv",index = False)


[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.0s
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed:    0.2s finished


## HYPERPARAMETER TUNING

### Tuning max_depth


Let's try to find the optimum values for ```max_depth``` and understand how the value of max_depth impacts the overall accuracy of the ensemble.


In [None]:

# specify number of folds for k-fold CV which is 5
n_folds = 5

# parameters to build the model on here max_depth with range(2, 20, 5)
parameters = range(2, 20, 5)

# instantiate the model
rf = RandomForestRegressor()


# Instantiate GridSearchCVwith rf, parameter, cv and scoring as accuracy and return_train_score=True
rfrg =GridSearchCV(cv=n_folds, estimator=rf,param_grid={'max_depth':parameters}, return_train_score=True)


In [None]:
rfrg.fit(X_train, y_train)


In [None]:
pd.DataFrame(rfrg.cv_results_)

Let's try to find the optimum values for n_estimators and understand how the value of n_estimators impacts the overall accuracy. Notice that we'll specify an appropriately low value of max_depth, so that the trees do not overfit.
<br>

### Tuning n_estimators
Let's try to find the optimum values for n_estimators and understand how the value of n_estimators impacts the overall accuracy. Notice that we'll specify an appropriately low value of max_depth, so that the trees do not overfit.
<br>


In [None]:

# parameters to build the model on (here n_estimators with range(100, 1500, 400))
parameters = param_grid={'n_estimators': range(20, 100, 10)}

# instantiate the model (note we are specifying a max_depth as 4)
rf = RandomForestRegressor(max_depth=4)
# Instantiate GridSearchCVwith rf, parameter, cv and scoring as accuracy and return_train_score=True
rfrg =GridSearchCV(cv=5, estimator=rf,
             param_grid=parameters,
             return_train_score=True)
rfrg

In [None]:
rfrg.fit(X_train, y_train)


In [None]:
pd.DataFrame(rfrg.cv_results_)

### Tuning max_features

Let's see how the model performance varies with ```max_features```, which is the maximum numbre of features considered for splitting at a node.

In [None]:

# parameters to build the model on max_features with [4, 8, 14, 20, 24]
parameters = {'max_features': [4, 8, 14, 20, 24]}

# instantiate the model (note we are specifying a max_depth as 4)
rf = RandomForestRegressor(max_depth=4)



# Instantiate GridSearchCVwith rf, parameter, cv and scoring as accuracy and return_train_score as True
rfrg = GridSearchCV(cv=5, estimator=rf,
             param_grid=parameters,return_train_score=True)

# fit tree on training data

rfrg

In [None]:
rfrg.fit(X_train, y_train)


In [None]:
pd.DataFrame(rfrg.cv_results_)

### Tuning min_samples_leaf

In [None]:

# parameters to build the model on max_features with [4, 8, 14, 20, 24]
parameters = {'min_samples_leaf': range(200, 260,30 )}

# instantiate the model (note we are specifying a max_depth as 4)
rf = RandomForestRegressor()
# Instantiate GridSearchCVwith rf, parameter, cv and scoring as accuracy and return_train_score as True
rfrg = GridSearchCV(cv=5, estimator=rf,
             param_grid=parameters,return_train_score=True)
# fit tree on training data
rfrg

In [None]:
rfrg.fit(X_train, y_train)


In [None]:
min_samples_leaf = pd.DataFrame(rfrg.cv_results_)

In [None]:
min_samples_leaf

In [None]:
# Save the output for analysis purpose
min_samples_leaf.to_csv(r'C:\\Users\\arjun\\CONTESTS\\AV_JOB_SEPT2021\\min_samples_leaf_result_jupyter.csv')

### Tuning min_samples_split

In [None]:

# parameters to build the model on (min_samples_split with range(200, 500, 50))
parameters = {'min_samples_split': range(60, 140, 20)}

# instantiate the model
rf = RandomForestRegressor()

# Instantiate GridSearchCVwith rf, parameter, cv and scoring as accuracy and return_train_score as True
rfrg = GridSearchCV(cv=5, estimator=rf,param_grid=parameters,return_train_score=True)

#fitting model on training data

rfrg


In [None]:
rfrg.fit(X_train, y_train)


In [None]:
pd.DataFrame(rfrg.cv_results_)

#### Random Search to Find Optimal Hyperparameters

We can now find the optimal hyperparameters using RandomizedSearchCV. 

with following  parameter values: 

    `'max_depth': [30,35,40,45],
    'min_samples_leaf': range(100, 140, 20),
    'min_samples_split': range(100, 140, 20),
    'n_estimators': [20,24,26], 
    'max_features': [50,60,70,80]`

Note : here we are avoiding grid search because using it your system might crash, whereas RandomizedSearchCV is fater than grid search  ;)

In [None]:
# Create the parameter grid based on the results of random search 
param_grid = {  'max_depth': [30,35,40,45],
                'min_samples_leaf': range(100, 140, 20),
                'min_samples_split': range(100, 140, 20),
                'n_estimators': [20,24,26], 
                'max_features': [50,60,70,80]  }

# Create a based model (i.e RF)
rf =  RandomForestRegressor()

# Instantiate the radom search model with estimator, param_grid, and random_state as 0
rs = RandomizedSearchCV(cv=5, estimator=rf,param_distributions=param_grid,random_state=0)
rs

In [None]:
rs.fit(X_train, y_train)

In [None]:
print("The best parameters are : ",rs.best_params_)

**Fitting the final model with the best parameters obtained from grid search.**

In [None]:
# Instantiate RF model with the best hyperparameters

rfr = RandomForestRegressor(max_depth= 35, 
                            max_features= 70,
                            min_samples_leaf= 100,
                       min_samples_split= 100, 
                            n_estimators= 26,
                           verbose = 1)
rfr

In [None]:
# fit rfc on training data

rfr.fit(X_train,y_train)


In [None]:
# predict on the test set
pred = rfr.predict(X_test)

In [None]:
y_test['predicted_sales_rf_tuned'] = pred

In [None]:
print("MSE is ",sklearn.metrics.mean_squared_error(y_test['Sales'],y_test['predicted_sales_rf_tuned'] ))
print("RMSE is ",(sklearn.metrics.mean_squared_error(y_test['Sales'],y_test['predicted_sales_rf_tuned'] ))**(0.5))

### Predicting sales and saving the predictions 

In [None]:
# Create an empty dataframe from a dictionary with ID of customers to store the predictions of Sales
dictt = {'ID':test.loc[:,'ID']}
result = pd.DataFrame(dictt)

# predict the sales for the two months data
result['Sales'] = rfr.predict(test_d)

# Write the data to our local folder.
result.to_csv(r"C:\\Users\\arjun\\CONTESTS\\AV_JOB_SEPT2021\\submission_rf_with_tune.csv",index = False)
