# Data Scientist Hiring Challenge

## Background
An E-commerce website has hundreds of thousands of visitors everyday. The visitors come from several marketing channels such as digital campaigns on social media, referrals from publishers, organic search and CRM.


The main business here is collecting traffic from different sources/channels and converting those visitors to leads. A marketing lead is **a person who shows interest in a brand's products or services**, which makes a visitor a potential customer for the seller/service provider. The primary goal of any company is to generate as many leads as possible to ultimately increase conversion rates in the sales funnel.


The main aim in the challenge is to develop a model that can forecast the next day number of leads.

 

## Datasets and Features
There is one main dataset and three others as auxiliary:

1. Transactions dataset is the main dataset that stores the number of leads for each minute broken down according to products and channels. The data goes back till 29.09.2020.
    
    Features:

    * pk: primary key and unique ID in the database table
    * ga_transactionid: the id of the transaction from google Analytics
    * ga_datehour: the time of the transaction in yyyymmddHH format
    * ga_products: name of the products (Product A, Product B, Product C, Product D, Product E, Product F)
    * ga_channels: the channel a visitor comes for (Facebook, Google Ads, Organic search, Direct, CRM)
    * ga_itemquantity: number of leads

2. Economic calendar dataset keeps record of all events that may affect economic variables such as currency exchange rate, interest rate, and stockes in the market.
    
    Features:

    * pk: primary key and unique ID in the database table  
    * date: starts from 28.04.2021
    * time: when the event takes place
    * country: where the event happened
    * indicator: the name of the event
    * priority: there are three levels (1, 2, 3) where 3 is the highest priority
    * exception: anticipated market impact  
    * previous: represents the previous market impact either positive or negative
 

3. Economic variables dataset observes and keeps track of the changes in terms of important variables such as USDTRY or BIST100. The dataset stores the variables daily at three different hours (09, 12, 15) hrs.
    
    Features:

    * pk: primary key and unique ID in the database table
    * date: starts from 28.04.2021
    * hour: (09, 12, 15) hrs
    * bist100: Borsa Istanbul stock exchange
    * usdtry: usd and try exchange rate
    * eurtry: eur and try exchange rate
    * eurusd: eur and usd exchange rate
    * faiz: interest rate in Turkey
    * xau: gold price in ounce
    * brent: Atlantic basin crude oils price
4. Live Digital campaigns dataset that has the number of live digital campaigns for everyday since 29.09.2020.
    
    Features:

    * date: since 29.09.2020
    * live_campaigns: numeric value of the number of campaign


## Tasks
1. Give some analysis on the relationship between the economical events and variables and their impacts on the daily number of visitors.
2. Using Transaction dataset, forecast the next day leads for each channel (Facebook, CRM and so on).
3. Given the number of live digital campaigns and other auxiliary datasets, try to optimise the performance of your forecasting model (or even develop a new model).

## Deliverables


Write your solution on jupyter notebooks for each task (analysis and model development) and make it clear you explain what you are doing properly.


Your jupyter notebooks for each task should be named in the following format: Task1.ipynb, Task2.ipynb and Task3.ipynb


Make sure that your code is replicable and you document your approach and code in a clear way.

## Task 3

Given the number of live digital campaigns and other auxiliary datasets, try to optimise the performance of your forecasting model (or even develop a new model).

In [1]:
#Let's import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import warnings
warnings.filterwarnings("ignore")

In [2]:
df_campaigns = pd.read_csv("live_digital_campaigns.csv");df_campaigns

Unnamed: 0,date,live_campaign
0,2020-08-29,126
1,2020-08-30,121
2,2020-08-31,130
3,2020-09-01,134
4,2020-09-02,133
...,...,...
411,2021-10-14,134
412,2021-10-15,135
413,2021-10-16,121
414,2021-10-17,132


In [3]:
#Let's prepare our Economic Variable dataset.
df_economic_variables = pd.read_csv("economic_variables.csv")
df_ev = df_economic_variables.groupby(by="date").mean()
df_ev.drop(labels=["pk","hour"],axis=1,inplace=True)

In [4]:
print(df_ev.shape)
print(df_ev.isnull().sum())
df_ev.isnull().sum()[-1]/df_ev.shape[0]
#We have 161 variable and 108 bky values are missing which is equal to %67 so we can just drop out bky column too 
#since our dataset length is only has 167 of length 
#which is very short for a machine larning algorithm to predict values correctly.

(161, 8)
bist100      0
usdtry       0
eurtry       0
eurusd       0
faiz         0
xau          0
brent        0
bky        108
dtype: int64


0.6708074534161491

In [5]:
df_ev.drop(labels=["bky"],axis=1,inplace=True)
df_ev.head()
#Our economic variable dataset is ready for ML algorithm. 

Unnamed: 0_level_0,bist100,usdtry,eurtry,eurusd,faiz,xau,brent
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2021-04-28,1388.59,8.2159,9.9379,1.2096,18.08,1773.86,67.33
2021-04-29,1398.85,8.171,9.9166,1.212,18.08,1780.19,67.54
2021-04-30,1401.52,8.2281,9.9741,1.2117,18.08,1769.11,68.08
2021-05-03,1421.15,8.2649,9.9694,1.2056,17.94,1793.31,67.45
2021-05-04,1420.996667,8.3091,9.995333,1.2025,17.94,1787.523333,68.113333


In [6]:
#Let's prepare our economic calender dataset
df_economic_calender =pd.read_csv("economic_calendar.csv")
#Let's filter out Turkey and USA (I explained it in Task 1) 
#(If you want to learn reasoning behind it, you can check out Task 1)
df_ec_filtered = df_economic_calender[df_economic_calender["country"].isin(["ABD","Türkiye"])]

In [7]:
#Let's filter indicators too. we mention that interest rates, inflation and exchange rates
df_ec_filtered = df_ec_filtered[df_ec_filtered.indicator.str.contains("Faiz|Enflasyon|Dolar|Döviz|Tüfe",case=False)]

In [8]:
#And for final filtering let's take only data that has 2 or 3 priority
df_ec_filtered = df_ec_filtered[df_ec_filtered.priority>1]

In [9]:
df_ec_filtered.indicator.value_counts()
#I want to see more in details what indicators have recurred to see which indicator is important the most. 
#It is clearly seen that TÜFE is the most important indicator to our data table. 

Özel Kapsamlı TÜFE B Endeksi(Aylık)              7
Özel Kapsamlı TÜFE F Endeksi(Aylık)              7
Özel Kapsamlı TÜFE D Endeksi(Yıllık)             7
Özel Kapsamlı TÜFE E Endeksi(Aylık)              7
Özel Kapsamlı TÜFE D Endeksi(Aylık)              7
Özel Kapsamlı TÜFE A Endeksi(Yıllık)             7
Özel Kapsamlı TÜFE C Endeksi(Aylık)              7
Özel Kapsamlı TÜFE B Endeksi(Yıllık)             7
Özel Kapsamlı TÜFE E Endeksi(Yıllık)             7
Özel Kapsamlı TÜFE C Endeksi(Yıllık)             7
Özel Kapsamlı TÜFE A Endeksi(Aylık)              7
Özel Kapsamlı TÜFE F Endeksi(Yıllık)             7
Gıda ve Enerji Hariç TÜFE(Aylık)                 6
Michigan 5 Yıllık Enflasyon Tahmini(Final)       6
Gıda ve Enerji Hariç TÜFE(Yıllık)                6
Merkezi Yönetim Faiz Giderleri(TL)               6
Hazine Faiz Dışı Dengesi(TL)                     6
Michigan 12 Aylık Enflasyon Tahmini(Final)       6
Merkezi Yönetim Faiz Dışı Dengesi(TL)            6
Merkezi Yönetim Faiz Hariç Gide

In [10]:
#Let's check how many unique date this indicator has since we want to predict next day leads
#And we only will give our model one day event. 
df_ec_filtered[df_ec_filtered.indicator.str.contains("Özel Kapsamlı Tüfe",case=False)].date.nunique()
#We only have 6 unique day for TÜFE Index. 

6

In [11]:
#Let's calculate how many unique day we have in total.
df_ec_filtered.date.nunique()
#We only have 46 unique date. Since again it is too short I will drop out this table entirely.

46

In [12]:
#And now finally let's take a look at our transaction data set and preprossess it
df_transaction = pd.read_csv("transactions.csv");df_transaction.head()

Unnamed: 0,pk,ga_transactionid,ga_datehour,ga_products,ga_channels,ga_itemquantity
0,146288072,2_50414543,2020082900,Product D,Facebook,1
1,146288071,2_50414542,2020082900,Product D,Organic search,1
2,146287503,2_50413935,2020082900,Product D,Organic search,1
3,146287504,2_50413936,2020082900,Product D,Organic search,1
4,146296436,3_65496155,2020082900,Product E,Google Ads,1


In [13]:
#We only need datehour to obtain daily leads, channels and item quantity to obtain total leads day by day so we filter
#this 3 attributes and store it in df variable.
df = df_transaction[["ga_datehour","ga_channels","ga_itemquantity"]]

In [14]:
#Let's convert string date hour data to datetime so that we can obain daily data with ease.
df["ga_datehour"] = pd.to_datetime(df.ga_datehour,format="%Y%m%d%H")

In [15]:
#Let's store channels to the columns and resample it daily to obtain daily leads for each and every channel & store it in df_daily
df_daily = pd.crosstab(df.ga_datehour.fillna("NA"),df.ga_channels.fillna("NA"),values=df.ga_itemquantity,aggfunc="sum",dropna=False).resample("D").sum()

In [16]:
df_daily

ga_channels,CRM,Direct,Facebook,Google Ads,NA,Organic search,Referral
ga_datehour,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2020-08-29,2420.0,337.0,2123.0,21985.0,1227.0,4938.0,389.0
2020-08-30,2611.0,376.0,2812.0,24524.0,1292.0,5878.0,505.0
2020-08-31,3282.0,500.0,2921.0,33484.0,1343.0,7915.0,580.0
2020-09-01,20859.0,501.0,2547.0,32932.0,1420.0,7588.0,614.0
2020-09-02,11520.0,550.0,3065.0,30143.0,1314.0,6998.0,616.0
...,...,...,...,...,...,...,...
2021-10-14,2848.0,747.0,4930.0,34627.0,3254.0,4393.0,105.0
2021-10-15,3953.0,722.0,4327.0,32897.0,3120.0,4175.0,66.0
2021-10-16,1318.0,493.0,4291.0,21555.0,5130.0,2598.0,62.0
2021-10-17,1953.0,599.0,6520.0,29707.0,8293.0,3384.0,88.0


In [17]:
#Let's change the index and column names
df_daily.index.names = ["Date"]
df_daily.columns.names = ["Channels"]

In [18]:
df_daily

Channels,CRM,Direct,Facebook,Google Ads,NA,Organic search,Referral
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2020-08-29,2420.0,337.0,2123.0,21985.0,1227.0,4938.0,389.0
2020-08-30,2611.0,376.0,2812.0,24524.0,1292.0,5878.0,505.0
2020-08-31,3282.0,500.0,2921.0,33484.0,1343.0,7915.0,580.0
2020-09-01,20859.0,501.0,2547.0,32932.0,1420.0,7588.0,614.0
2020-09-02,11520.0,550.0,3065.0,30143.0,1314.0,6998.0,616.0
...,...,...,...,...,...,...,...
2021-10-14,2848.0,747.0,4930.0,34627.0,3254.0,4393.0,105.0
2021-10-15,3953.0,722.0,4327.0,32897.0,3120.0,4175.0,66.0
2021-10-16,1318.0,493.0,4291.0,21555.0,5130.0,2598.0,62.0
2021-10-17,1953.0,599.0,6520.0,29707.0,8293.0,3384.0,88.0


In [19]:
#Up until now we will preprocess all the dataframes. And now let's merge it all together.
#We will use df_daily, df_ev and df_campaigns datasets.
#Let's check the infos of all those datasets
print(df_campaigns.info())
print(df_ev.info())
print(df_daily.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 416 entries, 0 to 415
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   date           416 non-null    object
 1   live_campaign  416 non-null    int64 
dtypes: int64(1), object(1)
memory usage: 6.6+ KB
None
<class 'pandas.core.frame.DataFrame'>
Index: 161 entries, 2021-04-28 to 2021-10-18
Data columns (total 7 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   bist100  161 non-null    float64
 1   usdtry   161 non-null    float64
 2   eurtry   161 non-null    float64
 3   eurusd   161 non-null    float64
 4   faiz     161 non-null    float64
 5   xau      161 non-null    float64
 6   brent    161 non-null    float64
dtypes: float64(7)
memory usage: 10.1+ KB
None
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 416 entries, 2020-08-29 to 2021-10-18
Freq: D
Data columns (total 7 columns):
 #   Column          Non-Null Coun

    As we can see in above, df_campaigns and df_daily dataset has 416 non-null variables but df_ev has only 161. At first let's merge it all together on the same dates and buil on a model and after that I will discard df_ev dataset and build just for 2 dataset and compare two different table.

In [20]:
#Let's first convert all dates to datetime for merging 
df_campaigns.date = pd.to_datetime(df_campaigns.date)
df_ev.index = pd.to_datetime(df_ev.index)

In [21]:
df_campaigns.set_index(df_campaigns.date,inplace=True)

In [22]:
df_campaigns.drop(labels=["date"],axis=1,inplace=True)

In [23]:
#Let's merge it all and store it in df_all dataframe
df_all = pd.merge(df_ev,pd.merge(df_campaigns,df_daily,left_index=True,right_index=True, how="inner"),left_index=True,right_index=True, how="inner")

In [24]:
df_all.head()

Unnamed: 0_level_0,bist100,usdtry,eurtry,eurusd,faiz,xau,brent,live_campaign,CRM,Direct,Facebook,Google Ads,NA,Organic search,Referral
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
2021-04-28,1388.59,8.2159,9.9379,1.2096,18.08,1773.86,67.33,112,11852.0,578.0,5356.0,24483.0,5099.0,2313.0,211.0
2021-04-29,1398.85,8.171,9.9166,1.212,18.08,1780.19,67.54,110,3271.0,433.0,4529.0,21608.0,5028.0,1829.0,152.0
2021-04-30,1401.52,8.2281,9.9741,1.2117,18.08,1769.11,68.08,109,2913.0,604.0,6764.0,25982.0,8424.0,1896.0,199.0
2021-05-03,1421.15,8.2649,9.9694,1.2056,17.94,1793.31,67.45,114,7218.0,819.0,9315.0,33252.0,9564.0,2828.0,199.0
2021-05-04,1420.996667,8.3091,9.995333,1.2025,17.94,1787.523333,68.113333,114,15261.0,683.0,7947.0,31392.0,7217.0,2602.0,197.0


In [25]:
#Let's seperate features and targets
X = df_all.loc[:,"bist100":"live_campaign"]
y = df_all.loc[:,"CRM":]

In [26]:
#Let's import train test split and gridsearch cv
from sklearn.model_selection import train_test_split,GridSearchCV

In [35]:
 from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.ensemble import GradientBoostingRegressor,RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

In [27]:
#Let's do train test split
X_train,X_test,y_train,y_test =train_test_split(X,y,test_size=0.2,random_state=42)

In [34]:
#Since bist100, usdtry and other columns have different scales we should scale it first.
scaler = StandardScaler()
scaler.fit(X_train)
X_trainStandard = scaler.transform(X_train)
X_testStandard = scaler.transform(X_test)

In [45]:
#Now instead of applying one model one by one let's do grid search for all columns
pipe = Pipeline([("scaler",StandardScaler()),("regressor",GradientBoostingRegressor())])

models = [{"regressor" : [GradientBoostingRegressor(random_state= 42)],
          "regressor__learning_rate": [1, 0.1, 0.01],
          "regressor__max_depth": [1,2,3],
          "regressor__n_estimators": [100,1000,5000,10000]},
         {"regressor":[RandomForestRegressor(random_state=42)],
         "regressor__n_estimators": np.arange(50,600,50),
         "regressor__max_features": np.arange(1,11)},
         {"regressor":[KNeighborsRegressor()],
         "regressor__n_neighbors": np.arange(5),
         "regressor__weights":["uniform","distance"],
         "regressor__algorithm":['auto', 'ball_tree', 'kd_tree', 'brute']}]

Grid_Search = GridSearchCV(estimator=pipe,param_grid=models,cv=5,verbose=2,n_jobs=-1,scoring="neg_mean_squared_error")
for col in y_train.columns:
    Grid_Search.fit(X_train,y_train[col])
    print(col)
    print(Grid_Search.best_params_)

Fitting 5 folds for each of 186 candidates, totalling 930 fits
CRM
{'regressor': KNeighborsRegressor(algorithm='brute', n_neighbors=3), 'regressor__algorithm': 'brute', 'regressor__n_neighbors': 3, 'regressor__weights': 'uniform'}
Fitting 5 folds for each of 186 candidates, totalling 930 fits
Direct
{'regressor': GradientBoostingRegressor(learning_rate=0.01, max_depth=1, n_estimators=10000,
                          random_state=42), 'regressor__learning_rate': 0.01, 'regressor__max_depth': 1, 'regressor__n_estimators': 10000}
Fitting 5 folds for each of 186 candidates, totalling 930 fits
Facebook
{'regressor': RandomForestRegressor(max_features=6, n_estimators=150, random_state=42), 'regressor__max_features': 6, 'regressor__n_estimators': 150}
Fitting 5 folds for each of 186 candidates, totalling 930 fits
Google Ads
{'regressor': GradientBoostingRegressor(max_depth=1, n_estimators=5000, random_state=42), 'regressor__learning_rate': 0.1, 'regressor__max_depth': 1, 'regressor__n_estimat

In [46]:
knn_crm=KNeighborsRegressor(algorithm="brute",n_neighbors=3,weights="uniform")
knn_crm.fit(X_trainStandard,y_train["CRM"])
y_pred_crm =knn_crm.predict(X_testStandard)
print("Test Score: ",mean_squared_error(y_pred_crm,y_test["CRM"]))
results=pd.DataFrame([],columns=["MSE"])
results.loc["CRM"]=mean_squared_error(y_pred_crm,y_test["CRM"])

Test Score:  16372423.407407403


In [47]:
gbr_direct = GradientBoostingRegressor(learning_rate=0.01, max_depth=1, n_estimators=10000,random_state=42)
gbr_direct.fit(X_trainStandard,y_train["Direct"])
y_pred_direct = gbr_direct.predict(X_testStandard)
print("Test Score: ", mean_squared_error(y_pred_direct,y_test["Direct"]))
results.loc["Direct"]=mean_squared_error(y_pred_direct,y_test["Direct"])

Test Score:  16312.922499306465


In [49]:
rfr_facebook= RandomForestRegressor(max_features=6, n_estimators=150,random_state=42)
rfr_facebook.fit(X_trainStandard,y_train["Facebook"])
y_pred_facebook=rfr_facebook.predict(X_testStandard)
print("Test Score: ", mean_squared_error(y_pred_facebook,y_test["Facebook"]))
results.loc["Facebook"]=mean_squared_error(y_pred_facebook,y_test["Facebook"])

Test Score:  1629370.3690781144


In [50]:
gbr_google = GradientBoostingRegressor(learning_rate=0.1, max_depth=1, n_estimators=5000,random_state=42)
gbr_google.fit(X_trainStandard,y_train["Google Ads"])
y_pred_google = gbr_google.predict(X_testStandard)
print("Test Score: ", mean_squared_error(y_pred_google,y_test["Google Ads"]))
results.loc["Google Ads"]=mean_squared_error(y_pred_google,y_test["Google Ads"])

Test Score:  17951686.956024468


In [52]:
rfr_na= RandomForestRegressor(max_features=2, n_estimators=150,random_state=42)
rfr_na.fit(X_trainStandard,y_train["NA"])
y_pred_na=rfr_na.predict(X_testStandard)
print("Test Score: ", mean_squared_error(y_pred_na,y_test["NA"]))
results.loc["NA"]=mean_squared_error(y_pred_na,y_test["NA"])

Test Score:  14303197.560084587


In [54]:
gbr_organic = GradientBoostingRegressor(learning_rate=0.01, max_depth=2, n_estimators=5000,random_state=42)
gbr_organic.fit(X_trainStandard,y_train["Organic search"])
y_pred_organic = gbr_organic.predict(X_testStandard)
print("Test Score: ", mean_squared_error(y_pred_organic,y_test["Organic search"]))
results.loc["Organic search"]=mean_squared_error(y_pred_organic,y_test["Organic search"])

Test Score:  231463.44726681607


In [56]:
knn_referral=KNeighborsRegressor(algorithm="auto",n_neighbors=4,weights="uniform")
knn_referral.fit(X_trainStandard,y_train["Referral"])
y_pred_referral =knn_referral.predict(X_testStandard)
print("Test Score: ",mean_squared_error(y_pred_referral,y_test["Referral"]))
results.loc["Referral"]=mean_squared_error(y_pred_referral,y_test["Referral"])

Test Score:  45507.429924242424


In [57]:
results

Unnamed: 0,MSE
CRM,16372420.0
Direct,16312.92
Facebook,1629370.0
Google Ads,17951690.0
,14303200.0
Organic search,231463.4
Referral,45507.43


In [60]:
#Let's merge it just df_campaigns and df_daily and forecast
df_all_new = pd.merge(df_campaigns,df_daily,left_index=True,right_index=True, how="inner");df_all_new.head()

In [64]:
#Let's seperate features and targets
X_new = df_all_new.loc[:,"live_campaign"]
y_new = df_all_new.loc[:,"CRM":]

In [65]:
#Let's do train test split
X_train_new ,X_test_new,y_train_new,y_test_new =train_test_split(X_new,y_new,test_size=0.2,random_state=42)

In [73]:
#Since there is only one feature no need to scale it.
pipe = Pipeline([("regressor",GradientBoostingRegressor())])

models = [{"regressor" : [GradientBoostingRegressor(random_state= 42)],
          "regressor__learning_rate": [1, 0.1, 0.01],
          "regressor__max_depth": [1,2,3],
          "regressor__n_estimators": [100,1000,5000,10000]},
         {"regressor":[RandomForestRegressor(random_state=42)],
         "regressor__n_estimators": np.arange(50,600,50),
         "regressor__max_features": np.arange(1,11)},
         {"regressor":[KNeighborsRegressor()],
         "regressor__n_neighbors": np.arange(5),
         "regressor__weights":["uniform","distance"],
         "regressor__algorithm":['auto', 'ball_tree', 'kd_tree', 'brute']}]

Grid_Search = GridSearchCV(estimator=pipe,param_grid=models,cv=5,verbose=2,n_jobs=-1,scoring="neg_mean_squared_error")
for col in y_train_new.columns:
    Grid_Search.fit(X_train_new.values.reshape(-1,1),y_train_new[col])
    print(col)
    print(Grid_Search.best_params_)

Fitting 5 folds for each of 186 candidates, totalling 930 fits
CRM
{'regressor': GradientBoostingRegressor(learning_rate=0.01, max_depth=1, random_state=42), 'regressor__learning_rate': 0.01, 'regressor__max_depth': 1, 'regressor__n_estimators': 100}
Fitting 5 folds for each of 186 candidates, totalling 930 fits
Direct
{'regressor': GradientBoostingRegressor(learning_rate=0.01, max_depth=1, n_estimators=1000,
                          random_state=42), 'regressor__learning_rate': 0.01, 'regressor__max_depth': 1, 'regressor__n_estimators': 1000}
Fitting 5 folds for each of 186 candidates, totalling 930 fits
Facebook
{'regressor': GradientBoostingRegressor(max_depth=1, random_state=42), 'regressor__learning_rate': 0.1, 'regressor__max_depth': 1, 'regressor__n_estimators': 100}
Fitting 5 folds for each of 186 candidates, totalling 930 fits
Google Ads
{'regressor': GradientBoostingRegressor(max_depth=1, random_state=42), 'regressor__learning_rate': 0.1, 'regressor__max_depth': 1, 'regresso

In [77]:
gbr_crm_new=GradientBoostingRegressor(learning_rate=0.01,max_depth=1,n_estimators=100,random_state=42)
gbr_crm_new.fit(X_train_new.values.reshape(-1,1),y_train_new["CRM"])
y_pred_crm_new =gbr_crm_new.predict(X_test_new.values.reshape(-1,1))
print("Test Score: ",mean_squared_error(y_pred_crm_new,y_test_new["CRM"]))
results.loc["CRM","MSE_New"]=mean_squared_error(y_pred_crm_new,y_test_new["CRM"])

Test Score:  17025765.336315833


In [79]:
gbr_direct_new = GradientBoostingRegressor(learning_rate=0.01,max_depth=1,n_estimators=1000,random_state=42)
gbr_direct_new.fit(X_train_new.values.reshape(-1,1),y_train_new["Direct"])
y_pred_direct_new =gbr_direct_new.predict(X_test_new.values.reshape(-1,1))
print("Test Score: ",mean_squared_error(y_pred_direct_new,y_test_new["Direct"]))
results.loc["Direct","MSE_New"]=mean_squared_error(y_pred_direct_new,y_test_new["Direct"])

Test Score:  18204.95419060235


In [80]:
gbr_facebook_new = GradientBoostingRegressor(learning_rate=0.1,max_depth=1,n_estimators=100,random_state=42)
gbr_facebook_new.fit(X_train_new.values.reshape(-1,1),y_train_new["Facebook"])
y_pred_facebook_new =gbr_facebook_new.predict(X_test_new.values.reshape(-1,1))
print("Test Score: ",mean_squared_error(y_pred_facebook_new,y_test_new["Facebook"]))
results.loc["Facebook","MSE_New"]=mean_squared_error(y_pred_facebook_new,y_test_new["Facebook"])

Test Score:  6407252.171251355


In [81]:
gbr_google_new = GradientBoostingRegressor(learning_rate=0.1,max_depth=1,n_estimators=100,random_state=42)
gbr_google_new.fit(X_train_new.values.reshape(-1,1),y_train_new["Google Ads"])
y_pred_google_new =gbr_google_new.predict(X_test_new.values.reshape(-1,1))
print("Test Score: ",mean_squared_error(y_pred_google_new,y_test_new["Google Ads"]))
results.loc["Google Ads","MSE_New"]=mean_squared_error(y_pred_google_new,y_test_new["Google Ads"])

Test Score:  20145123.41494296


In [83]:
gbr_na_new = GradientBoostingRegressor(learning_rate=0.01,max_depth=1,n_estimators=100,random_state=42)
gbr_na_new.fit(X_train_new.values.reshape(-1,1),y_train_new["NA"])
y_pred_na_new =gbr_na_new.predict(X_test_new.values.reshape(-1,1))
print("Test Score: ",mean_squared_error(y_pred_na_new,y_test_new["NA"]))
results.loc["NA","MSE_New"]=mean_squared_error(y_pred_na_new,y_test_new["NA"])

Test Score:  13942714.869032195


In [85]:
gbr_organic_new = GradientBoostingRegressor(learning_rate=0.01,max_depth=1,n_estimators=5000,random_state=42)
gbr_organic_new.fit(X_train_new.values.reshape(-1,1),y_train_new["Organic search"])
y_pred_organic_new =gbr_organic_new.predict(X_test_new.values.reshape(-1,1))
print("Test Score: ",mean_squared_error(y_pred_organic_new,y_test_new["Organic search"]))
results.loc["Organic search","MSE_New"]=mean_squared_error(y_pred_organic_new,y_test_new["Organic search"])

Test Score:  3427184.1346507412


In [87]:
gbr_referral_new = GradientBoostingRegressor(learning_rate=0.01,max_depth=1,n_estimators=5000,random_state=42)
gbr_referral_new.fit(X_train_new.values.reshape(-1,1),y_train_new["Referral"])
y_pred_referral_new =gbr_referral_new.predict(X_test_new.values.reshape(-1,1))
print("Test Score: ",mean_squared_error(y_pred_referral_new,y_test_new["Referral"]))
results.loc["Referral","MSE_New"]=mean_squared_error(y_pred_referral_new,y_test_new["Referral"])

Test Score:  39627.65873813055


In [88]:
results

Unnamed: 0,MSE,MSE_New
CRM,16372420.0,17025770.0
Direct,16312.92,18204.95
Facebook,1629370.0,6407252.0
Google Ads,17951690.0,20145120.0
,14303200.0,13942710.0
Organic search,231463.4,3427184.0
Referral,45507.43,39627.66


## Summary
In general we can clearly see from table prediction results obtained more better if we use economic variable table but there is not a lot difference between two results.