# CS4320 Introduction to Machine Learning

## Team Undefined

### Group Members: 

- Luke Shumway A02268065

- Ryan Andersen A02288683

- Ian Adams A02252812

Project: [Store Sales - Time Series Forecasting](https://www.kaggle.com/competitions/store-sales-time-series-forecasting/overview/description)

In [1]:
GroupName = "Undefined"
assert GroupName != "", 'Please enter your name in the above quotation marks, thanks!'

In [2]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

plt.rcParams["font.size"] = 16

from sklearn.model_selection import cross_val_score, cross_validate, train_test_split
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.impute import SimpleImputer
from sklearn.svm import SVC
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import (
    MinMaxScaler,
    OneHotEncoder,
    OrdinalEncoder,
    StandardScaler,
)
from xgboost import XGBRegressor
from lightgbm.sklearn import LGBMRegressor
from catboost import CatBoostRegressor

  from pandas import MultiIndex, Int64Index


## Table of contents
1. [Understanding the problem](#1)
2. [Exploratory Data Analysis](#2)
3. [Data aggregation, Splitting, and Feature Engineering](#3)
4. [Preprocessing and transformations](#4)
5. [Baseline model](#5) 
6. [Linear models](#6)
7. [Different models](#7)
8. [Hyperparameter optimization](#8)
9. [Using 2017 Data](#9)
10. [Results on the test set](#10)
11. [Submit the predictions to Kaggle](#11) 
12. [Our takeaway from the course](#12)

<!-- BEGIN QUESTION -->

## 1. The prediction problem <a name="1"></a>


This problem involves predicting the sales that a given store in Ecuador is going to make on a given day. This appears to be a classic time series regression task. 
The dataset we were given includes the following features: store number, onpromotion, oil price, cluster, transactions, family, holiday type, locale, locale name, city, state, store type, transferred, id, and date.
To elucidate some of the more vague features: onpromotion denotes the number of items in a product family that were being sold on promotion, cluster designates a grouping of similar stores, family identifies the type of products sold, and transferred signifies if a holiday happened when the store was closed (for example, on a Sunday), hence, “transferring” the holiday to when the store is open next Monday.


In [3]:
# This file is a template for all submissions
sample_df = pd.read_csv('file:sample_submission.csv')
sample_df

Unnamed: 0,id,sales
0,3000888,0.0
1,3000889,0.0
2,3000890,0.0
3,3000891,0.0
4,3000892,0.0
...,...,...
28507,3029395,0.0
28508,3029396,0.0
28509,3029397,0.0
28510,3029398,0.0


<!-- BEGIN QUESTION -->

## 2. Exploratory Data Analysis <a name="2"></a>
<hr>

During this stage, we looked at all of the data we had and tried to decide the best ways to use it (See the next section for how we processed it into a usable state). We also tried to look at the distributions of the data in the sets.

We came to the conclusion that these data sets had a lot of similar attributes that we could use to combine them together into more robust training and testing sets.

In [4]:
# We don't display this one, because we don't want to look at the testing data
test_df = pd.read_csv('file:test.csv')

In [5]:
train_df = pd.read_csv('file:train.csv')
display(train_df)

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
0,0,2013-01-01,1,AUTOMOTIVE,0.000,0
1,1,2013-01-01,1,BABY CARE,0.000,0
2,2,2013-01-01,1,BEAUTY,0.000,0
3,3,2013-01-01,1,BEVERAGES,0.000,0
4,4,2013-01-01,1,BOOKS,0.000,0
...,...,...,...,...,...,...
3000883,3000883,2017-08-15,9,POULTRY,438.133,0
3000884,3000884,2017-08-15,9,PREPARED FOODS,154.553,1
3000885,3000885,2017-08-15,9,PRODUCE,2419.729,148
3000886,3000886,2017-08-15,9,SCHOOL AND OFFICE SUPPLIES,121.000,8


In [6]:
oil_df = pd.read_csv('file:oil.csv')
display(oil_df)

Unnamed: 0,date,dcoilwtico
0,2013-01-01,
1,2013-01-02,93.14
2,2013-01-03,92.97
3,2013-01-04,93.12
4,2013-01-07,93.20
...,...,...
1213,2017-08-25,47.65
1214,2017-08-28,46.40
1215,2017-08-29,46.46
1216,2017-08-30,45.96


In [7]:
holiday_df = pd.read_csv('file:holidays_events.csv')
display(holiday_df)

Unnamed: 0,date,type,locale,locale_name,description,transferred
0,2012-03-02,Holiday,Local,Manta,Fundacion de Manta,False
1,2012-04-01,Holiday,Regional,Cotopaxi,Provincializacion de Cotopaxi,False
2,2012-04-12,Holiday,Local,Cuenca,Fundacion de Cuenca,False
3,2012-04-14,Holiday,Local,Libertad,Cantonizacion de Libertad,False
4,2012-04-21,Holiday,Local,Riobamba,Cantonizacion de Riobamba,False
...,...,...,...,...,...,...
345,2017-12-22,Additional,National,Ecuador,Navidad-3,False
346,2017-12-23,Additional,National,Ecuador,Navidad-2,False
347,2017-12-24,Additional,National,Ecuador,Navidad-1,False
348,2017-12-25,Holiday,National,Ecuador,Navidad,False


In [8]:
stores_df = pd.read_csv('file:stores.csv')
display(stores_df)

Unnamed: 0,store_nbr,city,state,type,cluster
0,1,Quito,Pichincha,D,13
1,2,Quito,Pichincha,D,13
2,3,Quito,Pichincha,D,8
3,4,Quito,Pichincha,D,9
4,5,Santo Domingo,Santo Domingo de los Tsachilas,D,4
5,6,Quito,Pichincha,D,13
6,7,Quito,Pichincha,D,8
7,8,Quito,Pichincha,D,8
8,9,Quito,Pichincha,B,6
9,10,Quito,Pichincha,C,15


In [9]:
transactions_df = pd.read_csv('file:transactions.csv')
display(transactions_df)

Unnamed: 0,date,store_nbr,transactions
0,2013-01-01,25,770
1,2013-01-02,1,2111
2,2013-01-02,2,2358
3,2013-01-02,3,3487
4,2013-01-02,4,1922
...,...,...,...
83483,2017-08-15,50,2804
83484,2017-08-15,51,1573
83485,2017-08-15,52,2255
83486,2017-08-15,53,932


In [10]:
holiday_df['locale_name'].value_counts()

Ecuador                           174
Quito                              13
Riobamba                           12
Guaranda                           12
Latacunga                          12
Ambato                             12
Guayaquil                          11
Cuenca                              7
Ibarra                              7
Salinas                             6
Loja                                6
Santa Elena                         6
Santo Domingo de los Tsachilas      6
Quevedo                             6
Manta                               6
Esmeraldas                          6
Cotopaxi                            6
El Carmen                           6
Santo Domingo                       6
Machala                             6
Imbabura                            6
Puyo                                6
Libertad                            6
Cayambe                             6
Name: locale_name, dtype: int64

In [11]:
stores_df['state'].value_counts()

Pichincha                         19
Guayas                            11
Santo Domingo de los Tsachilas     3
Azuay                              3
Manabi                             3
Cotopaxi                           2
Tungurahua                         2
Los Rios                           2
El Oro                             2
Chimborazo                         1
Imbabura                           1
Bolivar                            1
Pastaza                            1
Santa Elena                        1
Loja                               1
Esmeraldas                         1
Name: state, dtype: int64

<!-- BEGIN QUESTION -->

## 3. Data aggregation, splitting, and feature engineering <a name="3"></a>
<hr>

Since this is such a time-dependent task, we created features for day of the week, day of the month, and day of the year. Interestingly, these features improved performance in some models, but decreased it in others. The linear regressor was doing better without the new features, but our best-performing model, Catboost, worked better with them, so we kept them.

In [12]:
# Engineers more features based off the date
def create_date_features(df):
    df['month'] = pd.to_datetime(df.date, format='%Y-%m-%d').dt.month
    df['day_of_month'] = pd.to_datetime(df.date, format='%Y-%m-%d').dt.day
    df['day_of_year'] = pd.to_datetime(df.date, format='%Y-%m-%d').dt.dayofyear
    df['day_of_week'] = pd.to_datetime(df.date, format='%Y-%m-%d').dt.dayofweek
    df['year'] = pd.to_datetime(df.date, format='%Y-%m-%d').dt.year
    return df

In [13]:
# Takes in the testing or training set and merges data from the other tables with it
# Also integrates feature engineering
# Note: Data was already split into initial training and testing sets
def merge_data(in_df):
    t_oil_df = pd.merge(in_df, oil_df, how="left", on="date")
    t_holiday_df = pd.merge(t_oil_df, holiday_df, how="left", on="date")
    t_transact_df = pd.merge(t_holiday_df, transactions_df, how="left", on=["date", "store_nbr"])
    full_df = pd.merge(t_transact_df, stores_df, how="left", on="store_nbr")
    full_df.rename(columns={"type_x":"holiday_type", "type_y":"store_type", "dcoilwtico":"oil_price"}, inplace = True)
    return create_date_features(full_df)

In [14]:
full_train_df = merge_data(train_df)
full_test_df = merge_data(test_df)

X_train = full_train_df.drop(columns=["sales"])
y_train = full_train_df["sales"]

X_test = full_test_df

X_train.head()

Unnamed: 0,id,date,store_nbr,family,onpromotion,oil_price,holiday_type,locale,locale_name,description,...,transactions,city,state,store_type,cluster,month,day_of_month,day_of_year,day_of_week,year
0,0,2013-01-01,1,AUTOMOTIVE,0,,Holiday,National,Ecuador,Primer dia del ano,...,,Quito,Pichincha,D,13,1,1,1,1,2013
1,1,2013-01-01,1,BABY CARE,0,,Holiday,National,Ecuador,Primer dia del ano,...,,Quito,Pichincha,D,13,1,1,1,1,2013
2,2,2013-01-01,1,BEAUTY,0,,Holiday,National,Ecuador,Primer dia del ano,...,,Quito,Pichincha,D,13,1,1,1,1,2013
3,3,2013-01-01,1,BEVERAGES,0,,Holiday,National,Ecuador,Primer dia del ano,...,,Quito,Pichincha,D,13,1,1,1,1,2013
4,4,2013-01-01,1,BOOKS,0,,Holiday,National,Ecuador,Primer dia del ano,...,,Quito,Pichincha,D,13,1,1,1,1,2013


<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 4. Preprocessing and transformations <a name="4"></a>
<hr>

For preprocessing we made a column transformer, so that we could apply different preprocessing to different attributes.

For numerical data:

- Imputation using SimpleImputer with the median strategy

- Scaling using StandardScaler

For categorical data:

- Imputation using SimpleImputer with the most_frequent strategy

- One hot encoding using OneHotEncoder

We also decided to drop some features that were not relevant to the learning task.

In [15]:
numeric_features = ["store_nbr", "onpromotion", "cluster", "transactions", "month", "day_of_month", "day_of_year", "day_of_week", "year"] 
categorical_features = ["family", "holiday_type", "locale", "locale_name", "city", "state", "transferred", "store_type", "date"]
drop_features = ["id", "oil_price"]  # do not include these features in modeling

preprocessor = make_column_transformer(
    (make_pipeline(SimpleImputer(strategy='median'), StandardScaler()), numeric_features),
    (make_pipeline(SimpleImputer(strategy='most_frequent'), OneHotEncoder(handle_unknown="ignore")), categorical_features))

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 5. Baseline model <a name="5"></a>
<hr>

Here we are trying out the data on sklearn's baseline model for regression tasks, DummyRegressor. This sets a baseline that we can look at to judge other models performances.

We also define the mean_std_cross_val_scores function provided by the Professor, which we used to get cross validation scores for our models during this project.

In [16]:
def mean_std_cross_val_scores(model, X_train, y_train, **kwargs):
    """
    Returns mean and std of cross validation

    Parameters
    ----------
    model :
        scikit-learn model
    X_train : numpy array or pandas DataFrame
        X in the training data
    y_train :
        y in the training data

    Returns
    ----------
        pandas Series with mean scores from cross_validation
    """

    scores = cross_validate(model, X_train, y_train, **kwargs)

    mean_scores = pd.DataFrame(scores).mean()
    std_scores = pd.DataFrame(scores).std()
    out_col = []

    for i in range(len(mean_scores)):
        out_col.append((f"%0.3f (+/- %0.3f)" % (mean_scores[i], std_scores[i])))

    return pd.Series(data=out_col, index=mean_scores.index)

In [17]:
results_dict = {}

In [18]:
dummyPipe = make_pipeline(preprocessor, DummyRegressor(strategy="median"))
results_dict['DummyRegressor'] = mean_std_cross_val_scores(dummyPipe, X_train, y_train, cv=3, scoring='neg_root_mean_squared_error')

<!-- BEGIN QUESTION -->

## 6. Linear model <a name="6"></a>
<hr>

In [19]:
lrPipe = make_pipeline(preprocessor, LinearRegression(n_jobs=-1))
results_dict['LinearRegressor'] = mean_std_cross_val_scores(lrPipe, X_train, y_train, cv=3, scoring='neg_root_mean_squared_error')

<!-- BEGIN QUESTION -->

## 7. Different models <a name="7"></a>
<hr>

### DecisionTreeRegressor

In [20]:
dtPipe = make_pipeline(preprocessor, DecisionTreeRegressor(max_depth=10))
results_dict['DecisionTreeRegressor'] = mean_std_cross_val_scores(dtPipe, X_train, y_train, cv=3, scoring='neg_root_mean_squared_error')

### LogisticRegressor

In [21]:
lgrPipe = make_pipeline(preprocessor, LogisticRegression(n_jobs=-1))
results_dict['LogisticRegressor'] = mean_std_cross_val_scores(dtPipe, X_train, y_train, cv=3, scoring='neg_root_mean_squared_error')

### LightGBM

In [22]:
lgbmPipe = make_pipeline(preprocessor, LGBMRegressor()) # add model
results_dict['LightGBM'] = mean_std_cross_val_scores(lgbmPipe, X_train, y_train, cv=3, scoring='neg_root_mean_squared_error')

### XGBoost

In [23]:
xgbPipe = make_pipeline(preprocessor, XGBRegressor(objective='reg:squaredlogerror'))
results_dict['XGBoost'] = mean_std_cross_val_scores(xgbPipe, X_train, y_train, cv=3, scoring='neg_root_mean_squared_error')

### CatBoost

In [24]:
catPipe = make_pipeline(preprocessor, CatBoostRegressor(loss_function="RMSE", task_type="GPU")) # add model
results_dict['CatBoost'] = mean_std_cross_val_scores(catPipe, X_train, y_train, cv=3, scoring='neg_root_mean_squared_error')


Learning rate set to 0.109589
0:	learn: 1128.0062227	total: 22.9ms	remaining: 22.9s
1:	learn: 1040.9475690	total: 41.6ms	remaining: 20.8s
2:	learn: 966.6530661	total: 57.8ms	remaining: 19.2s
3:	learn: 900.2704201	total: 77.5ms	remaining: 19.3s
4:	learn: 844.2184579	total: 96.7ms	remaining: 19.2s
5:	learn: 795.8585421	total: 115ms	remaining: 19.1s
6:	learn: 754.2992695	total: 131ms	remaining: 18.6s
7:	learn: 718.1322305	total: 148ms	remaining: 18.4s
8:	learn: 688.1330358	total: 165ms	remaining: 18.2s
9:	learn: 662.1887186	total: 181ms	remaining: 18s
10:	learn: 639.7776339	total: 198ms	remaining: 17.8s
11:	learn: 621.1949330	total: 214ms	remaining: 17.6s
12:	learn: 605.5262791	total: 230ms	remaining: 17.4s
13:	learn: 591.9112461	total: 251ms	remaining: 17.7s
14:	learn: 580.0698540	total: 266ms	remaining: 17.5s
15:	learn: 569.4158775	total: 281ms	remaining: 17.3s
16:	learn: 559.7781305	total: 298ms	remaining: 17.3s
17:	learn: 552.0547264	total: 313ms	remaining: 17.1s
18:	learn: 544.313442

162:	learn: 353.4085132	total: 2.85s	remaining: 14.7s
163:	learn: 352.7623837	total: 2.87s	remaining: 14.6s
164:	learn: 352.3490666	total: 2.89s	remaining: 14.6s
165:	learn: 351.5935734	total: 2.9s	remaining: 14.6s
166:	learn: 351.3671903	total: 2.92s	remaining: 14.6s
167:	learn: 350.9445839	total: 2.94s	remaining: 14.5s
168:	learn: 349.6602502	total: 2.95s	remaining: 14.5s
169:	learn: 348.9739813	total: 2.97s	remaining: 14.5s
170:	learn: 348.7413756	total: 2.98s	remaining: 14.5s
171:	learn: 348.3099237	total: 3s	remaining: 14.4s
172:	learn: 347.9603951	total: 3.02s	remaining: 14.4s
173:	learn: 347.3587783	total: 3.03s	remaining: 14.4s
174:	learn: 346.8913261	total: 3.05s	remaining: 14.4s
175:	learn: 346.5543672	total: 3.06s	remaining: 14.3s
176:	learn: 346.2786970	total: 3.08s	remaining: 14.3s
177:	learn: 345.9303956	total: 3.1s	remaining: 14.3s
178:	learn: 345.2799898	total: 3.11s	remaining: 14.3s
179:	learn: 344.9236468	total: 3.13s	remaining: 14.2s
180:	learn: 344.5614826	total: 3.

326:	learn: 299.6799990	total: 5.46s	remaining: 11.2s
327:	learn: 299.1753688	total: 5.48s	remaining: 11.2s
328:	learn: 299.1059186	total: 5.49s	remaining: 11.2s
329:	learn: 298.8657715	total: 5.51s	remaining: 11.2s
330:	learn: 298.5485900	total: 5.52s	remaining: 11.2s
331:	learn: 298.2639173	total: 5.54s	remaining: 11.1s
332:	learn: 298.0860055	total: 5.55s	remaining: 11.1s
333:	learn: 297.8536132	total: 5.57s	remaining: 11.1s
334:	learn: 297.7049583	total: 5.59s	remaining: 11.1s
335:	learn: 297.4600420	total: 5.6s	remaining: 11.1s
336:	learn: 297.1666775	total: 5.62s	remaining: 11.1s
337:	learn: 297.0138862	total: 5.63s	remaining: 11s
338:	learn: 296.8472604	total: 5.65s	remaining: 11s
339:	learn: 296.4730344	total: 5.66s	remaining: 11s
340:	learn: 296.2915490	total: 5.68s	remaining: 11s
341:	learn: 296.1058084	total: 5.69s	remaining: 11s
342:	learn: 295.8408024	total: 5.71s	remaining: 10.9s
343:	learn: 295.4393481	total: 5.72s	remaining: 10.9s
344:	learn: 295.3586812	total: 5.74s	re

485:	learn: 271.2319627	total: 7.86s	remaining: 8.32s
486:	learn: 271.1265110	total: 7.88s	remaining: 8.3s
487:	learn: 271.0486718	total: 7.89s	remaining: 8.28s
488:	learn: 270.9684643	total: 7.91s	remaining: 8.26s
489:	learn: 270.8990746	total: 7.92s	remaining: 8.24s
490:	learn: 270.7619802	total: 7.94s	remaining: 8.23s
491:	learn: 270.4962095	total: 7.95s	remaining: 8.21s
492:	learn: 270.2484887	total: 7.96s	remaining: 8.19s
493:	learn: 269.9534810	total: 7.98s	remaining: 8.17s
494:	learn: 269.7688740	total: 7.99s	remaining: 8.15s
495:	learn: 269.6679226	total: 8.01s	remaining: 8.14s
496:	learn: 269.4195597	total: 8.03s	remaining: 8.12s
497:	learn: 269.3539082	total: 8.04s	remaining: 8.1s
498:	learn: 269.2715824	total: 8.06s	remaining: 8.09s
499:	learn: 269.1882151	total: 8.07s	remaining: 8.07s
500:	learn: 269.1230603	total: 8.09s	remaining: 8.05s
501:	learn: 268.9812617	total: 8.1s	remaining: 8.04s
502:	learn: 268.8882887	total: 8.12s	remaining: 8.02s
503:	learn: 268.8080503	total: 

646:	learn: 253.5723639	total: 10.3s	remaining: 5.59s
647:	learn: 253.4998074	total: 10.3s	remaining: 5.58s
648:	learn: 253.4491840	total: 10.3s	remaining: 5.56s
649:	learn: 253.2709576	total: 10.3s	remaining: 5.55s
650:	learn: 253.2217978	total: 10.3s	remaining: 5.53s
651:	learn: 253.0712089	total: 10.3s	remaining: 5.51s
652:	learn: 252.7839827	total: 10.3s	remaining: 5.5s
653:	learn: 252.5163616	total: 10.4s	remaining: 5.48s
654:	learn: 252.3664996	total: 10.4s	remaining: 5.46s
655:	learn: 252.3222818	total: 10.4s	remaining: 5.45s
656:	learn: 252.2679774	total: 10.4s	remaining: 5.43s
657:	learn: 252.2092827	total: 10.4s	remaining: 5.42s
658:	learn: 252.1258585	total: 10.4s	remaining: 5.4s
659:	learn: 252.0938232	total: 10.4s	remaining: 5.38s
660:	learn: 251.9994571	total: 10.5s	remaining: 5.37s
661:	learn: 251.9367955	total: 10.5s	remaining: 5.35s
662:	learn: 251.7198623	total: 10.5s	remaining: 5.33s
663:	learn: 251.6525829	total: 10.5s	remaining: 5.32s
664:	learn: 251.5473279	total:

807:	learn: 239.7874497	total: 12.6s	remaining: 3s
808:	learn: 239.6760524	total: 12.7s	remaining: 2.99s
809:	learn: 239.6391883	total: 12.7s	remaining: 2.97s
810:	learn: 239.5985910	total: 12.7s	remaining: 2.96s
811:	learn: 239.5661148	total: 12.7s	remaining: 2.94s
812:	learn: 239.4746072	total: 12.7s	remaining: 2.92s
813:	learn: 239.3928288	total: 12.7s	remaining: 2.91s
814:	learn: 239.3337501	total: 12.7s	remaining: 2.89s
815:	learn: 239.3032555	total: 12.8s	remaining: 2.88s
816:	learn: 239.1429433	total: 12.8s	remaining: 2.86s
817:	learn: 239.0961707	total: 12.8s	remaining: 2.84s
818:	learn: 238.9085084	total: 12.8s	remaining: 2.83s
819:	learn: 238.8669026	total: 12.8s	remaining: 2.81s
820:	learn: 238.6830970	total: 12.8s	remaining: 2.8s
821:	learn: 238.6460036	total: 12.8s	remaining: 2.78s
822:	learn: 238.5704421	total: 12.9s	remaining: 2.77s
823:	learn: 238.5266775	total: 12.9s	remaining: 2.75s
824:	learn: 238.5054331	total: 12.9s	remaining: 2.73s
825:	learn: 238.4228076	total: 1

969:	learn: 229.5306363	total: 15s	remaining: 465ms
970:	learn: 229.5164385	total: 15s	remaining: 449ms
971:	learn: 229.4895392	total: 15.1s	remaining: 434ms
972:	learn: 229.4371777	total: 15.1s	remaining: 418ms
973:	learn: 229.3877946	total: 15.1s	remaining: 403ms
974:	learn: 229.3493558	total: 15.1s	remaining: 387ms
975:	learn: 229.2898212	total: 15.1s	remaining: 372ms
976:	learn: 229.2511817	total: 15.1s	remaining: 356ms
977:	learn: 229.1791846	total: 15.2s	remaining: 341ms
978:	learn: 229.1437745	total: 15.2s	remaining: 325ms
979:	learn: 229.1153828	total: 15.2s	remaining: 310ms
980:	learn: 229.0279820	total: 15.2s	remaining: 294ms
981:	learn: 228.9699452	total: 15.2s	remaining: 279ms
982:	learn: 228.9182030	total: 15.2s	remaining: 263ms
983:	learn: 228.8983518	total: 15.2s	remaining: 248ms
984:	learn: 228.7872973	total: 15.3s	remaining: 232ms
985:	learn: 228.7425139	total: 15.3s	remaining: 217ms
986:	learn: 228.6688524	total: 15.3s	remaining: 201ms
987:	learn: 228.6118331	total: 1

130:	learn: 335.7396043	total: 2.02s	remaining: 13.4s
131:	learn: 335.2967017	total: 2.03s	remaining: 13.4s
132:	learn: 334.8335928	total: 2.05s	remaining: 13.4s
133:	learn: 333.8558281	total: 2.06s	remaining: 13.3s
134:	learn: 333.5513307	total: 2.08s	remaining: 13.3s
135:	learn: 333.2341685	total: 2.09s	remaining: 13.3s
136:	learn: 332.9898312	total: 2.11s	remaining: 13.3s
137:	learn: 332.5352209	total: 2.12s	remaining: 13.3s
138:	learn: 331.8197915	total: 2.14s	remaining: 13.2s
139:	learn: 331.4071233	total: 2.15s	remaining: 13.2s
140:	learn: 330.9161295	total: 2.17s	remaining: 13.2s
141:	learn: 330.6609179	total: 2.18s	remaining: 13.2s
142:	learn: 330.1095654	total: 2.2s	remaining: 13.2s
143:	learn: 329.7112052	total: 2.21s	remaining: 13.2s
144:	learn: 329.2968353	total: 2.23s	remaining: 13.1s
145:	learn: 328.8707991	total: 2.24s	remaining: 13.1s
146:	learn: 328.4686462	total: 2.26s	remaining: 13.1s
147:	learn: 328.2331655	total: 2.27s	remaining: 13.1s
148:	learn: 327.9704566	total

290:	learn: 285.3851658	total: 4.41s	remaining: 10.7s
291:	learn: 284.8458773	total: 4.42s	remaining: 10.7s
292:	learn: 284.1818903	total: 4.43s	remaining: 10.7s
293:	learn: 284.0220005	total: 4.45s	remaining: 10.7s
294:	learn: 283.8579813	total: 4.46s	remaining: 10.7s
295:	learn: 283.7359823	total: 4.48s	remaining: 10.7s
296:	learn: 283.5102607	total: 4.49s	remaining: 10.6s
297:	learn: 283.3641757	total: 4.51s	remaining: 10.6s
298:	learn: 283.1982555	total: 4.52s	remaining: 10.6s
299:	learn: 282.5857545	total: 4.54s	remaining: 10.6s
300:	learn: 282.1897224	total: 4.55s	remaining: 10.6s
301:	learn: 281.8717453	total: 4.57s	remaining: 10.6s
302:	learn: 281.7465873	total: 4.58s	remaining: 10.5s
303:	learn: 281.6242737	total: 4.59s	remaining: 10.5s
304:	learn: 281.4466500	total: 4.61s	remaining: 10.5s
305:	learn: 281.2945163	total: 4.63s	remaining: 10.5s
306:	learn: 281.1683144	total: 4.64s	remaining: 10.5s
307:	learn: 281.0791149	total: 4.65s	remaining: 10.5s
308:	learn: 280.8567443	tota

451:	learn: 256.1833869	total: 6.81s	remaining: 8.25s
452:	learn: 256.1174528	total: 6.82s	remaining: 8.24s
453:	learn: 255.9771254	total: 6.84s	remaining: 8.22s
454:	learn: 255.9163574	total: 6.85s	remaining: 8.21s
455:	learn: 255.7261873	total: 6.87s	remaining: 8.19s
456:	learn: 255.6071697	total: 6.88s	remaining: 8.18s
457:	learn: 255.5154158	total: 6.9s	remaining: 8.16s
458:	learn: 255.4093032	total: 6.91s	remaining: 8.15s
459:	learn: 255.3118449	total: 6.93s	remaining: 8.13s
460:	learn: 255.2279137	total: 6.94s	remaining: 8.12s
461:	learn: 255.0927431	total: 6.96s	remaining: 8.1s
462:	learn: 254.9020533	total: 6.97s	remaining: 8.09s
463:	learn: 254.8159899	total: 6.99s	remaining: 8.07s
464:	learn: 254.7005037	total: 7s	remaining: 8.05s
465:	learn: 254.6359868	total: 7.01s	remaining: 8.04s
466:	learn: 254.5159456	total: 7.03s	remaining: 8.02s
467:	learn: 254.3891819	total: 7.04s	remaining: 8.01s
468:	learn: 254.2971310	total: 7.06s	remaining: 7.99s
469:	learn: 254.1424540	total: 7.

614:	learn: 238.3502220	total: 9.23s	remaining: 5.78s
615:	learn: 238.2990731	total: 9.24s	remaining: 5.76s
616:	learn: 238.2483016	total: 9.26s	remaining: 5.75s
617:	learn: 238.1530189	total: 9.27s	remaining: 5.73s
618:	learn: 237.9845953	total: 9.29s	remaining: 5.72s
619:	learn: 237.9040972	total: 9.3s	remaining: 5.7s
620:	learn: 237.8558631	total: 9.32s	remaining: 5.69s
621:	learn: 237.8266761	total: 9.34s	remaining: 5.67s
622:	learn: 237.7007610	total: 9.35s	remaining: 5.66s
623:	learn: 237.6479403	total: 9.36s	remaining: 5.64s
624:	learn: 237.4558212	total: 9.38s	remaining: 5.63s
625:	learn: 237.3763217	total: 9.39s	remaining: 5.61s
626:	learn: 237.2955749	total: 9.41s	remaining: 5.6s
627:	learn: 237.2297248	total: 9.43s	remaining: 5.58s
628:	learn: 237.1434485	total: 9.44s	remaining: 5.57s
629:	learn: 237.0845984	total: 9.45s	remaining: 5.55s
630:	learn: 237.0401521	total: 9.47s	remaining: 5.54s
631:	learn: 236.9108728	total: 9.48s	remaining: 5.52s
632:	learn: 236.7041339	total: 

777:	learn: 225.8806891	total: 11.6s	remaining: 3.32s
778:	learn: 225.6649060	total: 11.6s	remaining: 3.3s
779:	learn: 225.6238894	total: 11.7s	remaining: 3.29s
780:	learn: 225.5943413	total: 11.7s	remaining: 3.27s
781:	learn: 225.5362861	total: 11.7s	remaining: 3.26s
782:	learn: 225.4781713	total: 11.7s	remaining: 3.24s
783:	learn: 225.3872360	total: 11.7s	remaining: 3.23s
784:	learn: 225.3175575	total: 11.7s	remaining: 3.21s
785:	learn: 225.2809421	total: 11.7s	remaining: 3.2s
786:	learn: 225.2225025	total: 11.8s	remaining: 3.18s
787:	learn: 225.1636009	total: 11.8s	remaining: 3.17s
788:	learn: 225.0857476	total: 11.8s	remaining: 3.15s
789:	learn: 224.9693596	total: 11.8s	remaining: 3.14s
790:	learn: 224.8969667	total: 11.8s	remaining: 3.12s
791:	learn: 224.8243000	total: 11.8s	remaining: 3.11s
792:	learn: 224.7751832	total: 11.8s	remaining: 3.09s
793:	learn: 224.6937754	total: 11.9s	remaining: 3.08s
794:	learn: 224.6213206	total: 11.9s	remaining: 3.06s
795:	learn: 224.4260094	total:

940:	learn: 215.6535519	total: 14s	remaining: 880ms
941:	learn: 215.6168721	total: 14.1s	remaining: 865ms
942:	learn: 215.5823602	total: 14.1s	remaining: 850ms
943:	learn: 215.5390516	total: 14.1s	remaining: 835ms
944:	learn: 215.5132171	total: 14.1s	remaining: 820ms
945:	learn: 215.4839163	total: 14.1s	remaining: 805ms
946:	learn: 215.4475998	total: 14.1s	remaining: 790ms
947:	learn: 215.3467311	total: 14.1s	remaining: 776ms
948:	learn: 215.3203598	total: 14.2s	remaining: 761ms
949:	learn: 215.2588608	total: 14.2s	remaining: 746ms
950:	learn: 215.1717306	total: 14.2s	remaining: 731ms
951:	learn: 215.1480960	total: 14.2s	remaining: 716ms
952:	learn: 215.1148368	total: 14.2s	remaining: 701ms
953:	learn: 215.0749507	total: 14.2s	remaining: 686ms
954:	learn: 215.0175166	total: 14.2s	remaining: 671ms
955:	learn: 214.9780523	total: 14.3s	remaining: 656ms
956:	learn: 214.9496611	total: 14.3s	remaining: 641ms
957:	learn: 214.9170357	total: 14.3s	remaining: 626ms
958:	learn: 214.8792192	total:

104:	learn: 334.4432534	total: 1.61s	remaining: 13.8s
105:	learn: 334.0014503	total: 1.63s	remaining: 13.8s
106:	learn: 333.5382679	total: 1.65s	remaining: 13.7s
107:	learn: 332.7664362	total: 1.66s	remaining: 13.7s
108:	learn: 332.1619840	total: 1.68s	remaining: 13.7s
109:	learn: 331.8847841	total: 1.69s	remaining: 13.7s
110:	learn: 331.4290465	total: 1.71s	remaining: 13.7s
111:	learn: 330.8725055	total: 1.72s	remaining: 13.7s
112:	learn: 330.5287711	total: 1.74s	remaining: 13.6s
113:	learn: 330.1770148	total: 1.75s	remaining: 13.6s
114:	learn: 329.6292102	total: 1.77s	remaining: 13.6s
115:	learn: 329.2307453	total: 1.78s	remaining: 13.6s
116:	learn: 328.6338928	total: 1.8s	remaining: 13.6s
117:	learn: 327.8721978	total: 1.81s	remaining: 13.6s
118:	learn: 327.4293797	total: 1.83s	remaining: 13.5s
119:	learn: 326.7197382	total: 1.84s	remaining: 13.5s
120:	learn: 326.4870871	total: 1.86s	remaining: 13.5s
121:	learn: 325.9851657	total: 1.88s	remaining: 13.5s
122:	learn: 325.7321921	total

261:	learn: 277.4186035	total: 4.01s	remaining: 11.3s
262:	learn: 277.1988841	total: 4.03s	remaining: 11.3s
263:	learn: 276.7223893	total: 4.04s	remaining: 11.3s
264:	learn: 276.5572122	total: 4.06s	remaining: 11.3s
265:	learn: 276.3496484	total: 4.08s	remaining: 11.2s
266:	learn: 276.2389121	total: 4.09s	remaining: 11.2s
267:	learn: 276.0865753	total: 4.11s	remaining: 11.2s
268:	learn: 275.7467088	total: 4.12s	remaining: 11.2s
269:	learn: 275.5991216	total: 4.14s	remaining: 11.2s
270:	learn: 275.5175521	total: 4.15s	remaining: 11.2s
271:	learn: 275.4200955	total: 4.17s	remaining: 11.2s
272:	learn: 275.0153978	total: 4.18s	remaining: 11.1s
273:	learn: 274.8804149	total: 4.2s	remaining: 11.1s
274:	learn: 274.7597302	total: 4.22s	remaining: 11.1s
275:	learn: 274.4260050	total: 4.23s	remaining: 11.1s
276:	learn: 274.2808021	total: 4.24s	remaining: 11.1s
277:	learn: 274.0477185	total: 4.26s	remaining: 11.1s
278:	learn: 273.9255801	total: 4.27s	remaining: 11s
279:	learn: 273.7335403	total: 

420:	learn: 251.9447079	total: 6.39s	remaining: 8.79s
421:	learn: 251.1178482	total: 6.41s	remaining: 8.78s
422:	learn: 251.0046114	total: 6.42s	remaining: 8.76s
423:	learn: 250.8670530	total: 6.44s	remaining: 8.75s
424:	learn: 250.7584680	total: 6.45s	remaining: 8.73s
425:	learn: 250.6634547	total: 6.47s	remaining: 8.72s
426:	learn: 250.5835858	total: 6.48s	remaining: 8.7s
427:	learn: 250.4691036	total: 6.5s	remaining: 8.69s
428:	learn: 250.3564573	total: 6.51s	remaining: 8.67s
429:	learn: 250.1692573	total: 6.53s	remaining: 8.66s
430:	learn: 250.0678665	total: 6.54s	remaining: 8.64s
431:	learn: 249.9926755	total: 6.56s	remaining: 8.63s
432:	learn: 249.9139526	total: 6.57s	remaining: 8.61s
433:	learn: 249.7731519	total: 6.59s	remaining: 8.6s
434:	learn: 249.6884545	total: 6.6s	remaining: 8.58s
435:	learn: 249.5386435	total: 6.62s	remaining: 8.57s
436:	learn: 249.4670022	total: 6.64s	remaining: 8.55s
437:	learn: 249.3246501	total: 6.65s	remaining: 8.54s
438:	learn: 249.1533151	total: 6

580:	learn: 234.1964681	total: 8.79s	remaining: 6.34s
581:	learn: 234.0809235	total: 8.81s	remaining: 6.33s
582:	learn: 234.0195756	total: 8.83s	remaining: 6.31s
583:	learn: 233.9332418	total: 8.84s	remaining: 6.3s
584:	learn: 233.8732743	total: 8.86s	remaining: 6.28s
585:	learn: 233.7517613	total: 8.87s	remaining: 6.27s
586:	learn: 233.6487390	total: 8.89s	remaining: 6.25s
587:	learn: 233.5712680	total: 8.9s	remaining: 6.24s
588:	learn: 233.4720949	total: 8.92s	remaining: 6.22s
589:	learn: 233.3803354	total: 8.93s	remaining: 6.21s
590:	learn: 233.1835443	total: 8.95s	remaining: 6.19s
591:	learn: 233.0894361	total: 8.96s	remaining: 6.18s
592:	learn: 233.0291910	total: 8.98s	remaining: 6.16s
593:	learn: 233.0012727	total: 8.99s	remaining: 6.15s
594:	learn: 232.9467301	total: 9.01s	remaining: 6.13s
595:	learn: 232.8299519	total: 9.02s	remaining: 6.12s
596:	learn: 232.7652064	total: 9.04s	remaining: 6.1s
597:	learn: 232.6941150	total: 9.05s	remaining: 6.08s
598:	learn: 232.6143631	total: 

741:	learn: 221.4056280	total: 11.2s	remaining: 3.89s
742:	learn: 221.3455108	total: 11.2s	remaining: 3.88s
743:	learn: 221.3071840	total: 11.2s	remaining: 3.86s
744:	learn: 221.2546681	total: 11.2s	remaining: 3.85s
745:	learn: 221.2076231	total: 11.3s	remaining: 3.83s
746:	learn: 221.1846790	total: 11.3s	remaining: 3.81s
747:	learn: 221.1076899	total: 11.3s	remaining: 3.8s
748:	learn: 221.0193160	total: 11.3s	remaining: 3.79s
749:	learn: 220.9654935	total: 11.3s	remaining: 3.77s
750:	learn: 220.9337292	total: 11.3s	remaining: 3.75s
751:	learn: 220.8928175	total: 11.3s	remaining: 3.74s
752:	learn: 220.8088943	total: 11.4s	remaining: 3.72s
753:	learn: 220.7641508	total: 11.4s	remaining: 3.71s
754:	learn: 220.7220867	total: 11.4s	remaining: 3.69s
755:	learn: 220.6233924	total: 11.4s	remaining: 3.68s
756:	learn: 220.5946062	total: 11.4s	remaining: 3.66s
757:	learn: 220.5438451	total: 11.4s	remaining: 3.65s
758:	learn: 220.4950703	total: 11.4s	remaining: 3.63s
759:	learn: 220.4424978	total

902:	learn: 212.8520058	total: 13.6s	remaining: 1.46s
903:	learn: 212.8157034	total: 13.6s	remaining: 1.44s
904:	learn: 212.7402716	total: 13.6s	remaining: 1.43s
905:	learn: 212.7133785	total: 13.6s	remaining: 1.41s
906:	learn: 212.6791899	total: 13.6s	remaining: 1.4s
907:	learn: 212.6506905	total: 13.7s	remaining: 1.38s
908:	learn: 212.6209006	total: 13.7s	remaining: 1.37s
909:	learn: 212.5876055	total: 13.7s	remaining: 1.35s
910:	learn: 212.5711974	total: 13.7s	remaining: 1.34s
911:	learn: 212.5445195	total: 13.7s	remaining: 1.32s
912:	learn: 212.4695784	total: 13.7s	remaining: 1.31s
913:	learn: 212.4371024	total: 13.7s	remaining: 1.29s
914:	learn: 212.3887673	total: 13.8s	remaining: 1.28s
915:	learn: 212.3554642	total: 13.8s	remaining: 1.26s
916:	learn: 212.3170399	total: 13.8s	remaining: 1.25s
917:	learn: 212.2899322	total: 13.8s	remaining: 1.23s
918:	learn: 212.2676635	total: 13.8s	remaining: 1.22s
919:	learn: 212.2341235	total: 13.8s	remaining: 1.2s
920:	learn: 212.1952316	total:

### Results

In [25]:
pd.DataFrame(results_dict)

Unnamed: 0,DummyRegressor,LinearRegressor,DecisionTreeRegressor,LogisticRegressor,LightGBM,XGBoost,CatBoost
fit_time,6.307 (+/- 0.145),24.658 (+/- 6.368),33.914 (+/- 0.983),34.935 (+/- 1.256),9.719 (+/- 0.318),38.966 (+/- 0.824),25.480 (+/- 0.885)
score_time,2.170 (+/- 0.019),2.176 (+/- 0.048),2.201 (+/- 0.062),2.208 (+/- 0.003),2.648 (+/- 0.098),2.439 (+/- 0.019),3.056 (+/- 0.012)
test_score,-1136.668 (+/- 289.512),-740.047 (+/- 110.780),-472.337 (+/- 4.391),-472.487 (+/- 4.319),-423.426 (+/- 20.338),-1050.390 (+/- 216.737),-412.630 (+/- 3.269)


<!-- BEGIN QUESTION -->

## 8. Hyperparameter optimization <a name="8"></a>
<hr>

### XGBoost

In [26]:
# Commented out hyperparameter optimization
# It takes a long time to run, but the parameters it came up with are stored in the dictionary below
#xgb_param_dict = {
#    "xgbregressor__booster": ['gbtree', 'gblinear'],
#    "xgbregressor__n_estimators": [50, 100, 150, 200, 250, 300, 350],
#    "xgbregressor__max_depth": [3, 4, 5, 6, 7, 8],
#    "xgbregressor__max_delta_step": [2, 3, 4, 5, 6, 7],
#    "xgbregressor__gamma": [.01, .1],
#    "xgbregressor__learning_rate": [.01, .05, .1, .15, .2, .25, .3],
#    "xgbregressor__grow_policy": ['depthwise', 'lossguide'],
#    "xgbregressor__tree_method": ['exact', 'hist', 'approx'],
#}
#
#xgb_op_pipe = make_pipeline(preprocessor, XGBRegressor(objective='reg:squaredlogerror'))
#
#xgb_r_search = RandomizedSearchCV(xgb_op_pipe, param_dict, cv=5, n_jobs=-1, scoring="f1", random_state=123, return_train_score=True)
#xgb_r_search.fit(X_train, y_train)
#
#print(xgb_r_search.best_params_)
xgb_best_params = {
    'xgbregressor__tree_method': 'approx',
    'xgbregressor__n_estimators': 300,
    'xgbregressor__max_depth': 7,
    'xgbregressor__max_delta_step': 6,
    'xgbregressor__learning_rate': 0.3,
    'xgbregressor__grow_policy': 'depthwise',
    'xgbregressor__gamma': 0.1,
    'xgbregressor__booster': 'gbtree'
}

In [27]:
#xgb_pipe_best = make_pipeline(
#    preprocessor,
#    XGBRegressor(
#        n_estimators=xgb_best_params['xgbregressor__n_estimators'],
#        max_depth=xgb_best_params['xgbregressor__max_depth'],
#        objective='reg:squaredlogerror',
#        booster=xgb_best_params['xgbregressor__booster'],
#        max_delta_step=xgb_best_params['xgbregressor__max_delta_step'],
#        gamma=xgb_best_params['xgbregressor__gamma'],
#        learning_rate=xgb_best_params['xgbregressor__learning_rate'],
#        grow_policy=xgb_best_params['xgbregressor__grow_policy'],
#        tree_method=xgb_best_params['xgbregressor__tree_method'],
#    )
#)
#xgb_pipe_best.fit(X_train, y_train)
#results_dict['Optimized XGBoost'] = mean_std_cross_val_scores(xgb_pipe_best, X_train, y_train, cv=3, scoring='neg_root_mean_squared_error')

### CatBoost

In [28]:
#Commented out, as it is very time consuming to run! 
#Just know the best values were found to be depth=10, learning_rate=0.1, l2_leaf_reg=0.5

# from sklearn.model_selection import train_test_split
# from sklearn.metrics import r2_score

# X_train = preprocessor.fit_transform(X_train)

# X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=0.2, random_state=123)

# train_dataset = cb.Pool(X_train, y_train) 
# test_dataset = cb.Pool(X_test, y_test)
# model = CatBoostRegressor(loss_function="RMSE", task_type="GPU")
# grid = {'learning_rate': [0.1],
#         'depth': [10, 15, 20],
#         'l2_leaf_reg': [0.1, 0.5, 1]}
# grid_search_result = model.grid_search(grid, train_dataset, plot=True)
# pred = model.predict(X_test)
# # rmse = (np.sqrt(mean_squared_log_error(y_test, pred)))
# r2 = r2_score(y_test, pred)
# print('Testing performance')
# # print('RMSLE: {:.2f}'.format(rmse))
# print('R2: {:.2f}'.format(r2))
# print(f'Grid search: {grid_search_result}')

<!-- BEGIN QUESTION -->

## 9. Using 2017 Data <a name="9"></a>
<hr>
In our testing we decided to try only uing the 2017 data, and found a marked improvement in both our linear and ensemble models. We also noticed most of the data for oil prices for the date ranges in the test set weren't present, and our imputation was hurting our scores, so we removed that as well (see preprocessing and transformations)

In [29]:
# Find all the indices of dates in 2017, add them to a list, and remove everything else from the training data
keep_indices = []
full_train_df.sort_values(by="year", axis=0, ascending=True, inplace=True)
display(full_train_df['year'].value_counts())

for i in range(len(full_train_df['year'])):
    if full_train_df['year'][i] == 2017:
        keep_indices.append(i)

train_2017_df = full_train_df.take(keep_indices)

X_train_2017 = train_2017_df.drop(columns=["sales"])
y_train_2017 = train_2017_df["sales"]

train_2017_df.head()

2016    670032
2014    659340
2013    657558
2015    655776
2017    411642
Name: year, dtype: int64

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,oil_price,holiday_type,locale,locale_name,...,transactions,city,state,store_type,cluster,month,day_of_month,day_of_year,day_of_week,year
2917132,2869018,2017-06-02,9,SCHOOL AND OFFICE SUPPLIES,6.0,0,47.68,,,,...,1995.0,Quito,Pichincha,B,6,6,2,153,4,2017
2917133,2869019,2017-06-02,9,SEAFOOD,11.0,0,47.68,,,,...,1995.0,Quito,Pichincha,B,6,6,2,153,4,2017
2917161,2869047,2017-06-03,1,PLAYERS AND ELECTRONICS,7.0,0,,,,,...,1420.0,Quito,Pichincha,D,13,6,3,154,5,2017
2917135,2869021,2017-06-03,1,BABY CARE,0.0,0,,,,,...,1420.0,Quito,Pichincha,D,13,6,3,154,5,2017
2917136,2869022,2017-06-03,1,BEAUTY,1.0,0,,,,,...,1420.0,Quito,Pichincha,D,13,6,3,154,5,2017


<!-- BEGIN QUESTION -->

## 10. Training the best model <a name="10"></a>
<hr>

In [30]:
bestModel = CatBoostRegressor(loss_function="RMSE", task_type="GPU", depth=10, learning_rate=0.1, l2_leaf_reg=0.5)
bestModelPipe = make_pipeline(preprocessor, bestModel)
bestModelPipe.fit(X_train_2017, y_train_2017)
result = bestModelPipe.predict(X_test)

sample_df['sales'] = result
sample_df.head()

0:	learn: 1240.2797203	total: 10.8ms	remaining: 10.8s
1:	learn: 1134.7590721	total: 20.7ms	remaining: 10.3s
2:	learn: 1041.2830790	total: 30.3ms	remaining: 10.1s
3:	learn: 957.5308164	total: 39.7ms	remaining: 9.89s
4:	learn: 883.1023920	total: 50ms	remaining: 9.96s
5:	learn: 817.1949245	total: 60ms	remaining: 9.94s
6:	learn: 757.9931453	total: 69.7ms	remaining: 9.88s
7:	learn: 706.0765556	total: 79.6ms	remaining: 9.87s
8:	learn: 659.6827028	total: 89.1ms	remaining: 9.81s
9:	learn: 617.1482636	total: 99.1ms	remaining: 9.81s
10:	learn: 581.2108037	total: 109ms	remaining: 9.81s
11:	learn: 549.4509447	total: 119ms	remaining: 9.78s
12:	learn: 521.1617320	total: 128ms	remaining: 9.75s
13:	learn: 497.1115773	total: 138ms	remaining: 9.73s
14:	learn: 475.3757878	total: 148ms	remaining: 9.71s
15:	learn: 457.0827684	total: 158ms	remaining: 9.69s
16:	learn: 438.8877412	total: 168ms	remaining: 9.69s
17:	learn: 422.7848663	total: 177ms	remaining: 9.67s
18:	learn: 409.2150148	total: 187ms	remaining: 

170:	learn: 200.3941098	total: 1.72s	remaining: 8.34s
171:	learn: 200.1181284	total: 1.73s	remaining: 8.33s
172:	learn: 199.8240268	total: 1.74s	remaining: 8.32s
173:	learn: 199.4439663	total: 1.75s	remaining: 8.31s
174:	learn: 198.9938561	total: 1.76s	remaining: 8.3s
175:	learn: 198.8702653	total: 1.77s	remaining: 8.29s
176:	learn: 198.5189498	total: 1.78s	remaining: 8.28s
177:	learn: 198.2098943	total: 1.79s	remaining: 8.26s
178:	learn: 198.0040248	total: 1.8s	remaining: 8.25s
179:	learn: 197.6937302	total: 1.81s	remaining: 8.24s
180:	learn: 197.3026509	total: 1.82s	remaining: 8.23s
181:	learn: 197.0694473	total: 1.83s	remaining: 8.21s
182:	learn: 196.7311648	total: 1.84s	remaining: 8.2s
183:	learn: 196.2440838	total: 1.85s	remaining: 8.19s
184:	learn: 196.0972022	total: 1.86s	remaining: 8.18s
185:	learn: 195.6705294	total: 1.87s	remaining: 8.17s
186:	learn: 195.4736045	total: 1.88s	remaining: 8.16s
187:	learn: 195.1314463	total: 1.89s	remaining: 8.15s
188:	learn: 194.9208898	total: 

340:	learn: 163.7158232	total: 3.42s	remaining: 6.62s
341:	learn: 163.6356976	total: 3.43s	remaining: 6.61s
342:	learn: 163.4851280	total: 3.44s	remaining: 6.59s
343:	learn: 163.3503571	total: 3.45s	remaining: 6.58s
344:	learn: 163.1147453	total: 3.46s	remaining: 6.58s
345:	learn: 162.9971894	total: 3.47s	remaining: 6.57s
346:	learn: 162.9105721	total: 3.48s	remaining: 6.55s
347:	learn: 162.7197874	total: 3.49s	remaining: 6.55s
348:	learn: 162.5369058	total: 3.5s	remaining: 6.54s
349:	learn: 162.4116412	total: 3.52s	remaining: 6.53s
350:	learn: 162.2842718	total: 3.53s	remaining: 6.52s
351:	learn: 162.1537418	total: 3.54s	remaining: 6.51s
352:	learn: 161.9928808	total: 3.55s	remaining: 6.5s
353:	learn: 161.8082322	total: 3.56s	remaining: 6.5s
354:	learn: 161.7097026	total: 3.57s	remaining: 6.49s
355:	learn: 161.5483918	total: 3.58s	remaining: 6.47s
356:	learn: 161.4093530	total: 3.59s	remaining: 6.46s
357:	learn: 161.2002498	total: 3.6s	remaining: 6.45s
358:	learn: 161.0902690	total: 3

506:	learn: 145.3908513	total: 5.17s	remaining: 5.02s
507:	learn: 145.2276013	total: 5.17s	remaining: 5.01s
508:	learn: 145.0995229	total: 5.18s	remaining: 5s
509:	learn: 145.0205358	total: 5.19s	remaining: 4.99s
510:	learn: 144.9678823	total: 5.2s	remaining: 4.98s
511:	learn: 144.8939655	total: 5.21s	remaining: 4.97s
512:	learn: 144.7484073	total: 5.23s	remaining: 4.96s
513:	learn: 144.6683773	total: 5.24s	remaining: 4.95s
514:	learn: 144.5989867	total: 5.25s	remaining: 4.94s
515:	learn: 144.5178928	total: 5.26s	remaining: 4.93s
516:	learn: 144.4494976	total: 5.26s	remaining: 4.92s
517:	learn: 144.3419454	total: 5.28s	remaining: 4.91s
518:	learn: 144.2457344	total: 5.28s	remaining: 4.9s
519:	learn: 144.1260738	total: 5.29s	remaining: 4.89s
520:	learn: 144.0345286	total: 5.3s	remaining: 4.88s
521:	learn: 143.9430374	total: 5.31s	remaining: 4.87s
522:	learn: 143.8355864	total: 5.32s	remaining: 4.86s
523:	learn: 143.7526040	total: 5.33s	remaining: 4.84s
524:	learn: 143.6752051	total: 5.3

661:	learn: 134.0055479	total: 6.68s	remaining: 3.41s
662:	learn: 133.9534442	total: 6.69s	remaining: 3.4s
663:	learn: 133.9038236	total: 6.7s	remaining: 3.39s
664:	learn: 133.8760938	total: 6.71s	remaining: 3.38s
665:	learn: 133.8060889	total: 6.72s	remaining: 3.37s
666:	learn: 133.7576132	total: 6.73s	remaining: 3.36s
667:	learn: 133.6803867	total: 6.74s	remaining: 3.35s
668:	learn: 133.6343180	total: 6.75s	remaining: 3.34s
669:	learn: 133.5461750	total: 6.76s	remaining: 3.33s
670:	learn: 133.4884646	total: 6.76s	remaining: 3.32s
671:	learn: 133.4348260	total: 6.78s	remaining: 3.31s
672:	learn: 133.3864206	total: 6.79s	remaining: 3.3s
673:	learn: 133.3290888	total: 6.79s	remaining: 3.29s
674:	learn: 133.2469095	total: 6.8s	remaining: 3.28s
675:	learn: 133.1946678	total: 6.81s	remaining: 3.27s
676:	learn: 133.1507803	total: 6.82s	remaining: 3.25s
677:	learn: 133.0835715	total: 6.83s	remaining: 3.25s
678:	learn: 133.0257960	total: 6.84s	remaining: 3.23s
679:	learn: 132.9641181	total: 6

831:	learn: 124.5335663	total: 8.36s	remaining: 1.69s
832:	learn: 124.4913463	total: 8.37s	remaining: 1.68s
833:	learn: 124.4316154	total: 8.38s	remaining: 1.67s
834:	learn: 124.3981099	total: 8.39s	remaining: 1.66s
835:	learn: 124.3576792	total: 8.4s	remaining: 1.65s
836:	learn: 124.3060142	total: 8.41s	remaining: 1.64s
837:	learn: 124.2615749	total: 8.42s	remaining: 1.63s
838:	learn: 124.2291798	total: 8.43s	remaining: 1.62s
839:	learn: 124.1712411	total: 8.44s	remaining: 1.61s
840:	learn: 124.1273698	total: 8.45s	remaining: 1.6s
841:	learn: 124.1000964	total: 8.46s	remaining: 1.59s
842:	learn: 124.0296730	total: 8.47s	remaining: 1.58s
843:	learn: 123.9920163	total: 8.48s	remaining: 1.57s
844:	learn: 123.9469175	total: 8.49s	remaining: 1.56s
845:	learn: 123.8904732	total: 8.49s	remaining: 1.55s
846:	learn: 123.8531055	total: 8.51s	remaining: 1.54s
847:	learn: 123.7607852	total: 8.52s	remaining: 1.53s
848:	learn: 123.7385777	total: 8.53s	remaining: 1.52s
849:	learn: 123.6919385	total:

986:	learn: 117.6396554	total: 9.89s	remaining: 130ms
987:	learn: 117.6059863	total: 9.9s	remaining: 120ms
988:	learn: 117.5552000	total: 9.91s	remaining: 110ms
989:	learn: 117.5168075	total: 9.92s	remaining: 100ms
990:	learn: 117.4880314	total: 9.93s	remaining: 90.2ms
991:	learn: 117.4365056	total: 9.94s	remaining: 80.2ms
992:	learn: 117.3819586	total: 9.95s	remaining: 70.2ms
993:	learn: 117.3326019	total: 9.96s	remaining: 60.1ms
994:	learn: 117.3033140	total: 9.97s	remaining: 50.1ms
995:	learn: 117.2705771	total: 9.98s	remaining: 40.1ms
996:	learn: 117.2378948	total: 9.99s	remaining: 30.1ms
997:	learn: 117.1837754	total: 10s	remaining: 20ms
998:	learn: 117.1253993	total: 10s	remaining: 10ms
999:	learn: 117.0581115	total: 10s	remaining: 0us


Unnamed: 0,id,sales
0,3000888,-2.284628
1,3000889,-5.367679
2,3000890,3.861734
3,3000891,1404.886994
4,3000892,-7.593072


Consult the next section for scoring

<!-- BEGIN QUESTION -->

## 11. Submit the predictions to Kaggle <a name="11"></a>
<hr>

We ran the CatBoost model from above on the test set, and when submitting to kaggle recieved a Root Mean Squared Logarithmic Error of 1.25. In comparison, the dummy submission scored 2.73, and the kaggle baseline scored 4.32. Consult the included screenshot for reference.

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 12. Our takeaway <a name="12"></a>
<hr> 

Over the course of this semester we have all learned a lot more about supervised machine learning. For this project specifically, we had the chance to try out many different machine learning models, and we learned the challenges of working with real world data.

Our biggest take away was definitely that certain models are built to work with certain kinds of data and exploration is paramount in finding approaches that work for your data set. In early runs of this project, we tried using a k nearest neighbor approach (KNeighborsRegressor) and a random forest approach (RandomForestRegressor). Both of these models did not work well with our data. We then tried a LinearRegressor which worked better.

We also learned not to give up on a particular algorithm, because variants of it may yeild better results. For example, even though RandomForestRegressor didn't work well, we found that other tree ensembles, specifically gradient boosted algorithms like LightGBM and XGBoost, worked very well on this data.

Ideas that we had and did not try were using time series specific models and preprocessing. This task is ultimately a time series forecasting task, so using models from a library like sktime could lead to better results.

In the end, this has been a very exciting semester and a great experience. Through taking this class, we have laid a strong foundation for future learning in the machine learning field.

<!-- END QUESTION -->

<br><br>