# CS4320 Introduction to Machine Learning

## Team Undefined

### Group Members: 

- Luke Shumway A02268065

- Ryan Andersen A02288683

- Ian Adams A02252812

Project: [Store Sales - Time Series Forecasting](https://www.kaggle.com/competitions/store-sales-time-series-forecasting/overview/description)

In [None]:
GroupName = "Undefined"
assert GroupName != "", 'Please enter your name in the above quotation marks, thanks!'

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

plt.rcParams["font.size"] = 16

from sklearn.model_selection import cross_val_score, cross_validate, train_test_split
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.impute import SimpleImputer
from sklearn.svm import SVC
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import (
    MinMaxScaler,
    OneHotEncoder,
    OrdinalEncoder,
    StandardScaler,
)
from xgboost import XGBRegressor

## Table of contents
0. [Submission instructions](#si)
1. [Understanding the problem](#1)
2. [Exploratory Data Analysis](#2)
3. [Data aggregation, Splitting, and Feature Engineering](#3)
4. [Preprocessing and transformations](#4)
5. [Baseline model](#5) 
6. [Linear models](#6)
7. [Different models](#7)
8. [Feature selection](#8)
9. [Hyperparameter optimization](#9)
10. [Interpretation and feature importances](#10)
11. [Results on the test set](#11) 
12. [Submit the predictions to Kaggle](#12)
13. [Your takeaway from the course](#13)

## Submission instructions <a name="si"></a>
<hr>

- It's your responsibility to make sure that the assignment is submitted by one of the group members before the deadline. 
- Upload the .ipynb file to Canvas.
- **Submit the screenshot of your Kaggle submission ranking and score** 
- Run all cells in your notebook to make sure there are no errors by doing `Kernel -> Restart Kernel and Clear All Outputs` and then `Run -> Run All Cells`.
- Notebooks with cell execution numbers out of order will have marks deducted. Notebooks without the output displayed may not be graded at all (because we need to see the output in order to grade your work).
- Make sure that the plots and output are rendered properly in your submitted file. 
- Please keep your notebook clean and delete any throwaway code.

## Introduction <a name="in"></a>

A few notes and tips when you work on this project: 

#### Tips
1. The project is open-ended, and while working on it, there might be some situations where you'll have to use your own judgment and make your own decisions (as you would be doing when you work as a data scientist). Make sure you explain your decisions whenever necessary. 
2. **Do not include everything you ever tried in your submission** -- it's fine just to have your final code. That said, your code should be reproducible and well-documented. For example, if you chose your hyperparameters based on some hyperparameter optimization experiment, you should leave in the code for that experiment so that someone else could re-run it and obtain the same hyperparameters, rather than mysteriously just setting the hyperparameters to some (carefully chosen) values in your code. 
3. If you realize that you are repeating a lot of code try to organize it in functions. Clear presentation of your code, experiments, and results is the key to be successful in this lab. You may use code from lecture notes or previous lab solutions with appropriate attributions. 

#### Assessment
We plan to grade fairly and leniently. We don't have some secret target score that you need to achieve to get a good grade. **You'll be assessed on demonstration of mastery of course topics, clear presentation, and the quality of your analysis and results.** For example, if you just have a bunch of code and no text or figures, that's not good. If you do a bunch of sane things and get a lower accuracy than your friend, don't sweat it.


<br><br>

<!-- BEGIN QUESTION -->

## 1. Pick your problem and explain the prediction problem <a name="1"></a>
<hr>

**Your tasks:**

1. Spend some time understanding the problem and what each feature means. Write a few sentences on your initial thoughts on the problem and the dataset. 

_Type your answer here, replacing this text._

In [None]:
# This file is a template for all submissions
sample_df = pd.read_csv('file:sample_submission.csv')
sample_df

<!-- BEGIN QUESTION -->

## 2. Exploratory Data Analysis <a name="2"></a>
<hr>

**Your tasks:**

1. Perform exploratory data analysis on the train set.
2. Include at least two summary statistics and two visualizations that you find useful, and accompany each one with a sentence explaining it.
3. Summarize your initial observations about the data. 

In [None]:
# We don't display this one, because we don't want to look at the testing data
test_df = pd.read_csv('file:test.csv')

In [None]:
train_df = pd.read_csv('file:train.csv')
display(train_df)

In [None]:
oil_df = pd.read_csv('file:oil.csv')
display(oil_df)

In [None]:
holiday_df = pd.read_csv('file:holidays_events.csv')
display(holiday_df)

In [None]:
stores_df = pd.read_csv('file:stores.csv')
display(stores_df)

In [None]:
transactions_df = pd.read_csv('file:transactions.csv')
display(transactions_df)

In [None]:
holiday_df['locale_name'].value_counts()

In [None]:
stores_df['state'].value_counts()

<!-- BEGIN QUESTION -->

## 3. Data aggregation, splitting, and feature engineering <a name="3"></a>
<hr>

**Your tasks:**

1. Split the data into train and test portions.

In [None]:
# Engineers more features based off the date
def create_date_features(df):
    df['month'] = pd.to_datetime(df.date, format='%Y-%m-%d').dt.month
    df['day_of_month'] = pd.to_datetime(df.date, format='%Y-%m-%d').dt.day
    df['day_of_year'] = pd.to_datetime(df.date, format='%Y-%m-%d').dt.dayofyear
    df['day_of_week'] = pd.to_datetime(df.date, format='%Y-%m-%d').dt.dayofweek
    df['year'] = pd.to_datetime(df.date, format='%Y-%m-%d').dt.year
    return df

In [None]:
# Takes in the testing or training set and merges data from the other tables with it
# Also integrates feature engineering
# Note: Data was already split into initial training and testing sets
def merge_data(in_df):
    t_oil_df = pd.merge(in_df, oil_df, how="left", on="date")
    t_holiday_df = pd.merge(t_oil_df, holiday_df, how="left", on="date")
    t_transact_df = pd.merge(t_holiday_df, transactions_df, how="left", on=["date", "store_nbr"])
    full_df = pd.merge(t_transact_df, stores_df, how="left", on="store_nbr")
    full_df.rename(columns={"type_x":"holiday_type", "type_y":"store_type", "dcoilwtico":"oil_price"}, inplace = True)
    return create_date_features(full_df)

In [None]:
full_train_df = merge_data(train_df)
full_test_df = merge_data(test_df)

X_train = full_train_df.drop(columns=["sales"])
y_train = full_train_df["sales"]

X_test = full_train_df.drop(columns=["sales"])
y_test = full_train_df["sales"]

X_train.head()

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 4. Preprocessing and transformations <a name="4"></a>
<hr>

**Your tasks:**

1. Identify different feature types and the transformations you would apply on each feature type. 
2. Define a column transformer, if necessary. 

In [None]:
numeric_features = ["store_nbr", "onpromotion", "oil_price", "cluster", "transactions", "month", "day_of_month", "day_of_year", "day_of_week", "year"] 
categorical_features = ["family", "holiday_type", "locale", "locale_name", "city", "state", "store_type"]
drop_features = ["transferred", "id"]  # do not include these features in modeling

preprocessor = make_column_transformer(
    (make_pipeline(SimpleImputer(strategy='median'), StandardScaler()), numeric_features),
    (make_pipeline(SimpleImputer(strategy='most_frequent'), OneHotEncoder(handle_unknown="ignore")), categorical_features))
    ('drop', drop_features)

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 5. Baseline model <a name="5"></a>
<hr>

**Your tasks:**
1. Try `scikit-learn`'s baseline model and report results.

In [None]:
def mean_std_cross_val_scores(model, X_train, y_train, **kwargs):
    """
    Returns mean and std of cross validation

    Parameters
    ----------
    model :
        scikit-learn model
    X_train : numpy array or pandas DataFrame
        X in the training data
    y_train :
        y in the training data

    Returns
    ----------
        pandas Series with mean scores from cross_validation
    """

    scores = cross_validate(model, X_train, y_train, **kwargs)

    mean_scores = pd.DataFrame(scores).mean()
    std_scores = pd.DataFrame(scores).std()
    out_col = []

    for i in range(len(mean_scores)):
        out_col.append((f"%0.3f (+/- %0.3f)" % (mean_scores[i], std_scores[i])))

    return pd.Series(data=out_col, index=mean_scores.index)

In [None]:
results_dict = {}

In [None]:
dummyPipe = make_pipeline(preprocessor, DummyRegressor(strategy="median"))
results_dict['DummyRegressor'] = mean_std_cross_val_scores(dummyPipe, X_train, y_train, cv=3, scoring='neg_root_mean_squared_error')

<!-- BEGIN QUESTION -->

## 6. Linear models <a name="6"></a>
<hr>

**Your tasks:**

1. Try a linear model as a first real attempt. 
2. Carry out hyperparameter tuning to explore different values for the complexity hyperparameter. 
3. Report cross-validation scores along with standard deviation. 
4. Summarize your results.

In [None]:
lrPipe = make_pipeline(preprocessor, LinearRegression(n_jobs=-1))
results_dict['LinearRegressor'] = mean_std_cross_val_scores(lrPipe, X_train, y_train, cv=3, scoring='neg_root_mean_squared_error')

<!-- BEGIN QUESTION -->

## 7. Different models <a name="7"></a>
<hr>

**Your tasks:**
1. Try other models aside from a linear model. One of these models should be a tree-based ensemble model. 
2. Summarize your results in terms of overfitting/underfitting and fit and score times. Can you beat a linear model? 

### DecisionTreeRegressor

In [None]:
dtPipe = make_pipeline(preprocessor, DecisionTreeRegressor(max_depth=10))
results_dict['DecisionTreeRegressor'] = mean_std_cross_val_scores(dtPipe, X_train, y_train, cv=3, scoring='neg_root_mean_squared_error')

### LogisticRegressor

In [None]:
lgrPipe = make_pipeline(preprocessor, LogisticRegression(n_jobs=-1))
results_dict['LogisticRegressor'] = mean_std_cross_val_scores(dtPipe, X_train, y_train, cv=3, scoring='neg_root_mean_squared_error')

### LightGBM

In [None]:
lgbmPipe = make_pipeline(preprocessor, 'insert model') # add model
results_dict['LightGBM'] = mean_std_cross_val_scores(lgbmPipe, X_train, y_train, cv=3, scoring='neg_root_mean_squared_error')

### XGBoost

In [None]:
xgbPipe = make_pipeline(preprocessor, XGBRegressor(objective='reg:squaredlogerror'))
results_dict['XGBoost'] = mean_std_cross_val_scores(xgbPipe, X_train, y_train, cv=3, scoring='neg_root_mean_squared_error')

### CatBoost

In [None]:
catPipe = make_pipeline(preprocessor, 'insert model') # add model
results_dict['CatBoost'] = mean_std_cross_val_scores(catPipe, X_train, y_train, cv=3, scoring='neg_root_mean_squared_error')

<!-- BEGIN QUESTION -->

## 8. Feature selection <a name="8"></a>
<hr>

**Your tasks:**

Make some attempts to select relevant features. Do the results improve with feature selection? Summarize your results. If you see improvements in the results, keep feature selection in your pipeline. If not, you may abandon it. 

In [None]:
# Your code here

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 9. Hyperparameter optimization <a name="9"></a>
<hr>

**Your tasks:**

Make some attempts to optimize hyperparameters for the models you've tried and summarize your results. In at least one case you should be optimizing multiple hyperparameters for a single model. You may use `sklearn`'s methods for hyperparameter optimization or fancier Bayesian optimization methods. 
  - [GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)   
  - [RandomizedSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html)
  - [scikit-optimize](https://github.com/scikit-optimize/scikit-optimize) 

### LightGBM

In [None]:
# Your code here

### XGBoost

In [None]:
xgb_param_dict = {
    "xgbregressor__booster": ['gbtree', 'gblinear'],
    "xgbregressor__n_estimators": [50, 100, 150, 200, 250, 300, 350],
    "xgbregressor__max_depth": [3, 4, 5, 6, 7, 8],
    "xgbregressor__max_delta_step": [2, 3, 4, 5, 6, 7],
    "xgbregressor__gamma": [.01, .1],
    "xgbregressor__learning_rate": [.01, .05, .1, .15, .2, .25, .3],
    "xgbregressor__grow_policy": ['depthwise', 'lossguide'],
    "xgbregressor__tree_method": ['exact', 'hist', 'approx'],
}

xgb_op_pipe = make_pipeline(preprocessor, XGBRegressor(objective='reg:squaredlogerror'))

xgb_r_search = RandomizedSearchCV(xgb_op_pipe, param_dict, cv=5, n_jobs=-1, scoring="f1", random_state=123, return_train_score=True)
xgb_r_search.fit(X_train, y_train)

print(xgb_r_search.best_params_)

In [None]:
xgb_pipe_best = make_pipeline(
    preprocessor,
    XGBRegressor(
        n_estimators=xgb_r_search.best_params_['xgbregressor__n_estimators'],
        max_depth=xgb_r_search.best_params_['xgbregressor__max_depth'],
        objective='reg:squaredlogerror',
        booster=xgb_r_search.best_params_['xgbregressor__booster'],
        max_delta_step=xgb_r_search.best_params_['xgbregressor__max_delta_step'],
        gamma=xgb_r_search.best_params_['xgbregressor__gamma'],
        learning_rate=xgb_r_search.best_params_['xgbregressor__learning_rate'],
        grow_policy=xgb_r_search.best_params_['xgbregressor__grow_policy'],
        tree_method=xgb_r_search.best_params_['xgbregressor__tree_method'],
    )
)
xgb_pipe_best.fit(X_train, y_train)
results_dict['Optimized XGBoost'] = mean_std_cross_val_scores(xgb_pipe_best, X_train, y_train, cv=3, scoring='neg_root_mean_squared_error')

### CatBoost

In [None]:
# Your code here

<!-- BEGIN QUESTION -->

## 10. Interpretation and feature importances <a name="10"></a>
<hr>

**Your tasks:**

1. Use the methods we saw in class (e.g., `eli5`, `shap`) (or any other methods of your choice) to examine the most important features of one of the non-linear models. 
2. Summarize your observations. 

In [None]:
# Your code here

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 11. Results on the test set <a name="11"></a>
<hr>

**Your tasks:**

1. Try your best performing model on the test data (from train test split) and report test scores. 
2. Do the test scores agree with the validation scores from before? To what extent do you trust your results? Do you think you've had issues with optimization bias? 
3. Take one or two test predictions and explain these individual predictions (e.g., with SHAP force plots).  

In [None]:
# Your code here

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 12. Submit the predictions to Kaggle <a name="12"></a>
<hr>

**Your tasks:**

Retrain the best model on the whole training dataset and upload the predicted output on the test set to Kaggle. Report your final test score.

In [None]:
# Your code here

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 13. Your takeaway <a name="13"></a>
<hr>

**Your tasks:**

What is your biggest takeaway from the supervised machine learning material we have learned so far? Please write thoughtful answers.  Discuss other ideas that you did not try but could potentially improve the performance/interpretability . 

<!-- END QUESTION -->

<br><br>