# Music Recommendation Project

This is the second section of the Capstone Project for Udacity's Machine Learning Engineer Nanodegree.

This notebook includes importing the cleaned data from the first notebook, implementing a baseline algorithm, implementing a complex algorithm, hyper-parameter optimization, and saving the model.

Author: Ben Walsh \
February 19, 2021

## Contents

1. [Feature Import](#feature-data-import)
2. Baseline Model
3. [Final Model](#xgb-model)
4. [Save Model](#save-model)

## <a class="anchor" id="feature-data-import"></a>1. Feature Data Import

### Import libraries

In [1]:
import pandas as pd
import numpy as np
import xgboost as xgb
from xgboost.sklearn import XGBClassifier
from sklearn.model_selection import GridSearchCV
import pickle

import os
import json
import datetime

First import all cleaned feature data: X and y target data for training.

In [2]:
X_train_file = './data-input-clean/X_train.csv'
y_train_file = './data-input-clean/y_train.csv'

### Import training data

In [4]:
if os.path.exists(X_train_file):
    X_train = pd.read_csv(X_train_file)
else:
    print('Training data file {} not found!'.format(X_train_file))

if os.path.exists(y_train_file):
    y_train = pd.read_csv(y_train_file)
else:
    print('Training data file {} not found!'.format(y_train_file))

## 2. Baseline Model

In [22]:
from sklearn.linear_model import LogisticRegression
baseline_model = LogisticRegression(random_state=0).fit(X=X_train.values, y=y_train.values.reshape(-1))

Save baseline model with pickle

In [23]:
model_folder = './saved_models'
if not(os.path.exists(model_folder)):
       os.mkdir(model_folder)

pickle.dump(baseline_model, open('{}/baseline-model'.format(model_folder), "wb"))

## <a class="anchor" id="xgb-model"></a>3. XGBoost Model

In [4]:
xgb_hparams = {'objective': 'binary:logistic',
                   'colsample_bytree': 0.3,
                   'learning_rate': 0.1,
                   'max_depth': 12, 
                   'min_child_weight': 1,
                   'alpha': 2, # regularization parameter - the higher, the more conservative
                   'n_estimators': 50}

In [5]:
xgb_model = xgb.XGBRegressor(objective = xgb_hparams['objective'], 
                             colsample_bytree = xgb_hparams['colsample_bytree'], 
                             learning_rate = xgb_hparams['learning_rate'],
                             max_depth = xgb_hparams['max_depth'], 
                             min_child_weight = xgb_hparams['min_child_weight'], 
                             alpha = xgb_hparams['alpha'], 
                             n_estimators = xgb_hparams['n_estimators'])

In [6]:
xgb_model.fit(X_train, y_train)



XGBRegressor(alpha=2, colsample_bytree=0.3, max_depth=12, n_estimators=50,
             objective='binary:logistic')

### Hyper-parameter Optimization
Will probably have to redo this every time the input feature data changes. First find the overall structure with max_depth and min_child_weight. A lower max_depth and higher min_child_weight will favor a simpler tree structure, less prone to overfitting.

In [7]:
# revisit this later
hyper_param_tune = False
if hyper_param_tune:
    param_test1 = {
     'max_depth':range(10,13,2),
     'min_child_weight':range(1,4,2)
    }
    gsearch1 = GridSearchCV(estimator = \
                            XGBClassifier( learning_rate = xgb_hparams['learning_rate'], 
                                          n_estimators = xgb_hparams['n_estimators'], 
                                          max_depth = xgb_hparams['max_depth'],
                                         min_child_weight = xgb_hparams['min_child_weight'], 
                                          alpha = xgb_hparams['alpha'],
                                         objective = xgb_hparams['objective'], 
                                          nthread = 2, 
                                          seed = 62), 
    param_grid = param_test1, scoring='roc_auc',n_jobs=2,iid=False, cv=5)
    gsearch1.fit(X_train, y_train)
    gsearch1.grid_scores_, gsearch1.best_params_, gsearch1.best_score_

## <a class="anchor" id="save-model"></a>4. Save Model

Get timestamp for history and to ensure a unique model name. 

In [8]:
timestamp = datetime.datetime.now()
timestamp_str = '{}-{:02}-{:02}-{:02}-{}-{}-{}'.format(timestamp.year, timestamp.month, timestamp.day, timestamp.hour, timestamp.minute, timestamp.second, timestamp.microsecond)


Save model with pickle

In [10]:
pickle.dump(xgb_model, open('{}/model-{}'.format(model_folder, timestamp_str), "wb"))