# Music Recommendation Project

Short description here... this will be Algorithm Development: Import features, baseline algorithm, complex algorithm, hyper-parameter optimization

Author: Ben Walsh \
February 7, 2021

## Contents

1. [Feature Import](#feature-data-import)
2. Data Exploration
3. Data Cleaning
4. [Evaluate Model](#evaluate-model)
5. [Saving Model & Results](#save-results)

## <a class="anchor" id="feature-data-import"></a>1. Feature Data Import

### Import libraries

In [48]:
import pandas as pd
import numpy as np
import xgboost as xgb
import pickle

import os
import json
import datetime

First import all cleaned feature data: X and y target data for training and testing.

In [5]:
X_train_file = './data-input-clean/X_train.csv'
X_test_file = './data-input-clean/X_test.csv'
y_train_file = './data-input-clean/y_train.csv'
y_test_file = './data-input-clean/y_test.csv'

### Import training/test data

In [6]:
if os.path.exists(X_train_file):
    X_train = pd.read_csv(X_train_file)
else:
    print('Training data file {} not found!'.format(X_train_file))

if os.path.exists(X_test_file):
    X_test = pd.read_csv(X_test_file)
else:
    print('Test data file {} not found!'.format(X_test_file))

if os.path.exists(y_train_file):
    y_train = pd.read_csv(y_train_file)
else:
    print('Training data file {} not found!'.format(y_train_file))

if os.path.exists(y_test_file):
    y_test = pd.read_csv(y_test_file)
else:
    print('Test data file {} not found!'.format(y_test_file))

## 2. Baseline Model

## 3. XGBoost Model

In [41]:
xgb_hyper_params = {'objective': 'reg:linear',
                   'colsample_bytree': 0.3,
                   'learning_rate': 0.1,
                   'max_depth': 5,
                   'alpha': 10,
                   'n_estimators': 10}

In [43]:
xgb_model = xgb.XGBRegressor(objective = xgb_hyper_params['objective'], #reg:squarederror #?
                             colsample_bytree = xgb_hyper_params['colsample_bytree'], 
                             learning_rate = xgb_hyper_params['learning_rate'],
                             max_depth = xgb_hyper_params['max_depth'], 
                             alpha = xgb_hyper_params['alpha'], 
                             n_estimators = xgb_hyper_params['n_estimators'])

In [8]:
xgb_model.fit(X_train, y_train)



XGBRegressor(alpha=10, colsample_bytree=0.3, max_depth=5, n_estimators=10)

## 4. Evaluate Model

Compare training and testing accuracy

In [9]:
y_predict_train = xgb_model.predict(X_train)
y_predict_test = xgb_model.predict(X_test)

In [10]:
# Round outputs to compare
y_predict_train = y_predict_train.round().reshape(len(y_predict_train),1)
y_predict_test = y_predict_test.round().reshape(len(y_predict_test),1)

In [37]:
train_acc = (y_predict_train == y_train.values).sum() / len(y_train)
print('Accuracy on training set = {:.2f}%'.format(100*train_acc))

Accuracy on training set = 61.67%


In [39]:
test_acc = (y_predict_test == y_test.values).sum() / len(y_test)
print('Accuracy on testing set = {:.2f}%'.format(100*test_acc))

Accuracy on testing set = 61.69%


### Observations

Initial parameters of XGBoost model without any song features in training data and without any new features has accuracy of 61.69%. The accuracy is nearly identical to the training accuracy, indicating the algorithm is not overfitting.

## 6. Save Model and Results

Get timestamp for history and to ensure a unique model name and 

In [29]:
timestamp = datetime.datetime.now()
timestamp_str = '{}-{:02}-{:02}-{}-{}-{}-{}'.format(timestamp.year, timestamp.month, timestamp.day, timestamp.hour, timestamp.minute, timestamp.second, timestamp.microsecond)


Save model with pickle

In [33]:
model_folder = './saved_models'
if not(os.path.exists(model_folder)):
       os.mkdir(model_folder)

In [34]:
pickle.dump(xgb_model, open('{}/model-{}'.format(model_folder, timestamp_str), "wb"))

Read model registry and retrieve latest model index

In [142]:
with open('{}/model-history.json'.format(model_folder), 'r') as openfile: 
  
    # Reading from json file 
    model_registry = json.load(openfile) 
    
print(model_registry) 

{'0': {'time': '2021-02-06-17-30-53-334642', 'data-features': ['discover', 'explore', 'listen with', 'my library', 'notification', 'radio', 'search', 'settings', 'city', 'bd', 'registered_via', 'registration_init_time', 'expiration_date'], 'model-params': {'objective': 'reg:linear', 'colsample_bytree': 0.3, 'learning_rate': 0.1, 'max_depth': 5, 'alpha': 10, 'n_estimators': 10}, 'train-acc': 0.6166558284115004, 'test-acc': 0.6168999460516007}, '1': {'time': '2021-02-06-17-30-53-334642', 'data-features': ['discover', 'explore', 'listen with', 'my library', 'notification', 'radio', 'search', 'settings', 'city', 'bd', 'registered_via', 'registration_init_time', 'expiration_date'], 'model-params': {'objective': 'reg:linear', 'colsample_bytree': 0.3, 'learning_rate': 0.1, 'max_depth': 5, 'alpha': 10, 'n_estimators': 10}, 'train-acc': 0.6166558284115004, 'test-acc': 0.6168999460516007}}


In [138]:
new_model_index = np.array(list(model_registry.keys())).astype(int).max() + 1;
new_model_registry_info = { \
    str(new_model_index) : \
   {\
    'time': timestamp_str,
     'data-features': list(X_train.columns.values),
    'model-params': xgb_hyper_params,
    'train-acc': train_acc, 
    'test-acc': test_acc \
    }
  }

Append latest model results and save back to model registry

In [139]:
model_registry.update(new_model_registry_info)

In [141]:
with open('{}/model-history.json'.format(model_folder), 'w') as json_file:
    json.dump(model_registry, json_file)

Print out highest score yet

In [149]:
test_accs = [model_registry[key]['test-acc'] for key in model_registry.keys()]

In [155]:
print('Highest test accuracy = {:.3f} from index = {}'.format(np.max(test_accs), test_accs.index(np.max(test_accs))))

Highest test accuracy = 0.617 from index = 0
