<h1 align="center" style="background-color:#616161;color:white">Next play prediction using Logistic Regression</h1>

<h3 style="background-color:#616161;color:white">0. Setup</h3>

<div style="background-color:white; color:#008000; font-family: 'Courier New, Monospace;font-weight: bold">Input Parameters</div>

In [1]:
PeriodGranularity = 30 # E.g. 15, 30, 60
# Train / Test split
newUsers = 10   # Num of randomly selected users to separate out of eval 2
rndPeriods = 3 # Num of random periods from each use to select
rndPeriodsLength = int(60/PeriodGranularity) * 24 * 7 * 4     # How long the random test period should cover

# Root path
root = "C:/DS/Github/MusicRecommendation"  # BA, Windows
#root = "/home/badrul/Documents/github/MusicRecommendation" # BA, Linux

<div style="background-color:white; color:#008000; font-family: 'Courier New, Monospace;font-weight: bold">Import Libraries</div>

<code>Standard code used in every page. Not all of these libraries are used here.</code>

In [5]:
# Core
import numpy as np
import pandas as pd
from IPython.core.debugger import Tracer    # Used for debugging
import logging

# File and database management
import csv
import os
import sys
import json
import sqlite3
from pathlib import Path

# Date/Time
import datetime
import time
#from datetime import timedelta # Deprecated

# Visualization
import matplotlib.pyplot as plt             # Quick
%matplotlib inline

# Data science (comment out if not needed)
#from sklearn.manifold import TSNE
from sklearn import metrics
from sklearn.linear_model import LogisticRegression

# Misc
import random

#-------------- Custom Libs -----------------#
os.chdir(root)

# Import the codebase module
fPath = root + "/1_codemodule"
if fPath not in sys.path: sys.path.append(fPath)

# Custom Libs
import coreCode as cc
import lastfmCode as fm

<div style="background-color:#white; color:#008000; font-family: 'Courier New, Monospace;font-weight: bold">Declare Functions</div>

<code>None</code>

<div style="background-color:#white; color:#008000; font-family: 'Courier New, Monospace;font-weight: bold">Load settings</div>

In [6]:
settingsDict =  cc.loadSettings()
dbPath = root + settingsDict['mainDbPath']
fmSimilarDbPath = root + settingsDict['fmSimilarDbPath']
fmTagsDbPath = root + settingsDict['fmTagsDbPath']
trackMetaDbPath = root + settingsDict['trackmetadata']

<div style="background-color:#white; color:#008000; font-family: 'Courier New, Monospace;font-weight: bold">Other setup</div>


<code>None</code>

<h3 style="background-color:#616161;color:white">1. Get train & test data</h3>

In this section we go through every user, one at a time, and randomly select 2 sections of the dataset to use as test data. The code here could be improved to make it any n cut-off points. 

More importantly, we ensure that each test period covers an entire months worth of data in order to reduce pollution between training and test data. (So first pick cut-off points, then move forward by a month to get the range of the test data)

<div style="background-color:#white; color:#008000; font-family: 'Courier New, Monospace;font-weight: bold">Get train and test data</div>


In [5]:
def getTrainAndTestData():
    con = sqlite3.connect(dbPath)
    c = con.cursor()

    # Get list of UserIDs 
    trainUsers = pd.read_sql_query("Select UserID from tblUsers Where tblUsers.TestUser = 0",con)

    fieldList="t, UserID, HrsFrom6pm, isSun,isMon,isTue,isWed,isThu,isFri,isSat,t1,t2,t3,t4,t5,t10,t12hrs,t24hrs,t1wk,t2wks,t3wks,t4wks"
    trainDf=pd.DataFrame(columns=[fieldList])  # Create an emmpty df
    testDf=pd.DataFrame(columns=[fieldList])  # Create an emmpty df
    periodsInAMonth=int(60/PeriodGranularity)*24*7*4

    totalRows=0
    
    for user in trainUsers.itertuples():
        # Get training dataset
        SqlStr="SELECT {} from tblTimeSeriesData where UserID = {}".format(fieldList,user.userID)
        df = pd.read_sql_query(SqlStr, con)
        totalRows += len(df)
    
        # Cut-off 1
        k = random.randint(periodsInAMonth, len(df))
        #Tracer()()  -- for debugging purposes
        testDf = testDf.append(df.iloc[k:k+periodsInAMonth])[df.columns.tolist()]

        tmp = df.drop(df.index[k:k+periodsInAMonth])

        # Cut-off 2
        k = random.randint(periodsInAMonth, len(tmp))
        testDf = testDf.append(tmp.iloc[k:k+periodsInAMonth])[df.columns.tolist()]
        trainDf = trainDf.append(tmp.drop(tmp.index[k:k+periodsInAMonth]))[df.columns.tolist()]

    if len(trainDf)+len(testDf) == totalRows:
        print('Ok')
    else:
        print("Incorrect. Total Rows = {}. TestDf+TrainDf rows = {}+{}={}".format(totalRows,len(testDf),len(trainDf),len(testDf)+len(trainDf)))
        
    return trainDf, testDf

trainDf,testDf = getTrainAndTestData()

Ok


<h3 style="background-color:#616161;color:white">2. Logistic Regression Model</h3>

In [54]:
X = trainDf.drop(['t','UserID'], 1).values
Y = trainDf['t'].values.astype(int) 

# fit a logistic regression model to the data
model = LogisticRegression()
model.fit(X, Y)
print(model)

# make predictions
X = testDf.drop(['t','UserID'], 1).values
Y = testDf['t'].values.astype(int)

predicted = model.predict(X)
# summarize the fit of the model
print(metrics.classification_report(Y,predicted))
print(metrics.confusion_matrix(Y,predicted))

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
             precision    recall  f1-score   support

          0       0.98      0.98      0.98     51395
          1       0.73      0.68      0.71      3807

avg / total       0.96      0.96      0.96     55202

[[50453   942]
 [ 1219  2588]]


In [104]:
tmp=np.reshape(fieldList.split(',')[2:],(1,20))
#np.reshape(model.coef_,(20,1))
np.reshape(fieldList.split(',')[2:],(20,1))

array([[' HrsFrom6pm'],
       [' isSun'],
       ['isMon'],
       ['isTue'],
       ['isWed'],
       ['isThu'],
       ['isFri'],
       ['isSat'],
       ['t1'],
       ['t2'],
       ['t3'],
       ['t4'],
       ['t5'],
       ['t10'],
       ['t12hrs'],
       ['t24hrs'],
       ['t1wk'],
       ['t2wks'],
       ['t3wks'],
       ['t4wks']], 
      dtype='<U11')