<h1 align="center" style="background-color:#616161;color:white">Linear Regression with SVM</h1>

Adapted from: https://github.com/nfmcclure/tensorflow_cookbook/tree/master/04_Support_Vector_Machines/03_Reduction_to_Linear_Regression


<h3 style="background-color:#616161;color:white">0. Setup</h3>

<div style="background-color:white; color:#008000; font-family: 'Courier New, Monospace;font-weight: bold">Input Parameters</div>

In [1]:
PeriodGranularity = 30 # E.g. 15, 30, 60
# Train / Test split
newUsers = 10   # Num of randomly selected users to separate out of eval 2
rndPeriods = 3 # Num of random periods from each use to select
rndPeriodsLength = int(60/PeriodGranularity) * 24 * 7 * 4     # How long the random test period should cover

# Root path
#root = "C:/DS/Github/MusicRecommendation"  # BA, Windows
root = "/home/badrul/Documents/git/MusicRecommendation" # BA, Linux

<div style="background-color:white; color:#008000; font-family: 'Courier New, Monospace;font-weight: bold">Common Libraries</div>

In [2]:
# Core
import numpy as np
import pandas as pd
from IPython.core.debugger import Tracer    # Used for debugging
import logging

# File and database management
import csv
import os
import sys
import json
import sqlite3
from pathlib import Path

# Date/Time
import datetime
import time
#from datetime import timedelta # Deprecated

# Visualization
import matplotlib.pyplot as plt             # Quick
%matplotlib inline

# Misc
import random

#-------------- Custom Libs -----------------#
os.chdir(root)

# Import the codebase module
fPath = root + "/1_codemodule"
if fPath not in sys.path: sys.path.append(fPath)

# Custom Libs
import coreCode as cc
import lastfmCode as fm

<div style="background-color:white; color:#008000; font-family: 'Courier New, Monospace;font-weight: bold">Page Specific Libraries</div>

In [3]:
# Data science (comment out if not needed)
#from sklearn.manifold import TSNE
import tensorflow as tf
from tensorflow.python.framework import ops
ops.reset_default_graph()
from sklearn import metrics
from sklearn import preprocessing

<div style="background-color:#white; color:#008000; font-family: 'Courier New, Monospace;font-weight: bold">Declare Functions</div>

<div style="background-color:#white; color:#008000; font-family: 'Courier New, Monospace;font-weight: bold">Load settings</div>

In [4]:
settingsDict =  cc.loadSettings()
dbPath = root + settingsDict['mainDbPath']
fmSimilarDbPath = root + settingsDict['fmSimilarDbPath']
fmTagsDbPath = root + settingsDict['fmTagsDbPath']
trackMetaDbPath = root + settingsDict['trackmetadata']

<h3 style="background-color:#616161;color:white">1. Load data</h3>

In [5]:
def getTrainAndTestData():
    con = sqlite3.connect(dbPath)
    c = con.cursor()

    # Get list of UserIDs 
    trainUsers = pd.read_sql_query("Select UserID from tblUsers Where tblUsers.TestUser = 0",con)

    fieldList="t, UserID, HrsFrom6pm, isSun,isMon,isTue,isWed,isThu,isFri,isSat,t1,t2,t3,t4,t5,t10,t12hrs,t24hrs,t1wk,t2wks,t3wks,t4wks"
    trainDf=pd.DataFrame(columns=[fieldList])  # Create an emmpty df
    testDf=pd.DataFrame(columns=[fieldList])  # Create an emmpty df
    periodsInAMonth=int(60/PeriodGranularity)*24*7*4

    totalRows=0
    
    for user in trainUsers.itertuples():
        # Get training dataset
        SqlStr="SELECT {} from tblTimeSeriesData where UserID = {}".format(fieldList,user.userID)
        df = pd.read_sql_query(SqlStr, con)
        totalRows += len(df)
    
        # Cut-off 1
        k = random.randint(periodsInAMonth, len(df))
        #Tracer()()  -- for debugging purposes
        testDf = testDf.append(df.iloc[k:k+periodsInAMonth])[df.columns.tolist()]

        tmp = df.drop(df.index[k:k+periodsInAMonth])

        # Cut-off 2
        k = random.randint(periodsInAMonth, len(tmp))
        testDf = testDf.append(tmp.iloc[k:k+periodsInAMonth])[df.columns.tolist()]
        trainDf = trainDf.append(tmp.drop(tmp.index[k:k+periodsInAMonth]))[df.columns.tolist()]

    if len(trainDf)+len(testDf) == totalRows:
        print('Ok')
    else:
        print("Incorrect. Total Rows = {}. TestDf+TrainDf rows = {}+{}={}".format(totalRows,len(testDf),len(trainDf),len(testDf)+len(trainDf)))
        
    return trainDf, testDf

trainDf,testDf = getTrainAndTestData()

#trainDf = trainDf.iloc[0:2000]
#testDf = testDf.iloc[0:2000]

trainDf['t'].replace(to_replace='0', value='-1', inplace=True)
testDf['t'].replace(to_replace='0', value='-1', inplace=True)
x_vals = trainDf.drop(['t','UserID'], 1).values
y_vals = trainDf['t'].values.astype(int)

# Change the 0's to -1
y_vals = np.array([1 if y==1 else -1 for y in y_vals])
y_vals =y_vals.reshape(len(y_vals),1)

# One-Hot version
y_vals_onehot = pd.get_dummies(trainDf['t']).values.astype(float)

# Test data
x_vals_test= testDf.drop(['t','UserID'], 1).values
y_vals_test = testDf['t'].values.astype(int)
y_vals_test = np.array([1 if y==1 else -1 for y in y_vals_test])
y_vals_test=y_vals_test.reshape(len(y_vals_test),1)

# One-Hot version
y_vals_test_onehot = pd.get_dummies(testDf['t']).values.astype(float)

Ok


<div style="background-color:#white; color:#008000; font-family: 'Courier New, Monospace;font-weight: bold">Confirm dimensions</div>

In [6]:
numOfFeatures = np.shape(x_vals)[1]
np.shape(x_vals),np.shape(y_vals),np.shape(y_vals_onehot)

((937173, 20), (937173, 1), (937173, 2))

In [7]:
np.shape(x_vals_test), np.shape( y_vals_test),np.shape(y_vals_test_onehot)

((55650, 20), (55650, 1), (55650, 2))

<h3 style="background-color:#616161;color:white">2. Model One: Standard Logistic Regression</h3>

Adapted from: https://blog.altoros.com/using-logistic-and-softmax-regression-in-tensorflow.html

In [8]:
mnistMode = False
   
# Set parameters
learning_rate = 0.01
training_iteration = 30
display_step = 2

if mnistMode:
    # Import MINST data
    from tensorflow.examples.tutorials.mnist import input_data
    mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)
    batch_size = 100
    numOfFeatures=784 # 784 for MNIST
    numOfClasses=10
else:
    batch_size = max(int(np.size(y_vals)/100),50)
    numOfFeatures=20 # 784 for MNIST
    numOfClasses=2
    
# TF graph input
x = tf.placeholder("float", [None, numOfFeatures]) # mnist data image of shape 28*28=784
y = tf.placeholder("float", [None, numOfClasses]) # 0-9 digits recognition => 10 classes

# Create a model

# Set model weights
W = tf.Variable(tf.zeros([numOfFeatures, numOfClasses]))
b = tf.Variable(tf.zeros([numOfClasses]))

# Construct a linear model
model = tf.nn.softmax(tf.matmul(x, W) + b) # Softmax
m=tf.matmul(x, W) + b

### Minimize error using cross entropy cost function ##

# This is a flippin nightmare due to incorrect versions online.
# This is wrong never use it: https://stackoverflow.com/questions/33712178/tensorflow-nan-bug
# cost_function = -tf.reduce_sum(y*tf.log(model)*1)

# This works but is numerically unstable: https://www.tensorflow.org/get_started/mnist/beginners#training
# cost_function = tf.reduce_mean(-tf.reduce_sum(y * tf.log(model), reduction_indices=[1]))

# This is the correct method: https://github.com/tensorflow/tensorflow/blob/r1.2/tensorflow/examples/tutorials/mnist/mnist_softmax.py
cost_function = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y, logits=m))

# Gradient Descent
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost_function)

# Initializing the variables
init = tf.global_variables_initializer()

# Launch the graph
with tf.Session() as sess:
    sess.run(init)

    # Training cycle
    for iteration in range(training_iteration):
        avg_cost = 0.
        if mnistMode:
            total_batch = int(mnist.train.num_examples/batch_size)
        else:
            total_batch = int(len(x_vals)/batch_size)
        
        # Loop over all batches
        for i in range(total_batch):
            if mnistMode:
                batch_xs, batch_ys = mnist.train.next_batch(batch_size)
            else:
                batch_xs = x_vals[i*batch_size:(i*batch_size)+batch_size]
                batch_ys = y_vals_onehot[i*batch_size:(i*batch_size)+batch_size]                
            
            # Fit training using batch data
            sess.run(optimizer, feed_dict={x: batch_xs, y: batch_ys})
            # Compute average loss
            avg_cost += sess.run(cost_function, feed_dict={x: batch_xs, y: batch_ys})/total_batch
        # Display logs per eiteration step
        if iteration % display_step == 0:
            print ("Iteration:", '%04d' % (iteration + 1), "cost=", "{:.9f}".format(avg_cost))

    print ("Tuning completed!")

    # Evaluation function
    
    preds=tf.argmax(model, 1)
    correct_prediction = tf.equal(preds, tf.argmax(y, 1))   
    accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))
    
    # Test the model
    if mnistMode:
        print ("Accuracy:", accuracy.eval({x: mnist.test.images, y: mnist.test.labels}))
    else:
        print ("Accuracy:", accuracy.eval({x: x_vals_test, y: y_vals_test_onehot}))   
        # Get predictions
        predictions= sess.run(tf.argmax(model, 1),feed_dict={x: x_vals_test})

        print(metrics.classification_report(np.argmax(y_vals_test_onehot,1),predictions))
        print(metrics.confusion_matrix(np.argmax(y_vals_test_onehot,1),predictions))
        print("* Precision = labelled as x / how many were actually x in the ones that were labelled")
        print("* Recall = labelled as x / how many were actually x in the dataset")

Iteration: 0001 cost= 0.352965117
Iteration: 0003 cost= 0.308893840
Iteration: 0005 cost= 0.278644631
Iteration: 0007 cost= 0.255437842
Iteration: 0009 cost= 0.237516905
Iteration: 0011 cost= 0.223548399
Iteration: 0013 cost= 0.212534204
Iteration: 0015 cost= 0.203736897
Iteration: 0017 cost= 0.196615376
Iteration: 0019 cost= 0.190773271
Iteration: 0021 cost= 0.185919295
Iteration: 0023 cost= 0.181838122
Iteration: 0025 cost= 0.178368805
Iteration: 0027 cost= 0.175389979
Iteration: 0029 cost= 0.172808978
Tuning completed!
Accuracy: 0.939299
             precision    recall  f1-score   support

          0       0.95      0.98      0.97     50472
          1       0.76      0.51      0.61      5178

avg / total       0.93      0.94      0.93     55650

[[49647   825]
 [ 2553  2625]]
* Precision = labelled as x / how many were actually x in the ones that were labelled
* Recall = labelled as x / how many were actually x in the dataset


<h3 style="background-color:#616161;color:white">3. Model Two: SVM Regression Model</h3>

In [9]:
# SVM Regression
#----------------------------------
#
# This function shows how to use TensorFlow to
# solve support vector regression. We are going
# to find the line that has the maximum margin
# which INCLUDES as many points as possible
#
from tensorflow.python.framework import ops
ops.reset_default_graph()

# Create graph
sess = tf.Session()

# Declare batch size
batch_size = 500

# Initialize placeholders
x_data = tf.placeholder(shape=[None, numOfFeatures], dtype=tf.float32)
y_target = tf.placeholder(shape=[None, 1], dtype=tf.float32)

# Create variables for linear regression
W = tf.Variable(tf.random_normal(shape=[numOfFeatures,1]))  # Weight vector
b = tf.Variable(tf.random_normal(shape=[1,1]))              # Constant

# Declare model operations
model_output = tf.add(tf.matmul(x_data, W), b)
prediction = tf.sign(model_output)

# Declare loss function
# = max(0, abs(target - predicted) + epsilon)
# 1/2 margin width parameter = epsilon
epsilon = tf.constant([0.1])

# Margin term in loss - only anything a greater error than epsilon should count towards the loss: http://cs.adelaide.edu.au/~chhshen/teaching/ML_SVR.pdf
loss = tf.reduce_mean(tf.maximum(0., tf.subtract(tf.abs(tf.subtract(model_output, y_target)), epsilon)))

# Declare optimizer
my_opt = tf.train.GradientDescentOptimizer(0.075)
train_step = my_opt.minimize(loss)

# Initialize variables
init = tf.global_variables_initializer()
sess.run(init)

# Training loop
train_loss = []
test_loss = []

# Train
for i in range(500):
    # Select a batch of train data and train
    rand_index = np.random.choice(len(x_vals), size=batch_size)  
    rand_x = x_vals[rand_index]
    rand_y = y_vals[rand_index]
    sess.run(train_step, feed_dict={x_data: rand_x, y_target: rand_y})
    
    # Monitor the loss on the test data
    temp_train_loss = sess.run(loss, feed_dict={x_data: x_vals, y_target: y_vals})
    train_loss.append(temp_train_loss)
    
    temp_test_loss = sess.run(loss, feed_dict={x_data: x_vals_test, y_target: y_vals_test})
    test_loss.append(temp_test_loss)
    if (i+1)%50==0:
        print('-----------')
        print('Generation: ' + str(i+1))
        #print('A = ' + str(sess.run(W)) + ' b = ' + str(sess.run(b)))
        print('Train Loss = ' + str(temp_train_loss))
        print('Test Loss = ' + str(temp_test_loss))

# Evaluate
output=sess.run(model_output, feed_dict={x_data: x_vals_test})
test_predictions = sess.run(prediction, feed_dict={x_data: x_vals_test})

print(metrics.classification_report(y_vals_test,test_predictions))
print(metrics.confusion_matrix(y_vals_test,test_predictions))
print("* Precision = labelled as x / how many were actually x in the ones that were labelled")
print("* Recall = labelled as x / how many were actually x in the dataset")

-----------
Generation: 50
Train Loss = 1.90308
Test Loss = 1.93949
-----------
Generation: 100
Train Loss = 2.05598
Test Loss = 2.08809
-----------
Generation: 150
Train Loss = 1.86781
Test Loss = 1.89594
-----------
Generation: 200
Train Loss = 1.54174
Test Loss = 1.56739
-----------
Generation: 250
Train Loss = 1.55054
Test Loss = 1.57303
-----------
Generation: 300
Train Loss = 1.48287
Test Loss = 1.50295
-----------
Generation: 350
Train Loss = 1.76014
Test Loss = 1.77682
-----------
Generation: 400
Train Loss = 1.60384
Test Loss = 1.61963
-----------
Generation: 450
Train Loss = 1.01567
Test Loss = 1.03249
-----------
Generation: 500
Train Loss = 1.87646
Test Loss = 1.88987
             precision    recall  f1-score   support

         -1       0.91      0.99      0.95     50472
          1       0.37      0.04      0.07      5178

avg / total       0.86      0.90      0.87     55650

[[50113   359]
 [ 4963   215]]
* Precision = labelled as x / how many were actually x in the one

<h3 style="background-color:#616161;color:white">3. Model Three: Logistic Regression with RBF Kernel</h3>

Good resouorces: 
* http://mccormickml.com/2014/02/26/kernel-regression/
* http://www.cc.gatech.edu/~isbell/tutorials/rbf-intro.pdf
* http://perso.telecom-paristech.fr/~clemenco/Projets_ENPC_files/kernel-log-regression-svm-boosting.pdf

Notes:
$$P(y_t == 1) = const + \sum_d w_d \int RBF(t'; t-t_d, sigma_d) dt$$

Where $w_d$ are the parameters of the linear regression and $t_d$,sigma_d are the parameters of the kernel (which can be optimised jointly or via CV). Then, for example, you could have t_d = [1 hour, 1 day, 1 week] and sigma_d = [10min, 1hour, 12hours]. That way it would give a real-valued score to all tracks played around 1 hour +-10min ago, around 1 day +-1hour ago, etc.
