Benchmarking on One Machine 
=========


**Typical User Experience**

Laptop Specs:
    
    Intel Core i7
    16gb RAM
    NVIDIA GeForce GTX 965M 2GB GDDR5 memory 
    Microsoft Windows 10
    Running Jupyter Notebooks and multiple programs in the background

In [1]:
from IPython.display import HTML

In [2]:
HTML('''<script>
  function code_toggle() {
    if (code_shown){
      $('div.input').hide('500');
      $('#toggleButton').val('Show Code')
    } else {
      $('div.input').show('500');
      $('#toggleButton').val('Hide Code')
    }
    code_shown = !code_shown
  }

  $( document ).ready(function(){
    code_shown=false;
    $('div.input').hide()
  });
</script>
<form action="javascript:code_toggle()"><input type="submit" id="toggleButton" value="Show Code"></form>''')

**Import Packages**

In [1]:
import warnings
warnings.simplefilter(action='ignore')

import pandas as pd
import numpy as np
import re
import os

import time #cpu time
import psutil #memory usage
#tensorflow
import tensorflow as tf

#Scikitlearn
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_svmlight_file
from scipy.sparse import coo_matrix,csr_matrix,lil_matrix
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score

# For Visualization
import matplotlib.pyplot as plt
#displays better in jupyter notebooks
%matplotlib inline

In [2]:
#!wget https://repo.anaconda.com/archive/Anaconda2-5.3.0-Linux-x86_64.sh
#export PATH="/home/ubuntu/anaconda2/bin:$PATH"

In [3]:
#conda install numpy 
#conda install -c intel mkl
#!pip install matplotlib --user

For benchmarking to start with we are using LIBSVM's **Avazu-App data** 

**Source(s)**: 

LIBSVM Data Classification: https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#a1a

Avazu's Click-through Prediction https://www.kaggle.com/c/avazu-ctr-prediction/data

**Preprocessing**: This data is used in a competition on click-through rate prediction jointly hosted by Avazu and Kaggle in 2014. The participents were asked to learn a model from the first 10 days of advertising log, and predict the click probability for the impressions on the 11th day. The data sets here are generated by applying our winning solution without some complicated components. To reproduce this data, you can execute our code and see the results in the directory "base." For better test scores, we divide the data to two disjoint groups "app" and "site," and conduct training and prediction tasks on the two groups independetly. Specifically, each instance has either "site_id=85f751fd" or "app_id=ecad2386," and these two feature values never co-occur. Thus we can split the data set according to them. The organizers do not disclose the test labels, so the labels in the test sets are not meaningful. To obtain a test score, please use the code provided below to generate and submit a file to the competition site. Because data are timely dependent, cross validation is not suitable for parameter selection. We provide a training-validation split (e.g., "avazu-app.tr" and "avazu-app.val") by consider the last 4,218,938 training instances for validation. [YJ16a]

- Number of classes: 2

- Number of data: 40,428,967 / 4,577,464 (testing) / 14,596,137 (avazu-app) / 1,719,304 (avazu-app.t) / 12,642,186 (avazu-app.tr) / 1,953,951 (avazu-app.val) / 25,832,830 (avazu-site) / 2,858,160 (avazu-site.t) / 23,567,843 (avazu-site.tr) / 2,264,987 (avazu-site.val)

    Number of features: 1,000,000
Files:

- avazu-app.bz2 (app) https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/avazu-app.bz2
- avazu-app.t.bz2 (app's testing) https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/avazu-app.t.bz2
- avazu-app.tr.bz2 (app's tr) https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/avazu-app.tr.bz2
- avazu-app.val.bz2 (app's val) https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/avazu-app.val.bz2

In [11]:
#Download data 
#!wget https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a1a.t # <--data just to make sure things work
#!wget https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/avazu-app.bz2 #<---full app data
#!wget https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/avazu-app.tr.bz2 #<---benchmark training
#!wget https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/avazu-app.t.bz2 #<---benchmark testing
#!wget https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/avazu-app.val.bz2 #<---benchmark validation

Extracting Files:
    
    For Scikitlearn extracting the files is unnecessary since the load_svm_light_file with automatically extract .bz2 files. However for Tensorflow will need them already extracted to have comparable results. 

In [12]:
#!bzip2 -dk avazu-app.tr.bz2 #Extract Training
#!bzip2 -dk avazu-app.t.bz2 #Extract Testing

***Classes and Functions***

In [5]:
#Data importing function
def get_data(file):
    data = load_svmlight_file(file)#avazu-app.tr.bz2
    return data[0], data[1]

**For Importing Raw Individual File**

*If the data is in a single file that hasn't already been seperated in training and testing datasets.*

In [5]:
mem_baseline=psutil.virtual_memory() #  physical memory usage
print('Here is the memory baseline prior to importing data:\n',mem_baseline)

('Here is the memory baseline prior to importing data:\n', svmem(total=8369991680, available=6987788288, percent=16.5, used=1092698112, free=4756865024, active=2981306368, inactive=455618560, buffers=113709056, cached=2406719488, shared=21860352, slab=108249088))


In [14]:
file_location=os.getcwd()+'/avazu-app.tr'

In [15]:
X, y =get_data(file_location) #import raw data
mem_InData=psutil.virtual_memory()

In [None]:
print('Here is the memory usage after importing data:\n',mem_InData) #  physical memory usage
print('\nThe time taken to import the raw data:')
exec_time1 = %%timeit -o X, y =get_data(file_location) #import raw data

('Here is the memory usage after importing data:\n', svmem(total=8369991680, available=4503855104, percent=46.2, used=3531640832, free=725569536, active=5411880960, inactive=1915072512, buffers=144220160, cached=3968561152, shared=21860352, slab=246988800))

The time taken to import the raw data:


In [None]:
print('Here is the type of sparse matrix format for the data:')
X

In [393]:
#May not be appropriate for time dependant data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

** For Importing Previously Seperated Training and Testing File**


If the data is already seperated into a training and testing file, you should begin your testing here.              

In [2]:
#Designate file location
train_file_location=os.getcwd()+'/avazu-app.tr'
test_file_location=os.getcwd()+'/avazu-app.t'

In [3]:
mem_baseline=psutil.virtual_memory() #  physical memory usage
print('Here is the memory baseline prior to importing data:\n\n',mem_baseline)

('Here is the memory baseline prior to importing data:\n\n', svmem(total=8369991680, available=7271579648, percent=13.1, used=831164416, free=7229689856, active=813244416, inactive=201641984, buffers=28139520, cached=280997888, shared=21852160, slab=62763008))


In [6]:
X_train, y_train=get_data(train_file_location)
X_test, y_test=get_data(test_file_location)

In [7]:
mem_baseline=psutil.virtual_memory() #  physical memory usage
print('Here is the memory usage afer importing the data:\n\n',mem_baseline)

('Here is the memory usage afer importing the data:\n\n', svmem(total=8369991680, available=4499521536, percent=46.2, used=3603218432, free=1644941312, active=3578728448, inactive=3014131712, buffers=28303360, cached=3093528576, shared=21852160, slab=62894080))


Execute Logistic Regression in SciKit Learn
-------

The parameters for the logistic regression (Gradient Descent) are:
    
    Regularization: L2
    
    Regularization Threshold (C): 1.0
    
    Tolerance: 0.001
    
    Fit Intercpt: Yes (True)
    
    Processors (n_jobs): 1
    
    Max Number of Iterations: 100

In [None]:
# Instantiate a logistic regression model, and fit with X and y

model = linear_model.LogisticRegression(penalty='l2',\
                                        C=1.0,\
                                        tol=0.001,\
                                        fit_intercept=True,\
                                        n_jobs=1,\
                                        max_iter=100)

model = model.fit(X_train, y_train.ravel())#

In [395]:
exec_time2 = %%timeit -o model.fit(X_train, y_train.ravel())
print('\nThe average time taken to execute the logistic regression:',exec_time2,'seconds')

391 ms ± 28.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The average time taken to execute the logistic regression: 391 ms ± 28.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) seconds


In [396]:
#Setup variables for evaluation metrics
y_pred = model.predict(X_test)
y_obs = y_test
y_score = y_pred

**Model Metrics**

In [340]:
r2=metrics.r2_score(y_obs, y_pred)
accuracy=model.score(X_train, y_train)
prec=metrics.precision_score(y_obs, y_pred, labels=None, pos_label=1)
recall=metrics.recall_score(y_obs, y_pred, labels=None, pos_label=1)
f1 = metrics.f1_score(y_obs,y_pred)
ROC_AUC = metrics.roc_auc_score(y_obs, y_score)
print('The correlation coefficient:',r2,\
      '\nThe accuracy of the model:',accuracy,\
      '\nThe precision (tp / (tp + fp)):',prec,\
      '\nThe recall (tp / (tp + fn)):',recall,\
      '\nThe f1 score is:',f1,\
      '\nThe Area Under the Curve score is:',ROC_AUC)

The correlation coefficient: 0.1846532267117451 
The accuracy of the model: 0.8495910927184944 
The precision (tp / (tp + fp)): 0.7361835245046924 
The recall (tp / (tp + fn)): 0.5912897822445561 
The f1 score is: 0.655829075708314 
The Area Under the Curve score is: 0.76239916181873


Execute Logistic Regression in Tensorflow
-------------

Here the GPU is used for the task but restricted to using only 75% of GPU resources in order to prevent crashing.

In [362]:
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.75)

In [363]:
learning_rate =  0.001
max_iter = 100
batch_size = 30

train_file = os.getcwd()+'\\tensorflow-models\\data\\libsvm_data\\a1a'
test_file = os.getcwd()+'\\tensorflow-models\\data\\libsvm_data\\a1a.t'

***Classes and Functions***

In [364]:
class DataSet(object):
    def __init__(self):
        self.iter = 0
        self.epoch_pass = 0

    def load(self, file):
        '''
        '''
        self.ins_num = 0 #<set at zero
        f = open(file, "r")
        self.y = []
        self.feature_ids = []
        self.feature_values = []
        self.ins_feature_interval = []
        self.max_ins_feature_interval = []
        self.ins_feature_interval.append(0)#makes zero the starting value in ins_feature_interval
        self.max_token=[]
        for line in f.readlines():#iterating through open file
            regexp = re.compile(r':')#<---If feature has a colon then do
            tokens = line.split(" ")#split lines in the file
            #tokens.remove('\n')
            #print(tokens[0])
            self.y.append(float(tokens[0]))#append to y the first value in tokens (+1,-1,1)
            try:
                tokens[-1] = tokens[-1].strip()#<----remove '\n'
                tokens.remove('') #<---remove '' empty in list
            except:
                pass
            
            #                         last value in list is that value + (line splits -1)<--maybe adjusting for return (\n) or y value 
            #print(self.ins_feature_interval[-1]+ len(tokens)-1)#<----stacks the batch sizes
            self.ins_feature_interval.append(self.ins_feature_interval[-1]+ len(tokens)-1)
            #print(len(self.ins_feature_interval))
            for feature in tokens:#(len(tokens)~16
              #  print(feature)
                if regexp.search(feature):#if there is a colon in feature
                    self.max_token.append(feature)#check on size
                    feature_id, feature_value = feature.split(":") #split on colon
                    if feature_id:
                        self.feature_ids.append(int(feature_id))#append to feature ids
                        self.feature_values.append(float(feature_value)) #append feature values
            self.ins_num += 1 #set ins_num to 1
        self.feature_num=max(self.feature_ids)#modify feature_num to max of ids (maximum # of features)
        #self.max_ins_feature_interval=max(self.ins_feature_interval)
        print('the max number of features:',self.feature_num)
    

    def mini_batch(self, batch_size):
        begin = self.iter #begins as 0 as defined above
        end = self.iter #starts with 0 as defined above
        if self.iter + batch_size > self.ins_num: #if 0 + batchsize(10) > ins_num(1) defined in def load
            end = self.ins_num #set end to be ins_num(1) 
            self.iter = 0 #set iter to 0
            self.epoch_pass += 1 #add +1 to epoch_pass 
        else:
            end += batch_size#add batch size to end, which should be equal to batch size
            #print('end:',end)
            #print('batch_size:',batch_size)
            self.iter = end#set self.iter to batch size
            #(begin, end)setting bounds moving across batch length in data
            #print((begin, end))#slicing action of data (0, 15) (15, 30) (30, 45) (60,...
        return self.slice(begin, end)
#Error
    def slice(self, begin, end):
            sparse_index = []
            sparse_ids = []
            sparse_values = []
            sparse_shape = []
            max_feature_num = 0
            for i in range(begin, end):#within range begin, end
            #              15,461,906,1351                  0,446,891,1338 =~15 supposed to be length of token
            #          (token length + range number +1) - (token length + range number)
                feature_num = self.ins_feature_interval[i + 1] - self.ins_feature_interval[i]
                if feature_num > max_feature_num:
                    max_feature_num = feature_num
                                    #       0,446,891,1338                  15,461,906,1351
                #(token length + range number) ,       (token length + range number +1)
                #self.max_ins_feature_interval-len(self.feature_ids)<-------------------
                #print(self.ins_feature_interval[i],self.ins_feature_interval[i + 1])
                #print(self.feature_ids[i])
                for j in range(self.ins_feature_interval[i], self.ins_feature_interval[i + 1]):
                #print(j,(len(self.feature_ids)))
                #print(self.ins_feature_interval[i + 1])
                #print(range(self.ins_feature_interval[i], self.ins_feature_interval[i + 1]))
                #15 vals:                  0  , 0-14,446-460,891-905
                    sparse_index.append([i - begin, j - self.ins_feature_interval[i]]) # index must be accent
                    #[0, 0]-[0, 14][0, 0]-[0, 12]
                    #print([i - begin, j - self.ins_feature_interval[i]])
                    sparse_ids.append(self.feature_ids[j])
                    sparse_values.append(self.feature_values[j])
            sparse_shape.append(end - begin)
            #print(end - begin)#<-----30
            sparse_shape.append(max_feature_num)
            #print(max_feature_num)#<-----15
            #       Creates array shape of 30,1 of y values  (30, 1)
            y = np.array(self.y[begin:end]).reshape((end - begin, 1))
            #begin:0,30,60,90,120,150,180,210 intervals of 30
            #end: 30,60,90,120,150,180,210,240
            #            0            0                       0                 30            30
            #print(len(sparse_index), len(sparse_ids), len(sparse_values), len(sparse_shape), len(y))
            return (sparse_index, sparse_ids, sparse_values, sparse_shape, y)

In [365]:
class BinaryLogisticRegression(object):
    def __init__(self, feature_num):
        self.feature_num = feature_num
        self.sparse_index = tf.placeholder(tf.int64)
        self.sparse_ids = tf.placeholder(tf.int64)
        self.sparse_values = tf.placeholder(tf.float32)
        self.sparse_shape = tf.placeholder(tf.int64)
        self.w = tf.Variable(tf.random_normal([self.feature_num, 1], stddev=0.1))
        self.y = tf.placeholder("float", [None, 1])

    def forward(self):
        return tf.nn.embedding_lookup_sparse(self.w,
                                             tf.SparseTensor(self.sparse_index, self.sparse_ids, self.sparse_shape),
                                             tf.SparseTensor(self.sparse_index, self.sparse_values, self.sparse_shape),
                                             combiner="sum")

In [366]:
mem_baseline=psutil.virtual_memory() #  physical memory usage
print('Here is the memory baseline prior to importing data:\n\n',mem_baseline)

Here is the memory baseline prior to importing data:

 svmem(total=17101512704, available=10041020416, percent=41.3, used=7060492288, free=10041020416)


In [400]:
start = time.time()

train_set = DataSet()
train_set.load(train_file)
test_set = DataSet()
test_set.load(test_file)
feature_num=test_set.feature_num

end = time.time()
exec_time=(end - start)
print('\nThe time taken to import and prepare data for Tensorflow is:',exec_time)

the max number of features: 119
the max number of features: 123

The time taken to import and prepare data for Tensorflow is: 1.808777093887329


In [368]:
mem_baseline=psutil.virtual_memory() #  physical memory usage
print('Here is the memory usage after importing data:\n',mem_baseline)

Here is the memory usage after importing data:
 svmem(total=17101512704, available=9906634752, percent=42.1, used=7194877952, free=9906634752)


In [369]:
model = BinaryLogisticRegression(feature_num)

In [370]:
y = model.forward()

In [371]:
loss = tf.reduce_sum(tf.nn.sigmoid_cross_entropy_with_logits(labels=y, logits=model.y))

In [372]:
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss)

In [373]:
probability_output = tf.nn.sigmoid(y)

In [374]:
session = tf.Session()
init_all_variable = tf.global_variables_initializer()
init_local_variable = tf.local_variables_initializer()
session.run([init_all_variable, init_local_variable])

[None, None]

Set number of passes in for loop. 

*This is done primarily for timing purposes to get an average calculation time.*

In [375]:
num_passes=5 #number of passes in for loop

In [379]:
start = time.time()
end_list=[]
for i in range(0,num_passes):
    while train_set.epoch_pass < max_iter:
        sparse_index, sparse_ids, sparse_values, sparse_shape, mb_y = train_set.mini_batch(batch_size)
        _, loss_, prob_out = session.run([optimizer, loss, probability_output],
                                         feed_dict={model.sparse_index: sparse_index,
                                                    model.sparse_ids: sparse_ids,
                                                    model.sparse_values: sparse_values,
                                                    model.sparse_shape: sparse_shape,
                                                    model.y: mb_y})
        
    end = time.time()
    exec_time=(end - start)
    end_list.append(exec_time) 

    try:
        auc = roc_auc_score(mb_y, prob_out)
        print("epoch: ", train_set.epoch_pass, " ROC AUC score is: ", auc)

    except:
        print('\nValueError: Only one class present in y_true. ROC AUC score is not defined in that case.\n')
        print(mb_y.T)
        print(prob_out.T,'\n')

print('\nThe average time taken to execute logistic regression for '+str(num_passes)+' full passes of',max_iter,'iterations took',np.array(end_list).mean(),'seconds with a standard deviation of +- '+str(np.array(end_list).std()))

epoch:  100  ROC AUC score is:  0.5
epoch:  100  ROC AUC score is:  0.5
epoch:  100  ROC AUC score is:  0.5
epoch:  100  ROC AUC score is:  0.5
epoch:  100  ROC AUC score is:  0.5

The average time taken to execute logistic regression for 5 full passes of 100 iterations took 0.00516057014465332 seconds with a standard deviation of +- 0.0028760574334623728


_____________________________________

Benchmarks
======

Average time taken to import data:

    SciKitLearn:
    
    Tensorflow:
    
    RocketML:

Memory Usage:
    
    SciKitLearn Before importing: svmem(total=8369991680, available=6985629696, percent=16.5, used=1094762496, free=4751126528, active=2986393600, inactive=456204288, buffers=114388992, cached=2409713664, shared=21860352, slab=108478464)
    
    SciKitLearn After importing:svmem(total=8369991680, available=4224389120, percent=49.5, used=3855998976, free=1019949056, active=5932777472, inactive=1237405696, buffers=114442240, cached=3379601408, shared=21860352, slab=108474368))
    
Average Calculation Time:
    
    SciKitLearn:
        
    Tensorflow:
    
    RocketML:
        