Benchmarking: One Machine 
=========


**Typical User Experience**

Laptop Specs:
    
    Intel Core i7
    16gb RAM
    NVIDIA GeForce GTX 965M 2GB GDDR5 memory 
    Microsoft Windows 10
    Running Jupyter Notebooks and multiple programs in the background

In [None]:
from IPython.display import HTML

In [None]:
HTML('''<script>
  function code_toggle() {
    if (code_shown){
      $('div.input').hide('500');
      $('#toggleButton').val('Show Code')
    } else {
      $('div.input').show('500');
      $('#toggleButton').val('Hide Code')
    }
    code_shown = !code_shown
  }

  $( document ).ready(function(){
    code_shown=false;
    $('div.input').hide()
  });
</script>
<form action="javascript:code_toggle()"><input type="submit" id="toggleButton" value="Show Code"></form>''')

**Import Packages**

In [30]:
import warnings
warnings.simplefilter(action='ignore')

import pandas as pd
import numpy as np
import re
import os

import time #cpu time
import psutil #memory usage

#tensorflow
import tensorflow as tf

#Scikit-learn
import sklearn as sk
from sklearn.model_selection import train_test_split

from sklearn.datasets import load_svmlight_file
from sklearn.datasets import dump_svmlight_file


from scipy.sparse import coo_matrix,csr_matrix,lil_matrix
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score


# For Visualization
import matplotlib.pyplot as plt
#displays better in jupyter notebooks
%matplotlib inline

print('TensorFlow version: {0}'.format(tf.__version__))
print('SciKit-Learn version: {0}'.format(sk.__version__))

TensorFlow version: 1.9.0
SciKit-Learn version: 0.20.0


For benchmarking to start with we are using LIBSVM's **Avazu-App data** 

**Source(s)**: 

LIBSVM Data Classification: https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#a1a

Avazu's Click-through Prediction https://www.kaggle.com/c/avazu-ctr-prediction/data

**Preprocessing**: This data is used in a competition on click-through rate prediction jointly hosted by Avazu and Kaggle in 2014. The participents were asked to learn a model from the first 10 days of advertising log, and predict the click probability for the impressions on the 11th day. The data sets here are generated by applying our winning solution without some complicated components. To reproduce this data, you can execute our code and see the results in the directory "base." For better test scores, we divide the data to two disjoint groups "app" and "site," and conduct training and prediction tasks on the two groups independetly. Specifically, each instance has either "site_id=85f751fd" or "app_id=ecad2386," and these two feature values never co-occur. Thus we can split the data set according to them. The organizers do not disclose the test labels, so the labels in the test sets are not meaningful. To obtain a test score, please use the code provided below to generate and submit a file to the competition site. Because data are timely dependent, cross validation is not suitable for parameter selection. We provide a training-validation split (e.g., "avazu-app.tr" and "avazu-app.val") by consider the last 4,218,938 training instances for validation. [YJ16a]

- Number of classes: 2

- Number of data: 40,428,967 / 4,577,464 (testing) / 14,596,137 (avazu-app) / 1,719,304 (avazu-app.t) / 12,642,186 (avazu-app.tr) / 1,953,951 (avazu-app.val) / 25,832,830 (avazu-site) / 2,858,160 (avazu-site.t) / 23,567,843 (avazu-site.tr) / 2,264,987 (avazu-site.val)

    Number of features: 1,000,000
Files:

- avazu-app.bz2 (app) https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/avazu-app.bz2
- avazu-app.t.bz2 (app's testing) https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/avazu-app.t.bz2
- avazu-app.tr.bz2 (app's tr) https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/avazu-app.tr.bz2
- avazu-app.val.bz2 (app's val) https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/avazu-app.val.bz2

*Download data*

In [1]:
#Download data 

#!wget https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/avazu-app.bz2 #<---full app data
#!wget https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/avazu-app.tr.bz2 #<---benchmark training
#!wget https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/avazu-app.t.bz2 #<---benchmark testing
#!wget https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/avazu-app.val.bz2 #<---benchmark validation

Extracting Files:
    
    For Scikitlearn extracting the files is unnecessary since the load_svm_light_file with automatically extract .bz2 files. However for Tensorflow will need them already extracted to have comparable results. 

In [None]:
#!bzip2 -dk avazu-app.bz2 #Extract full data 
#!bzip2 -dk avazu-app.tr.bz2 #Extract Training
#!bzip2 -dk avazu-app.t.bz2 #Extract Testing

***Classes and Functions***

In [4]:
#Data importing function
#scipy.sparse matrix of shape (n_samples, n_features)

def get_data(file):
    data = load_svmlight_file(file)#avazu-app.tr.bz2
    return data[0], data[1]

**For Importing Raw Individual File**

*If the data is in a single file that hasn't already been seperated in training and testing datasets.*

In [None]:
mem_baseline=psutil.virtual_memory() #  physical memory usage
print('Here is the memory baseline prior to importing data:\n',mem_baseline)

In [None]:
file_location=os.getcwd()+'/avazu-app'

In [None]:
X, y=get_data(file_location) #import raw data
mem_InData=psutil.virtual_memory()

In [None]:
print('Here is the memory usage after importing data:\n',mem_InData) #  physical memory usage
print('\nThe time taken to import the raw data:')
exec_time1 = %%timeit -o X, y =get_data(file_location) #import raw data

In [None]:
print('Here is the type of sparse matrix format for the data:')
X

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
print('The shape of the X training set:\n (samples,features)',X_train.shape)
print('The shape of the X testing set:\n (samples,features)',X_test.shape)
print('The shape of the Y training set:\n (samples,features)',y_train.shape)
print('The shape of the Y testing set:\n (samples,features)',y_test.shape)

** For Importing Previously Seperated Training and Testing File**


If the data is already seperated into a training and testing file, you should begin your testing here.              

In [None]:
#Designate file location
train_file=os.getcwd()+'\\avazu-app.tr'
test_file=os.getcwd()+'\\avazu-app.t'

In [None]:
mem_baseline=psutil.virtual_memory() #  physical memory usage
print('Here is the memory baseline prior to importing data:\n\n',mem_baseline)

In [None]:
start = time.time()

train_set = DataSet()
train_set.load(train_file)

end = time.time()
exec_time1=(end - start)
print('\nThe time taken to import and prepare training data for Tensorflow is:',exec_time1)

In [None]:
start = time.time()
X_train, y_train=get_data(train_file)
X_test, y_test=get_data(test_file)

end = time.time()
exec_time=(end - start)
print('Time to import the data into Scikit-Learn:',exec_time,' seconds')

In [None]:
mem_baseline=psutil.virtual_memory() #  physical memory usage
print('Here is the memory usage afer importing the data:\n\n',mem_baseline)

Execute Logistic Regression in SciKit Learn
-------

The parameters for the logistic regression (Gradient Descent) are:
    
    Regularization: L2
    
    Regularization Threshold (C): 1.0
    
    Tolerance: 0.001
    
    Fit Intercpt: Yes (True)
    
    Processors (n_jobs): 1
    
    Max Number of Iterations: 100

In [None]:
# Instantiate a logistic regression model, and fit with X and y

model = linear_model.LogisticRegression(penalty='l2',\
                                        C=1.0,\
                                        tol=0.001,\
                                        fit_intercept=True,\
                                        n_jobs=1,\
                                        max_iter=100)

model = model.fit(X_train, y_train.ravel())#

In [None]:
exec_time2 = %%timeit -o model.fit(X_train, y_train.ravel())
print('\nThe average time taken to execute the logistic regression:',exec_time2,'seconds')

In [None]:
#Setup variables for evaluation metrics
y_pred = model.predict(X_train)
y_obs = y_train
y_score = y_pred

**Model Metrics**

In [None]:
r2=metrics.r2_score(y_obs, y_pred)
accuracy=model.score(X_train, y_train)
prec=metrics.precision_score(y_obs, y_pred, labels=None, pos_label=1)
recall=metrics.recall_score(y_obs, y_pred, labels=None, pos_label=1)
f1 = metrics.f1_score(y_obs,y_pred)
ROC_AUC = metrics.roc_auc_score(y_obs, y_score)
print('The correlation coefficient:',r2,\
      '\nThe accuracy of the model:',accuracy,\
      '\nThe precision (tp / (tp + fp)):',prec,\
      '\nThe recall (tp / (tp + fn)):',recall,\
      '\nThe f1 score is:',f1,\
      '\nThe Area Under the Curve score is:',ROC_AUC)

Execute Logistic Regression in Tensorflow
-------------

*For splitting data file*

In [190]:
input_file=os.getcwd()+'\\avazu-app.t'

X,y=get_data(input_file)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

dump_svmlight_file(X_train, y_train,'train_file')#%80
dump_svmlight_file(X_test, y_test,'test_file')#%20
print('Data file has been train/test split.')

Data file has been train/test split.


Here the GPU is used for the task but restricted to using only 75% of GPU resources in order to prevent crashing.

In [48]:
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.75)

***Classes and Functions***

In [49]:
class DataSet(object):
    def __init__(self):
        self.iter = 0
        self.epoch_pass = 0

    def load(self, file, features, length):
        '''
         1. Each line is read and split by white space, then appends the first value of matrix as y's. 
         2. Then counts the number of pairs (int:int) represented by each y and stores that. 
         3. Then splits pairs (int:int) where the first value represents the sparse ids,
                  and the second value is the sparse value
        '''
        X, y=load_svmlight_file(file,n_features=features,zero_based=True,length=length)
        self.feature_num=X.shape[1] #The number of cols in X set
        self.ins_num =X.shape[0] #The number of rows in X set
        self.y = list(y)
        self.feature_ids = list(X.indices) #column index
        self.feature_value = list(X.data) #values
        self.ins_feature_interval =list(X.indptr) #row starts 
        self.ins_feature_interval_diff = [(j-i) for i, j in zip(X.indptr[:-1], X.indptr[1:])] #difference between row start records
    

    def mini_batch(self, batch_size):
        '''
        0. Ultimately this function creates slice boundaries of (batch size).
        1. Initially sets beginning and end bounds equivalent to iteration count starting at zero
        2.       if the iteration count plus batch size is greater than the number of records
        3. set the end bound equal to record size, then reset iteration to 0 and record that 1 epoch has been achieved
            (i.e keep adding the end bounds together until end of dataset then record that 1 pass is complete)
        4. Otherwise-
            keep adding batch size to end bound then set iteration to end bound 
            thus a batch size of 100 will set end bound (0,100,200,300,400)
                note: begin is set to iteration so when it goes through the loop--
                begin will be 1 batch size smaller than end bound until records limit is reached
        '''
        begin = self.iter #begins as 0 as defined above
        end = self.iter #starts with 0 as defined above
        if self.iter + batch_size > self.ins_num: #if 0 + batchsize(10) > ins_num(1) defined in def load
            end = self.ins_num #set end to be ins_num(1) 
            self.iter = 0 #set iter to 0
            self.epoch_pass += 1 #add +1 to epoch_pass 
        else:
            end += batch_size#add batch size to end, which should be equal to batch size
            self.iter = end#set self.iter to batch size
        return self.slice(begin, end)

    def slice(self, begin, end):
        '''
        This function does the actual slicing of batch sizes and creates objects used to pass into SparseTensor. 
        The format should look like this:
        SparseTensor(indices=[[0, 0], [0, 1]...], values=[1,1,1...], dense_shape=[1000, 15])
        '''
        sparse_index = []
        sparse_ids = list(train_set.feature_ids[train_set.ins_feature_interval[begin]:train_set.ins_feature_interval[end]])
        sparse_values = list(self.feature_value[self.ins_feature_interval[begin]:self.ins_feature_interval[end]])
        sparse_shape = [end - begin,max(self.ins_feature_interval_diff)]
        y = np.array(self.y[begin:end]).reshape((end - begin, 1))
        for i in range(begin, end):
            for j in range(self.ins_feature_interval[i], self.ins_feature_interval[i + 1]):
                sparse_index.append([i - begin, j - self.ins_feature_interval[i]]) 
        return (sparse_index, sparse_ids, sparse_values, sparse_shape, y)

In [50]:
class BinaryLogisticRegression(object):
    def __init__(self, feature_num):
        self.feature_num = feature_num
        self.sparse_index = tf.placeholder(tf.int64)
        self.sparse_ids = tf.placeholder(tf.int64)
        self.sparse_values = tf.placeholder(tf.float32)
        self.sparse_shape = tf.placeholder(tf.int64)
        self.w = tf.Variable(tf.random_normal([self.feature_num, 1], stddev=0.1))
        self.y = tf.placeholder("float", [None, 1])

    def forward(self):
        return tf.nn.embedding_lookup_sparse(self.w,
                                             tf.SparseTensor(self.sparse_index, self.sparse_ids, self.sparse_shape),
                                             tf.SparseTensor(self.sparse_index, self.sparse_values, self.sparse_shape),
                                             combiner="sum")

In [51]:
mem_baseline1=psutil.virtual_memory() #  physical memory usage
print('Here is the memory baseline prior to importing data:\n\n',mem_baseline1)

Here is the memory baseline prior to importing data:

 svmem(total=17101512704, available=7341273088, percent=57.1, used=9760239616, free=7341273088)


In [52]:
learning_rate =  0.001
max_iter = 10
batch_size = 1000
feature_num=1000000

train_file=os.getcwd()+'\\avazu-app.tr'
test_file=os.getcwd()+'\\avazu-app.t'

In [53]:
start = time.time()

train_set = DataSet()
train_set.load(train_file,feature_num,40428967)

end = time.time()
exec_time1=(end - start)
print('\nThe time taken to import and prepare training data for Tensorflow is:',exec_time1)


The time taken to import and prepare training data for Tensorflow is: 6.182252645492554


In [54]:
start = time.time()

test_set = DataSet()
test_set.load(test_file,feature_num,4577464)

end = time.time()
exec_time2=(end - start)
print('\nThe time taken to import and prepare testing data for Tensorflow is:',exec_time2)


The time taken to import and prepare testing data for Tensorflow is: 0.7349328994750977


In [55]:
mem_baseline2=psutil.virtual_memory() #  physical memory usage
print('Here is the memory usage after importing data:\n',mem_baseline2)

Here is the memory usage after importing data:
 svmem(total=17101512704, available=7266021376, percent=57.5, used=9835491328, free=7266021376)


In [56]:
model = BinaryLogisticRegression(feature_num)

In [57]:
y = model.forward()

In [58]:
loss = tf.reduce_sum(tf.nn.sigmoid_cross_entropy_with_logits(labels=y, logits=model.y))

In [59]:
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss)

In [60]:
probability_output = tf.nn.sigmoid(y)

In [61]:
session = tf.Session()
init_all_variable = tf.global_variables_initializer()
init_local_variable = tf.local_variables_initializer()
session.run([init_all_variable, init_local_variable])

[None, None]

Set number of passes in for loop. 

*This is done primarily for timing purposes to get an average calculation time.*

In [62]:
num_passes=1 #number of passes in for loop

In [63]:
start = time.time()
end_list=[]
for i in range(0,num_passes):
    while train_set.epoch_pass < max_iter:
        #fills values in batch units while for each epoch pass in less than max_iterations
        sparse_index, sparse_ids, sparse_values, sparse_shape, mb_y = train_set.mini_batch(batch_size)
        
        _, loss_, prob_out = session.run([optimizer, loss, probability_output],
                                         feed_dict={model.sparse_index: sparse_index,
                                                    model.sparse_ids: sparse_ids,
                                                    model.sparse_values: sparse_values,
                                                    model.sparse_shape: sparse_shape,
                                                    model.y: mb_y})
        
    end = time.time()
    exec_time=(end - start)
    end_list.append(exec_time) 
    #save endlist not exec_time
    try:
        auc = roc_auc_score(mb_y, prob_out)
        print("epoch: ", train_set.epoch_pass, " Receiver Operating Curve, Area Under the Curve score is: ", auc)

    except:
        print('\nValueError: Only one class present in y_true. ROC AUC score is not defined in that case.\n')
        print(mb_y.T)
        print(prob_out.T,'\n')

print('\nThe average time taken to execute logistic regression for '+str(num_passes)+' full passes of',max_iter,'iterations took',np.array(end_list).mean(),'seconds with a standard deviation of +- '+str(np.array(end_list).std()))

epoch:  10  Receiver Operating Curve, Area Under the Curve score is:  0.5047846889952153

The average time taken to execute logistic regression for 1 full passes of 10 iterations took 183.83894801139832 seconds with a standard deviation of +- 0.0


In [64]:
bench_list=[str(mem_baseline1),str(exec_time1),str(exec_time2),str(mem_baseline2),str(np.array(end_list).mean())]
with open('bench_times.txt', 'w') as f:
    for item in bench_list:
        f.write("%s\n" % item)

_____________________________________

Benchmarks
======

Scikit Learn AWS c5.2xlarge
-------

**SciKitLearn Before importing:**  svmem(total=16222670848, available=15447273472, percent=4.8, used=425279488, free=11094663168, active=2583482368, inactive=2261311488, buffers=263196672, cached=4439531520, shared=21082112, slab=169435136)
    
**SciKitLearn After importing:** svmem(total=16222670848, available=12618625024, percent=22.2, used=3253944320, free=6678728704, active=8217722880, inactive=1035603968, buffers=263544832, cached=6026452992, shared=21082112, slab=169373696)

**SciKitLearn time to import**: 132.54389119148254 seconds




***Average Calculation Time:***
    
**SciKitLearn:**
1 loop, best of 3: 9min 55s per loop
('\nThe average time taken to execute the logistic regression:', <IPython.core.magics.execution.TimeitResult object at 0x7fdb87730fd0>, 'seconds')

Tensorflow AWS c5.2xlarge
-------------

**Tensorflow Before importing:**  svmem(total=16222670848, available=15502614528, percent=4.4, used=352391168, free=1381421056, active=5944606720, inactive=8003710976, buffers=438689792, cached=14050168832, shared=21106688, slab=780996608)
    
**Tensorflow After importing:** svmem(total=16222670848, available=15418843136, percent=5.0, used=436162560, free=1297649664, active=6027079680, inactive=8003710976, buffers=438689792, cached=14050168832, shared=21106688, slab=780820480)

**Tensorflow time to import**: 
The time taken to import and prepare training data for Tensorflow is: 0.6254868507385254
The time taken to import and prepare testing data for Tensorflow is: 0.08609414100646973


***Average Calculation Time:***
    
**Tensorflow:**
The average time taken to execute logistic regression for 5 full passes of 100 iterations took 411.0613938808441 seconds with a standard deviation of +- 0.0011656490621771976