## <font color='green'> Regression (Sparse)<font>

## Dataset - E2006-tfidf

### <font color='green'> 1. Description<font>
Financial10-K reports from thousands of publicly traded U.S. companies, published in 1996–2006 and stock return volatility measurements in the twelve-month period before and the twelve-month period after each report.

Here the target variable (y) is the risk associated with buying a particular stock. And the features (X) are the volatility of the stock price return over different periods of time.

Download Link:
Train Data: https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression/E2006.train.bz2 <br>
Test Data: https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression/E2006.test.bz2


Data source:
1. http://www.cs.cmu.edu/~ark/10K/
2. http://www.cs.cmu.edu/~nasmith/papers/kogan+levin+routledge+sagi+smith.naacl09.pdf
3. https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/

### <font color='green'> 2. Data Preprocessing <font>

In [1]:
import os
import time
import numpy as np
import pandas as pd
from collections import OrderedDict
from sklearn import datasets

In [2]:
def preprocess_data(filename_train, filename_test, feature_dim):
    '''
    For E2006-tfidf regression we will perform some data preparation and data cleaning steps.
    '''
    x_train, y_train = datasets.load_svmlight_file(
        filename_train,
        n_features=feature_dim,
        dtype=np.float32)
    x_test, y_test = datasets.load_svmlight_file(
        filename_test,
        n_features=feature_dim,
        dtype=np.float32)
    return x_train, y_train, x_test, y_test

In [3]:
#---- Data Preparation ----
FILENAME_TRAIN = "datasets/E2006.train"
FILENAME_TEST = "datasets/E2006.test"
FEATURE_DIM = 150360
x_train, y_train, x_test, y_test = preprocess_data(FILENAME_TRAIN, FILENAME_TEST, FEATURE_DIM)
print("shape of train data: {}".format(x_train.shape))
print("shape of test data: {}".format(x_test.shape))

shape of train data: (16087, 150360)
shape of test data: (3308, 150360)


### <font color='green'> 3. Algorithm Evaluation <font>

In [4]:
train_time = []
test_time = []
train_score = []
test_score = []
estimator_name = []

In [5]:
def evaluate(estimator, estimator_nm,
             x_train, y_train,
             x_test, y_test):
    '''
    To generate performance report for both frovedis and sklearn estimators
    '''
    estimator_name.append(estimator_nm)

    start_time = time.time()
    estimator.fit(x_train, y_train)
    train_time.append(round(time.time() - start_time, 4))

    start_time = time.time()
    train_score.append(estimator.score(x_train, y_train))
    test_score.append(estimator.score(x_test, y_test))
    test_time.append(round(time.time() - start_time, 4))

#### 3.1 LassoRegressor

In [6]:
TARGET = "lasso_regressor"
import frovedis
from frovedis.exrpc.server import FrovedisServer
FrovedisServer.initialize("mpirun -np 8 " +  os.environ["FROVEDIS_SERVER"])
from frovedis.mllib.linear_model import Lasso as fLR
f_est = fLR(lr_rate=4.89E-05)
E_NM = TARGET + "_frovedis_" + frovedis.__version__
evaluate(f_est, E_NM, x_train, y_train, x_test, y_test)
f_est.release()
FrovedisServer.shut_down()

import sklearn
from sklearn.linear_model import Lasso as sLR
s_est = sLR(alpha=0.01)
E_NM = TARGET + "_sklearn_" + sklearn.__version__
evaluate(s_est, E_NM, x_train, y_train, x_test, y_test)

### <font color='green'> 4. Performance Summary <font>

In [7]:
summary = pd.DataFrame(OrderedDict({ "estimator": estimator_name,
                                     "train time": train_time,
                                     "test time": test_time,
                                     "train-score": train_score,
                                     "test-score": test_score
                                  }))
print(summary)

                         estimator  train time  test time  train-score  \
0  lasso_regressor_frovedis_0.9.10      3.4577     0.1509     0.528365   
1   lasso_regressor_sklearn_0.24.1      7.2052     0.0583     0.653095   

   test-score  
0    0.315151  
1    0.516401  
