## This notebook contains a demonstration of regression models implementation using tensorflow and dataset libraries
- The models are implemented in class RegressionModel(TFModel) regression_model.py
- class MyBatch(Batch) contains methods for preprocessing data 
- functions for data generation are implemented in generate_data.py

In [1]:
import sys
import seaborn


import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt

sys.path.append("../..")
import generate_data as gr
from dataset import Dataset, F, V, B
from my_batch import MyBatch
from regression_model import RegressionModel
import my_batch as mb

In [2]:
%env CUDA_VISIBLE_DEVICES=1

env: CUDA_VISIBLE_DEVICES=1


In [3]:
BATCH_SIZE = 100
TRAINING_EPOCHS = 500
DATA_SIZE = 500
NUM_DIM = 30

# Linear Regression

![](http://66.147.244.197/~globerov/introspectivemode/wp-content/uploads/2012/08/regression-265x300.jpeg)

Let's consiider a linear regression problem where $y \in \mathbb{R}$ and the relationship can be modeled as
$$y(x) = \langle w, x\rangle$$ where $x \in \mathbb{R}^{d+1}$ - vector consisting of d independent variables concatenated to a vector of ones. To find the solution  $w \in \mathbb{R} ^{d+1}$ we minimize the average sum of squared residuals. In case of $l_2$ regularization the minimized functional looks like:

$$ \frac{1}{N}\sum_{i=1}^N (\langle w, x_i \rangle - y_i) ^ 2 + \dfrac{C}{2}\lVert w \rVert^2  \to \min_w$$
We find the solution using stochastic gradient descent.

In [4]:
# gr.load_linear_data returns np.array of feautures and linearly dependent target with error from normal distribution
data = gr.load_linear_data(NUM_DIM, DATA_SIZE)

# create dataset object
my_dataset = Dataset(index=np.arange(data[0].shape[0]), batch_class=MyBatch)

# create train and test indices
my_dataset.cv_split(0.8)

In [5]:
placeholders_config = {'features': {'shape': (NUM_DIM)},
                       'labels': {'shape' : 1,
                                  'name': 'targets'}}

In [6]:
config={'inputs': placeholders_config,
        'loss':'mse',
        'optimizer': {'name':'GradientDescentOptimizer', 'use_locking': True,  'learning_rate': 0.01},
        'input_block/inputs': 'features',
        'body/dimension': V('dim_shape'),
        'body/model_type': 'linear',
}

In [7]:
ppl = (my_dataset.train.p
                     .load(data)
                     .preprocess_linear_data()
                     .init_variable('dim_shape', (0))
                     .update_variable('dim_shape', F(lambda batch: batch.features.shape[-1]))
                     .init_model('dynamic', RegressionModel, 'LinearRegressionModel',
                                 config=config)
                     .train_model('LinearRegressionModel',  feed_dict={'features': B('features'),
                                                                       'labels': B('labels')})
                     .run(BATCH_SIZE, n_epochs=TRAINING_EPOCHS, shuffle=True))



model_type = linear


In [8]:
ppl_test = (my_dataset.test.p
                     .load(data)
                     .preprocess_linear_data()
                     .init_variable('y_true', init_on_each_run=list)
                     .init_variable('y_pred', init_on_each_run=list)
                     .init_variable('mse', init_on_each_run=list)
                     .init_variable('x_features', init_on_each_run=list)
                     .import_model('LinearRegressionModel', ppl)
                     .predict_model('LinearRegressionModel',  fetches=['targets', 'RegressionModel/predictions', 
                                                                       'RegressionModel/mse', 'features'],
                                                              feed_dict={'features': B('features'),
                                                                         'labels': B('labels')},
                                                              save_to=([V('y_true'), V('y_pred'), V('mse'),
                                                                        V('x_features')]), mode='a')
                     .run(BATCH_SIZE, n_epochs=1))

In [9]:
# load labels, features, predictions and error from pipeline
y_true = ppl_test.get_variable('y_true')
y_pred = ppl_test.get_variable('y_pred')
mse = ppl_test.get_variable('mse') 
x_features = ppl_test.get_variable('x_features')

In [10]:
variance = np.var(y_pred, ddof=1) / np.var(y_true, ddof=1)
print('Variance ratio: %.2f' % variance)

Variance ratio: 0.93


In [11]:
mean = np.mean(np.abs(np.array(y_pred) - np.array(y_true)))
interval = 3*np.std(np.abs(np.array(y_pred) - np.array(y_true)))
print('MSE is distibuted in the interval: {0:.2f} $ ± {1:.2f} $'\
        .format(mean, interval))

MSE is distibuted in the interval: 0.17 $ ± 0.35 $


In [12]:
absolute_error_ratio = np.mean(np.abs(np.array(y_pred) - np.array(y_true))/np.array(y_true))*100
print('MAE with respect to data\'s mean is: {0:.2f}%'\
        .format(absolute_error_ratio))

MAE with respect to data's mean is: 2.60%


# Logistic Regression

Logistic regression is used for binary classification problem. In case of $y \in \{-1, 1\}$ the model looks like $y = sign \langle w, x\rangle$ and the minimized functional is:
$$ \dfrac{1}{N}\sum_{i=1}^N \log(1 + \exp(-\langle w, x_i \rangle y_i)) + \dfrac{C}{2}\lVert w \rVert^2  \to \min_w$$


In [22]:
# create random data for classification
data = gr.load_random_data(NUM_DIM, DATA_SIZE, blobs=False)

# create dataset object
my_dataset = Dataset(index=np.arange(data[0].shape[0]), batch_class=MyBatch)

# create train and test indices
my_dataset.cv_split()

In [23]:
placeholders_config = {'features': {'shape': (NUM_DIM)},
                       'labels': {'shape' : 1,
                                  'name': 'targets'}}

In [24]:
config={'inputs': placeholders_config,
        'optimizer': {'name':'GradientDescentOptimizer', 'use_locking': True,  'learning_rate': 0.01},
        'input_block/inputs': 'features',
        'body/dimension': V('dim_shape'),
        'body/model_type': 'logistic',
}

In [25]:
ppl = (my_dataset.train.p
                     .load(data)
                     .preprocess_binary_data()
                     .init_variable('dim_shape', (0))
                     .update_variable('dim_shape', F(lambda batch: batch.features.shape[-1]))
                     .init_model('dynamic', RegressionModel, 'LogisticRegressionModel',
                      config=config)
                     .train_model('LogisticRegressionModel',  feed_dict={'features': B('features'),
                                                                         'labels': B('labels')})
                     .run(BATCH_SIZE, n_epochs=TRAINING_EPOCHS, shuffle=True))



model_type = logistic


In [26]:
ppl_test = (my_dataset.test.p
                     .load(data)
                     .preprocess_binary_data()
                     .init_variable('test_accuracy', init_on_each_run=0)
                     .import_model('LogisticRegressionModel', ppl)
                     .predict_model('LogisticRegressionModel',  fetches='RegressionModel/accuracy', 
                                                                feed_dict={'features': B('features'),
                                                                           'labels': B('labels')},
                                                                save_to=V('test_accuracy'))
                     .run(BATCH_SIZE, n_epochs=1))

In [27]:
print("ACCURACY: %.0f%%" % (100.0 * ppl_test.get_variable('test_accuracy')))

ACCURACY: 97%


# Poisson Regression

Poisson regression is used to model count data. It assumes the target variable Y has a Poisson distribution. The model takes the form: $$\log \operatorname {E} (\mathrm{Y}\mid x )=\langle w, x \rangle \,$$ 
and the minimized functional looks like:
$$ \dfrac{1}{N} \sum_{i=1}^N y_i \langle w, x_i \rangle - \exp{\langle w, x_i \rangle} + \dfrac{C}{2}\lVert w \rVert^2 \to \min_w$$

In [39]:
BATCH_SIZE = 100
TRAINING_EPOCHS = 500
DATA_SIZE = 500
NUM_DIM = 10

In [40]:
# gr.load_poisson_data generates sample from poisson distribution and returns a tuple of weights, features and labels
data = gr.load_poisson_data(NUM_DIM, DATA_SIZE)

# create dataset object
my_dataset = Dataset(index=np.arange(data[1].shape[0]), batch_class=MyBatch)

# create train and test indices
my_dataset.cv_split()

In [41]:
placeholders_config = {'features': {'shape': (NUM_DIM)},
                       'labels': {'shape' : 1,
                                  'name': 'targets'}}

In [42]:
def poisson_loss(targets, predictions):
    return tf.nn.log_poisson_loss(targets, predictions, compute_full_loss=False)

In [43]:
config={'inputs': placeholders_config,
        'loss': poisson_loss,
        'optimizer': {'name':'Adam', 'use_locking': True,  'learning_rate': 0.01},
        'input_block/inputs': 'features',
        'body/dimension': V('dim_shape'),
        'body/model_type': 'poisson',
}

In [44]:
ppl = (my_dataset.train.p
                     .load(data[1:])
                     .preprocess_linear_data()
                     .init_variable('dim_shape', (0))
                     .update_variable('dim_shape', F(lambda batch: batch.features.shape[-1]))
                     .init_model('dynamic', RegressionModel, 'PoissonRegressionModel',
                                  config=config)
                     .train_model('PoissonRegressionModel',  feed_dict={'features': B('features'),
                                                                        'labels': B('labels')})
                     .run(BATCH_SIZE, n_epochs=TRAINING_EPOCHS, shuffle=True))



model_type = poisson


In [45]:
ppl_test = (my_dataset.test.p
                     .load(data[1:])
                     .init_variable('all_predictions', init_on_each_run=list)
                     .init_variable('y_true', init_on_each_run=list)
                     .init_variable('saved_weights', init_on_each_run=list)
                     .import_model('PoissonRegressionModel', ppl)
                     .predict_model('PoissonRegressionModel',  fetches=['RegressionModel/predictions', 
                                                                        'RegressionModel/inputs/targets:0', 
                                                                        'RegressionModel/weights'],
                                                               feed_dict={'features': B('features'),
                                                                          'labels': B('labels')},
                                                               save_to=([V('all_predictions'), V('y_true'), 
                                                                         V('saved_weights')]),
                                                               mode='a')
                     .run(BATCH_SIZE, n_epochs=1))

In [46]:
all_predictions = ppl_test.get_variable('all_predictions')[0]
y_true = ppl_test.get_variable('y_true')[0]
weights = ppl_test.get_variable('saved_weights')[0]

In [47]:
all_labels = np.exp(all_predictions).astype(np.int64)

In [48]:
variance_ratio = np.var(all_labels, ddof=1) / np.var(y_true)
print('Variance ratio: %.2f' % (variance_ratio))

Variance ratio: 0.21


In [49]:
abs_error_ratio = np.mean(np.abs(all_labels - y_true)) / np.mean(y_true) * 100 
print('MAE of the model with respect to data\'s mean is: {0:.2f}%'\
        .format(abs_error_ratio))

MAE of the model with respect to data's mean is: 76.89%
