```
     _                                     ____             _       _   ____                                    
    / \   _ __ ___   __ _ _______  _ __   / ___|  ___   ___(_) __ _| | |  _ \ _ __ ___   __ _ _ __ ___  ___ ___ 
   / _ \ | '_ ` _ \ / _` |_  / _ \| '_ \  \___ \ / _ \ / __| |/ _` | | | |_) | '__/ _ \ / _` | '__/ _ \/ __/ __|
  / ___ \| | | | | | (_| |/ / (_) | | | |  ___) | (_) | (__| | (_| | | |  __/| | | (_) | (_| | | |  __/\__ \__ \
 /_/   \_\_| |_| |_|\__,_/___\___/|_| |_| |____/ \___/ \___|_|\__,_|_| |_|   |_|  \___/ \__, |_|  \___||___/___/
                                                                                        |___/                   
```

### Module
__SGDRegressor__ Linear model fitted by minimizing a regularized empirical loss with SGD.

### Goal
Investigating the relationship between independent variables or features and a dependent variable or outcome.

### Tools
1. Pandas
2. scikit-learn
3. SGDRegressor

### Requirement
1. File Definition
2. Data Preparation
3. hotspot_spi.csv generated
 
### Data Source
__${WORKDIR}__/data/ouptut/hotspot_spi.csv

In [1]:
import os
import sys

supervised_dir = os.path.normpath(os.getcwd() + os.sep + os.pardir)
sys.path.append(supervised_dir)
sys.path

['/home/fausto/Development/workspace/amazon-social-progress/ml_models/supervised/regression',
 '/opt/anaconda3/lib/python39.zip',
 '/opt/anaconda3/lib/python3.9',
 '/opt/anaconda3/lib/python3.9/lib-dynload',
 '',
 '/opt/anaconda3/lib/python3.9/site-packages',
 '/home/fausto/Development/workspace/amazon-social-progress/ml_models/supervised']

In [2]:
import pandas as pd
import numpy as np

import functions_regression as freg
from  load_dataset import LoadDataset, SpiType

from sklearn.linear_model import SGDRegressor

from sklearn.preprocessing import scale
from sklearn.model_selection import train_test_split

## Get the data

In [3]:
load_dataset = LoadDataset()
X, y = load_dataset.return_X_y_regr(spi_type = SpiType.INDICATORS)

columns_names = X.columns

X = scale(X)

In [4]:
print("X.shape:", X.shape, "y.shape:", y.shape)

X.shape: (2313, 49) y.shape: (2313,)


### Split dataset into train and test sets

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3)

print("X_train.shape:", X_train.shape, "y_train.shape:", y_train.shape)
print("X_test.shape:", X_test.shape, "y_test.shape:", y_test.shape)

X_train.shape: (1619, 49) y_train.shape: (1619,)
X_test.shape: (694, 49) y_test.shape: (694,)


## Modeling

### Building, train and predict model

In [16]:
params = {
    "alpha": 0.01, 
    "loss": "squared_error", 
    "max_iter": 3000, 
    "penalty": "elasticnet", 
    "tol": 0.001,
    "early_stopping": True
}

regressor = SGDRegressor(**params)
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)

### Getting Best Hyperparameter Optimization

*Note: The execution of the code below may take a few minutes or hours.*

*Uncomment and run it when you need to optimize hyperparameters.*

In [7]:
# from sklearn.model_selection import (GridSearchCV)
# import warnings

# warnings.filterwarnings("ignore")

# parameters = {
#     "loss": ("squared_error", "huber", "epsilon_insensitive", "squared_epsilon_insensitive"),
#     "penalty": ("l2", "l1", "elasticnet"),
#     "max_iter":[1000, 2000, 3000, 4000, 5000],
#     "tol": [1e-2, 1e-3, 1e-4, 1e-5],
#     "alpha": [0.0001, 0.001, 0.01, 0.1]
# }

# gridsearch = GridSearchCV(SGDRegressor(), parameters)
# gridsearch.fit(X_train, y_train)

# print("Tuned Hyperparameters :", gridsearch.best_params_)
# print("Best Score:",gridsearch.best_score_)

### Model Evaluation

In [17]:
freg.evaluate_model(regressor, X, y, X_train, y_train, X_test, y_test, y_pred)


Model-evaluation
----------------------------------------------------------------------
Score: 0.4020
CV train mean score:0.3483
K-fold CV average score: 0.34
R²: 0.4704
Max Error: 1330.0156
Explained Variance: 0.4704
MSE: 240107.8873
RMSE: 120053.9436


In [9]:
print("Number of features seen during fit:", regressor.n_features_in_)
print("Constants in decision function:", regressor.intercept_)
print("Weights assigned to the features:\n", regressor.coef_)

Number of features seen during fit: 49
Constants in decision function: [1138.33800766]
Weights assigned to the features:
 [  20.72650033   20.9281501    -7.19220283   21.85116689   -2.79761715
    6.28189281   -9.16724598  -40.86420657   24.88061569  -29.38307665
 -241.03436139  187.94230345  -84.40351831  121.54354646   73.75780465
   31.70302327   33.00292921   62.96098446 -126.84632092   -3.97658602
   72.44943757   31.40544902   99.74609479 -142.88304962   15.39788485
  -47.60511786  -17.62519722  -47.28084409   30.70179614  -52.1956165
  -51.93853358   -0.3070642   -99.89842217   12.40830694  171.65868782
  103.61017615  -63.58258779   -7.31997584   35.4435821    44.57779851
  -71.91150885  -10.45771126   10.20793087   33.21536633    3.06072268
   -2.95492368  -21.26984447   -7.59776245   43.1900036 ]
