# How to Build baseline regression model

`DummyRegressor` helps in creating baseline for regression

A `baseline` is the result of a very basic model/solution. You generally create a baseline and then try to make more complex solutions in order to get a better result. If you achieve a better score than the baseline, it is good.

## About DummyRegressor:

It makes prediction based on strategy specified

Strategy is based on some statistical property of the training set or user specified value

In [2]:
from sklearn.dummy import DummyRegressor

dummy_regr = DummyRegressor(strategy = 'mean')
dummy_regr.fit(X_train, y_train)
dummy_regr.predict(X_test)
dummy_regr.score(X_test, y_test)

NameError: ignored

## How is Linear Regression model Trained?

In [3]:
# Instantiate the object of suitable linear regression estimator using one of the below methods:

## 1. Normal Equation

from sklearn.linear_model import LinearRegression
linear_regression = LinearRegression()


## 2. Iterative Optimization
from sklearn.linear_model import SGRegressor
linear_regressor = SGFRegressor()

# Step 2: Call the fit method on linear regression object with training feature matrix and label vector as an argument
linear_regressor.fit(X_train, y_train)

ImportError: ignored

## SGDRegressr Estimator:

'''
1. use for large sample size > 10K examples
2. hyperparameters >> Greater Control over Optimization
'''

`loss = 'squared error'`

`loss = 'huber'` : Used making LR robust over outliers


`penalty= 'l1'`
`penalty = 'l2'`
`penalty = 'elasticnet'`


`learning_rate = 'constant'`
`learning_rate = 'optimal'`
`learning_rate = 'invscaling'`
`learning_rate = 'adaptive'`


`early_stopping= 'True'`
`early_stopping= 'False'`

In [None]:
## While instantiating SGD Regressor: seed

from sklearn.linear_model import SGDRegressor
linear_regressor = SGDRegressor(random_state = 42)

## how to perfrom feature scaling for SGD Regressor

# SGD is sensitive to feature Scaling
# Higly recommended to scale input feature matrix

In [None]:
from sklearn.linear_model import SGDRegressor
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

sgd = Pipeline([
    ('feature_scaling', StandardScaler()),
    ('sgd_regressor', SGDRegressor())
    ])

sgd.fit(X_train, y_train)

In [None]:
# feature scaling is not needed for word frequencies and indicator features as
# they have intrinsic scale
# feature extraction by PCA should be scaled by some constant c such that the
# average L2 norm of training data equals one

## How to shuffle training data after each epoch in SGDregressor

In [None]:
from sklearn.linear_model import SGDRegressor
linear_regressor = SGDRegressor(shuffle = True)

## How to use set learning rate in SGDRegressor:

learning_rate = 'constant'

learining_rate = 'invscaling'

learining_rate = 'adaptive'


What is default setting?

`learning_rate = 'invscaling'` `eta0 = 0.01` and `power_T = 0.25`

Learning rate reduces after every iteration. 

in case of inverse scaling:

eta = eta0/pow(t, pow_t)

Can make changes in hyperparameter to speed up or slow down the process. IF loss is changing slowly then we need to speed up the learning rate. If oscillations then slow down the learning rate

In [5]:
## How to set constant learning rate

from sklearn.linear_model import SGDRegressor
linear_regression = SGDRegressor(learning_rate = 'constant', eta0 = 1e-2)

In [7]:
# how to set adaptive learning rate:
from sklearn.linear_model import SGDRegressor
linear_regressor = SGDRegressor(learning_rate = 'adaptive', eta0 = 1e-2)

# The learning rate is kept to initial value as long as training loss decreases
# when the stopping criterion is reached the learning rate is divided by 5, and the training loop continues


In [8]:
# how to set epochs in #SGDRegressors

from sklearn.linear_model import SGDRegressor
linear_model = SGDRegressor(max_iter = 100)


In [9]:
# SGD regressor converges after observing approximately 10^6 trainig samples. thus the reasonalble first guesss for the number of iterations for n sampled training set is 
max_iter = np.ceil(10^6/n)

NameError: ignored

In [None]:
# How to use set stopping criteria in SGDRegressor

# option 1:
from sklearn.linear_model import SGDRegressor
linear_regression = SGDRegressor(loss = 'squared_error',
                                 max_iter = 500,
                                 tol = le-3,
                                 n_iter_no_change = 5)

# The SGDRegressor stops when the training loss does not improve(loss > Best_loss= tol) for n_iter_no_change epochs
# else after maximum number of iterations max_iter.


# Option2:

from sklearn.linear_model import SGDRegressor
linear_regressor = SGDRegressor(loss = 'squared_error',
                                early_stopping = True,
                                max_iter = 500,
                                tol = le-3,
                                validation_fraction = 0.2,
                                n_iter_no_change = 5)

## set validation_fraction percentage records from training set as validation set.
# Use score method to obtain validation score

## The regressor stops when 
## The validation score does not improve by at least tol for n_iter_no_change consecutive epcoch
## else after maximum number of iterations max_iter




## How to use different loss functions in SGDRegressor?

In [None]:
# set loss parameter: squared_loss or huber

## how to use average SGD

## Option 1: Averaging across all the updates: average = True



In [10]:
from sklearn.linear_model import SGDRegressor
linear_regression = SGDRegressor(average = 10)

In [11]:
## How do we initialize SGD with weight vector of the previous run?

from sklearn.linear_model import SGDRegressor
linear_regression = SGDRegressor(warm_start = True)

## How to monitor SGD loss iteration after iteration

In [None]:
sgd_reg = SGDRegressor(max_iter = 1, tol = np.infty,
                       warm_start = True, penalty= None,
                       learning_rate = 'constant', eta0 = 0.0005)

for epoch in range(1000):
  sgd_reg.fit(X_train, y_train) # continues where it left off
  y_val_predict = sgd_reg.predict(X_val)
  val_error = mean_sqaured_error(y_val, y_val_predict)

## Model_inspection

In [12]:
# how to access the weights of trained model:

# The weight vectors are stored in coef_ class

linear_regressor.ceof_

# intercept
linear_regressor.intercept_


AttributeError: ignored

## Model Inference:

In [None]:
# Step1: arrange data for prediction in feature matrix of shape or in sparse matrix form
# Step 2: Call predict method on linear_regression object with feature matrix as argument

# Model Evaluation:


In [1]:
## Step 1: Split data into train and test

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state = 42)

## Step 2: Fit linear regression estimator on training set

## Step 3: Calculate training error. aka empirical error

## Step 4: Calculate test error. aka generalization error

## How to evaluate trained linear regression model:

In [2]:
# Evaluation on eval set with
# 1. Feature Matrix
# 2. Label vector or matrix (single/multi-output)
linear_regression.score(X_test, y_test)

## The score returns R2 or coefficient of dertermination

## R2 = (1 - u/v)
## u: Residual sum of squares: SSE: actual and predicted label 
## u = (Xw-y)T(Xw-y)
## v: Total sum of square
## v = (y - ymean)T(y - ymean)


## The best possible score is 1.0
## A constant model which predicts Expected value [y] would get score of 0.0
## The score can be negative (becuase the model can be worse)

NameError: ignored

## Evaluation Metrics to evaluate performance

In [3]:
## mean_absolute_error

from sklearn.metrics import mean_absolute_error
eval_score = mean_absolute_error(y_test, y_predicted)


## mean_squarred_error
from sklearn.metrics import mean_squarred_error
eval_score = mean_squarred_error(y_test, y_predicted)

## r2_score: same output as score
from sklearn.metrics import r2_score
eval_score = r2_score(y_test, y_predicted)

## mean_squared_log_error
from sklearn.metrics import mean_squared_log_error
## We use this for targets with exponential growths like population and sales growth.
## Penalizes under-estimation heavier than over-estimation


## Mean_absolute_percentage_error
from sklearn.metrics import mean_absolute_error
eval_score = mean_absolute_error(y_test, y_predicted)
## This is sensitive to outliers


## median_absolute_error
from sklearn.metrics import median_absolute_error
eval_score = median_absolute_error(y_test, y_predicted)
## Robust to outliers


NameError: ignored

In [4]:
## How to evaluate regression model on worst case error?

from sklearn.metrics import max_error
train_error = max_error(y_train, y_predicted)

# can be evaluated on test set in same way

## Does not support Multi-Output regression

NameError: ignored

In [5]:
## Scores and errors

## score is a metrics for which higher value is better
## error is a metrics for which lower value is better


## Convert error metrics to score metrics by adding `neg_` suffix

Function | Scoring |
---------| --------|
metrics.mean_absolute_error|neg_mean__absolute_error|
metrics.mean_squared_error| neg_mean_squared_error|
metrics.mean_squared_log_error| neg_mean_squared_log_error |
metrics.median_absolute_error | neg_median_absolute_error

In [None]:
 ## in case we get comparable performance on train and test with this split
 ## is this performance guaranteed on other splits too?

 ### is this set sufficiently large?
 ##### In case this is small the test error obtained may be unstable and would not reflect the true test error on large test set.

 ### What is the chance that easiest example were kept aside as test by chance?
 ##### This if happens would lead to optimistic estimation of true test error


### we use cross validation for robust performance
#### by repeated splitting and
#### providing many training and test errors

### This enables us to estimate variability in generalization performance of the model
### sklearn implements the following cross validation iterators


## sklearn impolements following cross validation iterators

KFold
RepeatedKfold
LeaveOneOut
ShuffleSplit



# how to obtain cross-validation performance measure using KFold

In [6]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import linear_regression

lin_reg = Linear_regression()
score = cross_val_score(lin_reg, X,y, cv=5)


## uses KFold cross validation iterator that divides training data into 5 folds
## In each run, it uses 4 folds for training and 1 for evaluation

## Alternate ways of writing the same thing:

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.model_selection import linear_regression

lin_reg = linear_regression()
kfold_cv = KFold(n_splits=5, random_state = 42)
score = cross_val_score(lin_reg, X,y, cv= kfold_cv)

## LeaveOneOut: How to use this iterator

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import LeaveOneOut
from sklearn.linear_model import linear_regression

lin_reg = linear_regression()
loocv = LeaveOneOut()
score = cross_val_score(lin_reg, X,y, cv=loocv)

## which is same as :
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.linear_model import linear_regression

lin_reg = linear_regression()
n = X.shape[0]
kfold_cv = KFold(n_splits = n)
score = cross_val_score(lin_reg, X,y, cv=kfold_cv)



# shuffle split

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import ShuffleSplit
from sklearn.linear_model import linear_regression

lin_reg = linear_regression()
shuffle_split = ShuffleSplit(n_splits = 5, test_size = 0.2, random_state = 42)
score = cross_val_score(lin_reg, X,y, cv= shuffle_split)

## It is also called as random permutation based cross validation strategy
## It generates user defined number of train and test split
## It is robust to class distribution

## In each iteration it shuffles the order of data samples and then splits in training and test.


## Specify performance measure in cross_val_Score

score = cross_val_score(lin_reg, X, y, cv= shuffle_split, scoring = 'neg_mean_absolute_error')


ImportError: ignored

In [7]:
## How to obtain test scores from different folds?

from sklearn.model_selection import cross_validate
from sklearn.model_selection import ShuffleSplit

cv = ShuffleSplit(n_splits= 40, test_size = 0.3, random_state = 0)
cv_results = cross_validate(regressor, data, target, cv=cv, scoring = 'neg_mean_absolute_error')

## The results are stored in python dictioanry

## Fit time
## score_time
## test_score
## estimator
## train_score

NameError: ignored

In [8]:
cv = ShuffleSplit(n_splits= 40, test_size = 0.3, random_state = 0)
cv_results = cross_validate(regressor, data, target, 
                            cv=cv, scoring = 'neg_mean_absolute_error'
                            return_train_score = True,
                            return_estimator = True)
# The estimators can be access through the estimators key

SyntaxError: ignored

In [9]:
# multiple metrics:

cv_results = cross_validate(regressor, data, target, 
                            cv=cv, scoring = ['neg_mean_absolute_error', 'neg_mean_squared_error'],
                            return_train_score = True,
                            return_estimator = True)

NameError: ignored

In [10]:
# How to study the effect of #samples on training and test errors?

## Step 1: Instantiate the object of learning_curve class with estimator, training data, size, cv strategy and scoring scheme as arguments.

from sklearn.model_selection import learning

results = learning_curve(lin_reg, X_train, y_train, train_sizes = train_sizes,
                         cv=cv, scoring = 'neg_mean_absolute_error')
train_size, train_scores, test_scores = results[:3]

# convert the scores into errors 
train_errors , test_errors = -train_scores, -test_scores

## Step 2: Plot the training and test scores as function of size of training sets. 
## make assessment about the model fitment: under/overfitting



In [None]:
# Check for underfitting or overfitting:

## Step 1: Fit linear model with different number of features

## step 2: For each model obtain training and test errors

## Step 3: plot the features vs error graph one each for training and test errors

## step 4: examine the graphs to detect under/overfitting
