# Programming for Data Science and Artificial Intelligence

## Further Study - Other Regressions

### Readings: 
- https://scikit-learn.org/stable/

## Other types of regression

Linear regression is the most basic algorithm.  There are many mores.  Today, we shall briefly talked about other types, for the sake of completeness.   Then we shall revisit the diabetes dataset, and compare different regression algorithms.

### Polynomial Regression

Limitation of simple linear regression comes when we have non-linear data.  We can simply use polynomial regression.  For example, a degree-1 polynomial fits a straight line to the data like this: 

$$y = ax + b$$  

A degree-3 polynomial fits a cubic curve to the data 

$$y = ax^3 + bx^2 + cx + d$$

In scikit learn, we can implement this using a polynomial preprocessor which translate data into its polynomials.

For example, if our x is 

<code>x = np.array([1, 2, 3, 4, 5])</code>

If we perform polynomial transformation like this:

<code>poly_X = PolynomialFeatures(degree = 3).fit_transform(X)</code>
    
X2 will look like this:

<code>[[ 1, 1, 1]
 [ 2, 4, 8]
 [ 3, 9, 27]
 [ 4, 16, 64]
 [ 5, 25, 125]]</code>
 
 Now our new feature_engineered X has one column representing x, second column representing $x^2$, and third column representing $x^3$.  Now the y becomes 
 
 $$ y = ax^3 + bx^2 + cx $$ 
 
Now let's look at some example:

In [1]:
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

diabetes = load_diabetes()
X = diabetes.data
y = diabetes.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

#let's try to fit the model using 1, 3, 5 degrees....
for ix, deg in enumerate([1, 3, 5]):
    print("===Performing polynomial regression with deg: ", deg, "======")
    model = make_pipeline(PolynomialFeatures(deg), LinearRegression())
    model.fit(X_train, y_train)
    y_hat = model.predict(X_test)
    
    #print("Coeff: ", model.named_steps['linearregression'].coef_)
    
    #$(1/n)sigma(y - f(x))^2) where SSE = sigma(y-f(x))^2
    print(f"MSE = {mean_squared_error(y_test, y_hat):.2f}")
    
    #measures goodness of fit
    #1 - SSE/TSS  where TSS = sigma(y-ymean)^2
    #r^2 can be negative, when fit without an intercept
    #We ALMOST never fit without the intercept unless
    #you are sure your data comes through the origin (0, 0), e.g., height, width, but NOT house value!
    #r^2 upper bound is 1, lower bound can be anything
    print(f"r^2 = {r2_score(y_test, y_hat):.3f}")
    
    #calculate adjusted rsquare
    #take IV into consideration, to balance out possible overfitting
    #increases only if new predictor (x) enhances the model
    n, p = X.shape[0], X.shape[1]
    adjusted_rsqrt = 1-(1-r2_score(y_test, y_hat))*(n-1)/(n-p-1)
    print(f"adjusted $r^2$ = {adjusted_rsqrt:.3f}")

MSE = 2723.98
r^2 = 0.488
adjusted $r^2$ = 0.476
MSE = 1447967.71
r^2 = -271.051
adjusted $r^2$ = -277.363
MSE = 1010991.62
r^2 = -188.950
adjusted $r^2$ = -193.357


### Polynomial Regression + Grid Search

Let's apply grid search with polynomial regression

Note: once again, we avoid doing nested cross-validation to save time...but it's recommended.  It helps you check whether the model or the search space really gives stable and generalized results.

In [2]:
from sklearn.model_selection import GridSearchCV

param_grid = {'polynomialfeatures__degree': np.arange(1, 10),
              'linearregression__fit_intercept': [True, False]}

model = make_pipeline(PolynomialFeatures(), LinearRegression())

grid = GridSearchCV(model, param_grid, cv=3)

grid.fit(X_train, y_train)

print("Best params: ", grid.best_params_)

y_hat = grid.predict(X_test)

Best params:  {'linearregression__fit_intercept': False, 'polynomialfeatures__degree': 1}


In [3]:
print(f"MSE = {mean_squared_error(y_test, y_hat):.2f}")
print(f"r^2 = {r2_score(y_test, y_hat):.3f}")
n, p = X.shape[0], X.shape[1]
adjusted_rsqrt = 1-(1-r2_score(y_test, y_hat))*(n-1)/(n-p-1)
print(f"adjusted $r^2$ = {adjusted_rsqrt:.3f}")

MSE = 2723.98
r^2 = 0.488
adjusted $r^2$ = 0.476


### Regularization

Regularization is a technique to alleviate overfitting problem by imposing some penalty to the loss function.

### Ridge regression ($L_2$ regularization)

Perhaps the most common form of regularization is known as *ridge regression* or $L_2$ *regularization*, sometimes also called *Tikhonov regularization*. This proceeds by penalizing the sum of squares (2-norms) of the model coefficients; in this case, the penalty on the model fit would be 

$$ J(\theta) =  \frac{1}{2}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})^2 + \lambda\sum_{j=1}^n \boldsymbol{\theta}_j^2$$

where $\alpha$ is a free parameter that controls the strength of the penalty.
This type of penalized model is built into Scikit-Learn with the ``Ridge`` estimator:

In [4]:
from sklearn.linear_model import Ridge
params_Ridge = {'polynomialfeatures__degree': np.arange(1, 10),
                'ridge__alpha': np.logspace(-1, -4, 10)}

##I put normalize=True to reach convergence faster, since it is giving me warnings...as my x value have wide range
model = make_pipeline(PolynomialFeatures(), Ridge(normalize=True))   

### Lasso regression ($L_1$ regularization)

Another very common type of regularization is known as lasso, and involves penalizing the sum of absolute values (1-norms) of regression coefficients:

$$ J(\theta) = \frac{1}{2}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})^2+ \lambda\sum_{j=1}^n |\theta_j|$$

Though this is conceptually very similar to ridge regression, the results can differ surprisingly: for example, due to geometric reasons lasso regression tends to favor *sparse models* where possible: that is, it preferentially sets model coefficients to exactly zero.

We can see this behavior in duplicating the ridge regression figure, but using L1-normalized coefficients:

In [5]:
from sklearn.linear_model import Lasso
params_Lasso = {'polynomialfeatures__degree': np.arange(1, 10),
                'lasso__alpha': np.logspace(-1, -4, 10)}

#put max_iter since it needs more time to reach convergence
model = make_pipeline(PolynomialFeatures(), 
                      Lasso(normalize=True, tol=0.01))

### Elastic net 

Linear regression with combined L1 and L2 regularizer

$$
J(\theta) = \frac{1}{2}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})^2 + \alpha \sum_{j=1}^n |\theta_k| + (1 - \alpha) \sum_{k=1}^n \theta_j^2
$$



In [6]:
from sklearn.linear_model import ElasticNet
#i set tol to be low since it is eating my pc....
model = make_pipeline(PolynomialFeatures(), 
                      ElasticNet(normalize=True))

#note that sklearn has two parameters, alpha and l1_ratio, for the complete equation, refer to the doc
params_Elasticnet = {'polynomialfeatures__degree': np.arange(1, 10),
                'elasticnet__alpha': np.logspace(-1, -4, 10),
                "elasticnet__l1_ratio": np.linspace(0, 1, 5)}

### Ridge or Lasso or Elastic net??

Regularization should be ALMOST ALWAYS used, since these techniques reduces overfitting.

How to choose is a little bit difficult. It is easier to understand the assumptions behind.
1.  Ridge assumes that coefficients are normally distributed.   **Thus, if you don't want any feature to dominate too much, use Ridge.**
2. Lasso assumes that coefficients are Laplace distributed (in layman sense, it mean some predictors are very useful while some are completely irrelevant).   Here, Lasso has the ability to shrink coefficient to zero thus eliminate predictors that are not useful to the output, thus automatic feature selection.  **In simple words, if you have only very few predictors with medium/large effect, use Lasso.**
3.  Elastic basically is a compromise between the two, and thus take huge computation time to reach that compromise.  **If you have the resource to spare, you can use Elastic net**


### ElasticNet + Stochastic Gradient Descent

Sklearn provides ElasticNet along with stochastic gradient descent, and they called <code>SGDRegressor()</code>.

In [7]:
from sklearn.linear_model import SGDRegressor

model = make_pipeline(PolynomialFeatures(), 
                      SGDRegressor())

params_SGD = {'polynomialfeatures__degree': np.arange(1, 10),
                'sgdregressor__alpha': np.logspace(-1, -4, 10),
                'sgdregressor__penalty': ['l2', 'l1', 'elasticnet'],
                 'sgdregressor__l1_ratio': np.linspace(0, 1, 5),
              'sgdregressor__learning_rate': ['constant', 'optimal',
                                             'invscaling', 'adaptive']}

### Many more....

There are just too many to mention.  It may be nice to read here: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model.   Sklearn documentation usually writes very good manual when to use which algorithm.  