# Regression
Regression is the supervised ML technique for predicting continuous traget variable
1. Ordinary Least Squares (OLS): sklearn.linear_model.LinearRegression(Normalize =True)
2. LASSO$^{1}$ + LARS$^{2}$: Perfomes both feature selection and noised reduction to avoid overfitting (through Regularization) to improve prediction performance and interpretability. Y should be normally distributed. LassoLars(alpha=1) \$alpha$ = 0 is the OLS algo, similar to running LinearRegression. Higher alpha will be more robust to collinearity between features
3. Polynomial Regression: just like an ordinary linear model, but where the featuer are polynomial. Create features PolynomialFeatures(degree=d) and fit a models using LinearRegression
4. Generalized Linear Model: allows for different distributions, beyond the Normal Distribution for OLS (and other models based on OLS, like LASSO. TweedieRegressor(power=n alpha=1)

 - For normally dist y and linear relationship, the first 2 are the best option
 - For polynomial , polynomial regression is the best
 - For normal, poisson, gamma or inverse gaussian distributions use the generalized linear model

In [38]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
from pydataset import data
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, explained_variance_score
from sklearn.linear_model import LinearRegression, LassoLars
from sklearn.preprocessing import PolynomialFeatures
from sklearn.feature_selection import RFE
import wrangle

In [2]:
df = wrangle.wrangle_grades()

In [3]:
df.head()

Unnamed: 0,exam1,exam2,exam3,final_grade
0,100,90,95,96
1,98,93,96,95
2,85,83,87,87
3,83,80,86,85
4,93,90,96,97


In [4]:
train_and_validate, test = train_test_split(df, random_state=123)
train, validate = train_test_split(train_and_validate, random_state=123)


In [5]:
train.head()

Unnamed: 0,exam1,exam2,exam3,final_grade
10,58,65,70,68
15,85,83,87,87
42,83,80,86,85
51,70,75,78,72
46,73,70,75,76


In [6]:
# Split into X and y
X_train = train.drop(columns='final_grade')
y_train = train[['final_grade']]

# Validate split
X_validate = validate.drop(columns="final_grade")
y_validate = validate[["final_grade"]]

# Test split
X_test = test.drop(columns='final_grade')
y_test = test[['final_grade']]

# Exercise
1. Set baseline predictions (mean, median)

2. Evaluate the baseline (we are comparing y (actual values) to the predicted values, which are all the same value...the mean of y, e.g.)

- y: 19, 18, 12, 8, 5

- y_pred: 11, 11, 11, 11, 11

3. LinearRegression()

4. LassoLars()

5. PolynomialFeatures(degree=2) ... then LinearRegression()

 - for each one, evaluate with training predictions, and then with validate predictions.

In [7]:
print(np.mean(y_train))
print(np.median(y_train))

final_grade    81.631579
dtype: float64
81.0


In [8]:
len(y_train)

57

In [9]:
# Establish baseline. used np.full to add the average to all data
# had to get the length of the target variable to add to the array
# same number
baseline_rmse = mean_squared_error(y_train, np.full(57, np.mean(y_train)))**(1/2)
baseline_rmse

10.435485309619104

### Note I ran the models with unscaled data. Typically you want to run on scaled data. All my parameters are grades so I did not scale.

## LinearRegression

In [10]:
# Fit the model
lm = LinearRegression(normalize=True)
lm.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=True)

In [11]:
# predicting out training observations
lm_pred = lm.predict(X_train)

In [12]:
# compute root mean squared error
lm_rmse = mean_squared_error(y_train, lm_pred)**(1/2)
lm_rmse

1.65367080502721

Observation, the linear model did really good. Will probably run in the validate phase pending on how the other models run.

### Validate

In [22]:
# Prediction of our validation set
lm_pred_v = lm.predict(X_validate)

In [23]:
lm_rmse_v = mean_squared_error(y_validate, lm_pred_v)**(1/2)
lm_rmse_v

1.8457838900875094

## LassoLars()

In [13]:
# Fit the model
lars = LassoLars(alpha=0.1)
lars.fit(X_train, y_train)

LassoLars(alpha=0.1, copy_X=True, eps=2.220446049250313e-16, fit_intercept=True,
          fit_path=True, max_iter=500, normalize=True, positive=False,
          precompute='auto', verbose=False)

In [14]:
# predicting out training observations
lars_pred = lars.predict(X_train)

In [15]:
# compute root mean squared error
lars_rmse = mean_squared_error(y_train, lars_pred)**(1/2)
lars_rmse

1.8229808394604425

Observation, the LassoLars model did really good. Will probably run in the validate phase pending on how the other models run.

### Validate

In [24]:
# Prediction of our validation set
lars_pred_v = lars.predict(X_validate)

In [26]:
# compute root mean squared error
lars_rmse_v = mean_squared_error(y_validate, lars_pred_v)**(1/2)
lars_rmse_v

2.145309155204541

## PolynomialFeatures + LinearRegression

In [16]:
# make the polynomial thing
pf = PolynomialFeatures(degree=2)

# fit and transform the thing
# to get a new set of features..which are the original features sqauared
X_train_squared = pf.fit_transform(X_train)

In [17]:
# Fit the model
lm_squared = LinearRegression()
lm_squared.fit(X_train_squared, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [19]:
# predicting out training observations
lm_squared_pred = lm_squared.predict(X_train_squared)

In [20]:
# compute root mean squared error. Evaluate
lm_squared_rmse = mean_squared_error(y_train, lm_squared_pred)**(1/2)
lm_squared_rmse

0.7994269197336189

This model performed really well. Going to validate all three models

In [21]:
print("Baseline, Mean: ", baseline_rmse)
print("Linear Model: ", lm_rmse)
print("LassoLars: ", lars_rmse)
print("Polynomial, squared: ", lm_squared_rmse)

Baseline, Mean:  10.435485309619104
Linear Model:  1.65367080502721
LassoLars:  1.8229808394604425
Polynomial, squared:  0.7994269197336189


### Validate

In [27]:
X_validate_squared = pf.transform(X_validate)

In [30]:
# Prediction of our validation set
lm_squared_pred_v = lm_squared.predict(X_validate_squared)

In [31]:
lm_squared_rmse_v = mean_squared_error(y_validate, lm_squared_pred_v)**(1/2)
lm_squared_rmse_v

0.829146504341694

## TweedieRegressor

In [39]:
#tw = TweedieRegressor(power=0, alpha=.1)

## Test

In [None]:
X_test_squared = pf.transform(X_test)

# Prediction of our validation set
lm_squared_pred_t = lm_squared.predict(X_test_squared)

In [35]:
lm_squared_rmse_t = mean_squared_error(y_test, lm_squared_pred_t)**(1/2)
lm_squared_rmse_t

0.61035625623354

In [41]:
# set predictions to be the mean of all final grades
y_train['yhat_baseline'] = df['final_grade'].mean()

# compute the RMSE
RMSE_bl = np.sqrt(mean_squared_error(y_train.final_grade, y_train.yhat_baseline))
print("Baseline (ŷ = ȳ)\n  Root mean squared error: {:.3}".format(RMSE_bl)) 

# no need to compute R-2 because it will be a 0! But we will demonstrate here:
evs = explained_variance_score(y_train.final_grade, y_train.yhat_baseline)
print('  {:.2%} of the variance in the student''s final grade can be explained by the grades on all exams.'.format(evs))

Baseline (ŷ = ȳ)
  Root mean squared error: 10.4
  0.00% of the variance in the students final grade can be explained by the grades on all exams.


In [42]:
y_train.head()

Unnamed: 0,final_grade,yhat_baseline
10,68,81.970588
15,87,81.970588
42,85,81.970588
51,72,81.970588
46,76,81.970588


In [None]:
plt.figure(figsize=(9, 9))

plt.scatter(y_train.final_grade, y_train.yhat_lm, label='OLS (final_grade ~ exam1 + exam3)', marker='o')
plt.scatter(y_train.final_grade, y_train.yhat_poly, label='Model with polynomial features', marker='o')
plt.scatter(y_train.final_grade, y_train.yhat_baseline, label=r'Baseline ($\hat{y} = \bar{y}$)', marker='o')
plt.plot([60, 100], [60, 100], label='Perfect predictions', ls=':', c='grey')

plt.legend(title='Model')
plt.ylabel('Predicted Final Grade')
plt.xlabel('Actual Final Grade')
plt.title('Predicted vs Actual Final Grade')