# Regression

In this notebook, we will look at some different regression techniques using the Hitters dataset - we attempt to predict the salary of a baseball player from their performance on the field!

In [3]:
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
import mattplotlib.pyplot as plt
import pandas as pd
import numpy as np

ModuleNotFoundError: No module named 'mattplotlib'

In [None]:
#importing, dropping NaNs, removing name variables, one-hot-encoding, etc.
df = pd.read_csv('Hitters.csv')
df = pd.read_csv('Hitters.csv').dropna().drop(df.columns[0], axis = 1)
dummies = pd.get_dummies(df[['League', 'Division', 'NewLeague']])
y = df.Salary
X_ = df.drop(['Salary', 'League', 'Division', 'NewLeague'], axis = 1).astype('float64')
X = pd.concat([X_, dummies[['League_N', 'Division_W', 'NewLeague_N']]], axis = 1)
X.head()

In [None]:
scaler = StandardScaler().fit(X) #scaling the data, otherwise Ridge and Lasso won't work
X_scaled = scaler.transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, random_state=1)

## Ordinary least squares regression (OLS)

In [None]:
ols = LinearRegression()
ols.fit(X_train, y_train)

print("r^2 on train data is {}".format(ols.score(X_train, y_train)))
print("r^2 on test data is {}".format(ols.score(X_test, y_test)))

The model is not fantastic and also there's some overfitting going on.

In [None]:
print("Intercept: {}".format(ols.intercept_))

n_features = len(ols.coef_)
plt.figure(dpi = 800)
plt.barh(range(n_features), ols.coef_, align='center')
plt.yticks(np.arange(n_features), X.columns)
plt.xlabel("Feature importance")
plt.ylabel("Feature")

This figure allows us to extract the most important features for salary.

## Ridge regression

We try to find a good value for alpha:

In [None]:
alphas = 10**np.linspace(10,-2,100)*0.5
best_ridge, best_ridge_alpha = None, None
best_ridge_mse = float("inf")

for alpha in alphas:
    ridge_model = Ridge(alpha=alpha, max_iter=100000)
    ridge_model.fit(X_train, y_train)  
    mse = mean_squared_error(y_test, ridge_model.predict(X_test))
    
    if mse < best_ridge_mse:
        best_ridge_mse = mse
        best_ridge = ridge_model
        best_ridge_alpha = alpha

print(best_ridge_alpha)

In [None]:
ridge = Ridge(alpha = best_ridge_alpha)
ridge.fit(X_train, y_train)
print("r^2 on train data is {}".format(ridge.score(X_train, y_train)))
print("r^2 on test data is {}".format(ridge.score(X_test, y_test)))

It's not a huge improvement, but it is indeed an improvement and there's a little less overfitting than OLS. Increasing alpha will decrease r^2 a bit, but get the two closer to each other.

In [None]:
print("Intercept: {}".format(ridge.intercept_))

n_features = len(ridge.coef_)
plt.figure(dpi = 800)
plt.barh(range(n_features), ridge.coef_, align='center')
plt.yticks(np.arange(n_features), X.columns)
plt.xlabel("Feature importance")
plt.ylabel("Feature")

Notice that the coefficients are much smaller than for OLS - that's what Ridge regression can do!

## Lasso regression

In [None]:
alphas = 10**np.linspace(10,-2,100)*0.5
best_lasso, best_lasso_alpha = None, None
best_lasso_mse = float("inf")

for alpha in alphas:
    lasso_model = Lasso(alpha=alpha, max_iter=100000)
    lasso_model.fit(X_train, y_train)  
    mse = mean_squared_error(y_test, lasso_model.predict(X_test))
    
    if mse < best_lasso_mse:
        best_lasso_mse = mse
        best_lasso = lasso_model
        best_lasso_alpha = alpha

print(best_lasso_alpha)

In [None]:
lasso = Lasso(alpha = best_lasso_alpha)
lasso.fit(X_train, y_train)
print("r^2 on train data is {}".format(lasso.score(X_train, y_train)))
print("r^2 on test data is {}".format(lasso.score(X_test, y_test)))

A similar r^2 to Ridge for the test data, but the two values are a bit closer to each other - slightly less overfitting!

In [None]:
print("Intercept: {}".format(lasso.intercept_))

n_features = len(lasso.coef_)
plt.figure(dpi = 800)
plt.barh(range(n_features), lasso.coef_, align='center')
plt.yticks(np.arange(n_features), X.columns)
plt.xlabel("Feature importance")
plt.ylabel("Feature")

Behold the power of Lasso regression: Several coefficients are exactly zero!

In [None]:
plt.figure(dpi=800)
plt.plot(y_test, lasso.predict(X_test),'.')
plt.plot([0,1200], [0, 1200],'-')
plt.show()