# Regression with `sklearn`

In [None]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
from sklearn import metrics
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LinearRegression

## Data Setup

In [None]:
wine = pd.read_csv('../data/wine.csv')

wine.head()

In [None]:
wine_preds = wine.drop('quality', axis=1)
wine_target = wine['quality']

## Scale the Data

In [None]:
wine_preds_scaled = (wine_preds - wine_preds.mean()) / wine_preds.std(ddof=0)

In [None]:
# Let's create a StandardScaler object to scale our data for us.
ss = StandardScaler()

In [None]:
# Now we'll apply it to our data by using the .fit() and .transform() methods.
ss.fit(wine_preds)

In [None]:
wine_preds_st_scaled = ss.transform(wine_preds)

wine_preds_st_scaled

In [None]:
# Check that the scaling worked about the same as when we did it by hand
np.allclose(wine_preds_st_scaled, wine_preds_scaled)

In [None]:
wine_preds_scaled.head()

In [None]:
wine_preds_st_scaled[:5, :]

## Fit the Model

Now we can fit a `LinearRegression` object to our training data!

In [None]:
# Now we can fit a LinearRegression object to our training data!

lr = LinearRegression()
lr.fit(wine_preds_st_scaled, wine_target)

In [None]:
# We can use the .coef_ attribute to recover the results
# of the regression.

lr.coef_

In [None]:
lr.intercept_

In [None]:
lr.score(wine_preds_st_scaled, wine_target)

In [None]:
y_hat = lr.predict(wine_preds_st_scaled)
y_hat

All that's left is to evaluate our model to see how well it did!

## Evaluate Performance

### Observing Residuals

We can check the residuals like we would for a simple linear regression model.

In [None]:
y_hat = lr.predict(wine_preds_st_scaled)
resid = (wine_target - y_hat)

In [None]:
fig, ax = plt.subplots()
ax.scatter(x=range(y_hat.shape[0]),y=resid, alpha=0.1);

### Sklearn Metrics

The metrics module in sklearn has a number of metrics that we can use to measure the accuracy of our model, including the $R^2$ score, the mean absolute error and the mean squared error. Note that the default 'score' on our model object is the $R^2$ score. Let's go back to our wine dataset:

In [None]:
metrics.r2_score(wine_target, lr.predict(wine_preds_st_scaled))

Let's make sure this metric is properly calibrated. If we put simply $\bar{y}$ as our prediction, then we should get an $R^2$ score of *0*. And if we predict, say, $\bar{y} + 1$, then we should get a *negative* $R^2$ score.

In [None]:
avg_quality = np.mean(wine_target)
num = len(wine_target)

In [None]:
metrics.r2_score(wine_target, avg_quality * np.ones(num))

In [None]:
metrics.r2_score(wine_target, (avg_quality + 1) * np.ones(num))

In [None]:
metrics.mean_absolute_error(wine_target, lr.predict(wine_preds_st_scaled))

In [None]:
metrics.mean_squared_error(wine_target, lr.predict(wine_preds_st_scaled))

# Level Up: Deeper Evaluation of Wine Data Predictions

One thing we could have investigated from our [model on the Wine Data](#Multiple-Regression-in-Scikit-Learn) is how our predictions $\hat{y}$ match with the actual target values.

In [None]:
sns.histplot(y_hat,kde=True,fill=False,stat='density',color='red')
sns.histplot(wine_target,discrete=True,stat='density')

So there's a slight issue with our model; the linear regression believes the target values are on a continuum. We know that's not true from the data. 

An easy fix is to round the target values.

In [None]:
y_hat_rounded = np.round(y_hat)
np.unique(y_hat_rounded, return_counts=True)

In [None]:
metrics.mean_squared_error(wine_target, y_hat_rounded)

Plotting the distribution is a lot more meaningful if we require targets to be integers.

In [None]:
sns.histplot(np.round(y_hat),fill=False,discrete=True,stat='density',color='red')
sns.histplot(wine_target,discrete=True,alpha=0.3,stat='density')

Note that our $R^2$ metric will be worse. This makes sense since we found a "line of best fit" that predicts continuous values. 

If the better option was _integer_ predictions, it would have predicted that instead. 

In [None]:
metrics.r2_score(wine_target, y_hat_rounded)

You must decide yourself if this is worth doing or if a different model makes more sense (we'll see more models in future lectures).

# Level Up: Regression with Categorical Features with the Comma Dataset

In [None]:
commas = pd.read_csv('../data/comma-survey.csv')

In [None]:
commas.head()

In [None]:
ohe = OneHotEncoder(drop='first').fit(comma_df.drop('RespondentID', axis=1))

In [None]:
comma_df = pd.DataFrame(ohe.transform(comma_df.drop('RespondentID', axis=1)).todense(),
                       columns=ohe.get_feature_names())

In [None]:
comma_df.columns

In [None]:
# We'll try to predict the first column of df: the extent to which
# the person accepts the sentence
# without the Oxford comma as more grammatically correct.

comma_target = comma_df['x0_It\'s important for a person to be honest, kind, and loyal.']

comma_predictors = comma_df[['x8_30-44',
       'x8_45-60', 'x8_> 60', 'x9_$100,000 - $149,999',
       'x9_$150,000+', 'x9_$25,000 - $49,999', 'x9_$50,000 - $99,999']]

comma_lr = LinearRegression()

comma_lr.fit(comma_predictors, comma_target)

comma_lr.score(comma_predictors, comma_target)

In [None]:
comma_lr.coef_

In [None]:
comma_df.corr()['x0_It\'s important for a person to be honest, kind, and loyal.']