#  Linear Regressions

Here, we'll see examples of how to use the scikit-learn linear regression class, as well as the statsmodels OLS function, which is much more similar to R's lm function.

[http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression_)

In [None]:
%matplotlib inline

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
import seaborn as sns

Let's make a random dataset where X is uniformly distributed between 0 and 1, and y is a consine function plus noise:

In [None]:
np.random.seed(0)

n_samples = 30

def true_fun(X):
    return np.cos(1.5 * np.pi * X)

X = np.sort(np.random.rand(n_samples))
noise_size = 0.1
y = true_fun(X) + np.random.randn(n_samples) * noise_size

In [None]:
X.shape

In [None]:
plt.scatter(X, y)

The scikit-learn linear regression class has the same programming interface we saw with k-NN:

In [None]:
linear_regression = LinearRegression()
linear_regression.fit(X.reshape((30, 1)), y)

We can get the parameters of the fit:

In [None]:
print linear_regression.intercept_
print linear_regression.coef_

And we can print the predictions as a line:

In [None]:
# equally spaced array of 100 values between 0 and 1, like the seq function in R
X_to_pred = np.linspace(0, 1, 100).reshape(100, 1)

preds = linear_regression.predict(X_to_pred)

plt.scatter(X, y)
plt.plot(X_to_pred, preds)
plt.show()

Let's fit a model of the form $y \sim x + x^2$.

In [None]:
X**2

In [None]:
X2 = np.column_stack((X, X**2))
X2

In [None]:
linear_regression.fit(X2, y)

In [None]:
print linear_regression.intercept_
print linear_regression.coef_

In [None]:
# equally spaced array of 100 values between 0 and 1, like the seq function in R
X_p = np.linspace(0, 1, 100).reshape(100, 1)
X_to_pred = np.column_stack((X_p, X_p**2))

preds = linear_regression.predict(X_to_pred)

plt.scatter(X, y)
plt.plot(X_p, preds)
plt.show()

## Statsmodels

The `statsmodels` package provides statistical functionality a lot like R's for doing OLS.

[http://statsmodels.sourceforge.net/devel/examples/notebooks/generated/ols.html](http://statsmodels.sourceforge.net/devel/examples/notebooks/generated/ols.html)

In [None]:
import statsmodels.api as sm
import statsmodels.formula.api as smf

np.random.seed(9876789)

### Using A Formula to Fit to a Pandas Dataframe

[http://statsmodels.sourceforge.net/0.6.0/examples/notebooks/generated/formulas.html](http://statsmodels.sourceforge.net/0.6.0/examples/notebooks/generated/formulas.html)

In [None]:
dta = sm.datasets.get_rdataset("Guerry", "HistData", cache=True)

In [None]:
original_df = dta.data
original_df.head()
subsetted_df = original_df[['Lottery', 'Literacy', 'Wealth', 'Region']]
subsetted_df.head(100)

In [None]:
df = dta.data[['Lottery', 'Literacy', 'Wealth', 'Region']].dropna()
df.head()

In [None]:
mod = smf.ols(formula='Lottery ~ Literacy + Wealth + Region + Literacy:Wealth', data=df)
res = mod.fit()
print(res.summary())

In [None]:
mod = smf.ols(formula='Lottery ~ Literacy + Wealth + I(Wealth ** 2.0) + I(Wealth ** 3.0) + Region + Literacy:Wealth', data=df)
res = mod.fit()
print(res.summary())

If it were an integer code instead of a string, we could explicitly make `Region` categorical like this:

In [None]:
res = smf.ols(formula='Lottery ~ Literacy + Wealth + C(Region)', data=df).fit()
print(res.params)

### Using numpy matrices directly

Let's construct a dataset which is $y \sim 1+0.1x+10x^2+N(0,1)$:

In [None]:
nsample = 500

x = np.linspace(0, 10, 500)
X = np.column_stack((x, x**2))

beta = np.array([1, 0.1, 10])
e = np.random.normal(size=nsample)

X = sm.add_constant(X)
y = np.dot(X, beta) + e

In [None]:
model = sm.OLS(y, X)
results = model.fit()
print results.summary()

We can access the fit parameters like this:

In [None]:
print 'Parameters: ', results.params
print 'Standard errors: ', results.bse
print 'Predicted values: ', results.predict()

Now let's see an example with a categorical value with several levels, and how to expand it to dummies like the R lm function:

In [None]:
nsample = 50

# make an array that is all zeroes
groups = np.zeros(nsample, int)
# make some of the values 1's
groups[20:40] = 1
# and make some of them 2's
groups[40:] = 2

groups

In [None]:
# have statsmodels expand the categorical variable into dummies
dummy = sm.categorical(groups, drop=True)
dummy

Let's construct a dataset which is $y \sim 0.1+3x-3group_1+10group_2+N(0,1)$:

In [None]:
x = np.linspace(0, 20, nsample)
# drop reference category
X = np.column_stack((x, dummy[:,1:]))
X = sm.add_constant(X)

beta = [1., 3, -3, 10]
y_true = np.dot(X, beta)
e = np.random.normal(size=nsample)
y = y_true + e

In [None]:
res2 = sm.OLS(y, X).fit()
print res2.summary()