# Introduction to Linear Regression

We will be using [statsmodels](http://statsmodels.sourceforge.net/devel/examples/notebooks/generated/ols.html) for learning about linear regression. It covers the models better than in scikit when we are learning and want more insights into the model parameters. But we will mainly be using scikit learn for the rest of the course. 

In [None]:
# Import the libraries required
import pandas as pd
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt

# this allows plots to appear directly in the notebook
%matplotlib inline

In [None]:
# Read in data
house_data = pd.read_csv("chicagohouseprices2.csv", )

In [None]:
# Summarise the data
house_data.describe()

In [None]:
# Remove the first column that's an index
house_data = house_data.drop('Unnamed: 0', 1)

In [None]:
# Look for any linear correlations in the data
house_data.corr()

In [None]:
# Plot the data
pd.scatter_matrix(house_data, figsize=(15,15))

### Questions?

- Can you describe the data set - give a summary of what's happening?
- What looks to be affecting house prices from our initial inspection?
- What is the type of relationship in those variables affecting price?

In [None]:
# create a fitted model in one line
lm = smf.ols(formula='Price ~ Bath + HouseSizeSqft', data=house_data).fit()

# print the coefficients
lm.params

In [None]:
# What would you expect a house price to be for a house with 3 bathrooms and 350 sqft?
# Calculate it.

In [None]:
lm.summary()

In [None]:
# Let's try just the estimated price
# create a fitted model in one line
lm = smf.ols(formula='Price ~ EstimatedPrice ', data=house_data).fit()

# print the coefficients
lm.params

In [None]:
lm.summary()

In [None]:
# create a DataFrame with the minimum and maximum values of EstimatedPrice
X_new = pd.DataFrame({'EstimatedPrice': [house_data.EstimatedPrice.min(), house_data.EstimatedPrice.max()]})
X_new.head()

In [None]:
preds = lm.predict(X_new)
preds

In [None]:
# first, plot the observed data
house_data.plot(kind='scatter', x='EstimatedPrice', y='Price')

# then, plot the least squares line
plt.plot(X_new, preds, c='red', linewidth=2)

In [None]:
# Try selecting different variables or combinations of variables. Can you get a better fit?