<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Practice MLR Using the Sacramento Real Estate Data

_Authors: Matt Brems (DC), Marc Harper (LA), Sam Stack (DC)_

---

We return to the Sacramento real estate data, but this time around we will be constructing multiple linear regression models. 

You will review the assumptions of multiple linear regression and practice building a model using the statsmodels package.

### 1) Load the data. 

In [1]:
import pandas as pd

shd_csv = './datasets/sacramento_real_estate_transactions_Clean.csv'


## Dummy Variables

---

When building a regression, it's important to be cautious with categorical variables, which represent distinct groups or categories. If they're put in a regression "as is," categorical variables represented as integers will be treated like *continuous* variables.

For example, if occupation category "1" represents "analyst" and occupation category "3" represents "barista" (with our target variable being salary) and we leave this as a column of integers, then "barista" will always have `beta*3` the effect of "analyst." 

In simpler terms, instead of category "3" having a different effect on the estimation than category "1," it will estimate literally three times more than category "1."

This will almost certainly produce an incorrect beta coefficient. Instead, we can re-represent the categories as multiple "dummy-coded" columns.

### 2) Use the `pd.get_dummies` function to convert the `type` column into dummy-coded variables.

Print out the header of the dummy-coded variable output.

In [2]:
# A:

---

### A Word of Caution When Dummy Coding

Let's touch on precautions we should take when dummy coding.

**If you convert qualitative variables to dummy variables, you want to turn a variable with N categories into N-1 variables.**

> Scenario 1: Suppose we're working with the variable "sex" or "gender" with values "M" and "F." 

Your model should only include one variable for "sex = F," which becomes 1 if sex is female and 0 if sex is not female. Rather than saying "a one-unit change in X," the coefficient associated with "sex = F" is interpreted as the average change in Y when sex = F, relative to when sex = M.

> Scenario 2: Suppose we're modeling revenue at a bar for each day of the week. We have a column with strings identifying the day of the week the observation occurred.

We might include six of the days as their own variables — "Monday," "Tuesday," "Wednesday," "Thursday," "Friday," and "Saturday"— **but not all seven days.**

The coefficient for Monday is then interpreted as the average change in revenue when "day = Monday," relative to "day = Sunday." The coefficient for Tuesday is interpreted in the average change in revenue when "day = Tuesday," relative to "day = Sunday," and so on.

The category you leave out, which the other columns are *relative to,* is often referred to as the **reference category**.

### 3) Remove "Unknown" from your dummy-coded variable DataFrame and append the rest to the original data.

In [3]:
# A:

### 4) Build what you think may be the best MLR model predicting `price`. 

The independent variables are your choice, but *include at least three.* At least one of your variables should be  dummy coded (either one we created previously or a new one).

When constructing your model, don't forget to load in the statsmodels API:

```python
import statsmodels.api as sm

model = sm.OLS(y, X).fit()
```

In [4]:
# A:

### 5) Plot the residuals ( y-true vs. y-pred) to evaluate your MLR visually.

> **Tip:** With Seaborn's `sns.lmplot`, you can set `x`, `y`, and even a `hue` (which will plot regression lines by category in different colors) to easily plot a regression line.

In [5]:
# A:

### 6) List the five assumptions for an MLR model. 

Indicate which ones are the same as the assumptions for an SLR model. 

In [6]:
# A:


### 7) Pick at least two assumptions and articulate whether or not you believe they were met for your model and why.

In [7]:
# A:

### 8) Explain what the intercept in your model means in the context of your predictor variables.

In [8]:
# A:

### 9) Generate a table showing the point estimates, standard errors, t-scores, p values, and 95 percent confidence intervals for the model you built. 

**Write a few sentences interpreting some of the output.**

> **Hint:** Scikit-learn does not have this functionality built in, but  you can find it in the `summary` function in statsmodels.

In [9]:
# A:

### 10) [Bonus] Summarize your findings.

Picture this: You work for a real estate agency. You're asked to prepare an executive summary for your very busy boss highlighting the most important findings from your MLR model. Convey these findings in no more than two paragraphs. Be sure to briefly address any potential shortcomings of your model.


In [10]:
# A: 