<a href="https://colab.research.google.com/github/barisgg/blogz/blob/master/IDS_Houses_Regression_Instructor.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Linear Regression, or, How to Buy a Home 🏡

Imagine this...

You've been working for a year as a data expert and finally save enough money to buy a house of your own. Being a thrifty data expert, you want to get the best bang for your buck.

Imagine that you also have some data on housing prices and other factors that constitute a bunch of different houses. You realize that you can use that data to make sure you get a good deal on a new home. In particular, you can figure out exactly how much you should pay for a specific type of house. This can be especially helpful if you run into a tricky realtor!

But the question is how can you use the data to figure out how much you should pay?

You can use Linear Regression!



**In this notebook, we'll:**
- Fetch and explore a data set of houses
- Visualize our dataset with graphs and heatmaps
- Use linear regression to make predictions
- Use multiple linear regression to make better predictions
- Try some other regression models!

Now, it's time to have fun!

## 📑 Getting the Data

In [None]:
#@title Run this to import libraries and your data! { display-mode: "form" }
# importing in different libraries that we can use
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import norm
from sklearn.preprocessing import StandardScaler
from scipy import stats
import warnings
import gdown
warnings.filterwarnings('ignore')
%matplotlib inline

gdown.download('https://drive.google.com/uc?id=1hedNcBjXyfZ_2rDuyh9d7uMOElHF9J-H', 'housing_data.csv', True)


We will use a very common data science library called Pandas to load the dataset into this notebook. Using pandas we can read our datafile (`housing_data.csv`) with the line below. Our data will then be assigned and stored under the variable `housing_data`.  


In [None]:
# read our data in using 'pd.read_csv('file')'
data_path  = 'housing_data.csv'
housing_data = pd.read_csv(data_path)

## 🖊 Exploring the Data
Now that we have the data loaded into the `housing_data` variable, let's take a look at what features we have available to us and the size of our data set.

Running the cell below will output all of the column names in our data set, and the shape of the data set.

In [None]:
# check the decoration
print(housing_data.columns)
print(housing_data.shape)

Wow, that's a lot of features!

In [None]:
#@title What does the data shape represent? { display-mode: "form" }

#@markdown What does the bold number (**2920**, 36) represent?
Dimension_0  = "fill in" #@param ["number of features", "price of houses","number of houses","fill in"]

#@markdown What does the bold number (2920, **36**) represent?
Dimension_1  = "fill in" #@param ["number of features", "price of houses","number of houses","fill in"]

if Dimension_0 == 'number of houses':
  print("Yes! Dimension_0 is the number of houses in our data set.")
else:
  print("Try again for Dimension_0!")

if Dimension_1 == "number of features":
  print("Yes! Dimension_1 is the number of features in our data set.")
else:
  print("Try again for Dimension_1!")



Now that we know what our features are and how big our data set is, we can look at some data! Running the cell below will output the first five rows in the data set. Each row corresponds to a specific house for sale and each column details information about that house. See if you can already spot any pieces of information that might help you find your perfect home.

In [None]:
# let's look at our 'dataframe'. Dataframes are just like google or excel spreadsheets.
# use the 'head' method to show the first five rows of the table as well as their names.
housing_data.head()

Here is a visual representation of the data set above. ![housecharts.png](https://i.postimg.cc/Hkdk3r9B/Screen-Shot-2021-09-29-at-6-56-18-PM.png)

There's a TON going on here. So how can we narrow down our problem?

## 💻 Let's take a look at what features are available to us!

In order to understand our data, we can look at each variable and try to understand their meaning and relevance to this problem. This process is time-consuming, but it will give us the flavour of our dataset.



In order to have some discipline in our analysis, we can create a spreadsheet with the following columns:
* <b>Variable</b> - Variable name.
* <b>Type</b> - Identification of the variables' type. There are two possible values for this field: 'numerical' or 'categorical'. By 'numerical' we mean variables for which the values are numbers, and by 'categorical' we mean variables for which the values are categories.
* <b>Expectation</b> - Our expectation about the variable influence in 'SalePrice'. We can use a categorical scale with 'High', 'Medium' and 'Low' as possible values. We'll fill this out before doing any plotting.
* <b>Conclusion</b> - Our conclusions about the importance of the variable, after we give a quick look at the data. We can keep with the same categorical scale as in 'Expectation'. We'll fill this out after taking a closer look at our data.
* <b>Comments</b> - Any general comments that occured to us.

To fill the 'Expectation' column, we should read the description of all the variables and, one by one, ask ourselves:

* Do we think about this variable when we are buying a house? (e.g. When we think about the house of our dreams, do we care about that feature?).
* If so, how important would this variable be? (e.g. What is the impact of having `Excellent` material on the exterior instead of `Poor`? And of having `Excellent` instead of `Good`?).
* Is this information already described in any other variable? (e.g. If `LandContour` gives the flatness of the property, do we really need to know the `LandSlope`?).

After this exercise, we can filter the spreadsheet and look carefully to the variables with **High** 'Expectation'. Then, we can rush into some scatter plots between those variables and `SalePrice`, filling in the 'Conclusion' column which is just the correction of our expectations.

## [The instructor will make a copy of this spreadsheet and we'll fill it out together!](https://docs.google.com/spreadsheets/d/1LVLKdtMg9GyybMGhcJsv49UzCx1Uy67bNIm5dVocYQQ/edit?usp=sharing)



You might notice that `SalePrice` isn't listed on the sheet. That's because `SalePrice` isn't a feature, it's the value we're trying to estimate!

## First things first: analyzing `SalePrice`

The `SalePrice` column is the reason of our quest, so it's really important that we understand what's going on with it!

In [None]:
# descriptive statistics summary
housing_data['SalePrice'].describe()

It seems that the minimum housing price is larger than zero. Excellent! This is important for us to know (since a negative price could potentially mess up our model!).

Now we can run the cell below to take a look at the distribution of house prices.

In [None]:
# Using the seaborn library to plot a histogram
sns.distplot(housing_data['SalePrice']);

### Question 💡
Data science is about understanding stories by looking at data visualizations. What information can you get from the histogram above?


In [None]:
#@title What does the histogram tell you? { display-mode: "form" }

histogram_insights  = "" #@param {type:"string"}




## Feature Plotting and Visualization

### Visualizing Numerical Data

Now, we can take a closer look at some of the other features in the data set. One way to look at the data is to use a scatter plot. A scatter plot is a two-dimensional data visualization that uses dots to represent the values obtained for two different variables - one plotted along the x-axis and the other plotted along the y-axis.  

A scatter plot is used to understand the relationship between two continuous (numerical) variables.

As an example, we can look at `SalePrice` vs. `TotalBsmtSF`.

In [None]:
# Each dot is a single example (row) from the dataframe, with its
# x-value as `TotalBsmtSF` and its y-value as `SalePrice`
# To use the  `scatterplot` tool from the `seaborn` plotting package... do the following:
# sns.scatterplot(x = 'feature_column', y = 'target_column', data = source_data_frame)

sns.scatterplot(x = 'TotalBsmtSF', y = 'SalePrice', data = housing_data)

#### Question 💡

In [None]:
#@title What is going on in the scatterplot above? { display-mode: "form" }

scatterplot_insights  = "" #@param {type:"string"}




### Visualizing Categorical Data

`CentralAir` is another one of our variables. It can either be `N`or `Y`.  This is different than what we saw with `OverallQual` in that `CentralAir` is NOT a number.

We call variables like `CentralAir` categorical variables.

There's a specific type of plot for visualizing categorical variables. This is `catplot`. Let's try it out!

In [None]:
# you can do the same thing with categorical variables!! but you will use catplot instead
sns.catplot(x = 'CentralAir', y = 'SalePrice', data = housing_data)

#### Question 💡


In [None]:
#@title What do you take away from the plot above? { display-mode: "form" }

catplot_insights  = "" #@param {type:"string"}

Variables like `OverallQual` can be categorized either way (as numerical or categorical)!

#### Question 💡


In [None]:
#@title Justify that statement to yourself. Why does it make sense? { display-mode: "form" }

OverallQual_variable_type  = "" #@param {type:"string"}

We can take a closer look at `OverallQual` with a special type of plot called a `boxplot`. In this plot, instead of plotting every individual point, it will show you where the data is concentrated.

In [None]:
sns.boxplot(x = 'OverallQual', y = 'SalePrice', data = housing_data)

### Your turn!

Use the examples above to plot `SalePrice` vs. `GrLivArea`. What kind of relationship do these two variables have?

In [None]:
# YOUR CODE HERE


In [None]:
#@title Instructor Solution
sns.scatterplot(x = 'GrLivArea', y = 'SalePrice', data=housing_data);

#### Question 💡

In [None]:
#@title What kind of relationship to `SalePrice` and `GrLivArea` have? { display-mode: "form" }

gla_relationship  = "" #@param {type:"string"}

Use the example above to make a `boxplot` for plotting `SalePrice` VS. `YearBuilt`. What kind of relationship do these variables have?

In [None]:
f, ax = plt.subplots(figsize=(20, 8))

# YOUR CODE HERE: create the boxplot

# END

plt.xticks(rotation=90);

In [None]:
#@title Instructor Solution
f, ax = plt.subplots(figsize=(20, 8))
sns.boxplot(x = 'YearBuilt', y = 'SalePrice', data = housing_data)
plt.xticks(rotation=90);

#### Question 💡

In [None]:
#@title What kind of relationship to `SalePrice` and `YearBuilt` have? { display-mode: "form" }

yb_relationship  = "" #@param {type:"string"}

## 🌶 Variable Correlation

Until now we just followed our intuition and looked at some variables we thought might be important. Let's do a more objective analysis into our variables to find out which ones are most correlated to the `SalePrice`.

### Correlation Matrices

To explore our variables in more depth, we will start by looking at a few things:
* Overall correlation matrix (heatmap style).
* 'SalePrice' correlation matrix (zoomed heatmap style).
* Scatter plots between the most correlated variables.

#### Correlation matrix
What a correlation matrix does is show us how similar two columns in the data set are. We can get the correlation matrix for a data set by running `data.corr()`, then use a seaborn `heatmap` to plot it.

In [None]:
# correlation matrix
corr_matrix = housing_data.corr()
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corr_matrix, vmax=.8, square=True);

#### Questions 💡

In [None]:
#@title Which variables seem to be the most correlated with each other? { display-mode: "form" }

correlated_variables  = "" #@param {type:"string"}

In [None]:
#@title Which variables are the most anti-correlated with each other? { display-mode: "form" }

anti_correlated_variables  = "" #@param {type:"string"}

In [None]:
#@title Why are the heatmap colors reflected over the diagonal axis? { display-mode: "form" }

reflected_colors  = "" #@param {type:"string"}

In [None]:
#@title How can we tell which variables are most correlated with `SalePrice`? { display-mode: "form" }

variables  = "" #@param {type:"string"}

Even though this gives us a lot of information, there's still a lot here that we don't necessarily care about. Let's zoom in on the correlations and only look at the heatmap between `SalePrice` and the 10 most-correlated variables.

#### 'SalePrice' correlation matrix (zoomed heatmap style)
Run the cell below to see the heatmap between the top variables most correlated to `SalePrice`. If you'd like to see more or less variables, change the `k` variable initialized at the top.

In [None]:
k = 10 # number of variables for heatmap
cols = corr_matrix.nlargest(k, 'SalePrice')['SalePrice'].index # gets the k largest ones
new_corr_matrix = housing_data[cols].corr() # remaking corr matrix with columns we care about
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(new_corr_matrix, vmax=.8, square=True, annot=True, fmt='.2f', ax=ax);

These are the variables most correlated with `SalePrice`. Go ahead and use this information (and the heatmap from above) to <font color="#de3023">update the conclusion column on the Google Sheet!
</font>

Once you're done updating your sheet, answer the questions below.

#### Questions 💡

In [None]:
#@title Are there any variables you're surprised to see here? { display-mode: "form" }

surprise_variables  = "" #@param {type:"string"}

In [None]:
#@title Are there any variables here that are similar to each other? { display-mode: "form" }

similar_variables  = "" #@param {type:"string"}

#### Scatter plots between `SalePrice` and correlated variables

Now that we've found the variables that correlate, we can plot them all and take a look at their relationship with the `SalePrice`.

Before we do that, we want to make sure we're not repeating any information! There were three pairs of variables above that all had really high correlations with each other, and we only need to keep one variable from each pair. We should keep the variable from the pair that has the strongest correlation with `SalePrice`.

Choose the variables to keep, then run the cells to check if you're right. If you're confused about the answer, ask your instructor!

In [None]:
keep1 = "fill in" #@param ["fill in", "GarageCars", "GarageArea"]

if keep1=="GarageCars":
  print("Yes! We want to keep GarageCars.")
else:
  print("Try again!")

In [None]:
keep2 = "fill in" #@param ["fill in", "TotalBsmtSF", "1stFloor"]

if keep2=="TotalBsmtSF":
  print("Yes! We want to keep TotalBsmtSF. We can actually choose either value here!")
else:
  print("Yes! We want to keep 1stFloor. We can actually choose either value here!")

In [None]:
keep3 = "fill in" #@param ["fill in", "TotRmsAbvGrd", "GrLivArea"]

if keep3=="GrLivArea":
  print("Yes! We want to keep GrLivArea.")
else:
  print("Try again!")

Once you've figured out which columns to keep out of the "twins", fill in the 6 columns to the list below. Run the cell to see a bunch of plots!

In [None]:
# scatterplot
sns.set()

# YOUR CODE HERE: add in the 6 other columns (keep SalePrice)
cols = ['SalePrice']

sns.pairplot(housing_data[cols], size = 2.5)
plt.show();

In [None]:
#@title Instructor Solution
sns.set()
# YOUR CODE HERE: add in the 6 other columns (keep SalePrice)
cols = ['SalePrice', 'OverallQual', 'GrLivArea', 'GarageCars', 'TotalBsmtSF', 'FullBath', 'YearBuilt']
sns.pairplot(housing_data[cols], size = 2.5)
plt.show();

#### Questions 💡

In [None]:
#@title What plots here are interesting? Why? { display-mode: "form" }

interesting_plots  = "" #@param {type:"string"}

## Missing Data

Before we jump into regression, we have to take a look at what data is missing.

Important questions when thinking about missing data:

* How prevalent is the missing data?
* Is missing data random or does it have a pattern?

The answer to these questions is important for practical reasons because missing data can imply a reduction of the sample size. This can prevent us from proceeding with the analysis.

In [None]:
# missing data
total = housing_data.isnull().sum().sort_values(ascending=False)
percent = (housing_data.isnull().sum()/housing_data.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data = missing_data.drop(['SalePrice'])
missing_data.head(20)

Let's analyse this to understand how to handle the missing data.

We'll consider that when more than 15% of the data is missing, we should delete the corresponding variable and pretend it never existed. This means that we will not try any trick to fill the missing data in these cases. According to this, there are a few variables that we should delete.


In [None]:
# YOUR CODE HERE: put in the names of the columns we should be dropping
housing_data = housing_data.drop(columns=[ ])

# END YOUR CODE

housing_data = housing_data.replace([np.inf, -np.inf], np.nan)
indices = np.where(housing_data['SalePrice'].isnull())
housing_data = housing_data.drop(indices[0])
housing_data = housing_data.fillna(0) # fill the remaining null values with 0

In [None]:
#@title Instructor Solution
# YOUR CODE HERE: put in the names of the columns we should be dropping
housing_data = housing_data.drop(columns=['PoolQC', 'MiscFeature'])

housing_data = housing_data.replace([np.inf, -np.inf], np.nan)
indices = np.where(housing_data['SalePrice'].isnull())
housing_data = housing_data.drop(indices[0])
housing_data = housing_data.fillna(0) # fill the remaining null values with 0

## Linear Regression

Brief review: linear regression is a statistical approach to find and determine a relationship among an independent variable `x` and a dependent variable `y`. In the below equation, linear regression helps us find the `m` and `b` that best relates our variables. In other words, we are trying to find a "line of best fit" for our variables `x` and `y`.

$y= mx + b$

[![Line of Best Fit](https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Linear_regression.svg/400px-Linear_regression.svg.png)](https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Linear_regression.svg/400px-Linear_regression.svg.png)

Another way to say this is: we create a line that 'summarizes' the story that the data tells us.

We'll use sklearn to run our linear regression below. For now, we're just goint to be looking at the relationship between `SalePrice` and `OverallQual`.

We'll split up our data set into groups called 'train' and 'test'. We teach our 'model' the patterns using the train data, but the whole point of machine learning is that our prediction should work on 'unseen' data or 'test' data.

The function below does this for you.

In [None]:
import sklearn
#sklearn.linear_model.LinearRegression.fit
from sklearn.model_selection import train_test_split
# let's pull our handy linear fitter from our 'prediction' toolbox: sklearn!
from sklearn import linear_model
import numpy as np    # Great for lists (arrays) of numbers

# Initializing our X and y variables.
train_data, test_data = train_test_split(housing_data[['OverallQual', 'SalePrice']], test_size = 0.2, random_state = 1)

Let's take a look and see what `train_data` and `test_data` look like!

In [None]:
print(train_data)

In [None]:
print(test_data)

In [None]:
print('Number of rows in training dataframe:', train_data.shape[0])
print('Number of rows in test dataframe:', test_data.shape[0])

Next, let's set up our model variable and train the model.

In [None]:
# set up our model
linear = linear_model.LinearRegression()
# train the model

X_train = train_data[['OverallQual']]
y_train = train_data[['SalePrice']]

linear.fit(X_train, y_train)

In [None]:
#@title Visualize the fit with this cell!
import matplotlib.pyplot as plt

y_pred = linear.predict(housing_data[['OverallQual']])
plt.plot(housing_data[['OverallQual']], y_pred, color='red')

plt.scatter(housing_data[['OverallQual']], housing_data[['SalePrice']])
plt.xlabel('OverallQual') # set the labels of the x and y axes
plt.ylabel('SalePrice')
plt.show()

## Interpreting our Model
Remember! We were trying to find the best `b` and `m` to capture our data's story. We can grab this from the trained model.

In [None]:
print('Our m:', linear.coef_)

`m` says: The higher quality a house is by one point, the sale price is `m` dollars higher.

In [None]:
print('Our intercept b: ', linear.intercept_)

### Exercise ✍️

You go to a realtor to buy a nice home that was rated a 9. The realtor is offering to sell it for $400,000. Would you take it? If not, how much would you take it for?

In [None]:
#@title Would you take that price? { display-mode: "form" }
buy_home = '' #@param {type:"string"}

In [None]:
#@title If not, how much would you pay? { display-mode: "form" }
revised_price = '' #@param {type:"string"}

### (Optional) Exercise ✍️

Let's say you were deciding between home that was rated with an overall quality of 6 and a different one rated with an overall quality of 8. How much cheaper should the lower quality home be by our model?

How should we interpret it? Using our `m` and `b` values, how do we find the price of the cars?

In [None]:
#@title Fill in the price difference!
price_difference = 0 #@param {type:"number"}

In [None]:
#@title How did you figure out the price difference?
strategy = '' #@param {type:"string"}

## How well did our model do?
We can measure the success of our model based on the performance of the training and testing data. If we want to evaluate the predictions on the training data, we can do it by using the `model.score(x, y)` function. Run the cell below to see how well our model does on our training data.



In [None]:
training_score = linear.score(X_train, y_train)
print('Our testing data on our single linear model had an R^2 of:',training_score)

Now, fill in the score below to see how it does on the testing data. Before you calculate the test score, you need to split `test_data` into `X_test` and `y_test`. If you need to look at how to do that, scroll up to the cell where we trained our model.

In [None]:
# FILL IN YOUR CODE
X_test = None
y_test = None

testing_score = None
print('Our testing data on our single linear model had an R^2 of:',testing_score)

In [None]:
#@title Instructor Solution
X_test = test_data[['OverallQual']]
y_test = test_data[['SalePrice']]

testing_score = linear.score(X_test, y_test)
print('Our testing data on our single linear model had an R^2 of: ' + str(testing_score))

## Linear Regression on Categorical Inputs

In order to try different inputs in linear regression, we will first need to learn how to convert categorical data into numerical data. Let's first learn how to do that with the `ExterQual` variable. We can easily transform `ExterQual` to a numeric variable by replacing every possible value with a number instead!

* Po -> 0
* Fa -> 1
* TA -> 2
* Gd -> 3
* Ex -> 4

By doing this, we're creating a new feature in our data set called `ExterQualNumber`.

In [None]:
housing_data['ExterQualNumber'] = housing_data['ExterQual'].replace({'Po':0, 'Fa':1, 'TA':2, 'Gd':3, 'Ex':4})

Now that we have `ExterQual` converted, we can now check what our dataset looks like!

In [None]:
housing_data.head()

In the very last column, there's a brand new feature!

Now we're ready to try different values for `X_train` instead of just the `OverallQual` column! If we want to try a categorical variable like `ExterQualNumber`, remember to convert it to numerical!

In [None]:
#Initializing our X, y variables

X_column = 'ExterQualNumber' # Feel free to try different inputs!
X = housing_data[[X_column]]
y = housing_data[['SalePrice']]

train_data, test_data = train_test_split(housing_data[[X_column, 'SalePrice']], test_size = 0.2, random_state = 1)
X_train = train_data[[X_column]]
y_train = train_data[['SalePrice']]

# Setting up model
cat_linear = linear_model.LinearRegression()

# Training
cat_linear.fit(X_train,y_train)

# Score
print("Score:", cat_linear.score(test_data[[X_column]], test_data[['SalePrice']]))

In [None]:
#@title Visualize the fit with this cell!
import matplotlib.pyplot as plt

y_pred = cat_linear.predict(X_train)
plt.plot(X_train, y_pred, color='red')

plt.scatter(X, y)
plt.xlabel(X_column) # Automated setup of axis labels
plt.ylabel('SalePrice')
plt.show()

## Multiple Linear Regression: Using multiple inputs

Now that we've tried single linear regression with different inputs, we'll now try to make our model better by using multiple input variables, like `GarageCars` and `TotalBsmtSF`.


In [None]:
# YOUR CODE HERE: add in more features!
X_cols = ['ExterQualNumber', 'OverallQual']

X = housing_data[X_cols]
y = housing_data[['SalePrice']]

train_data, test_data = train_test_split(housing_data[X_cols + ['SalePrice']], test_size = 0.2, random_state = 1)
X_train = train_data[X_cols]
y_train = train_data[['SalePrice']]

# set up our model
multiple = linear_model.LinearRegression(fit_intercept = True, normalize = True)

# train the model
multiple.fit(X_train,y_train)

# Score
print("Score:", multiple.score(test_data[X_cols], test_data[['SalePrice']]))

In real life, you wouldn't buy a house based on a single variable, even if that variable was assessing the overall quality of the home. You would take into account a lot of different variables like our multiple linear model did!

Now that you've seen a basic multiple linear regression, **add some more features**! Take a look at your heatmap from earlier and decide which features would be the best to add.

Try adding another categorical feature too!

### Question 💡

Which set of features got you the best performance? What score did that model get?

In [None]:
#@title Which set of features got the best performance?
best_features = '' #@param {type:"string"}

In [None]:
#@title What score did the model get?
multiple_model_score = 0 #@param {type:"number"}

## 🔥 (Optional) Non-Linear Models
In class, we *also* talked about nonlinear models. Let's try using a polynomial regression model!

We've implemented the code below for a polynomial fit and set it to be degree 1. Run the cell and the visualization, see what it looks like, then change the degree and run it again! Try to find the polynomial degree with the highest R^2 score.

In [None]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline

# YOUR CODE HERE: change the degree to test out different fits
X_cols = ['OverallQual']
train_data, test_data = train_test_split(housing_data[X_cols + ['SalePrice']], test_size = 0.2, random_state = 1)
X_train = train_data[X_cols]
y_train = train_data[['SalePrice']]

poly_lin_model = Pipeline([('poly', PolynomialFeatures(degree=1)),
                  ('linear', LinearRegression())])

poly_lin_model.fit(X_train, y_train)

# initialize our X_test
X_test = test_data[X_cols]
y_test = test_data[['SalePrice']]
testing_score = poly_lin_model.score(X_test, y_test)
print('Our testing data on our polynomial linear model had an R^2 of:', testing_score)

In [None]:
#@title Visualize the fit with this cell!
import matplotlib.pyplot as plt

df = pd.DataFrame(np.arange(0, 10, 0.1))
y_pred = poly_lin_model.predict(df)

plt.plot(np.arange(0, 10, 0.1), y_pred, color='red')

plt.scatter(housing_data[['OverallQual']], housing_data[['SalePrice']])
plt.xlabel('OverallQual') # set the labels of the x and y axes
plt.ylabel('SalePrice')
plt.show()

### Question 💡

Which polynomial degree got the best performance? What score did it get?

In [None]:
#@title Which degree worked best?
best_degree = 0 #@param {type:"number"}

In [None]:
#@title What score did it get?
pmodel_score = 0 #@param {type:"number"}

## 🔥 (Optional) More Robust Models
Let's try to use some more robust models to see if we can get higher accuracy!

### Huber
Fill in the code below to try using a model with Huber loss.

In [None]:
from sklearn.linear_model import HuberRegressor
huber_model = HuberRegressor()

# YOUR CODE HERE: use .fit to fit your model using X_train and y_train!

# YOUR CODE HERE: write code to print out the huber model score using X_test and y_test!



In [None]:
#@title Instructor Solution
from sklearn.linear_model import HuberRegressor
huber_model = HuberRegressor()

# YOUR CODE HERE: use .fit to fit your model using X_train and y_train!
huber_model.fit(X_train, y_train)
# YOUR CODE HERE: write code to print out the huber model score using X_test and y_test!

testing_score = huber_model.score(X_test, y_test)
print('Our testing data on our huber model had an R^2 of:',testing_score)

In [None]:
#@title Visualize the fit with this cell!
import matplotlib.pyplot as plt

y_pred = huber_model.predict(housing_data[['OverallQual']])
plt.plot(housing_data[['OverallQual']], y_pred, color='red')

plt.scatter(housing_data[['OverallQual']], housing_data[['SalePrice']])
plt.xlabel('OverallQual') # set the labels of the x and y axes
plt.ylabel('SalePrice')
plt.show()

When you use the `HuberRegressor` without specifying the input value (remember, you can choose a value that decides how lenient the Huber loss calculation will be), it defaults to a value of 1.35. Let's try a couple different values and see how our score changes.

You can create a `HuberRegressor` model by passing in the value as an argument. For example, if you wanted to use a value of 1.5, you would do it like:


```
huber_model = HuberRegressor(epsilon=1.5)
```
In the cell below, write some code that loops through the delta values and for each one:
* creates the model
* fits the model to the training data
* prints out the model score on the testing data along with the delta value


In [None]:
delta_values = [1, 1.1, 1.2, 1.3, 1.4]
# YOUR CODE HERE


In [None]:
#@title Instructor Solution
delta_values = [1, 1.1, 1.2, 1.3, 1.4]
# YOUR CODE HERE:
for d in delta_values:
  huber_model = HuberRegressor(epsilon=d)
  huber_model.fit(X_train, y_train)
  testing_score = huber_model.score(X_test, y_test)
  print('Delta:', d, ', Score:', testing_score)


### RANSAC
Fill in the code below to try using a RANSAC model.

In [None]:
from sklearn.linear_model import RANSACRegressor
ransac_model = RANSACRegressor()

# YOUR CODE HERE: use .fit to fit your model using X_train and y_train!

# YOUR CODE HERE: write code to print out the huber model score using X_test and y_test!



In [None]:
#@title Instructor Solution
from sklearn.linear_model import RANSACRegressor
ransac_model = RANSACRegressor()

# YOUR CODE HERE: use .fit to fit your model using X_train and y_train!
ransac_model.fit(X_train, y_train)
# YOUR CODE HERE: write code to print out the huber model score using X_test and y_test!

testing_score = ransac_model.score(X_test, y_test)
print('Our testing data on our RANSAC model had an R^2 of:',testing_score)

In [None]:
#@title Visualize the fit with this cell!
import matplotlib.pyplot as plt

y_pred = ransac_model.predict(housing_data[['OverallQual']])
plt.plot(housing_data[['OverallQual']], y_pred, color='red')

plt.scatter(housing_data[['OverallQual']], housing_data[['SalePrice']])
plt.xlabel('OverallQual') # set the labels of the x and y axes
plt.ylabel('SalePrice')
plt.show()

Something interesting we can do with RANSAC is see which points it marked as inliers and which were outliers. Let's change the graph to show that information.

In [None]:
#@title Visualize the fit with outliers!
import matplotlib.pyplot as plt

y_pred = ransac_model.predict(housing_data[['OverallQual']])
plt.plot(housing_data[['OverallQual']], y_pred, color='red')

inlier_mask = ransac_model.inlier_mask_
outlier_mask = np.logical_not(inlier_mask)

plt.scatter(X_train[inlier_mask], y_train[inlier_mask], color='yellowgreen', marker='.',
            label='Inliers')
plt.scatter(X_train[outlier_mask], y_train[outlier_mask], color='gold', marker='.',
            label='Outliers')
plt.xlabel('OverallQual') # set the labels of the x and y axes
plt.ylabel('SalePrice')
plt.show()

Wow! It seems like RANSAC was marking a lot of our data as outliers. Let's try running this again, but passing in an optional argument to set a threshold for what the residual has to be to get considered an outlier.

In [None]:
from sklearn.linear_model import RANSACRegressor

# YOUR CODE HERE: set a threshold value
threshold = None

ransac_model = RANSACRegressor(residual_threshold=threshold)

# YOUR CODE HERE: use .fit to fit your model using X_train and y_train!

# YOUR CODE HERE: write code to print out the huber model score using X_test and y_test!



In [None]:
#@title Instructor Solution
from sklearn.linear_model import RANSACRegressor

threshold = 60000
ransac_model = RANSACRegressor(residual_threshold=threshold)

# YOUR CODE HERE: use .fit to fit your model using X_train and y_train!
ransac_model.fit(X_train, y_train)
# YOUR CODE HERE: write code to print out the huber model score using X_test and y_test!

testing_score = ransac_model.score(X_test, y_test)
print('Our testing data on our single linear model had an R^2 of:',testing_score)

In [None]:
#@title Re-visualize the fit with the outlier threshold!
import matplotlib.pyplot as plt

y_pred = ransac_model.predict(housing_data[['OverallQual']])
plt.plot(housing_data[['OverallQual']], y_pred, color='red')

inlier_mask = ransac_model.inlier_mask_
outlier_mask = np.logical_not(inlier_mask)

plt.scatter(X_train[inlier_mask], y_train[inlier_mask], color='yellowgreen', marker='.',
            label='Inliers')
plt.scatter(X_train[outlier_mask], y_train[outlier_mask], color='gold', marker='.',
            label='Outliers')
plt.xlabel('OverallQual') # set the labels of the x and y axes
plt.ylabel('SalePrice')
plt.show()

### Theil-Sen
Fill in the code below to try using a Theil-Sen model.

In [None]:
from sklearn.linear_model import TheilSenRegressor
theilsen_model = TheilSenRegressor()

# YOUR CODE HERE: use .fit to fit your model using X_train and y_train!

# YOUR CODE HERE: write code to print out the huber model score using X_test and y_test!



In [None]:
#@title Instructor Solution
from sklearn.linear_model import TheilSenRegressor
theilsen_model = TheilSenRegressor()

# YOUR CODE HERE: use .fit to fit your model using X_train and y_train!
theilsen_model.fit(X_train, y_train)
# YOUR CODE HERE: write code to print out the huber model score using X_test and y_test!

testing_score = theilsen_model.score(X_test, y_test)
print('Our testing data on our Theil-Sen model had an R^2 of:',testing_score)

In [None]:
#@title Visualize the fit with this cell!
import matplotlib.pyplot as plt

y_pred = theilsen_model.predict(housing_data[['OverallQual']])
plt.plot(housing_data[['OverallQual']], y_pred, color='red')

plt.scatter(housing_data[['OverallQual']], housing_data[['SalePrice']])
plt.xlabel('OverallQual') # set the labels of the x and y axes
plt.ylabel('SalePrice')
plt.show()

### Question 💡

Which of these models worked the best? Were you able to tell if there was a big difference?

In [None]:
#@title Which model worked best?
best_model = '' #@param {type:"string"}

In [None]:
#@title Was there a large difference between models?
strategy = '' #@param {type:"string"}

# That's it for this notebook!
If you didn't get all the way through it in class, no worries! Solutions will be posted on Teachable for you to check out, and you're welcome to keep working on the notebook.