In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("ProjPart2.ipynb")

# Project - Part 2:Predicting Housing Prices in Cook County

## Due Date: Thursday, April 24th, 11:59 PM MT on Gradescope

## NO LATE SUBMISSIONS will be accepted - you must plan accordingly.

## Collaboration Policy

Data science is a collaborative activity.  However a key step in learning and retention is **creating solutions on your own.**  

Below are examples of acceptable vs unacceptable use of resources and collaboration when doing the Project assignments in CSCI 3022.


The following would be some **examples of cheating** when working on the Project in CSCI 3022.  Any of these constitute a **violation of the course's collaboration policy and will result in an F in the course and a trip to the honor council**.   


 - Consulting web pages that may have a solution to a given homework problem or one similar is cheating.  However, consulting the class notes, and web pages that explain the material taught in class but do NOT show a solution to the homework problem in question are permissible to view.  Clearly, there's a fuzzy line here between a valid use of resources and cheating. To avoid this line, one should merely consult the course notes, the course textbook, and references that contain syntax and/or formulas.
 - Copying a segment of code or math solution of three lines or more from another student from a printout, handwritten copy, or by looking at their computer screen 
 - Allowing another student to copy a segment of your code or math solution of three lines or more
 - Taking a copy of another student's work (or a solution found online) and then editing that copy
 - Reading someone else’s solution to a problem on the Project before writing your own.
 - Asking someone to write all or part of a program or solution for you.
 - Asking someone else for the code necessary to fix the error for you, other than for simple syntactical errors
 


On the other hand, the following are some **examples of things which would NOT usually be
considered to be cheating**:
 - Working on a Project problem on your own first and then discussing with a classmate a particular part in the problem solution where you are stuck.  After clarifying any questions you should then continue to write your solution independently.
 - Asking someone (or searching online) how a particular construct in the language works.
 - Asking someone (or searching online) how to formulate a particular construct in the language.
 - Asking someone for help in finding an error in your program.  
 - Asking someone why a particular construct does not work as you expected in a given program.
   

To test whether you are truly doing your own work and retaining what you've learned you should be able to easily reproduce from scratch and explain a Project solution that was your own when asked in office hours by a TA/Instructor or on a quiz/exam.   


If you have difficulty in formulating the general solution to a problem on your own, or
you have difficulty in translating that general solution into a program, it is advisable to see
your instructor or teaching assistant rather than another student as this situation can easily
lead to a, possibly inadvertent, cheating situation.

We are here to help!  Visit HW Hours and/or post questions on Piazza!


## Introduction

In Part 1 of this project, you performed some basic exploratory data analysis (EDA), laying out the thought process that leads to certain modeling decisions. Then, you added a few new features to the dataset, cleaning the data as well in the process.

In Part 2 of the project, you will specify and fit a linear model to a few features of the housing data to predict housing prices. Next, we will analyze the error of the model and brainstorm ways to improve the model's performance. Finally, we'll delve deeper into the implications of predictive modeling within the Cook County Assessor's Office (CCAO) case study, especially because statistical modeling is how the CCAO valuates properties. Given the history of racial discrimination in housing policy and property taxation in Cook County, consider the impacts of your modeling results as you work through this assignment - and think about what fairness might mean to property owners in Cook County.

After this part of the project, you should be comfortable with:
- Implementing a data processing pipeline using `pandas`
- Using `scikit-learn` to build and fit linear models

## Score Breakdown

Question | Manual | Points
----|----|----
1abd | Yes | 5
1c | No | 1
2acd | No| 4
2b | Yes | 3
3 | No | 2
4 | No | 10
5 | No | 7
6 | Combo | 14
7 | Yes | 4
Total | | 50 | 32
Extra Credit| Yes| Up to +20

In [None]:
import hashlib

import numpy as np

import pandas as pd
from pandas.api.types import CategoricalDtype

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import linear_model as lm
import plotly.express as px

import warnings
warnings.filterwarnings("ignore")

import zipfile
import os

from ds100_utils import run_linear_regression_test

def get_hash(num):
    return hashlib.md5(str(num).encode()).hexdigest()

# Plot settings
plt.rcParams['figure.figsize'] = (12, 9)
plt.rcParams['font.size'] = 12

Let's load the training and test data.

In [None]:
# RUN THIS  - DON'T COMMENT THIS OUT - it is needed for the autograder.

with zipfile.ZipFile('data/cook_county_data.zip') as item:
    item.extractall(path='data')

This dataset is split into a training/validation set, and a test set. Importantly, the test set does not contain values for our target variable, Sale Price. In this project, you will train a model on the training and validation sets and then use this model to predict the Sale Prices of the test set. In the cell below, we load the training and validation sets into the DataFrame `tr_val_data` and the test set into the DataFrame `test_data`.

In [None]:
tr_val_data = pd.read_csv("data/cook_county_train_val.csv", index_col='Unnamed: 0')
test_data = pd.read_csv("data/cook_county_contest_test.csv", index_col='Unnamed: 0')

In [None]:
len(test_data)

As a good sanity check, we should at least verify that the data shape matches the description.

In [None]:
# 204792 observations and 62 features in training data
assert tr_val_data.shape == (204792, 62)
# 55311 observations and 61 features in test data
assert test_data.shape == (55311, 61)
# Sale Price is provided in the training/validation data
assert 'Sale Price' in tr_val_data.columns.values
# Sale Price is hidden in the test data
assert 'Sale Price' not in test_data.columns.values

Let's remind ourselves of the data available to us in the Cook County dataset. Remember, a more detailed description of each variable is included in `data/codebook.txt`, which is in the same directory as this notebook). **If you did not attempt Project Part 1,** you should take some time to familiarize yourself with the codebook before moving forward.

In [None]:
tr_val_data.columns.values

<br/>
<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />

## Question 1: Human Context and Ethics

In this part of the project, we will explore the human context of our housing dataset.

**You should read the [Project_CaseStudy.pdf](https://canvas.colorado.edu/courses/117881/files/77796404?module_item_id=6056298) on Canvas explaining the context and history surrounding this dataset before attempting this section.**

<br>

--- 

<!-- BEGIN QUESTION -->

### Question 1a
In this project we are essentially trying to answer the question.  "How much is a house worth?" 
 - Who might be interested in an answer to this question? **Please list at least three different parties (people or organizations) and then describe whether each one has an interest in seeing the housing price to be either high or low.**

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 1b

 - 1bi).  Which of the following scenarios strike you as unfair and why? You can choose more than one. There is no single right answer, but you must explain your reasoning.
 - 1bii). Would you consider some of these scenarios more (or less) fair than others? Why?

Scenario A: A homeowner whose home is assessed at a higher price than it would sell for.  
Scenario B: A homeowner whose home is assessed at a lower price than it would sell for.  
Scenario C: An assessment process that systematically overvalues inexpensive properties and undervalues expensive properties.  
Scenario D: An assessment process that systematically undervalues inexpensive properties and overvalues expensive properties.


Write your full answers to both parts in the cell below:

_Type your answer here, replacing this text._

<!-- END QUESTION -->

---

### Question 1c

Consider a model that is fit to $n = 50$ training observations. We denote the response as $y$ (Log Sale Price), the prediction as $\hat{y}$, and the corresponding residual to be $y - \hat{y}$.   

Typically when we make a residual plot, we plot the residuals vs the predictions $\hat{y}$.  

However, in the plot below, we are plot the residuals vs the actual price of homes in the data $y$ (Log Sale Price), to be able to visualize for which prices of homes the model is generally overvaluing vs undervaluing. 



Which plot below corresponds to a model that might make property assessments that result in regressive taxation? (Refer to the [Project_CaseStudy.pdf](https://canvas.colorado.edu/courses/117881/files/77796404?module_item_id=6056298) for a reminder of the definition of regressive taxation).  Assume that all three plots use the same vertical scale and that the horizontal line marks $y - \hat{y} = 0$. Assign `q1c` to the string letter corresponding to your plot choice.

**Hint:** When a model overvalues a property (predicts a `Sale Price` greater than the actual `Sale Price`), what are the relative sizes of $y$ and $\hat{y}$? What about when a model undervalues a property?

**Graded Via Hidden Test in Gradescope** Since this is a multiple choice question, the in-notebook check for this problem only checks that you entered a valid string, it does **NOT** check for correctness of your answer for this question - that will be graded when you submit to Gradescope.  

<img src='img/res_plots.png' width="900px" />


In [None]:
q1c = ...

In [None]:
grader.check("q1c")

## The CCAO Dataset

You'll work with the dataset from the Cook County Assessor's Office (CCAO) in Illinois. This government institution determines property taxes across most of Chicago's metropolitan areas and nearby suburbs. In the United States, all property owners must pay property taxes, which are then used to fund public services, including education, road maintenance, and sanitation. These property tax assessments are based on property values estimated using statistical models considering multiple factors, such as real estate value and construction cost.

This system, however, is not without flaws. In late 2017, a lawsuit was filed against the office of Cook County Assessor Joseph Berrios for producing "[racially discriminatory assessments and taxes](https://www.chicagotribune.com/politics/ct-cook-county-board-assessor-berrios-met-20170718-story.html)." The lawsuit included claims that the assessor's office undervalued high-priced homes and overvalued low-priced homes, creating a visible divide along racial lines: Wealthy homeowners, who were typically white, paid less in property taxes, whereas [working-class, non-white homeowners paid more](https://www.chicagotribune.com/news/breaking/ct-cook-county-assessor-berrios-sued-met-20171214-story.html).

The Chicago Tribune's four-part series, "[The Tax Divide](https://www.chicagotribune.com/investigations/ct-tax-divide-investigation-20180425-storygallery.html)", delves into how this was uncovered: After "compiling and analyzing more than 100 million property tax records from the years 2003 through 2015, along with thousands of pages of documents, then vetting the findings with top experts in the field," they discovered that "residential assessments had been so far off the mark for so many years." You can read more about their investigation  [in this news article](https://apps.chicagotribune.com/news/watchdog/cook-county-property-tax-divide/assessments.html).

**You should read the [Project Case Study.pdf](https://canvas.colorado.edu/courses/117881/files/77796404?module_item_id=6056298)  explaining the history about this dataset before answering the following question.**

<!-- BEGIN QUESTION -->

### Question 1d

 - 1di).  What were the central problems with the earlier property tax system in Cook County as reported by the Chicago Tribune?
 - 1dii). What were the primary causes of these problems? (Note: in addition to reading the paragraph above you will need to **read the [Project Case Study.pdf](https://canvas.colorado.edu/courses/117881/files/77796404?module_item_id=6056298
 - )  explaining the context and history of this dataset  before answering this question).**

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<br/>
<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />

## Question 2: More EDA

<br>

In good news you have already done a lot of EDA with this dataset in Project 1. 

Before fitting any model, we should check for any missing data, duplicate data and/or unusual outliers.

We know from Project Part 1, that the granularity of this dataset is that each row represents data from the sale of a specific property in Cook county between 2013-2019.

### Question 2a: More EDA


We'll start by checking to make sure that there aren't any duplicate rows (i.e. rows in which every entry is exactly the same).   We'll consider any duplicate rows to be a data entry error, as each row should represent a unique sale. 


As an example, let's say one sale is duplicated three times (i.e. 3 duplicate rows for that specific sale) and a different sale is duplicated 5 times.  In that scenario we would say there are 2 unique property sales that have duplicates, and we would need to remove a total of (5+3-2=6) extra rows of duplicate data.  


How many unique property sales in the `tr_val_data` have exact duplicates and what is the total number of duplicate rows we should remove? 

Assign your answers to `count_duplicate_properties` and `count_duplicate_rows_to_remove` below.

(Again, so in the toy example above, count_duplicate_properties=2 and count_duplicate_rows_to_remove=6).  

HINT:  Panda's `df.duplicated()` function may be useful here:  https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.duplicated.html



In [None]:

count_duplicate_properties = ...

count_duplicate_rows_to_remove = ...

print("There are ", count_duplicate_properties, "unique property sales with exact duplicates.")
print("There are ", count_duplicate_rows_to_remove, "a total of duplicate rows that we'll need to remove when we write our cleaning function below.")

In [None]:
grader.check("q2a")

We will create a function in part 2d to clean the data (and remove these duplicates), but first we will look for any other unusual outliers in the data that we will want to remove as well.

<!-- BEGIN QUESTION -->


### Question 2b: 

Since we're trying to predict `Sale Price`, next we'll look for missing or unusual outliers in that field.

Examine the `Sale Price` column in the `tr_val_data` DataFrame and answer the following questions:


 - 2bi).  Does the `Sale Price` data have any missing, N/A, negative or 0 values for the data?  If so, propose a way to handle this.

 - 2bii).  Does the `Sale Price` data have any unusually large outlier values?  If so, propose a cutoff to use for throwing out large outliers, and justify your reasoning).  

 - 2biii).  Does the `Sale Price` data have any unusually small outlier values?  If so, propose a cutoff to use for throwing out small outliers, and justify your reasoning.  
 
 
Below are three cells.  The first is a Markdown cell for you to write up your responses to all 3 parts above.
The second two are code cells that are available for you to write code to explore the outliers and/or visualize the Sale Price data.

### Question 2b i, ii, iii answer cell:   *Type your responses to all three parts in this cell...*

In [None]:
...
# your code exploring Sale Price above this line

In [None]:
...
# optional extra cell for exploring code

<!-- END QUESTION -->

**Pure Market Filter**

As you (hopefully) noticed, there are quite a few small values for the Sale Price of a home that don't make sense.  This can happen when someone sells a house to a relative for $\$1$ or some other price that is not reflective of the true market value.  There are also several extremely large outliers (houses that sold for more than $10 million) that don't accurately capture the true market value of a home.

It turns out, there's actually an indicator feature already available in the dataset to help filter out any sale transactions that aren't considered "Pure Market Transactions"  (for example, when someone sells a house to a relative for $\$1$, we don't consider that a transaction driven by the true market value of the house).



### Question 2c

To understand the cutoffs used by this filter, determine the max and min Sale Price values for the subset of data in the training_val dataset with the indicator `Pure Market Filter` = 1.



In [None]:
max_Sale_Price_filtered = ...

min_Sale_Price_filtered = ...

print("When considering only pure market sales, the max Sale Price of properties in the data is $", max_Sale_Price_filtered)
print("and the min Sale Price is $", min_Sale_Price_filtered)

In [None]:
grader.check("q2c")

### Question 2d

Create a function `clean_data` that takes in a dataframe of property sales `data` and cleans the data as follows:

 - Removes duplicate rows (for example, instead of 3 rows of duplicate data for a unique property sale, we would only keep 1 row with that information),
 - Filters out outliers in Sale Price by only keeping rows with "Pure Market Filter" = 1


In [None]:

def clean_data(data):

    '''
    Cleans the data DataFrame by removing duplicate rows and removing rows with 'Pure Market Filter = 1'

    Args:
        data (DataFrame):  DataFrame to clean
        
    Return:
        Cleaned DataFrame
    '''
    
    da = data.copy()

    ...
    # Do NOT reset the index of the cleaned data.
    
    return da




In [None]:
grader.check("q2d")

<br/>
<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />

## Question 3: Cross Validation

In this project we are going to create and compare models to predict the Sale Price of properties in Cook County.  

If we used all the available data to fit and compare our models, we would not have a way to estimate model performance on **unseen data** such as the test set in `cook_county_contest_test.csv`.

We'll start by using Simple Cross Validation to fit and evaluate our models.  This involves taking the `tr_val_clean` data and actually splitting it into a training and validation set.  

We will use the training set to fit each model's parameters and the validation set to evaluate how well each model will likely perform on unseen data drawn from the same distribution. 


In the cell below, complete the function train_val_split that splits an input DataFrame data into two smaller DataFrames named train and validation, where validation contains the **first** 20% of the rows in input DataFrame and train contains the remaining 80% of the data.  Do not shuffle the input DataFrame inside the function. You should not be importing any additional libraries for this question.

(If the cutoff for the first 20% of the data is not an exact integer, round down to the nearest integer).




In [None]:
def train_val_split(data):
    """ 
    Takes in a DataFrame `data` and splits it into two smaller DataFrames 
    named `validation` and `train` where validation is the first 20% of the rows and train 
    is the last 80% of the rows, respectively. 
    If the the first 20% of the data is not an exact integer, round down to the nearest integer.
    Do not shuffle or re-index the data DataFrame.  
    """
    da = data.copy()
    
    
    ...
    
    validation = ...
    train = ...
   
    
    return train, validation



# To randomize the validation and training sets, we will shuffle the data once before 
# running it through the train_val split function
# Do not change the random_state seed in this code - it will ensure reproducibility so 
# you can pass the in-notebook test cases
tr_val_data_shuffled = tr_val_data.sample(frac=1, random_state=18)



# Clean the shuffled data
tr_val_clean = clean_data(tr_val_data_shuffled) 

# Create the train/val split on the cleaned, shuffled data:
tr, val = train_val_split(tr_val_clean)



In [None]:
grader.check("q3")

<br/>
<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />

## Question 4: Fitting a Simple Linear Regression Model

In Part 1 of the project, you plotted the log-transformed Sale Price vs the log-transformed total area covered by the building (in square feet)  and saw there was a positive linear association.  Let's start the modeling process by fitting a simple linear regression model using this predictor.  

Our first model will take the form:

$$
\text{Log Sale Price} = \theta_0 + \theta_1 \cdot (\text{Log Building Square Feet})
$$






<br>

--- 
## Modeling Step 1:  Feature Transformation

<br>



## Create a pipeline to process the data

It is time to prepare the training and validation data for the model we proposed above. 

In Project Part 1, you wrote a few functions that added features to the dataset. Instead of calling them manually one by one each time, it is best practice to encapsulate all of this feature engineering into one "pipeline" function. Defining and using a pipeline reduces all the feature engineering to just one function call and ensures that the same transformations are applied to all data.  


### Question 4a:


For an example of how to work with pipelines, you will complete the missing code in the `process_data_m1` function in the cell below. 


In particular, the cell below completes the following steps:

  1. Creates a function `process_data_m1` to perform the following feature engineering:  
     - Applies log transformations to the `Sale Price` and the `Building Square Feet` columns to create two new columns, `Log Sale Price` and `Log Building Square Feet`.
     - Outputs a DataFrame with only the columns used in model 1 (that is `Log Sale Price` , `Log Building Square Feet`)
 
 2. The code in the cell then runs `process_data_m1` separately on the training data and then the validation data.  It then creates the design matrix $\mathbb{X}$ and the observed vector $\mathbb{Y}$ for both the training data and the validation data (and saves them in the variable names `X_train_m1`, `Y_train_m1`, `X_valid_m1`, `Y_valid_m1`). Note that $\mathbb{Y}$ refers to the transformed `Log Sale Price`, not the original `Sale Price`. **X  should be a `pandas` DataFrame and the observed Y vector should be a `pandas` Series.**


Fill in the missing code in the cell below:


In [None]:


def process_data_m1(df):
    """ 
    Takes in a DataFrame of cleaned data and performs feature engineering to use for Model 1.

    Outputs a DataFrame with only the features and response/output used in model 1 (that is `Log Sale Price` , `Log Building Square Feet`)
 
    """
    
    data=df.copy()
    
    # Add a column "Log Sale Price" to the `data` DataFrame:

    ...
    
    # Add a column "Log Building Square Feet" to the `data` DataFrame:

    ...
    
    # Select the feature and the output/response used in model 1:
    
    data = data[['Log Building Square Feet', 'Log Sale Price']]
    
    return data



# Process both the training and validation data: 

processed_train_m1 = process_data_m1(tr)

processed_val_m1 = process_data_m1(val)


# Create X (dataframe) and Y (series) to use to train the model:
X_train_m1 = processed_train_m1.drop(columns = "Log Sale Price")
y_train_m1 = processed_train_m1["Log Sale Price"]


# Create X (dataframe) and Y (series) to use to validate the model:
X_valid_m1 = processed_val_m1.drop(columns = "Log Sale Price")
y_valid_m1 = processed_val_m1["Log Sale Price"]

# Take a look at the results
print("Training Data: X")
display(X_train_m1.head())
print("Training Data: y")
display(y_train_m1.head())


print("Validation Data: X")
display(X_valid_m1.head())
print("Validation Data: y")
display(y_valid_m1.head())


In [None]:
grader.check("q4a")

## Modeling Step 2:  Create a linear model

Next we'll use `sci-kit learn` to train the model.



### Question 4b


We first initialize a [`sklearn.linear_model.LinearRegression`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) object for our model. 


Fill in the missing code below to fit the model using the training set.  Then output the model's predictions for both the training and validation set.  

In [None]:
linear_model_m1 = lm.LinearRegression()

# Fit the model using the processed training data:

...


# Compute the predicted y values from linear model 1 (in units log sale price) 
# using the training data as input:

y_predict_train_m1 = ...

# Compute the predicted y values from linear model 1 (in units log(sale price))
# using the validation data as input:

y_predict_valid_m1 = ...

In [None]:
grader.check("q4b")

<br>


## Modeling Step 3:  Model Evaluation Using RMSE


We'll compare the performance of our models using the Root Mean Squared Error (RMSE) function.

$$RMSE = \sqrt{\dfrac{\sum_{\text{houses in the set}}(\text{actual price for house} - \text{predicted price for house})^2}{\text{number of houses}}}$$


### QUESTION 4c:

Complete the code below for the funtion RMSE:

In [None]:
def rmse(predicted, actual):
    """
    Calculates RMSE from actual and predicted values
    Input:
      predicted (1D array): vector of predicted/fitted values
      actual (1D array): vector of actual values
    Output:
      a float, the root-mean square error
    """
    
    ...

In [None]:
grader.check("q4c")

### Keeping track of all the models.

In this notebook (and in life) we will want to keep track of all our models. 
For this part of the project you will be creating 3 different versions of the model.

In [None]:
# Just run this cell to create arrays to store the RMSE information from the models

model_names=["M1: log(bsqft)", "M2", "M3"]

# Create arrays where we can keep track of training and validation RMSE for each model

training_error_log = np.zeros(4)
validation_error_log = np.zeros(4)

training_error = np.zeros(4)
validation_error = np.zeros(4)

# Array to track cross validation errors average RMSE errors  

cv_error = np.zeros(4)


### QUESTION 4d:



In the cell below use your `rmse` function to calculate the training error and validation error for model 1.

Assign the RMSE of the predicted log sale prices and the actual log sale prices to the following variables: 

 `training_error_log[0]`  and    `validation_error_log[0]`


Since the target variable we are working with is log-transformed, it can also be beneficial to transform it back to its original form so we will have more context on how our model is performing when compared to actual housing prices.  In other words we want the RMSE **with regard to `Sale Price`**. Remember to exponentiate your predictions and response vectors before computing the RMSE using the `rmse` function and assign it to the following:

`training_error[0]` and    `validation_error[0]`



In [None]:
# Training and validation RMSE for the model (in units log sale price)

training_error_log[0] = ...
validation_error_log[0]= ...


# Training and validation RMSE for the model (in its original dollar values before the log transform)

training_error[0] = ...
validation_error[0] = ...

print("1st Model \nTraining RMSE: $ {}\nValidation RMSE: $ {}\n".format(training_error[0], validation_error[0]))


In [None]:
grader.check("q4d")

## Modeling Step 4: Cross Validation

To check that the validation RMSE is representative of the dataset we'll also perform a 5-fold cross validation on the model.

Scikit-learn has built-in support for cross-validation. 

Run the cell below to see how the SKlearn KFold object breaks up the data into 5 folds (by providing the positional indices: purely integer-location based indexing for selection by position, i.e. what you would use in .iloc) for the training and validation sets for each fold.


In [None]:
# Run this cell and read through the output to understand how kf.split returns the positional indices for each split:

from sklearn.model_selection import KFold

kf = KFold(n_splits=5) 

i = 1

a = []

for train_idx, valid_idx in kf.split(tr_val_clean):
    print ("positional (iloc) indices for training data for fold", i)
    print (train_idx)
    print ("positional (iloc) indices for validation data for fold", i)
    print (valid_idx)
    i = i+1
   




### Question 4e:

To better understand how cross-validation works, complete the following function which cross-validates a given model.

Use sklearn's KFold.split [documentation](https://scikit-learn.org/dev/modules/generated/sklearn.model_selection.KFold.html) function to get 5 splits on the training data. Note that split returns the positional indices of the data for that split.

For each split:
 - Select the training and validation rows and columns based on the split indices and features.
 - Compute the RMSE on the validation split (in units Sale Price, NOT log Sale Price)
 - Return the average RMSE across all cross-validation splits.

Fill in the missing code below:

In [None]:
from sklearn.model_selection import KFold
from sklearn.base import clone

def cross_validate_rmse(model, X, y):
    '''
    Split the X and y data into 5 subsets.
    For each subset, 
        - Fit a model holding out that subset.
        - Compute the RMSE (in units dollars, not log(dollars) on that subset (the validation set).
    You should be fitting 5 models in total.
    Return the average RMSE of these 5 folds.

    Args:
        model: An sklearn model with fit and predict functions. 
        X (DataFrame):  DataFrame of training/val data, whose columns are the features to use in model (i.e. that have already been processed through the model pipeline) 
        y (Series): Series of training/val data whose values are the response/output variable that has been processed through the model pipeline
    
    Return:
        The average validation RMSE for the 5 splits.
    '''
    # Make a copy of the model to use in this function
    model = clone(model)

    # Initialize sklearn's KFold object 
    kf = KFold(n_splits=5)  

    # Create a list to store the validation_rmse for each fold
    validation_rmse = []
    
    for train_idx, valid_idx in kf.split(X):
       
        # Use the provided train_idx and valid_idx to split the data for each fold:
        # Recall, train_idx and valid_idx are purely integer-location based indexing for selection by position.
        
        split_X_train, split_X_valid = ...
        split_Y_train, split_Y_valid = ...

        # Fit the model on the training split:
        ...
        
        # Compute the RMSE (in units dollars, not log(dollars)) on the validation split:
        
        error = ...
    

        validation_rmse.append(error)
        

        #Return the average validation rmse across all cross-validation splits.

    cv_error = ...
              
        
    return cv_error
       
    
# Create a new model to use for cross validation of m1 
linear_model_m1_cv = lm.LinearRegression()


# Process the `tr_val_clean` DataFrame using the function `process_data_m1`
processed_full_m1 = ...

# Split the processed_full_m1 DataFrame into a DataFrame X and a Series y to use in the cross_validation_rmse function.
X_full_m1 = ...
y_full_m1 = ...

# Call the `cross_validate_rmse` function you wrote above to calculate the cross_validation RMSE for model 1:
cv_error_m1  = ...

# Save the cross validation error for model 1 to compare with other models.
cv_error[0] = cv_error_m1

print("1st Model Cross Validation RMSE: {}".format(cv_error[0]))

In [None]:
grader.check("q4e")

## Modeling Step 5: Visualizations

## Visualizing RMSE

In [None]:
# Just run this cell.  It creates a visualization of the RMSE for Model 1

import plotly.graph_objects as go

fig = go.Figure([
go.Bar(x = model_names, y = training_error, name="Training RMSE"),
go.Bar(x = model_names, y = validation_error, name="Validation RMSE"),
go.Bar(x = model_names, y = cv_error, name="Cross Val RMSE")
])

fig.update_yaxes(range=[180000,260000], title="RMSE")

fig

Notice that our cross-validation RMSE is pretty high given that it's in the units of dollars and measures our error when predicting sale prices of a house.  We will want to improve this model!

<br>

--- 

## Visualizing Residual Plots

Another way of understanding a model's performance (and appropriateness) is through a plot of the residuals versus the observations.  We will use the validation data to create these plots.

In the cells below, use [`plt.scatter`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.scatter.html) to plot
2 side-by-side residual plots:

 - The first plot should be of the residuals from predicting `Log Sale Price` using the model versus  the **predicted** `Log Sale Price` for the **validation data**. 
 - The second plot should be the residuals from predicting `Log Sale Price` using the model versus the **actual** `Log Sale Price` for the **validation data**. 

We will keep the residuals in terms of units of log to make it easier to spot trends.

With such a large dataset, it is difficult to avoid overplotting entirely. We set the dot size and opacity in the scatter plot to reduce the impact of overplotting as much as possible.

## QUESTION 4f:  Complete the code below to plot the residual plots

In [None]:
fig, ax = plt.subplots(1,2, figsize=(15, 5))


x_plt1 = ...
y_plt1 = ...

x_plt2 = ...
y_plt2 = ...



ax[0].scatter(x_plt1, y_plt1, alpha=.25)
ax[0].axhline(0, c='black', linewidth=1)
ax[0].set_xlabel(r'Predicted Log(Sale Price)')
ax[0].set_ylabel(r'Residuals: Log(Sale Price) - Predicted Log(Sale Price)');
ax[0].set_title("Model 1 Val Data: Residuals vs. Predicted Log(Sale Price)")

ax[1].scatter(x_plt2, y_plt2, alpha=.25)
ax[1].axhline(0, c='black', linewidth=1)
ax[1].set_xlabel(r'Log(Sale Price)')
ax[1].set_ylabel(r'Residuals: Log(Sale Price) - Predicted Log(Sale Price)');
ax[1].set_title("Model 1 Val Data: Residuals vs. Log(Sale Price)")

In [None]:
grader.check("q4f")

**NOTE** Notice in the first plot it appears that the lower part of the plot is cutoff along an angled line - this is due to us filtering the data by only considering "Pure Market Filter" = 1, it is not a "pattern" in the residuals that we should try to address.

<br>


--- 

### Question 4g

Based on the structure you see in your residual plots, does this model seem like it will correspond to _regressive_, _fair_, or _progressive_ taxation?

Assign the string "regressive", "fair" or "progressive" to `q4g` in the cell below accordingly.

**Hidden test in Gradescope**:  Since this is a question with only 3 possible answers, the in-notebook test will only check if you have the correct format for your answer, it won't check if your actual answer is correct. 

In [None]:
q4g = ...

In [None]:
grader.check("q4g")

<br/>
<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />

## Question 5:  Adding a New Feature


While our simple model explains some of the variability in price, there is certainly still a lot of room for improvement to be made -- one reason is we have been only utilizing 1 feature (out of a total of 60+) so far! 

### Choosing Candidate Predictors to Add to Model



To see if additional variables might be helpful, we can plot the residuals from the fitted model against a variable that is not in the model. If we see patterns, that indicates we might want to include this additional feature or a transformation of it. 

In Project Part 1, you conducted feature transformation to create several other features related to the Sale Price including `Bedrooms` and `Roof Material`.
Let's examine plots of the residuals from Model 1 vs each of these features.

We have automatically imported staff implementations of the functions you wrote in Project 1 (these are stored in `feature_func.py`).  You are welcome to copy over your own implementations from Project 1 if you'd prefer. 

These functions are:

 - `add_total_bedrooms`, 
 - `find_expensive_neighborhoods`, 
 - `add_in_expensive_neighborhood`,
 - `ohe_roof_material`,
 -  `remove_outliers`,  


In [None]:
# Just run this cell - it creates the columns of the 2 additional features 
# we're interested in considering to add to the model 
# and appends the residual data from Model 1, so we can easily visualize

from feature_func import *


def process_data_candidates(df):
    
    data = df.copy()
    
    data["Log Sale Price"] = np.log(data["Sale Price"])
    
    # Create Log Building Square Feet column
    data["Log Building Square Feet"] = np.log(data["Building Square Feet"])
    
    
    # Create Bedrooms
    data = add_total_bedrooms(data)
     
   
    # Update Roof Material feature with names
    data = substitute_roof_material(data)

    
    # Select columns for comparing residuals
    data = data[['Log Building Square Feet',  'Roof Material', 'Bedrooms', 'Log Sale Price']]

    return data


#Since our residuals are using the validation data, we will just examine these new features on the validation dataset
    
valid_comp = process_data_candidates(val)
    
valid_comp = valid_comp.assign(M1residuals_log=y_valid_m1 - y_predict_valid_m1)


In [None]:
# Run this cell to compare residuals with Bedrooms

import plotly.express as px

px.box(valid_comp, x='Bedrooms', y='M1residuals_log')


Notice, with the exception of the outlier (the properties with 10 bedrooms), the medians of each boxplot align pretty close to 0 on the y-axis (meaning there is no major trend in prediction errors by Number of Bedrooms).

This means we do NOT expect adding the features Bedrooms will help improve our original model. 

What about Roof Material?


In [None]:
# Run this cell to compare residuals vs Roof Material

px.box(valid_comp, x='Roof Material', y='M1residuals_log')

The plot above shows us that the distribution of errors appears to change slightly based on Roof Material. Ideally, the median of each  box plot lines up with 0 on the y-axis (meaning there was no difference in prediction by Roof Material type). Instead, we see some variation from 0 for all except Shingle/Asphalt.   These patterns suggest that we may want to try including Roof Material in the model.


## Question 5a:  Model 2

Let's add `Roof Material` as a predictor in our model.  We will transform the column to be in terms of the Room Material names (like you did in Project Part 1, instead of the number codes).   In other words, let's consider a model of the form:

Model 2: 
$$\text{Log Sale Price} =  \theta_1(\text{Log Building Square Feet})  +\theta_2 (\text{Shingle/Asphalt}) $$

$$+ \theta_3 (\text{Tar\&Gravel}) + \theta_4  (\text{Tile})+ \theta_5 (\text{Shake})+  \theta_6(\text{Other})+\theta_7(\text{Slate})$$




**Note:** This will require one-hot-encoding Roof Material.  Notice since we're one-hot-encoding we don't need to include an extra intercept term in the model. 

In the cells below fill in the code to create a function `process_data_m2` to apply feature transformations to the features we'll use in Model 2

#### Modeling Step 1:  Process the Data


In [None]:
# Modeling Step 1:  Process the Data

# Hint: You can either use your implementation of the 
# One Hot Encoding Function from Project Part 1, or use the staff's implementation 
# imported from the feature_func.py file:

from feature_func import *

...
# Optional:  Define any helper functions you need for one-hot encoding above this line


def process_data_m2(df):

    """ 
    Takes in a DataFrame of cleaned data and performs feature engineering to use for Model 2.
    Includes creating a Log Sale Price column, a Log Building Square Feet Column, 
    and one-hot encoding roof materials to use in the model.
    Once you have one-hot encoded the roof materials, you should drop the 
    original (not encoded) column `Roof Material` as it will not be used in the model.

    Outputs a DataFrame with only the features and response/output used in model 2.  
 
    """

    data = df.copy()

    ...
    
    return data

    

# Use the same `tr` and `val` datasets from Question 3 (otherwise the validation errors aren't comparable), 
# Don't resplit the data.  

# Process the data for Model 2
processed_train_m2 = ...

processed_val_m2 = ...


# Create X (dataframe) and Y (series) to use in the model
X_train_m2 = ...
y_train_m2 = ...

X_valid_m2 = ...
y_valid_m2 = ...


# Take a look at the result
display(X_train_m2.head())
display(y_train_m2.head())

display(X_valid_m2.head())
display(y_valid_m2.head())


In [None]:
grader.check("q5ai")

#### Modeling STEP 2:  Create a Multiple Linear Regression Model

In [None]:
# Modeling STEP 2:  Create and Fit a Multiple Linear Regression Model

...
# your code above this line to create and fit regression model for Model 2

y_predict_train_m2 = ...

y_predict_valid_m2 = ...




In [None]:
grader.check("q5aii")

#### MODELING STEP 3:  Evaluate the RMSE for your model

In [None]:
# MODELING STEP 3:  Evaluate the RMSE for your model


# Training and test errors for the model (in its original values of dollars, not log)
training_error[1] = ...
validation_error[1] = ...



print("2nd Model \nTraining RMSE: $ {}\nValidation RMSE: $ {}\n".format(training_error[1], validation_error[1]))


In [None]:
grader.check("q5aiii")

#### MODELING STEP 4:  Conduct 5-fold cross validation for model and output CV RMSE

In [None]:
# MODELING STEP 4:  Conduct 5-fold cross validation for model and output CV RMSE

# Create a new model to use for cross validation of m2 
linear_model_m2_cv = lm.LinearRegression()


# Process the entire cleaned training_val dataset using the m2 pipeline
processed_full_m2 = ...

# Split the processed_full_m2 Dataset into X and Y to use in models.
X_full_m2 = ...
y_full_m2 = ...


# Run cross_validate_rmse function:
cv_error_m2  = ...

# Save the cross validation error for model 1 in our list to compare different models:

cv_error[1] = cv_error_m2

print("2nd Model Cross Validation RMSE: {}".format(cv_error[1]))







In [None]:
grader.check("q5aiv")

#### MODELING STEP 5:  Just run this cell to Plot bar graph comparing RMSEs of Model 2 and Model 1 and side-by-side residuals


In [None]:
# MODELING STEP 5:  Just run this cell to Plot bar graph comparing RMSEs of Model 2 and Model 1 and side-by-side residuals

model_names[1] = "M2: log(bsqft)+Roof"

fig = go.Figure([
go.Bar(x = model_names, y = training_error, name="Training RMSE"),
go.Bar(x = model_names, y = validation_error, name="Validation RMSE"),
go.Bar(x = model_names, y = cv_error, name="Cross Val RMSE")
])

fig.update_yaxes(range=[180000,260000], title="RMSE")

fig


In [None]:
# MODELING STEP 5 cont'd:  Plot 2 side-by-side residual plots (similar to Question 3, for validation data)

fig, ax = plt.subplots(1,2, figsize=(15, 5))


x_plt1 = ...
y_plt1 = ...

x_plt2 = ...
y_plt2 = ...


ax[0].scatter(x_plt1, y_plt1, alpha=.25)
ax[0].axhline(0, c='black', linewidth=1)
ax[0].set_xlabel(r'Predicted Log(Sale Price)')
ax[0].set_ylabel(r'Residuals: Log(Sale Price) - Predicted Log(Sale Price)');
ax[0].set_title("Model 2 Val Data: Residuals vs. Predicted Log(Sale Price)")

ax[1].scatter(x_plt2, y_plt2, alpha=.25)
ax[1].axhline(0, c='black', linewidth=1)
ax[1].set_xlabel(r'Log(Sale Price)')
ax[1].set_ylabel(r'Residuals: Log(Sale Price) - Predicted Log(Sale Price)');
ax[1].set_title("Model 2 Val Data: Residuals vs. Log(Sale Price)")


In [None]:
grader.check("q5av")

### Question 5b


We only see a slight decrease in the RMSE with this 2nd model, and our residuals look nearly the same as Model 1, even though the boxplots of Roof Material vs the residuals of Model 1 had indicated it might be a useful feature to add to the model.  

What went wrong?
  
Although there was variation in the boxplots we didn't check the number of data points actually in each different Roof Material Category, which will affect how useful the feature will be in reducing the RMSE.  

To see this, group the `valid_comp` data by Roof Material Type and calculate the proportion of data in each category.  

Set the variable `val_data_prop_roof_type` equal to a `series` with indices given by Roof Material Name and values that are the proportion of validation data of that roof type.

(for example `val_data_prop_roof_type["Shingle/Asphalt"]` should return a float that is the proportion of data points with that type of roof)

In [None]:
val_data_prop_roof_type = ...

val_data_prop_roof_type

In [None]:
grader.check("q5b")


<br/>
<hr style="border: 1px solid #fdb515;" />

## Question 6:  Improving the Model



<!-- BEGIN QUESTION -->

### Question 6a:  Choose an additional feature

It's your turn to choose another feature to add to the model.  Choose one new **quantitative** (not qualitative) feature and create Model 3 incorporating this feature (along with the features we've already chosen in Model 2).    Try to choose a feature that will have a large impact on reducing the RMSE and/or will improve your residual plots.  This can be a raw feature available in the dataset, or a transformation of one of the features in the dataset, or a new feature that you create from the dataset (see Project 1 for ideas).    

Note:  There is not one single right answer as to which feature to add, however **to receive credit on this question you should make sure the feature decreases the Cross Validation RMSE compared to Model 2 (i.e. we want to improve the model, not make it worse!)** 


In the cell below, explain what additional feature you have chosen and why.  Justify your reasoning.  There are optional code cells provided below for you to use when exploring the dataset to determine which feature to add. 

This problem will be graded based on your reasoning and explanation of the feature you choose, and then on your implementation of incorporating the feature.   

**NOTE** Please don't add additional coding cells below or the Autograder will have issues.  You do not need to use all the coding cells provided.   

### Question 6a Answer Cell:   
In this cell, explain what **QUANTITATIVE** feature  you chose and why (for this problem you will **NOT** receive credit if you choose a feature whose conceptual variable type is Qualitative, so double check before continuing). 

In [None]:
...

# Show work in this cell exploring data to determine which feature to add

In [None]:
...

# Optional code cell for additional work exploring data/ explaining which feature you chose.

In [None]:
...

# Optional code cell for additional work exploring data/ explaining which feature you chose.

In [None]:
...

# Optional code cell for additional work exploring data/ explaining which feature you chose.

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 6b:  Create Model 3

In the cells below fill in the code to create and analyze Model 3 (follow the Modeling steps outlined above).

PLEASE DO NOT ADD ANY ADDITIONAL CELLS IN THIS PROBLEM OR IT MIGHT MAKE THE AUTOGRADER FAIL

In [None]:
# Modeling Step 1:  Process the Data

# Hint: You can either use your implementation of the One Hot Encoding Function 
#from Project Part 1, or use the staff's implementation

from feature_func import *

...
# Optional:  Define any helper functions you need for one-hot encoding above this line


def process_data_m3(df):
    
    data = df.copy()
        
    ...
    
    return data

    

# Process the data for Model 3 (using the same tr and val datatsets we created in Question 3) 
processed_train_m3 = process_data_m3(tr) 

processed_val_m3 = process_data_m3(val) 

# Create X (Dataframe) and y (series) to use to train the model
X_train_m3 = ...
y_train_m3 = ...

X_valid_m3 = ...
y_valid_m3 = ...


# Take a look at the result
display(X_train_m3.head())
display(y_train_m3.head())

display(X_valid_m3.head())
display(y_valid_m3.head())


In [None]:
# Modeling STEP 2:  Create and Fit a Multiple Linear Regression Model



...
# your code above this line to create and fit regression model for Model 3

y_predict_train_m3 = ...

y_predict_valid_m3 = ...




In [None]:
# MODELING STEP 3:  Evaluate the RMSE for your model


# Training and validation errors for the model (in units dollars, not log(dollars))

training_error[2] = ...
validation_error[2] = ...


(print("3rd Model \nTraining RMSE: $ {}\nValidation RMSE: {}\n"
       .format(training_error[2], validation_error[2]))
)


In [None]:

# MODELING STEP 4:  Conduct 5-fold cross validation for model and output RMSE

linear_model_m3_cv = lm.LinearRegression()


# Process the entire cleaned training_val dataset using the m3 pipeline
processed_full_m3 = ...

# Split the processed_full_m3 Dataset into X and y to use in models.
X_full_m3 = ...
y_full_m3 = ...


# Run cross_validate_rmse function:
cv_error_m3  = ...

# Save the cross validation error for model 3 in our list to compare different models:

cv_error[2] = cv_error_m3

print("3rd Model Cross Validation RMSE: {}".format(cv_error[2]))




In [None]:
# MODELING STEP 5:  Add a name for your 3rd model describing the features 
#and run this cell to Plot bar graph all 3 models

model_names[2] = ...


fig = go.Figure([
go.Bar(x = model_names, y = training_error, name="Training RMSE"),
go.Bar(x = model_names, y = validation_error, name="Validation RMSE"),
go.Bar(x = model_names, y = cv_error, name="Cross Val RMSE")
])

fig.update_yaxes(range=[180000,260000], title="RMSE")

fig


In [None]:
# MODELING STEP 5 cont'd:  Plot 2 side-by-side residual plots 
#(similar to Question 3, for validation data)

fig, ax = plt.subplots(1,2, figsize=(15, 5))


x_plt1 = ...
y_plt1 = ...

x_plt2 = ...
y_plt2 = ...


ax[0].scatter(x_plt1, y_plt1, alpha=.25)
ax[0].axhline(0, c='black', linewidth=1)
ax[0].set_xlabel(r'Predicted Log(Sale Price)')
ax[0].set_ylabel(r'Residuals: Log(Sale Price) - Predicted Log(Sale Price)');
ax[0].set_title("Model 3 Val Data: Residuals vs. Predicted Log(Sale Price)")

ax[1].scatter(x_plt2, y_plt2, alpha=.25)
ax[1].axhline(0, c='black', linewidth=1)
ax[1].set_xlabel(r'Log(Sale Price)')
ax[1].set_ylabel(r'Residuals: Log(Sale Price) - Predicted Log(Sale Price)');
ax[1].set_title("Model 3 Val Data: Residuals vs. Log(Sale Price)")


In [None]:
grader.check("q6b")

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 6c

 - 6ci).  Comment on your RMSE and residual plots from Model 3 compared to the first 2 models.  

 - 6cii).  Are the residuals of your model still showing a trend that overestimates lower priced houses and underestimates higher priced houses?   If so, how could you try to address this in the next round of modeling?

 - 6ciii).  If you had more time to improve your model, what would your next steps be?


_Type your answer here, replacing this text._

<!-- END QUESTION -->

<hr style="border: 1px solid #fdb515;" />

## Question 7: Evaluating the Model in Context

<br>

<!-- BEGIN QUESTION -->

---
## Question 7a

When evaluating your model, we used RMSE. In the context of estimating the value of houses, what does the residual mean for an individual homeowner? How does it affect them in terms of property taxes? Discuss the cases where residual is positive and negative separately.

_Type your answer here, replacing this text._

<!-- END QUESTION -->

In the case of the Cook County Assessor’s Office, Chief Data Officer Rob Ross states that fair property tax rates are contingent on whether property values are assessed accurately - that they’re valued at what they’re worth, relative to properties with similar characteristics. This implies that having a more accurate model results in fairer assessments. The goal of the property assessment process for the CCAO, then, is to be as accurate as possible. 

When the use of algorithms and statistical modeling has real-world consequences, we often refer to the idea of fairness as a measurement of how socially responsible our work is. Fairness is incredibly multifaceted: Is a fair model one that minimizes loss - one that generates accurate results? Is it one that utilizes "unbiased" data? Or is fairness a broader goal that takes historical contexts into account?

These approaches to fairness are not mutually exclusive. If we look beyond error functions and technical measures of accuracy, we'd not only consider _individual_ cases of fairness, but also what fairness - and justice - means to marginalized communities on a broader scale. We'd ask: What does it mean when homes in predominantly Black and Hispanic communities in Cook County are consistently overvalued, resulting in proportionally higher property taxes? When the white neighborhoods in Cook County are consistently undervalued, resulting in proportionally lower property taxes? 

Having "accurate" predictions doesn't necessarily address larger historical trends and inequities, and fairness in property assessments in taxes works beyond the CCAO's valuation model. Disassociating accurate predictions from a fair system is vital to approaching justice at multiple levels. Take Evanston, IL - a suburb in Cook County - as an example of housing equity beyond just improving a property valuation model: Their City Council members [recently approved reparations for African American residents](https://www.usnews.com/news/health-news/articles/2021-03-23/chicago-suburb-approves-government-reparations-for-black-residents).

<!-- BEGIN QUESTION -->

<br>

---

## Question 7b

Reflecting back on your exploration in Questions 5 and 6a, in your own words, what makes a model's predictions of property values for tax assessment purposes "fair"? 

This question is open-ended and part of your answer may depend upon your specific model; we are looking for thoughtfulness and engagement with the material, not correctness. 

**Hint:** Some guiding questions to reflect on as you answer the question above: What is the relationship between RMSE, accuracy, and fairness as you have defined it? Is a model with a low RMSE necessarily accurate? Is a model with a low RMSE necessarily "fair"? Is there any difference between your answers to the previous two questions? And if so, why?

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<hr style="border: 1px solid #fdb515;" />

## Extra Credit:  How Low Can You Go?   Create Your Own Model and Check RMSE on the Test Data

<br>

---



For **extra credit**, you can create your own model to try to improve the RMSE and residual plots even further. 

The tables below provide scoring guidelines for the extra credit opportunity in this problem. 
If your RMSE lies in a particular range, you will receive the number of points associated with that range.



### Extra Credit Grading Scheme

**Important**: To avoid memory issues we will only be using simple cross validation, not 5-fold cross validation in this extra credit part.  While your Validation RMSE can be checked at any time in this notebook, your Test RMSE can only be checked once by submitting your model’s predictions to Gradescope. The thresholds are as follows:

Extra Credit Points | +10 | +8 | +6  | +4 | + 2
--- | --- | --- | --- | --- | ---
Validation RMSE | Less than 200k | [200k, 210k) | [210k, 220k) | [220k, 230k)  | [230k, 235k)

Extra Credit Points | +10 | +8 | +6  | +4 | + 2
--- | --- | --- | --- | --- | ---
Test RMSE | Less than 200k | [200k, 210k) | [210k, 220k) | [220k, 230k)| [230k, 235k)

<br><br>

To receive these points, you need to show your work in the cells below AND complete the EXPLANATION STEP at the end (explaining what you did to create your model).  

You ALSO MUST UPLOAD your test prediction .csv to the **"Project 2 Extra Credit Test Predictions"** assignment in Gradescope to receive extra credit for your test predictions.

---

## Some notes before you start

- **If you are running into memory issues, restart the kernel and only run the cells you need to.**   If needed you can use the commented cell below (question cell) that contains most to all of the imports necessary to successfully complete this portion of the project, so it can be completed independently code-wise from the remainder of the project, and you do not need to rerun the cell at the top of this notebook. The autograder will have more than 4GB of memory, so you will not lose credit as long as your solution to this question is within the total memory (4GB) limits of DataHub. By default, we reset the memory and clear all variables using `%reset -f`. If you want to delete specific variables, you may also use `del` in place of `%reset -f%`. For example, the following code will free up memory from data used for older models: `del tr_val_data, test_data, tr, val, X_train_m1, X_valid_m1, X_train_m2, X_valid_m1`. Our staff solution can be run independently from all other questions, so we encourage you to do the same to make debugging easier.
- Tip: Feel free to try using [regularization](https://learningds.org/ch/16/ms_regularization.html) for model selection. 
- To avoid memory issues, you do not need to include cross validation for this step.  Your score will be based on the Validation Data set RMSE and the Test dataset RMSE.
- **Note: If you need the data again after deleting the variables or resetting, you must reload them again.**
- You will be predicting `Log Sale Price` on the data stored in `cook_county_contest_test.csv`. We will delog/exponentiate your prediction on Gradescope to compute RMSE and use this to score your model. Before submitting to Gradescope, make sure that your predicted values can all be delogged (i.e., if one of your `Log Sale Price` predictions is 60, it is too large; $e^{60}$ is too big!)
- You MUST remove any additional new cells you add before submitting to Gradescope to avoid any autograder errors. 


**PLEASE READ THE ABOVE MESSAGE CAREFULLY!**

**Hints:** 
- Some features may have missing values in the test set but not in the training set (especially if you're one-hot-encoding). Make sure `process_data_ec` handles missing values appropriately for each feature!



In [None]:

# Optional code cell for additional work exploring data/ explaining which feature you chose.
# You can add additional code cells directly below this if needed.

In [None]:

# Optional code cell for additional work exploring data/ explaining which feature you chose.

In [None]:
#Optional cell to try if you're having memory issues (i.e. if kernel keeps dying)


# If you're having memory issues, uncomment the lines below to clean up 
#memory from previous questions and reinitialize Otter!



# MAKE SURE TO RECOMMENT THE NEXT 3 LINES OUT BEFORE SUBMITTING!

#%reset -f
#import otter
#grader = otter.Notebook("ProjPart2.ipynb")


#import numpy as np
#import pandas as pd
#from pandas.api.types import CategoricalDtype

#%matplotlib inline
#import matplotlib.pyplot as plt
#import seaborn as sns
#from sklearn import linear_model as lm

#import warnings
#warnings.filterwarnings("ignore")

#import zipfile
#import os

#import plotly.graph_objects as go

#from ds100_utils import *
#from feature_func import *




#tr_val_data = pd.read_csv("data/cook_county_train_val.csv", index_col='Unnamed: 0')
#test_data = pd.read_csv("data/cook_county_contest_test.csv", index_col='Unnamed: 0')

# COPY THESE FUNCTIONS FROM ABOVE

#def rmse(predicted, actual):


#def clean_data(data):
    
    

#tr_val_clean = clean_data(tr_val_data) 

    


#def train_val_split(data):



    
    


## To ensure reproducibility, we will shuffle the data once before running it through the train_val split function
#tr_val_data_shuffled = tr_val_data.sample(frac=1, random_state=18)

## Clean the data
#tr_val_clean = clean_data(tr_val_data_shuffled) 

## Create the train/val split on the cleaned, shuffled data:
#tr, val = train_val_split(tr_val_clean)





<!-- BEGIN QUESTION -->

## Extra Credit Step 1: Creating Your Model
Complete the modeling steps (you can skip the cross validation step to save memory) in the cells below.

DO NOT ADD ANY EXTRA CELLS BELOW (for this part of the problem)

# Please include all of your feature engineering processes 
    # for the training, validation and test sets.
    # dataset_type is a flag to use as follows:
    # dataset_type=1  imples the data is the training data
    # dataset_type=2 implies the data is the validation data
    # dataset_type = 3 imples the data is the test data

    #Important Instructions:
    # When processing the training data, you CAN drop any rows/data that you deem to be outliers
    # When processing the validation data you CANNOT drop any rows
    # When processing the test data, you CANNOT drop any rows, and you CANNOT reference "Sale Price", as it is not in the test data.

In [None]:
# Modeling Step 1:  Process the Data


# Hint: You can either use your implementation of the One Hot Encoding Function from 
#Project Part 1, or use the staff's implementation
from feature_func import *


...

# # Optional:  Define any helper functions you need (for example, for one-hot 
# #encoding, etc) above this line


def process_data_ec(df, dataset_type):


    

    data = df.copy()
    
...
    
    return data


# Use the same original train and valid datasets from 3a (otherwise the 
# validation errors aren't comparable).  Don't resplit the data.  
    
# Process the data 
processed_train_ec = process_data_ec(tr, dataset_type=1)

processed_val_ec = process_data_ec(val, dataset_type=2)


X_train_ec = ...
y_train_ec = ...

X_valid_ec = ...
y_valid_ec = ...



# Take a look at the result
display(X_train_ec.head())
display(y_train_ec.head())

display(X_valid_ec.head())
display(y_valid_ec.head())

In [None]:
# Run this code to make sure you haven't dropped any of the rows in the validation set

assert X_valid_ec.shape[0] == 33475

In [None]:
# Modeling STEP 2:  Create a Multiple Linear Regression Model


...

# your code above this line to create regression model 

y_predict_train_ec = ...

y_predict_valid_ec = ...




In [None]:
# MODELING STEP 3:  Evaluate the RMSE for your model


# Training and test errors for the model 
#(in its original values before the log transform)

training_error_ec = ...
validation_error_ec = ...


(print("Extra Credit \nTraining RMSE:$ {}\nValidation RMSE:$ {}\n"
       .format(training_error_ec, validation_error_ec))
)



In [None]:
# Optional: Run this cell to visualize

#import plotly.graph_objects as go

fig = go.Figure([
go.Bar(x = ["Extra Credit Model"], y = [training_error_ec], name="Training RMSE"),
go.Bar(x = ["Extra Credit Model"], y = [validation_error_ec], name="Validation RMSE"),

])


fig
fig.update_yaxes(range=[140000,260000], title="RMSE")
# Feel free to update the range as needed

In [None]:
# MODELING STEP 5: Plot 2 side-by-side residual plots for validation data

fig, ax = plt.subplots(1,2, figsize=(15, 5))


x_plt1 = ...
y_plt1 = ...

x_plt2 = ...
y_plt2 = ...


ax[0].scatter(x_plt1, y_plt1, alpha=.25)
ax[0].axhline(0, c='black', linewidth=1)
ax[0].set_xlabel(r'Predicted Log(Sale Price)')
ax[0].set_ylabel(r'Residuals: Log(Sale Price) - Predicted Log(Sale Price)');
ax[0].set_title("EC Val Data: Residuals vs. Predicted Log(Sale Price)")

ax[1].scatter(x_plt2, y_plt2, alpha=.25)
ax[1].axhline(0, c='black', linewidth=1)
ax[1].set_xlabel(r'Log(Sale Price)')
ax[1].set_ylabel(r'Residuals: Log(Sale Price) - Predicted Log(Sale Price)');
ax[1].set_title("EC Val Data: Residuals vs. Log(Sale Price)")


<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

## Extra Credit Step 2:  Explanation (Required for points on model above):

 - Explain what you did to create your model.  What versions did you try?  What worked and what didn't? 

 - Comment on the RMSE and residual plots from your model.   Are the residuals of your model still showing a trend that overestimates lower priced houses and underestimates higher priced houses?

**Write your answers in the text cell below**


_Type your answer here, replacing this text._

<!-- END QUESTION -->

## Extra Credit Step 3: Create and Submit Test Set Predictions to Gradescope

Now it's time to test your model on the actual test set.  You are only allowed to submit to Gradescope once, so wait until you have the best version of your model.    

When you are happy with your model and ready to submit, set the variable `did_ec = True` in the cell below your final prediction code.

The test data is in the dataframe `test_data`. 

Process the test data and run it through your model. Store your predictions from the test_data in the variable `y_test_pred`.  These should be in units Log Sale Price (you do not need to exponentiate them).  

Then run the cell provided below to create a .csv file to store your predictions on the test set and submit this .csv to the Gradescope Assignment: **"Project 2 Extra Credit Test Predictions"**. 
Note that **you will not receive credit for the test set predictions (i.e. up to 10 points) unless you submit your.csv to the Gradescope assignment**!


In [None]:
# Set this to True if you are attempting the extra credit and submitting predictions to Gradescope
did_ec = False

In [None]:
# Cells to process the test_data and run the model on it.  
# You CAN add any additional cells below
# Note: Make sure you don't remove any of the rows from the test data set.

processed_test_ec = process_data_ec(test_data, dataset_type = 3) 


y_test_pred = linear_model_ec.predict(processed_test_ec)

In [None]:
y_test_pred

In [None]:
grader.check("8c")

In [None]:
# Run this file to create the .csv of your predictions for the test set to upload to the assignment in Gradescope labeled Project 2 Extra Credit Test Predictions to have it checked.


#Store your predictions for the test set in Y_test_pred 
#(these should be in units of Log Sale Price)


# Construct and save the submission:
submission_df = pd.DataFrame({
    "Id": pd.read_csv('data/cook_county_contest_test.csv')['Unnamed: 0'], 
    "Value": y_test_pred,
}, columns=['Id', 'Value'])
submission_df.to_csv("submission.csv", index=False)

print('Created a CSV file:')
print('You MUST now upload this CSV file to the Gradescope assignment "Project 2 Extra Credit Test Predictions" for scoring.')

## Congratulations! You have finished the Project - Part 2


If you discussed this assignment with any other students in the class (in a manner that is acceptable as described by the Collaboration policy above) please **include their names** here:

**Collaborators**: *list collaborators here*

### Submission Instructions

Before proceeding any further, **save this notebook.**

After running the `grader.export()` cell provided below, **2 files will be created**: a zip file and pdf file.  You can download them using the links provided below OR by finding them in the same folder where this juptyer notebook resides in your JuptyerHub.

To receive credit on this assignment, **you must submit BOTH of these files
to their respective Gradescope portals:** 

* **Project Part 2 Autograded**: Submit the zip file that is output by the `grader.export()` cell below to the Autograded assignment in Gradescope.

* **Project Part 2 Manually Graded**: Submit your ProjectPart2.PDF to the  Manually Graded assignment in Gradescope.  **YOU MUST SELECT THE PAGES CORRESPONDING TO EACH QUESTION WHEN YOU UPLOAD TO GRADESCOPE.  IF NOT, YOU WILL LOSE POINTS**   Also, **check** that all of your plots **and** all lines of your code are showing up in your PDF before submitting.  If not, you will not receive credit for your plots/code.  

* **Extra Credit Submission**:  If you completed the extra credit, to receive credit for the Test Case prediction you must submit your Test Case prediction.csv (generated in the last cell of the extra credit section) to the Gradescope assignment titled "Project 2 Extra Credit Test Predictions"

**You are responsible for ensuring your submission follows our requirements. We will not be granting regrade requests nor extensions to submissions that don't follow instructions.** If you encounter any difficulties with submission, please don't hesitate to reach out to staff prior to the deadline.

In [None]:
import simple_latex_checker as slc

nb = slc.Nb_checker()
nb.run_check("ProjPart2.ipynb")

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

AFTER running the cell below, click on <a href='ProjPart2.pdf' download>this link to download the PDF </a> to upload to Gradescope.  There will be a separate link that appears after running the cell below with a link to download the zip file to upload to Gradescope.

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(run_tests=True)