In [None]:
# Initialize OK
from client.api.notebook import Notebook
ok = Notebook('hw6.ok')

# Homework 6: Exploring fairness through Cook County’s property assessments

## Due Date: 11:59pm Monday, April 6

### Collaboration Policy

Data science is a collaborative activity. While you may talk with others about the homework, we ask that you **write your solutions individually**. If you do discuss the assignments with others please **include their names** in the collaborators cell below.

**Collaborators:** *write names here*

## Introduction

This assignment will continue from where we left off in in Homework 5. Recall that the linear model that you created failed to produce accurate estimates of the observed housing prices because the model was too simple. The goal of this homework is to guide you through the iterative process of specifying, fitting, and analyzing the performance of more complex linear models used to predict prices of houses in Cook County, Illinois. Additionally, you will have the opportunity to choose your own features and create your own regression model!

By the end of this homework, you should feel comfortable:

1. Identifying informative variables through EDA
2. Feature engineering categorical variables
3. Using sklearn to build more complex linear models

Additionally, as a continuation of the last homework, we’ll explore the dynamics of the CCAO’s appraisal system with more depth as you continue developing your housing prediction model. Alongside our discussion on implicit bias, however, we’ll tackle another central facet of the CCAO’s work: transparency. 

As you work through this assignment, consider the balance of power between the CCAO and its constituents - how might transparency redistribute this balance? And what are the limits of transparency as a solution for systemic inequity?

## HCE Learning Outcomes

Through the completion of this homework, student will be able to:
* Understand the relationship between bias and fairness.
* Analyze the technical and performative functions of transparency initiatives.
* Recognize the social aspects of transparency in regard to the redistribution of power between different stakeholders.
* Weigh the effectiveness and limitations of transparency as a means of arriving at fair algorithmic systems in order to reimagine equitable practices in data science.


## Score Breakdown

*To be determined by course staff*

In [None]:
import numpy as np

import pandas as pd
from pandas.api.types import CategoricalDtype

from sklearn.feature_extraction import DictVectorizer

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Plot settings
plt.rcParams['figure.figsize'] = (12, 9)
plt.rcParams['font.size'] = 12

# The Data

As a reminder, the Cook County dataset contains residential sales data that the CCAO uses to assess property values. A more detailed description of each variable is available on [their website](https://datacatalog.cookcountyil.gov/Property-Taxation/Cook-County-Assessor-s-Residential-Sales-Data/5pge-nu6u). There, the list of column values notes what each entry represents and sometimes cautions about the quality of a given variable. There are 83 features in total.

The raw data are split into training and test sets with 2000 and 930 observations, respectively. To save some time, we've used a slightly modified data cleaning pipeline from last week's assignment to prepare the training data. This data is stored in `ccao_train_cleaned.csv`. It consists of 1998 observations and 83 features (we added TotalBathrooms from Homework 5). 

In [None]:
training_data = pd.read_csv("ccao_train_cleaned.csv")

### Bias and Fairness

When the use of algorithms and statistical modeling has real-world consequences, we often refer to the idea of fairness as a measurement of how socially-responsible our work is. Does our algorithm make similar predictions across race and gender identities? Does our model adequately represent reality?

In the case of the Cook County Assessor’s Office, fair property tax rates are contingent on whether property values are assessed accurately - that they’re valued at what they’re worth, relative to properties with similar characteristics. This implies that having a more accurate model results in  fairer assessments. The goal of creating a property assessment model, then, is to be as accurate as possible, which is typically approached by attempting to minimize human-related biases in the data collection and modeling process. 

In our previous examination of bias through Homework 1, however, we established that, because of human involvement and historical/institutional contexts, bias is impossible to eradicate. This adds a new dimension to our understanding of fairness in home assessments: Having “accurate” (and therefore fair) assessments requires us to constantly reflect on the decisions we make throughout the data lifecycle, as well as the contexts in which we’re working. Keep this relationship between bias and fairness in mind as we continue building our model!

## HCE: Question -1

Based on your work in the previous homework, what would you define as a fair property assessment? There isn’t one right answer! Share your thoughts in 1-2 sentences. 

`TODO`: *Write your answer here*

## HCE: Question 0

Does removing human judgment from the assessment process make assessments fairer? In what ways? How might it be (still) unfair?

`TODO`: *Write your answer here*

# Part 4: More Feature Selection and Engineering

In this section, we identify two more features of the dataset that will increase our linear regression model's accuracy. Additionally, we will implement one-hot encoding so that we can include binary and categorical variables in our improved model.Before we make changes to our basic linear model, let’s go back to the data. In this section, we’ll identify two more features of the dataset that will increase our linear regression model's accuracy. Additionally, we will implement one-hot encoding so we can include binary and categorical variables in our improved model.

We’ll start by first diving into the hallmark of the CCAO’s existing model: mass appraisal.

### Mass Appraisal

A unique technique employed by the current assessor’s office is the idea of mass appraisal. Rather than assessing homes one by one, mass appraisal evaluates value by looking to the real estate market for local trends based on location and property characteristics. This differs from the classic system, where human evaluation was a more significant factor in evaluating housing prices. The CCAO’s website states that [“mass appraisal is a way to put fairness into the assessment system.”](https://www.cookcountyassessor.com/index.php/about-cook-county-assessors-office) 

The dataset we’re currently working with is the same one used for mass appraisal. Let’s examine the column `Neighborhood Code` and see how the location of homes might influence the fairness of mass appraisal.
 
## HCE: Question 2

How does mass appraisal appeal to fairness? Consider how it might both benefit and hurt homeowners.


`TODO:` *Write your answer here*

## Question 1: Neighborhood vs Sale Price

First, let's take a look at the relationship between neighborhood and sale prices of the houses in our data set.

In [None]:
fig, axs = plt.subplots(nrows=2)

sns.boxplot(
    x='Neighborhood',
    y='SalePrice',
    data=training_data.sort_values('Neighborhood'),
    ax=axs[0]
)

sns.countplot(
    x='Neighborhood',
    data=training_data.sort_values('Neighborhood'),
    ax=axs[1]
)

# Draw median price
axs[0].axhline(
    y=training_data['SalePrice'].median(), 
    color='red',
    linestyle='dotted'
)

# Label the bars with counts
for patch in axs[1].patches:
    x = patch.get_bbox().get_points()[:, 0]
    y = patch.get_bbox().get_points()[1, 1]
    axs[1].annotate(f'{int(y)}', (x.mean(), y), ha='center', va='bottom')
    
# Format x-axes
axs[1].set_xticklabels(axs[1].xaxis.get_majorticklabels(), rotation=90)
axs[0].xaxis.set_visible(False)

# Narrow the gap between the plots
plt.subplots_adjust(hspace=0.01)

### Question 1a <a name="q1a"></a> 

Based on the plot above, what can be said about the relationship between the houses' sale prices and their neighborhoods?

<!--
BEGIN QUESTION
name: q1a
points: 1
manual: True
-->
<!-- EXPORT TO PDF -->

*Write your answer here, replacing this text.*

### Question 1b <a name="q1b"></a> 

One way we can deal with the lack of data from some neighborhoods is to create a new feature that bins neighborhoods together.  Let's categorize our neighborhoods in a crude way: we'll take the top 3 neighborhoods measured by median `SalePrice` and identify them as "rich neighborhoods"; the other neighborhoods are not marked.

Write a function that returns list of the top n most pricy neighborhoods as measured by our choice of aggregating function.  For example, in the setup above, we would want to call `find_rich_neighborhoods(training_data, 3, np.median)` to find the top 3 neighborhoods measured by median `SalePrice`.

*The provided tests check that you answered correctly, so that future analyses are not corrupted by a mistake.*

<!--
BEGIN QUESTION
name: q1b
points: 1
-->

In [None]:
def find_rich_neighborhoods(data, n=3, metric=np.median):
    """
    Input:
      data (data frame): should contain at least a string-valued Neighborhood
        and a numeric SalePrice column
      n (int): the number of top values desired
      metric (function): function used for aggregating the data in each neighborhood.
        for example, np.median for median prices
    
    Output:
      a list of the top n richest neighborhoods as measured by the metric function
    """
    neighborhoods = ...
    return neighborhoods

rich_neighborhoods = find_rich_neighborhoods(training_data, 3, np.median)
rich_neighborhoods

In [None]:
ok.grade("q1b");

### Question 1c <a name="q1c"></a> 

We now have a list of neighborhoods we've deemed as richer than others.  Let's use that information to make a new variable `in_rich_neighborhood`.  Write a function `add_rich_neighborhood` that adds an indicator variable which takes on the value 1 if the house is part of `rich_neighborhoods` and the value 0 otherwise.

**Hint:** [`pd.Series.astype`](https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.Series.astype.html) may be useful for converting True/False values to integers.

*The provided tests check that you answered correctly, so that future analyses are not corrupted by a mistake.*

<!--
BEGIN QUESTION
name: q1c
points: 1
-->

In [None]:
def add_in_rich_neighborhood(data, neighborhoods):
    """
    Input:
      data (data frame): a data frame containing a 'Neighborhood' column with values
        found in the codebook
      neighborhoods (list of strings): strings should be the names of neighborhoods
        pre-identified as rich
    Output:
      data frame identical to the input with the addition of a binary
      in_rich_neighborhood column
    """
    data['in_rich_neighborhood'] = ...
    return data

rich_neighborhoods = find_rich_neighborhoods(training_data, 3, np.median)
training_data = add_in_rich_neighborhood(training_data, rich_neighborhoods)

In [None]:
ok.grade("q1c");

Having identified rich neighborhoods, let’s now look toward the other end of the spectrum: lower-valued properties. According to the CCAO, their assessment system struggles with accurately predicting the values of properties that are worth less than 150k because they lack data for those properties. This ultimately diminishes the usefulness of mass appraisal for these properties. 

## HCE: Question 2.5

In what situation does mass appraisal fail? How might mass appraisal work unfairly in this way?


`TODO:` *Write your answer here*

## Question 2: Floodplain

In 2019, the Cook County Assessor’s Office added the Federal Emergency Management Agency’s [floodplain data](https://msc.fema.gov/portal/home) to its assessment models. As described in their [Medium article](https://medium.com/@AssessorCook/why-and-how-floodplain-data-is-used-in-cook-county-property-assessments-6269d75189d7), “a floodplain is an area near a body of water that has a high risk of flooding.” A value of 0 indicates that a property is not on a floodplain, while a value of 1 indicates that a property is on a floodplain.

### Question 2a <a name="q2a"></a>

Let's see if our data set has any missing values.  Create a Series object containing the counts of missing values in each of the columns of our data set, sorted from greatest to least.  The Series should be indexed by the variable names.  For example, `missing_counts['Floodplain']` should return 975.

**Hint:** [`pandas.DataFrame.isnull`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isnull.html) may help here.

*The provided tests check that you answered correctly, so that future analyses are not corrupted by a mistake.*

<!--
BEGIN QUESTION
name: q2a
points: 1
-->

In [None]:
missing_counts = ...
missing_counts

In [None]:
ok.grade("q2a");

The following values (or lack thereof) thus have the following meanings:
```
Floodplain (Ordinal): Whether the property is on a floodplain

       0	Not on floodplain
       1	On floodplain
       Null	No data
```


### Question 2b <a name="q2b"></a>

A missing value here means that there is no data for the property.  Let's fix this in our data set.  Write a function that replaces the missing values in `Floodplain` with `'No data'`.  In addition, it should replace each abbreviated condition with its full word.  For example, `'1'` should be changed to `'On floodplain'`.  Hint: the [DataFrame.replace](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.replace.html) method may be useful here.

*The provided tests check that part of your answer is correct, but they are not fully comprehensive.*

<!--
BEGIN QUESTION
name: q2b
points: 2
-->

In [None]:
def fix_floodplain(data):
    """
    Input:
      data (data frame): a data frame containing a Floodplain column.  Its values
                         should be limited to those found in the codebook
    Output:
      data frame identical to the input except with a refactored Floodplain column
    """
    ...
    return data
    
training_data = fix_floodplain(training_data)

In [None]:
ok.grade("q2b");

In [None]:
training_data['Floodplain']

### An Important Note on One Hot Encoding <a name="important_note"></a>

Unfortunately, simply fixing these missing values isn't sufficient for using `Floodplain` in our model.  Since `Floodplain` is a categorical variable, we will have to one-hot-encode the data using `DictVectorizer` from Lab 6. Note that we dropped the first one-hot-encoded column. For more information on categorical data in pandas, refer to this [link](https://pandas-docs.github.io/pandas-docs-travis/categorical.html).

In [None]:
def ohe_floodplain(data):
    """
    One-hot-encodes floodplain.  New columns are of the form Floodplain=STATUS
    """
    vec_enc = DictVectorizer()
    vec_enc.fit(data[['Floodplain']].to_dict(orient='records'))
    floodplain_data = vec_enc.transform(data[['Floodplain']].to_dict(orient='records')).toarray()
    floodplain_cats = vec_enc.get_feature_names()
    floodplain = pd.DataFrame(floodplain_data, columns=floodplain_cats)
    data = pd.concat([data, floodplain], axis=1)
    data = data.drop(columns=floodplain_cats[0])
    return data

In [None]:
training_data = ohe_floodplain(training_data)
training_data.filter(regex='Floodplain').head(10)

# Part 5: Improved Linear Models

In this section, we will create linear models that produce more accurate estimates of the housing prices in Ames than the model created in Homework 5, but at the expense of increased complexity.

## Question 3: Adding Covariates to our Model

It's finally time to fit our updated linear regression model using the ordinary least squares estimator! Our new model consists of the linear model from Homework 5, with the addition of the our newly created `in_rich_neighborhood` variable and our one-hot-encoded fireplace quality variables:

$$\begin{align}
\text{SalePrice} & = \theta_0 + \theta_1 \cdot \text{Gr_Liv_Area} + \theta_2 \cdot \text{Garage_Area} + 
\theta_3 \cdot \text{TotalBathrooms} + \theta_4 \cdot \text{in_rich_neighborhood} + \\
& \quad \: \theta_5 \cdot \text{Floodplain=On floodplain} + \theta_6 \cdot \text{Floodplain=No data}
\end{align}$$

### Question 3a <a name="q3a"></a>

Although the floodplain variable that we explored in Question 2 has three categories, only two of these categories' indicator variables are included in our model. Is this a mistake, or is it done intentionally? Why?

<!--
BEGIN QUESTION
name: q3a
points: 1
manual: True
-->
<!-- EXPORT TO PDF -->

*Write your answer here, replacing this text.*

### Question 3b <a name="q3b"></a>

We still have a little bit of work to do prior to esimating our linear regression model's coefficients. Instead of having you go through the process of selecting the pertinent convariates and creating a [`sklearn.linear_model.LinearRegression`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) object for our linear model again, we will provide the necessary code from Homework 5. However, we will now use cross validation to help validate our model instead of explicitly splitting the data into a training and testing set.

First, we will re-import the data.

In [None]:
training_data = pd.read_csv("ccao_train_cleaned.csv")

Next, we will implement a reusable pipeline that selects the required variables in our data and splits our covariates and response variable into a matrix and a vector, respectively.

In [None]:
def select_columns(data, *columns):
    """Select only columns passed as arguments."""
    return data.loc[:, columns]

def process_data_gm(data):
    """Process the data for a guided model."""
    # One-hot-encode fireplace quality feature
    data = fix_fireplace_qu(data)
    data = ohe_fireplace_qu(data)
    
    # Use rich_neighborhoods computed earlier to add in_rich_neighborhoods feature
    data = add_in_rich_neighborhood(data, rich_neighborhoods)
    
    # Transform Data, Select Features
    data = select_columns(data, 
                          'SalePrice', 
                          'Gr_Liv_Area', 
                          'Garage_Area',
                          'TotalBathrooms',
                          'in_rich_neighborhood',
                          'Floodplain=On floodplain',
                          'Floodplain=No data'
                         )
    
    # Return predictors and response variables separately
    X = data.drop(['SalePrice'], axis = 1)
    y = data.loc[:, 'SalePrice']
    
    return X, y

We then split our dataset into training and testing sets using our data cleaning pipeline.

In [None]:
# Pre-process the training data
# Our functions make this very easy!
X_train, y_train = process_data_gm(training_data)
X_train.head()

Finally, we initialize a [`sklearn.linear_model.LinearRegression`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) object as our linear model. We set the `fit_intercept=True` to ensure that the linear model has a non-zero intercept.

In [None]:
from sklearn import linear_model as lm

linear_model = lm.LinearRegression(fit_intercept=True)

After a little bit of work, it's finally time to fit our updated linear regression model. Use the cell below to estimate the model, and then use it to compute the fitted value of `SalePrice` over the training data.

*The provided tests check that you answered correctly, so that future analyses are not corrupted by a mistake.*

<!--
BEGIN QUESTION
name: q3b
points: 2
-->

In [None]:
# Fit the model below

# Compute the fitted and predicted values of SalePrice
y_fitted = ...

In [None]:
ok.grade("q3b");

You can see that, as we consider more features in our model, its computational complexity grows. It isn’t just computational, however - increased complexity requires greater expertise and understanding of data science. 

### Question 3c <a name="q3c"></a>

Let's assess the performance of our new linear regression model using the Root Mean Squared Error function that we created in Homework 5.

$$RMSE = \sqrt{\dfrac{\sum_{\text{houses}}(\text{actual price for house} - \text{predicted price for house})^2}{\text{# of houses}}}$$

The function is provided below.

In [None]:
def rmse(predicted, actual):
    """
    Calculates RMSE from actual and predicted values
    Input:
      predicted (1D array): vector of predicted/fitted values
      actual (1D array): vector of actual values
    Output:
      a float, the root-mean square error
    """
    return np.sqrt(np.mean((actual - predicted)**2))

Please compute the training error using the `rmse` function above.

*The provided tests for this question do not confirm that you have answered correctly; only that you have assigned each variable to a non-negative number.*

<!--
BEGIN QUESTION
name: q3c
points: 1
-->

In [None]:
training_error = ...
print("Training RMSE: {}".format(training_error))

In [None]:
ok.grade("q3c");

A slighlty modified version of the `cross_validate_rmse` function from Lecture 18 is also provided below.

In [None]:
from sklearn.model_selection import KFold
from sklearn.base import clone

def cross_validate_rmse(model, X, y):
    model = clone(model)
    five_fold = KFold(n_splits=5)
    rmse_values = []
    for tr_ind, va_ind in five_fold.split(X):
        model.fit(X.iloc[tr_ind,:], y.iloc[tr_ind])
        rmse_values.append(rmse(y.iloc[va_ind], model.predict(X.iloc[va_ind,:])))
    return np.mean(rmse_values)

Now use the `cross_validate_rmse` functions to calculate the cross validation error in the cell below.

*The provided tests for this question do not confirm that you have answered correctly; only that you have assigned each variable to a non-negative number.*

<!--
BEGIN QUESTION
name: q3d
points: 1
-->

In [None]:
cv_error = ...
print("Cross Validation RMSE: {}".format(cv_error))

In [None]:
ok.grade("q3d");

# Part 6: Open-Response

The following part is purposefully left nearly open-ended.  The Cook County data in your possession comes from a larger data set.  Your goal is to provide a linear regression model that accurately predicts the prices of the held-out homes, measured by root mean square error. 

$$RMSE = \sqrt{\dfrac{\sum_{\text{houses in public test set}}(\text{actual price for house} - \text{predicted price for house})^2}{\text{# of houses}}}$$

Perfect prediction of house prices would have a score of 0, so you want your score to be as low as possible!

### Grading Scheme

Your grade for Question 4 will be based on your training RMSE and test RMSE. The thresholds are as follows:

Points | 3 | 2 | 1 | 0
--- | --- | --- | --- | ---
Training RMSE | Less than 36k | 36k - 38k | 38k - 40k | More than 40k

Points | 3 | 2 | 1 | 0
--- | --- | --- | --- | ---
Test RMSE | Less than 37k | 37k - 40k | 40k - 43k | More than 43k


### One Hot Encoding

If you choose to include more categorical features in your model, you'll need to one-hot-encode each one. Remember that if a categorical variable has a unique value that is present in the training set but not in the test set, one-hot-encoding this variable will result in different outputs for the training and test sets (different numbers of one-hot columns). Watch out for this! Feel free to look back at how we [one-hot-encoded `Floodplain`](#important_note).

To generate all possible categories for a categorical variable, we suggest reading through a more detailed description of each variable [here](https://datacatalog.cookcountyil.gov/Property-Taxation/Cook-County-Assessor-s-Residential-Sales-Data/5pge-nu6u) or finding the values programmatically across both the training and test datasets.

## Question 4: Your Own Linear Model <a name="q4"></a>

Just as in the guided model above, you should encapsulate as much of your workflow into functions as possible. Below, we have initialized `final_model` for you. Your job is to select better features and define your own feature engineering pipeline in `process_data_fm`. We recommend using cross validation to help inform your feature selection process.

To evaluate your model, we will process training data using your `process_data_fm`, fit `final_model` with this training data, and compute the training RMSE. Then, we will process the test data with your `process_data_fm`, use `final_model` to predict sale prices for the test data, and compute the test RMSE. See below for an example of the code we will run to grade your model:

```
training_data = pd.read_csv('ccao_train_cleaned.csv')
test_data = pd.read_csv('ccao_test_cleaned.csv')

X_train, y_train = process_data_fm(training_data)
X_test, y_test = process_data_fm(test_data)

final_model.fit(X_train, y_train)
y_predicted_train = final_model.predict(X_train)
y_predicted_test = final_model.predict(X_test)

training_rmse = rmse(y_predicted_train, y_train)
test_rmse = rmse(y_predicted_test, y_test)
```

**Note:** It is your duty to make sure that all of your feature engineering and selection happens in `process_data_fm`, and that the function performs as expected without errors. We will **NOT** accept regrade requests that require us to go back and run code that require typo/bug fixes.

**Hint:** Some features may have missing values in the test set but not in the training set. Make sure `process_data_fm` handles missing values appropriately for each feature!
<!--
BEGIN QUESTION
name: q4
points: 6
-->

In [None]:
final_model = lm.LinearRegression(fit_intercept=True) # No need to change this!

def process_data_fm(data):
    ...
    # Return predictors and response variables separately
    X = data.drop(['SalePrice'], axis = 1)
    y = data.loc[:, 'SalePrice']
    return X, y

In [None]:
ok.grade("q4");

## Question 5: EDA for Feature Selection

In the following question, explain a choice you made in designing your custom linear model in Question 4. First, make a plot to show something interesting about the data. Then explain your findings from the plot, and describe how these findings motivated a change to your model.

### Question 5a <a name="q5a"></a>

In the cell below, create a visualization that shows something interesting about the dataset.

<!--
BEGIN QUESTION
name: q5a
points: 2
manual: True
-->
<!-- EXPORT TO PDF -->

In [None]:
# Code for visualization goes here
...

### Question 5b <a name="q5b"></a>

Explain any conclusions you draw from the plot above, and describe how these conclusions affected the design of your model. After creating the plot, did you add/remove certain features from your model, or did you perform some other type of feature engineering? How significantly did these changes affect your rmse?

<!--
BEGIN QUESTION
name: q5b
points: 2
manual: True
-->
<!-- EXPORT TO PDF -->

*Write your answer here, replacing this text.*

### Bias in Modeling 

As the previous question highlighted, a series of decisions are built into your model: When the goal is to minimize rmse (and receive full homework points!), it makes sense to use features that carry a lot of weight. These choices, however, create a representation of reality - intentional or not - and dictate what a model considers valuable when generating any type of prediction. 
Having worked through a structured data analysis/modeling process and built your own model, we’ll now take a step back and look at how your work fits into the real world. Because civic data initiatives have multiple stakeholders, it’s imperative to understand how different groups interact with particular aspects of the data. For a student, it’s important to minimize a model’s error for the sake of completing this assignment and learning basic data science principles. For the CCAO, it’s important to project a vision of fairness for housing assessments in order to maintain the trust of Cook County’s constituents. An assessment model has the potential to reach stakeholders beyond the assessor’s office, so let’s explore this further!
 
 
## HCE: Question 3
Above, we included `Floodplain` in our model to predict property values. This column could be useful to organizations outside of the CCAO - insurance companies, for example, calculate risk and insure houses based on their characteristics. If an insurance company wanted to assign an insurance rate to a house, why might they use `Floodplain` from the county assessor as part of their calculation?

`TODO:` *Write your answer here*

## HCE: Question 4
This dataset, and many like it around the country, are available free or for sale to any business. What kinds of businesses and industries would be interested in the feature(s) you identified? How could a business or industry use the feature(s) to make a decision that would help them? 


`TODO:` *Write your answer here*

## HCE: Question 5 
In the cell below, generate a visualization of the feature(s) you selected which could help the business or industry use this data to make a decision.


`TODO:` *Write your answer here*

## HCE: Question 6
While you have argued that this feature could help a given business make an informed decision, is there a potential for the feature(s) to mislead them? How so? 

*Hint:* If you’re stuck, consider the limitations of bias in data collection. 

`TODO:` *Write your answer here*

As your work through these questions demonstrates, your features and modeling process have relevance to a diversity of fields. You may have noticed, however, that the quality of this information is contingent on its explainability, i.e. what a particular variable would mean or represent to different industries. This need for context and clarity not only reveals the underlying biases of our work but also characterizes another common approach to fairness: transparency.
 
## HCE: Question 7
What would you consider a transparent and fair process in regard to home assessment, and how might you implement these ideas in your modeling process?


`TODO:` *Write your answer here*

# Part 7: Approaching Fairness through Transparency


In this homework, the validity of your model’s assessments is determined by how closely your test set’s results align with the CCAO’s residential sales dataset. The CCAO, however, does not have an autograder to check its work; its predictions, after all, draw on this dataset to re-establish the standard for fair and accurate property values throughout the triennial assessment period. Instead, the Office champions transparency as a guiding principle in regard to fairness. 


### Transparency and the CCAO

After a lawsuit was filed against the CCAO for producing [“racially discriminatory assessments and taxes,"](https://harris.uchicago.edu/news-events/news/prof-chris-berry-testifies-institutional-racism-cook-county-property-taxes) the Office decided to tackle these inequities by committing to transparency initiatives. The hope was that, by exposing the inner workings of the CCAO’s property valuation process, their assessment results could be publicly verified as accurate and therefore trusted to be fair. 

These transparency initiatives include publishing all of the CCAO’s work on [GitLab](https://gitlab.com/ccao-data-science---modeling). By allowing the public to access any updates to the system in real-time, the Office argues that they increase accessibility to a process that had previously been blackboxed - obscured and hidden - from the public. Empowered by transparency, the citizens of Cook County would ideally hold the Assessor’s Office accountable for their work, redistributing the balance of power between the CCAO and its constituents. And in this scenario, this form of transparency would thus contribute to the legitimacy and fairness of the Office’s property assessments.

Additionally, these measures were, in part, developed to push back against the inequities of the tax lawyer industry. Because hiring a tax lawyer to negotiate for lower valuations (and therefore taxes) is limited to the wealthy, property owners with a lower socioeconomic status paid a disproportionate amount of tax. However, because the CCAO’s assessment process is now public, tax lawyers can only contest the CCAO’s work through technical means - in other words, tax lawyers must abide by the rules set by the CCAO. In this way, the transparency initiatives aim to shift the balance of power in property assessment and taxes away from the tax lawyer industry. 

## HCE: Question 8

How do the CCAO’s transparency initiatives aim to redistribute power between the tax lawyer industry, the CCAO, and the constituents of Cook County?


`TODO:` *Write your answer here*

With these balances of power in mind, the next step is to critically examine the effectiveness and reach of the CCAO’s transparency initiatives. We can assess them in three ways: 

* Accessibility - Are there barriers or limits to participating in the transparency initiatives implemented by the CCAO? More specifically, who can and cannot interact with and understand the CCAO’s published code on GitLab?
* Explainability - What efforts has the CCAO made to effectively communicate information about their assessment process? To what extent do they elaborate on the documentation of their GitLab repository?
* Accountability - In what ways can the CCAO be held accountable through their transparency initiatives? Are there barriers to who can hold them accountable?

As you may have noticed, these terms are closely linked to one another. We’ll now examine these standards for transparency by diving deeper into the CCAO’s assessment process. 

According to the CCAO’s [Progress Report on Implementation of 100 Day Objectives](https://gitlab.com/ccao-data-science---modeling/ccao_sf_cama_dev/-/blob/master/documentation/Progress%20Report%20on%20Implementation%20of%20100%20Day%20Objectives.pdf), their  assessment system operates beyond a single model or series of models: It allows “any number of models to be specified and runs all of them (...) Each model is subjected to a battery of tests as outlined by the International Association of Assessing Officers. The algorithm then recommends a model based on its performance on these tests.” This complex modeling, coupled with real estate-based data collection practices, enable the CCAO to leverage their data scientists’ expertise in their final assessments. It’s important to acknowledge this power dynamic because it directly interacts with the notion of transparency in the assessment system. 

Let’s start with accessibility. To be frank, the CCAO’s algorithmic modeling system is almost completely inaccessible to the everyday person, despite its viewability on GitLab. Understanding code - much less algorithms and statistical models written in code - is a high barrier to entry, severely restricting accessibility on the basis of expertise. This is where the next metric for transparency, explainability, comes in. 

## HCE: Question 9 

Take a look at the Residential Automated Valuation Model files under the Models subgroup in the CCAO’s [GitLab](https://gitlab.com/ccao-data-science---modeling). Without directly looking at any code, does the documentation sufficiently explain how the residential valuation model works? 

`TODO:` *Write your answer here*

With the barrier of expertise, explanations are perhaps the only way that the CCAO can bridge the technical gaps throughout the assessment process. However, as you might’ve noticed, these explanations are also lacking. That leaves the final piece of the puzzle - the ideal product of transparency, accountability.

Without any measures to expand the accessibility and explainability of their algorithmic modeling, the prospects of holding the CCAO accountable for their assessments are dim. The level of expertise needed to check the CCAO’s systems excludes participatory voices from the community. Even community members who have valuable insights into measures of inequity - itself another form of expertise - cannot engage with technical expertise in data and modeling.

WIth all that said, it’s imperative to acknowledge that the CCAO is a government institution. Although their work relies on technical expertise, anyone who wishes to critique - or even just interact with - the assessment pipeline should have the means to; all of Cook County’s constituents are affected by property assessments, and they should not be required to have a technical background in order to participate in a technical space. 

## HCE: Question 10 

In what ways does the CCAO’s implementation of transparency fail? What aspects of it are inaccessible and to whom? Consider the concepts of expertise and power.


`TODO:` *Write your answer here*

So what role does transparency really play in relation to fairness? Given its limitations in regard to accessibility, the CCAO’s transparency initiatives can come across as merely performative. That said, they certainly still demonstrate the Office’s commitment to fairness and equity, performing the social function of instilling faith in its work. And while there is validity in fostering trust with the local community as a governing institution, it’s nonetheless vital to understand what this faith is built on: The CCAO operates its assessment model by leveraging its expertise, and transparency is then used to reinforce the legitimacy of the CCAO’s work by exemplifying a gesture of good will. And through this process, the CCAO’s power to determine property assessments and taxes is thus maintained through this process. 

## HCE: Question 11

How does the CCAO maintain its power in housing assessments and its open-data initiative? 


`TODO:` *Write your answer here*

Fairness, in the end, is as difficult to define as it is to implement. What you make of this process - from the ingrained bias throughout the data lifecycle to the role of transparency in the CCAO’s work - is for your consideration. Regardless, it raises several challenging but pressing questions: How can we envision fair and equitable data science beyond transparency? How can we go one step further and incorporate justice in data science? 

The Cook County Assessor’s Office is just one case study. Although it is unique in being the first Assessor’s Office to publicly publish its assessment model and data, it is not exempt from the same scrutiny and critique that other public and private institutions receive. Consider how expertise serves as the baseline and standard for fairness - and consider how data science becomes authoritative despite (or, perhaps, because of) its many exceptions, parameters, and metrics. Continue to reflect on these concepts in your data science work as you learn more and more.


## Before You Submit

Make sure that if you run Kernel > Restart & Run All, your notebook produces the expected outputs for each cell. Congratulations on finishing the assignment!

# Submit
Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output.
**Please save before submitting!**

<!-- EXPECT 4 EXPORTED QUESTIONS -->

In [None]:
# Save your notebook first, then run this cell to submit.
import jassign.to_pdf
jassign.to_pdf.generate_pdf('hw6.ipynb', 'hw6.pdf')
ok.submit()