# **Welcome to the Fairness Project!**

This project consists of 2 notebooks + 2 bonus notebooks where you will be exploring the effects of bias on the job application process and evaluating the overall fairness of a variety of models.

Notebook 1:
* Fairness + Hiring Background
* Data Exploration
* (Optional) Necessity + Sufficiency Scores
* (Optional) Skill-Based Classifier

Notebook 2:
* Logistic Regression Models for Studying Protected Characteristics
* Statistical Parity Difference
* "We Are All Equal" vs. "What You See Is What You Get" Metrics
* Decision Tree Model
* (Optional) Random Forest Model
* (Optional) Neural Networks

(Optional)
Notebook 3: Advanced Techniques

(Optional)
Notebook 4: Open-ended Application

# Fairness and Hiring

Here are some statistics related to fairness and hiring:
* A [Yale University study](https://www.forbes.com/sites/pragyaagarwaleurope/2018/12/03/unconscious-bias-how-it-affects-us-more-than-we-know/?sh=424d33276e13) found that male and female scientists, both trained to be objective, were more likely to hire men, consider them more competent than women, and pay them $4,000 more per year than women. Other research has shown that a science faculty rated male applicants for a laboratory manager position as significantly more competent and hireable than the female applicants.
* [A 2003 study by UChicago and MIT ](https://uh.edu/~adkugler/Bertrand&Mullainathan.pdf) titled: *Are Emily and Greg More Employable Than Lakisha and
Jamal? A Field Experiment on Labor Market Discrimination* tested the difference a name had on job interview opportunities. The researchers submitted 5000 identical resumes to jobs in the Chicago and Boston area. They used random names that were stereotypically white or African American. The applicants with the white sounding names received an astounding 50% more job interview requests.

**Questions:** Is this fair? Why is fairness important?

# Rayo Tech

You work at Rayo Tech, a fast-growing tech company and close rival to Google. Due to the rapid growth of the company, Rayo Tech has been receiving thousands upon thousands of job applications - completely overwhelming the recruitment team.

As a machine learning engineer you have been tasked with helping automate some of the recruiting process. You are to develop a system that automatically determines whether or not a candidate should receive an interview based on their resume.

**Note:** Some form of automation is used in practically every large company--there are even third party services that provide tools! Read more about automated hiring [here](https://www.hirevue.com/blog/hiring/automated-hiring-processes).

**Aside:**  What's in a name? https://en.wikipedia.org/wiki/Rayo%27s_number

## Setting up the problem

**Questions:** What kind of data do you need to develop your system? What tools are applicable?

In [None]:
#@title Run this to download data and prepare the environment.
import pandas as pd
import plotly.express as px
import plotly.io as pio
import random
import plotly.figure_factory as ff
import sklearn.model_selection
import sklearn.metrics
import numpy as np

pio.templates.default = "plotly_white"

SKILLS = [
    "Java",
    "Python",
    "Recruiting",
    "Web_Development",
    "Databases",
    "Machine_Learning",
    "Materials",
    "AutoCAD",
    "Data_Science",
    "Art",
    "Design",
    "Marketing",
    "Finance",
    "Accounting",
    "Writing",
    "Cloud_Computing",
    "Unix",
    "Windows",
    "Teamwork",
    "Organization",
]

HOBBIES = [
    "Basketball",
    "Tennis",
    "Swimming",
    "Running",
    "Chess",
    "Painting",
    "Hand_Stand",
]

PROTECTED = [
    "URM",
    "Female",
    "Disability",
]

OTHER = [
    "Years_Experience",
    "GPA",
    "Prestigious_University",
]

COLUMNS = ["Interview"] + PROTECTED + OTHER + SKILLS + HOBBIES

SKILLS_AND_HOBBIES = SKILLS + HOBBIES
FEATURES = SKILLS_AND_HOBBIES + OTHER + PROTECTED
FEATURES_WITHOUT_PROTECTED = SKILLS_AND_HOBBIES + OTHER

def print_applicant(data, i):
  row = data.iloc[i].to_dict()
  print(f"Applicant {i}")
  print(f"\tGPA: {row['GPA']:.2f}")
  print(f"\tYears of Experience: {row['Years_Experience']}")

  skills = ", ".join([k for k, v in row.items() if v == 1 and k in SKILLS])
  hobbies = ", ".join([k for k, v in row.items() if v == 1 and k in HOBBIES])
  protected = ", ".join([k for k, v in row.items() if v == 1 and k in PROTECTED])
  print(f"\tSkills: {skills}")
  print(f"\tHobbies: {hobbies}")
  print(f"\tProtected Attributes: {protected}")
  print()

  # ! pip install aif360 &> /dev/null
! wget -O data.csv 'https://storage.googleapis.com/inspirit-ai-data-bucket-1/Data/AI%20Scholars/Sessions%206%20-%2010%20(Projects)/Project%20-%20Fairness/data.csv' &> /dev/null

# Part 1: Exploring the data

**Note:** For privacy reasons, the dataset we are using is synthetically
generated - applicants are randomly sampled to have certain characteristics. The applicants are then either selected or rejected for the interview based on a randomized hiring algorithm. Read more after class about how this dataset was generated [here](https://docs.google.com/document/d/1Kn6DueFnSrEs2hvniEqoHxnCe4jp43JZDvBj4LgaH2Y/edit?usp=sharing).

In [None]:
data = pd.read_csv('data.csv')
data.head()

Unnamed: 0,Female,URM,Disability,Years_Experience,GPA,Prestigious_University,Java,Python,Recruiting,Web_Development,Databases,Machine_Learning,Materials,AutoCAD,Data_Science,Art,Design,Marketing,Finance,Accounting,Writing,Cloud_Computing,Unix,Windows,Teamwork,Organization,Basketball,Tennis,Swimming,Running,Chess,Painting,Hand_Stand,Interview
0,0,1,0,2,3.249054,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,1,0,0,0,1,0,0,1,0,1,0,0,0,0
1,1,0,1,3,2.836575,1,0,0,0,1,0,0,0,0,0,1,1,0,0,0,1,0,0,1,1,1,0,0,0,0,0,1,0,0
2,1,0,0,1,3.170914,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,1,1,1,0,1,0,1,0,0,0,0
3,1,0,0,1,3.35647,0,0,1,0,0,1,1,0,0,1,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,1
4,1,0,0,1,2.678728,1,0,1,1,0,0,1,0,0,1,1,0,0,0,0,0,0,1,1,1,1,0,0,0,1,0,1,1,0


**Question:** Any skills you would like clarification on?

Each row represents an applicant.

There are 3 protected features:
* Female: 1 if the applicant is female, 0 otherwise
* URM: 1 if the applicant is an underrepresented minority, 0 otherwise
* Disability: 1 if the applicant has a disability, 0 otherwise

The other columns represent job skills which are marked 1 if the applicant possesses them and 0 otherwise.

**Exercise:** Print out the details of 4 random candidates and store a list of their index numbers in a variable named `applicants`.

Hint: Use `random.randrange(len(data))` to get a random row of the dataset.

Hint: Use the `print_applicant(data, i)` function to print out the details of the applicant corresponding to the $i^{th}$ row of the dataset.

In [None]:
### BEGIN YOUR CODE HERE ####

applicants = [random.randrange(len(data)) for _ in range(4)]

for i in applicants:
  print_applicant(data, i)


### END YOUR CODE HERE ####

Applicant 662
	GPA: 3.57
	Years of Experience: 3.0
	Skills: Java, Python, Data_Science, Design, Unix, Windows, Teamwork, Organization
	Hobbies: Basketball
	Protected Attributes: 

Applicant 259
	GPA: 3.94
	Years of Experience: 1.0
	Skills: Python, Web_Development, Databases, AutoCAD, Design, Writing, Teamwork, Organization
	Hobbies: Tennis, Running, Painting
	Protected Attributes: Female

Applicant 13
	GPA: 3.19
	Years of Experience: 3.0
	Skills: AutoCAD, Design, Teamwork
	Hobbies: Chess
	Protected Attributes: 

Applicant 317
	GPA: 3.24
	Years of Experience: 2.0
	Skills: Finance, Accounting, Writing, Windows, Teamwork, Organization
	Hobbies: Swimming
	Protected Attributes: Female, URM



**Discuss:** Let's say you're trying to hire a new software engineer, which features would you pay the most attention to? Who would you hire?

Run the next cell to see the results of the applicants above.

In [None]:
for i in applicants:
  interview = data["Interview"][i]
  if interview == 0:
    print ("Applicant", i, "was not interviewed")
  else:
    print ("Applicant", i, "was interviewed")

Applicant 662 was interviewed
Applicant 259 was not interviewed
Applicant 13 was not interviewed
Applicant 317 was not interviewed


**Question:** What are some interesting questions we can ask about the dataset?

## Interview Rate
**Exercise:** What is the percentage of applicants who get interviewed?

Hint: Use `data["Interview"]` to access the interview column of `data`.

Looking at the format of the data, how can you compute the percentage of applicants who get interviewed?

In [None]:
def interview_rate(data):
  ### BEGIN YOUR CODE HERE ####
  return data["Interview"].mean()
  ### END YOUR CODE HERE ####

print("Percentage of people who get interviews:")
print(str(interview_rate(data)*100)+"%")

Percentage of people who get interviews:
29.7%


In [None]:
#@title Run to check your work!
assert interview_rate(data) == 0.297, "Oops. There's something wrong with you interview rate implementation"
print("Well done")

Well done


## Skill Availability

Let's take a look at the skills possessed by our applicant pool and how available they are!

In [None]:
def plot_skill_availability(data):
  total = data[SKILLS_AND_HOBBIES].sum().to_frame("Count")
  total = total.sort_values(by="Count")
  return px.bar(total, x=total.index, y='Count')

plot_skill_availability(data)

**Questions:** What skills are you looking for in an applicant? Which skills and hobbies are rare?

## Does GPA matter?
Aside from skills which are binary (categorical) variables, we also have some numerical information about each candidate. In particular, we know each candidate's GPA and their years of experience.

In [None]:
px.scatter(data, x='GPA', y="Interview")

**Questions:** Is a high GPA enough to get you a job? Can you get a job even if you have a low GPA?

Unfortunately, the plot is a little cluttered. Instead let's ask a more concrete question: What is the interview rate for candidates who have a GPA of $x$? Because GPA is a continuous variable, this question is pretty hard to answer since it's very unlikely that two people have the exact same GPA!

To get around this, we ask the related question: What is the interview rate for candidates within a certain GPA interval?

In [None]:
MIN_GPA = 2.25
MAX_GPA = 4.2
INTERVAL_WIDTH = 0.1

interview_rate_by_gpa = (data.groupby(pd.cut(data["GPA"], np.arange(MIN_GPA, MAX_GPA, INTERVAL_WIDTH))) # Group applicants based on which interval their GPA is in
                            .apply(interview_rate) # Apply our interview_rate function to each group
                            .to_frame("Interview_Rate") # Convert from Series to DataFrame (used for plotting)
                            .reset_index()
                            .astype({"GPA":str})) # Need to change the GPA column to a string so we can plot it
interview_rate_by_gpa

Unnamed: 0,GPA,Interview_Rate
0,"(2.25, 2.35]",0.0
1,"(2.35, 2.45]",0.0
2,"(2.45, 2.55]",0.4
3,"(2.55, 2.65]",0.25
4,"(2.65, 2.75]",0.333333
5,"(2.75, 2.85]",0.088235
6,"(2.85, 2.95]",0.313433
7,"(2.95, 3.05]",0.234783
8,"(3.05, 3.15]",0.290541
9,"(3.15, 3.25]",0.330709


In [None]:
px.bar(interview_rate_by_gpa, x='GPA', y="Interview_Rate")

**Exercise:** Try changing the interval width to get more fine-grained data!

## Optional: Does the number of years of experience matter?

**Exercise:** Mimic the code above to generate a scatterplot for years of experience and answer the following questions:

**Questions:** Is a large number of years of experience good enough to get you a job? Can you get a job without any experience?

In [None]:
### BEGIN YOUR CODE HERE ####
interview_rate_by_years_experience = (data.groupby("Years_Experience") # Group applicants based on which interval their GPA is in
.apply(interview_rate) # Apply our interview_rate function to each group
.to_frame("Interview_Rate") # Convert from Series to DataFrame (used for plotting)
.reset_index()) # Need to change the GPA column to a string so we can plot it

px.bar(interview_rate_by_years_experience, x='Years_Experience', y="Interview_Rate")
### END YOUR CODE HERE ####

**Do years of experience matter?**

**Exercise:** Follow the code that plots interview rate for GPA intervals to plot the interview rate for candidates with different years of experience.

Hint: Since years of experience is an integer, we don't need to use intervals. This means in the call to `groupby`, we can just use `groupby("Years_Experience")`.


In [None]:
### BEGIN YOUR CODE HERE ####




### END YOUR CODE HERE ####

# Protected Characteristics

In this section, we want to answer some questions related to protected characteristics. Are there significant differences in the applicants' skills across different protected groups? Are there significant differences in the outcomes?

In [None]:
#@title Run to compare the prevalence of different skills across different groups of applicants
skills_by_female = data.groupby("Female")[SKILLS_AND_HOBBIES].mean().T.reset_index().melt(id_vars="index").rename(columns={"index":"Skills", "value":"Prevalence"})
px.bar(skills_by_female, x='Skills', color='Female', y='Prevalence', barmode='group', title="Skill Prevalence by Female").show()

skills_by_urm = data.groupby("URM")[SKILLS_AND_HOBBIES].mean().T.reset_index().melt(id_vars="index").rename(columns={"index":"Skills", "value":"Prevalence"})
px.bar(skills_by_urm, x='Skills', color='URM', y='Prevalence', barmode='group', title="Skill Prevalence by URM").show()

skills_by_disability = data.groupby("Disability")[SKILLS_AND_HOBBIES].mean().T.reset_index().melt(id_vars="index").rename(columns={"index":"Skills", "value":"Prevalence"})
px.bar(skills_by_disability, x='Skills', color='Disability', y='Prevalence', barmode='group', title="Skill Prevalence by Disability").show()

**Question:** What are some noticeable differences you see across the protected groups?


## Measuring Bias

**Question:** What are some ways we can measure bias?



Here is one way to think about it - there is bias if being in a protected group decreases your chances of getting an interview.

**Exercise:** Use the `interview_rate` function to compare the interview rate for female applicants vs. non-female applicants. Repeat for other protected characteristics.

**Extra credit:** See if you can use the `groupby` function with apply (like we did when we computed the interview rates for different years of experience).

In [None]:
### BEGIN YOUR CODE HERE ####

print("Non Female Applicants: " + str(interview_rate(data[data["Female"]==0])))
print("Female Appicants: " + str(interview_rate(data[data['Female']==1])))


### END YOUR CODE HERE ####

Non Female Applicants: 0.3333333333333333
Female Appicants: 0.2587268993839836


In [None]:
#@title Check your work and visualize the results here

df = pd.DataFrame([data.groupby('Female')["Interview"].mean(),data.groupby('URM')["Interview"].mean(), data.groupby('Disability')["Interview"].mean()], index=["Female", "URM", "Disability"]).reset_index().melt(id_vars='index').rename(columns={"index":"Protected", 'value':'Percentage'})
px.bar(df, x='Protected', color = 'variable', y='Percentage', barmode='group')

## Intersectional Identities

An applicant may be part of more than one protected group, and the interview rate may be disproportionately more biased against some intersectional identities than others. Let's check the biases for intersectional identities.

**Exercise:** Use the `groupby` and `interview_rate` functions to compare the interview rate of intersectional identity groups. See the code for interview rate by years of experience for help on how to use the `groupby` function.

In [None]:
### BEGIN YOUR CODE HERE ####

data.groupby(['Female','URM', 'Disability']).apply(interview_rate)

### END YOUR CODE HERE ####

Female  URM  Disability
0       0    0             0.367847
             1             0.358974
        1    0             0.210526
             1             0.166667
1       0    0             0.268222
             1             0.257143
        1    0             0.221053
             1             0.285714
dtype: float64

In [None]:
#@title Run to visualize the results
df = data.groupby(["Female", "URM", "Disability"]).apply(interview_rate).reset_index().rename(columns={0: "Interview_Rate"})
df["Intersection"] = list(zip(df["Female"], df["URM"], df["Disability"]))
df["Intersection"] = df['Intersection'].astype(str)
px.bar(df, y='Interview_Rate', x="Intersection")

# Desirable Skills

**Question:** How do you define a skill to be 'desirable'?

There are a lot of good answers to this!

**Option 1**: A skill is desirable if it is rare.

**Question:** Is Option 1 a good definition? Why or why not? Check the skill availability graph from before to help you answer this.

Let's take a look below at the `Python` skill.



In [None]:
ff.create_annotated_heatmap(
    sklearn.metrics.confusion_matrix(data["Python"], data["Interview"]),
    x=['Not Interviewed', 'Interviewed'],
    y=['Does not know Python', 'Knows Python']
)

The figure you just printed is called a **confusion matrix.**

To make sense of this matrix, try answering the following questions:

* What percentage of people know Python?

* What percentage of people who know Python were interviewed?

* What percentage of people who do *not* know Python were interviewed?

**Discuss:** Do you think that Python is a desirable skill based on your calculations from the confusion matrix?


# Congratulations, you made it through notebook 1!

You've explored how bias can manifest in different ways in the interview process and the impact of various factors on the interview rate. Below are optional exercises to keep your exploration going.

In our next notebook, we'll be formalizing metrics for fairness and will train models with different inputs to examine how the application process weighs demographic characteristics and skills.

# (Optional) Necessity and Sufficiency Scores

Here are two other ways to define desirable. Two examples: Let's assign each skill two scores. Using Python as an exmaple

1. **S1**: # of people who know Python and get an interview / # of people who get an interview
2. **S2**: # of people who know Python and get an interview / # of people who know Python

**Question:**  How do we interpret these two scores?

The two scores capture the idea of neccesity and sufficiency.

**Question:** Which one is which?

**Read this only after answering the previous question.**

If a skill has a S1 score of 1 it means that every person who got an interview had this skill, meaning that this skill is in some sense *necessary* to getting an interview

On the other hand, if a skill has a S2 score of 1 it means that every person who has this skill got an interview, making the skill *sufficient* to getting the interview.

We'll refer to these scores as necessity and sufficiency from now on. To recap, for any skill $s$,

$$\text{necessity}(s) = \frac{\text{Number of people who have skill } s \text{ and got an interview}}{\text{Number of people who get an interview}}$$

and

$$\text{sufficiency}(s) = \frac{\text{Number of people who have skill } s \text{ and got an interview}}{\text{Number of people who have skill } s}$$


In other words:

**necessity = you won't get an interview without the skill**

**sufficiency = the skill is enough to get you an interview on its own**

**Question:** Using the confusion matrix above, compute the neccesity and sufficiency scores for `Python`:

In [None]:
print("Necessity score of Python is: ")

### BEGIN YOUR CODE HERE ####
print(253/(253+44))
### END YOUR CODE HERE ####

print("Sufficiency score of Python is: ")

### BEGIN YOUR CODE HERE ####
print(253/(253+287))
### END YOUR CODE HERE ####

Necessity score of Python is: 
0.8518518518518519
Sufficiency score of Python is: 
0.4685185185185185


Below we implement the necessity score.


In [None]:
def necessity(skill, data):
  return data[data["Interview"] == 1][skill].mean()

We can check here to see if the necessity score for `Python` is the same as we computed previously.

In [None]:
necessity("Python", data)

0.8518518518518519

**Exercise:** Implement the sufficiency score.


Hint: How do you select all the applicants with a certain skill, or all the applicants that were interviewed? Use the necessity score as an example to help you.

In [None]:
def sufficiency(skill, data):
  ### BEGIN YOUR CODE HERE ####
  return data[data[skill]==1]["Interview"].mean()
  ### END YOUR CODE HERE ####

Check your work:

In [None]:
sufficiency("Python", data)

0.4685185185185185

The code below uses your `necessity` and `sufficiency` functions to calculate the scores and formats them into a concise table. Ask your instructor if you have any questions about how this works!

In [None]:
 def get_scores(data):
  scores_df = pd.DataFrame()
  scores_df['Skill'] = SKILLS_AND_HOBBIES
  scores_df['Necessity'] = scores_df["Skill"].transform(lambda x: necessity(x, data))
  scores_df['Sufficiency'] = scores_df["Skill"].transform(lambda x: sufficiency(x, data))
  return scores_df


scores_df = get_scores(data)

In [None]:
scores_df

Unnamed: 0,Skill,Necessity,Sufficiency
0,Java,0.286195,0.477528
1,Python,0.851852,0.468519
2,Recruiting,0.127946,0.193878
3,Web_Development,0.080808,0.269663
4,Databases,0.501684,0.448795
5,Machine_Learning,0.407407,0.484
6,Materials,0.074074,0.431373
7,AutoCAD,0.245791,0.388298
8,Data_Science,0.535354,0.477477
9,Art,0.178451,0.187279


**Exercise:** Use the `df.sort_values(by=column_name)` function to see the skills with the highest S1 and S2 scores!

In [None]:
#SORT BY S1

### BEGIN YOUR CODE HERE ####

scores_df.sort_values(by='Necessity')


### END YOUR CODE HERE ####

Unnamed: 0,Skill,Necessity,Sufficiency
26,Hand_Stand,0.020202,0.206897
11,Marketing,0.03367,0.227273
6,Materials,0.074074,0.431373
24,Chess,0.080808,0.176471
3,Web_Development,0.080808,0.269663
12,Finance,0.117845,0.154867
13,Accounting,0.124579,0.137546
2,Recruiting,0.127946,0.193878
22,Swimming,0.13468,0.273973
25,Painting,0.164983,0.281609


In [None]:
#SORT BY S2

### BEGIN YOUR CODE HERE ####

scores_df.sort_values(by='Sufficiency')

### END YOUR CODE HERE ####

Unnamed: 0,Skill,Necessity,Sufficiency
13,Accounting,0.124579,0.137546
12,Finance,0.117845,0.154867
24,Chess,0.080808,0.176471
9,Art,0.178451,0.187279
14,Writing,0.218855,0.19174
2,Recruiting,0.127946,0.193878
26,Hand_Stand,0.020202,0.206897
11,Marketing,0.03367,0.227273
3,Web_Development,0.080808,0.269663
22,Swimming,0.13468,0.273973


Run the cell below for a visualization:

In [None]:
fig = px.scatter(scores_df, x='Necessity', y='Sufficiency', text='Skill', title="Necessity vs. Sufficiency")
fig.update_traces(textposition='top center')

**Exercise:** Identify some skills that are:
* Necessary and sufficient
* Not neccessary but sufficient
* Not sufficient but neccesary
* Not sufficient and not neccesary

**Question:** What are some drawbacks of this scoring system?

# (Optional) Taking a look at a score based classifier

**Question: How could you use the neccesity and sufficiency scores, GPA, and years of experience to make a very simple classifier?**

Remember:

**necessity = you won't get an interview without the skill**

**sufficiency = the skill is enough to get you an interview on its own**

**Challenge**: Work together to build a classifier using these datapoints on the training data to get the highest possible testing accuracy! (Next time we'll use machine learning to tackle this same problem)

The code below defines the training and testing set.

In [None]:
data_train, data_test = sklearn.model_selection.train_test_split(data, test_size=0.2, random_state=1)
x_train = data_train[FEATURES]
x_test = data_test[FEATURES]
y_train = data_train["Interview"]
y_test = data_test["Interview"]

Here is a baseline classifier to demonstrate how to use the helper functions. It takes the row which corresponds to an applicant as the input.

* It starts by defining a variable `score` and adds the candidate's GPA and years of experience to it.
* It then loops through all skills and if the applicant has that skill, it adds the S1 and S2 scores of those skills to `score`.
* If the score exceeds a certain threshold (in this case 15) and the applicant has the `Teamwork` skill it will predict yes, otherwise it will predict no.

Let's first test this model out.

In [None]:
def predict(applicant):
  # return 0
  score = applicant["GPA"] + applicant["Years_Experience"]
  for s in SKILLS_AND_HOBBIES:
    if applicant[s]:
      score += necessity(s, data_train)
      score += sufficiency(s, data_train)
  if score > 15 and applicant['Teamwork']:
      return 1
  else:
      return 0

In [None]:
preds_test = x_test.apply(lambda x: predict(x), axis=1)

In [None]:
print("Test Accuracy: ")
print(sklearn.metrics.accuracy_score(preds_test, y_test))
ff.create_annotated_heatmap(
    sklearn.metrics.confusion_matrix(preds_test, y_test),
    x=['Not Interviewed', 'Interviewed'],
    y=['Predicted No Interview', 'Predicted Interview']
)

Test Accuracy: 
0.755


**Questions:**
* What are some problems with this model?

* Do we care more about minimizing false positives or false negatives?
  
* What if we're focused on saving time?
* What if we're focused on getting the best applicants?

* What accuracy is achieved by the classifier that always returns false? (test this out by replacing the `predict` function with return False)
* Is this a useful model?


**Exercise:** Spend the rest of time in this session working on improving this model by editing the `predict` function! Use the confusion matrix to help diagnose the problem and make progress!



# Acknowledgements
* Data and notebook by Harry Sha. Email harryshahai@gmail.com for bugs/questions!