# Your Name: Ashley Tsoi 

# Your NetID: ast418

***

# Homework 3 - Part A

### <div style="color: red">Read Carefully Before Proceeding</div>

If you are having issues with running this code because of missing libraries, check the material that we've done in class for installation instructions. This code uses what we have already seen, so if you've been able to execute the code of the Notebooks we've seen in class, you will be fine here as well.


You need to answer all questions. Make sure that you answer both **technical** (code-related) and **non-technical** (conceptual) parts of this homework. A lot of code is already available for you, and you can build on that. You are free to use code from our notebooks in class.  All visualizations must be generated by your code, programmatically, unless explicitly mentioned otherwise by a question.


Once you're done, download the notebook via `File` -> `Download as` -> `Notebook`, which will fetch a file with an ".ipynb" extension. Include this file in your submission, as a separate document -- **not** in the word / pdf submission itself. In case you use additional code stored in another directory, make sure to submit that as well.

***

In [None]:
# a bunch of packages

import numpy as np
import pandas as pd
from sklearn import metrics
from sklearn.model_selection import train_test_split

from scipy.spatial.distance import pdist, squareform
from scipy.cluster.hierarchy import linkage, dendrogram

from sklearn.cluster import KMeans
from sklearn.neighbors import KNeighborsClassifier
from scipy.spatial import distance

import math
import pyproj as proj

%matplotlib inline
import matplotlib.pylab as plt
plt.rcParams['figure.figsize'] = 10, 8



## Part 1 - Basic Evaluation

### 1.1 - Reconstructing a Confusion Matrix

Using the following values on some evaluation measures:

* \# Instances = 1000
* Accuracy = 70%
* Precision = 62.5%
* Recall = 71.43%

**Note:** There are several ways to present the confusion matrix. Either as a 2x2 square or with the individual values. Feel free to follow an approach that you prefer, but make sure that the information is legible and clearly represents what you computed. You may upload an image with the confusion matrix shown as part of your submission if you prefer.

**A) Reconstruct the corresponding confusion matrix.**

> **accuracy** = $\frac{TP+TN}{N}$
> **precision** = $\frac{TP}{TP+FP}$
> **recall** = $\frac{TP}{TP+FN}$

> $ \frac{1}{precision} + \frac{1}{recall} = \frac{1}{6.25\%} + \frac{1}{71.43\%} \approx 3 $ 
<br><br>
> $ \frac{2TP + FP + FN}{TP} = 2 + \frac{FP + FN}{TP} \approx 3 $
<br><br>
> $ \frac{FP + FN}{TP} \approx 1 $
<br><br>
> $ TP \approx FP + FN = 1000 - 70\%\times1000 = 300 $ <br>
--> $TN = 70\%\times1000 - TP = 400$
<br><br>
> $ 62.5\% = \frac{TP}{TP+FP} = \frac{300}{300+FP}$ <br>
--> $FP = 180$ --> $FN = 300 - FP = 120$
<br>

> |      _        | Predicted: YES | Predicted: NO | TOTAL       |
| ------------- | -------------- | ------------- | ----------- |
|**Actual: YES**| `TP = 300    ` | `FN = 120   ` | TP+FN = 420 |
|**Actual: NO** | `FP = 180    ` | `TN = 400   ` | FP+TN = 580 |
|**TOTAL**      | TP+FP = 480    | FN+TN = 520   | N = 1000    |

<br>

**B) Report the TPR value of that confusion matrix.** <br><br>
> $ TPR = \frac{TP}{TP+FN} = \frac{300}{420} \approx 71.43\% $<br>


**C) Report the FPR value of that confusion matrix.** <br><br>
> $ FPR = \frac{FP}{FP+TN} = \frac{180}{580} \approx 31.03\% $<br>


### 1.2 - "Reading" a Cumulative Response Curve

As part of a recent project that you did for your business, you wanted to present the **savings that you'd get by targeting a subset of the available population**. The results are presented to the business side. After training and evaluating your classifier, you presented your results in the form of a **Cumulative Response Curve (CRC plot)**. The CRC that you got from your classifier is shown below.

<img src="imgs/crc.png" height="40%" width="40%" />

With that CRC in mind and your general knowledge of evaluation curves, answer the following questions.

#### Q1: Why pick a CRC ?

**In 2-3 sentences (max), discuss some reasons why a CRC is a better option over other evaluation curves (e.g., ROC) and measures (e.g., AUC, precision, etc), considering the original goal (see above).**

> Curves like ROC are common for visualization of the **classification performance**, but are not as **intuitive** for business stakeholders to understand the potential result of the implementation. While CRC and its measures such as the AUC and precision are good single-number measures that describes the performance, the CRC is easier for businesses to connect the graph's information to the actual implications of their business problem (how much savings they can potentially save from better classification in this case).

> In short, CRC is more straight-forward and intuitive (esp. for outsiders), while some of the other curves are more comprehensive for understanding the model performance (for geeks/model-makers).

***

#### Q2: "Reading" the CRC

Imagine that you are presenting your findings and you want to illustrate the savings that you'd achieve by using your classifier. Using the previous CRC, give **3 examples** that will help your audience understand the trade-off between the savings and the performance that you'd (roughly) get.

> 1. **Customer churn** --
  If there are 100 potential customers that we can target, we can 

> 2. **Advertisement targeting** --
  

> 3. **Recruiting** -- 
  If


***

## Part 2 - Data Driven Recruiting

### Task 2.1 - Defining the problem

The recruiting paradigm has changed over the years, with several companies priding themselves in following data-driven hiring practices and processes. That is, they rely heavily on _data_ to decide whether someone should be hired or not, among other things.

You're helping build such a system and you're treating it as a _classification_ problem. An _instance_ is a candidate considered for a particular role. 

* What is the **target variable** of your problem?

Think carefully before you answer.  _Hiring_ is an action that we take for _something else_ that we are trying to capture / predict.  Consider the analogy with the marketing scenario that we've seen in class: we invite people (action) because we predicted they will _donate money_ (target variable). Similarly, we _hire_ people (action) because .... . The target variable has to be something **measurable / quantifiable**.


_Your answer here_

***


### Task 2.2 - Recruiting False Positives & False Negatives

We've seen that a confusion matrix has four corners, for a binary classification problem: $TP$, $FP$, $FN$ and $TN$. For some companies, the following motto has been heard in relation to their hiring practices:

"_A False Positive (FP) is worse / costlier than a False Negative (FN)_"

Very briefly answer the following:

* What is a False Positive in this case?
* What is a False Negative in this case?
* What do you think the above expression means?
* What _performance measure_ (i.e., accuracy, precision, recall, F1 measure, etc) do they try to optimize / improve following the logic of that expression? Explain your thinking.


Be careful: the question is **not** asking for a general definition of FP and FN. It is asking for what they mean in this context. As a hint, your answers should be something along the lines of "A FP refers to a candidate ....".

Your answer must also be aligned with your previous definition of a target variable.

_Your answer here_

***


### Part 3 - Dispatching Emergency Vehicles

Data Mining / Data Science is so pervasive that there is the consideration (in fact, implementation) of using such techniques to decide whether an emergency vehicle (police car, ambulance, etc) should be dispatched with respect to a 911 call.

An overly simplistic way to approach this problem is via **a classification framework that results in an "dispatch / don't dispatch car" action**.

**What is the proper way of evaluating such an approach**, i.e. which evaluation measure would you use? Accuracy? AUC? An evaluation curve? Something else? Explain your thinking.

_Your answer here_

***
