# Big Idea

- We will create a target ($y$-values) dataframe that can be balanced or imbalanced.  
- The dataset will have $n$ records.
- *Balanced* would mean that half of the elements of $y$ are $True$, half $False$.
- *Imbalanced* with parameter $imb$ would mean that $n \times imb$ elements would be $True$, and $n \times (1 - imb)$ would be $False$.  If $imb = 0.5$, then the set is balanced.  In our crash dataset, if we're looking to predict fatal crashes, $imb \approx 0.004$.
- Our *confusion matrix* for $y$ on itself is:

<table>
    <tr> <td> <td>     <td colspan="2"> Prediction
    <tr> <td> <td> <th> N <th> P 
    <tr> <td rowspan='2'> Actual <th scope="row"> N <td> n * (1 - imb) <td> 0 
    <tr> <th scope="row"> P <td> 0 <td> n * imb 
</table>
        
- We will create a dataframe of features ($x$-values), each by perturbing either (or both) the TP or TN to modify the Accuracy, Precision, or Recall.  
- For feature $x^{(i)}$, choose a parameter $p$ for modifying the $True$ values of $y$.  For each record, choose a random $r \in [0,1)$.  If $r>p$ and $y_j == False$, then $x^{(i)}_j = True$.  Now the confusion matrix for $x^{(i)}$ on $y$ is:
        
        
<table>
    <tr> <td> <td>     <td colspan="2"> Prediction
    <tr> <td> <td> <th> N <th> P 
    <tr> <td rowspan='2'> Actual <th scope="row"> N <td> n * (1 - imb) * p <td> n * (1 - imb) * (i-p) 
    <tr> <th scope="row"> P <td> 0 <td> n * imb 
</table>
        
- Similarly, use parameter $q$ to swap some $True$ values.  
        
<table>
    <tr> <td> <td>     <td colspan="2"> Prediction
    <tr> <td> <td> <th> N <th> P 
    <tr> <td rowspan='2'> Actual <th scope="row"> N <td> n * (1 - imb) * p <td> n * (1 - imb) * (i-p) 
    <tr> <th scope="row"> P <td> n * imb * (1-q) <td> n * imb * q
</table>
        
- This confusion matrix gives us these values for accuracy, precision, and recall.
        
\begin{align}
    Accuracy &= \frac{TN + TP}{TN + FP + FN + TP} \cr
        &= \frac{n(1-imb)p + n(imb)(q)}{n} \cr
        &= (1-imb)p + imb \cdot q \cr
\end{align}
        

\begin{align}
    Precision &= \frac{TP}{TP + FP} \cr
        &= \frac{n \cdot imb \cdot q}{n \cdot imb \cdot q + n \cdot (1-imb) \cdot (1-p)} \cr
        &= \frac{imb \cdot q}{imb \cdot q + (1-imb) \cdot (1-p)} \cr
\end{align}

\begin{align}
    Recall &= \frac{TP}{TP + FN} \cr
        &= \frac{n \cdot imb \cdot q}{n \cdot imb \cdot q + n \cdot imb (1-q)} \cr
        &= \frac{imb \cdot q}{imb \cdot q + imb (1-q)} \cr
        &= \frac{q}{q + (1-q)} \cr
        &= q
\end{align}


        




# Math Notes

- Confusion Matrix


<table>
    <tr> <td> <td>     <td colspan="2"> Prediction
    <tr> <td> <td> <th> N <th> P 
    <tr> <td rowspan='2'> Actual <th scope="row"> N <td> TN <td> FP 
    <tr> <th scope="row"> P <td> FN <td> TP 
</table>

- Accuracy is the proportion of predictions that are correct.

$$\frac{TP + TN}{TN + TP + FN + FP}$$

- Precision is the proportion, of the things we predicted as positive, that are correct.

$$\frac{TP}{TP + FP}$$

- Recall is the proportion, of the things that are positive, that we predicted correctly.

$$\frac{TP}{TP + FN}$$

# Create the Target

In [None]:
y = []

# Import Libraries

In [1]:
import numpy as np
import pandas as pd
from scipy.stats import chi2_contingency


import random


from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score

from sklearn.feature_selection import VarianceThreshold
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2, f_classif, mutual_info_classif

# Create y-values