# Module 26 Topic Review

## Visualizing Confusion Matrices 
- Create a confusion matrix from scratch 
- Create a confusion matrix using scikit-learn 
- Visualize confusion matrices 


##### **Create a confusion matrix from scratch**
```Python
def conf_matrix(y_true, y_pred):
    conf_dict = {'TP':0,'TN':0,'FP':0,'FN':0}
    y_true = list(y_true)
    y_pred = list(y_pred)

    for i in range(len(y_true)):
        if y_true[i] == y_pred[i]: # True 
            if y_pred[i] == 1:
                conf_dict['TP'] += 1 #Positive
            else:
                conf_dict['TN'] += 1 #Negative

        else: #False
            if y_pred[i] == 1: #Positive
                conf_dict['FP'] += 1
            else: #Negative
                conf_dict['FN'] += 1

    return conf_dict

# Example output: {'TP': 38, 'TN': 26, 'FP': 7, 'FN': 5}
```

##### **Create a confusion matrix using scikit-learn**

```Python
# Import confusion_matrix
from sklearn.metrics import confusion_matrix

# Print confusion matrix
cnf_matrix = confusion_matrix(y_test,y_predicted)
print('Confusion Matrix:\n', cnf_matrix)

#example output
Confusion Matrix:
 [[26  7]
 [ 5 38]]
```

##### **Visualize confusion matrices**

```Python
# Import plot_confusion_matrix
from sklearn.metrics import plot_confusion_matrix

# Visualize your confusion matrix
plot_confusion_matrix(model_log,X_test,y_test)
plt.show()
```
<img src='images/output.png' width=350>

## Evaluating Logistic Regression models
- Implement evaluation metrics from scratch using Python

**Precision** is how often a prediction is correct, or in other words what is the liklihood that a given prediction is correct.  
 
$\Large \text{Precision} = \frac{\text{Number of True Positives}}{\text{Number of Predicted Positives}} $  


```Python
def precision(y, y_hat):
    y_y_hat = list(zip(y,y_hat))
    true_pos = sum([1 for i in y_y_hat if i[0]==1 and i[1]==1])
    false_pos = sum([1 for i in y_y_hat if i[0]==0 and i[1]==1])
    return true_pos/float(true_pos + false_pos)
```

**Recall** is an indication of what percentage of the target classes are captured by the model.

$\Large \text{Recall} = \frac{\text{Number of True Positives}}{\text{Number of Actual Total Positives}} $  

```Python
def recall(y, y_hat):
    y_y_hat = list(zip(y,y_hat))
    true_pos = sum([1 for i in y_y_hat if i[0]==1 and i[1]==1])
    false_neg = sum([1 for i in y_y_hat if i[0] == 1 and i[1] == 0])
    return true_pos/float(true_pos + false_neg)
```


**Accuracy** is the proportion of total observations that are predicted correctly (both positive and negative)  

$\Large \text{Accuracy} = \frac{\text{Number of True Positives + True Negatives}}{\text{Total Observations}} $  

```Python
def accuracy(y, y_hat):
    # Your code here
    y_y_hat = list(zip(y,y_hat))
    true_pos = sum([1 for i in y_y_hat if i[0]==1 and i[1]==1])
    true_neg = sum([1 for i in y_y_hat if i[0] == 0 and i[1] == 0])
    return (true_pos + true_neg)/float(len(y_hat))
```

The **F1 score** represents the *harmonic mean of precision and recall*; essentially meaning that for the F1 score to be high, both precision and recall must be high.  

$\Large \text{F1 score} = 2 * \frac{\text{Precision * Recall}}{\text{Precision + Recall}} $  

```Python
def f1_score(y, y_hat):
    # Your code here
    Precision = precision(y,y_hat)
    Recall = recall(y,y_hat)
    numer = Precision * Recall
    denom = Precision + Recall
    return 2*(numer/denom)
```

#### ROC Curves and AUC
- Create a visualization of ROC curves and use it to assess a model 
- Evaluate classification models using the evaluation metrics appropriate for a specific problem 

The Receiver Operator Characteristic curve (ROC curve) illustrates the true positive rate against the false positive rate of our classifier.  

$$ \text{TPR} = \frac{\text{TP}}{\text{TP}+\text{FN}} $$     

$$ \text{FPR} = \frac{\text{FP}}{\text{FP}+\text{TN}}$$

<center><img src='images/decision_boundary.png' width=400></center>


```Python
# First calculate true positive rate(TPR) and false positive rate(FPR)
from sklearn.metrics import roc_curve, auc

y_score = logreg.fit(X_train, y_train).decision_function(X_test)

fpr, tpr, thresholds = roc_curve(y_test, y_score)

# Calculate are under curve and plot roc curve
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Seaborn's beautiful styling
sns.set_style('darkgrid', {'axes.facecolor': '0.9'})

print('AUC: {}'.format(auc(fpr, tpr)))
plt.figure(figsize=(10, 8))
lw = 2
plt.plot(fpr, tpr, color='darkorange',
         lw=lw, label='ROC curve')
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.yticks([i/20.0 for i in range(21)])
plt.xticks([i/20.0 for i in range(21)])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic (ROC) Curve')
plt.legend(loc='lower right')
plt.show()

```

<img src='images/seaborn_roc.png' width=550>


#### Logistic Regression Model Comparisons
- Compare the different inputs with logistic regression models and determine the optimal model 

Multiple roc curves (and auc) should be calculated by adjusting various hyperparameters of the model, such as:   
- the regularization term 'C'
- the  penalty term (e.g. L1,L2,etc.)
- the class weights  

<img src='images/roc_comparison.png' width=550>

#### Class Imbalance Problems
- Use sampling techniques to address a class imbalance problem within a dataset 
- Create a visualization of ROC curves and use it to assess a model

Class imbalance occurs when an overwhelming majority of observations belong to a particular class and therefore only a small minory belongs to the other. 

```Python
print('Raw counts: \n')
print(df['Class'].value_counts())
print('-----------------------------------')
print('Normalized counts: \n')
print(df['Class'].value_counts(normalize=True))

Raw counts: 

0    284315
1       492
Name: Class, dtype: int64
-----------------------------------
Normalized counts: 

0    0.998273
1    0.001727
Name: Class, dtype: float64
```

The [SMOTE](https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html) function from the imblearn library can be used to improve a model's performance on the minority class.  SMOTE stands for **Synthetic Minority Oversampling**

```Python
# Previous original class distribution
print(y_train.value_counts())

# Fit SMOTE to training data
X_train_resampled, y_train_resampled = SMOTE().fit_resample(X_train,y_train)

# Preview synthetic sample class distribution
print('\n')
print(pd.Series(y_train_resampled).value_counts()) 

0    213233
1       372
Name: Class, dtype: int64

1    213233
0    213233
Name: Class, dtype: int64
```

##### Thank you for exploring this notebok and viewing my content! Click the links below to explore further!

#### ---> [LIKE AND SUBSCRIBE TO THE YOUTUBE CHANNEL](https://www.youtube.com/channel/UCCrhddH1eLvb0iolx28r_ig)

#### ---> [EXPLORE MY GITHUB](https://github.com/Zeth-Abney/Flatiron-school)

#### ---> [CONNECT WITH ME ON LINKEDIN](https://www.linkedin.com/in/zeth-abney/)

#### ---> [DISCOVER WHAT FLATIRON SCHOOL HAS TO OFFER](https://flatironschool.com/courses/data-science-bootcamp/)