### Given the following confusion matrix, evaluate (by hand) the model's performance.


|               | actual cat | actual dog 
|:------------  |-----------:|-----------:|
| predicted cat |         34 |          7 |
| predicted dog |         13 |         46 |


- In the context of this problem, what is a false positive?
- In the context of this problem, what is a false negative?
- How would you describe this model?

In [1]:
import pandas as pd

In [2]:
# In the context of this problem, what is a false positive?
# False alarm. Predicted cat and it was an actual dog. Value == 7

In [3]:
# In the context of this problem, what is a false negative?
# Miss. Predicted dog and was actually a cat. Value == 13

In [4]:
# How would you describe this model?
TP = 34
TN = 46
FP = 7
FN = 13
accuracy = (TP + TN)/(TP + TN + FP + FN) * 100
print(f"The model is {accuracy}% accurate.")

The model is 80.0% accurate.


### You are working as a datascientist working for Codeup Cody Creator (C3 for short), a rubber-duck manufacturing plant.

Unfortunately, some of the rubber ducks that are produced will have defects. Your team has built several models that try to predict those defects, and the data from their predictions can be found here.

Use the predictions dataset and pandas to help answer the following questions:

An internal team wants to investigate the cause of the manufacturing defects. They tell you that they want to identify as many of the ducks that have a defect as possible. Which evaluation metric would be appropriate here? Which model would be the best fit for this use case?
Recently several stories in the local news have come out highlighting customers who received a rubber duck with a defect, and portraying C3 in a bad light. The PR team has decided to launch a program that gives customers with a defective duck a vacation to Hawaii. They need you to predict which ducks will have defects, but tell you the really don't want to accidentally give out a vacation package when the duck really doesn't have a defect. Which evaluation metric would be appropriate here? Which model would be the best fit for this use case?

In [5]:
# Which evaluation metric would be appropriate here?
# The team wants to investigate the defects. 

In [6]:
r_duck = pd.read_csv('untidy_data/c3.csv')
r_duck.head()

Unnamed: 0,actual,model1,model2,model3
0,No Defect,No Defect,Defect,No Defect
1,No Defect,No Defect,Defect,Defect
2,No Defect,No Defect,Defect,No Defect
3,No Defect,Defect,Defect,Defect
4,No Defect,No Defect,Defect,No Defect


In [7]:
# Value counts to see how many defects and non defects there are
r_duck.actual.value_counts()

No Defect    184
Defect        16
Name: actual, dtype: int64

In [8]:
# Setting the baseline as No defect positive due to having the most observations
r_duck["baseline"] = r_duck.actual.value_counts().index[0]
r_duck.head()

Unnamed: 0,actual,model1,model2,model3,baseline
0,No Defect,No Defect,Defect,No Defect,No Defect
1,No Defect,No Defect,Defect,Defect,No Defect
2,No Defect,No Defect,Defect,No Defect,No Defect
3,No Defect,Defect,Defect,Defect,No Defect
4,No Defect,No Defect,Defect,No Defect,No Defect


In [9]:
# Baseline shows there are more no defects than defects.
# Basline is to predict no defects, so we should use specificity to calculate TN out of
# all Actual Negatives since the baseline is predicting positives
actual_v_m1 = pd.crosstab(r_duck.model1, r_duck.actual)
actual_v_m1

actual,Defect,No Defect
model1,Unnamed: 1_level_1,Unnamed: 2_level_1
Defect,8,2
No Defect,8,182


In [10]:
# Model_1 accuracy and specificity
TP = 182
TN = 8
FP = 8
FN = 2
accuracy = (TP + TN)/(TP + TN + FP + FN) * 100
accuracy
accuracy_2 = (r_duck.actual == r_duck.model1).mean() * 100
print(f"The model is {accuracy}% accurate.")
print(f"The model is {round(accuracy_2,2)}% accurate with the DF calculation.")
# % of predicting TN out of all actual Negatives
specificity = TN/(TN + FP) * 100
print(f"The models specificity is {round(specificity,2)}%.")

The model is 95.0% accurate.
The model is 95.0% accurate with the DF calculation.
The models specificity is 50.0%.


In [11]:
actual_v_m2 = pd.crosstab(r_duck.model2, r_duck.actual)
actual_v_m2

actual,Defect,No Defect
model2,Unnamed: 1_level_1,Unnamed: 2_level_1
Defect,9,81
No Defect,7,103


In [12]:
# Model_2 accuracy and specificty
TP = 103
TN = 9
FP = 7
FN = 81
accuracy = (TP + TN)/(TP + TN + FP + FN) * 100
accuracy_2 = (r_duck.actual == r_duck.model2).mean() * 100
print(f"The model is {round(accuracy,2)}% accurate.")
print(f"The model is {round(accuracy_2,2)}% accurate with the DF calculation.")
# % of predicting TN out of all actual Negatives
specificity = TN/(TN + FP) * 100
print(f"The models specificity is {round(specificity,2)}%.")

The model is 56.0% accurate.
The model is 56.0% accurate with the DF calculation.
The models specificity is 56.25%.


In [13]:
actual_v_m3 = pd.crosstab(r_duck.model3, r_duck.actual)
actual_v_m3

actual,Defect,No Defect
model3,Unnamed: 1_level_1,Unnamed: 2_level_1
Defect,13,86
No Defect,3,98


In [14]:
# Model_3 accuracy and specificity
TP = 98
TN = 13
FP = 3
FN = 86
accuracy = (TP + TN)/(TP + TN + FP + FN) * 100
accuracy
accuracy_2 = (r_duck.actual == r_duck.model3).mean() * 100
print(f"The model is {round(accuracy,2)}% accurate.")
print(f"The model is {round(accuracy_2,2)}% accurate with the DF calculation.")
# % of predicting TN out of all actual Negatives
specificity = TN/(TN + FP) * 100
print(f"The models specificity is {round(specificity,2)}%.")

The model is 55.5% accurate.
The model is 55.5% accurate with the DF calculation.
The models specificity is 81.25%.


In [15]:
from sklearn.metrics import accuracy_score, recall_score, precision_score, classification_report
print(classification_report(r_duck.model1, r_duck.actual))
print(classification_report(r_duck.model2, r_duck.actual))
print(classification_report(r_duck.model3, r_duck.actual))

              precision    recall  f1-score   support

      Defect       0.50      0.80      0.62        10
   No Defect       0.99      0.96      0.97       190

    accuracy                           0.95       200
   macro avg       0.74      0.88      0.79       200
weighted avg       0.96      0.95      0.96       200

              precision    recall  f1-score   support

      Defect       0.56      0.10      0.17        90
   No Defect       0.56      0.94      0.70       110

    accuracy                           0.56       200
   macro avg       0.56      0.52      0.44       200
weighted avg       0.56      0.56      0.46       200

              precision    recall  f1-score   support

      Defect       0.81      0.13      0.23        99
   No Defect       0.53      0.97      0.69       101

    accuracy                           0.56       200
   macro avg       0.67      0.55      0.46       200
weighted avg       0.67      0.56      0.46       200



### Model 3 shows the highest specificity that is predicting the TN out of all actual negatives at 81.25% since the baseline is predicting positives.

Recently several stories in the local news have come out highlighting customers who received a rubber duck with a defect, and portraying C3 in a bad light. The PR team has decided to launch a program that gives customers with a defective duck a vacation to Hawaii. They need you to predict which ducks will have defects, but tell you the really don't want to accidentally give out a vacation package when the duck really doesn't have a defect. Which evaluation metric would be appropriate here? Which model would be the best fit for this use case?

In [16]:
# They want us to predict defects but don't want to accidently give out a vacation if 
# there is no defect. We would pick the recall metric or true positive predictions value 
# Want to avoid FN
# model_1
TP = 182
TN = 8
FP = 8
FN = 2

recall = TP/(TP + FN) * 100
print(f"The models recall is {round(recall,2)}%.")

The models recall is 98.91%.


In [17]:
# the boolean mask here is model1 == positive. Using DF to calculate recall
positive = "No Defect"
subset = r_duck[r_duck.actual == positive]
model1_recall = (subset.model1 == subset.actual).mean() * 100
#subset = r_duck[r_duck.baseline == positive]
baseline_recall = (subset.baseline == subset.actual).mean() * 100
print(f"The models recall is {round(model1_recall,2)}% using the DF calculation.")
print(f"The baseline recall is {baseline_recall}% using the DF calculation.")

The models recall is 98.91% using the DF calculation.
The baseline recall is 100.0% using the DF calculation.


In [18]:
# model_2
TP = 103
TN = 9
FP = 7
FN = 81

recall = TP/(TP + FN) * 100
print(f"The models recall is {round(recall,2)}%.")

The models recall is 55.98%.


In [19]:
# the boolean mask here is model1 == positive. Using DF to calculate recall
model2_recall = (subset.model2 == subset.actual).mean() * 100
print(f"The models recall is {round(model2_recall,2)}% using the DF calculation.")

The models recall is 55.98% using the DF calculation.


In [20]:
# model_3
TP = 98
TN = 13
FP = 3
FN = 86
recall = TP/(TP + FN) * 100
print(f"The models recall is {round(recall,2)}%.")

The models recall is 53.26%.


In [21]:
# the boolean mask here is model1 == positive. Using DF to calculate recall
model3_recall = (subset.model3 == subset.actual).mean() * 100
print(f"The models recall is {round(model3_recall,2)}% using the DF calculation.")

The models recall is 53.26% using the DF calculation.


### Based on the 3 models. Model_1 show the highest recall or the true positive rate AKA sensitivity value at 98.91%

### 3. You are working as a data scientist for Gives You Paws ™, a subscription based service that shows you cute pictures of dogs or cats (or both for an additional fee).

- At Gives You Paws, anyone can upload pictures of their cats or dogs. The photos are then put through a two step process. First an automated algorithm tags pictures as either a cat or a dog (Phase I). Next, the photos that have been initially identified are put through another round of review, possibly with some human oversight, before being presented to the users (Phase II).

- Several models have already been developed with the data, and you can find their results here.

- Given this dataset, use pandas to create a baseline model (i.e. a model that just predicts the most common class) and answer the following questions:

In [22]:
paws = pd.read_csv('untidy_data/gives_you_paws.csv')
paws.head()

Unnamed: 0,actual,model1,model2,model3,model4
0,cat,cat,dog,cat,dog
1,dog,dog,cat,cat,dog
2,dog,cat,cat,cat,dog
3,dog,dog,dog,cat,dog
4,cat,cat,cat,dog,dog


In [23]:
# Show more dogs than cats. Baseline is that they will give you a dog pic
paws.actual.value_counts()

dog    3254
cat    1746
Name: actual, dtype: int64

In [24]:
baseline = 3254/(3254+1746) * 100
baseline

65.08

In [25]:
# Based on the value counts, predicting dog is the baseline
paws['baseline'] = paws.actual.value_counts().index[0]
paws.head()

Unnamed: 0,actual,model1,model2,model3,model4,baseline
0,cat,cat,dog,cat,dog,dog
1,dog,dog,cat,cat,dog,dog
2,dog,cat,cat,cat,dog,dog
3,dog,dog,dog,cat,dog,dog
4,cat,cat,cat,dog,dog,dog


In [26]:
baseline_accuracy = (paws.baseline == paws.actual).mean() *100
baseline_accuracy

65.08

### A.In terms of accuracy, how do the various models compare to the baseline model? Are any of the models better than the baseline?

In [27]:
# Model_1
a_v_m1 = pd.crosstab(paws.model1, paws.actual)
a_v_m1

actual,cat,dog
model1,Unnamed: 1_level_1,Unnamed: 2_level_1
cat,1423,640
dog,323,2614


In [28]:
# In terms of accuracy, how do the various models compare to the baseline model? Are any
# of the models better than the baseline?
# Model1
TP = 2614
TN = 1423
FP = 323
FN = 640

accuracy = (TP + TN)/(TP + TN + FP + FN) * 100
accuracy
model1_accuracy = (paws.model1 == paws.actual).mean()*100
print(f"The model is {round(accuracy,2)}% accurate.")
print(f"The model is {model1_accuracy}% accurate using the DF calculation.")

The model is 80.74% accurate.
The model is 80.74% accurate using the DF calculation.


In [29]:
# Model2
a_v_m2 = pd.crosstab(paws.model2, paws.actual)
a_v_m2

actual,cat,dog
model2,Unnamed: 1_level_1,Unnamed: 2_level_1
cat,1555,1657
dog,191,1597


In [30]:
# Model2
TP = 1597
TN = 1555
FP = 191
FN = 1657

accuracy = (TP + TN)/(TP + TN + FP + FN) * 100
accuracy
model2_accuracy = (paws.model2 == paws.actual).mean() *100
print(f"The model is {round(accuracy,2)}% accurate.")
print(f"The model is {round(model2_accuracy,2)}% accurate using the DF calculation.")

The model is 63.04% accurate.
The model is 63.04% accurate using the DF calculation.


In [31]:
#Model3
a_v_m3 = pd.crosstab(paws.model3, paws.actual)
a_v_m3

actual,cat,dog
model3,Unnamed: 1_level_1,Unnamed: 2_level_1
cat,893,1599
dog,853,1655


In [32]:
# Model3
TP = 1655
TN = 893
FP = 853
FN = 1599

accuracy = (TP + TN)/(TP + TN + FP + FN) * 100
accuracy
model3_accuracy = (paws.model3 == paws.actual).mean() *100
print(f"The model is {round(accuracy,2)}% accurate.")
print(f"The model is {round(model3_accuracy,2)}% accurate using the DF calculation.")

The model is 50.96% accurate.
The model is 50.96% accurate using the DF calculation.


In [33]:
# Model4
a_v_m4 = pd.crosstab(paws.model4, paws.actual)
a_v_m4

actual,cat,dog
model4,Unnamed: 1_level_1,Unnamed: 2_level_1
cat,603,144
dog,1143,3110


In [34]:
# Model4
TP = 3110
TN = 603
FP = 1143
FN = 144

accuracy = (TP + TN)/(TP + TN + FP + FN) * 100
accuracy
model4_accuracy = (paws.model4 == paws.actual).mean() * 100
print(f"The model is {round(accuracy,2)}% accurate.")
print(f"The model is {round(model4_accuracy,2)}% accurate using the DF calculation.")

The model is 74.26% accurate.
The model is 74.26% accurate using the DF calculation.


In [35]:
# Based on the accuracy calculation, model 1 is the most accurate with 80.74%

### Suppose you are working on a team that solely deals with dog pictures. Which of these models would you recomend for Phase I? For Phase II?

In [64]:
# Want to use recall for phase 1. We're optimizing for true positive cases / (all positve cases)
# Phase 2 use precesion % of positive predictions that are correct. trying to minimize false 
# positives (the model saying dog, but we have a cat)

In [86]:
#paws = pd.read_csv('gives_you_paws.csv')
act_vs_model1 = pd.crosstab(paws.model1, paws.actual)
print(act_vs_model1)
# Model1
TP = 2614
TN = 1423
FP = 323
FN = 640

recall = (TP)/(TP + FN) * 100
subset = paws[paws.actual == "dog"]
model1_recall = (subset.model1 == subset.actual).mean() * 100
print(f"The model has a recall rate of {round(recall,2)}%.")
print(f"The model has a recall rate of {round(model1_recall,2)}% using the DF calculation.")

actual   cat   dog
model1            
cat     1423   640
dog      323  2614
The model has a recall rate of 80.33%.
The model has a recall rate of 80.33% using the DF calculation.


In [90]:
# Model2
TP = 1597
TN = 1555
FP = 191
FN = 1657

recall = (TP)/(TP + FN) * 100
subset = paws[paws.actual == "dog"]
model2_recall = (subset.model2 == subset.actual).mean() * 100
print(f"The model has a recall rate of {round(recall,2)}%.")
print(f"The model has a recall rate of {round(model2_recall,2)}% using the DF calculation.")

The model has a recall rate of 49.08%.
The model has a recall rate of 49.08% using the DF calculation.


In [91]:
# Model3
TP = 1655
TN = 893
FP = 853
FN = 1599

recall = (TP)/(TP + FN) * 100
subset = paws[paws.actual == "dog"]
model3_recall = (subset.model3 == subset.actual).mean() * 100
print(f"The model has a recall rate of {round(recall,2)}%.")
print(f"The model has a recall rate of {round(model3_recall,2)}% using the DF calculation.")

The model has a recall rate of 50.86%.
The model has a recall rate of 50.86% using the DF calculation.


In [92]:
# Model4
TP = 3110
TN = 603
FP = 1143
FN = 144

recall = (TP)/(TP + FN) * 100
subset = paws[paws.actual == "dog"]
model4_recall = (subset.model4 == subset.actual).mean() * 100
print(f"The model has a recall rate of {round(recall,2)}%.")
print(f"The model has a recall rate of {round(model4_recall,2)}% using the DF calculation.")

The model has a recall rate of 95.57%.
The model has a recall rate of 95.57% using the DF calculation.


In [None]:
# Model 4 has the best recall rate of 95.57%. We accounted for all predicted positives
# Phase 2 use precesion % of positive predictions that are correct. Want to reduce
# the false negatives

In [56]:
# Model1
TP = 2614
TN = 1423
FP = 323
FN = 640

precision = TP/(TP + FP) * 100
subset = paws[paws.model1 == "dog"]
model1_precision = (subset.model1 == subset.actual).mean() * 100
print(f"The model has a precision rate of {round(precision,2)}%.")
print(f"The model has a precision rate of {round(model1_precision,2)}% using the DF calculation.")

The model has a precision rate of 89.0%.
The model has a precision rate of 89.0% using the DF calculation.


In [58]:
# Model2
TP = 1597
TN = 1555
FP = 191
FN = 1657

precision = TP/(TP + FP) * 100
subset = paws[paws.model2 == "dog"]
model2_precision = (subset.model2 == subset.actual).mean() * 100
print(f"The model has a precision rate of {round(precision,2)}%.")
print(f"The model has a precision rate of {round(model2_precision,2)}% using the DF calculation.")

The model has a precision rate of 89.32%.
The model has a precision rate of 89.32% using the DF calculation.


In [59]:
# Model3
TP = 1655
TN = 893
FP = 853
FN = 1599

precision = TP/(TP + FP) * 100
subset = paws[paws.model3 == "dog"]
model3_precision = (subset.model3 == subset.actual).mean() * 100
print(f"The model has a precision rate of {round(precision,2)}%.")
print(f"The model has a precision rate of {round(model3_precision,2)}% using the DF calculation.")

The model has a precision rate of 65.99%.
The model has a precision rate of 65.99% using the DF calculation.


In [60]:
# Model4
TP = 3110
TN = 603
FP = 1143
FN = 144

precision = TP/(TP + FP) * 100
subset = paws[paws.model4 == "dog"]
model4_precision = (subset.model4 == subset.actual).mean() * 100
print(f"The model has a precision rate of {round(precision,2)}%.")
print(f"The model has a precision rate of {round(model4_precision,2)}% using the DF calculation.")

The model has a precision rate of 73.12%.
The model has a precision rate of 73.12% using the DF calculation.


In [None]:
# Model 2 shows the best recall rate of 89.32%

### Suppose you are working on a team that solely deals with cat pictures. Which of these models would you recomend for Phase I? For Phase II?

In [None]:
# Phase I should be recall to optimize for # of true positive cases out of all actual positive cases

In [None]:
# Phase II should be precision to minimize False Positives

In [83]:
act_vs_model1 = pd.crosstab(paws.model1, paws.actual)
act_vs_model1

actual,cat,dog
model1,Unnamed: 1_level_1,Unnamed: 2_level_1
cat,1423,640
dog,323,2614


In [94]:
# Model1
# Have to adjust our confusion matrix to accomodate for cats
# Model1
TP = 1423
TN = 2614
FP = 640
FN = 323

recall = (TP)/(TP + FN) * 100
subset = paws[paws.actual == "cat"]
model1_recall = (subset.model1 == subset.actual).mean() * 100
print(f"The model has a recall rate of {round(recall,2)}%.")
print(f"The model has a recall rate of {round(model1_recall,2)}% using the DF calculation.")

The model has a recall rate of 81.5%.
The model has a recall rate of 81.5% using the DF calculation.


In [72]:
a_v_m2 = pd.crosstab(paws.model2, paws.actual)
a_v_m2

actual,cat,dog
model2,Unnamed: 1_level_1,Unnamed: 2_level_1
cat,1555,1657
dog,191,1597


In [95]:
# Model2
TP = 1555
TN = 1597
FP = 1657
FN = 191

recall = (TP)/(TP + FN) * 100
subset = paws[paws.actual == "cat"]
model2_recall = (subset.model2 == subset.actual).mean() * 100
print(f"The model has a recall rate of {round(recall,2)}%.")
print(f"The model has a recall rate of {round(model2_recall,2)}% using the DF calculation.")

The model has a recall rate of 89.06%.
The model has a recall rate of 89.06% using the DF calculation.


In [74]:
a_v_m3 = pd.crosstab(paws.model3, paws.actual)
a_v_m3

actual,cat,dog
model3,Unnamed: 1_level_1,Unnamed: 2_level_1
cat,893,1599
dog,853,1655


In [97]:
# Model3
TP = 893
TN = 1655
FP = 1599
FN = 853

recall = (TP)/(TP + FN) * 100
subset = paws[paws.actual == "cat"]
model3_recall = (subset.model3 == subset.actual).mean() * 100
print(f"The model has a recall rate of {round(recall,2)}%.")
print(f"The model has a recall rate of {round(model3_recall,2)}% using the DF calculation.")

The model has a recall rate of 51.15%.
The model has a recall rate of 51.15% using the DF calculation.


In [77]:
a_v_m4 = pd.crosstab(paws.model4, paws.actual)
a_v_m4

actual,cat,dog
model4,Unnamed: 1_level_1,Unnamed: 2_level_1
cat,603,144
dog,1143,3110


In [78]:
# Model4
TP = 603
TN = 3110
FP = 144
FN = 1143

recall = (TP)/(TP + FN) * 100
subset = paws[paws.actual == "cat"]
model4_recall = (subset.model4 == subset.actual).mean() * 100
print(f"The model has a recall rate of {round(recall,2)}%.")
#print(f"The model has a recall rate of {round(model4_recall,2)}% using the DF calculation.")

The model has a recall rate of 34.54%.


In [None]:
# We would want to use model 2 with a recall rate of 89.06%.

In [80]:
!git add model_evaluation.ipynb

In [81]:
!git commit -m "Updates to evaluations"

[master 7e702ca] Updates to evaluations
 1 file changed, 378 insertions(+), 100 deletions(-)


In [82]:
!git push

Enumerating objects: 5, done.
Counting objects: 100% (5/5), done.
Delta compression using up to 8 threads
Compressing objects: 100% (3/3), done.
Writing objects: 100% (3/3), 1.19 KiB | 611.00 KiB/s, done.
Total 3 (delta 2), reused 0 (delta 0)
remote: Resolving deltas: 100% (2/2), completed with 2 local objects.[K
To https://github.com/george887/classification_exercises.git
   be3cbb6..7e702ca  master -> master
