**Model Evaluation Exercises**

In [1]:
import pandas as pd
import sklearn.metrics as metrics

# 2

Given the following confusion matrix, evaluate (by hand) the model's performance.

|               | pred dog   | pred cat   |
|:------------  |-----------:|-----------:|
| actual dog    |         46 |         7  |
| actual cat    |         13 |         34 |

We'll take a positive case to be a dog and negative case to be a cat.

In [2]:
TP = 46
TN = 34
FP = 13
FN = 7

# accuracy is total correct predictions over total observations.
accuracy = (TP + TN) / (TP + TN + FP + FN)
accuracy

0.8

In [3]:
# precision is true positives over total positive predictions.
precision = TP / (TP + TN)
precision

0.575

In [4]:
# recall is true positives over total actual positives.
recall = TP / (TP + FN)
recall

0.8679245283018868

For our baseline we will predict dog everytime.

|               | pred dog   | pred cat   |
|:------------  |-----------:|-----------:|
| actual dog    |         53 |         0  |
| actual cat    |         47 |         0  |

In [5]:
TP = 53
TN = 0
FP = 47
FN = 0

accuracy = (TP + TN) / (TP + TN + FP + FN)
accuracy

0.53

- In the context of this problem, what is a false positive?

A false positive would be predicting dog when really it was a cat.

- In the context of this problem, what is a false negative?

A false negative would be predicting cat when really it was a dog.

- How would you describe this model?

Since there is no particular consequence of false positive versus false negative for this model we will assess the performance based on accuracy. Compared to the baseline this model, in terms of accuracy, performs better overall making accurate predictions 80% of the time versus 53% of the time for the baseline model.

---

# 3

You are working as a datascientist working for Codeup Cody Creator (C3 for short), a rubber-duck manufacturing plant.

In [6]:
c3 = pd.read_csv('c3.csv')
c3.head()

Unnamed: 0,actual,model1,model2,model3
0,No Defect,No Defect,Defect,No Defect
1,No Defect,No Defect,Defect,Defect
2,No Defect,No Defect,Defect,No Defect
3,No Defect,Defect,Defect,Defect
4,No Defect,No Defect,Defect,No Defect


Use the predictions dataset and pandas to help answer the following questions:

- An internal team wants to investigate the cause of the manufacturing defects. They tell you that they want to identify as many of the ducks that have a defect as possible. Which evaluation metric would be appropriate here? Which model would be the best fit for this use case?

Given that defect is positive, we should optimize for recall since we do not want to miss positive cases.

In [7]:
recalls = []
columns = ['model1', 'model2', 'model3']

for column in columns:
    positive = c3[column] == 'Defect'
    correct = c3.actual == c3[column]
    tp = c3[positive & correct].shape[0]

    negative = c3[column] == 'No Defect'
    wrong = c3.actual != c3[column]
    fn = c3[negative & wrong].shape[0]
    
    recalls.append(tp / (tp + fn))

recalls

[0.5, 0.5625, 0.8125]

Model3 has the best performance in terms of recall so this would be the best model for our use case.

---

- Recently several stories in the local news have come out highlighting customers who received a rubber duck with a defect, and portraying C3 in a bad light. The PR team has decided to launch a program that gives customers with a defective duck a vacation to Hawaii. They need you to predict which ducks will have defects, but tell you the really don't want to accidentally give out a vacation package when the duck really doesn't have a defect. Which evaluation metric would be appropriate here? Which model would be the best fit for this use case? 

In this case precision would be the appropriate metric to evaluate.

In [8]:
recalls = []
columns = ['model1', 'model2', 'model3']

for column in columns:
    positive = c3[column] == 'Defect'
    correct = c3.actual == c3[column]
    tp = c3[positive & correct].shape[0]

    wrong = c3.actual != c3[column]
    fp = c3[positive & wrong].shape[0]
    
    recalls.append(tp / (tp + fp))

recalls

[0.8, 0.1, 0.13131313131313133]

Model1 has the best performance in terms of precision for our use case.

---

# 4

You are working as a data scientist for Gives You Paws ™, a subscription based service that shows you cute pictures of dogs or cats (or both for an additional fee).

In [9]:
paws = pd.read_csv('gives_you_paws.csv')
paws.head()

Unnamed: 0,actual,model1,model2,model3,model4
0,cat,cat,dog,cat,dog
1,dog,dog,cat,cat,dog
2,dog,cat,cat,cat,dog
3,dog,dog,dog,cat,dog
4,cat,cat,cat,dog,dog


Given this dataset, use pandas to create a baseline model (i.e. a model that just predicts the most common class) and answer the following questions:

- In terms of accuracy, how do the various models compare to the baseline model? Are any of the models better than the baseline?

In [10]:
paws.actual.value_counts()

dog    3254
cat    1746
Name: actual, dtype: int64

In [11]:
paws['baseline'] = 'dog'
paws.baseline.value_counts()

dog    5000
Name: baseline, dtype: int64

In [12]:
accuracies = []

total = paws.shape[0]
columns = paws.drop(columns = 'actual').columns

for column in columns:
    positive = paws[column] == 'dog'
    correct = paws.actual == paws[column]
    tp = paws[positive & correct].shape[0]
    
    negative = paws[column] == 'cat'
    correct = paws.actual == paws[column]
    tn = paws[negative & correct].shape[0]
    
    accuracies.append((tp + tn) / total)

accuracies

[0.8074, 0.6304, 0.5096, 0.7426, 0.6508]

Models 1 and 4 are better than the baseline. Model 2 has similar performance to the baseline and model 3 is significantly worse than the baseline.

---

- Suppose you are working on a team that solely deals with dog pictures. Which of these models would you recomend for Phase I? For Phase II?

Here we determine performance based on recall.

In [13]:
recalls = []

for column in columns:
    positive = paws[column] == 'dog'
    correct = paws.actual == paws[column]
    tp = paws[positive & correct].shape[0]
    
    negative = paws[column] == 'cat'
    wrong = paws.actual != paws[column]
    fn = paws[negative & wrong].shape[0]
    
    recalls.append(tp / (tp + fn))

recalls

[0.803318992009834,
 0.49078057775046097,
 0.5086047940995697,
 0.9557467732022127,
 1.0]

Based on recall model 4 has the best performance so this would be best for phase I. For phase II we want better accuracy so model 1 is best.

---

- Suppose you are working on a team that solely deals with cat pictures. Which of these models would you recomend for Phase I? For Phase II?

Here we determine performance based on precision.

In [14]:
precisions = []

for column in columns:
    positive = paws[column] == 'dog'
    correct = paws.actual == paws[column]
    tp = paws[positive & correct].shape[0]
    
    wrong = paws.actual != paws[column]
    fp = paws[positive & wrong].shape[0]
    
    precisions.append(tp / (tp + fp))

precisions

[0.8900238338440586,
 0.8931767337807607,
 0.6598883572567783,
 0.7312485304490948,
 0.6508]

Model 2 has the best performance in terms of precision so this is the best model to use for phase I. For phase II we want better accuracy so model 1 is best.

---

# 5

Follow the links below to read the documentation about each function, then apply those functions to the data from the previous problem.

- sklearn.metrics.accuracy_score
- sklearn.metrics.precision_score
- sklearn.metrics.recall_score
- sklearn.metrics.classification_report

In [17]:
[print(metrics.accuracy_score(paws.actual, paws[column])) for column in columns]

0.8074
0.6304
0.5096
0.7426
0.6508


[None, None, None, None, None]

In [23]:
[print(metrics.precision_score(paws.actual, paws[column], pos_label = 'dog')) for column in columns]

0.8900238338440586
0.8931767337807607
0.6598883572567783
0.7312485304490948
0.6508


[None, None, None, None, None]

In [24]:
[print(metrics.recall_score(paws.actual, paws[column], pos_label = 'dog')) for column in columns]

0.803318992009834
0.49078057775046097
0.5086047940995697
0.9557467732022127
1.0


[None, None, None, None, None]

In [28]:
for column in columns:
    print(column)
    print(metrics.classification_report(paws.actual, paws[column], zero_division = 0))

model1
              precision    recall  f1-score   support

         cat       0.69      0.82      0.75      1746
         dog       0.89      0.80      0.84      3254

    accuracy                           0.81      5000
   macro avg       0.79      0.81      0.80      5000
weighted avg       0.82      0.81      0.81      5000

model2
              precision    recall  f1-score   support

         cat       0.48      0.89      0.63      1746
         dog       0.89      0.49      0.63      3254

    accuracy                           0.63      5000
   macro avg       0.69      0.69      0.63      5000
weighted avg       0.75      0.63      0.63      5000

model3
              precision    recall  f1-score   support

         cat       0.36      0.51      0.42      1746
         dog       0.66      0.51      0.57      3254

    accuracy                           0.51      5000
   macro avg       0.51      0.51      0.50      5000
weighted avg       0.55      0.51      0.52      5000