# Evaluating Model Performance Exercises

In [27]:
import numpy as np
import seaborn as sns
import scipy.stats as stats
import pandas as pd
import matplotlib.pyplot as plt
import itertools
# import env
from math import sqrt
# to turn off pink warning boxes basically for display purposes in class
import warnings
warnings.filterwarnings('ignore')

# import splitting and imputing functions
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer

# confusion matrix
import sklearn.metrics

# to see local file system
import os

# import our own acquire module
import acquire

#### 2. Given the following confusion matrix, evaluate (by hand) the model's performance.



In [None]:
|               | pred dog   | pred cat   |
|:------------  |-----------:|-----------:|
| actual dog    |         46 |         7  |
| actual cat    |         13 |         34 |


- In the context of this problem, what is a false positive?

First, I would need to choose what constitutes a positive result. In this case I choose 'dog' as positive and 'cat' as negative. Then a false positive would be where the model predicted a positive(dog), but the actual value was a negative(cat)


- In the context of this problem, what is a false negative?

A false negative would be where the model predicted a negative(cat) and the actual value was a positive(dog)


- How would you describe this model?

The model is 80% accurate compared to 53% for the baseline model. It is a relatively accurate model


#### 3. You are working as a data scientist working for Codeup Cody Creator (C3 for short), a rubber-duck manufacturing plant. Unfortunately, some of the rubber ducks that are produced will have defects. Your team has built several models that try to predict those defects, and the data from their predictions can be found here. Use the predictions dataset and pandas to help answer the following questions:


- An internal team wants to investigate the cause of the manufacturing defects. They tell you that they want to identify as many of the ducks that have a defect as possible. Which evaluation metric would be appropriate here? 

In this case, it is likely that the percentage of ducks with defects is a much smaller class than those without defects. Also, it is likely more important to this company to catch all of the defects, even if that results in more false positives(detecting a defect when there is none. So, for this project, I will assign positive to 'has a defect' and negative to 'no defects'. I want to maximize the amount of True Positive detections so the metric I will use is Recall which measures the number of True Positives(correctly predicted defects) against the total actual positives(TP + FN).

- Which model would be the best fit for this use case?

model2 and model3 tie for best with a recall of .5625

In [5]:
duck_df = pd.read_csv('c3.csv')
duck_df

Unnamed: 0,actual,model1,model2,model3
0,No Defect,No Defect,Defect,No Defect
1,No Defect,No Defect,Defect,Defect
2,No Defect,No Defect,Defect,No Defect
3,No Defect,Defect,Defect,Defect
4,No Defect,No Defect,Defect,No Defect
...,...,...,...,...
195,No Defect,No Defect,Defect,Defect
196,Defect,Defect,No Defect,No Defect
197,No Defect,No Defect,No Defect,No Defect
198,No Defect,No Defect,Defect,Defect


In [13]:
# Recall metric needs True Positives in numerator so subset is the True Positives:

duck_subset = duck_df[duck_df.actual == 'Defect']

tp_1_recall = (duck_subset.model1 == duck_subset.actual).mean()
tp_1_recall

tp_2_recall = (duck_subset.model2 == duck_subset.actual).mean()

tp_3_recall = (duck_subset.model2 == duck_subset.actual).mean()

tp_1_recall, tp_2_recall, tp_3_recall


(0.5, 0.5625, 0.5625)

#### 4. You are working as a data scientist for Gives You Paws ™, a subscription based service that shows you cute pictures of dogs or cats (or both for an additional fee). At Gives You Paws, anyone can upload pictures of their cats or dogs. The photos are then put through a two step process. First an automated algorithm tags pictures as either a cat or a dog (Phase I). Next, the photos that have been initially identified are put through another round of review, possibly with some human oversight, before being presented to the users (Phase II). Several models have already been developed with the data, and you can find their results here. Given this dataset, use pandas to create a baseline model (i.e. a model that just predicts the most common class) and answer the following questions:



##### a. In terms of accuracy, how do the various models compare to the baseline model? Are any of the models better than the baseline? 

Model1 is the most accurate at .8074. Models 1 and 4 both beat baseline.


In [15]:
paws_df = pd.read_csv('gives_you_paws.csv')
paws_df

Unnamed: 0,actual,model1,model2,model3,model4
0,cat,cat,dog,cat,dog
1,dog,dog,cat,cat,dog
2,dog,cat,cat,cat,dog
3,dog,dog,dog,cat,dog
4,cat,cat,cat,dog,dog
...,...,...,...,...,...
4995,dog,dog,dog,dog,dog
4996,dog,dog,cat,cat,dog
4997,dog,cat,cat,dog,dog
4998,cat,cat,cat,cat,dog


In [17]:
paws_df.describe()

Unnamed: 0,actual,model1,model2,model3,model4
count,5000,5000,5000,5000,5000
unique,2,2,2,2,2
top,dog,dog,cat,dog,dog
freq,3254,2937,3212,2508,4253


In [19]:
paws_df['baseline'] = 'dog'

In [21]:

paws_m1_acc = (paws_df.model1 == paws_df.actual).mean()
paws_m2_acc = (paws_df.model2 == paws_df.actual).mean()
paws_m3_acc = (paws_df.model3 == paws_df.actual).mean()
paws_m4_acc = (paws_df.model4 == paws_df.actual).mean()
paws_base_acc = (paws_df.baseline == paws_df.actual).mean()
paws_m1_acc, paws_m2_acc, paws_m3_acc, paws_m4_acc, paws_base_acc

(0.8074, 0.6304, 0.5096, 0.7426, 0.6508)

##### b. Suppose you are working on a team that solely deals with dog pictures. Which of these models would you recommend?

Accuracy is probably a pretty good indicator as the classes are not terribly balanced at 65% dogs %35 cats. Maybe for our dog team though, we would want to know the precision of specifically identifying dogs.

I would recommend model2 as it is the most precise at 0.8931, but model1 may be better overall with a 0.8900 precision, but a much higher accuracy.

In [24]:
paws_sub_pre_1 = paws_df[paws_df.model1 == 'dog']
paws_sub_pre_2 = paws_df[paws_df.model2 == 'dog']
paws_sub_pre_3 = paws_df[paws_df.model3 == 'dog']
paws_sub_pre_4 = paws_df[paws_df.model4 == 'dog']
paws_sub_pre_b = paws_df[paws_df.baseline == 'dog']

paws_pre_1 = (paws_sub_pre_1.model1 == paws_sub_pre_1.actual).mean()
paws_pre_2 = (paws_sub_pre_2.model2 == paws_sub_pre_2.actual).mean()
paws_pre_3 = (paws_sub_pre_3.model3 == paws_sub_pre_3.actual).mean()
paws_pre_4 = (paws_sub_pre_4.model4 == paws_sub_pre_4.actual).mean()
paws_pre_b = (paws_sub_pre_b.baseline == paws_sub_pre_b.actual).mean()

paws_pre_1, paws_pre_2, paws_pre_3, paws_pre_4, paws_pre_b

(0.8900238338440586,
 0.8931767337807607,
 0.6598883572567783,
 0.7312485304490948,
 0.6508)

##### c. Suppose you are working on a team that solely deals with cat pictures. Which of these models would you recommend?

We would want to know the precision of specifically identifying cats.

I would recommend model4 as it is the most precise at 0.8072, but model1 may be better overall with a 0.6898 precision, but a somewhat higher accuracy.


In [26]:
# success is now cat, so we will reverse the precision calcs.

paws_sub_pre_1c = paws_df[paws_df.model1 == 'cat']
paws_sub_pre_2c = paws_df[paws_df.model2 == 'cat']
paws_sub_pre_3c = paws_df[paws_df.model3 == 'cat']
paws_sub_pre_4c = paws_df[paws_df.model4 == 'cat']

paws_pre_1c = (paws_sub_pre_1c.model1 == paws_sub_pre_1c.actual).mean()
paws_pre_2c = (paws_sub_pre_2c.model2 == paws_sub_pre_2c.actual).mean()
paws_pre_3c = (paws_sub_pre_3c.model3 == paws_sub_pre_3c.actual).mean()
paws_pre_4c = (paws_sub_pre_4c.model4 == paws_sub_pre_4c.actual).mean()

paws_pre_1c, paws_pre_2c, paws_pre_3c, paws_pre_4c

(0.6897721764420747,
 0.4841220423412204,
 0.358346709470305,
 0.8072289156626506,
 nan)

##### 5. Follow the links below to read the documentation about each function, then apply those functions to the data from the previous problem.



In [34]:
# sklearn.metrics.accuracy_score
paw_acc_1 = sklearn.metrics.accuracy_score(paws_df.actual, paws_df.model1, normalize=True)
paw_acc_2 = sklearn.metrics.accuracy_score(paws_df.actual, paws_df.model2, normalize=True)
paw_acc_3 = sklearn.metrics.accuracy_score(paws_df.actual, paws_df.model3, normalize=True)
paw_acc_4 = sklearn.metrics.accuracy_score(paws_df.actual, paws_df.model4, normalize=True)
paw_acc_b = sklearn.metrics.accuracy_score(paws_df.actual, paws_df.baseline, normalize=True)

paw_acc_1, paw_acc_2, paw_acc_3, paw_acc_4, paw_acc_b

# this function returns same vals as the manual method

(0.8074, 0.6304, 0.5096, 0.7426, 0.6508)

In [48]:
# sklearn.metrics.precision_score
paw_cols = paws_df.columns.tolist()[1:]
# def get_precision_score(columns):
#     paw_precisions = []
#     for col in columns:
#         paw_precisions.append(sklearn.metrics.precision_score(paws_df.actual, paws_df['col']))
#     return paw_precisions
# paw_precisions = get_precision_score(paw_cols)
paw_cols

['model1', 'model2', 'model3', 'model4', 'baseline']