# Evaluation Exercises

## QUESTION ONE
Given the following confusion matrix, evaluate (by hand) the model's performance.

|               | actual cat | actual dog |
|:------------  |-----------:|-----------:|
| predicted cat |         34 |          7 |
| predicted dog |         13 |         46 |

In the context of this problem, what is a false positive?

In [None]:
# We will define success as accurately identifying a cat
# A false positive is when cat is predicted, but the actual was a dog (7)

In the context of this problem, what is a false negative?

In [None]:
# A false negative is when dog is predicted, but the actual is a cat (13)

How would you describe this model?

In [13]:
# Overall success rate is 34+46 / 34+7+13+46
print(f'Model accuracy is {(34+46)/(34+7+13+46) * 100}%') # Given a prediction, how much do we trust that prediction?
print(f'Model PPV is {round(34/(34+7)*100, 2)}%') # Given a positive prediction, how much do we trust that prediction?
print(f'Model NPV is {round(46/(13+46)*100,2)}%') # Given a negative prediction, how much do we trust that negative prediction?
print(f'Model Recall(Sensitivity) is {round(34/(34+13)*100, 2)}%') # How well is this model minimizing false negatives?
print(f'Model Specificity is {round(46/(7+46)*100,2)}%') # How well is this model minimizing false positives? 

Model accuracy is 80.0%
Model PPV is 82.93%
Model NPV is 77.97%
Model Recall(Sensitivity) is 72.34%
Model Specificity is 86.79%


## QUESTION TWO
You are working as a datascientist working for Codeup Cody Creator (C3 for short), a rubber-duck manufacturing plant.

Unfortunately, some of the rubber ducks that are produced will have defects. Your team has built several models that try to predict those defects, and the data from their predictions can be found here.

Use the predictions dataset and pandas to help answer the following questions:

In [14]:
import pandas as pd

In [15]:
df = pd.read_csv('c3.csv')
df.head()

Unnamed: 0,actual,model1,model2,model3
0,No Defect,No Defect,Defect,No Defect
1,No Defect,No Defect,Defect,Defect
2,No Defect,No Defect,Defect,No Defect
3,No Defect,Defect,Defect,Defect
4,No Defect,No Defect,Defect,No Defect


In [17]:
df.shape

(200, 4)

In [21]:
df.actual.value_counts()

No Defect    184
Defect        16
Name: actual, dtype: int64

In [18]:
# Model 1 Accuracy
(df.actual == df.model1).mean()

0.95

In [19]:
# Model 2 Accuracy
(df.actual == df.model2).mean()

0.56

In [20]:
# Model 3 Accuracy
(df.actual == df.model3).mean()

0.555

#### An internal team wants to investigate the cause of the manufacturing defects. They tell you that they want to identify as many of the ducks that have a defect as possible. Which evaluation metric would be appropriate here? Which model would be the best fit for this use case?

The team is less concerned about accuracy, and more concerned about catching as many defects as possible.
We first need to define what our positive and negative conditions are.
Because the goal of the model is to identify defective ducks, we will define a positive result as a defect.

With this in mind, we need to minimize false negatives (a defective duck slips through).

In [24]:
pd.crosstab(df.model1, df.actual)

actual,Defect,No Defect
model1,Unnamed: 1_level_1,Unnamed: 2_level_1
Defect,8,2
No Defect,8,182


In [25]:
pd.crosstab(df.model2, df.actual)

actual,Defect,No Defect
model2,Unnamed: 1_level_1,Unnamed: 2_level_1
Defect,9,81
No Defect,7,103


In [26]:
pd.crosstab(df.model3, df.actual)

actual,Defect,No Defect
model3,Unnamed: 1_level_1,Unnamed: 2_level_1
Defect,13,86
No Defect,3,98


Although model3 has the worst overall accuracy, it has the best recall(sensitivity). The fewest number of defective ducks will slip through this model compared to the other models. The company will investigate a lot of good ducks, but they will have the best chance of finding the highest number of defects.

#### Recently several stories in the local news have come out highlighting customers who received a rubber duck with a defect, and portraying C3 in a bad light. The PR team has decided to launch a program that gives customers with a defective duck a vacation to Hawaii. They need you to predict which ducks will have defects, but tell you the really don't want to accidentally give out a vacation package when the duck really doesn't have a defect. Which evaluation metric would be appropriate here? Which model would be the best fit for this use case?

The PR team wants to minimize False positives (where the prediction is that the duck was defective, but the duck was actually good). 

In [27]:
pd.crosstab(df.model1, df.actual)

actual,Defect,No Defect
model1,Unnamed: 1_level_1,Unnamed: 2_level_1
Defect,8,2
No Defect,8,182


Model 1 reduces the number of False positives (high specificity). Only 10 customers will recieve vacation packages, and out of that 10, only 2 customers didn't actually deserve them. 

## QUESTION THREE

You are working as a data scientist for Gives You Paws ™, a subscription based service that shows you cute pictures of dogs or cats (or both for an additional fee).

At Gives You Paws, anyone can upload pictures of their cats or dogs. The photos are then put through a two step process. First an automated algorithm tags pictures as either a cat or a dog (Phase I). Next, the photos that have been initially identified are put through another round of review, possibly with some human oversight, before being presented to the users (Phase II).

Several models have already been developed with the data, and you can find their results here.

In [28]:
df = pd.read_csv('gives_you_paws.csv')
df.head()

Unnamed: 0,actual,model1,model2,model3,model4
0,cat,cat,dog,cat,dog
1,dog,dog,cat,cat,dog
2,dog,cat,cat,cat,dog
3,dog,dog,dog,cat,dog
4,cat,cat,cat,dog,dog


In [58]:
df.shape

(5000, 5)

Given this dataset, use pandas to create a baseline model (i.e. a model that just predicts the most common class) and answer the following questions:

In [29]:
df.actual.value_counts()

dog    3254
cat    1746
Name: actual, dtype: int64

In [36]:
baseline_accuracy = (df.actual == 'dog').mean()
print(f'The baseline accuracy (always predicting dog) is {baseline_accuracy * 100}%')

The baseline accuracy (always predicting dog) is 65.08%


#### In terms of accuracy, how do the various models compare to the baseline model? Are any of the models better than the baseline?

In [41]:
(df.model1 == df.actual).mean()

0.8074

In [42]:
(df.model2 == df.actual).mean()

0.6304

In [43]:
(df.model3 == df.actual).mean()

0.5096

In [44]:
(df.model4 == df.actual).mean()

0.7426

In [45]:
# Model 1 has the highest accuracy at 80.74%
# Model 4 has the next highest accuracy at 74.26%
# Model 2 & 3 performed worse than the baseline

#### Suppose you are working on a team that solely deals with dog pictures. Which of these models would you recomend for Phase I? For Phase II?

In [48]:
pd.crosstab(df.model1, df.actual)

actual,cat,dog
model1,Unnamed: 1_level_1,Unnamed: 2_level_1
cat,1423,640
dog,323,2614


In [67]:
print('MODEL ONE')
print(f'Each actual DOG has a {2614/(2614+640)} percent chance of being tagged as dog.')
print(f'Each actual CAT has a {323/(323+1423)} percent chance of being tagged as dog.')

MODEL ONE
Each actual DOG has a 0.803318992009834 percent chance of being tagged as dog.
Each actual CAT has a 0.1849942726231386 percent chance of being tagged as dog.


In [49]:
pd.crosstab(df.model2, df.actual)

actual,cat,dog
model2,Unnamed: 1_level_1,Unnamed: 2_level_1
cat,1555,1657
dog,191,1597


In [68]:
print('MODEL TWO')
print(f'Each actual DOG has a {1597/(1597+1657)} percent chance of being tagged as dog.')
print(f'Each actual CAT has a {191/(191+1555)} percent chance of being tagged as dog.')

MODEL TWO
Each actual DOG has a 0.49078057775046097 percent chance of being tagged as dog.
Each actual CAT has a 0.10939289805269187 percent chance of being tagged as dog.


In [50]:
pd.crosstab(df.model3, df.actual)

actual,cat,dog
model3,Unnamed: 1_level_1,Unnamed: 2_level_1
cat,893,1599
dog,853,1655


In [69]:
print('MODEL THREE')
print(f'Each actual DOG has a {1655/(1655+1599)} percent chance of being tagged as dog.')
print(f'Each actual CAT has a {853/(853+893)} percent chance of being tagged as dog.')

MODEL THREE
Each actual DOG has a 0.5086047940995697 percent chance of being tagged as dog.
Each actual CAT has a 0.488545246277205 percent chance of being tagged as dog.


In [51]:
pd.crosstab(df.model4, df.actual)

actual,cat,dog
model4,Unnamed: 1_level_1,Unnamed: 2_level_1
cat,603,144
dog,1143,3110


In [70]:
print('MODEL FOUR')
print(f'Each actual DOG has a {3110/(3110+144)} percent chance of being tagged as dog.')
print(f'Each actual CAT has a {1143/(1143+603)} percent chance of being tagged as dog.')

MODEL FOUR
Each actual DOG has a 0.9557467732022127 percent chance of being tagged as dog.
Each actual CAT has a 0.654639175257732 percent chance of being tagged as dog.


In [55]:
# If I wanted to do the least amount of work, I would want the model that predicted the fewest number of dogs, regardless of any other metric
print(df.model1.value_counts())
print(df.model2.value_counts())
print(df.model3.value_counts())
print(df.model4.value_counts())

# Model 3 would send the least amount of work to my team, so lets start with that one...

dog    2937
cat    2063
Name: model1, dtype: int64
cat    3212
dog    1788
Name: model2, dtype: int64
dog    2508
cat    2492
Name: model3, dtype: int64
dog    4253
cat     747
Name: model4, dtype: int64


In [56]:
# Just kidding...lets look at the possible combinations of the four models we have and what their expected combined performance is

### Model 1: Model 2
If we use model 1, what do we pass to phase II?

    2614 dogs, 323 cats
    
If we use model 2, what do we pass to the users?

    Each actual dog has a 0.49078057775046097 chance of being passed to user
    Each actual cat has a 0.10939289805269187 chance of being passed to user
   
We pass on 1283 dogs and 35 cats

#### Accuracy = 97.34%

### Model 2: Model 1
If we use model 2, what do we pass to phase II?

    1597 dogs, 191 cats
    
If we use model 1, what do we pass to the users?

    Each actual DOG has a 0.803318992009834 percent chance of being tagged as dog.
    Each actual CAT has a 0.1849942726231386 percent chance of being tagged as dog.
    
We pass on 1283 dogs, 35 cats

#### Accuracy: 97.34%

#### Note: It doesn't matter which test is run first, the end result is the same.

### Model 1: Model 3
If we use model 1, what do we pass to phase II?

    2614 dogs, 323 cats
    
If we use model 3, what do we pass to the users?

    Each actual DOG has a 0.5086047940995697 percent chance of being tagged as dog.
    Each actual CAT has a 0.488545246277205 percent chance of being tagged as dog.

We pass on 653 dogs and 158 cats

#### Accuracy = 80.52%

### Model 1: Model 4
If we use model 1, what do we pass to phase II?

    2614 dogs, 323 cats
    
If we use model 4, what do we pass to the users?
    
    Each actual DOG has a 0.9557467732022127 percent chance of being tagged as dog.
    Each actual CAT has a 0.654639175257732 percent chance of being tagged as dog.

We pass on 2498 dogs and 211 cats

#### Accuracy = 92.21%

### Model 2: Model 3
If we use model 2, what do we pass to phase II?

    1597 dogs, 191 cats
    
If we use model 3, what do we pass to the users?
        
        Each actual DOG has a 0.5086047940995697 percent chance of being tagged as dog.
        Each actual CAT has a 0.488545246277205 percent chance of being tagged as dog.

We pass on 812 dogs, 93 cats

#### Accuracy: 89.72%

### Model 2: Model 4
If we use model 2, what do we pass to phase II?

    1597 dogs, 191 cats
    
If we use model 4, what do we pass to the users?

    Each actual DOG has a 0.9557467732022127 percent chance of being tagged as dog.
    Each actual CAT has a 0.654639175257732 percent chance of being tagged as dog.
    
We pass on 1526 dogs, 125 cats

#### Accuracy: 92.43%

### Model 3: Model 4
If we use model 3, what do we pass to phase II?

    1655 dogs, 853 cats
    
If we use model 4, what do we pass to the users?
    
    Each actual DOG has a 0.9557467732022127 percent chance of being tagged as dog.
    Each actual CAT has a 0.654639175257732 percent chance of being tagged as dog.
    
We pass on 1582 dogs, 558 cats

#### Accuracy: 73.93%

In [101]:
model_combinations = {'Model Combination':  ['Model 1:2', 'Model 1:3', 'Model 1:4', 'Model 2:3', 'Model 2:4', 'Model 3:4'],
        'Dogs Passed to User': [1283, 653, 2498, 812, 1526, 1582],
        'Cats Passed to User': [35, 158, 211, 93, 125, 558],
        'Accuracy': [.9734, .8052, .9221, .8972, .9243, .7393]
        }

model_df = pd.DataFrame (model_combinations, columns = ['Model Combination','Dogs Passed to User', 'Cats Passed to User', 'Accuracy'])
model_df

Unnamed: 0,Model Combination,Dogs Passed to User,Cats Passed to User,Accuracy
0,Model 1:2,1283,35,0.9734
1,Model 1:3,653,158,0.8052
2,Model 1:4,2498,211,0.9221
3,Model 2:3,812,93,0.8972
4,Model 2:4,1526,125,0.9243
5,Model 3:4,1582,558,0.7393


Its clear from our results that the combination of models 1 and 2 give the highest possible accuracy. 

However, there can be a case made for models 1 and 4. While accuracy drops from 97% to 92%, the number of pictures more than doubles. The user is getting many more dog pictures for the same subscription price. 

#### Suppose you are working on a team that solely deals with cat pictures. Which of these models would you recomend for Phase I? For Phase II?

Rather than recalculating every possible model combination for cats, lets consider what characteristics made the combination of models 1 and 2 the most accurate overall. 