1. Create a new file named model_evaluation.py or model_evaluation.ipynb for these exercises.

In [1]:
import pandas as pd
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

### 2. Given the following confusion matrix, evaluate (by hand) the model's performance.


|               | pred dog   | pred cat   |
|:------------  |-----------:|-----------:|
| actual dog    |         46 |         7  |
| actual cat    |         13 |         34 |

We can say that a dog is 0 (negative) and a cat is 1 (positive):

|                | pred dog (0)| pred cat (1)|
|:-------------  |------------:|------------:|
| actual dog (0) |          46 |          7  |
| actual cat (1) |          13 |          34 |
 
- In the context of this problem, what is a false positive?
    - FP = actual dog and model predicted cat: 7
- In the context of this problem, what is a false negative?
    - FN = actual cat and model predicted dog: 13
- How would you describe this model?
    - True Negative is actual dog and model predicted dog: 46
    - False Positive is actual dog and model predicted cat: 7
    - False Negative is actual cat and model predicted dog: 13
    - True Positive is actual cat and model predicted cat: 34
    - 80% accuracy
    - 72% recall
    - 83% precision
    - 77% f-1 score

In [103]:
tn = 46
fp = 7
fn = 13
tp = 34

print("True Positives", tp)
print("False Positives", fp)
print("False Negatives", fn)
print("True Negatives", tn)

print("-------------")

accuracy = (tp+tn) / (tp+tn+fp+fn)
recall = tp / (tp+fn)
precision = tp / (tp+fp)
f1_score = (2* precision * recall) / (precision + recall)

print(f"Accuracy is, {accuracy} or{accuracy: .2%}")
print(f"Recall is, {recall: .2} or{recall: .2%}")
print(f"Precision is, {precision: .2} or{precision: .2%}")
print(f"F-1 score is, {f1_score: .2} or{f1_score: .2%}")

True Positives 34
False Positives 7
False Negatives 13
True Negatives 46
-------------
Accuracy is, 0.8 or 80.00%
Recall is,  0.72 or 72.34%
Precision is,  0.83 or 82.93%
F-1 score is,  0.77 or 77.27%


### 3. You are working as a datascientist working for Codeup Cody Creator (C3 for short), a rubber-duck manufacturing plant.

Unfortunately, some of the rubber ducks that are produced will have defects. Your team has built several models that try to predict those defects, and the data from their predictions can be found here https://ds.codeup.com/data/c3.csv.

Use the predictions dataset and pandas to help answer the following questions:

- An internal team wants to investigate the cause of the manufacturing defects. They tell you that they want to identify as many of the ducks that have a defect as possible. 
    - Which evaluation metric would be appropriate here?
        - Recall in order to avoid false negatives
    - Which model would be the best fit for this use case?
        - Model 3 has highest recall
    

In [56]:
c3df = pd.read_clipboard()
c3df.head()

Unnamed: 0,actual,model1,model2,model3
0,No Defect,No Defect,Defect,No Defect
1,No Defect,No Defect,Defect,Defect
2,No Defect,No Defect,Defect,No Defect
3,No Defect,Defect,Defect,Defect
4,No Defect,No Defect,Defect,No Defect


In [57]:
#look at classes
c3df.actual.value_counts()


No Defect    184
Defect        16
Name: actual, dtype: int64

In [59]:
#visualize actual vs prediction(model1)
pd.crosstab(c3df.actual, c3df.model1)

model1,Defect,No Defect
actual,Unnamed: 1_level_1,Unnamed: 2_level_1
Defect,8,8
No Defect,2,182


In [60]:
#checking value counts for model 1
c3df.model1.value_counts()

No Defect    190
Defect        10
Name: model1, dtype: int64

In [61]:
#double check to see if array matches value counts
model1 = confusion_matrix(c3df.actual, 
                          c3df.model1, 
                          labels=['Defect', 'No Defect'])

model1

array([[  8,   8],
       [  2, 182]])

In [43]:
#flatten confusion matrix and assign labels appropriately 
#match labels w/ crosstab and array above
tp1, fn1, fp1, tn1 = model1.ravel()

tp1, fn1, fp1, tn1

(8, 8, 2, 182)

In [62]:
#do same step for model 2
model2 = confusion_matrix(c3df.actual, 
                          c3df.model2, 
                          labels=['Defect', 'No Defect'])

model2

array([[  9,   7],
       [ 81, 103]])

In [45]:
tp2, fn2, fp2, tn2 = model2.ravel()

tp2, fn2, fp2, tn2

(9, 7, 81, 103)

In [63]:
#same for model 3
model3 = confusion_matrix(c3df.actual, 
                          c3df.model3, 
                          labels=['Defect', 'No Defect'])

model3

array([[13,  3],
       [86, 98]])

In [47]:
tp3, fn3, fp3, tn3 = model3.ravel()

tp3, fn3, fp3, tn3

(13, 3, 86, 98)

In [48]:
#find evaluation metric that gives the most defects (aka true positive)
#want to reduce false negatives (duck is defective and model shows no defect)
#Recall optimizes for this (tp / (tp + fn))

recall_model1 = tp1 / (tp1 + fn1)
recall_model1

0.5

In [49]:
recall_model2 = tp2 / (tp2 + fn2)
recall_model2

0.5625

In [50]:
recall_model3 = tp3 / (tp3 + fn3)
recall_model3

0.8125

- Recently several stories in the local news have come out highlighting customers who received a rubber duck with a defect, and portraying C3 in a bad light. The PR team has decided to launch a program that gives customers with a defective duck a vacation to Hawaii. They need you to predict which ducks will have defects, but tell you they really don't want to accidentally give out a vacation package when the duck really doesn't have a defect. 
    - Which evaluation metric would be appropriate here? 
        - Precision to minimize false positives
    - Which model would be the best fit for this use case?
        - Model 1 has highest precision

In [52]:
#need to predict ducks w/ defects, 
#don't want to accidentally predict defect when duck really doesn't have defect
#aka don't want false positives
#Precision optimizes for this (tp / (tp + fp))

precision_model1 = tp1 / (tp1 + fp1)
precision_model1

0.8

In [54]:
precision_model2 = tp2 / (tp2 + fp2)
precision_model2

0.1

In [53]:
precision_model3 = tp3 / (tp3 + fp3)
precision_model3

0.13131313131313133

In [None]:
#alternate solution for exercise 3 part 1

In [104]:
df = pd.read_csv('https://ds.codeup.com/data/c3.csv')
df.head()

Unnamed: 0,actual,model1,model2,model3
0,No Defect,No Defect,Defect,No Defect
1,No Defect,No Defect,Defect,Defect
2,No Defect,No Defect,Defect,No Defect
3,No Defect,Defect,Defect,Defect
4,No Defect,No Defect,Defect,No Defect


In [105]:
#confusion matrix for model 1
confusion_matrix(df.actual, df.model1,
                 labels = ['No Defect', 'Defect'])

array([[182,   2],
       [  8,   8]])

In [106]:
#confusion matrix for model 2
confusion_matrix(df.actual, df.model2,
                 labels = ['No Defect', 'Defect'])

array([[103,  81],
       [  7,   9]])

In [107]:
#confusion matrix for model 3 
confusion_matrix(df.actual, df.model3,
                 labels = ['No Defect', 'Defect'])

array([[98, 86],
       [ 3, 13]])

In [108]:
# Model 1

subset = df[df.actual == 'Defect']
subset

Unnamed: 0,actual,model1,model2,model3
13,Defect,No Defect,Defect,Defect
30,Defect,Defect,No Defect,Defect
65,Defect,Defect,Defect,Defect
70,Defect,Defect,Defect,Defect
74,Defect,No Defect,No Defect,Defect
87,Defect,No Defect,Defect,Defect
118,Defect,No Defect,Defect,No Defect
135,Defect,Defect,No Defect,Defect
140,Defect,No Defect,Defect,Defect
147,Defect,Defect,No Defect,Defect


In [109]:
#Model 1 recall

model_recall = (subset.actual == subset.model1).mean()
print("Model 1")
print(f"Model recall: {model_recall:.2%}")

Model 1
Model recall: 50.00%


In [110]:
# Model 2 recall

model_recall = (subset.actual == subset.model2).mean()
print("Model 2")
print(f"Model recall: {model_recall:.2%}")

Model 2
Model recall: 56.25%


In [111]:
# Model 3 recall

model_recall = (subset.actual == subset.model3).mean()
print("Model 3")
print(f"Model recall: {model_recall:.2%}")

Model 3
Model recall: 81.25%


In [None]:
#alternate solution for exercise 3 part 2

In [112]:
subset = df[df.model1 == 'Defect']

model_precision = (subset.actual == subset.model1).mean()

print("Model 1")
print(f"Model precision: {model_precision:.2%}")

Model 1
Model precision: 80.00%


In [113]:
subset = df[df.model2 == 'Defect']

model_precision = (subset.actual == subset.model2).mean()

print("Model 2")
print(f"Model precision: {model_precision:.2%}")

Model 2
Model precision: 10.00%


In [114]:
subset = df[df.model3 == 'Defect']

model_precision = (subset.actual == subset.model3).mean()

print("Model 3")
print(f"Model precision: {model_precision:.2%}")

Model 3
Model precision: 13.13%


4. You are working as a data scientist for Gives You Paws ™, a subscription based service that shows you cute pictures of dogs or cats (or both for an additional fee).

At Gives You Paws, anyone can upload pictures of their cats or dogs. The photos are then put through a two step process. 

- First, an automated algorithm tags pictures as either a cat or a dog (Phase I). 
- Next, the photos that have been initially identified are put through another round of review, possibly with some human oversight, before being presented to the users (Phase II).

Several models have already been developed with the data, and you can find their results here. https://ds.codeup.com/data/gives_you_paws.csv

Given this dataset, use pandas to create a baseline model (i.e. a model that just predicts the most common class) and answer the following questions:

a. In terms of accuracy, how do the various models compare to the baseline model? 

Are any of the models better than the baseline?
- Model 1 and model 4 are more accurate than the baseline model.
- Model 2 and model 3 are less accurate than the baseline model.

In [115]:
pawsdf = pd.read_csv('gives_you_paws.csv')
pawsdf.head()

Unnamed: 0,actual,model1,model2,model3,model4
0,cat,cat,dog,cat,dog
1,dog,dog,cat,cat,dog
2,dog,cat,cat,cat,dog
3,dog,dog,dog,cat,dog
4,cat,cat,cat,dog,dog


In [65]:
#dog is most common
pawsdf.actual.value_counts()

dog    3254
cat    1746
Name: actual, dtype: int64

In [69]:
#3254 dogs
#1746 cats 
#so our baseline model would be to predict dog every single time.
pawsdf['baseline_prediction'] = 'dog'
pawsdf.head()

#or df["baseline"] = df.actual.value_counts().idxmax()

Unnamed: 0,actual,model1,model2,model3,model4,baseline_prediction
0,cat,cat,dog,cat,dog,dog
1,dog,dog,cat,cat,dog,dog
2,dog,cat,cat,cat,dog,dog
3,dog,dog,dog,cat,dog,dog
4,cat,cat,cat,dog,dog,dog


In [80]:
#In terms of accuracy, how do the various models compare to the baseline model?
#Are any of the models better than the baseline? Yes (model 1 and 4)

model1_accuracy = (pawsdf.actual == pawsdf.model1).mean()
model2_accuracy = (pawsdf.actual == pawsdf.model2).mean()
model3_accuracy = (pawsdf.actual == pawsdf.model3).mean()
model4_accuracy = (pawsdf.actual == pawsdf.model4).mean()

baseline_accuracy = (pawsdf.actual == pawsdf.baseline_prediction).mean()

print(f'model 1 accuracy: {model1_accuracy:.2%}')
print(f'model 2 accuracy: {model2_accuracy:.2%}')
print(f'model 3 accuracy: {model3_accuracy:.2%}')
print(f'model 4 accuracy: {model4_accuracy:.2%}')

print(f'baseline accuracy: {baseline_accuracy:.2%}')

model 1 accuracy: 80.74%
model 2 accuracy: 63.04%
model 3 accuracy: 50.96%
model 4 accuracy: 74.26%
baseline accuracy: 65.08%


In [82]:
#or make output dictionary

#stp 1: get col names into a list
models = list(pawsdf.columns)
models = models[1:]

#stp 2: get accuracy in dictionary form for each model
output = {}
for model in models:
    accuracy = (pawsdf.actual == pawsdf[model]).mean()
    output.update({model:accuracy})
    
output

{'model1': 0.8074,
 'model2': 0.6304,
 'model3': 0.5096,
 'model4': 0.7426,
 'baseline_prediction': 0.6508}

In [83]:
#make into a dataframe
pd.DataFrame(list(output.items()), columns = ['model', 'accuracy'])

Unnamed: 0,model,accuracy
0,model1,0.8074
1,model2,0.6304
2,model3,0.5096
3,model4,0.7426
4,baseline_prediction,0.6508


b. Suppose you are working on a team that solely deals with dog pictures. Which of these models would you recomend for Phase I? 
- Recommend model 4

For Phase II?
- Recommend model 2

In [84]:
#needing dog pictures (dog = positive, cat = negative)
#phase I is automated algorithm tags pics as either cat or dog

#need recall to minimize false negatives

subset = pawsdf[pawsdf.actual == 'dog']

model1_recall = (subset.actual == subset.model1).mean()
model2_recall = (subset.actual == subset.model2).mean()
model3_recall = (subset.actual == subset.model3).mean()
model4_recall = (subset.actual == subset.model4).mean()

print(f'model 1 recall: {model1_recall:.2%}')
print(f'model 2 recall: {model2_recall:.2%}')
print(f'model 3 recall: {model3_recall:.2%}')
print(f'model 4 recall: {model4_recall:.2%}')

#model 4 is best

model 1 recall: 80.33%
model 2 recall: 49.08%
model 3 recall: 50.86%
model 4 recall: 95.57%


In [85]:
#phase II is photos that have been initially identified are put through another round of review
#all features from first model are going to be fed into 2nd model
#output photos will be presented to customers, so we don't want false positives
#need precision to minimize false positives
#filter out for each model all the observations where prediction is dog

subset1 = pawsdf[pawsdf.model1 == 'dog']
subset2 = pawsdf[pawsdf.model2 == 'dog']
subset3 = pawsdf[pawsdf.model3 == 'dog']
subset4 = pawsdf[pawsdf.model4 == 'dog']

model1_precision = (subset1.actual == subset1.model1).mean()
model2_precision = (subset2.actual == subset2.model2).mean()
model3_precision = (subset3.actual == subset3.model3).mean()
model4_precision = (subset4.actual == subset4.model4).mean()

print(f'model 1 precision: {model1_precision:.2%}')
print(f'model 2 precision: {model2_precision:.2%}')
print(f'model 3 precision: {model3_precision:.2%}')
print(f'model 4 precision: {model4_precision:.2%}')

#model 2 is best

model 1 precision: 89.00%
model 2 precision: 89.32%
model 3 precision: 65.99%
model 4 precision: 73.12%


c. Suppose you are working on a team that solely deals with cat pictures. Which of these models would you recomend for Phase I? 

For Phase II?

In [86]:
#needing cat pictures (cat = positive, dog = negative)
#phase I is automated algorithm tags pics as either cat or dog

#need recall to minimize false negatives

subset = pawsdf[pawsdf.actual == 'cat']

model1_recall = (subset.actual == subset.model1).mean()
model2_recall = (subset.actual == subset.model2).mean()
model3_recall = (subset.actual == subset.model3).mean()
model4_recall = (subset.actual == subset.model4).mean()

print(f'model 1 recall: {model1_recall:.2%}')
print(f'model 2 recall: {model2_recall:.2%}')
print(f'model 3 recall: {model3_recall:.2%}')
print(f'model 4 recall: {model4_recall:.2%}')

#model 2 is best

model 1 recall: 81.50%
model 2 recall: 89.06%
model 3 recall: 51.15%
model 4 recall: 34.54%


In [87]:
#phase II is photos that have been initially identified are put through another round of review
#all features from first model are going to be fed into 2nd model
#output photos will be presented to customers, so we don't want false positives
#need precision to minimize false positives
#filter out for each model all the observations where prediction is cat

subset1 = pawsdf[pawsdf.model1 == 'cat']
subset2 = pawsdf[pawsdf.model2 == 'cat']
subset3 = pawsdf[pawsdf.model3 == 'cat']
subset4 = pawsdf[pawsdf.model4 == 'cat']

model1_precision = (subset1.actual == subset1.model1).mean()
model2_precision = (subset2.actual == subset2.model2).mean()
model3_precision = (subset3.actual == subset3.model3).mean()
model4_precision = (subset4.actual == subset4.model4).mean()

print(f'model 1 precision: {model1_precision:.2%}')
print(f'model 2 precision: {model2_precision:.2%}')
print(f'model 3 precision: {model3_precision:.2%}')
print(f'model 4 precision: {model4_precision:.2%}')

#model 4 is best

model 1 precision: 68.98%
model 2 precision: 48.41%
model 3 precision: 35.83%
model 4 precision: 80.72%


5. Follow the links below to read the documentation about each function, then apply those functions to the data from the previous problem.

sklearn.metrics.accuracy_score https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html

sklearn.metrics.precision_score https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html

sklearn.metrics.recall_score https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html

sklearn.metrics.classification_report https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html

In [88]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import classification_report

In [90]:
#accuracy from sklearn
model1_accuracy = accuracy_score(pawsdf.actual, pawsdf.model1)
model2_accuracy = accuracy_score(pawsdf.actual, pawsdf.model2)
model3_accuracy = accuracy_score(pawsdf.actual, pawsdf.model3)
model4_accuracy = accuracy_score(pawsdf.actual, pawsdf.model4)

print(f'model 1 accuracy: {model1_accuracy:.2%}')
print(f'model 2 accuracy: {model2_accuracy:.2%}')
print(f'model 3 accuracy: {model3_accuracy:.2%}')
print(f'model 4 accuracy: {model4_accuracy:.2%}')

model 1 accuracy: 80.74%
model 2 accuracy: 63.04%
model 3 accuracy: 50.96%
model 4 accuracy: 74.26%


In [120]:
#precision from sklearn

#average{‘micro’, ‘macro’, ‘samples’, ‘weighted’, ‘binary’} default=’binary’
#This parameter is required for multiclass/multilabel targets. 
#If None, the scores for each class are returned. 
# Otherwise, this determines the type of averaging performed on the data:
#'macro': Calculate metrics for each label, and find their unweighted mean. 
    #This does not take label imbalance into account.
    
model1_precision = precision_score(pawsdf.actual, pawsdf.model1, average='macro')
model2_precision = precision_score(pawsdf.actual, pawsdf.model2, average='macro')
model3_precision = precision_score(pawsdf.actual, pawsdf.model3, average='macro')
model4_precision = precision_score(pawsdf.actual, pawsdf.model4, average='macro')

print(f'model 1 precision: {model1_precision:.2%}')
print(f'model 2 precision: {model2_precision:.2%}')
print(f'model 3 precision: {model3_precision:.2%}')
print(f'model 4 precision: {model4_precision:.2%}')

model 1 precision: 78.99%
model 2 precision: 68.86%
model 3 precision: 50.91%
model 4 precision: 76.92%


In [139]:
#recall from sklearn

model1_recall = recall_score(pawsdf.actual, pawsdf.model1, average='macro')
model2_recall = recall_score(pawsdf.actual, pawsdf.model2, average='macro')
model3_recall = recall_score(pawsdf.actual, pawsdf.model3, average='macro')
model4_recall = recall_score(pawsdf.actual, pawsdf.model4, average='macro')

print(f'model 1 recall: {model1_recall:.2%}')
print(f'model 2 recall: {model2_recall:.2%}')
print(f'model 3 recall: {model3_recall:.2%}')
print(f'model 4 recall: {model4_recall:.2%}')

model 1 recall: 80.92%
model 2 recall: 69.07%
model 3 recall: 51.00%
model 4 recall: 65.06%


In [137]:
#classification report from sklearn

model1_classreport = classification_report(pawsdf.actual, pawsdf.model1)
model2_classreport = classification_report(pawsdf.actual, pawsdf.model2)
model3_classreport = classification_report(pawsdf.actual, pawsdf.model3)
model4_classreport = classification_report(pawsdf.actual, pawsdf.model4)

print("model 1 classification report")
print("-----------------------------")
print(model1_classreport)
print("-----------------------------------------------------")
print(" ")
print(" ")
print(" ")

print("model 2 classification report")
print("-----------------------------")
print(model2_classreport)
print("-----------------------------------------------------")
print(" ")
print(" ")
print(" ")

print("model 3 classification report")
print("-----------------------------")
print(model3_classreport)
print("-----------------------------------------------------")
print(" ")
print(" ")
print(" ")

print("model 4 classification report")
print("-----------------------------")
print(model4_classreport)

model 1 classification report
-----------------------------
              precision    recall  f1-score   support

         cat       0.69      0.82      0.75      1746
         dog       0.89      0.80      0.84      3254

    accuracy                           0.81      5000
   macro avg       0.79      0.81      0.80      5000
weighted avg       0.82      0.81      0.81      5000

-----------------------------------------------------
 
 
 
model 2 classification report
-----------------------------
              precision    recall  f1-score   support

         cat       0.48      0.89      0.63      1746
         dog       0.89      0.49      0.63      3254

    accuracy                           0.63      5000
   macro avg       0.69      0.69      0.63      5000
weighted avg       0.75      0.63      0.63      5000

-----------------------------------------------------
 
 
 
model 3 classification report
-----------------------------
              precision    recall  f1-score   