## Can AI Detect Cancer?

Let's look at a standard breast cancer dataset and practice evaluating a classification model.

In [None]:
import sklearn
from sklearn import datasets

cancer = datasets.load_breast_cancer() # Load the cancer dataset from scikit-learn

print("Feature Names")
print((cancer.feature_names)) 

# These are all the different features in the dataset. Would you consider dropping one of these features?
print('')

print("Names of Classes")
print((cancer.target_names))

# We find that either a tumor is 'malignant' or it is 'benign'.

In [None]:
# Now, let's plot our data and see if we see any correlation between two features
%matplotlib inline
import matplotlib.pyplot as plt

X = cancer.data
y = cancer.target
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Set1,
            edgecolor='k')
plt.xlabel('Mean radius')
plt.ylabel('Mean texture')
plt.show()

Do you see any clear data trends? Plotting helps machine learning researchers get a feel for what the data looks like. 

### Evaluating the Model

Now, let's make a decision tree model which, given an arbitrary tumor, identifies whether it is malignant or benign.

We want to figure out how well it is doing. We will use a classifier called a Decision Tree. 

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

# Create the training and testing datasets
X = cancer.data # X, what we're using to predict, is the features
y = cancer.target # y, what we're predicting, is the classes

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20) # We split the data into 'train' and 'test'

# Train a classifier
classifier = DecisionTreeClassifier()
classifier.fit(X_train, y_train) # Put these into the classifier

# Use the trained classifier to predict whether a new piece of data is benign or malignant
prediction = classifier.predict(X_test)

# Note that these predictions aren't necessarily correct

### Accuracy

The accuracy, as we normally think of it, is defined as the number of correct predictions / total predictions.

#### Calculating the Correct and Total Predictions

In [None]:
# From scratch way

correct = [] # This list stores every prediction that was correct
for i in range(0, len(y_test)): # Iterate through all of our predictions
        if (y_test[i] == prediction[i]): # If our prediction matches the actual value,
            correct.append(y_test[i]) # then put it in the 'correct' list

num_correct = float(len(correct))
num_pred = float(len(y_test))

print(num_correct)
print(num_pred)

#### Practice

Using the accuracy formula, define accuracy in terms of num_correct and num_pred.

In [None]:
# TODO:
accuracy = 0

# Our accuracy is defined by the number of 
print("Accuracy:")
print(accuracy)

In [None]:
# A faster way is to do it with a built-in function, accuracy_score
from sklearn.metrics import accuracy_score
print("Accuracy:")
accuracy_score(y_test, prediction)

### Precision

The formula for precision is (true positive)/(true positive + false positive).

#### Calculating the Number of True Positives


A true positive is a tumor prediction that says 'malignant' and matches with the correct value (in y_test).
Note: 1 = 'malignant' and 0 = 'benign'
To calculate the number of true positives, let's iterate through and find predictions that are both 'malignant' and correct.

In [None]:
true_positive = []
for i in range(0, len(y_test)): # Iterate through all of our predictions
        if (prediction[i] == 1 and y_test[i] == prediction[i]): # If we predict malignant and our prediction is correct
            true_positive.append(y_test[i]) # then this prediction is a true positive

#### Calculating the Number of False Positives

A false positive is a tumor prediction that says 'malignant' and does NOT match with the correct value (in y_test). To calculate the number of false positives, let's go through all of our predictions and see which ones are 'malignant' and don't match their correct values.

In [None]:
false_positive = []
for i in range(0, len(y_test)): # Iterate through all of our predictions
        if (prediction[i] == 1 and y_test[i] != prediction[i]): # If we predict malignant and our prediction is incorrect
            false_positive.append(y_test[i]) # then this prediction is a false positive

Now we define num_true_pos as the length of the true positives list, and num_false_pos similarly.

In [None]:
num_true_pos = len(true_positive)
num_false_pos = len(false_positive)

#### Practice

Your task is to now use num_true_pos and num_false_pos to calculate the precision, with our formula: precision = (true positive)/(true positive + false positive).

In [None]:
# TODO: Define variable 'precision' in terms of num_true_pos and num_false_pos, using our formula.
precision = 0

print("Precision: ")
print(precision)

### Recall

The formula for recall is (true positive)/(true positive + false negative).  
We can use the true_positive array from earlier, so now all we have to calculate is a false_negative.

#### Calculating False Negatives

In [None]:
false_negative = []
for i in range(0, len(y_test)):
    if (prediction[i] == 0 and y_test[i] != prediction[i]):
        false_negative.append(y_test[i])

num_false_neg = len(false_negative)

#### Practice

Your task is to define the 'recall' variable in terms of num_true_pos and num_false_neg. Remember that recall = (true positive)/(true positive + false negative).

In [None]:
# TODO: Define variable 'recall' in terms of num_true_pos and num_false_neg.
recall = 0
print("Recall:")
print(recall)

## AI Ethics Reflections

1. Given that we're doing a cancer detection task, which error is worse: a false positive (reporting that a tumor is malignant when it actually isn't) or a false negative (reporting that a tumor is benign when it's not)?  


2. Precision is best for when it's really bad to have false positives, and recall is the best for when it's bad to have false negatives. Given your answer from above and our precision or recall score, how do you think the model is doing? (5 = amazingly, 1 = very poor)