# Precision and Recall

In machine learning model, we have mentioned that, there is an important concept called metrics. However, for classifications problems, accuracy is one of the metrics. There are other important metrics.

In this exercise, we will test our model with new metrics: Precision and Recall

## Please answer Questions

To help you understand precision and recall. Please answer questions below by searching and input your answers.

* **Question 1: What is your understanding of these terms: true postive, false postive, true negative, false negative?**

    * true positive - a true positive occurs when an outcome/observation is detected and that outcome/observation is present

    * true negative - a true negative occurs when an outcome/observation is not detected and that outcome/observation is not present

    * false positive - a false positive occurs when an outcome/observation is detected and that outcome/observation is not actually present

    * false negative - a false negative occurs when an outcome/observation is not detected and that outcome/observation is actually present

* **Question 2: What are the relationships between those terms and precision or recall?**

* Please write down your answer by two simple mathematical equation

    * precision - precision, also known as positive predictive value, is the proportion of positive predictions that are actually true. this can be represented by the equation:
        * true positives / (true positives + false positives)
    * recall - recall, also known as sensitivity or true positive rate, is the proportion of positive outcomes/observations that are correctly identified. this can be represented by the equation:
        * true positives / (true positives + false negatives)


** Answer:** Please double click the cell and input your answer here. 

In [6]:
#import key modules from scikit-learn and import numpy library

from sklearn import svm, datasets
from sklearn.model_selection import train_test_split
import numpy as np

## Below is an example for how to get precision of your model

** Attention **: You need to finish one line of code to implement the whole example.

In [23]:
#Let's load iris data again

#import iris dataset, set variables X, y to the independent and dependent variables
iris = datasets.load_iris()
X = iris.data
y = iris.target

In [26]:
#the data attribute of the iris dataset includes 150 samples, each sample has up to four features
X.shape

(150, 804)

In [25]:
# Let's split the data to training and testing data.
random_state = np.random.RandomState(0)
n_samples, n_features = X.shape

#the numpy function 'c_' (short for concatenate) combines together two separate numpy arrays
#this next line of code takes the original data array held by the variable X and concatenates
#another array of random numbers that has the shape (# of samples, 200 * # of features). 
#In this case we are concatenating an array of random numbers that is 150 by 800
X = np.c_[X, random_state.randn(n_samples, 200 * n_features)]

# Limit to the two first classes, and split into training and test
X_train, X_test, y_train, y_test = train_test_split(X[y < 2], y[y < 2],
                                                    test_size=.5,
                                                    random_state=random_state)


In [27]:
# Create a simple classifier
classifier = svm.LinearSVC(random_state=random_state)

In [28]:
# How could we fit the model? Please find your solutions from our example, and write down 
#your code to fit the svm model from training data.

#fit using X_train and y_train
classifier.fit(X_train, y_train)


LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2',
     random_state=<mtrand.RandomState object at 0x7f19f7041678>,
     tol=0.0001, verbose=0)

In [35]:
classifier.coef_

array([[ -1.07331978e-02,  -4.24068527e-02,   6.73259276e-02,
          2.89269882e-02,  -2.19487314e-02,   7.49967365e-03,
          5.78062384e-03,   5.43868406e-03,   3.68080630e-03,
         -3.37823169e-03,  -1.50985409e-02,  -1.59896708e-03,
          6.38862284e-03,   6.36462127e-03,  -1.46830884e-03,
          7.00417712e-03,   2.11050342e-03,   3.50091779e-03,
         -2.46561543e-03,  -5.54884717e-03,   2.24128553e-03,
         -1.15914309e-02,   4.84175037e-03,   6.98670700e-04,
         -7.21057992e-03,  -7.89538117e-03,   8.66282729e-03,
          4.35182240e-03,  -1.09848513e-02,  -1.24304196e-03,
          1.32617521e-02,  -9.07481421e-03,  -6.95035728e-03,
         -2.42384625e-03,   5.30471738e-03,   2.45606697e-02,
          3.14166637e-03,   1.68367984e-04,  -9.06871325e-03,
         -9.26279236e-03,   6.42655547e-03,  -7.91715884e-03,
         -1.00650418e-03,   3.95504772e-03,   2.52583011e-03,
          4.63991301e-03,  -6.59422808e-03,  -2.12015152e-02,
        

In [37]:
# After you have fit the model, then we make predicions.
# decision_function measure the distance of each test example from the hyperplane separating 
# the two classes. the sign of the value indicates which side of the hyperplane the prediction
# for a sample is on

y_score = classifier.decision_function(X_test)

In [38]:
#y_score is a 1 * X_test.shape array. with a test size of .5 and a limit on the dataset such 
#that we only look at 2 of the three classes and only 100 of the 150 samples, that make y_score 
#a 1 by 50 array
y_score.shape

(50,)

In [39]:
# each number corresponds to the distance that the test example is from the hyperplane. the 
# bigger the number the further away it is
# the sign indicates which side of the hyperplane it's on
y_score

array([-0.20078869,  0.30423874,  0.20105976,  0.27523711,  0.42593404,
       -0.15043726, -0.08794601, -0.12733462,  0.22931596, -0.23913518,
       -0.06386267, -0.14958466, -0.04914839,  0.09898417,  0.0515638 ,
       -0.1142941 ,  0.18899737,  0.04871897, -0.08258102, -0.26105668,
        0.24693291, -0.18318328, -0.38384994,  0.26336904,  0.12585371,
       -0.03991278,  0.39424539,  0.42411536, -0.4790443 , -0.30529061,
       -0.09281931,  0.01213433, -0.20204098,  0.40148935, -0.04536122,
        0.12179099,  0.06493837, -0.07007139,  0.0032915 , -0.39635676,
        0.02619439,  0.20018683,  0.065023  ,  0.49589616, -0.28221895,
        0.31364573,  0.1906223 ,  0.11549516,  0.03145977,  0.22408591])

In [43]:
# returns a 1 or a 0 to indicate which class the classifier believes each training example belongs

y_predict = classifier.predict(X_test)

In [41]:
y_predict.shape

(50,)

In [42]:
# class predictions. the 0's have the same index as the negative numbers in the y_score variable
y_predict

array([0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0,
       1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1,
       1, 1, 1, 1])

In [46]:
y_test

array([1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0,
       1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1,
       1, 0, 1, 0])

In [49]:
y_predict - y_test

array([-1,  0,  0,  0,  0,  0, -1,  0,  0,  0,  0, -1,  0,  1,  0,  0,  1,
        1, -1,  0,  0,  0,  0,  0,  0, -1,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  1,  0,  1,  0,  0,  0,  0,  0,  0,  0,  0,  1,  0,  1])

## Get the average precision score, Run the cell below

In [44]:
from sklearn.metrics import average_precision_score
average_precision = average_precision_score(y_test, y_score)

In [47]:
# this compares the y_score variables with the true y_test labels and returns the recall value

print('Average precision-recall score: {0:0.2f}'.format(
      average_precision))

Average precision-recall score: 0.88
