---

_You are currently looking at **version 1.2** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-machine-learning/resources/bANLa) course resource._

---

# Assignment 3 - Evaluation

In this assignment you will train several models and evaluate how effectively they predict instances of fraud using data based on [this dataset from Kaggle](https://www.kaggle.com/dalpozz/creditcardfraud).
 
Each row in `fraud_data.csv` corresponds to a credit card transaction. Features include confidential variables `V1` through `V28` as well as `Amount` which is the amount of the transaction. 
 
The target is stored in the `class` column, where a value of 1 corresponds to an instance of fraud and 0 corresponds to an instance of not fraud.

In [1]:
import numpy as np
import pandas as pd

### Question 1
Import the data from `fraud_data.csv`. What percentage of the observations in the dataset are instances of fraud?

*This function should return a float between 0 and 1.* 

In [2]:
# Use X_train, X_test, y_train, y_test for all of the following questions
from sklearn.model_selection import train_test_split

df = pd.read_csv('fraud_data.csv')

X = df.iloc[:,:-1]
y = df.iloc[:,-1]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [3]:
def answer_one():
    
    # read data
    data_fraud_raw = pd.read_csv('fraud_data.csv')
    
    # percentage of fraud instances (class = 1)
    result_answer_one = data_fraud_raw['Class'].value_counts(normalize = True).iloc[1]
    
    return result_answer_one

# answer_one()
# type(answer_one())

### Question 2

Using `X_train`, `X_test`, `y_train`, and `y_test` (as defined above), train a dummy classifier that classifies everything as the majority class of the training data. What is the accuracy of this classifier? What is the recall?

*This function should a return a tuple with two floats, i.e. `(accuracy score, recall score)`.*

In [4]:
def answer_two():
    from sklearn.dummy import DummyClassifier
    from sklearn.metrics import recall_score, accuracy_score
    
    # train a dummy classifier (that classifies everythong as the majority class == 0; not fraudulent)
    dummy_majority = DummyClassifier(strategy = 'most_frequent').fit(X_train, y_train)
    
    # predictions
    y_dummy_predictions = dummy_majority.predict(X_test)
    
    # calculate accuracy and recall
    accuracy_score_answer_two = accuracy_score(y_test, y_dummy_predictions)
    recall_score_answer_two = recall_score(y_test, y_dummy_predictions) 
    result_answer_two = tuple(np.asarray([accuracy_score_answer_two, recall_score_answer_two]))
    
    return result_answer_two

# answer_two()

### Question 3

Using X_train, X_test, y_train, y_test (as defined above), train a SVC classifer using the default parameters. What is the accuracy, recall, and precision of this classifier?

*This function should a return a tuple with three floats, i.e. `(accuracy score, recall score, precision score)`.*

In [5]:
def answer_three():
    from sklearn.metrics import recall_score, precision_score, accuracy_score
    from sklearn.svm import SVC

    # train a SVC classifier with default parameters
    svc_answer_three = SVC(kernel = 'rbf', C = 1).fit(X_train, y_train)
    svc_predicted_answer_three = svc_answer_three.predict(X_test)
    
    # calculate accuracy, recall and precision    
    accuracy_score_answer_three = accuracy_score(y_test, svc_predicted_answer_three)
    recall_score_answer_three = recall_score(y_test, svc_predicted_answer_three)
    precision_score_answer_three = precision_score(y_test, svc_predicted_answer_three)
    result_answer_three = tuple(np.asarray([accuracy_score_answer_three,
                                            recall_score_answer_three,
                                            precision_score_answer_three]))
    
    return result_answer_three

# answer_three()
# type(answer_three())

### Question 4

Using the SVC classifier with parameters `{'C': 1e9, 'gamma': 1e-07}`, what is the confusion matrix when using a threshold of -220 on the decision function. Use X_test and y_test.

*This function should return a confusion matrix, a 2x2 numpy array with 4 integers.*

In [6]:
def answer_four():
    from sklearn.metrics import confusion_matrix
    from sklearn.svm import SVC

    # fit the model using SVC classifier
    svc_answer_four = SVC(C = 1e9, gamma = 1e-07).fit(X_train, y_train)
    svc_predicted_answer_four = svc_answer_four.decision_function(X_test)
    
    # set the treshold for decision function
    treshold_answer_four = -220
    svc_predicted_answer_four[svc_predicted_answer_four > treshold_answer_four] = 1
    svc_predicted_answer_four[svc_predicted_answer_four != 1] = 0

    # calculate the confusion matrix
    confusion_matrix_answer_four = confusion_matrix(y_test, svc_predicted_answer_four)
    
    return confusion_matrix_answer_four

# answer_four()
# type(answer_four())

### Question 5

Train a logisitic regression classifier with default parameters using X_train and y_train.

For the logisitic regression classifier, create a precision recall curve and a roc curve using y_test and the probability estimates for X_test (probability it is fraud).

Looking at the precision recall curve, what is the recall when the precision is `0.75`?

Looking at the roc curve, what is the true positive rate when the false positive rate is `0.16`?

*This function should return a tuple with two floats, i.e. `(recall, true positive rate)`.*

In [49]:
def answer_five():
        
    from sklearn.linear_model import LogisticRegression
    from sklearn.metrics import precision_recall_curve, roc_curve, auc
#     import matplotlib.pyplot as plt

    # fit the model using logistic regression classifier
    logreg_answer_five = LogisticRegression().fit(X_train, y_train)
    logreg_predicted_answer_five = logreg_answer_five.decision_function(X_test)

    # create precision-recall curve
    precision, recall, thresholds = precision_recall_curve(y_test, logreg_predicted_answer_five)
    closest_zero = np.argmin(np.abs(thresholds))
    closest_zero_p = precision[closest_zero]
    closest_zero_r = recall[closest_zero]
    recall_at_defined_precision = recall[np.where(precision == 0.75)[0][0]]
#     print('Recall at precision set to 0.75: ', recall_at_defined_precision)

#     plt.figure()
#     plt.xlim([0.0, 1.01])
#     plt.ylim([0.0, 1.01])
#     plt.plot(precision, recall, label='Precision-Recall Curve')
#     plt.plot(closest_zero_p, closest_zero_r, 'o', markersize = 12, fillstyle = 'none', c='r', mew=3)
#     plt.xlabel('Precision', fontsize=16)
#     plt.ylabel('Recall', fontsize=16)
#     plt.axes().set_aspect('equal')
#     plt.show()

    # create ROC curve
    logreg_prob_answer_five = logreg_answer_five.predict_proba(X_test)
    fpr_lr, tpr_lr, _ = roc_curve(y_test, logreg_prob_answer_five[:, 1])
    roc_auc_lr = auc(fpr_lr, tpr_lr)
    tpr_at_defined_fpr = tpr_lr[np.where(np.round(fpr_lr, 2) == 0.16)[0][0]]
#     print('True positive rate at false positive rate set to 0.16: ', tpr_at_defined_fpr)

#     plt.figure()
#     plt.xlim([-0.01, 1.00])
#     plt.ylim([-0.01, 1.01])
#     plt.plot(fpr_lr, tpr_lr, lw=3, label='LogRegr ROC curve (area = {:0.2f})'.format(roc_auc_lr))
#     plt.xlabel('False Positive Rate', fontsize=16)
#     plt.ylabel('True Positive Rate', fontsize=16)
#     plt.title('ROC curve (Fraud data)', fontsize=16)
#     plt.legend(loc='lower right', fontsize=13)
#     plt.plot([0, 1], [0, 1], color='navy', lw=3, linestyle='--')
#     plt.axes().set_aspect('equal')
#     plt.show()

    # results converted to a tuple
    result_answer_five = tuple(np.asarray([recall_at_defined_precision, tpr_at_defined_fpr]))
    
    return result_answer_five

# answer_five()
# type(answer_five())

(0.82499999999999996, 0.9375)

### Question 6

Perform a grid search over the parameters listed below for a Logisitic Regression classifier, using recall for scoring and the default 3-fold cross validation.

`'penalty': ['l1', 'l2']`

`'C':[0.01, 0.1, 1, 10, 100]`

From `.cv_results_`, create an array of the mean test scores of each parameter combination. i.e.

|      	| `l1` 	| `l2` 	|
|:----:	|----	|----	|
| **`0.01`** 	|    ?	|   ? 	|
| **`0.1`**  	|    ?	|   ? 	|
| **`1`**    	|    ?	|   ? 	|
| **`10`**   	|    ?	|   ? 	|
| **`100`**   	|    ?	|   ? 	|

<br>

*This function should return a 5 by 2 numpy array with 10 floats.* 

*Note: do not return a DataFrame, just the values denoted by '?' above in a numpy array. You might need to reshape your raw result to meet the format we are looking for.*

In [44]:
def answer_six():    
    from sklearn.model_selection import GridSearchCV
    from sklearn.linear_model import LogisticRegression

    # fit the model using logistic regression classifier
    logreg_answer_six = LogisticRegression().fit(X_train, y_train)

    # define parameters
    params_grid = {'C': [0.01, 0.1, 1, 10, 100],
                   'penalty': ['l1', 'l2']
                  }

    # create GridSearchCV object
    grid_answer_six = GridSearchCV(logreg_answer_six, params_grid, scoring = 'recall', cv = 3).fit(X_train, y_train)

    # mean test scores from cv_results_
    result_answer_six = grid_answer_six.cv_results_['mean_test_score'].reshape((5,2))
    
    return result_answer_six

# answer_six()
# type(answer_six())

numpy.ndarray

In [46]:
# Use the following function to help visualize results from the grid search
def GridSearch_Heatmap(scores):
    %matplotlib notebook
    import seaborn as sns
    import matplotlib.pyplot as plt
    plt.figure()
    sns.heatmap(scores.reshape(5,2), xticklabels=['l1','l2'], yticklabels=[0.01, 0.1, 1, 10, 100])
    plt.yticks(rotation=0);

# GridSearch_Heatmap(answer_six())