---

_You are currently looking at **version 1.2** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-machine-learning/resources/bANLa) course resource._

---

# Assignment 3 - Evaluation

In this assignment you will train several models and evaluate how effectively they predict instances of fraud using data based on [this dataset from Kaggle](https://www.kaggle.com/dalpozz/creditcardfraud).
 
Each row in `fraud_data.csv` corresponds to a credit card transaction. Features include confidential variables `V1` through `V28` as well as `Amount` which is the amount of the transaction. 
 
The target is stored in the `class` column, where a value of 1 corresponds to an instance of fraud and 0 corresponds to an instance of not fraud.

In [1]:
import numpy as np
import pandas as pd

### Question 1
Import the data from `fraud_data.csv`. What percentage of the observations in the dataset are instances of fraud?

*This function should return a float between 0 and 1.* 

In [2]:
def answer_one():
    
    # Your code here
    df = pd.read_csv("fraud_data.csv")
    perc_0 = df['Class'].value_counts(normalize=True)[1]
    
    return perc_0


In [3]:
answer_one()

0.016410823768035772

In [4]:
# Use X_train, X_test, y_train, y_test for all of the following questions
from sklearn.model_selection import train_test_split

#df = pd.read_csv('readonly/fraud_data.csv')
df = pd.read_csv('fraud_data.csv')

X = df.iloc[:,:-1]
y = df.iloc[:,-1]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

### Question 2

Using `X_train`, `X_test`, `y_train`, and `y_test` (as defined above), train a dummy classifier that classifies everything as the majority class of the training data. What is the accuracy of this classifier? What is the recall?

*This function should a return a tuple with two floats, i.e. `(accuracy score, recall score)`.*

In [5]:
def answer_two():
    from sklearn.dummy import DummyClassifier
    from sklearn.metrics import recall_score
    
    # Your code here
    dummy_majority = DummyClassifier(strategy = 'most_frequent').fit(X_train, y_train)
    accuracy_score = dummy_majority.score(X_test, y_test)
    
    pred = dummy_majority.predict(X_test)
    recall_score = recall_score(y_test, pred) 
    
    return accuracy_score, recall_score# Return your answer

In [6]:
answer_two()

(0.9852507374631269, 0.0)

### Question 3

Using X_train, X_test, y_train, y_test (as defined above), train a SVC classifer using the default parameters. What is the accuracy, recall, and precision of this classifier?

*This function should a return a tuple with three floats, i.e. `(accuracy score, recall score, precision score)`.*

In [7]:
def answer_three():
    from sklearn.metrics import recall_score, precision_score
    from sklearn.svm import SVC

    # Your code here
    svm = SVC().fit(X_train, y_train)
    pred = svm.predict(X_test)
    
    accuracy_score = svm.score(X_test, y_test)
    recall_score = recall_score(y_test, pred) 
    precision_score = precision_score(y_test, pred) 
    
    return accuracy_score, recall_score, precision_score# Return your answer

In [8]:
answer_three()



(0.9907817109144543, 0.375, 1.0)

In [9]:
#from sklearn.svm import SVC
#from sklearn.metrics import confusion_matrix
#svm = SVC().fit(X_train, y_train)
#pred = dummy_majority.predict(X_test)

#confusion_matrix(y_test, pred)

### Question 4

Using the SVC classifier with parameters `{'C': 1e9, 'gamma': 1e-07}`, what is the confusion matrix when using a threshold of -220 on the decision function. Use X_test and y_test.

*This function should return a confusion matrix, a 2x2 numpy array with 4 integers.*

In [10]:
def answer_four():
    from sklearn.metrics import confusion_matrix
    from sklearn.svm import SVC

    # Your code here
    svm = SVC(C = 1e9, gamma = 1e-07).fit(X_train, y_train)
    
    # prediction with default threshold
    #pred = svm.predict(X_test)
    
    # prediction with threshold = -220
    y_scores_svm = svm.fit(X_train, y_train).decision_function(X_test)
    pred = (y_scores_svm >= -220).astype(int)#[(y_scores_svm >= -220).astype(bool)==True]
        
    conf_mat = confusion_matrix(y_test, pred)
    
    return conf_mat # Return your answer

In [11]:
answer_four()

array([[5320,   24],
       [  14,   66]], dtype=int64)

In [None]:
#from sklearn.metrics import confusion_matrix
#from sklearn.svm import SVC

# Your code here
#svm = SVC(C = 1e9, gamma = 1e-07).fit(X_train, y_train)

#y_scores_svm = svm.fit(X_train, y_train).decision_function(X_test)

#pred = (y_scores_svm >= -220).astype(int)#[(y_scores_svm >= -220).astype(bool)==True]


### Question 5

Train a logisitic regression classifier with default parameters using X_train and y_train.

For the logisitic regression classifier, create a precision recall curve and a roc curve using y_test and the probability estimates for X_test (probability it is fraud).

Looking at the precision recall curve, what is the recall when the precision is `0.75`?

Looking at the roc curve, what is the true positive rate when the false positive rate is `0.16`?

*This function should return a tuple with two floats, i.e. `(recall, true positive rate)`.*

In [33]:
## Your code here
#from sklearn.linear_model import LogisticRegression
#from sklearn.metrics import precision_recall_curve
#from sklearn.metrics import roc_curve

    
#lr = LogisticRegression().fit(X_train, y_train)
#lr_predicted = lr.predict_proba(X_test)[:,1]

#precision, recall, thresholds = precision_recall_curve(y_test, lr_predicted)
#precision, recall, thresholds

#prec_value = precision[(np.abs(precision-0.75)).argmin()]
#prec_value

#recall_value = [(recall) for precision, recall in zip(precision, recall) if precision == prec_value]
#recall_value

#roc_curve(y_test, lr_predicted)

(array([  0.00000000e+00,   0.00000000e+00,   1.87125749e-04,
          1.87125749e-04,   3.74251497e-04,   3.74251497e-04,
          5.61377246e-04,   5.61377246e-04,   5.23952096e-03,
          5.23952096e-03,   1.47829341e-02,   1.47829341e-02,
          2.69461078e-02,   2.69461078e-02,   4.26646707e-02,
          4.26646707e-02,   6.17514970e-02,   6.17514970e-02,
          7.52245509e-02,   7.52245509e-02,   1.23315868e-01,
          1.23315868e-01,   1.37911677e-01,   1.37911677e-01,
          1.48577844e-01,   1.48577844e-01,   1.59618263e-01,
          1.59618263e-01,   2.04715569e-01,   2.05276946e-01,
          2.16878743e-01,   2.17252994e-01,   2.31287425e-01,
          2.31287425e-01,   2.37836826e-01,   2.37836826e-01,
          2.68899701e-01,   2.69273952e-01,   4.74738024e-01,
          4.75112275e-01,   5.75598802e-01,   5.75598802e-01,
          5.94498503e-01,   5.94498503e-01,   6.83757485e-01,
          6.84131737e-01,   8.69011976e-01,   8.69386228e-01,
        

In [12]:
def answer_five():
        
    # Your code here
    from sklearn.linear_model import LogisticRegression
    from sklearn.metrics import precision_recall_curve
    from sklearn.metrics import roc_curve

    
    lr = LogisticRegression().fit(X_train, y_train)
    lr_predicted = lr.predict_proba(X_test)[:,1]

    # recall for precision = 0.75
    precision, recall, thresholds = precision_recall_curve(y_test, lr_predicted)
    #precision, recall, thresholds
    prec_value = precision[(np.abs(precision-0.75)).argmin()]
    #prec_value
    recall_value = [(recall) for precision, recall in zip(precision, recall) if precision == prec_value][0]
    #recall_value
    
    # tpr when lpr is (close to) 0.16
    fpr_lr, tpr_lr, _ = roc_curve(y_test, lr_predicted)

    fpr_lr_value = fpr_lr[(np.abs(fpr_lr-0.16)).argmin()]
    tpr_lr_value = [(tpr_lr) for fpr_lr, tpr_lr in zip(fpr_lr, tpr_lr) if fpr_lr == fpr_lr_value][0]
    
    
    return recall_value, tpr_lr_value# Return your answer

In [13]:
answer_five()



(0.825, 0.9375)

In [None]:
#from sklearn.linear_model import LogisticRegression
#from sklearn.metrics import precision_recall_curve
#from sklearn.metrics import roc_curve
    
#lr = LogisticRegression().fit(X_train, y_train)
#lr_predicted = lr.predict(X_test)
    
#precision, recall, thresholds = precision_recall_curve(y_test, lr_predicted)

#prec_value = precision[(np.abs(precision-0.75)).argmin()]
#recall_value = [(recall) for precision, recall in zip(precision, recall) if precision == prec_value]

#print(precision, recall)

#fpr_lr, tpr_lr, _ = roc_curve(y_test, lr_predicted)
#print(fpr_lr, tpr_lr)


#fpr_lr_value = fpr_lr[(np.abs(fpr_lr-0.16)).argmin()]
#fpr_lr_value
#tpr_lr_value = [(tpr_lr) for fpr_lr, tpr_lr in zip(fpr_lr, tpr_lr) if fpr_lr == fpr_lr_value]
#tpr_lr_value

### Question 6

Perform a grid search over the parameters listed below for a Logisitic Regression classifier, using recall for scoring and the default 3-fold cross validation.

`'penalty': ['l1', 'l2']`

`'C':[0.01, 0.1, 1, 10, 100]`

From `.cv_results_`, create an array of the mean test scores of each parameter combination. i.e.

|      	| `l1` 	| `l2` 	|
|:----:	|----	|----	|
| **`0.01`** 	|    ?	|   ? 	|
| **`0.1`**  	|    ?	|   ? 	|
| **`1`**    	|    ?	|   ? 	|
| **`10`**   	|    ?	|   ? 	|
| **`100`**   	|    ?	|   ? 	|

<br>

*This function should return a 5 by 2 numpy array with 10 floats.* 

*Note: do not return a DataFrame, just the values denoted by '?' above in a numpy array. You might need to reshape your raw result to meet the format we are looking for.*

In [None]:
#from sklearn.model_selection import GridSearchCV
#from sklearn.linear_model import LogisticRegression

#lr = LogisticRegression().fit(X_train, y_train)

#grid_values = {'penalty': ['l1', 'l2'], 'C': [0.01, 0.1, 1, 10, 100]}
#grid_lr = GridSearchCV(lr, param_grid = grid_values, scoring = 'recall')

#grid_lr.fit(X_train, y_train)

#grid_res = grid_lr.cv_results_['mean_test_score'].reshape(5,2)

#print('hello')

#grid_lr.cv_results_

In [None]:
#grid_lr.cv_results_

In [None]:
#grid_lr.cv_results_['mean_test_score']

In [None]:
#grid_lr.cv_results_['mean_test_score'].reshape(5,2)

In [14]:
def answer_six():    
    from sklearn.model_selection import GridSearchCV
    from sklearn.linear_model import LogisticRegression

    # Your code here
    lr = LogisticRegression().fit(X_train, y_train)

    grid_values = {'penalty': ['l1', 'l2'], 'C': [0.01, 0.1, 1, 10, 100]}
    grid_lr = GridSearchCV(lr, param_grid = grid_values, scoring = 'recall')

    grid_lr.fit(X_train, y_train)

    grid_res = grid_lr.cv_results_['mean_test_score'].reshape(5,2)
    
    return grid_res# Return your answer

In [15]:
answer_six()



array([[0.66666667, 0.76086957],
       [0.80072464, 0.80434783],
       [0.8115942 , 0.8115942 ],
       [0.80797101, 0.8115942 ],
       [0.80797101, 0.8115942 ]])

In [None]:
# Use the following function to help visualize results from the grid search
#def GridSearch_Heatmap(scores):
#    %matplotlib notebook
#    import seaborn as sns
#    import matplotlib.pyplot as plt
#    plt.figure()
#    sns.heatmap(scores.reshape(5,2), xticklabels=['l1','l2'], yticklabels=[0.01, 0.1, 1, 10, 100])
#    plt.yticks(rotation=0);

#GridSearch_Heatmap(answer_six())