# Logistic Regression for Credit Card Fraud Detection (10 pts)

Each row in `fraud_data.csv` corresponds to a credit card transaction. Features include confidential variables `V1` through `V28` as well as `Amount` which is the amount of the transaction. 
 
The target is stored in the `class` column, where a value of 1 corresponds to an instance of fraud and 0 corresponds to an instance of not fraud.

### Loading the data (1 pts)
Load the data from `fraud_data.csv`.

In [1]:
import numpy as np
import pandas as pd

data = pd.read_csv('fraud_data.csv')

## Print the percentage of fraud observations
print("Percentage of fraud observations {:.2%}".format(data['Class'].sum() / data.shape[0]))

X = data.iloc[:,:-1]
y = data.iloc[:,-1]

# Use X_train, X_test, y_train, y_test for all of the following questions
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)  # Your code here

Percentage of fraud observations 1.64%


**Question:** What percentage of the observations in the dataset are instances of fraud?

***1.64% of the observations from the dataset are fraud***

### Predictions using the majority class label (4pts)

Using `X_train`, `X_test`, `y_train`, and `y_test` (as defined above), train a dummy classifier that classifies everything as the majority class of the training data. What is the accuracy of this classifier? (Here accuracy is the ratio of the number of correctly classified transactions to the total number of transactions)

In [2]:
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score
    
## Instantiate and fit a dummy classifier that always predict class label by the majority class of the training data
## Use DummyClassifier in sklearn with strategy 'most_frequent
dummy = DummyClassifier(strategy="most_frequent")
dummy.fit(X_train, y_train)

dummy_test_pred = dummy.predict(X_test)

## Measure test accuracy of your dummy classifier
dummy_test_acc = accuracy_score(y_test, dummy_test_pred)

print('Dummy classifier accuray:', dummy_test_acc)

Dummy classifier accuray: 0.9852507374631269


**Question:** *How does the accuracy of the dummy classifier look (very low, low, high, very high)? Give an explanation.*

***The dummy classifier is very accurate (98.5%) however, this is misleading because the classes of data are very unbalanced.***

**Question:** *How many fraudulent transactions are correctly classified? (This is the **recall** score/measure)*

***None of the fraudulent transactions are correctly classified, because the dummy classifier classified all samples as 'non-fraudulent'***

In [3]:
from sklearn.metrics import recall_score

## Measure test recall score of your dummy classifier
dummy_test_recall = recall_score(y_test, dummy_test_pred)

print('Dummy classifier recall:', dummy_test_recall)

Dummy classifier recall: 0.0


**Question:** *How does the recall of the dummy classifier look (very low, low, high, very high)? Give an explanation.*

***The recall for the dummy classifier is 0.  Since the dummy classifier always returns the majority class, it never predicts the minority class, in this case the negative class***

### Training a logistic regression model (3pts)

Train a logisitic regression classifier with default parameters using X_train and y_train.

In [4]:
from sklearn.linear_model import LogisticRegression
    
## Instantiate a logistic regression model and fit to the training data
logR = LogisticRegression(max_iter = 1000)
logR.fit(X_train, y_train)

logR_test_pred = logR.predict(X_test)

## Measure test accuracy 
logR_test_acc = accuracy_score(y_test, logR_test_pred)

print('Logistic classifier accuray:', logR_test_acc)

## Measure test recall
logR_test_recall = recall_score(y_test, logR_test_pred)

print('Logistic classifier recall:', logR_test_recall)

Logistic classifier accuray: 0.9964970501474927
Logistic classifier recall: 0.7875


**Question:** *Compare the results of logistic regression with those of the above dummy classifier*

***The logistic regression works considerably better than the dummy classifier.  Since it actually predicts positives, we see that the recall is much better.***

### Grid search for selecting hyperparameters for Logistic Regression (2pts)

Perform a grid search over the parameters listed below for a Logisitic Regression classifier, using recall for scoring and the default 3-fold cross validation.

`'penalty': ['l1', 'l2']`

`'C':[0.01, 0.1, 1, 10, 100]`

In [5]:
from sklearn.model_selection import GridSearchCV

## Define the grid of logistic regression parameters
parameters = {'penalty': ['l1', 'l2'],
              'C':[0.01, 0.1, 1, 10, 100]}
model = LogisticRegression(max_iter = 1000)
    
## Perform grid search CV to find best model parameter setting
cmodel = GridSearchCV(model, param_grid=parameters, cv=3, scoring='recall', n_jobs=-2, 
                    return_train_score=True)
cmodel.fit(X_train, y_train.ravel())

## Fit logistic regression with best parameters to the entire training data
model = cmodel.best_estimator_
model.fit(X_train, y_train)
    
logR_test_pred = model.predict(X_test)

## Measure test accuracy
logR_test_acc = accuracy_score(y_test, logR_test_pred)

print('Logistic classifier accuray:', logR_test_acc)

## Measure test recall
logR_test_recall = recall_score(y_test, logR_test_pred)

print('Logistic classifier recall:', logR_test_recall)

Logistic classifier accuray: 0.9964970501474927
Logistic classifier recall: 0.7875


**Question:** *Compare the results with that of logistic regression with default parameters*

***The default matches the regression using the gridsearch.***