# Logistic Regression for Credit Card Fraud Detection (10 pts)

Each row in `fraud_data.csv` corresponds to a credit card transaction. Features include confidential variables `V1` through `V28` as well as `Amount` which is the amount of the transaction. 
 
The target is stored in the `class` column, where a value of 1 corresponds to an instance of fraud and 0 corresponds to an instance of not fraud.

### Loading the data (1 pts)
Load the data from `fraud_data.csv`.

In [1]:
import numpy as np
import pandas as pd

data = pd.read_csv('fraud_data.csv', sep=',')

# Print the percentage of fraud observations

X = data.iloc[:,:-1]
y = data.iloc[:,-1]

fraud_obs = len(data[data['Class'] == 1])
non_fraud_obs = len(data[data['Class'] == 0])
print("Fraudulent Observations: ", fraud_obs)
print("Non-Fraudulent Observations: ", non_fraud_obs)
print("% Fraudulent: ", np.round(fraud_obs/(fraud_obs+non_fraud_obs),4))

# Use X_train, X_test, y_train, y_test for all of the following questions
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=0)

Fraudulent Observations:  356
Non-Fraudulent Observations:  21337
% Fraudulent:  0.0164


**Question:** What percentage of the observations in the dataset are instances of fraud?

**Answer:** 1.6%

### Predictions using the majority class label (4pts)

Using `X_train`, `X_test`, `y_train`, and `y_test` (as defined above), train a dummy classifier that classifies everything as the majority class of the training data. What is the accuracy of this classifier? (Here accuracy is the ratio of the number of correctly classified transactions to the total number of transactions)

In [2]:
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
    
## Instantiate and fit a dummy classifier that always predict class label by the majority class of the training data
## Use DummyClassifier in sklearn with strategy 'most_frequent
dummy = DummyClassifier(strategy='most_frequent')
dummy.fit(X_train, y_train)
dummy_test_pred = dummy.predict(X_test)

## Measure test accuracy of your dummy classifier
dummy_test_acc = accuracy_score(y_test, dummy_test_pred)

print('Dummy classifier accuray:', dummy_test_acc)

Dummy classifier accuray: 0.9849416103257529


**Question:** *How does the accuracy of the dummy classifier look (very low, low, high, very high)? Give an explanation.*

**Answer:** The accuracy of the dummy classifier is very high (98.5%). This is because there are very few fraud cases in the data compared to non-fraud cases, so it's easy for the dummy variable to achieve high accuracy by assuming everything in non-fraudulent. This results in only misclassifying the 1.6% of the data that is actually fraud.

**Question:** *How many fraudulent transactions are correctly classified? (This is the **recall** score/measure)*

**Answer:** 0 fraudulent transactions were correctly classified - the recall score is 0%.

In [3]:
from sklearn.metrics import recall_score

## Measure test recall score of your dummy classifier
dummy_test_recall = recall_score(y_test, dummy_test_pred)

print('Dummy classifier recall:', dummy_test_recall)

Dummy classifier recall: 0.0


**Question:** *How does the recall of the dummy classifier look (very low, low, high, very high)? Give an explanation.*

### Training a logistic regression model (3pts)

Train a logisitic regression classifier with default parameters using X_train and y_train.

In [4]:
from sklearn.linear_model import LogisticRegression
    
## Instantiate a logistic regression model and fit to the training data
logR = LogisticRegression(random_state=0, solver='liblinear')
logR.fit(X_train, y_train)
logR_test_pred = logR.predict(X_test)

## Measure test accuracy 
logR_test_acc = accuracy_score(y_test, logR_test_pred)

print('Logistic classifier accuray:', logR_test_acc)

## Measure test recall
logR_test_recall = recall_score(y_test, logR_test_pred)

print('Logistic classifier recall:', logR_test_recall)

Logistic classifier accuray: 0.9964658881376767
Logistic classifier recall: 0.7959183673469388


**Question:** *Compare the results of logistic regression with those of the above dummy classifier*

**Answer:** When comparing the results, I see that the logistic regression model has improved accuracy (achieving 99.6% compared to the dummy classifier's 98.5%, but more importantly the log reg model also has a recall score of 79.6% compared to the dummy model's 0%. This means that the log reg model is correctly labeling fraud cases 79.6% of the time, rather than never.

### Grid search for selecting hyperparameters for Logistic Regression (2pts)

Perform a grid search over the parameters listed below for a Logisitic Regression classifier, using recall for scoring and the default 3-fold cross validation.

`'penalty': ['l1', 'l2']`

`'C':[0.01, 0.1, 1, 10, 100]`

In [11]:
from sklearn.model_selection import GridSearchCV

## Define the grid of logistic regression parameters
parameters = {'penalty': ['l1', 'l2'], 'C':[0.01, 0.1, 1, 10, 100]}
model = LogisticRegression(random_state=0, solver='liblinear')
    
## Perform grid search CV to find best model parameter setting
cmodel = GridSearchCV(model, param_grid=parameters, cv=3, scoring='recall', n_jobs=-2, 
                      return_train_score=True)
cmodel.fit(X_train, y_train.ravel());

cmodel.best_params_

{'C': 10, 'penalty': 'l2'}

In [12]:

## Fit logistic regression with best parameters to the entire training data
model = LogisticRegression(penalty=cmodel.best_params_['penalty'], C=cmodel.best_params_['C'], 
                           random_state=0, solver='liblinear')
model.fit(X_train, y_train)
logR_test_pred = model.predict(X_test)

## Measure test accuracy
logR_test_acc = accuracy_score(y_test, logR_test_pred)
print('Logistic classifier accuray:', logR_test_acc)

## Measure test recall
logR_test_recall = recall_score(y_test, logR_test_pred)
print('Logistic classifier recall:', logR_test_recall)

Logistic classifier accuray: 0.9967732022126613
Logistic classifier recall: 0.8163265306122449


**Question:** *Compare the results with that of logistic regression with default parameters*

**Answer:** The accuracy stayed the same across the default parameters and the best parameters chosen by the grid search. The recall score increased for the model that used the best parameters to .8163 (up from .7959 in the default model), which means more of the fraud cases were correctly labeled.