# Fraud Analytics using Python

*This is the exercise notebook which will help you practice the steps followed to solve a credit card fraud problem. You will learn how to make use of several Machine learning algorithms to solve financial crime problem. Later we'll try to optimise the solution wherever needed using pre-built methods.*

The OLT covers the following topics:

* Introduction to fraud detection
    * Reading the data labels
    * Data resampling and plotting
    * Applying SMOTE
    * Logistic regression with SMOTE 
* Using ML classification to catch fraud
    * Random forest classifier
    * Performance of RF classifier
    * Plotting precision recall curve  
* Performing model adjustments and regression analysis
    * GridSearchCV to find optimal parameters
    * Logistic regression
    * Voting classifier

In [None]:
# Counting the occurrences of fraud and no fraud and print them
# Import pandas and read csv
import pandas as pd
df = pd.read_csv("../data/creditcard.csv")

# Explore the features available in your dataframe
print(df._____)

# Count the occurrences of fraud and no fraud and print them
occ = df['____'].value_counts()
print(occ)

# Print the ratio of fraud cases
print(occ / ____)

## Visualizing the data 

In this exercise, you'll look at the data and visualize the fraud to non-fraud ratio. It is always a good starting point in your fraud analysis, to look at your data first, before you make any changes to it.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

count_classes = pd.value_counts(df['Class'], sort = True).sort_index()
count_classes.plot(kind = 'bar')
plt.title("Fraud class histogram")
plt.xlabel("Class")
plt.ylabel("Frequency")


In [None]:
def prep_data(df):
    """
    Prepare the data to train the model
    args: df - dataframe
    returns: X - array of columns
    y - class array to be predicted
    """
    # accumulate feature variables in X and target variable in y using slicing
    # <----code goes here---->
    
    return X,y

### For the cell below:

* Define the plot_data(X, y) function, that will nicely plot the given feature set X with labels y in a scatter plot. This has been done for you.

* Use the function prep_data() on your dataset df to create feature set X and labels y.

* Run the function plot_data() on your newly obtained X and y to visualize your results.

In [None]:
# Define a function to create a scatter plot of our data and labels
def plot_data(X, y):
    plt.scatter(X[y == 0, 0], X[y == 0, 1], label="Class #0", alpha=0.5, linewidth=0.15)
    plt.scatter(X[y == 1, 0], X[y == 1, 1], label="Class #1", alpha=0.5, linewidth=0.15, c='r')
    plt.legend()
    return plt.show()

# Create X and y from the prep_data function 
X, y = prep_data(____)

# Plot our data by running our plot data function on X and y
____(X, y)

In [None]:
df.iloc[0].count

In [None]:
df.shape

## How to manage imbalanced data?

Let's learn different methods which can help us deal with imbalanced data.
There are various resmapling techniques to handle imbalanced data:

* `Random Under Sampling(RUS)`: This reduces the majority class and makes the data balanced.
* `Random Over Sampling(ROS)`: This generated duplicates of the minority class. Inefficient because of duplicacy.
* `Synthetic Minority Oversampling Technique(SMOTE)`: Generates fake realistic data to balance out the data.

We are going to use the best possible option which is SMOTE here.

In [None]:
from imblearn.over_sampling import SMOTE

# Run the prep_data function
X, y = ____(df)

# Define the resampling method
method = ____(kind='____')

# Create the resampled feature set
X_resampled, y_resampled = method.____(____, ____)

# Plot the resampled data
plot_data(____, ____)

In [None]:
def compare_plot(X,y,X_resampled,y_resampled, method):
    # Start a plot figure
    f, (ax1, ax2) = plt.subplots(1, 2)
    
    # sub-plot number 1, this is our normal data
    c0 = ax1.scatter(X[y == 0, 0], X[y == 0, 1], label="Class #0",alpha=0.5)
    c1 = ax1.scatter(X[y == 1, 0], X[y == 1, 1], label="Class #1",alpha=0.5, c='r')
    ax1.set_title('Original set')
    
    # sub-plot number 2, this is our oversampled data
    ax2.scatter(X_resampled[y_resampled == 0, 0], X_resampled[y_resampled == 0, 1], label="Class #0", alpha=.5)
    ax2.scatter(X_resampled[y_resampled == 1, 0], X_resampled[y_resampled == 1, 1], label="Class #1", alpha=.5,c='r')
    ax2.set_title(method)
    
    # some settings and ready to go
    plt.figlegend((c0, c1), ('Class #0', 'Class #1'), loc='lower center',
                  ncol=2, labelspacing=0.)
    plt.tight_layout(pad=3)
    return plt.show()

### For the cell below:

* Print the value counts of our original labels, y. Be mindful that y is currently a Numpy array, so in order to use value counts, we'll assign y back as a pandas Series object.
* Repeat the step and print the value counts on y_resampled. This shows you how the balance between the two classes has changed with SMOTE.
* Use the compare_plot() function called on our original data as well our resampled data to see the scatterplots side by side.


In [None]:
# Print the value_counts on the original labels y
print(pd.value_counts(pd.Series(____)))

# Print the value_counts
print(____(____(____)))

# Run compare_plot
compare_plot(____, ____, ____, ____, method='SMOTE')

## Rule-based method to detect fraudsters exercise

In this exercise you're going to try finding fraud cases in our credit card dataset the "old way". First you'll define threshold values using common statistics, to split fraud and non-fraud. Then, use those thresholds on your features to detect fraud. This is common practice within fraud analytics teams.

Statistical thresholds are often determined by looking at the mean values of observations. Let's start this exercise by checking whether feature means differ between fraud and non-fraud cases. Then, you'll use that information to create common sense thresholds. Finally, you'll check how well this performs in fraud detection.

In [None]:
# Get the mean for each group
____.____(____).mean()

# Implement a rule for stating which cases are flagged as fraud
df['flag_as_fraud'] = np.where(np.logical_and(______), 1, 0)

# Create a crosstab of flagged fraud cases versus the actual fraud cases
print(____(df.Class, df.flag_as_fraud, rownames=['Actual Fraud'], colnames=['Flagged Fraud']))



## Now, using ML classification to catch fraudsters

In this exercise you'll see what happens when you use a simple machine learning model on our credit card data instead.

Do you think you can beat those results? Remember, you've predicted 170 out of 492 fraud cases, and had 1226 false positives. That's less than half of the cases caught, Also false positives were roughly 3 times the actual amount of fraud cases.

So with that in mind, let's implement a Logistic Regression model.[Poll] If not, you might want to refresh that at this point. But don't worry, you'll be guided through the structure of the machine learning model.

In [None]:
# importing sklearn for training splitting and importing the classifier

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix


### Exercise below:

* Split X and y into training and test data, keeping 30% of the data for testing.
* Fit your model to your training data.
* Obtain the model predicted labels by running model.predict on X_test.
* Obtain a classification comparing y_test with predicted, and use the given confusion matrix to check your results.


In [None]:
# Create the training and testing sets
X_train, X_test, y_train, y_test = train_test_split(____, ____, test_size=____, random_state=0)

# Fit a logistic regression model to our data
model = LogisticRegression()
model.fit(____, ____)

# Obtain model predictions
predicted = model.predict(____)

# Print the classifcation report and confusion matrix
print('Classification report:\n', classification_report(____, ____))
conf_mat = confusion_matrix(y_true=y_test, y_pred=predicted)
print('Confusion matrix:\n', conf_mat)

## Logistic regression combined with SMOTE

In this exercise, we're going to take the Logistic Regression model from the previous exercise, and combine that with a SMOTE resampling method. We'll see how to do that efficiently by using a pipeline that combines the resampling method with the model in one go. First, we need to define the pipeline that we're going to use.

### Exercise below:

* Import the Pipeline module from imblearn, this has been done for you.
* Then define what you want to put into the pipeline, assign the SMOTE method with borderline2 to resampling, and assign LogisticRegression() to the model.
* The Pipeline() requires two arguments. You need to state you want to combine resampling with the model in the respective arguments, we show you how to do this.

In [None]:
# This is the pipeline module we need for this from imblearn
from imblearn.pipeline import Pipeline 
from imblearn.over_sampling import BorderlineSMOTE

# Define which resampling method and which ML model to use in the pipeline
resampling = ____
model = ____

# Define the pipeline, tell it to combine SMOTE with the Logistic Regression model
pipeline = Pipeline([('SMOTE', resampling), ('Logistic Regression', model)])

Now that you have our pipeline defined which is a combination of logistic regression with a SMOTE method, let's run it on the data. You can treat the pipeline as if it were a single machine learning model. Our data X and y are already defined, and the pipeline is defined in the previous exercise. Are you curious to find out what the model results are? Let's give it a try!

### Exercise on using a pipeline below:

* Split the data 'X'and 'y' into the training and test set. Set aside 30% of the data for a test set, and set the random_state to zero.
* Fit your pipeline onto your training data and obtain the predictions by running the pipeline.predict() function on our X_test dataset.

In [None]:
# Split your data X and y, into a training and a test set and fit the pipeline onto the training data
X_train, X_test, y_train, y_test = ____

# Fit your pipeline onto your training set and obtain predictions by fitting the model onto the test data 
pipeline.fit(____, ____) 
predicted = pipeline.____(____)

# Obtain the results from the classification report and confusion matrix 
print('Classifcation report:\n', classification_report(y_test, predicted))
conf_mat = confusion_matrix(y_true=y_test, y_pred=predicted)
print('Confusion matrix:\n', conf_mat)

Whoops! As you can see, the SMOTE hasn't helped but has added more false positives in our results. We have a very high number of false positives. Remember, not in all cases does resampling necessarily lead to better results. When the fraud cases are very spread and scattered over the data, using SMOTE can introduce a bit of bias. Nearest neighbors aren't necessarily also fraud cases, so the synthetic samples might 'confuse' the model slightly.

## Fraud detection using labelled data

Now that you're familiar with the main challenges of fraud detection, you're about to learn how to flag fraudulent transactions with supervised learning. You will use classifiers, adjust them and compare them to find the most efficient fraud detection model.

### Natural hit rate

* Count the total number of observations by taking the length of your labels `y`.
* Count the non-fraud cases in our data by using list comprehension on `y`; remember `y` is a NumPy array so `.value_counts()` cannot be used in this case.
* Calculate the natural accuracy by dividing the non-fraud cases over the total observations.
Print the percentage.

In [None]:
# Count the total number of observations from the length of y
total_obs = ____

# Count the total number of non-fraudulent observations 
non_fraud = [i for ____ ____ ____ if i == 0]
count_non_fraud = non_fraud.count(0)

# Calculate the percentage of non fraud observations in the dataset
percentage = (float(____)/float(____)) * 100

# Print the percentage: this is our "natural accuracy" by doing nothing
____(____)

In [None]:

np.bincount(y.astype('int'))

## Part 1: Random forest classifier 

* Import the random forest classifier from `sklearn`.
* Split your features `X` and labels `y` into a training and test set. Set aside a test set of 30%.
* Assign the random forest classifier to `model` and keep `random_state` at 5. We need to set a random state here in order to be able to compare results across different models.

In [None]:
# Import the random forest model from sklearn
from sklearn.ensemble import ____

# Split your data into training and test set
X_train, X_test, y_train, y_test = ____(____, ____, test_size=____, random_state=0)

# Define the model as the random forest
model = ____(random_state=5)

In [None]:
X_test.shape, y_test.shape

## Part 2: Random forest classifier 

In [None]:
# Fit the model to our training set
model.fit(X_train, y_train)

# Obtain predictions from the test data 
predicted = model.predict(X_test)

In [None]:
from sklearn.metrics import accuracy_score

# Print the accuracy performance metric
print(accuracy_score(y_test, predicted))

## Performance metrics for the RF model

With highly imbalanced fraud data, the AUROC curve is a more reliable performance metric, used to compare different classifiers. Moreover, the classification report tells you about the precision and recall of your model, whilst the confusion matrix actually shows how many fraud cases you can predict correctly

### Exercise below:

* Import the classification report, confusion matrix and ROC score from sklearn.metrics.
* Get the binary predictions from your trained random forest model.
* Get the predicted probabilities by running the predict_proba() function.
* Obtain classification report and confusion matrix by comparing y_test with predicted.

In [None]:
# Import the packages to get the different performance metrics
from sklearn.metrics import ____, ____, ____

# Obtain the predictions from our random forest model 
predicted = model.____(X_test)

# Predict probabilities
probs = ____.____(X_test)

# Print the ROC curve, classification report and confusion matrix
print(____(y_test, probs[:,1]))
print(____(____, predicted))
print(____(____, ____))


In [None]:
probs

## Plotting the Precision Recall Curve
In this curve Precision and Recall are inversely related; as Precision increases, Recall falls and vice-versa.

In [None]:
def plot_pr_curve(recall, precision, average_precision):
    plt.step(recall, precision, color='b', alpha=0.2, where='post')
    plt.fill_between(recall, precision, step='post', alpha=0.2, color='b')
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    plt.ylim([0.0, 1.05])
    plt.xlim([0.0, 1.0])
    plt.title('2-class Precision-Recall curve: AP={0:0.2f}'.format(average_precision))
    plt.show()

In [None]:
from sklearn.metrics import average_precision_score, precision_recall_curve, roc_curve

# Calculate average precision and the PR curve
average_precision = average_precision_score(y_test, predicted)

# Obtain precision and recall 
precision, recall, _ = precision_recall_curve(y_test, predicted)

# Plot the recall precision tradeoff
plot_pr_curve(recall, precision, average_precision)

The ROC curve plots the true positives vs. false positives , for a classifier, as its discrimination threshold is varied. Since, a random method describes a horizontal curve through the unit interval, it has an AUC of 0.5. Minimally, classifiers should perform better than this, and the extent to which they score higher than one another (meaning the area under the ROC curve is larger), they have better expected performance.

In [None]:
# Create true and false positive rates
false_positive_rate, true_positive_rate, threshold = roc_curve(y_test, predicted)

In [None]:
false_positive_rate


In [None]:
true_positive_rate

In [None]:
threshold

In [None]:
# Plot ROC curve
plt.title("Receiver Operating Characteristic")
plt.plot(false_positive_rate, true_positive_rate)
plt.plot([0, 1], ls="--")
plt.plot([0, 0], [1, 0] , c=".7"), plt.plot([1, 1] , c=".7")
plt.ylabel("True Positive Rate")
plt.xlabel("False Positive Rate")
plt.show()

### Model adjustment exercise below:

* Set the class_weight argument of your classifier to balanced_subsample.
* Fit your model to your training set.
* Obtain predictions and probabilities from X_test.
* Obtain the roc_auc_score, the classification report and confusion matrix.

In [None]:
# Define the model with balanced subsample
model = RandomForestClassifier(class_weight='____', random_state=5)

# Fit your training model to your training set
model.fit(____, ____)

# Obtain the predicted values and probabilities from the model 
predicted = ____.____(____)
probs = ____.____(____)

# Print the roc_auc_score, the classification report and confusion matrix
print(____(____, ____))
print(____(____, ____))
print(____(____, ____))

we can see that the model results don't improve drastically. If we mostly care about catching fraud, and not so much about the false positives, this does actually not improve our model at all, albeit a simple option to try.

## Adjusting the RM to fraud detection

In [None]:
def get_model_results(X_train, y_train, X_test, y_test, model):
    model.fit(X_train, y_train)
    predicted = model.predict(X_test)
    print (classification_report(y_test, predicted))
    print (confusion_matrix(y_test, predicted))

### For the exercise below:

* Change the weight option to set the ratio to 1 to 12 for the non-fraud and fraud cases, and set the split criterion to 'entropy'.
* Set the maximum depth to 10.
* Set the minimal samples in leaf nodes to 10.
* Set the number of trees to use in the model to 20.


In [None]:
# Change the model options
model = RandomForestClassifier(bootstrap=True, class_weight={0:____, 1:____}, criterion='____',

        # Change depth of model
        max_depth=____,

        # Change the number of samples in leaf nodes
        min_samples_leaf=____, 

        # Change the number of trees to use
        n_estimators=____, n_jobs=-1, random_state=5)

# Run the function get_model_results
get_model_results(X_train, y_train, X_test, y_test, model)

You can see by smartly defining more options in the model, you can obtain better predictions. You have effectively reduced the number of false negatives, i.e. you are catching more cases of fraud, whilst keeping the number of false positives low. In this exercise you've manually changed the options of the model

## GridSearchCV to find optimal parameters

In [None]:
from sklearn.model_selection import GridSearchCV

### For the exercise below:

* Define in the parameter grid that you want to try 1 and 30 trees, and that you want to try the gini and entropy split criterion.
* Define the model to be simple RandomForestClassifier, you want to keep the random_state at 5 to be able to compare models.
* Set the scoring option such that it optimizes for recall.
* Fit the model to the training data X_train and y_train and obtain the best parameters for the model.


In [None]:
# Define the parameter sets to test
param_grid = {'n_estimators': [____, ____], 'max_features': ['auto', 'log2'],  'max_depth': [4, 8], 'criterion': ['____', '____']
}

# Define the model to use
model = ____(random_state=5)

# Combine the parameter sets with the defined model
CV_model = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, scoring='____', n_jobs=-1)

# Fit the model to our training data and obtain best parameters
CV_model.fit(____, ____)
CV_model.____

## Model results using GridSearchCV

In [None]:
# Input the optimal parameters in the model
model = RandomForestClassifier(class_weight={0:1,1:12}, ###################################
                               criterion='gini',
                               n_estimators=30, 
                               max_features='log2',  
                               min_samples_leaf=10, 
                               max_depth=8,
                               n_jobs=-1, random_state=5)

# Get results from your model
get_model_results(X_train, y_train, X_test, y_test, model)

We've managed to improve your model even further. The number of false positives has now been slightly reduced even further, which means we are catching more cases of fraud. However, you see that the number of false negatives is still the same. That is that Precision-Recall trade-off in action. To decide which final model is best, you need to take into account how bad it is not to catch fraudsters, versus how many false positives the fraud analytics team can deal with.

## Logistic Regression

Exercise below:

* Define a LogisticRegression model with class weights that are 1:15 for the fraud cases.
* Fit the model to the training set, and obtain the model predictions.
* Print the classification report and confusion matrix.

In [None]:
# Define the Logistic Regression model with weights
model = ____(____={____, ____}, random_state=5)

# Get the model results
get_model_results(X_train, y_train, X_test, y_test, model)


As you can see the Logistic Regression has quite different performance from the Random Forest. More false positives, but also a better Recall

## Voting Classifier

In [None]:
from sklearn.tree import DecisionTreeClassifier

Now:

* Import the Voting Classifier package.
* Define the three models; use the Logistic Regression from before, the Random Forest from previous exercises and a Decision tree with balanced class weights.
* Define the ensemble model by inputting the three classifiers with their respective labels.

In [None]:
# Import the package
from sklearn.ensemble import ____

# Define the three classifiers to use in the ensemble
clf1 = LogisticRegression(class_weight={0:1, 1:15}, random_state=5)
clf2 = ____(class_weight={0:1, 1:12}, criterion='gini', max_depth=8, max_features='log2',
            min_samples_leaf=10, n_estimators=30, n_jobs=-1, random_state=5)
clf3 = DecisionTreeClassifier(random_state=5, class_weight="____")

# Combine the classifiers in the ensemble model
ensemble_model = ____(estimators=[('lr', ____), ('rf', ____), ('dt', ____)], voting='hard')

# Get the results 
get_model_results(X_train, y_train, X_test, y_test, ensemble_model)

## Adjust weights within the Voting Classifier

* Define an ensemble method where you over weigh the second classifier (clf2) with 4 to 1 to the rest of the classifiers.
* Fit the model to the training and test set, and obtain the predictions predicted from the ensemble model.
* Print the performance metrics, this is ready for you to run.

In [None]:
# Define the ensemble model
ensemble_model = VotingClassifier(  ###################################
    estimators=[('lr', clf1), ('rf', clf2), ('gnb', clf3)], 
    voting='soft', 
    weights=[1, 4, 1], 
    flatten_transform=True)

# Get results 
get_model_results(X_train, y_train, X_test, y_test, ensemble_model)

Print the performance metrics and we are ready to go!

In [None]:
ensemble_model.estimators_