# Package installing

In [1]:
!pip install numpy
!pip install matplotlib

Collecting matplotlib
  Downloading matplotlib-3.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.6/11.6 MB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting contourpy>=1.0.1 (from matplotlib)
  Downloading contourpy-1.1.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (301 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m301.7/301.7 kB[0m [31m23.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting cycler>=0.10 (from matplotlib)
  Downloading cycler-0.12.0-py3-none-any.whl (8.2 kB)
Collecting fonttools>=4.22.0 (from matplotlib)
  Downloading fonttools-4.43.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.5/4.5 MB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting kiwisolver>=1.0.1 (from matplotlib)
  Downloading kiwisolver-1.4.5-

# Imports

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import confusion_matrix
from pathlib import Path

# Goal

In this exercise, you will design and build a model to detect fraudulent activity according to a the data of one of NCR customers. The data contains transactions logs of a tenant in NCR Analytics. The data has already been processed to a format that is convenient to work with and is ready to be analyzed.

# Data

In [3]:
data = pd.read_csv(Path().parent.resolve().joinpath("besties_processed_data.csv"))

In [4]:
data.head()

Unnamed: 0,TransactionId,Hour,VoidsAmount,IsVoidFollowedByDrawerOpen,IsVoided,IsReturn,IsReceiptReprinted,NonScannedItems,ManualOverrideAmount,ReturnsAmount,PriceLookupTrxTotalAmount,VoidedItemsAmount,DiscountAmount,IsTrainingMode,TotalAmount,ItemCount,StoreId,BusinessDate,MaxWeightToPriceRatioItemId,IsFraud
0,tr1,12,0.0,False,False,False,False,1,0.0,0.0,0.0,0.0,0.0,False,13.7,6,32,09/05/2023,204000000000.0,False
1,tr2,15,0.0,False,False,False,False,1,0.0,0.0,0.0,0.0,0.0,False,11.56,6,15,09/07/2023,207000000000.0,False
2,tr3,15,0.0,False,False,True,False,1,1.0,4.49,0.0,0.0,0.0,False,581.41,6,7,09/04/2023,,False
3,tr4,13,0.0,False,False,False,False,1,0.0,0.0,0.0,0.0,0.0,False,16.31,8,10,09/04/2023,203000000000.0,False
4,tr5,12,0.0,False,False,False,False,1,0.0,0.0,0.0,0.0,0.0,False,16.24,8,16,09/06/2023,203000000000.0,False


# Tasks

### Task 1
Explore the label column “IsFraud”. 

How many fraudulent transactions are there in the data? What is the percentage of fraudulent transactions?

### Task 2
Analyze each of the data columns with respect to the label: does higher values indicate fraudulent activity? Try sorting the table by each column independently and create graphs to cross each column with the label. Note that some columns might correlate with the label only for some value range, so you can use the filters to control that.

### Task 3
In the given data, the loss prevention teams in the stores marked each transaction as fraudulent or not after a long investigation. However, in most cases the label is unknown. We aim to help the loss prevention team by devising a model that predicts if the transaction is fraudulent according to the given metrics.

Use the data to devise a one-feature model: select an arbitrary column, select a rule according to which a transaction will be defined as fraudulent or not in the case that the label was not given.
For example, “voids amount” ≥ $300  à  IsFraud=True.
We’ll refer to the predicted label as the prediction and to the original label column as the observed.


## Task 4
Design the accuracy metric and measure the performance of the one-feature models that you developed: how many of the predictions were accurate? Compute the percentage of the predicted labels that match the observed labels.
The accuracy metric measures the percentage of perfect “hits” that your model achieved. What can be the problem with the accuracy metric for the investigated data? Which kinds of data or labels is it good for? (if you didn’t succeed answering this question, see answer* in the end of this file).


## Task 5
Compute the precision and recall metrics:
* Precision: out of the samples that your model predicted to be True, how many were indeed positive (the observed value)?
* Recall: out of the positive labels (the observed value), how many did your model predicted to be True?

You can directly use the precision_score and recall_score that have been imported

## Task 6

Now that you have the accuracy, precision, and recall metrics, repeat tasks (3-5) for every other column: develop a single-feature model and compute the performance metrics for the created models. Which single feature best predicts the label? Account for the meaning of the different performance metrics.

## Task 7
Which of the columns do you think should be combined to detect a fraudulent activity? (e.g., high voids amount and also high total manual override).

## Task 8
Now, let's delve into machine learning modeling!
<ol>
  <li>Segment the data into two categories: features and labels.</li>
    <p>Features encompass the data utilized for prediction, which could include all columns except for the "IsFraud" column or a selected subset.</p>
    <p>Labels represent what you aim to predict, in this instance, the "IsFraud" column.</p>
  <li>Apply machine learning to the dataset.</li>
    <p>Train the machine learning algorithm, for example, a Random Forest Classifier (see the link below) </p>
    <p>Utilize your trained model to predict the "IsFraud" column.</p>
  <li>Calculate various metrics.</li>
    <p>Compute the Confusion Matrix (see links below).</p>
    <p>Calculate metrics you have already employed, such as precision and recall.</p>
</ol>

Feel free to repeat these procedures with a subset of the data and examine the outcomes. Is it imperative to use all the data to predict the label?

Some useful links:

* [Radom Forest Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
* [Understanding the confusion Matrix](https://towardsdatascience.com/understanding-confusion-matrix-a9ad42dcfd62)
* [Confusion Matrix Implementation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html)