![Fraud detection image](cover_image.jpg)

🏦 Banks are battling frauds with machine learning models, but changing data patterns can weaken these defenses. London's Poundbank needs your help to figure out why their fraud detection models aren't as accurate anymore.

Poundbank recommends the `nannyml` library for monitoring machine learning models, which is also their tool of choice.

## The data

They have provided you with a reference(test data) and analysis set(production data). A summary and preview are provided below.

## reference.csv and analysis.csv

| Column     | Description              |
|------------|--------------------------|
| `'timestamp'` | Date of the transaction. |
| `'time_since_login_min'` | Time since the user logged in to the app. |
| `'transaction_amount'` | The amount of Pounds(£) that users sent to another account. |
| `'transaction_type'` | Transaction type: <ul><li>`CASH-OUT` - Withdrawing money from an account.</li><li>`PAYMENT` - Transaction where a payment is made to a third party.</li><li>`CASH-IN` - This is the opposite of a cash-out. It involves depositing money into an account.</li><li>`TRANSFER` - Transaction which involves moving funds from one account to another.</li> |
| `'is_first_transaction'` | A binary indicator denoting if the transaction is the user's first (1 for the first transaction, 0 otherwise). |
| `'user_tenure_months'` | The duration in months since the user's account was created or since they became a member. |
| `'is_fraud'` | A binary label indicating whether the transaction is fraudulent (1 for fraud, 0 otherwise). |
| `'predicted_fraud_proba'` | The probability assigned by a detection model indicates the likelihood of a fraudulent transaction. |
| `'predicted_fraud'` |  The predicted classification label is calculated based on predicted fraud probability by the detection model (1 for predicted fraud, 0 otherwise). |

In [2]:
# installing nannyml
!pip install nannyml

Defaulting to user installation because normal site-packages is not writeable
Collecting nannyml
  Downloading nannyml-0.10.5-py3-none-any.whl.metadata (19 kB)
Collecting APScheduler<4.0.0,>=3.9.1 (from nannyml)
  Downloading APScheduler-3.10.4-py3-none-any.whl.metadata (5.7 kB)
Collecting FLAML<2.0.0,>=1.0.11 (from nannyml)
  Downloading FLAML-1.2.4-py3-none-any.whl.metadata (12 kB)
Collecting Jinja2<3.1 (from nannyml)
  Downloading Jinja2-3.0.3-py3-none-any.whl.metadata (3.5 kB)
Collecting analytics-python<2.0.0,>=1.4.0 (from nannyml)
  Downloading analytics_python-1.4.post1-py2.py3-none-any.whl.metadata (2.0 kB)
Collecting gcsfs<2023.0.0,>=2022.5.0 (from nannyml)
  Downloading gcsfs-2022.11.0-py2.py3-none-any.whl.metadata (1.6 kB)
Collecting matplotlib<4.0,>=3.7 (from nannyml)
  Downloading matplotlib-3.7.5-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.metadata (5.7 kB)
Collecting numpy<1.25,>=1.24 (from nannyml)
  Downloading numpy-1.24.4-cp38-cp38-manylinux_2_17_x86_64.

In [6]:
# Re-run this cell
# Import required libraries
import pandas as pd
import nannyml as nml
nml.disable_usage_logging()

reference = pd.read_csv("reference.csv")
analysis = pd.read_csv("analysis.csv")
reference.head()

Unnamed: 0,timestamp,time_since_login_min,transaction_amount,transaction_type,is_first_transaction,user_tenure_months,is_fraud,predicted_fraud_proba,predicted_fraud
0,2018-01-01 00:00:00.000,1.56175,3981.1,PAYMENT,False,0.31898,1.0,0.99,1
1,2018-01-01 00:08:43.152,1.658074,1267.9,PAYMENT,False,7.391323,0.0,0.07,0
2,2018-01-01 00:17:26.304,2.454287,1984.7,CASH-IN,False,0.781225,1.0,1.0,1
3,2018-01-01 00:26:09.456,2.392085,2265.2,CASH-OUT,False,0.680473,1.0,0.98,1
4,2018-01-01 00:34:52.608,2.189806,2126.8,CASH-IN,False,8.542895,1.0,0.99,1


In [12]:
# checking the datatype
df= pd.DataFrame(reference)
a = df['user_tenure_months'].dtype
print(a)

float64


In [4]:
# Start coding here
# Use as many cells as you need
analysis.head(10)

Unnamed: 0,timestamp,time_since_login_min,transaction_amount,transaction_type,is_first_transaction,user_tenure_months,predicted_fraud_proba,predicted_fraud,is_fraud
0,2018-11-01 00:04:52.464,2.174243,2832.3,CASH-OUT,False,1.013445,0.97,1,1
1,2018-11-01 00:13:35.616,2.493543,1426.9,CASH-OUT,False,6.700041,0.09,0,0
2,2018-11-01 00:22:18.768,1.807432,1302.0,PAYMENT,False,6.291723,0.01,0,0
3,2018-11-01 00:31:01.920,2.133415,1432.1,PAYMENT,True,8.165503,0.0,0,0
4,2018-11-01 00:39:45.072,1.987827,1870.3,CASH-OUT,False,8.205203,0.03,0,0
5,2018-11-01 00:48:28.224,2.978838,1512.5,PAYMENT,False,9.49067,0.02,0,0
6,2018-11-01 00:57:11.376,1.867206,1639.4,CASH-IN,False,6.979411,0.0,0,0
7,2018-11-01 01:05:54.528,2.547514,2082.8,CASH-OUT,False,2.754975,1.0,1,1
8,2018-11-01 01:14:37.680,1.94,1690.7,PAYMENT,False,9.833902,0.02,0,0
9,2018-11-01 01:23:20.832,2.052716,1240.9,CASH-IN,False,6.421194,0.05,0,0


In [28]:
# estimating the model performance using cbpe
cbpe = nml.CBPE( timestamp_column_name="timestamp", y_true="is_fraud", y_pred="predicted_fraud", y_pred_proba="predicted_fraud_proba", problem_type="classification_binary", metrics=["accuracy"], chunk_period="m" )

cbpe = cbpe.fit(reference)
est_results = cbpe.estimate(analysis)
s = est_results.plot()
s.show()

In [26]:
# Calculate the realized performance
calculator = nml.PerformanceCalculator(
    y_true="is_fraud",
    y_pred="predicted_fraud",
    y_pred_proba="predicted_fraud_proba",
    timestamp_column_name="timestamp",
    metrics=["accuracy"],
    chunk_period="m",
    problem_type="classification_binary",
)
calculator = calculator.fit(reference)
calc_results = calculator.calculate(analysis)
b = calc_results.plot()
b.show()

In [30]:
# Compare the results and find the months with alerts
est_results.compare(calc_results).plot().show()
months_with_performance_alerts = ["april_2019", "may_2019", "june_2019"]
print(months_with_performance_alerts)

['april_2019', 'may_2019', 'june_2019']


In [33]:
# determing the hihest correlation in the features

features = ["time_since_login_min", "transaction_amount",
            "transaction_type", "is_first_transaction", 
            "user_tenure_months"]

# Calculate the univariate drift results
udc = nml.UnivariateDriftCalculator(
    timestamp_column_name="timestamp",
    column_names=features,
    chunk_period="m",
    continuous_methods=["kolmogorov_smirnov"],
    categorical_methods=["chi2"]
)

udc.fit(reference)
udc_results = udc.calculate(analysis)
udc_results.plot().show()

In [34]:
# the correlation ranker
ranker = nml.CorrelationRanker()
ranker.fit(
    calc_results.filter(period="reference"))

correlation_ranked_features = ranker.rank(udc_results, calc_results)

# the highest correlating feature
display(correlation_ranked_features)
highest_correlation_feature = "time_since_login_min"
print(highest_correlation_feature)

Unnamed: 0,column_name,pearsonr_correlation,pearsonr_pvalue,has_drifted,rank
0,time_since_login_min,0.952925,1.045775e-09,True,1
1,transaction_amount,0.626235,0.005427712,True,2
2,is_first_transaction,0.054255,0.8306916,True,3
3,user_tenure_months,-0.100547,0.6913911,True,4
4,transaction_type,-0.186569,0.4585328,True,5


time_since_login_min


In [35]:
# Calculating average monthly transactions
calc = nml.SummaryStatsAvgCalculator(
    column_names=["transaction_amount"],
    chunk_period="m",
    timestamp_column_name="timestamp",
)

calc.fit(reference)
stats_avg_results = calc.calculate(analysis)

# Find the month
stats_avg_results.plot().show()
alert_avg_transaction_amount = 3069.8184
print(alert_avg_transaction_amount)


3069.8184


# Import required libraries
import pandas as pd
import nannyml as nml
nml.disable_usage_logging()

reference = pd.read_csv("reference.csv")
analysis = pd.read_csv("analysis.csv")

## Identifing the months when both the estimated and realized ROC AUC of the model have alerts. Store the names of these months as lowercase strings in a list named months_with_performance_alerts. 

# Get the estimated performance using CBPE algorithm
cbpe = nml.CBPE(
    timestamp_column_name="timestamp",
    y_true="is_fraud",
    y_pred="predicted_fraud",
    y_pred_proba="predicted_fraud_proba",
    problem_type="classification_binary",
    metrics=["accuracy"],
    chunk_period="m"
)

cbpe.fit(reference)
est_results = cbpe.estimate(analysis)

# Calculate the realized performance
calculator = nml.PerformanceCalculator(
    y_true="is_fraud",
    y_pred="predicted_fraud",
    y_pred_proba="predicted_fraud_proba",
    timestamp_column_name="timestamp",
    metrics=["accuracy"],
    chunk_period="m",
    problem_type="classification_binary",
)
calculator = calculator.fit(reference)
calc_results = calculator.calculate(analysis)

# Compare the results and find the months with alerts
est_results.compare(calc_results).plot().show()
months_with_performance_alerts = ["april_2019", "may_2019", "june_2019"]
print(months_with_performance_alerts)

## Determining which alerting feature has the strongest correlation with the model’s realized performance. Store the name of this feature in a variable named highest_correlation_feature. 

features = ["time_since_login_min", "transaction_amount",
            "transaction_type", "is_first_transaction", 
            "user_tenure_months"]

# Calculate the univariate drift results
udc = nml.UnivariateDriftCalculator(
    timestamp_column_name="timestamp",
    column_names=features,
    chunk_period="m",
    continuous_methods=["kolmogorov_smirnov"],
    categorical_methods=["chi2"]
)

udc.fit(reference)
udc_results = udc.calculate(analysis)

# Use the correlation ranker
ranker = nml.CorrelationRanker()
ranker.fit(
    calc_results.filter(period="reference"))

correlation_ranked_features = ranker.rank(udc_results, calc_results)

# Find the highest correlating feature
display(correlation_ranked_features)
highest_correlation_feature = "time_since_login_min"
print(highest_correlation_feature)

## Use the summary average statistics calculator to find out what were the monthly average transactions amounts, and if there's any alert. Record this value in a variable called alert_avg_transaction_amount.

# Calculate average monthly transactions
calc = nml.SummaryStatsAvgCalculator(
    column_names=["transaction_amount"],
    chunk_period="m",
    timestamp_column_name="timestamp",
)

calc.fit(reference)
stats_avg_results = calc.calculate(analysis)

# Find the month
stats_avg_results.plot().show()
alert_avg_transaction_amount = 3069.8184
print(alert_avg_transaction_amount)

## Answer to the bonus question
"""
First, I recommend looking at the distribution plots for all features and analyzing them using this command: 
- `univariate_data_drift.filter(column_names=features).plot(kind="distribution")`

Observations:

- time_since_log_min - From April to June, the transactions made within one minute after logging in completely vanished.
- transaction_amount - In May and June, a larger number of transactions appeared. Additionally, as you discovered in the third question, the average transaction value has increased and raised an alert.

Possible explanation: 

Fraudsters may have noticed that early card transactions, when done right after logging in, often led to account blocking. As a result, they began waiting a bit longer before transferring money to their account to avoid detection. Furthermore, they tend to make a single larger transfer instead of many smaller ones, leading to an increase in the average transaction value.
"""