# Credit card fraud detection

Modelagem para previsão de fraude em cartões de crédito, usando um **dataset modificado** a partir [desta base original do Kaggle](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud).

As colunas são todas codificadas, exceto por estas:

- Time: Number of seconds elapsed between this transaction and the first transaction in the dataset
- Amount: The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-sensitive learning

SOBRE O DATASET ORIGINAL (extraído do Kaggle):
```
It is important that credit card companies are able to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase.

The dataset contains transactions made by credit cards in September 2013 by European cardholders.
This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, … V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-sensitive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.

Given the class imbalance ratio, we recommend measuring the accuracy using the Area Under the Precision-Recall Curve (AUPRC). Confusion matrix accuracy is not meaningful for unbalanced classification.
```
Benchmarking: AuC in [0.85, 0.95]


In [None]:
import pandas as pd

df = pd.read_csv("../data/creditcard.csv")

In [None]:
df['Class'].value_counts()

# EDA

A EDA deve ser feita com cautela. Iremos passar por ela mais rapidamente, pelo tempo da aula.

In [None]:
import sweetviz as sv

# Prepare-se: EDA automática tende a demorar!
# my_report = sv.analyze(df, target_feat='Class')
# my_report.show_html() # Default arguments will generate to "SWEETVIZ_REPORT.html"

# Baseline

Vamos criar um baseline de negócio, algum critério de negócio que o cliente já consegue atingir hoje. Em geral, espera-se que a discussão comercial já consiga extrair do próprio cliente esse baseline e critério de sucesso da modelagem.

In [None]:
import plotly.express as px

fig = px.histogram(
    df, x="Amount", nbins=50, log_y=True, title="Log Histogram")
fig.show()

In [None]:
from sklearn.preprocessing import KBinsDiscretizer

kbins = KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='quantile')
df['AMOUNT_QUANTILE'] = kbins.fit_transform(df[['Amount']]).flatten()
df.groupby('AMOUNT_QUANTILE').agg({'Class': ['mean']}).plot(kind='bar', figsize=(15, 5), title='Class by Amount Quantile')

In [None]:
# Benchmarking: filtrar a população usando o decil D9 de Amount
df[df['AMOUNT_QUANTILE'].isin([9.0])]

In [None]:
df_high = df[df['AMOUNT_QUANTILE'].isin([9.0])].copy()
print(f"Taxa de fraude média da base: {df['Class'].mean()}")
print(f"Taxa de fraude no grupo de bench: {df_high['Class'].mean()}")

print("Lift:", df_high['Class'].mean() / df['Class'].mean())
print("Suporte:", len(df_high) / len(df))
print("Valor total das transações fraudadas na base de bench:", df_high[df_high['Class'] == 1]['Amount'].sum())

## Pré-processamento de dados

In [None]:
import numpy as np

y = df['Class'].copy()
X = df.drop(columns=['Class', 'Time']).copy()

In [None]:
# Imputers
from sklearn.impute import SimpleImputer

num_cols = X.select_dtypes(include=np.number).columns.values
cat_cols = X.select_dtypes(exclude=np.number).columns.values

num_imp = SimpleImputer(strategy='mean')
X[num_cols] = num_imp.fit_transform(X[num_cols])

cat_imp = SimpleImputer(strategy='most_frequent')
X[cat_cols] = cat_imp.fit_transform(X[cat_cols])

In [None]:
# Outliers
# X[num_cols].describe()
Q75 = X[num_cols].quantile(0.75)
Q25 = X[num_cols].quantile(0.25)
IQR = Q75 - Q25
lower_lim = Q25 - 1.5 * IQR
upper_lim = Q75 + 1.5 * IQR

In [None]:
upper_lim

In [None]:
X[num_cols]

In [None]:
# Há categorias com cardinalidade muito baixa (outliers)?
for col in cat_cols:
    print(f"{col}: {100 * np.round(X[col].value_counts().min() / len(X), 4)}%")

In [None]:
# Onde a célula tiver valor acima que upper_lim, nós daremos um "replace" por upper_lim
X[num_cols] = np.where(X[num_cols] > upper_lim, upper_lim, X[num_cols])
# Mesma coisa para o lower_lim
X[num_cols] = np.where(X[num_cols] < lower_lim, lower_lim, X[num_cols])

In [None]:
# Encoding
# Cuidado ao usar o TargetEncoder agora, sem separar base de validação!
from category_encoders.binary import BinaryEncoder

enc = BinaryEncoder(handle_unknown='OTHERS')
X_cat = enc.fit_transform(X[cat_cols])
X_cat

In [None]:
X.drop(columns=cat_cols, inplace=True)
X = pd.concat([X, X_cat], axis=1)

# Modelagem v1

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

# Split the data into training and testing sets
X_tr, X_ts, y_tr, y_ts = train_test_split(
    X, y, test_size=0.3,
    random_state=42, stratify=y)

In [None]:
print(y_tr.mean())
print(y_ts.mean())

In [None]:
clf = RandomForestClassifier(n_estimators=50)
clf.fit(X_tr, y_tr)

E agora?

# Avaliação

In [None]:
# Let us compute the roc_auc:
from sklearn.metrics import roc_auc_score

y_pred = clf.predict_proba(X_ts)[:, 1]
roc_auc = roc_auc_score(y_ts, y_pred)
roc_auc

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

cm = confusion_matrix(y_pred=clf.predict(X_ts), y_true=y_ts)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=[0, 1])
disp.plot()

A Confusion Matrix não precisa (não deve?) ser construída com o threshold de 0.5 no score! Mas qual threshold usar?

## Avaliação de negócio contra o baseline!

In [None]:
from sklearn.preprocessing import KBinsDiscretizer

X_baseline = X_ts[X_ts['AMOUNT_QUANTILE'] == 9.0].copy()
X_baseline['SCORE'] = clf.predict_proba(X_baseline)[:, 1]
X_baseline['Class'] = y_ts

kbins2 = KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='uniform')
X_baseline['SCORE_RANGE'] = kbins2.fit_transform(X_baseline[['SCORE']]).flatten()
X_baseline.groupby('SCORE_RANGE').agg({'Class': ['mean']}).plot(kind='bar', figsize=(15, 5), title='Class by Score Range')

In [None]:
X_high = X_baseline.sort_values(by='SCORE', ascending=False).iloc[:int(0.1 * len(X_baseline)), :]

print(f"Taxa de fraude média da base: {y_ts.mean()}")
print(f"Taxa de fraude na base escorada: {X_high['Class'].mean()}")

print("Lift:", X_high['Class'].mean() / y_ts.mean())
print("Suporte:", len(X_high) / len(y_ts))
print("Valor total das transações fraudadas na base escorada:", X_high[X_high['Class'] == 1]['Amount'].sum())

Compare com nosso baseline:
```
Taxa de fraude média da base: 0.001727485630620034
Taxa de fraude no grupo de bench: 0.0029842362110732716
Lift: 1.727502769445417
Suporte: 0.10000807564420819
Valor total das transações fraudadas na base de bench: 46965.89000000001
```
Deu certo?

# Modelagem v2

- Validação
- Otimização hiperparamétrica
- Uso de Pipelines

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

# Split the data into training and testing sets
X_tr, X_ts, y_tr, y_ts = train_test_split(
    X, y, test_size=0.3,
    random_state=42, stratify=y)

In [None]:
from sklearn.model_selection import cross_validate
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold
import pandas as pd

# https://scikit-learn.org/stable/modules/cross_validation.html
def custom_cv_kfolds(X: pd.DataFrame, y: pd.Series, n_splits: int = 2):
    """Função para split customizado.

    Esta função que um Generator que retorna os índices de treino e validação,
    de acordo com o número de splits definido. Usamos um split de
    KFolds, exceto que a classe minoritária da base de treino e sempre incluída
    em sua totalidade.
    """
    idx_fraud = np.where(y.to_numpy() == 1.0)[0]
    idx_non_fraud = np.where(y.to_numpy() == 0.0)[0]
    kf = KFold(n_splits=n_splits)
    for _, (train_index, test_index) in enumerate(kf.split(X.iloc[idx_non_fraud, :])):
        idx_tr = np.hstack([
            idx_non_fraud[train_index],
            idx_fraud
        ])
        idx_val = np.hstack([
            idx_non_fraud[test_index],
            idx_fraud
        ])
        yield idx_tr, idx_val

In [None]:
clf = RandomForestClassifier(n_estimators=50)
custom_cv = custom_cv_kfolds(X_tr, y_tr, n_splits=5)
# Opções de scoring: https://scikit-learn.org/stable/modules/model_evaluation.html
res = cross_validate(
    clf, X_tr, y_tr, return_estimator=True,
    cv=custom_cv, scoring='roc_auc', verbose=1)

In [None]:
clf_ = res['estimator'][0]

In [None]:
# Let us compute the roc_auc:
from sklearn.metrics import roc_auc_score

y_pred = clf_.predict_proba(X_ts)[:, 1]
roc_auc = roc_auc_score(y_ts, y_pred)
roc_auc

Vamos tentar o uso de uma lib de otimização de hiperparâmetros.

In [None]:
from skopt import BayesSearchCV
from skopt.space import Real, Categorical, Integer

opt = BayesSearchCV(
    RandomForestClassifier(),
    {
        'max_depth': Integer(1, 9),
        'max_features': [0.6, 0.7, 0.8, 0.9, 1.0],
        'bootstrap': [True, False],
        'n_estimators': [5, 10, 30, 50, 80, 100, 200, 300, 500]
    },
    n_iter=3,
    cv=3
)

opt.fit(X_tr, y_tr)

## Usando a classe Pipeline:

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from category_encoders.binary import BinaryEncoder

numeric_pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

categorical_pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', BinaryEncoder(handle_unknown='OTHERS'))
])

preprocessor = ColumnTransformer([
    ('numeric', numeric_pipe, num_cols),
    ('categorical', categorical_pipe, cat_cols)
])

opt = BayesSearchCV(
    RandomForestClassifier(),
    {
        'max_depth': Integer(1, 9),
        'max_features': [0.6, 0.7, 0.8, 0.9, 1.0],
        'bootstrap': [True, False],
        'n_estimators': [5, 10, 30, 50, 80, 100, 200, 300, 500]
    },
    n_iter=15,
    cv=3
)

model = Pipeline([
    ('preprocessor', preprocessor),
    ('clf', opt)
])