<a href="https://colab.research.google.com/github/alexnunesfroes/Projeto_Embraer_ICD/blob/main/Projeto_EMBRAER_ALEXFROES_PS4_CD_AMI_2025.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Identificando fraudes com *PyCaret*

As empresas de cartão de crédito precisam identificar transações fraudulentas para proteger os clientes de cobranças indevidas. No Brasil, que é o segundo país com mais fraudes na América Latina (atrás do México), os casos aumentaram significativamente durante a pandemia, impulsionados pelo crescimento do uso de serviços digitais e do comércio eletrônico.

## Sobre o projeto

Neste projeto, serão aplicados modelos de detecção de anomalias para identificar valores discrepantes em relação à população de dados. Como os dados já estão rotulados, será possível comparar as anomalias detectadas com os casos reais de fraude. Embora a detecção de anomalias seja um processo não supervisionado, o projeto também utilizará métricas de avaliação de modelos de classificação para medir o desempenho dos algoritmos.

## Instalando a biblioteca *PyCaret*

In [1]:
pip install pycaret



In [2]:
!pip install openml

Collecting openml
  Downloading openml-0.15.1-py3-none-any.whl.metadata (10 kB)
Collecting liac-arff>=2.4.0 (from openml)
  Downloading liac-arff-2.5.0.tar.gz (13 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting xmltodict (from openml)
  Downloading xmltodict-0.14.2-py2.py3-none-any.whl.metadata (8.0 kB)
Collecting minio (from openml)
  Downloading minio-7.2.16-py3-none-any.whl.metadata (6.5 kB)
Collecting pycryptodome (from minio->openml)
  Downloading pycryptodome-3.23.0-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.4 kB)
Downloading openml-0.15.1-py3-none-any.whl (160 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m160.4/160.4 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading minio-7.2.16-py3-none-any.whl (95 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m95.8/95.8 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading xmltodict-0.14.2-py2.py3-none-any.whl (10.0 kB)
Downloading pycr

In [3]:
import openml
import pandas as pd

In [5]:
# Baixar o dataset 'Credit_Card_Fraud_Classification' (ID 46455)
dataset = openml.datasets.get_dataset(46455)
df, *_ = dataset.get_data()

In [6]:
# Salvar com CSV
df.to_csv('Credit_Card_Fraud_Classification_dataset.csv', index=False)
# Exibir as primeiras Linhas
df.head()

Unnamed: 0,time,v1,v2,v3,v4,v5,v6,v7,v8,v9,...,v21,v22,v23,v24,v25,v26,v27,v28,amount,class
0,123113.0,-4.168525,-4.164323,1.91185,1.130443,4.152041,-2.125948,-1.803619,0.675859,0.308972,...,-0.058678,-1.673241,0.937707,-0.616568,0.780497,-1.055841,-0.154194,0.146745,157.37,otherwise
1,67116.0,-0.241374,-0.043836,1.545847,-0.950404,-0.819948,0.847419,-0.786322,-1.420254,1.645278,...,1.222,-1.007936,-0.415337,-0.336823,1.033332,0.848539,0.117121,0.092623,96.35,otherwise
2,125495.0,-2.134432,-2.21931,0.969065,-2.85848,0.693123,-1.315593,0.284006,0.149392,1.18268,...,0.579502,0.74396,0.519019,-0.354719,0.373946,-0.319379,-0.056289,0.155978,276.73,otherwise
3,67705.0,-0.862259,-0.224703,2.30834,-1.941343,-0.32121,1.954794,-0.942382,0.729052,0.090916,...,0.121589,0.683341,-0.590164,-1.645139,0.665159,-0.005705,0.219394,0.098477,2.0,otherwise
4,64782.0,1.24161,-0.051895,0.579918,-0.115431,-0.579488,-0.548451,-0.269573,-0.041116,0.35321,...,-0.103847,-0.237586,0.124342,0.14365,0.053582,0.933286,-0.052477,0.006476,1.54,otherwise


## Importando as bibliotecas


In [7]:
import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from pycaret.anomaly import *

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import confusion_matrix, accuracy_score, roc_auc_score, roc_curve

## Importando a base de dados


Os dados foram obtidos [aqui](https://www.openml.org/search?type=data&status=active&id=46455) e a base de dados possui 31 *features*.

Conforme informações do *link*:

É essencial que empresas de cartão de crédito identifiquem transações fraudulentas para evitar prejuízos aos clientes. O conjunto de dados utilizado contém transações com cartões de crédito realizadas por europeus em setembro de 2013, abrangendo dois dias e totalizando 284.807 transações, das quais apenas 492 são fraudes (0,172%), caracterizando um forte desbalanceamento.

Os dados foram transformados por PCA, preservando apenas as variáveis 'Tempo' (segundos desde a primeira transação) e 'Quantidade' (valor da transação). A variável alvo é 'Classe', onde 1 indica fraude e 0.

Importando a base de dados.

In [8]:
fraude = df.to_csv('Credit_Card_Fraud_Classification_dataset.csv', index=False)

Visualizando a base de dados.

In [9]:
fraude

A ideia desse projeto é aplicar o módulo de detecção de anomalias da biblioteca *PyCaret* com algumas *features* da base de dados e identificar as possíveis anomalias. Depois comparar a coluna *Class* que possui os resultados reais e os resultados encontrados pelos modelos da biblioteca, ou seja, quais anomalias identificadas são fraudes.

Primeiro irei excluir as colunas *Class* (que os modelos não irão 'ver') e a columa *Time*, que ao meu ver não trás nenhuma informação relevante. Os resultado de *Class* serão salvos em uma váriável.

In [24]:
fraude1 = df.drop(['time', 'class'], axis=1)
classe = df['class'].map({'otherwise': 0, 'fraud': 1}) # Corrected mapping based on df.head() output and problem description

## *Setup*

Aqui eu irei realizar automaticamente todo pré-processamento dos dados.

In [32]:
exp_ano101 = setup(fraude1, normalize = False,session_id = 123)

Unnamed: 0,Description,Value
0,Session id,123
1,Original data shape,"(28480, 29)"
2,Transformed data shape,"(28480, 29)"
3,Numeric features,29
4,Preprocess,True
5,Imputation type,simple
6,Numeric imputation,mean
7,Categorical imputation,mode
8,CPU Jobs,-1
9,Use GPU,False


## Modelos

Os todo ao módulo de detecção de anomalias da biblioteca *PyCaret* nos permite usar doze (12) modelos, entretanto usarei os três modelos abaixo.

Depois irei avaliar os resultados de cada um desses modelos por meio de métricas que também podem ser usadas para modelos de classificação.

Irei usar os três modelos abaixo:

1) *iforest -	Isolation Forest*;

2) *histogram	- Histogram-based Outlier Detection*;

3) *pca	- Principal Component Analysis*;


## Criando os modelos

In [33]:
iforest = create_model('iforest')
print(iforest)

histogram = create_model('histogram')
print(histogram)

pca = create_model('pca')
print(pca)

Processing:   0%|          | 0/3 [00:00<?, ?it/s]

IForest(behaviour='new', bootstrap=False, contamination=0.05,
    max_features=1.0, max_samples='auto', n_estimators=100, n_jobs=-1,
    random_state=123, verbose=0)


Processing:   0%|          | 0/3 [00:00<?, ?it/s]

HBOS(alpha=0.1, contamination=0.05, n_bins=10, tol=0.5)


Processing:   0%|          | 0/3 [00:00<?, ?it/s]

PCA(contamination=0.05, copy=True, iterated_power='auto', n_components=None,
  n_selected_components=None, random_state=123, standardization=True,
  svd_solver='auto', tol=0.0, weighted=True, whiten=False)


## Resultados de cada modelo.

#### Resultados do modelo *IForest*

In [34]:
iforest_results = assign_model(iforest)
iforest_results.head()

Unnamed: 0,v1,v2,v3,v4,v5,v6,v7,v8,v9,v10,...,v22,v23,v24,v25,v26,v27,v28,amount,Anomaly,Anomaly_Score
0,-4.168525,-4.164322,1.911849,1.130443,4.152041,-2.125948,-1.803619,0.675859,0.308972,-0.726281,...,-1.673241,0.937707,-0.616568,0.780497,-1.055841,-0.154194,0.146745,157.369995,1,0.00869
1,-0.241374,-0.043836,1.545847,-0.950404,-0.819948,0.847419,-0.786322,-1.420254,1.645278,-1.286672,...,-1.007936,-0.415337,-0.336823,1.033332,0.848539,0.117121,0.092623,96.349998,0,-0.02173
2,-2.134433,-2.21931,0.969065,-2.85848,0.693123,-1.315593,0.284006,0.149392,1.18268,-2.028474,...,0.74396,0.519019,-0.354719,0.373946,-0.319379,-0.056289,0.155978,276.730011,0,-0.06377
3,-0.862259,-0.224703,2.30834,-1.941343,-0.32121,1.954794,-0.942382,0.729052,0.090916,-0.085781,...,0.683341,-0.590164,-1.645139,0.665159,-0.005705,0.219394,0.098477,2.0,0,-0.056046
4,1.24161,-0.051895,0.579918,-0.115431,-0.579488,-0.548451,-0.269573,-0.041116,0.35321,-0.264004,...,-0.237586,0.124342,0.14365,0.053582,0.933286,-0.052477,0.006476,1.54,0,-0.129563


#### Resultados do modelo *Histogram*

In [35]:
histogram_results = assign_model(histogram)
histogram_results.head()

Unnamed: 0,v1,v2,v3,v4,v5,v6,v7,v8,v9,v10,...,v22,v23,v24,v25,v26,v27,v28,amount,Anomaly,Anomaly_Score
0,-4.168525,-4.164322,1.911849,1.130443,4.152041,-2.125948,-1.803619,0.675859,0.308972,-0.726281,...,-1.673241,0.937707,-0.616568,0.780497,-1.055841,-0.154194,0.146745,157.369995,0,51.730862
1,-0.241374,-0.043836,1.545847,-0.950404,-0.819948,0.847419,-0.786322,-1.420254,1.645278,-1.286672,...,-1.007936,-0.415337,-0.336823,1.033332,0.848539,0.117121,0.092623,96.349998,1,55.159151
2,-2.134433,-2.21931,0.969065,-2.85848,0.693123,-1.315593,0.284006,0.149392,1.18268,-2.028474,...,0.74396,0.519019,-0.354719,0.373946,-0.319379,-0.056289,0.155978,276.730011,0,47.389536
3,-0.862259,-0.224703,2.30834,-1.941343,-0.32121,1.954794,-0.942382,0.729052,0.090916,-0.085781,...,0.683341,-0.590164,-1.645139,0.665159,-0.005705,0.219394,0.098477,2.0,0,52.937107
4,1.24161,-0.051895,0.579918,-0.115431,-0.579488,-0.548451,-0.269573,-0.041116,0.35321,-0.264004,...,-0.237586,0.124342,0.14365,0.053582,0.933286,-0.052477,0.006476,1.54,0,41.854803


#### Resultados do modelo *PCA*

In [36]:
pca_results = assign_model(pca)
pca_results.head()

Unnamed: 0,v1,v2,v3,v4,v5,v6,v7,v8,v9,v10,...,v22,v23,v24,v25,v26,v27,v28,amount,Anomaly,Anomaly_Score
0,-4.168525,-4.164322,1.911849,1.130443,4.152041,-2.125948,-1.803619,0.675859,0.308972,-0.726281,...,-1.673241,0.937707,-0.616568,0.780497,-1.055841,-0.154194,0.146745,157.369995,0,12915.560099
1,-0.241374,-0.043836,1.545847,-0.950404,-0.819948,0.847419,-0.786322,-1.420254,1.645278,-1.286672,...,-1.007936,-0.415337,-0.336823,1.033332,0.848539,0.117121,0.092623,96.349998,0,11341.315563
2,-2.134433,-2.21931,0.969065,-2.85848,0.693123,-1.315593,0.284006,0.149392,1.18268,-2.028474,...,0.74396,0.519019,-0.354719,0.373946,-0.319379,-0.056289,0.155978,276.730011,0,8761.654429
3,-0.862259,-0.224703,2.30834,-1.941343,-0.32121,1.954794,-0.942382,0.729052,0.090916,-0.085781,...,0.683341,-0.590164,-1.645139,0.665159,-0.005705,0.219394,0.098477,2.0,0,9886.907683
4,1.24161,-0.051895,0.579918,-0.115431,-0.579488,-0.548451,-0.269573,-0.041116,0.35321,-0.264004,...,-0.237586,0.124342,0.14365,0.053582,0.933286,-0.052477,0.006476,1.54,0,5551.840991


Observação: Eu poderia plotar (como já tentei) os gráficos de *t-SNE* e *Umap* para visualizar os resultados, mas o processo é muito demorado (por conta da grande quantidade de dados) e por muitas vezes o *Colab* não funcionava, por isso não irei realizar essa etapa, mas colocarei os comandos abaixo.

In [37]:
#plot_model(iforest)
#plot_model(histogram)
#plot_model(pca)

#plot_model(iforest, plot = 'umap')
#plot_model(histogram, plot = 'umap')
#plot_model(pca, plot = 'umap')

Irei inserir em cada resultado dos modelos uma nova coluna com os valores reais.

In [25]:
# Drop rows with NaN in 'Label2' before assigning to iforest_results
iforest_results = assign_model(iforest)
valid_indices = classe.dropna().index
iforest_results = iforest_results.loc[valid_indices]
iforest_results['Label2'] = classe.loc[valid_indices]

Resultados reais e previstos do *Iforest*.

In [41]:
# Ensure 'Label2' is in iforest_results before attempting to access it
if 'Label2' not in iforest_results.columns:
    valid_indices = classe.dropna().index
    iforest_results = iforest_results.loc[valid_indices]
    iforest_results['Label2'] = classe.loc[valid_indices]

iforest_results[['Anomaly','Label2']]

Unnamed: 0,Anomaly,Label2
0,1,0
1,0,0
2,0,0
3,0,0
4,0,0
...,...,...
28475,0,0
28476,1,0
28477,0,0
28478,0,0


Resultados reais e previstos do *Histogram*.

In [27]:
# Drop rows with NaN in 'Label2' before assigning to histogram_results
histogram_results = assign_model(histogram)
valid_indices = classe.dropna().index
histogram_results = histogram_results.loc[valid_indices]
histogram_results['Label2'] = classe.loc[valid_indices]

valor_classe=[0,1]
print(classification_report(histogram_results['Label2'],histogram_results['Anomaly'],labels=valor_classe))

              precision    recall  f1-score   support

           0       1.00      0.95      0.98     28431
           1       0.03      0.86      0.06        49

    accuracy                           0.95     28480
   macro avg       0.51      0.90      0.52     28480
weighted avg       1.00      0.95      0.97     28480



Resultados reais e previstos do *PCA*.

In [28]:
# Drop rows with NaN in 'Label2' before assigning to pca_results
pca_results = assign_model(pca)
valid_indices = classe.dropna().index
pca_results = pca_results.loc[valid_indices]
pca_results['Label2'] = classe.loc[valid_indices]

valor_classe=[0,1]
print(classification_report(pca_results['Label2'],pca_results['Anomaly'],labels=valor_classe))

              precision    recall  f1-score   support

           0       1.00      0.95      0.98     28431
           1       0.03      0.90      0.06        49

    accuracy                           0.95     28480
   macro avg       0.52      0.92      0.52     28480
weighted avg       1.00      0.95      0.97     28480



## *Classification Report* de cada modelo

Em todos os relatórios abaixo vemos que a precisão da classe 1 (anomalias/fraudes) é baixíssima, mas o *recall* é bastante alto.

Sabe-se que o *recall* é o número de resultados classificados corretamente como fraudes pelo total de fraudes; e nesse projeto eu irei me ater a essa métrica, pois, como dito anteriormente, eu quero saber quais as anomalias que são fraudes.

#### *Classification report* do modelo *Iforest*

O *recall* foi de 86%.

In [26]:
from sklearn.metrics import classification_report

valor_classe = [0, 1]
print(classification_report(iforest_results['Label2'], iforest_results['Anomaly'], labels=valor_classe))

              precision    recall  f1-score   support

           0       1.00      0.95      0.98     28431
           1       0.03      0.88      0.06        49

    accuracy                           0.95     28480
   macro avg       0.51      0.91      0.52     28480
weighted avg       1.00      0.95      0.97     28480



#### *Classification report* do modelo *Histogram*

Vemos que os resultados forma iguais ao do modelo anterior.

In [44]:
from sklearn.metrics import classification_report

# Ensure 'Label2' is in histogram_results before attempting to access it
if 'Label2' not in histogram_results.columns:
    valid_indices = classe.dropna().index
    histogram_results = histogram_results.loc[valid_indices]
    histogram_results['Label2'] = classe.loc[valid_indices]

valor_classe=[0,1]
print(classification_report(histogram_results['Label2'],histogram_results['Anomaly'],labels=valor_classe))

              precision    recall  f1-score   support

           0       1.00      0.95      0.98     28431
           1       0.03      0.86      0.06        49

    accuracy                           0.95     28480
   macro avg       0.51      0.90      0.52     28480
weighted avg       1.00      0.95      0.97     28480



#### *Classification report* do modelo *PCA*

Aqui vemos que o resultado do *recall* foi ligeiramente superior ao modelos anteriores, 88%.

In [46]:
from sklearn.metrics import classification_report

# Ensure 'Label2' is in pca_results before attempting to access it
if 'Label2' not in pca_results.columns:
    valid_indices = classe.dropna().index
    pca_results = pca_results.loc[valid_indices]
    pca_results['Label2'] = classe.loc[valid_indices]

valor_classe=[0,1]
print(classification_report(pca_results['Label2'],pca_results['Anomaly'],labels=valor_classe))

              precision    recall  f1-score   support

           0       1.00      0.95      0.98     28431
           1       0.03      0.90      0.06        49

    accuracy                           0.95     28480
   macro avg       0.52      0.92      0.52     28480
weighted avg       1.00      0.95      0.97     28480



## Matrizes de Confusão

Uma forma de visualizar melhores o que encontramos é com a matriz de confusão.

1) A matriz de confusão do modelo *Iforest* mostra que forma identificadas 422 anomalias, que corretamente, são fraudes; e isso é um bom resultado.

2) A matriz de confusão do modelo *Histogram*, mostra um resultado ligeiramente melhor, com 423 anomalias que são, corretamente, fraudes.

3) Por fim, o modelo *PCA* deu um resultado melhor, com 432 anomalias que são também fraudes.

In [47]:
# Ensure consistent indices before calculating confusion matrix
valid_indices_iforest = iforest_results['Label2'].dropna().index
print('Matriz de confusão da Iforest')
print(confusion_matrix(iforest_results['Label2'].loc[valid_indices_iforest], iforest_results['Anomaly'].loc[valid_indices_iforest]))
print(''*127)

valid_indices_histogram = histogram_results['Label2'].dropna().index
print('Matriz de confusão da Histogram')
print(confusion_matrix(histogram_results['Label2'].loc[valid_indices_histogram], histogram_results['Anomaly'].loc[valid_indices_histogram]))
print(''*127)

valid_indices_pca = pca_results['Label2'].dropna().index
print('Matriz de confusão da PCA')
print(confusion_matrix(pca_results['Label2'].loc[valid_indices_pca], pca_results['Anomaly'].loc[valid_indices_pca]))
print(''*127)

Matriz de confusão da Iforest
[[27050  1381]
 [    6    43]]

Matriz de confusão da Histogram
[[27052  1379]
 [    7    42]]

Matriz de confusão da PCA
[[27051  1380]
 [    5    44]]



Então podemos perceber, que mesmo com classes desbalanceadas os modelos retornaram resultados satisfatórios e mesmo não sendo modelos de classificação. Por fim, podemos usar outras métricas para avaliação dos resultados.

## Métricas de Avaliação

Abaixo para cada modelo temos a acurácia e o valor do AUC.

Veja que a acurácia de todos os modelos supera os 95% e o AUC está acima de 90%, o que é um resultado próximo.

In [48]:
# Ensure consistent indices before calculating metrics
valid_indices_pca = pca_results['Label2'].dropna().index
print('Métricas de avaliação do modelo PCA')
print(''*127)
print('Acurácia do modelo PCA :',accuracy_score(pca_results['Label2'].loc[valid_indices_pca],pca_results['Anomaly'].loc[valid_indices_pca]))
print('AUC do modelo PCA :',roc_auc_score(pca_results['Label2'].loc[valid_indices_pca],pca_results['Anomaly_Score'].loc[valid_indices_pca]))

Métricas de avaliação do modelo PCA

Acurácia do modelo PCA : 0.9513693820224719
AUC do modelo PCA : 0.9459256531566937


In [49]:
# Ensure consistent indices before calculating metrics
valid_indices_pca = pca_results['Label2'].dropna().index
print('Métricas de avaliação do modelo PCA')
print(''*127)
print('Acurácia do modelo PCA :',accuracy_score(pca_results['Label2'].loc[valid_indices_pca],pca_results['Anomaly'].loc[valid_indices_pca]))
print('AUC do modelo PCA :',roc_auc_score(pca_results['Label2'].loc[valid_indices_pca],pca_results['Anomaly_Score'].loc[valid_indices_pca], labels=[0, 1]))

Métricas de avaliação do modelo PCA

Acurácia do modelo PCA : 0.9513693820224719
AUC do modelo PCA : 0.9459256531566937


## Salvando os modelos

Salvando o modelo *Iforest*.

In [50]:
save_model(iforest, 'Modelo Iforest Final 07Dez2020')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('numerical_imputer',
                  TransformerWrapper(include=['v1', 'v2', 'v3', 'v4', 'v5', 'v6',
                                              'v7', 'v8', 'v9', 'v10', 'v11',
                                              'v12', 'v13', 'v14', 'v15', 'v16',
                                              'v17', 'v18', 'v19', 'v20', 'v21',
                                              'v22', 'v23', 'v24', 'v25', 'v26',
                                              'v27', 'v28', 'amount'],
                                     transformer=SimpleImputer())),
                 ('categorical_imputer',
                  TransformerWrapper(include=[],
                                     transformer=SimpleImputer(strategy='most_frequent'))),
                 ('trained_model',
                  IForest(behaviour='new', bootstrap=False, contamination=0.05,
     max_features=1.0, max_samples='auto', n_estimators=100, n_jobs=-1,
     rando

Salvando o modelo *Histogram*

In [52]:
save_model(histogram,'Modelo Histogram Final 07Dez2020')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('numerical_imputer',
                  TransformerWrapper(include=['v1', 'v2', 'v3', 'v4', 'v5', 'v6',
                                              'v7', 'v8', 'v9', 'v10', 'v11',
                                              'v12', 'v13', 'v14', 'v15', 'v16',
                                              'v17', 'v18', 'v19', 'v20', 'v21',
                                              'v22', 'v23', 'v24', 'v25', 'v26',
                                              'v27', 'v28', 'amount'],
                                     transformer=SimpleImputer())),
                 ('categorical_imputer',
                  TransformerWrapper(include=[],
                                     transformer=SimpleImputer(strategy='most_frequent'))),
                 ('trained_model',
                  HBOS(alpha=0.1, contamination=0.05, n_bins=10, tol=0.5))]),
 'Modelo Histogram Final 07Dez2020.pkl')

Salvando o modelo *PCA*

In [53]:
save_model(pca, 'Modelo PCA final 07Dez2020')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('numerical_imputer',
                  TransformerWrapper(include=['v1', 'v2', 'v3', 'v4', 'v5', 'v6',
                                              'v7', 'v8', 'v9', 'v10', 'v11',
                                              'v12', 'v13', 'v14', 'v15', 'v16',
                                              'v17', 'v18', 'v19', 'v20', 'v21',
                                              'v22', 'v23', 'v24', 'v25', 'v26',
                                              'v27', 'v28', 'amount'],
                                     transformer=SimpleImputer())),
                 ('categorical_imputer',
                  TransformerWrapper(include=[],
                                     transformer=SimpleImputer(strategy='most_frequent'))),
                 ('trained_model',
                  PCA(contamination=0.05, copy=True, iterated_power='auto', n_components=None,
   n_selected_components=None, random_state=123, standardization=Tr

## Conclusão

O módulo de detecção de anomalias se mostrou eficiente em identificar anomalias em dados de fraudes e ao compararmos os resultados dos modelos com os resultados reais obtivemos uma conclusão satisfatória.

In [57]:
#Código para gerar relatório com conclusão automática

!pip install fpdf

import openml
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from pycaret.anomaly import *
from sklearn.metrics import confusion_matrix, accuracy_score, roc_auc_score, roc_curve
from sklearn.metrics import precision_recall_curve
import seaborn as sns
from fpdf import FPDF



In [58]:
# Create instance of FPDF class
# Ep = A4 paper, P = portrait, mm = units
pdf = FPDF('P', 'mm', 'A4')

# Add a page
pdf.add_page()

# Set font
pdf.set_font('Arial', 'B', 16)

# Add title
pdf.cell(200, 10, 'Credit Card Fraud Detection Anomaly Report', 0, 1, 'C')

pdf.ln(10) # Add a line break

# Add project summary
pdf.set_font('Arial', '', 12)
summary_text = """
It is important for credit card companies to be able to recognize fraudulent credit card transactions so that customers are not charged for items they did not purchase. These frauds can occur due to lack of attention from the card operators' customers when cards or card information were provided.

Brazil is the second country with the most credit card fraud in all of Latin America, behind Mexico. Especially during the coronavirus pandemic, probably due to the high demand for digital services and e-commerce, the number of credit and debit card fraud cases skyrocketed and more than doubled.

This project applies anomaly detection models to identify discrepant values and compares them with the real fraud results.
"""
pdf.multi_cell(0, 10, summary_text)

pdf.ln(10) # Add a line break

# Add results for Iforest model
pdf.set_font('Arial', 'B', 14)
pdf.cell(0, 10, 'Iforest Model Results', 0, 1, 'L')
pdf.set_font('Arial', '', 12)

# Add Classification Report (convert to string)
if 'iforest_results' in locals() and not iforest_results['Label2'].isnull().all():
    report_iforest = classification_report(iforest_results['Label2'], iforest_results['Anomaly'])
    pdf.multi_cell(0, 10, "Classification Report:\n" + report_iforest)
else:
    pdf.multi_cell(0, 10, "Classification Report: Data not available or contains NaNs.")


# Add Accuracy and AUC (assuming these variables are available from previous cells)
# Added checks to see if variables exist and are not None before printing
if 'accuracy_score' in locals() and 'iforest_results' in locals() and not iforest_results['Label2'].isnull().all():
    try:
        iforest_accuracy = accuracy_score(iforest_results['Label2'], iforest_results['Anomaly'])
        pdf.multi_cell(0, 10, f"Accuracy: {iforest_accuracy:.4f}")
    except Exception as e:
         pdf.multi_cell(0, 10, f"Accuracy: Could not calculate ({e})")
else:
     pdf.multi_cell(0, 10, "Accuracy: Data not available or contains NaNs.")


if 'roc_auc_score' in locals() and 'iforest_results' in locals() and not iforest_results['Label2'].isnull().all():
    try:
        iforest_auc = roc_auc_score(iforest_results['Label2'], iforest_results['Anomaly_Score'])
        pdf.multi_cell(0, 10, f"AUC: {iforest_auc:.4f}")
    except Exception as e:
         pdf.multi_cell(0, 10, f"AUC: Could not calculate ({e})")
else:
    pdf.multi_cell(0, 10, "AUC: Data not available or contains NaNs.")


pdf.ln(10) # Add a line break

# Add results for Histogram model
pdf.set_font('Arial', 'B', 14)
pdf.cell(0, 10, 'Histogram Model Results', 0, 1, 'L')
pdf.set_font('Arial', '', 12)

# Add Classification Report (convert to string)
if 'histogram_results' in locals() and not histogram_results['Label2'].isnull().all():
    report_histogram = classification_report(histogram_results['Label2'], histogram_results['Anomaly'])
    pdf.multi_cell(0, 10, "Classification Report:\n" + report_histogram)
else:
    pdf.multi_cell(0, 10, "Classification Report: Data not available or contains NaNs.")

# Add Accuracy and AUC
if 'accuracy_score' in locals() and 'histogram_results' in locals() and not histogram_results['Label2'].isnull().all():
    try:
        histogram_accuracy = accuracy_score(histogram_results['Label2'], histogram_results['Anomaly'])
        pdf.multi_cell(0, 10, f"Accuracy: {histogram_accuracy:.4f}")
    except Exception as e:
        pdf.multi_cell(0, 10, f"Accuracy: Could not calculate ({e})")
else:
    pdf.multi_cell(0, 10, "Accuracy: Data not available or contains NaNs.")

if 'roc_auc_score' in locals() and 'histogram_results' in locals() and not histogram_results['Label2'].isnull().all():
    try:
        histogram_auc = roc_auc_score(histogram_results['Label2'], histogram_results['Anomaly_Score'])
        pdf.multi_cell(0, 10, f"AUC: {histogram_auc:.4f}")
    except Exception as e:
        pdf.multi_cell(0, 10, f"AUC: Could not calculate ({e})")
else:
    pdf.multi_cell(0, 10, "AUC: Data not available or contains NaNs.")

pdf.ln(10) # Add a line break

# Add results for PCA model
pdf.set_font('Arial', 'B', 14)
pdf.cell(0, 10, 'PCA Model Results', 0, 1, 'L')
pdf.set_font('Arial', '', 12)

# Add Classification Report (convert to string)
if 'pca_results' in locals() and not pca_results['Label2'].isnull().all():
    report_pca = classification_report(pca_results['Label2'], pca_results['Anomaly'])
    pdf.multi_cell(0, 10, "Classification Report:\n" + report_pca)
else:
    pdf.multi_cell(0, 10, "Classification Report: Data not available or contains NaNs.")

# Add Accuracy and AUC
if 'accuracy_score' in locals() and 'pca_results' in locals() and not pca_results['Label2'].isnull().all():
    try:
        pca_accuracy = accuracy_score(pca_results['Label2'], pca_results['Anomaly'])
        pdf.multi_cell(0, 10, f"Accuracy: {pca_accuracy:.4f}")
    except Exception as e:
        pdf.multi_cell(0, 10, f"Accuracy: Could not calculate ({e})")
else:
    pdf.multi_cell(0, 10, "Accuracy: Data not available or contains NaNs.")

if 'roc_auc_score' in locals() and 'pca_results' in locals() and not pca_results['Label2'].isnull().all():
    try:
        pca_auc = roc_auc_score(pca_results['Label2'], pca_results['Anomaly_Score'])
        pdf.multi_cell(0, 10, f"AUC: {pca_auc:.4f}")
    except Exception as e:
        pdf.multi_cell(0, 10, f"AUC: Could not calculate ({e})")
else:
    pdf.multi_cell(0, 10, "AUC: Data not available or contains NaNs.")


# Save the PDF
pdf.output('anomaly_detection_report.pdf')

''