<a href="https://colab.research.google.com/github/betobye/1TIAPR/blob/main/Aula1_Pipeline_DS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **1. Introdução**

- Defina o Problema a ser resolvido
- Relevância do problema
- Como será resolvido


A detecção de fraudes em transações com cartão de crédito é um desafio crítico para instituições financeiras e empresas, devido ao impacto financeiro significativo e à ameaça à segurança dos consumidores. Com o aumento contínuo do volume de transações eletrônicas, métodos tradicionais de análise e monitoramento tornam-se insuficientes para identificar rapidamente comportamentos suspeitos.

Nesse contexto, modelos de machine learning emergem como ferramentas essenciais, pois são capazes de analisar grandes volumes de dados em tempo real, identificar padrões complexos e adaptar-se a novas estratégias de fraude. A aplicação desses modelos contribui para a redução de perdas financeiras, melhora a experiência do cliente ao minimizar falsos positivos e fortalece a segurança do sistema de pagamentos como um todo.

# **2.Coleta de dados**

- Fonte de Dados: https://www.kaggle.com/datasets/dhanushnarayananr/credit-card-fraud

- **Descreva as informações da base de dados:**

    * distance_from_home - a distância da casa onde a transação ocorreu.

    * distance_from_last_transaction - a distância desde a última transação.

    * ratio_to_median_purchase_price - Razão entre o preço de transação comprado e o preço de compra mediano.

    * repeat_retailer - A transação ocorreu no mesmo varejista.

    * used_chip - É a transação através de chip (cartão de crédito).

    * used_pin_number - A transação ocorreu usando um número PIN.

    * online_order - A transação é um pedido online.

    * fraude - A transação é fraudulenta.

In [None]:
#Instalando as bibliotecas de interesse
# pip install pandas
# pip install numpy
# pip install matplotlib
# pip install seaborn

In [None]:
import pandas as pd #processamento de dataset
import numpy as np #manipulação algebricas
import matplotlib.pyplot as plt #visualização de dados
import seaborn as sns #visualização de dados

In [None]:
#importando o banco de dados e salvando no objeto df
df = pd.read_csv("/content/card_transdata.csv")

In [None]:
#visualizando as 6 primeiras linhas do dataset
df.head(6)

Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud
0,57.877857,0.31114,1.94594,1.0,1.0,0.0,0.0,0.0
1,10.829943,0.175592,1.294219,1.0,0.0,0.0,0.0,0.0
2,5.091079,0.805153,0.427715,1.0,0.0,0.0,1.0,0.0
3,2.247564,5.600044,0.362663,1.0,1.0,0.0,1.0,0.0
4,44.190936,0.566486,2.222767,1.0,1.0,0.0,1.0,0.0
5,5.586408,13.261073,0.064768,1.0,0.0,0.0,0.0,0.0


In [None]:
#visualizando as 7 ultimas linhas
df.tail(7)

Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud
999993,4.846253,2.84445,0.86774,1.0,0.0,0.0,1.0,0.0
999994,3.295884,0.085712,0.831991,1.0,0.0,0.0,1.0,0.0
999995,2.207101,0.112651,1.626798,1.0,1.0,0.0,0.0,0.0
999996,19.872726,2.683904,2.778303,1.0,1.0,0.0,0.0,0.0
999997,2.914857,1.472687,0.218075,1.0,1.0,0.0,1.0,0.0
999998,4.258729,0.242023,0.475822,1.0,0.0,0.0,1.0,0.0
999999,58.108125,0.31811,0.38692,1.0,1.0,0.0,1.0,0.0


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 8 columns):
 #   Column                          Non-Null Count    Dtype  
---  ------                          --------------    -----  
 0   distance_from_home              1000000 non-null  float64
 1   distance_from_last_transaction  1000000 non-null  float64
 2   ratio_to_median_purchase_price  1000000 non-null  float64
 3   repeat_retailer                 1000000 non-null  float64
 4   used_chip                       1000000 non-null  float64
 5   used_pin_number                 1000000 non-null  float64
 6   online_order                    1000000 non-null  float64
 7   fraud                           1000000 non-null  float64
dtypes: float64(8)
memory usage: 61.0 MB


Todas as informações são numéricas decimais

# **3. Análise de Dados**

- Quais perguntas queremos responder?
- Existe dados faltantes
- Existes dados duplicados
- Relação da Target com as features

In [None]:
#analise descritivas das colunas, trazendo media, desvio padrao e mediana, etc
df.describe()

Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud
count,1000000.0,1000000.0,1000000.0,1000000.0,1000000.0,1000000.0,1000000.0,1000000.0
mean,26.628792,5.036519,1.824182,0.881536,0.350399,0.100608,0.650552,0.087403
std,65.390784,25.843093,2.799589,0.323157,0.477095,0.300809,0.476796,0.282425
min,0.004874,0.000118,0.004399,0.0,0.0,0.0,0.0,0.0
25%,3.878008,0.296671,0.475673,1.0,0.0,0.0,0.0,0.0
50%,9.96776,0.99865,0.997717,1.0,0.0,0.0,1.0,0.0
75%,25.743985,3.355748,2.09637,1.0,1.0,0.0,1.0,0.0
max,10632.723672,11851.104565,267.802942,1.0,1.0,1.0,1.0,1.0


In [None]:
df.isnull().sum() #analisando quantos dados faltantes

Unnamed: 0,0
distance_from_home,0
distance_from_last_transaction,0
ratio_to_median_purchase_price,0
repeat_retailer,0
used_chip,0
used_pin_number,0
online_order,0
fraud,0


In [None]:
df.duplicated().sum() #analisando quantos dados duplicados

np.int64(0)

# **4. Pré-Processamento**

- Tratamento dos dados para aplicar modelos de Machine Learning
- Transformar dados categóricos em numéricos
- Tratar Dados faltantes
- Tratar dados numéricos
- Validação

# **5.Modelagem**

- Que tipo de problema será resolvido, classificação ou regressão?
- Qual algoritmo pode ser utilizado para isso

# **6. Avaliação**

- Quais métricas podemos utilizar
- Quais conclusões sobre as métricas,visão técnica e negócio

# **7. Conclusão**

- Quais são seus principais resultados