# Preparação dos dados
_Feature Engineering_

---

## Sumário

1. **Importação de bibliotecas**
2. **Carregamento das bases**
3. **Feature Engineering**
    - 3.1. Ajustando a tipagem das variáveis
    - 3.2. Criando novas features
        - 3.2.1. Criando variáveis temporais
        - 3.2.2. Criando variável de diferença de tempo entre transações consecutivas para o mesmo cliente
        - 3.2.3. Criando variável com a proporção de transações em período de alto risco por cliente 
        - 3.2.4. Criar flag indicando se valor da transação está acima da média do cliente
        - 3.2.5. Criando variável de diferença de tempo entre transações consecutivas por cliente e cartão
        - 3.2.6. Criando variável de transação de valor alto para faixa etária
        - 3.2.7. Criando variável de tempo de conta aberta em anos
        - 3.2.8. Criando variáveis financeiras
        - 3.2.9. Criando variáveis de risco com base no cartão
        - 3.2.10. Criando variável de distância da transação em relação à média de localização do cliente
        - 3.2.11. Criando variáveis de geolocalização e localidade
        - 3.2.12. Criando variáveis de diversidade e comportamento
        - 3.2.13. Criando variáveis de padrões de Comerciante/Cliente
        - 3.2.14. Criando variáveis relacionadas categorias de comerciante (MCC)
        - 3.2.15. Criando variáveis de transações em janelas móveis (rolling windows)
4. **Salvando os DataFrames em formato parquet**

<br>

---

<br>

## 1. Importação de bibliotecas

In [None]:
# Importação de pacotes e definição de parâmetros globais

import pandas as pd
import numpy as np
import seaborn as sns
import warnings
import gc

from pathlib import Path

In [2]:
# Configurações para exibição de dados no Jupyter Notebook

# Configurar para exibir todas as colunas do Dataframe
pd.set_option('display.max_columns', None)

# Configurar para exibir o conteúdo completo das colunas
pd.set_option('display.max_colwidth', None)

# Configurar a supressão de mensagens de aviso durante a execução
warnings.filterwarnings('ignore')

# Configurar estilo dos gráficos do seaborn
sns.set_style('whitegrid')

## 2. Carregamento das bases

In [3]:
# Efetuando a limpeza da memória antes do carregamento dos dados
print(f'\nQuantidade de objetos removidos da memória {gc.collect()}')


Quantidade de objetos removidos da memória 0


In [4]:
# Caminho base dos arquivos
caminho_base = Path('dados/dados_parquet')

# Nomes dos arquivos a serem carregados
arquivos = ['df_train', 'df_val', 'df_test']

# Carregamento dos arquivos Parquet em um dicionário
dfs = {}
for nome in arquivos:
    caminho_arquivo = caminho_base / f'{nome}.parquet'
    try:
        dfs[nome] = pd.read_parquet(caminho_arquivo)
    except Exception as e:
        print(f'Erro ao carregar {caminho_arquivo}: {e}')

# Exibição da volumetria
print('\nVOLUMETRIA')
for nome, df in dfs.items():
    print(f'\n{nome}')
    print('-' * 45)
    print(f'Quantidade de linhas (registros):  {df.shape[0]:,}')
    print(f'Quantidade de colunas (variáveis): {df.shape[1]:,}')



VOLUMETRIA

df_train
---------------------------------------------
Quantidade de linhas (registros):  3,566,068
Quantidade de colunas (variáveis): 43

df_val
---------------------------------------------
Quantidade de linhas (registros):  891,518
Quantidade de colunas (variáveis): 43

df_test
---------------------------------------------
Quantidade de linhas (registros):  2,194,300
Quantidade de colunas (variáveis): 43


In [5]:
# Criação dos dataframes

df_train = pd.read_parquet(caminho_base / f'df_train.parquet')
df_val = pd.read_parquet(caminho_base / f'df_val.parquet')
df_test = pd.read_parquet(caminho_base / f'df_test.parquet')

In [6]:
df_train.head(5)

Unnamed: 0,id,date,client_id,card_id,amount,use_chip,merchant_id,merchant_city,merchant_state,zip,mcc,errors,id_card,client_id_card,card_brand,card_type,card_number,expires,cvv,has_chip,num_cards_issued,credit_limit,acct_open_date,year_pin_last_changed,card_on_dark_web,id_client,current_age,retirement_age,birth_year,birth_month,gender,address,latitude,longitude,per_capita_income,yearly_income,total_debt,credit_score,num_credit_cards,code,description,transaction_id,is_fraud
0,8390375,2010-08-16 09:13:00,360,2611,$55.94,Swipe Transaction,81833,Leesburg,NJ,8327.0,5912,,2611,360,Mastercard,Debit,5641208271147811,12/2016,668,YES,2,$28003,04/2009,2009,No,360,43,67,1976,4,Male,881 Plum Street,39.22,-74.8,$22697,$46278,$51243,791,4,5912,Drug Stores and Pharmacies,8390375.0,No
1,12144433,2012-12-31 10:05:00,1385,3807,$203.59,Swipe Transaction,3558,Burnsville,MN,55337.0,3640,,3807,1385,Mastercard,Debit,5045568837955027,04/2024,43,YES,2,$39632,07/2007,2013,No,1385,51,68,1968,7,Female,5537 Eighth Street,44.96,-93.26,$40364,$82298,$182301,789,6,3640,"Lighting, Fixtures, Electrical Supplies",12144433.0,No
2,17770001,2016-05-09 02:15:00,328,3150,$13.29,Chip Transaction,34702,Charlotte,NC,28227.0,5310,,3150,328,Mastercard,Debit,5900781818896314,08/2024,694,YES,2,$15125,05/2014,2014,No,328,45,67,1974,3,Male,4391 Lexington Lane,35.19,-80.83,$17817,$36323,$65525,700,3,5310,Discount Stores,17770001.0,No
3,17093797,2015-12-15 12:01:00,1376,2182,$1.68,Chip Transaction,14528,Cedar Park,TX,78613.0,5499,,2182,1376,Mastercard,Credit,5822242274317975,10/2023,304,YES,2,$13100,03/2008,2015,No,1376,49,68,1971,1,Female,97536 Summit Street,30.51,-97.83,$30418,$62019,$85666,543,4,5499,Miscellaneous Food Stores,17093797.0,No
4,11850127,2012-10-25 17:27:00,1629,4290,$20.25,Swipe Transaction,54709,Downingtown,PA,19335.0,5813,,4290,1629,Mastercard,Debit,5366247073382596,04/2021,515,YES,2,$18776,08/2009,2012,No,1629,42,63,1977,6,Female,53789 Bayview Street,26.14,-80.13,$14430,$29422,$58679,598,1,5813,Drinking Places (Alcoholic Beverages),11850127.0,No


In [7]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3566068 entries, 0 to 3566067
Data columns (total 43 columns):
 #   Column                 Dtype         
---  ------                 -----         
 0   id                     int32         
 1   date                   datetime64[ns]
 2   client_id              int32         
 3   card_id                int32         
 4   amount                 object        
 5   use_chip               object        
 6   merchant_id            int32         
 7   merchant_city          object        
 8   merchant_state         object        
 9   zip                    float64       
 10  mcc                    int32         
 11  errors                 object        
 12  id_card                int32         
 13  client_id_card         int32         
 14  card_brand             object        
 15  card_type              object        
 16  card_number            int64         
 17  expires                object        
 18  cvv                   

## 3. Feature Engineering

### 3.1. Ajustando a tipagem das variáveis

In [8]:
# Lista de dataframes para iteração
dfs = [df_train, df_val, df_test]
        
# Lista de variáveis com valores monetários a serem limpos e convertidos
vars_numeric = ['amount', 'credit_limit', 'per_capita_income', 'yearly_income', 'total_debt']

# Limpeza e conversão para tipo numérico
for df in dfs:
    for var in vars_numeric:
        df[var] = (
            df[var]
            .replace(r'[\$,]', '', regex=True)  # Remove símbolos de dólar
            .apply(pd.to_numeric, errors='coerce')  # Converte para float, tratando erros como NaN
        )
        
# Lista de variáveis categóricas a serem convertidas para tipo 'datetime'
vars_date = ['expires', 'acct_open_date']
        
# Conversão para datetime (formato 'MM/YYYY')
for df in dfs:
    for var in vars_date:
        df[var] = pd.to_datetime(df[var], format='%m/%Y', errors='coerce')

In [9]:
df_train.shape, df_val.shape, df_test.shape

((3566068, 43), (891518, 43), (2194300, 43))

In [10]:
df_train.head(5)

Unnamed: 0,id,date,client_id,card_id,amount,use_chip,merchant_id,merchant_city,merchant_state,zip,mcc,errors,id_card,client_id_card,card_brand,card_type,card_number,expires,cvv,has_chip,num_cards_issued,credit_limit,acct_open_date,year_pin_last_changed,card_on_dark_web,id_client,current_age,retirement_age,birth_year,birth_month,gender,address,latitude,longitude,per_capita_income,yearly_income,total_debt,credit_score,num_credit_cards,code,description,transaction_id,is_fraud
0,8390375,2010-08-16 09:13:00,360,2611,55.94,Swipe Transaction,81833,Leesburg,NJ,8327.0,5912,,2611,360,Mastercard,Debit,5641208271147811,2016-12-01,668,YES,2,28003,2009-04-01,2009,No,360,43,67,1976,4,Male,881 Plum Street,39.22,-74.8,22697,46278,51243,791,4,5912,Drug Stores and Pharmacies,8390375.0,No
1,12144433,2012-12-31 10:05:00,1385,3807,203.59,Swipe Transaction,3558,Burnsville,MN,55337.0,3640,,3807,1385,Mastercard,Debit,5045568837955027,2024-04-01,43,YES,2,39632,2007-07-01,2013,No,1385,51,68,1968,7,Female,5537 Eighth Street,44.96,-93.26,40364,82298,182301,789,6,3640,"Lighting, Fixtures, Electrical Supplies",12144433.0,No
2,17770001,2016-05-09 02:15:00,328,3150,13.29,Chip Transaction,34702,Charlotte,NC,28227.0,5310,,3150,328,Mastercard,Debit,5900781818896314,2024-08-01,694,YES,2,15125,2014-05-01,2014,No,328,45,67,1974,3,Male,4391 Lexington Lane,35.19,-80.83,17817,36323,65525,700,3,5310,Discount Stores,17770001.0,No
3,17093797,2015-12-15 12:01:00,1376,2182,1.68,Chip Transaction,14528,Cedar Park,TX,78613.0,5499,,2182,1376,Mastercard,Credit,5822242274317975,2023-10-01,304,YES,2,13100,2008-03-01,2015,No,1376,49,68,1971,1,Female,97536 Summit Street,30.51,-97.83,30418,62019,85666,543,4,5499,Miscellaneous Food Stores,17093797.0,No
4,11850127,2012-10-25 17:27:00,1629,4290,20.25,Swipe Transaction,54709,Downingtown,PA,19335.0,5813,,4290,1629,Mastercard,Debit,5366247073382596,2021-04-01,515,YES,2,18776,2009-08-01,2012,No,1629,42,63,1977,6,Female,53789 Bayview Street,26.14,-80.13,14430,29422,58679,598,1,5813,Drinking Places (Alcoholic Beverages),11850127.0,No


### 3.2. Criando novas features

#### 3.2.1. Criando variáveis temporais

In [11]:
# Itera sobre os DataFrames de treino, validação e teste (armazenados em 'dfs')
for df in dfs:
    
    df['hour'] = df['date'].dt.hour     # Hora do dia
    
    # Definindo as condições para categorizar os horários
    conditions = [
        (df['hour'] >= 0) & (df['hour'] < 6),
        (df['hour'] >= 6) & (df['hour'] < 12),
        (df['hour'] >= 12) & (df['hour'] < 18),
        (df['hour'] >= 18) & (df['hour'] <= 23)
    ]
    
    # Definindo os rótulos para cada faixa de horário (madrugada, manhã, tarde e noite)
    choices = ['dawn', 'morning', 'afternoon', 'evening']
    
    df['day'] = df['date'].dt.day                   # Dia do mês
    df['month'] = df['date'].dt.month               # Mês do ano
    df['year'] = df['date'].dt.year                 # Ano
    df['quarter'] = df['date'].dt.quarter           # Trimestre do ano
    df['day_of_year'] = df['date'].dt.dayofyear     # Dia do ano
    df['weekday'] = df['date'].dt.dayofweek         # Dia da semana (0=segunda, 6=domingo)
    # Final de semana (0=semana, 1=final de semana)
    df['weekend'] = (df['date'].dt.dayofweek >= 5).astype(int)   
    # Período do dia (madrugada, manhã, tarde, noite)               
    df['period_of_day'] = np.select(conditions, choices, default='indefinido')      

In [12]:
df_train.shape, df_val.shape, df_test.shape

((3566068, 52), (891518, 52), (2194300, 52))

#### 3.2.2. Criando variável de diferença de tempo entre transações consecutivas para o mesmo cliente

In [13]:
# Itera sobre os DataFrames de treino, validação e teste (armazenados em 'dfs')
for df in dfs:
    
    # Ordena pelas colunas client_id e data da transação
    df.sort_values(by=['client_id', 'date'], inplace=True)

    # Calcula a diferença de tempo entre transações consecutivas para o mesmo cliente
    df['time_diff_transaction'] = df.groupby('client_id')['date'].diff()

    # Substitui valores NaT por 0
    df['time_diff_transaction'] = df['time_diff_transaction'].fillna(pd.Timedelta(0))

    # Converte a diferença de tempo para minutos
    df['time_diff_transaction'] = df['time_diff_transaction'].dt.total_seconds() / 60

In [14]:
df_train.shape, df_val.shape, df_test.shape

((3566068, 53), (891518, 53), (2194300, 53))

#### 3.2.3. Criando variável com a proporção de transações em período de alto risco por cliente

In [15]:
# Lista os períodos de alto risco
high_risk_periods = ['dawn', 'evening']

# Itera sobre os DataFrames de treino, validação e teste (armazenados em 'dfs')
for i, df in enumerate(dfs):
    
    # Cria a variável de alto risco (1 se o período for de alto risco, 0 caso contrário)
    df['high_risk_period'] = df['period_of_day'].isin(high_risk_periods).astype(int)
    
    # Calcular proporção de transações de alto risco por cliente
    proportion_by_client = (
        df.groupby('client_id')['high_risk_period']
        .mean()
        .reset_index()
        .rename(columns={'high_risk_period': 'proportion_high_risk_period'})
    )
    
    # Mesclar ao dataframe original
    dfs[i] = df.merge(proportion_by_client, on='client_id', how='left')
    
# Atribui os DataFrames atualizados de volta aos nomes originais
df_train, df_val, df_test = dfs

In [16]:
df_train.shape, df_val.shape, df_test.shape

((3566068, 55), (891518, 55), (2194300, 55))

#### 3.2.4. Criar flag indicando se valor da transação está acima da média do cliente

In [17]:
# Itera sobre os DataFrames de treino, validação e teste (armazenados em 'dfs')
for i in range(len(dfs)):
    df = dfs[i]

    # Calcula a média do valor das transações por cliente
    mean_amount_by_client = (
        df.groupby('client_id')['amount']
        .mean()
        .rename('mean_amount_by_client')
        .reset_index()
    )

    # Junta ao DataFrame original
    df = df.merge(mean_amount_by_client, on='client_id', how='left')

    # Cria a flag
    df['flag_above_mean_amount'] = (df['amount'] > df['mean_amount_by_client']).astype(int)

    # Atualiza a lista
    dfs[i] = df

# Atribui os DataFrames atualizados de volta aos nomes originais
df_train, df_val, df_test = dfs

In [18]:
df_train.shape, df_val.shape, df_test.shape

((3566068, 57), (891518, 57), (2194300, 57))

#### 3.2.5. Criando variável de diferença de tempo entre transações consecutivas por cliente e cartão

In [19]:
# Itera sobre os DataFrames de treino, validação e teste (armazenados em 'dfs')
for i in range(len(dfs)):
    df = dfs[i]

    # Ordena o DataFrame por client_id, card_id e date
    df = df.sort_values(by=['client_id', 'card_id', 'date'])

    # Calcula a diferença de tempo entre transações consecutivas por cliente e cartão
    df['time_diff_client_card'] = df.groupby(['client_id', 'card_id'])['date'].diff()

    # Substitui valores NaT por 0 minutos
    df['time_diff_client_card'] = df['time_diff_client_card'].fillna(pd.Timedelta(0))
    df['time_diff_client_card'] = df['time_diff_client_card'].dt.total_seconds() / 60  # converte para minutos

    # Atualiza o DataFrame na lista
    dfs[i] = df

# Atualiza os nomes dos DataFrames
df_train, df_val, df_test = dfs

In [20]:
df_train.shape, df_val.shape, df_test.shape

((3566068, 58), (891518, 58), (2194300, 58))

#### 3.2.6. Criando variável de transação de valor alto para faixa etária

In [21]:
# Itera sobre os DataFrames de treino, validação e teste (armazenados em 'dfs')
for df in dfs:
    
    # Criar as faixas etárias
    df['age_group'] = pd.cut(df['current_age'], bins=[0, 25, 40, 60, 120], 
                             labels=['young', 'adult', 'mature', 'elderly'])

    # Calcular média e desvio por faixa etária
    stats_by_age = df.groupby('age_group')['amount'].agg(['mean', 'std']).reset_index()

    # Mesclar e criar flag
    df = df.merge(stats_by_age, on='age_group', how='left')
    df['high_amount_for_age_group'] = (
        df['amount'] > (df['mean'] + 2 * df['std'])
    ).astype(int)


In [22]:
df_train.shape, df_val.shape, df_test.shape

((3566068, 59), (891518, 59), (2194300, 59))

#### 3.2.7. Criando variável de tempo de conta aberta em anos

In [23]:
# Itera sobre os DataFrames de treino, validação e teste (armazenados em 'dfs')
for df in dfs:
    
    # Calcular o tempo de conta aberta em anos
    df['years_acct_open'] = df['date'].dt.year - df['acct_open_date'].dt.year

In [24]:
df_train.shape, df_val.shape, df_test.shape

((3566068, 60), (891518, 60), (2194300, 60))

#### 3.2.8. Criando variáveis financeiras

In [25]:
# Itera sobre os DataFrames de treino, validação e teste (armazenados em 'dfs')
for df in dfs:

    # Relação dívida/renda (renda comprometida com dívidas)
    df['debt_to_income'] = df['total_debt'] / df['yearly_income']

    # Renda mensal estimada
    df['monthly_income'] = df['yearly_income'] / 12

    # Valor transacionado em relação à renda anual
    df['amount_income_ratio'] = df['amount'] / df['yearly_income']

    # Score de crédito categorizado
    df['credit_score_category'] = pd.cut(
        df['credit_score'],
        bins=[0, 580, 670, 740, 800, 850],
        labels=['bad', 'average', 'good', 'very good', 'excellent']
    )

In [26]:
df_train.shape, df_val.shape, df_test.shape

((3566068, 64), (891518, 64), (2194300, 64))

#### 3.2.9. Criando variáveis de risco com base no cartão

In [27]:
# Itera sobre os DataFrames de treino, validação e teste (armazenados em 'dfs')
for df in dfs:

    # Criar a flag, 1 se cartão comprometido e transação com chip
    df['flag_risky_chip_use'] = ((df['card_on_dark_web'] == 1
                                  ) & (df['use_chip'] == 1)).astype(int)

    # Criar a flag, 1 se cartão comprometido sem chip (mais vulnerável)
    df['flag_no_chip_darkweb'] = ((df['card_on_dark_web'] == 1
                                         ) & (df['has_chip'] == 0)).astype(int)

In [28]:
df_train.shape, df_val.shape, df_test.shape

((3566068, 66), (891518, 66), (2194300, 66))

#### 3.2.10. Criando variável de distância da transação em relação à média de localização do cliente

In [29]:
def haversine(lat1, lon1, lat2, lon2): 
    '''
    Calcula a distância geográfica entre dois pontos na superfície da Terra 
    utilizando a fórmula de Haversine.

    :param lat1: float
        Latitude do primeiro ponto (em graus decimais).
    :param lon1: float
        Longitude do primeiro ponto (em graus decimais).
    :param lat2: float
        Latitude do segundo ponto (em graus decimais).
    :param lon2: float
        Longitude do segundo ponto (em graus decimais).
    :return: float
        Distância entre os dois pontos, em quilômetros.
    '''       
    R = 6371  # raio da Terra em km
    phi1, phi2 = np.radians(lat1), np.radians(lat2)
    delta_phi = np.radians(lat2 - lat1)
    delta_lambda = np.radians(lon2 - lon1)

    a = np.sin(delta_phi / 2.0) ** 2 + \
        np.cos(phi1) * np.cos(phi2) * np.sin(delta_lambda / 2.0) ** 2
    c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1 - a))
    return R * c

In [30]:
# Itera sobre os DataFrames de treino, validação e teste (armazenados em 'dfs')
for i in range(len(dfs)):
    df = dfs[i]

    # Calcular a média de latitude e longitude por cliente
    client_center = (
        df.groupby('client_id')[['latitude', 'longitude']]
        .mean()
        .reset_index()
        .rename(columns={'latitude': 'center_lat', 'longitude': 'center_lon'})
    )

    # Mesclar com o DataFrame original
    df = df.merge(client_center, on='client_id', how='left')

    # Calcular a distância geográfica entre ponto atual e o centro usando a fórmula de Haversine
    df['distance_from_center'] = haversine(
        df['latitude'], df['longitude'],
        df['center_lat'], df['center_lon']
    )

    # Atualiza a lista
    dfs[i] = df

# Atualiza os nomes dos DataFrames
df_train, df_val, df_test = dfs

In [31]:
df_train.shape, df_val.shape, df_test.shape

((3566068, 69), (891518, 69), (2194300, 69))

#### 3.2.11. Criando variáveis de geolocalização e localidade

In [32]:
# Itera sobre os DataFrames de treino, validação e teste (armazenados em 'dfs')
for df in dfs:

    # Transações online (estado nulo)
    df['is_online_transaction'] = df['merchant_state'].isna().astype(int)

    # Cidade incomum para o cliente
    client_cities = df_train.groupby('client_id')['merchant_city'].agg(
        lambda x: x.mode().values[0])
    df = df.merge(client_cities.rename('usual_city'), on='client_id')
    df['unusual_city_for_client'] = (df['merchant_city'] != df['usual_city']).astype(int)
    df.drop(columns='usual_city', inplace=True)

    # Estado incomum para o cliente
    client_states = df.groupby('client_id')['merchant_state'].agg(
        lambda x: x.mode().dropna().values[0] if not x.dropna().empty else np.nan)
    df = df.merge(client_states.rename('usual_state'), on='client_id')
    df['unusual_state_for_client'] = (df['merchant_state'] != df['usual_state']).astype(int)
    df.drop(columns='usual_state', inplace=True)


In [33]:
df_train.shape, df_val.shape, df_test.shape

((3566068, 70), (891518, 70), (2194300, 70))

#### 3.2.12. Criando variáveis de diversidade e comportamento

In [34]:
# Itera sobre os DataFrames de treino, validação e teste (armazenados em 'dfs')
for df in dfs:

    # Diversidade de categorias (MCC) por cliente
    df['merchant_category_diversity'] = df.groupby(
        'client_id')['code'].transform('nunique')

    # Diversidade de estados por cliente
    df['state_diversity_per_client'] = df.groupby(
        'client_id')['merchant_state'].transform(lambda x: x.nunique(dropna=True))

    # Proporção de transações online por cliente
    df['online_transaction_ratio'] = df.groupby(
        'client_id')['is_online_transaction'].transform('mean')


In [35]:
df_train.shape, df_val.shape, df_test.shape

((3566068, 73), (891518, 73), (2194300, 73))

#### 3.2.13. Criando variáveis de padrões de Comerciante/Cliente

In [36]:
# Itera sobre os DataFrames de treino, validação e teste (armazenados em 'dfs')
for df in dfs:

    # Quantidade de transações cliente-comerciante
    df['merchant_transaction_count'] = df.groupby(
        ['client_id', 'merchant_id'])['id'].transform('count')

    # Comerciante novo para o cliente (sem alterar a ordem dos índices)
    first_seen_merchant = (
        df.groupby(['client_id', 'merchant_id'])['date']
        .transform('min')
    )
    df['merchant_id_is_new'] = (df['date'] == first_seen_merchant).astype(int)

    # Proporção de transações do cliente com o comerciante
    client_total_tx = df.groupby(
        'client_id')['id'].transform('count')
    df['merchant_transaction_ratio'] = df['merchant_transaction_count'] / client_total_tx

    # Quantidade de comerciantes únicos por cliente
    df['client_unique_merchants_count'] = df.groupby(
        'client_id')['merchant_id'].transform('nunique')

In [37]:
df_train.shape, df_val.shape, df_test.shape

((3566068, 77), (891518, 77), (2194300, 77))

#### 3.2.14. Criando variáveis relacionadas categorias de comerciante (MCC)

In [38]:
# Itera sobre os DataFrames de treino, validação e teste (armazenados em 'dfs')
for df in dfs:
    
    # Quantidade de transações por categoria de comerciante (MCC)
    most_common_mcc_per_client = (
        df.groupby('client_id')['code']
        .agg(lambda x: x.mode().values[0])
        .rename('most_common_mcc')
    )
    df['most_common_mcc'] = df['client_id'].map(most_common_mcc_per_client)

    # Proporção de transações por categoria de comerciante (MCC)
    client_total_tx = df.groupby('client_id')['id'].transform('count')
    mcc_client_count = df.groupby(['client_id', 'code'])['id'].transform('count')
    df['mcc_transaction_ratio'] = mcc_client_count / client_total_tx

In [39]:
df_train.shape, df_val.shape, df_test.shape

((3566068, 79), (891518, 79), (2194300, 79))

#### 3.2.15. Criando variáveis de transações em janelas móveis (rolling windows)

In [40]:
def create_transaction_features_light(df, window, group_col, prefix):
    '''
    Cria features agregadas (soma, média e desvio padrão) sobre o valor das transações 
    financeiras utilizando janelas móveis (rolling windows).

    :param df: DataFrame
        DataFrame contendo os dados de transações financeiras.
    :param window: str
        Janela temporal no formato aceito pelo pandas (ex: '1H', '7D').
    :param group_col: str
        Coluna de agrupamento, geralmente o identificador do cliente.
    :param prefix: str
        Prefixo que será adicionado ao nome das novas colunas geradas.

    :return: pd.DataFrame
        DataFrame original enriquecido com as novas features agregadas.
    '''

    window_label = window.lower()

    # Ordena o DataFrame por cliente e data
    df = df.sort_values([group_col, 'date'])

    # Função auxiliar para aplicar rolling nas transações de cada grupo (cliente)
    def apply_rolling(group):
        # Aplica rolling sobre a coluna 'amount' com a janela definida
        rolled = group.set_index('date')['amount'].rolling(window=window, min_periods=1).agg(['sum', 'mean', 'std'])

        # Renomeia as colunas resultantes com base no prefixo e janela
        rolled.columns = [f"{prefix}_{col}_last_{window_label}" for col in rolled.columns]

        # Concatena o resultado ao grupo original
        return pd.concat([group.reset_index(drop=True), rolled.reset_index(drop=True)], axis=1)

    # Aplica a função em cada grupo
    df = df.groupby(group_col, group_keys=False).apply(apply_rolling).reset_index(drop=True)

    return df

In [41]:
def process_in_batches(df, window, group_col, prefix, batch_size=300):
    '''
    Processa o DataFrame em lotes (batches) de grupos distintos para evitar estouro de memória,
    aplicando a função de criação de features por janela de tempo.

    :param df: DataFrame
        DataFrame original com os dados de transações.
    :param window: str
        Janela temporal no formato aceito pelo pandas (ex: '1H', '7D').
    :param group_col: str
        Coluna usada para agrupar os dados (ex: 'client_id').
    :param prefix: str
        Prefixo para os nomes das novas features agregadas.
    :param batch_size: int
        Número máximo de grupos (clientes) a serem processados em cada lote.

    :return: pd.DataFrame
        DataFrame completo com as features agregadas para todos os grupos.
    '''

    # Obtém os valores únicos de identificadores (ex: clientes)
    unique_ids = df[group_col].unique()
    results = []

    # Itera sobre os lotes de identificadores
    for i in range(0, len(unique_ids), batch_size):
        batch_ids = unique_ids[i:i+batch_size]

        # Filtra o DataFrame para o lote atual
        batch_df = df[df[group_col].isin(batch_ids)].copy()

        # Cria as features agregadas para o lote atual
        batch_df = create_transaction_features_light(batch_df, window, group_col, prefix)

        # Adiciona o resultado à lista
        results.append(batch_df)

        # Libera memória
        gc.collect()

    # Retorna o DataFrame concatenado com todos os lotes
    return pd.concat(results)

In [42]:
# Define as janelas de tempo que serão usadas para gerar as features agregadas
time_windows = ['1H', '2H', '4H', '8H', '12H', '24H', '48H', '72H', '7D', '14D', '21D', '30D', '45D']

In [43]:
# Itera sobre os DataFrames de treino, validação e teste (armazenados em 'dfs')
for i in range(len(dfs)):
    df = dfs[i]

    # Para cada janela de tempo, aplica o processamento em lotes para gerar as features
    for window in time_windows: 
        df = process_in_batches(df, window, 'client_id', 'client', batch_size=300)
    
    # Atualiza o DataFrame processado na lista
    dfs[i] = df

# Atualiza os nomes dos DataFrames individuais com os dados processados
df_train, df_val, df_test = dfs

In [44]:
df_train.shape, df_val.shape, df_test.shape

((3566068, 118), (891518, 118), (2194300, 118))

In [45]:
# Itera sobre os DataFrames de treino, validação e teste (armazenados em 'dfs')
for i in range(len(dfs)):
    df = dfs[i]

    # Para cada janela de tempo, aplica o processamento em lotes para gerar as features
    for window in time_windows: 
        df = process_in_batches(df, window, 'card_id', 'card', batch_size=300)
    
    # Atualiza o DataFrame processado na lista
    dfs[i] = df

# Atualiza os nomes dos DataFrames individuais com os dados processados
df_train, df_val, df_test = dfs

In [46]:
df_train.shape, df_val.shape, df_test.shape

((3566068, 157), (891518, 157), (2194300, 157))

In [47]:
df_train.head(10)

Unnamed: 0,id,date,client_id,card_id,amount,use_chip,merchant_id,merchant_city,merchant_state,zip,mcc,errors,id_card,client_id_card,card_brand,card_type,card_number,expires,cvv,has_chip,num_cards_issued,credit_limit,acct_open_date,year_pin_last_changed,card_on_dark_web,id_client,current_age,retirement_age,birth_year,birth_month,gender,address,latitude,longitude,per_capita_income,yearly_income,total_debt,credit_score,num_credit_cards,code,description,transaction_id,is_fraud,hour,day,month,year,quarter,day_of_year,weekday,weekend,period_of_day,time_diff_transaction,high_risk_period,proportion_high_risk_period,mean_amount_by_client,flag_above_mean_amount,time_diff_client_card,age_group,years_acct_open,debt_to_income,monthly_income,amount_income_ratio,credit_score_category,flag_risky_chip_use,flag_no_chip_darkweb,center_lat,center_lon,distance_from_center,is_online_transaction,merchant_category_diversity,state_diversity_per_client,online_transaction_ratio,merchant_transaction_count,merchant_id_is_new,merchant_transaction_ratio,client_unique_merchants_count,most_common_mcc,mcc_transaction_ratio,client_sum_last_1h,client_mean_last_1h,client_std_last_1h,client_sum_last_2h,client_mean_last_2h,client_std_last_2h,client_sum_last_4h,client_mean_last_4h,client_std_last_4h,client_sum_last_8h,client_mean_last_8h,client_std_last_8h,client_sum_last_12h,client_mean_last_12h,client_std_last_12h,client_sum_last_24h,client_mean_last_24h,client_std_last_24h,client_sum_last_48h,client_mean_last_48h,client_std_last_48h,client_sum_last_72h,client_mean_last_72h,client_std_last_72h,client_sum_last_7d,client_mean_last_7d,client_std_last_7d,client_sum_last_14d,client_mean_last_14d,client_std_last_14d,client_sum_last_21d,client_mean_last_21d,client_std_last_21d,client_sum_last_30d,client_mean_last_30d,client_std_last_30d,client_sum_last_45d,client_mean_last_45d,client_std_last_45d,card_sum_last_1h,card_mean_last_1h,card_std_last_1h,card_sum_last_2h,card_mean_last_2h,card_std_last_2h,card_sum_last_4h,card_mean_last_4h,card_std_last_4h,card_sum_last_8h,card_mean_last_8h,card_std_last_8h,card_sum_last_12h,card_mean_last_12h,card_std_last_12h,card_sum_last_24h,card_mean_last_24h,card_std_last_24h,card_sum_last_48h,card_mean_last_48h,card_std_last_48h,card_sum_last_72h,card_mean_last_72h,card_std_last_72h,card_sum_last_7d,card_mean_last_7d,card_std_last_7d,card_sum_last_14d,card_mean_last_14d,card_std_last_14d,card_sum_last_21d,card_mean_last_21d,card_std_last_21d,card_sum_last_30d,card_mean_last_30d,card_std_last_30d,card_sum_last_45d,card_mean_last_45d,card_std_last_45d
0,7516687,2010-01-11 15:56:00,115,19,-85.0,Swipe Transaction,61195,Mamaroneck,NY,10543.0,5541,,19,115,Mastercard,Debit,5610743457688598,2023-01-01,310,YES,2,46184,2001-01-01,2014,No,115,61,69,1958,7,Male,386 11th Lane,40.93,-73.72,49546,101018,78115,748,6,5541,Service Stations,7516687.0,No,15,11,1,2010,1,11,0,0,afternoon,1369.0,0,0.306585,83.400384,0,0.0,elderly,9,0.773278,8418.166667,-0.000841,very good,0,0,40.93,-73.72,0.0,0,75,35,0.083733,185,1,0.08068,236,5812,0.091583,-85.0,-85.0,,-85.0,-85.0,,-85.0,-85.0,,-85.0,-85.0,,-85.0,-85.0,,75.0,37.5,173.241161,75.0,37.5,173.241161,75.0,37.5,173.241161,192.7,48.175,108.818625,966.91,161.151667,304.175943,966.91,161.151667,304.175943,966.91,161.151667,304.175943,966.91,161.151667,304.175943,-85.0,-85.0,,-85.0,-85.0,,-85.0,-85.0,,-85.0,-85.0,,-85.0,-85.0,,-85.0,-85.0,,-85.0,-85.0,,-85.0,-85.0,,-85.0,-85.0,,-85.0,-85.0,,-85.0,-85.0,,-85.0,-85.0,,-85.0,-85.0,
1,7516690,2010-01-11 15:57:00,115,19,130.15,Swipe Transaction,61195,Mamaroneck,NY,10543.0,5541,,19,115,Mastercard,Debit,5610743457688598,2023-01-01,310,YES,2,46184,2001-01-01,2014,No,115,61,69,1958,7,Male,386 11th Lane,40.93,-73.72,49546,101018,78115,748,6,5541,Service Stations,7516690.0,No,15,11,1,2010,1,11,0,0,afternoon,1.0,0,0.306585,83.400384,1,1.0,elderly,9,0.773278,8418.166667,0.001288,very good,0,0,40.93,-73.72,0.0,0,75,35,0.083733,185,0,0.08068,236,5812,0.091583,45.15,22.575,152.134024,45.15,22.575,152.134024,45.15,22.575,152.134024,45.15,22.575,152.134024,45.15,22.575,152.134024,205.15,68.383333,133.669708,205.15,68.383333,133.669708,205.15,68.383333,133.669708,322.85,64.57,101.119236,1097.06,156.722857,277.920499,1097.06,156.722857,277.920499,1097.06,156.722857,277.920499,1097.06,156.722857,277.920499,45.15,22.575,152.134024,45.15,22.575,152.134024,45.15,22.575,152.134024,45.15,22.575,152.134024,45.15,22.575,152.134024,45.15,22.575,152.134024,45.15,22.575,152.134024,45.15,22.575,152.134024,45.15,22.575,152.134024,45.15,22.575,152.134024,45.15,22.575,152.134024,45.15,22.575,152.134024,45.15,22.575,152.134024
2,7608574,2010-02-04 07:31:00,115,19,152.15,Swipe Transaction,1911,Mamaroneck,NY,10543.0,4900,,19,115,Mastercard,Debit,5610743457688598,2023-01-01,310,YES,2,46184,2001-01-01,2014,No,115,61,69,1958,7,Male,386 11th Lane,40.93,-73.72,49546,101018,78115,748,6,4900,"Utilities - Electric, Gas, Water, Sanitary",7608574.0,No,7,4,2,2010,1,35,3,0,morning,7846.0,0,0.306585,83.400384,1,34054.0,elderly,9,0.773278,8418.166667,0.001506,very good,0,0,40.93,-73.72,0.0,0,75,35,0.083733,36,0,0.0157,236,5812,0.0314,152.15,152.15,,152.15,152.15,,152.15,152.15,,152.15,152.15,,152.15,152.15,,152.15,152.15,,152.15,152.15,,152.15,152.15,,340.33,113.443333,34.158739,340.33,68.066,81.784636,645.76,92.251429,78.553412,968.61,80.7175,85.365077,1742.82,124.487143,199.038193,152.15,152.15,,152.15,152.15,,152.15,152.15,,152.15,152.15,,152.15,152.15,,152.15,152.15,,152.15,152.15,,152.15,152.15,,152.15,152.15,,152.15,152.15,,152.15,152.15,,197.3,65.766667,131.030305,197.3,65.766667,131.030305
3,7613098,2010-02-05 10:17:00,115,19,19.74,Swipe Transaction,59397,Mount Vernon,NY,10552.0,5812,,19,115,Mastercard,Debit,5610743457688598,2023-01-01,310,YES,2,46184,2001-01-01,2014,No,115,61,69,1958,7,Male,386 11th Lane,40.93,-73.72,49546,101018,78115,748,6,5812,Eating Places and Restaurants,7613098.0,No,10,5,2,2010,1,36,4,0,morning,1606.0,0,0.306585,83.400384,0,1606.0,elderly,9,0.773278,8418.166667,0.000195,very good,0,0,40.93,-73.72,0.0,0,75,35,0.083733,121,1,0.052769,236,5812,0.150458,19.74,19.74,,19.74,19.74,,19.74,19.74,,19.74,19.74,,19.74,19.74,,19.74,19.74,,171.89,85.945,93.628009,171.89,85.945,93.628009,259.41,86.47,66.211245,360.07,60.011667,75.764207,665.5,83.1875,77.112626,988.35,76.026923,83.462265,1762.56,117.504,193.695477,19.74,19.74,,19.74,19.74,,19.74,19.74,,19.74,19.74,,19.74,19.74,,19.74,19.74,,171.89,85.945,93.628009,171.89,85.945,93.628009,171.89,85.945,93.628009,171.89,85.945,93.628009,171.89,85.945,93.628009,217.04,54.26,109.432966,217.04,54.26,109.432966
4,7618627,2010-02-06 17:12:00,115,19,86.52,Swipe Transaction,32858,Yorktown Heights,NY,10598.0,5311,,19,115,Mastercard,Debit,5610743457688598,2023-01-01,310,YES,2,46184,2001-01-01,2014,No,115,61,69,1958,7,Male,386 11th Lane,40.93,-73.72,49546,101018,78115,748,6,5311,Department Stores,7618627.0,No,17,6,2,2010,1,37,5,1,afternoon,1855.0,0,0.306585,83.400384,1,1855.0,elderly,9,0.773278,8418.166667,0.000856,very good,0,0,40.93,-73.72,0.0,0,75,35,0.083733,12,1,0.005233,236,5812,0.029219,86.52,86.52,,86.52,86.52,,86.52,86.52,,86.52,86.52,,86.52,86.52,,86.52,86.52,,106.26,53.13,47.220591,258.41,86.136667,66.205832,258.41,86.136667,66.205832,446.59,63.798571,69.884884,752.02,83.557778,72.14081,957.17,79.764167,84.146866,1849.08,115.5675,187.28786,86.52,86.52,,86.52,86.52,,86.52,86.52,,86.52,86.52,,86.52,86.52,,86.52,86.52,,106.26,53.13,47.220591,258.41,86.136667,66.205832,258.41,86.136667,66.205832,258.41,86.136667,66.205832,258.41,86.136667,66.205832,303.56,60.712,95.86356,303.56,60.712,95.86356
5,7620784,2010-02-07 10:13:00,115,19,19.67,Swipe Transaction,50783,Mount Vernon,NY,10550.0,5411,,19,115,Mastercard,Debit,5610743457688598,2023-01-01,310,YES,2,46184,2001-01-01,2014,No,115,61,69,1958,7,Male,386 11th Lane,40.93,-73.72,49546,101018,78115,748,6,5411,"Grocery Stores, Supermarkets",7620784.0,No,10,7,2,2010,1,38,6,1,morning,1021.0,0,0.306585,83.400384,0,1021.0,elderly,9,0.773278,8418.166667,0.000195,very good,0,0,40.93,-73.72,0.0,0,75,35,0.083733,134,1,0.058439,236,5812,0.14348,19.67,19.67,,19.67,19.67,,19.67,19.67,,19.67,19.67,,19.67,19.67,,106.19,53.095,47.270088,125.93,41.976667,38.575674,125.93,41.976667,38.575674,278.08,69.52,63.455464,466.26,58.2825,66.555344,614.94,68.326667,69.165596,976.84,75.141538,82.27047,1868.75,109.926471,182.826166,19.67,19.67,,19.67,19.67,,19.67,19.67,,19.67,19.67,,19.67,19.67,,106.19,53.095,47.270088,125.93,41.976667,38.575674,125.93,41.976667,38.575674,278.08,69.52,63.455464,278.08,69.52,63.455464,278.08,69.52,63.455464,323.23,53.871667,87.364745,323.23,53.871667,87.364745
6,7656885,2010-02-16 09:45:00,115,19,131.0,Swipe Transaction,81833,Mamaroneck,NY,10543.0,5912,,19,115,Mastercard,Debit,5610743457688598,2023-01-01,310,YES,2,46184,2001-01-01,2014,No,115,61,69,1958,7,Male,386 11th Lane,40.93,-73.72,49546,101018,78115,748,6,5912,Drug Stores and Pharmacies,7656885.0,No,9,16,2,2010,1,47,1,0,morning,827.0,0,0.306585,83.400384,1,12932.0,elderly,9,0.773278,8418.166667,0.001297,very good,0,0,40.93,-73.72,0.0,0,75,35,0.083733,37,1,0.016136,236,5812,0.037942,131.0,131.0,,131.0,131.0,,131.0,131.0,,131.0,131.0,,131.0,131.0,,374.29,187.145,79.40102,374.29,187.145,79.40102,374.29,187.145,79.40102,393.26,131.086667,112.160025,671.34,95.905714,85.37901,859.52,95.502222,74.017659,1008.2,84.016667,81.0685,1487.8,82.655556,84.136155,131.0,131.0,,131.0,131.0,,131.0,131.0,,131.0,131.0,,131.0,131.0,,131.0,131.0,,131.0,131.0,,131.0,131.0,,131.0,131.0,,409.08,81.816,61.448393,409.08,81.816,61.448393,409.08,81.816,61.448393,454.23,64.89,84.91363
7,7664105,2010-02-17 23:48:00,115,19,341.14,Swipe Transaction,54850,Mount Vernon,NY,10550.0,4814,,19,115,Mastercard,Debit,5610743457688598,2023-01-01,310,YES,2,46184,2001-01-01,2014,No,115,61,69,1958,7,Male,386 11th Lane,40.93,-73.72,49546,101018,78115,748,6,4814,Telecommunication Services,7664105.0,No,23,17,2,2010,1,48,2,0,evening,2263.0,1,0.306585,83.400384,1,2283.0,elderly,9,0.773278,8418.166667,0.003377,very good,0,0,40.93,-73.72,0.0,0,75,35,0.083733,36,0,0.0157,236,5812,0.028783,341.14,341.14,,341.14,341.14,,341.14,341.14,,341.14,341.14,,341.14,341.14,,341.14,341.14,,494.88,164.96,161.893796,738.17,184.5425,137.865749,738.17,184.5425,137.865749,1035.22,115.024444,115.076494,1223.4,111.218182,103.317116,1372.08,98.005714,103.55996,1851.68,92.584,99.676787,341.14,341.14,,341.14,341.14,,341.14,341.14,,341.14,341.14,,341.14,341.14,,341.14,341.14,,472.14,236.07,148.591419,472.14,236.07,148.591419,472.14,236.07,148.591419,750.22,125.036667,119.28487,750.22,125.036667,119.28487,750.22,125.036667,119.28487,795.37,99.42125,125.377574
8,7695693,2010-02-25 20:48:00,115,19,264.59,Swipe Transaction,65898,Mamaroneck,NY,10543.0,5912,,19,115,Mastercard,Debit,5610743457688598,2023-01-01,310,YES,2,46184,2001-01-01,2014,No,115,61,69,1958,7,Male,386 11th Lane,40.93,-73.72,49546,101018,78115,748,6,5912,Drug Stores and Pharmacies,7695693.0,No,20,25,2,2010,1,56,3,0,evening,2008.0,1,0.306585,83.400384,1,11340.0,elderly,9,0.773278,8418.166667,0.002619,very good,0,0,40.93,-73.72,0.0,0,75,35,0.083733,49,0,0.021369,236,5812,0.037942,264.59,264.59,,264.59,264.59,,264.59,264.59,,264.59,264.59,,264.59,264.59,,264.59,264.59,,289.31,144.655,169.613704,384.17,128.056667,123.332551,795.98,113.711429,99.590365,1534.15,139.468182,113.710592,1679.05,111.936667,108.213079,2019.38,112.187778,98.89992,2324.81,105.673182,98.274225,264.59,264.59,,264.59,264.59,,264.59,264.59,,264.59,264.59,,264.59,264.59,,264.59,264.59,,264.59,264.59,,264.59,264.59,,264.59,264.59,,736.73,245.576667,106.35241,862.66,143.776667,132.497,1014.81,144.972857,120.994058,1014.81,144.972857,120.994058
9,7699749,2010-02-26 23:38:00,115,19,34.46,Swipe Transaction,94123,Mount Vernon,NY,10550.0,5310,,19,115,Mastercard,Debit,5610743457688598,2023-01-01,310,YES,2,46184,2001-01-01,2014,No,115,61,69,1958,7,Male,386 11th Lane,40.93,-73.72,49546,101018,78115,748,6,5310,Discount Stores,7699749.0,No,23,26,2,2010,1,57,4,0,evening,1610.0,1,0.306585,83.400384,0,1610.0,elderly,9,0.773278,8418.166667,0.000341,very good,0,0,40.93,-73.72,0.0,0,75,35,0.083733,15,1,0.006542,236,5812,0.019189,34.46,34.46,,34.46,34.46,,34.46,34.46,,34.46,34.46,,34.46,34.46,,34.46,34.46,,299.05,149.525,162.726484,323.77,107.923333,135.764687,830.44,103.805,96.366218,1568.61,130.7175,112.576767,1693.77,112.918,107.380814,2053.84,108.096842,97.753638,2359.27,102.576957,97.156181,34.46,34.46,,34.46,34.46,,34.46,34.46,,34.46,34.46,,34.46,34.46,,34.46,34.46,,299.05,149.525,162.726484,299.05,149.525,162.726484,299.05,149.525,162.726484,771.19,192.7975,136.686204,877.38,146.23,129.850838,1049.27,131.15875,118.637399,1049.27,131.15875,118.637399


## 4. Salvando os DataFrames em formato parquet

In [48]:
# Efetuando a limpeza da memória após o processamento
print(f'\nQuantidade de objetos removidos da memória {gc.collect()}')


Quantidade de objetos removidos da memória 0


In [49]:
# Caminho de saída para os arquivos atualizados
caminho_saida = Path('dados/dados_transformados_parquet')

# Cria o diretório se não existir
caminho_saida.mkdir(parents=True, exist_ok=True)

# Dicionário com os DataFrames atualizados
dfs_atualizados = {
    'df_train': df_train,
    'df_val': df_val,
    'df_test': df_test
}

# Salvamento dos arquivos Parquet atualizados
for nome, df in dfs_atualizados.items():
    caminho_arquivo = caminho_saida / f'{nome}.parquet'
    try:
        df.to_parquet(caminho_arquivo, index=False)
        print(f'{nome} salvo com sucesso em: {caminho_arquivo}')
    except Exception as e:
        print(f'Erro ao salvar {nome}: {e}')


df_train salvo com sucesso em: dados\dados_transformados_parquet\df_train.parquet
df_val salvo com sucesso em: dados\dados_transformados_parquet\df_val.parquet
df_test salvo com sucesso em: dados\dados_transformados_parquet\df_test.parquet


In [50]:
# Exibição da volumetria
print('\nVOLUMETRIA')
for nome, df in dfs_atualizados.items():
    print(f'\n{nome}')
    print('-' * 45)
    print(f'Quantidade de linhas (registros):  {df.shape[0]:,}')
    print(f'Quantidade de colunas (variáveis): {df.shape[1]:,}')


VOLUMETRIA

df_train
---------------------------------------------
Quantidade de linhas (registros):  3,566,068
Quantidade de colunas (variáveis): 157

df_val
---------------------------------------------
Quantidade de linhas (registros):  891,518
Quantidade de colunas (variáveis): 157

df_test
---------------------------------------------
Quantidade de linhas (registros):  2,194,300
Quantidade de colunas (variáveis): 157
