Pré processamento dos dados

**Objetivo**: Transformar dados brutos em formato pronto para modelagem preditiva.

**Etapas executadas**:
1. Carregamento do dataset (10.000 registros, 18 colunas)
2. Remoção de colunas irrelevantes: RowNumber, CustomerId, Surname
3. One-hot encoding: Geography (2 dummies), Gender (1), Card Type (3)
4. Separação X (17 features) e y (Exited)
5. Salvamento: customer_churn_processed.csv, X_features.csv, y_target.csv

**Resultado**: Dados 100% numéricos, sem missing values, prontos para modelagem.

In [1]:
import pandas as pd
from sklearn.preprocessing import StandardScaler

In [2]:
df = pd.read_csv('Customer-Churn-Records.csv')

In [3]:
df.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Complain,Satisfaction Score,Card Type,Point Earned
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1,1,2,DIAMOND,464
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0,1,3,DIAMOND,456
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1,1,3,DIAMOND,377
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0,0,5,GOLD,350
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0,0,5,GOLD,425


In [4]:
df = df.drop(columns=['RowNumber', 'CustomerId', 'Surname'])

In [5]:
df.isna().sum()

CreditScore           0
Geography             0
Gender                0
Age                   0
Tenure                0
Balance               0
NumOfProducts         0
HasCrCard             0
IsActiveMember        0
EstimatedSalary       0
Exited                0
Complain              0
Satisfaction Score    0
Card Type             0
Point Earned          0
dtype: int64

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 15 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   CreditScore         10000 non-null  int64  
 1   Geography           10000 non-null  object 
 2   Gender              10000 non-null  object 
 3   Age                 10000 non-null  int64  
 4   Tenure              10000 non-null  int64  
 5   Balance             10000 non-null  float64
 6   NumOfProducts       10000 non-null  int64  
 7   HasCrCard           10000 non-null  int64  
 8   IsActiveMember      10000 non-null  int64  
 9   EstimatedSalary     10000 non-null  float64
 10  Exited              10000 non-null  int64  
 11  Complain            10000 non-null  int64  
 12  Satisfaction Score  10000 non-null  int64  
 13  Card Type           10000 non-null  object 
 14  Point Earned        10000 non-null  int64  
dtypes: float64(2), int64(10), object(3)
memory usage: 1.1+

In [7]:
print("Geography:", df['Geography'].unique())
print("Gender:", df['Gender'].value_counts())
print("Card Type:", df['Card Type'].value_counts())


Geography: ['France' 'Spain' 'Germany']
Gender: Gender
Male      5457
Female    4543
Name: count, dtype: int64
Card Type: Card Type
DIAMOND     2507
GOLD        2502
SILVER      2496
PLATINUM    2495
Name: count, dtype: int64


In [8]:
df_processed = pd.get_dummies(df, 
                             columns=['Geography', 'Gender', 'Card Type'], 
                             drop_first=True)

In [9]:
print("Shape após encoding:", df_processed.shape)
print("\nTipos após encoding:")
print(df_processed.dtypes)

Shape após encoding: (10000, 18)

Tipos após encoding:
CreditScore             int64
Age                     int64
Tenure                  int64
Balance               float64
NumOfProducts           int64
HasCrCard               int64
IsActiveMember          int64
EstimatedSalary       float64
Exited                  int64
Complain                int64
Satisfaction Score      int64
Point Earned            int64
Geography_Germany        bool
Geography_Spain          bool
Gender_Male              bool
Card Type_GOLD           bool
Card Type_PLATINUM       bool
Card Type_SILVER         bool
dtype: object


In [10]:
X = df_processed.drop('Exited', axis=1)
y = df_processed['Exited']

In [11]:
X.to_csv('X_features.csv', index=False)
y.to_csv('y_target.csv', index=False)