## Data Preprocessing
### Checking for Missing Values
- Inspect the dataset for any missing values.
- Dataset: `bank-additional-full.csv`.

In [1]:
import pandas as pd

# Load the dataset
data = pd.read_csv('../data/bank-additional-full.csv', sep=';')

In [2]:
# Check for missing values
print("Missing values per column:")
print(data.isnull().sum())

Missing values per column:
age               0
job               0
marital           0
education         0
default           0
housing           0
loan              0
contact           0
month             0
day_of_week       0
duration          0
campaign          0
pdays             0
previous          0
poutcome          0
emp.var.rate      0
cons.price.idx    0
cons.conf.idx     0
euribor3m         0
nr.employed       0
y                 0
dtype: int64


### Converting Categorical Features to Numeric
- Use `LabelEncoder` to convert categorical columns to numeric.
- Columns include `job`, `marital`, `education`, `y`, etc.

In [3]:
# Identify categorical columns
categorical_cols = data.select_dtypes(include=['object']).columns
print("Categorical columns:", categorical_cols)

Categorical columns: Index(['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact',
       'month', 'day_of_week', 'poutcome', 'y'],
      dtype='object')


In [4]:
from sklearn.preprocessing import LabelEncoder

# Initialize dictionary to store encoders
label_encoders = {}

# Encode each categorical column
for col in categorical_cols:
    label_encoders[col] = LabelEncoder()
    data[col] = label_encoders[col].fit_transform(data[col])

# Verify encoding
print("First 5 rows after encoding:")
print(data.head())

First 5 rows after encoding:
   age  job  marital  education  default  housing  loan  contact  month  \
0   56    3        1          0        0        0     0        1      6   
1   57    7        1          3        1        0     0        1      6   
2   37    7        1          3        0        2     0        1      6   
3   40    0        1          1        0        0     0        1      6   
4   56    7        1          3        0        0     2        1      6   

   day_of_week  ...  campaign  pdays  previous  poutcome  emp.var.rate  \
0            1  ...         1    999         0         1           1.1   
1            1  ...         1    999         0         1           1.1   
2            1  ...         1    999         0         1           1.1   
3            1  ...         1    999         0         1           1.1   
4            1  ...         1    999         0         1           1.1   

   cons.price.idx  cons.conf.idx  euribor3m  nr.employed  y  
0          93

In [5]:
# Check data types after encoding
print("Data types after encoding:")
print(data.dtypes)

Data types after encoding:
age                 int64
job                 int64
marital             int64
education           int64
default             int64
housing             int64
loan                int64
contact             int64
month               int64
day_of_week         int64
duration            int64
campaign            int64
pdays               int64
previous            int64
poutcome            int64
emp.var.rate      float64
cons.price.idx    float64
cons.conf.idx     float64
euribor3m         float64
nr.employed       float64
y                   int64
dtype: object


### Separating Features and Target
- Features (`X`): All columns except `y`.
- Target (`y`): `y` column (0 = no, 1 = yes).

In [6]:
# Separate features and target
X = data.drop('y', axis=1)  # Features
y = data['y']               # Target

# Verify shapes
print("Features shape:", X.shape)
print("Target shape:", y.shape)

Features shape: (41188, 20)
Target shape: (41188,)


### Splitting Data into Training and Test Sets
- Split: 80% training, 20% test.
- Use `train_test_split` from scikit-learn.

In [8]:
from sklearn.model_selection import train_test_split

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Verify shapes
print("Training features shape:", X_train.shape)
print("Test features shape:", X_test.shape)
print("Training target shape:", y_train.shape)
print("Test target shape:", y_test.shape)

Training features shape: (32950, 20)
Test features shape: (8238, 20)
Training target shape: (32950,)
Test target shape: (8238,)


In [9]:
# Verify all features are numeric
print("X_train data types:")
print(X_train.dtypes)

X_train data types:
age                 int64
job                 int64
marital             int64
education           int64
default             int64
housing             int64
loan                int64
contact             int64
month               int64
day_of_week         int64
duration            int64
campaign            int64
pdays               int64
previous            int64
poutcome            int64
emp.var.rate      float64
cons.price.idx    float64
cons.conf.idx     float64
euribor3m         float64
nr.employed       float64
dtype: object


In [10]:
# Check unique values in y_train
print("Unique values in y_train:", y_train.unique())

Unique values in y_train: [0 1]
