## Training a Decision Tree Classifier
### Overview
- Train a Decision Tree Classifier on the preprocessed data.
- Use `X_train`, `y_train` for training and predict on `X_test`.

In [8]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

# Load the dataset
data = pd.read_csv('../data/bank-additional-full.csv', sep=';')

# Encode categorical columns
categorical_cols = data.select_dtypes(include=['object']).columns
label_encoders = {}
for col in categorical_cols:
    label_encoders[col] = LabelEncoder()
    data[col] = label_encoders[col].fit_transform(data[col])

# Separate features and target
X = data.drop('y', axis=1)
y = data['y']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Verify shapes
print("Training features shape:", X_train.shape)
print("Test features shape:", X_test.shape)
print("Training target shape:", y_train.shape)
print("Test target shape:", y_test.shape)

Training features shape: (32950, 20)
Test features shape: (8238, 20)
Training target shape: (32950,)
Test target shape: (8238,)


### Initializing the Decision Tree Classifier
- Import `DecisionTreeClassifier` from scikit-learn.
- Initialize with `random_state=42` for reproducibility.

In [9]:
from sklearn.tree import DecisionTreeClassifier

# Initialize the classifier
dt_classifier = DecisionTreeClassifier(random_state=42)

# Verify initialization
print("Decision Tree Classifier initialized with random_state=42")

Decision Tree Classifier initialized with random_state=42


### Training the Model
- Fit the classifier on `X_train` and `y_train`.

In [10]:
# Train the model
dt_classifier.fit(X_train, y_train)

# Confirm training
print("Model trained successfully")

Model trained successfully


### Predicting on Test Data
- Use the trained model to predict `y_pred` for `X_test`.

In [11]:
# Predict on test data
y_pred = dt_classifier.predict(X_test)

# Verify predictions
print("First 10 predictions:", y_pred[:10])
print("Shape of predictions:", y_pred.shape)

First 10 predictions: [0 0 0 0 0 0 0 0 0 0]
Shape of predictions: (8238,)


In [12]:
# Check model attributes
print("Number of features:", dt_classifier.n_features_in_)
print("Classes:", dt_classifier.classes_)

Number of features: 20
Classes: [0 1]


In [13]:
# Check distribution of predictions
import numpy as np
print("Prediction counts (0 = no, 1 = yes):")
print(np.bincount(y_pred))

Prediction counts (0 = no, 1 = yes):
[7300  938]
