# c50py Tour

This notebook demonstrates the features of `c50py`, a Python implementation of the C5.0 decision tree algorithm.

In [None]:
!pip install c50py

## 1. Classification (Titanic)
We'll start by training a classifier on the Titanic dataset.

In [None]:
import pandas as pd
import numpy as np
from c50py import C5Classifier

# Load data
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
df = pd.read_csv(url)
df = df.dropna(subset=["Embarked"])

# Select features
features = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]
X = df[features].values
y = df["Survived"].values

# Initialize and fit
clf = C5Classifier(
    min_samples_split=20,
    categorical_features=["Pclass", "Sex", "Embarked"]
)
clf.fit(X, y, feature_names=features)

# Print the tree
clf.print_tree()

## 2. Regression (Diabetes)
Now let's try a regression problem.

In [None]:
from sklearn.datasets import load_diabetes
from c50py import C5Regressor

data = load_diabetes()
X_reg, y_reg = data.data, data.target
features_reg = data.feature_names

reg = C5Regressor(min_samples_split=10, pruning=True)
reg.fit(X_reg, y_reg, feature_names=features_reg)

print(f"R^2 Score: {reg.score(X_reg, y_reg):.4f}")

## 3. Advanced Features
### Sample Weights
We can assign weights to samples. Let's weight the positive class heavily.

In [None]:
weights = np.ones(len(y))
weights[y == 1] = 5.0
clf_weighted = C5Classifier(min_samples_split=20, categorical_features=["Pclass", "Sex", "Embarked"])
clf_weighted.fit(X, y, sample_weight=weights, feature_names=features)
clf_weighted.print_tree()

### Missing Values
Let's predict on an instance with a missing value.

In [None]:
# Create a test instance with missing Age (index 2)
# Pclass=3, Sex=male, Age=NaN, SibSp=0, Parch=0, Fare=7.25, Embarked=S
x_miss = [[3, "male", np.nan, 0, 0, 7.25, "S"]]
pred = clf.predict(x_miss)
print(f"Prediction for missing age: {pred[0]}")

### Boosting
Train an ensemble of trees.

In [None]:
boosted_clf = C5Classifier(trials=10, categorical_features=["Pclass", "Sex", "Embarked"])
boosted_clf.fit(X, y, feature_names=features)
print(f"Boosted Accuracy: {boosted_clf.score(X, y):.4f}")