# Naive Decision Tree

This is a simple implementation of a decision tree using [SciKit Learn](https://scikit-learn.org/stable/index.html) library. The idea is to show how to use the library to create a decision tree on data that has minimal processing. In this case, only the "Sex" and "Embarked" columns are transformed into numerical values.


## Imports

In [None]:
import datetime

import joblib
import polars as pl

from matplotlib.text import Annotation
from numpy import ndarray
from polars import DataFrame
from sklearn import tree
from polars.dataframe.frame import DataFrame
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree._classes import DecisionTreeClassifier

## Load Data


In [None]:
train_data: DataFrame = pl.read_csv("../data/train.csv")
test_data: DataFrame = pl.read_csv("../data/test.csv")

## Encode Text Data


In [None]:
train_data: DataFrame = train_data.with_columns(
    [
        pl.col("Sex").rank("dense").cast(pl.UInt8).alias("Sex"),
        pl.col("Embarked").rank("dense").cast(pl.UInt8).alias("Embarked"),
    ]
)
test_data: DataFrame = test_data.with_columns(
    [
        pl.col("Sex").rank("dense").cast(pl.UInt8).alias("Sex"),
        pl.col("Embarked").rank("dense").cast(pl.UInt8).alias("Embarked"),
    ]
)

## Display Endoced Data


In [None]:
train_data.sample(5)

In [None]:
test_data.sample(5)

## Polulate Missing Data


In [None]:
train_data: DataFrame = train_data.fill_null(strategy="forward")

## Extract Features


In [None]:
features: list[str] = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]
X: DataFrame = train_data.select(features)
y: DataFrame = train_data.select("Survived")

## Display Sample Features


In [None]:
X.sample(5)

In [None]:
y.sample(5)

## Split Data into Training and Testing Sets


In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Train Decision Tree

In [None]:
dtclf = DecisionTreeClassifier()
ndtclf: DecisionTreeClassifier = dtclf.fit(X_train, y_train)

## Save the Model

In [None]:
now = datetime.datetime.now().strftime("%Y%m%d%H%M%S")
joblib.dump(ndtclf, f"../res/naive-decision-tree.{now}.model")

## Evaluate Model


In [None]:
y_pred: ndarray = dtclf.predict(X_test)
accuracy: float = accuracy_score(y_test, y_pred)

print(f"Accuracy: {accuracy}")

## Display the Decision Tree


In [None]:
annotations: list[Annotation] = tree.plot_tree(
    fdtclf, filled=True, rounded=True, proportion=True, feature_names=features
)

## (Optional) Export the Decision Tree to a Graphviz File



In [None]:
dot_data = tree.export_graphviz(
    ndtclf,
    out_file="../res/naive-decision-tree.dot",
    feature_names=features,
    filled=True,
    rounded=True,
    special_characters=True,
)

----
Go back to [index](_index.ipynb).