# Decision Tree

This example shows how to use [SciKit-Learn](https://scikit-learn.org/stable/) to train a Decision Tree model on the Titanic dataset. Data is processed to increase the accuracy of the model. For a more detailed explanation of what is Decision Tree is, see [Decision Tree](../document/decision_tree.md).

## Imports

In [None]:
import polars as pl
from polars import LazyFrame, Expr

## Process Data
Apply the same processing to the training and testing data.

In [None]:
train_data: LazyFrame = pl.scan_csv("data/train.csv", has_header=True)
test_data: LazyFrame = pl.scan_csv("data/test.csv", has_header=True)

In [None]:
def encode_title(title: str) -> int:
    print(title)
    title_map: dict[str, int] = {
        "Capt": 0,
        "Col": 0,
        #"Countess": 1,
        "Don": 1,
        "Dona": 1,
        "Dr": 2,
        "Jonkheer": 1,
        "Lady": 1,
        "Major": 0,
        "Master": 3,
        "Miss": 4,
        "Miss": 4,
        "Mme": 5,
        "Mr": 6,
        "Mrs": 5,
        "Ms": 4,
        "Rev": 7,
        "Sir": 1,
    }
    return title_map.get(title, 8)


### Test Encode Title

In [None]:
encoded_titles = [encode_title(name) for name in ["Mr", "Mrs", "Miss", "Master", "Dr", "Rev", "Col", "Sir", "Lady", "Countess", "Jonkheer", "Dona", "Mme", "Capt", "Major", "Don"]]
print(encoded_titles)

In [None]:
title: Expr = pl.col("Name").str.split(", ").list.get(1).str.split(".").list.get(0)
x = title.str.extract(r"^(\w+\.)")
x
    
# In Polars there are two DataFrames that contain strings with the names and titles of people. I know how to extract the titles. The titles are mostly the same across the two datafames, but not exactly the same. I want to write a function that takes the expression to extract the title and return an integer. How do I do that?

In [None]:
train_features: LazyFrame = train_data.select(
    sku=pl.col("Pclass").rank(method="dense"),
    family_size=pl.col("SibSp") + pl.col("Parch") + 1,
    embarked=pl.col("Embarked").rank(method="dense"),
    #title=pl.col("Name").str.split(", ").list.get(1).str.split(".").list.get(0)
    title=encode_title()
)

In [None]:
train_features.collect().sample(5)

----
Go back to [index](_index.ipynb).