# Visualize Data

This notebook uses [Polars](https://polars.rs) to visualize the Kaggle [Titanic](https://www.kaggle.com/c/titanic/data) dataset. It also performas some basic data manipulation and analysis. Data is visualized using [SeaBorn](https://seaborn.pydata.org/).

## Imports

In [None]:
import seaborn as sns
import polars as pl
from polars import DataFrame, Series
from polars.expr.expr import Expr

## Exploratory Data Analysis

In [None]:
train_data: DataFrame = pl.read_csv("data/train.csv")

In [None]:
train_data.sample(10)

In [None]:
train_data.describe()

## Remove Null `Age` Values

### Create Filters

In [None]:
mean_miss_age: float = train_data.filter(pl.col("Name").str.contains("Miss.")).get_column("Age").mean()
mean_master_age: float = train_data.filter(pl.col("Name").str.contains("Master.")).get_column("Age").mean()
mean_mrs_age: float = train_data.filter(pl.col("Name").str.contains("Mrs.")).get_column("Age").mean()
mean_mr_age: float = train_data.filter(pl.col("Name").str.contains("Mr.")).get_column("Age").mean()

print(f"{mean_miss_age=}")
print(f"{mean_master_age=}")
print(f"{mean_mrs_age=}")
print(f"{mean_mr_age=}")

### Apply Filters

In [None]:
null_age_miss_filter: Expr = pl.col("Name").str.contains("Miss.") & pl.col("Age").is_null()
null_age_master_filter: Expr = pl.col("Name").str.contains("Master.") & pl.col("Age").is_null()
null_age_mrs_filter: Expr = pl.col("Name").str.contains("Mrs.") & pl.col("Age").is_null()
null_age_mr_filter: Expr = pl.col("Name").str.contains("Mr.") & pl.col("Age").is_null()

### Update Null `Age` Values

In [None]:
updated_null_age_miss_train_data: DataFrame = train_data.with_columns(
    pl.when(null_age_miss_filter).then(mean_miss_age).otherwise(pl.col("Age")).alias("Age")
)
updated_null_age_master_train_data: DataFrame = updated_null_age_miss_train_data.with_columns(
    pl.when(null_age_master_filter).then(mean_master_age).otherwise(pl.col("Age")).alias("Age")
)
updated_null_age_mrs_train_data: DataFrame = updated_null_age_master_train_data.with_columns(
    pl.when(null_age_mrs_filter).then(mean_mrs_age).otherwise(pl.col("Age")).alias("Age")
)
updated_null_age_mr_train_data: DataFrame = updated_null_age_mrs_train_data.with_columns(
    pl.when(null_age_mr_filter).then(mean_mr_age).otherwise(pl.col("Age")).alias("Age")
)
updated_null_age_dr_train_data: DataFrame = updated_null_age_mr_train_data.with_columns(
    pl.when(null_age_mr_filter).then(40.0).otherwise(pl.col("Age")).alias("Age")
)

updated_null_age_dr_train_data.describe()

### Create Training `DataFrame`

In [None]:
passenger_class: Series = train_data["Pclass"].rank("dense").cast(pl.Int64) - 1
passenger_age: Series = train_data["Age"].fill_null(strategy="mean").round(1)
passenger_adult: Series = (passenger_age > 16).rank("dense").cast(pl.Int64) - 1
passenger_gender: Series = train_data["Sex"].rank("dense").cast(pl.Int64) - 1
passenger_family: Series = train_data["SibSp"] + train_data["Parch"]
# passenger_fare: Series = train_data['Fare'].fill_null(strategy='mean')
passenger_port: Series = train_data["Embarked"].fill_null(strategy="forward")
passenger_embarked: Series = passenger_port.rank("dense").cast(pl.Int64) - 1
passenger_is_child: Series = (passenger_age < 16).rank("dense").cast(pl.Int64) - 1
passenger_is_alone: Series = (passenger_family == 0).rank("dense").cast(pl.Int64) - 1
passenger_survided: Series = train_data["Survived"]

processed_data: DataFrame = DataFrame(
    {
        "class": passenger_class,
        "age": passenger_age,
        "adult": passenger_adult,
        "gender": passenger_gender,
        "family": passenger_family,
        # 'fare': passenger_fare,
        "embarked": passenger_embarked,
        "is_child": passenger_is_child,
        "is_alone": passenger_is_alone,
        "survived": passenger_survided,
    }
)

processed_data.describe()

In [None]:
processed_data.sample(10)

### Save Processed Data

In [None]:
processed_data.write_csv("data/processed_train.csv")
processed_data.write_parquet("data/processed_train.parquet")

## Naive Visualisation of the Training Data

In [None]:
sns.countplot(data=processed_data, x="gender", hue="survived", palette="BuPu_d")

In [None]:
sns.countplot(data=processed_data, x="is_child", hue="survived", palette="BuPu_d")

In [None]:
sns.countplot(data=processed_data, x="class", hue="survived", palette="BuPu_d")

In [None]:
sns.kdeplot(data=processed_data, x="age", palette="BuPu_d", hue="survived")

In [None]:
sns.kdeplot(data=processed_data, x="age", hue="survived", fill=True, palette="BuPu_d")

In [None]:
fg = sns.FacetGrid(data=processed_data, hue="gender", aspect=3, palette="BuPu_d")
fg.map(sns.kdeplot, "age", fill=True)
fg.add_legend()
fg.set(xlim=(0, processed_data["age"].max()))

In [None]:
fg = sns.FacetGrid(data=processed_data, hue="class", aspect=3, palette="BuPu_d")
fg.map(sns.kdeplot, "age", fill=True)
fg.add_legend()
fg.set(xlim=(0, processed_data["age"].max()))

In [None]:
sns.countplot(data=processed_data, x="family", hue="survived", palette="BuPu_d")

In [None]:
sns.countplot(data=processed_data, x="is_alone", hue="survived", palette="BuPu_d")

In [None]:
sns.heatmap(data=processed_data.corr(), annot=True, cmap="BuPu")

## Next Steps

+ Implement a [Naive Decision Tree](naive-decision-tree.ipynb) classifier.
+ Implement a [Decision Tree](decision-tree.ipynb) classifier.
+ Implement a [Random Forest](random-forest.ipynb) classifier.
+ Implement a [Gradient Boosting](gradient-boosting.ipynb) classifier.
+ Implement a [Neural Network](neural-network.ipynb) classifier.

----
Go back to [index](_index.ipynb).