# Catboost Modeling
In this notebook I'll be using [Catboost](https://catboost.ai/en/docs/). Catboost handles much of the data cleaning internally, so I'm going to import the raw data again, and then create a model using the training data.

## Preproccessing
This preproccesing will be the same as in the Sklearn notebooks, but we won't be scaling or performing one hot encoding. Catboost will handle these things internally.

In [100]:
# Import statements
from catboost import CatBoostClassifier, metrics, Pool, cv
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, StandardScaler, MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

In [73]:
# Load training data into dataframe
X_train = pd.read_csv('../Data/Training_Features.csv')
X_train.drop(columns=['date_recorded', 'permit', 'public_meeting'], inplace=True)
X_test = pd.read_csv('../Data/Test_Features.csv')

y_train = pd.read_csv('../Data/Training_Labels.csv')
y_train = y_train['status_group']

In [74]:
# Ordinally encoding the target.
y_train.replace({'functional': 1, 'non functional': 0, 'functional needs repair': 2}, inplace=True)
y_train.value_counts()

1    32259
0    22824
2     4317
Name: status_group, dtype: int64

In [92]:
# Drop columns with NaN values
X_train.dropna(axis=1, inplace=True)

In [93]:
# make the lists of columns
# num = any columns with numerical value
num = []
# obj = any columns with object value
obj = []
for c in X_train.columns:
    if X_train[c].dtype in ['float64', 'int64']:
        num.append(c)
    else:
        obj.append(c)

# Modeling
Let's start with a baseline model, and see how it does.

In [96]:
# Creating a pool object for our catboost data.
X_train_pool = Pool(X_train, y_train, obj)

In [98]:
# Setting up the model
base = CatBoostClassifier(
    # Adding Accuracy as a metric
    custom_loss=[metrics.Accuracy()],
    random_seed=15,
    logging_level='Silent'
)

base.fit(
    X_train_pool,
    # Using X/y test as eval set
    # Uncomment below line to plot the learning of the model
    plot=True
);

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

In [103]:
base.get_best_score()

{'learn': {'Accuracy': 0.8188720538720539, 'MultiClass': 0.44880357230715595}}