# Catboost Modeling
In this notebook I'll be using [Catboost](https://catboost.ai/en/docs/). Catboost handles much of the data cleaning internally, so I'm going to import the raw data again, and then create a model using the training data.

## Preproccessing
This preproccesing will be the same as in the Sklearn notebooks, but we won't be scaling or performing one hot encoding. Catboost will handle these things internally.

In [56]:
# Import statements
from catboost import CatBoostClassifier, metrics, Pool
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, StandardScaler, MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

In [57]:
# Load training data into dataframe
X_train = pd.read_csv('../Data/Training_Features.csv')
X_train.drop(columns=['date_recorded', 'permit', 'public_meeting'], inplace=True)
X_test = pd.read_csv('../Data/Test_Features.csv')

y_train = pd.read_csv('../Data/Training_Labels.csv')
y_train = y_train['status_group']

In [58]:
# Ordinally encoding the target.
y_train.replace({'functional': 1, 'non functional': 0, 'functional needs repair': 2}, inplace=True)
y_train.value_counts()

1    32259
0    22824
2     4317
Name: status_group, dtype: int64

In [59]:
# make the lists of columns
# num = any columns with numerical value
num = []
# obj = any columns with object value
obj = []
for c in X_train.columns:
    if X_train[c].dtype in ['float64', 'int64']:
        num.append(c)
    else:
        obj.append(c)

In [67]:
num

['id',
 'amount_tsh',
 'gps_height',
 'longitude',
 'latitude',
 'num_private',
 'region_code',
 'district_code',
 'population',
 'construction_year']

In [71]:
X_train

Unnamed: 0,id,amount_tsh,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,...,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
0,69572,6000.0,Roman,1390,Roman,34.938093,-9.856322,none,0,Lake Nyasa,...,annually,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe
1,8776,0.0,Grumeti,1399,GRUMETI,34.698766,-2.147466,Zahanati,0,Lake Victoria,...,never pay,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe
2,34310,25.0,Lottery Club,686,World vision,37.460664,-3.821329,Kwa Mahundi,0,Pangani,...,per bucket,soft,good,enough,enough,dam,dam,surface,communal standpipe multiple,communal standpipe
3,67743,0.0,Unicef,263,UNICEF,38.486161,-11.155298,Zahanati Ya Nanyumbu,0,Ruvuma / Southern Coast,...,never pay,soft,good,dry,dry,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe
4,19728,0.0,Action In A,0,Artisan,31.130847,-1.825359,Shuleni,0,Lake Victoria,...,never pay,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59395,60739,10.0,Germany Republi,1210,CES,37.169807,-3.253847,Area Three Namba 27,0,Pangani,...,per bucket,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe
59396,27263,4700.0,Cefa-njombe,1212,Cefa,35.249991,-9.070629,Kwa Yahona Kuvala,0,Rufiji,...,annually,soft,good,enough,enough,river,river/lake,surface,communal standpipe,communal standpipe
59397,37057,0.0,,0,,34.017087,-8.750434,Mashine,0,Rufiji,...,monthly,fluoride,fluoride,enough,enough,machine dbh,borehole,groundwater,hand pump,hand pump
59398,31282,0.0,Malec,0,Musa,35.861315,-6.378573,Mshoro,0,Rufiji,...,never pay,soft,good,insufficient,insufficient,shallow well,shallow well,groundwater,hand pump,hand pump


In [60]:
# First, the numeric columns.
num_transformer = Pipeline(steps=[
    # Fill the unknown value with the median value for the column
    ('num_imputer', SimpleImputer(strategy='median')),
    ])

In [61]:
# Catboost does not require scaling, the NaN values just need to be imputed.
obj_transformer = Pipeline(steps=[
    # For each unknown value, fill in "Unknown".
    ('obj_imputer', SimpleImputer(strategy='constant', fill_value='Unknown'))
])

In [62]:
# Now that the transformers have been set up, package them together into a column transformer.
preprocessor = ColumnTransformer(
    transformers=[
        ('numeric', num_transformer, num),
        ('obj', obj_transformer, obj),
    ])

In [63]:
# Fit and transform the data using the preprocessor
X_train_transformed = preprocessor.fit_transform(X_train, y_train)

In [64]:
# Restore column names by converting the array back into a dataframe.
X_train_transformed = pd.DataFrame(X_train_transformed, columns=X_train.columns)

In [70]:
X_train_transformed

Unnamed: 0,id,amount_tsh,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,...,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
0,69572.0,6000.0,1390.0,34.938093,-9.856322,0.0,11.0,5.0,109.0,1999.0,...,annually,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe
1,8776.0,0.0,1399.0,34.698766,-2.147466,0.0,20.0,2.0,280.0,2010.0,...,never pay,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe
2,34310.0,25.0,686.0,37.460664,-3.821329,0.0,21.0,4.0,250.0,2009.0,...,per bucket,soft,good,enough,enough,dam,dam,surface,communal standpipe multiple,communal standpipe
3,67743.0,0.0,263.0,38.486161,-11.155298,0.0,90.0,63.0,58.0,1986.0,...,never pay,soft,good,dry,dry,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe
4,19728.0,0.0,0.0,31.130847,-1.825359,0.0,18.0,1.0,0.0,0.0,...,never pay,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59395,60739.0,10.0,1210.0,37.169807,-3.253847,0.0,3.0,5.0,125.0,1999.0,...,per bucket,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe
59396,27263.0,4700.0,1212.0,35.249991,-9.070629,0.0,11.0,4.0,56.0,1996.0,...,annually,soft,good,enough,enough,river,river/lake,surface,communal standpipe,communal standpipe
59397,37057.0,0.0,0.0,34.017087,-8.750434,0.0,12.0,7.0,0.0,0.0,...,monthly,fluoride,fluoride,enough,enough,machine dbh,borehole,groundwater,hand pump,hand pump
59398,31282.0,0.0,0.0,35.861315,-6.378573,0.0,1.0,4.0,0.0,0.0,...,never pay,soft,good,insufficient,insufficient,shallow well,shallow well,groundwater,hand pump,hand pump


# Modeling
Let's start with a baseline model, and see how it does.

In [66]:
# Setting up the model
base = CatBoostClassifier(
    # Adding Accuracy as a metric
    custom_loss=[metrics.Accuracy()],
    random_seed=15,
    logging_level='Silent'
)

base.fit(
    X_train, y_train,
    # Using X/y test as eval set
    # Uncomment below line to plot the learning of the model
    plot=True
);

CatBoostError: Bad value for num_feature[non_default_doc_idx=0,feature_idx=2]="Roman": Cannot convert 'b'Roman'' to float