# Income classification using catboost

Dataset description:

* age: continuous.
* workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
* fnlwgt: continuous.
* education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
* education-num: continuous.
* marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
* occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
* relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
* race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
* sex: Female, Male.
* capital-gain: continuous.
* capital-loss: continuous.
* hours-per-week: continuous.
* native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.


In [1]:
from IPython.core.interactiveshell import InteractiveShell

InteractiveShell.ast_node_interactivity = "all"

In [2]:
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt

%matplotlib inline

In [3]:
df = pd.read_csv("dataset/income_evaluation.csv")
df.head(4)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K


In [4]:
df.columns = [col.strip() for col in df.columns]
df.columns

Index(['age', 'workclass', 'fnlwgt', 'education', 'education-num',
       'marital-status', 'occupation', 'relationship', 'race', 'sex',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
       'income'],
      dtype='object')

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education-num   32561 non-null  int64 
 5   marital-status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital-gain    32561 non-null  int64 
 11  capital-loss    32561 non-null  int64 
 12  hours-per-week  32561 non-null  int64 
 13  native-country  32561 non-null  object
 14  income          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


In [6]:
nominal_features = [
    "workclass",
    "marital-status",
    "occupation",
    "relationship",
    "race",
    "sex",
    "native-country",
]

ordinal_features = ["education-num"]
redundant_features = ["education"]
numerical_features = ["age", "fnlwgt", "capital-gain", "capital-loss", "hours-per-week"]

target_column = "income"

In [7]:
# Looks like education and education-num provide the same information
# but in different forms. Consider education-num as oridinal encoding of education
df.loc[:, "education-num"].nunique(), df.loc[:, "education"].nunique()

(16, 16)

In [8]:
df.groupby(by=["education-num", "education"])["age"].count().sort_index()

education-num  education    
1               Preschool          51
2               1st-4th           168
3               5th-6th           333
4               7th-8th           646
5               9th               514
6               10th              933
7               11th             1175
8               12th              433
9               HS-grad         10501
10              Some-college     7291
11              Assoc-voc        1382
12              Assoc-acdm       1067
13              Bachelors        5355
14              Masters          1723
15              Prof-school       576
16              Doctorate         413
Name: age, dtype: int64

In [9]:
df.drop(columns=redundant_features, inplace=True)

In [10]:
for col in nominal_features:
    df.loc[:, col] = df[col].astype("category")

for col in ordinal_features:
    df.loc[:, col] = df[col].astype("category")

df.loc[:, target_column] = df.loc[:, target_column].astype("category")

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype   
---  ------          --------------  -----   
 0   age             32561 non-null  int64   
 1   workclass       32561 non-null  category
 2   fnlwgt          32561 non-null  int64   
 3   education-num   32561 non-null  category
 4   marital-status  32561 non-null  category
 5   occupation      32561 non-null  category
 6   relationship    32561 non-null  category
 7   race            32561 non-null  category
 8   sex             32561 non-null  category
 9   capital-gain    32561 non-null  int64   
 10  capital-loss    32561 non-null  int64   
 11  hours-per-week  32561 non-null  int64   
 12  native-country  32561 non-null  category
 13  income          32561 non-null  category
dtypes: category(9), int64(5)
memory usage: 1.5 MB


In [11]:
for col in nominal_features:
    df.loc[:, col] = df.loc[:, col].apply(lambda text: text.strip())

In [12]:
# split training and test set
from sklearn.model_selection import train_test_split

X = df.iloc[:, :-1]
y = df.iloc[:, -1]
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.1, stratify=y, random_state=42
)

X_train.shape, X_test.shape

((29304, 13), (3257, 13))

In [13]:
from sklearn.preprocessing import LabelEncoder

target_encoder = LabelEncoder()
y_train = target_encoder.fit_transform(y_train)

In [14]:
target_encoder.classes_

array([' <=50K', ' >50K'], dtype=object)

In [15]:
df_train = X_train.copy()
df_train.describe()

Unnamed: 0,age,fnlwgt,capital-gain,capital-loss,hours-per-week
count,29304.0,29304.0,29304.0,29304.0,29304.0
mean,38.573539,189792.9,1074.151276,87.560947,40.451167
std,13.631085,105449.8,7362.461438,403.761129,12.380112
min,17.0,12285.0,0.0,0.0,1.0
25%,28.0,117924.5,0.0,0.0,40.0
50%,37.0,178354.5,0.0,0.0,40.0
75%,48.0,236952.2,0.0,0.0,45.0
max,90.0,1484705.0,99999.0,4356.0,99.0


In [16]:
classes, counts = np.unique(y_train, return_counts=True)
pd.Series(counts, index=classes) / y_train.shape[0]

0    0.75918
1    0.24082
dtype: float64

In [17]:
df_train.replace(to_replace="?", value="other", inplace=True)

In [18]:
df_train.columns

Index(['age', 'workclass', 'fnlwgt', 'education-num', 'marital-status',
       'occupation', 'relationship', 'race', 'sex', 'capital-gain',
       'capital-loss', 'hours-per-week', 'native-country'],
      dtype='object')

In [20]:
from catboost import CatBoostClassifier, Pool
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
roc_auc_vals = []
cv_oob_prob = pd.Series(np.full_like(y_train, fill_value=0.0))

cat_feat = nominal_features + ordinal_features

#  only one of the parameters iterations, n_estimators,
# num_boost_round, num_trees should be initialized.
for train_idx, valid_idx in skf.split(df_train, y_train):
    X_train_cv, y_train_cv = df_train.iloc[train_idx], y_train[train_idx]
    X_valid_cv, y_valid_cv = df_train.iloc[valid_idx], y_train[valid_idx]
    model = CatBoostClassifier(
        loss_function="Logloss",
        eval_metric="AUC",
        task_type="CPU",
        iterations=1000,
        learning_rate=0.05,
        random_state=42,
        depth=5,
        early_stopping_rounds=100,
        cat_features=cat_feat,
    )

    _train = Pool(X_train_cv, label=y_train_cv, cat_features=cat_feat)
    _valid = Pool(X_valid_cv, label=y_valid_cv, cat_features=cat_feat)
    fit_model = model.fit(_train, eval_set=_valid, use_best_model=True, verbose=200)
    p_valid = fit_model.predict_proba(X_valid_cv)[:, -1]
    cv_oob_prob[valid_idx] = p_valid
    roc_auc_vals.append(roc_auc_score(y_valid_cv, p_valid))

0:	test: 0.8896206	best: 0.8896206 (0)	total: 133ms	remaining: 2m 13s
100:	test: 0.9216147	best: 0.9216147 (100)	total: 4.64s	remaining: 41.3s
200:	test: 0.9258446	best: 0.9258682 (198)	total: 9.2s	remaining: 36.6s
300:	test: 0.9298765	best: 0.9298765 (300)	total: 14.3s	remaining: 33.1s
400:	test: 0.9306598	best: 0.9306598 (400)	total: 18.7s	remaining: 28s
500:	test: 0.9309947	best: 0.9310221 (498)	total: 22.4s	remaining: 22.3s
600:	test: 0.9311787	best: 0.9312525 (576)	total: 27.3s	remaining: 18.1s
700:	test: 0.9314760	best: 0.9314963 (698)	total: 31.7s	remaining: 13.5s
800:	test: 0.9315383	best: 0.9316275 (782)	total: 36.5s	remaining: 9.06s
900:	test: 0.9319407	best: 0.9319604 (894)	total: 41s	remaining: 4.51s
999:	test: 0.9320616	best: 0.9321062 (993)	total: 45.6s	remaining: 0us

bestTest = 0.9321061845
bestIteration = 993

Shrink model to first 994 iterations.
0:	test: 0.8800850	best: 0.8800850 (0)	total: 64.7ms	remaining: 1m 4s
100:	test: 0.9124846	best: 0.9124846 (100)	total: 4.2

Stopped by overfitting detector  (100 iterations wait)

bestTest = 0.9294501014
bestIteration = 873

Shrink model to first 874 iterations.


In [21]:
overall_roc_auc = roc_auc_score(y_train, cv_oob_prob)
overall_roc_auc

0.9288088156086013

In [22]:
np.mean(roc_auc_vals), np.std(roc_auc_vals)

(0.9289662493601032, 0.004476139818443479)

In [23]:
model.get_all_params()

{'nan_mode': 'Min',
 'eval_metric': 'AUC',
 'combinations_ctr': ['Borders:CtrBorderCount=15:CtrBorderType=Uniform:TargetBorderCount=1:TargetBorderType=MinEntropy:Prior=0/1:Prior=0.5/1:Prior=1/1',
  'Counter:CtrBorderCount=15:CtrBorderType=Uniform:Prior=0/1'],
 'iterations': 1000,
 'sampling_frequency': 'PerTree',
 'fold_permutation_block': 0,
 'leaf_estimation_method': 'Newton',
 'od_pval': 0,
 'counter_calc_method': 'SkipTest',
 'grow_policy': 'SymmetricTree',
 'penalties_coefficient': 1,
 'boosting_type': 'Plain',
 'model_shrink_mode': 'Constant',
 'feature_border_type': 'GreedyLogSum',
 'ctr_leaf_count_limit': 18446744073709551615,
 'bayesian_matrix_reg': 0.10000000149011612,
 'one_hot_max_size': 2,
 'l2_leaf_reg': 3,
 'random_strength': 1,
 'od_type': 'Iter',
 'rsm': 1,
 'boost_from_average': False,
 'max_ctr_complexity': 4,
 'model_size_reg': 0.5,
 'simple_ctr': ['Borders:CtrBorderCount=15:CtrBorderType=Uniform:TargetBorderCount=1:TargetBorderType=MinEntropy:Prior=0/1:Prior=0.5/1:

In [25]:
model.fit(df_train, y_train, cat_features=cat_feat, verbose=200)

0:	total: 36.1ms	remaining: 36s
200:	total: 9.18s	remaining: 36.5s
400:	total: 18.4s	remaining: 27.4s
600:	total: 27.7s	remaining: 18.4s
800:	total: 37s	remaining: 9.18s
999:	total: 46.6s	remaining: 0us


<catboost.core.CatBoostClassifier at 0x7f72818c3370>

In [None]:
y_test = target_encoder.transform(y_test)

In [27]:
df_test = X_test.copy()
df_test.replace(to_replace="?", value="other", inplace=True)

In [29]:
y_test_prob = model.predict_proba(df_test)[:, 1]
roc_auc_score(y_test, y_test_prob)

0.935737598719229

In [30]:
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score

y_hat = model.predict(df_test)

accuracy_score(y_test, y_hat)
confusion_matrix(y_test, y_hat)
precision_score(y_test, y_hat)
recall_score(y_test, y_hat)

0.8805649370586429

array([[2358,  115],
       [ 274,  510]])

0.816

0.6505102040816326

In [33]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_hat))

              precision    recall  f1-score   support

           0       0.90      0.95      0.92      2473
           1       0.82      0.65      0.72       784

    accuracy                           0.88      3257
   macro avg       0.86      0.80      0.82      3257
weighted avg       0.88      0.88      0.88      3257



## References

[Catboost using GPU](https://www.kaggle.com/baomengjiao/5kflod-catboost-using-gpu)