# Problem 1

Defalut value of `loss_function = Logloss` in multiclass classification might make the new users confused in multiclass classification task

In [1]:
import sklearn
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris

# Preparing Iris data

In [2]:
np.random.seed(0)

In [3]:
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)

features = df.columns[:4]

df['species'] = pd.factorize(df['species'])[0] # enumerate lables
df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75

In [4]:
df['species'].unique()

array([0, 1, 2])

## This is a multiclass classification task!

In [5]:
X_train, X_test = df[df['is_train']==True], df[df['is_train']==False]

y_train = X_train['species']
y_test = X_test['species']

X_train = X_train[features]
X_test = X_test[features]

In [6]:
X_train.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


### Let's train RandomForestClassifier and CatBoostClassifier with default parameters to compare them

# Random Forest

In [7]:
rf_clf = RandomForestClassifier()
rf_clf.fit(X_train, y_train)



RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [8]:
from sklearn.metrics import accuracy_score

In [9]:
print("accuracy train:", accuracy_score(y_train, rf_clf.predict(X_train)))
print("accuracy test:", accuracy_score(y_test, rf_clf.predict(X_test)))

accuracy train: 0.9915254237288136
accuracy test: 0.96875


Ok, good baseline

## Now let's train CatBoostClassifier

In [10]:
from catboost import CatBoostClassifier

In [11]:
cb_clf = CatBoostClassifier()
cb_clf.fit(X_train, y_train, silent=True)

<catboost.core.CatBoostClassifier at 0x7f568f8ce358>

In [12]:
print(accuracy_score(y_train, cb_clf.predict(X_train)))
print(accuracy_score(y_test, cb_clf.predict(X_test)))

0.6779661016949152
0.625


### Hmmmm, not so good. Let's find out why

In [13]:
cb_clf.predict(X_train)

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

The problem is that CatBoostClassifier has default parameter `loss_function = Logloss`, which is set for binary classification.

This is bad for for several reasons:
- Not all new users will go to the CatBoostClassifier documentation and read that they must set `loss_function = MultiClass` in multiclassification task, especially if they used `RandomForestClassifier` or `XGBClassifier` earlier with default parameters, because the last two classifiers automatically choose `binary/multiclass classification` based on number of unique values in `target` vector

- The second problem, in my opinion, may be crucial during the first usage of CatBoost, when the user might be confused. Described problem is easy to fix, so there is no need to open an issue, as the user just will go to the CatBoost documentation and set the `MultiClass` label. But my experiments with other programmers showed that some of new users might be really confused by this problem, and this might cause negative first impression in CatBoost user experience - we want to avoid it.

**Possible solution:** simply calculate how many unique values in the target label. If the number of unique values is three or more, then automatically switch to `MultiClass` label or, as an alternative, print a warning to uset that "it's probably better to switch `loss_function` to `MultiClass`