# Comparison java and python version of xgboost
Why are there different results observed in the performance metrics?

The core xgboost libraries (i.e. native C core) which perform the main part of the computation should be the same in spark and python

Note: I am using xgboost from the master branch (`5d74578095e1414cfcb62f9732165842f25b81ca`)
Other libraries are the current versions from conda respectively pip.

### Step 1
generate some random data

In [1]:
from sklearn import datasets
from sklearn.model_selection import ShuffleSplit
from sklearn.metrics import classification_report
import pandas as pd
import xgboost as xgb
import evaluation
from sklearn.datasets import load_breast_cancer

mySeed = 45

In [2]:
(X, y) = datasets.make_classification(n_samples=1000, n_features=100, n_informative=20, random_state=mySeed)
#(X,y) = load_breast_cancer(return_X_y=True)
df = pd.DataFrame(X)
df['target'] = pd.Series(y)

X = pd.DataFrame(X)
y = pd.Series(y)

### Step 2
fit xgboost in python / sklearn on the data

Deliberately specifying all the parameters for easier comparison with the spark variant.

In [3]:
clf = xgb.XGBClassifier(max_depth=2, learning_rate=0.01, max_delta_step=2,
                 n_estimators=2, silent=True,
                 objective='binary:logistic', nthread=-1,
                 gamma=0, subsample=0.7, colsample_bytree=0.7,
                 colsample_bylevel=0.6, reg_alpha=0, reg_lambda=2, scale_pos_weight=1,
                 base_score=0.5, missing=None, seed= mySeed)

split = ShuffleSplit(n_splits=5, test_size=0.2, random_state=mySeed)
results_for_model = []
fold_counter = 0

for train_index, test_index in split.split(X, y):
    fold_counter += 1
    X_train = X.iloc[train_index]
    X_test = X.iloc[test_index]
    y_train = y.iloc[train_index]
    y_test = y.iloc[test_index]
    
    X_train['target'] = y_train
    X_test['target'] = y_test
    
    X_train.to_csv('clean_train_' + str(fold_counter) + '_.csv', index=False, sep=';')
    X_test.to_csv('clean_test_' + str(fold_counter) + '_.csv', index=False, sep=';')

    
    fit_params = {
                'early_stopping_rounds': 20,
                'eval_metric': ['error'],
                'eval_set': [(X_train, y_train)],
    }
    clf.fit(X_train, y_train,
                      eval_set=fit_params['eval_set'], eval_metric=fit_params['eval_metric'],
                      early_stopping_rounds=fit_params['early_stopping_rounds'], verbose=False)
    
    y_pred = clf.predict(X_test)
    # print(classification_report(y_true=y_test, y_pred=y_pred))
    results_for_model.append(
            evaluation.evalSingleModel(X_test, y_test, clf, 'myXgboostModel' + '_' + str(fold_counter), 'training'))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


SCORING number of target: 99
real number of target==1: 99


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


SCORING number of target: 98
real number of target==1: 98


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


SCORING number of target: 94
real number of target==1: 94


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


SCORING number of target: 101
real number of target==1: 101


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


SCORING number of target: 98
real number of target==1: 98


the results for python are fairly ok (for this type of parameter setting).

In [4]:
validation_scoring, train_scoring = evaluation.niceDisplayOfResults(results_for_model)
train_scoring

Unnamed: 0_level_0,kappa_mean,kappa_std,Error_mean,Error_std
modelName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
myXgboostModel,1.0,0.0,0.0,0.0


As you can see the resulting metric (kappa) is around. 63%

### Step 3 - in spark
just run `sbt run` when xgboost4j is already compiled /installed into your `~/.m2` folder

You should see something similar to the following in spark:
```
MeasureUnit(kappa,0.4086460032626426)
MeasureUnit(f1_R,0.7563025210084033)
MeasureUnit(AUC_R,0.7011240465676435)
```
Here Metrics for kappa are around 0.3 up to 0.8 where in python these are strictly 1 (over-fit)

As you can see there is quite some difference between the results of xgboost in python and in spark. Depending of the specific values, the difference between what python and what xgboost in spark report on my real data-set are $|(metric_{python} -metric_{spark}|$ up to $0.3$ apart What is wrong here?

**Looking forward for any hints.**

The settings for both classifiers should be the same, as well as the seed

Python:
```    
mySeed = 45
xgb.XGBClassifier(max_depth=2, learning_rate=0.01, max_delta_step=2,
                 n_estimators=2, silent=True,
                 objective='binary:logistic', nthread=-1,
                 gamma=0, subsample=0.7, colsample_bytree=0.7,
                 colsample_bylevel=0.6, reg_alpha=0, reg_lambda=2, scale_pos_weight=1,
                 base_score=0.5, missing=None, seed= mySeed)
````

spark

```
val mySeed = 45
val xgbBaseParams = Map(
    "max_depth" -> 2,
    "num_rounds" -> 2,
    "eta" -> 0.01,
    "gamma" -> 0.0,
    "subsample" -> 0.7,
    "colsample_bytree" -> 0.7,
    "colsample_bylevel" -> 0.6,
    "min_child_weight" -> 1,
    "max_delta_step" -> 0,
    "seed" -> mySeed,
    "eval_metric" -> "error",
    "seed" -> mySeed,
    "scale_pos_weight" -> 1,
    "silent" -> 1,
    "lambda" -> 2.0,
    "alpha" -> 0.0,
    "boosterType" -> "gbtree",
    "useExternalMemory" -> false,
    "objective" -> "binary:logistic",
    "tracker_conf" -> TrackerConf(1 minute, "scala")
  )
```