# Multi-Class Predictive Model | BAIS:6100

**Instructor: Qihang Lin**

In this notebook, we will study how to apply sparse logistic regression model and XGBoost model to classify documents into one of three or more classes. 

Similar to the binary case, multi-class classification also consists of three steps: 1. Dataset Preparation; 2. Feature Engineering; 3. Model Training. Since steps 1 and 2 are the same as the binary case, this tutorial will focus on step 3.

The csv file "classdata/foursciences.csv" contains the posts from a forum. Each post is about one of the four scientific topics, which are cryptography, medicine, space and electronics. You task is to build a predictive model that can predict the topic of a post. 

- Column "topic" contains the class labels ("crypt", "med", "space", or "electronics"). 
- Column "text" contains the texts of posts. 

The following code reads the data and shows the frequencies of the class labels.

In [1]:
import pandas as pd
df = pd.read_csv("classdata/foursciences.csv")
df.head()

Unnamed: 0,text,topic
0,"Archive-name: ripem/faq\nLast-update: Sun, 7 M...",crypt
1,Archive-name: ripem/faq\nLast-update: 31 Mar 9...,crypt
2,Archive-name: ripem/attacks\nLast-update: 31 M...,crypt
3,">>If you have access to FTP, try FTPing to rsa...",crypt
4,Some sick part of me really liked that phra...,crypt


In [2]:
df["topic"].value_counts()

crypt          991
med            988
space          985
electronics    984
Name: topic, dtype: int64

## Multi-Class Sparse Logistic Regression: One VS Rest

We call the $K$ classses, Class $0$, Class $1$, ..., Class $K-1$. The idea is to build a binary SLR model for each class such that model $0$ predicts if a document is in Class $0$, model $1$ predicts if a document is in Class $1$, so on so forth. This is called **One VS Rest** method.

* Vectorization of a document: $\mathbf{x}=(x_1,x_2,\dots,x_n)$ 
    * $\mathbf{x}$ is a row in DTM and also called a feature vector.
    
* Each document has a class label, denoted by $y$.
    * For example, $y=0,1,\dots,K-1$ in $K$-class classification.

* A linear score is defined for class $0,1,2,\dots,K-1$ separately: 
    * Class $0$: $\alpha^0+\beta_1^0x_1+\beta_2^0x_2+\cdots+\beta_n^0x_n$
    * Class $1$: $\alpha^1+\beta_1^1x_1+\beta_2^1x_2+\cdots+\beta_n^1x_n$
    * ......
    * Class $K-1$: $\alpha^{K-1}+\beta_1^{K-1}x_1+\beta_2^{K-1}x_2+\cdots+\beta_n^{K-1}x_n$
* Coefficients $k=0,1,\dots,K-1$: 
    * Intercept: $\alpha^k$ 
    * Slopes: $\mathbf{\beta}^k=(\beta_1^k,\beta_2^k,\dots,\beta_n^k)$
* Logistic regression makes prediction based on the linear score:
    * Predict $y$ as Class $k$ if the $k$th linear score above is larger than other $K-1$ scores. 
* Impact of terms: 
    * $\beta_i^k>0$: A document with a high frequency in term $i$ will be more likely in Class $k$. 
    * $\beta_i^k<0$: A document with a high frequency in term $i$ will be less likely in Class $k$. 
    * $\beta_i^k=0$: Term $i$ has no impact on the class label $k$ (but may still have impact on other classes).

## Data Preparation and Feature Engineering

This part is the same as binary classification.

In [3]:
#Split the data
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df, test_size=0.30, random_state=2021)
df_train.reset_index(drop=True,inplace=True)
df_test.reset_index(drop=True,inplace=True)

In [4]:
#Feature Engineering
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import nltk 

stemmer = nltk.stem.SnowballStemmer("english")
class StemmedTfidfVectorizer(TfidfVectorizer):
    def build_analyzer(self):
        analyzer = super(StemmedTfidfVectorizer, self).build_analyzer()
        return lambda doc: ([stemmer.stem(w) for w in analyzer(doc)])

nltk_stopwords = nltk.corpus.stopwords.words("english") 

vectorizer=StemmedTfidfVectorizer(stop_words=nltk_stopwords, norm=None)

#Create the training and testing DTMs and the labels
train_x = vectorizer.fit_transform(df_train["text"])
train_y = df_train["topic"]
test_x = vectorizer.transform(df_test["text"])
test_y = df_test["topic"]

## Fit a Multi-Class SLR Model

The main difference in this step from the binary case is to set argument **multi_class='ovr'**. Here **ovr** means "one vs rest".

In [5]:
#Model training
from sklearn.linear_model import LogisticRegression
sparselr = LogisticRegression(penalty='l1', 
                              solver='liblinear',
                              multi_class='ovr', ##Remember to set this for multi-class case
                              random_state=2021,
                              tol=0.01,
                              max_iter=1000, 
                              C=1)
sparselr.fit(train_x,train_y)

LogisticRegression(C=1, max_iter=1000, multi_class='ovr', penalty='l1',
                   random_state=2021, solver='liblinear', tol=0.01)

**sklearn** library converts the original text class labels into Class $0$, Class $1$, ... and Class $K-1$ by alphabetical rule. In this example, 
   * 'crypt' is Class $0$
   * 'electronics' is Class $1$
   * 'med' is Class $2$
   * 'space' is Class $3$

There are four slope coefficients $\beta_i^k$ for each term, representing its impact to each of the four classes. Hence,  slope coefficients $\beta_i^k$ should be a 2D table. See its shape below. 

In [6]:
#Shape of slope coefficient matrix. Each row is a class and each column is a term.
sparselr.coef_.shape

(4, 29655)

In [7]:
#Slope betas for class 0, 1, 2 and 3.
print(sparselr.coef_[0])  #Slope of each term for class 0
print(sparselr.coef_[1])  #Slope of each term for class 1  
print(sparselr.coef_[2])  #Slope of each term for class 2
print(sparselr.coef_[3])  #Slope of each term for class 3

[0.         0.00828141 0.         ... 0.         0.         0.        ]
[0. 0. 0. ... 0. 0. 0.]
[ 0.         -0.03782684  0.         ...  0.          0.
  0.        ]
[0.         0.03240413 0.         ... 0.         0.         0.        ]


In [8]:
#Intercept alpha for class 0, 1, 2 and 3.
print(sparselr.intercept_[0])  #Intercept for class 0
print(sparselr.intercept_[1])  #Intercept for class 1
print(sparselr.intercept_[2])  #Intercept for class 2
print(sparselr.intercept_[3])  #Intercept for class 3

-1.5382228430655507
-1.018189090055603
-1.129507609889858
-1.2260173351099795


In [9]:
#How many non-zero betas in total (Sparsity). There are 30000s terms.
print(sum(sparselr.coef_[0]!=0))
print(sum(sparselr.coef_[1]!=0))
print(sum(sparselr.coef_[2]!=0))
print(sum(sparselr.coef_[3]!=0))

1394
1641
1554
1501


## Descriptive Analytics

We can identify the terms that have the largest impacts to class $k$ by sorting $\beta_0^k$, $\beta_1^k$, $\beta_2^k$, ...,$\beta_n^k$.

In [10]:
#Create a table with term and its four sequences of beta (one sequence for each class)
dfbeta = pd.DataFrame({'Term': vectorizer.get_feature_names(),
                       'Beta0': sparselr.coef_[0],
                       'Beta1': sparselr.coef_[1],
                       'Beta2': sparselr.coef_[2],
                       'Beta3': sparselr.coef_[3]
                     })

In [11]:
#Show the terms that have the largest impact to Class 0 (in a descending order).
dfbeta.sort_values(by="Beta0",inplace=True,ascending=False)
dfbeta.reset_index(inplace=True,drop=True)
dfbeta.head(10)

Unnamed: 0,Term,Beta0,Beta1,Beta2,Beta3
0,mathew,0.478685,0.0,0.0,-0.109287
1,crypt,0.454312,0.0,-0.068029,-0.014402
2,omiss,0.452885,0.0,0.0,0.0
3,forbidden,0.436091,-0.124673,-0.156081,0.0
4,decript,0.397477,0.0,0.0,0.0
5,kelsey,0.367561,0.0,0.0,0.0
6,rdl,0.35195,-0.147157,0.0,0.0
7,occam,0.33031,0.0,-0.110946,0.0
8,substitut,0.321596,0.0,0.0,0.0
9,hermann,0.316392,0.0,0.0,0.0


In [12]:
#Show the terms that have the largest impact to Class 1 (in a descending order).
dfbeta.sort_values(by="Beta1",inplace=True,ascending=False)
dfbeta.reset_index(inplace=True,drop=True)
dfbeta.head(10)

Unnamed: 0,Term,Beta0,Beta1,Beta2,Beta3
0,yxy4145,-0.001464,0.505077,-0.069072,-0.125707
1,solvent,-0.127805,0.434965,-0.172994,-0.226376
2,nikolai,0.0,0.410533,0.0,-0.004542
3,motorola,-0.141929,0.392147,-0.204707,-0.222962
4,aftershav,-0.04761,0.380306,-0.043176,0.0
5,dayton,0.0,0.376375,-0.052481,0.0
6,hd,0.0,0.369702,-0.030905,-0.071864
7,gandler,0.0,0.356764,-0.077681,0.0
8,schemat,-0.032308,0.333382,-0.038664,-0.027997
9,cci,-0.021818,0.330066,-0.069711,-0.069345


In [13]:
#Show the terms that have the largest impact to Class 2 (in a descending order).
dfbeta.sort_values(by="Beta2",inplace=True,ascending=False)
dfbeta.reset_index(inplace=True,drop=True)
dfbeta.head(10)

Unnamed: 0,Term,Beta0,Beta1,Beta2,Beta3
0,vs,-0.075435,-0.049148,0.434462,-0.056806
1,ecg,0.0,-0.082585,0.390758,0.0
2,morphin,0.0,-0.006892,0.387095,0.0
3,med,0.0,-0.196239,0.371658,-0.129886
4,osteopathi,-0.051365,-0.100121,0.370791,-0.087062
5,diagnos,-0.051815,-0.102983,0.342373,-0.026415
6,girli,0.0,0.0,0.327651,0.0
7,eyelid,0.0,0.0,0.315495,0.0
8,pathophysiolog,0.0,-0.127093,0.30943,0.0
9,diverticular,-0.056903,-0.035239,0.306203,-0.034024


In [14]:
#Show the terms that have the largest impact to Class 3 (in a descending order).
dfbeta.sort_values(by="Beta3",inplace=True,ascending=False)
dfbeta.reset_index(inplace=True,drop=True)
dfbeta.head(10)

Unnamed: 0,Term,Beta0,Beta1,Beta2,Beta3
0,jennis,0.0,-0.148584,-0.067012,0.691701
1,wallop,0.0,0.0,-0.01872,0.402537
2,freebairn,0.0,-0.12364,0.0,0.391155
3,troop,0.0,-0.10455,-0.071294,0.38597
4,nicol,0.0,0.0,0.0,0.333689
5,pat,-0.192517,-0.247104,-0.265129,0.332954
6,missil,-0.012079,-0.058853,-0.022674,0.321594
7,tdrs,0.0,0.0,0.0,0.320883
8,astro,-0.031615,-0.098924,-0.160845,0.310421
9,adam,0.0,-0.084369,-0.135387,0.292048


## Predictive Analytics

This part is also similar to binary case except that there are four predicted probabilities instead of two.

In [15]:
#Apply the model to the testing set and predict the class labels
sparselr.predict(test_x)[0:10]

array(['med', 'electronics', 'space', 'electronics', 'crypt',
       'electronics', 'electronics', 'electronics', 'crypt', 'med'],
      dtype=object)

In [16]:
#Predict the probability of each doc being in each class 
#The columns correspond to 'crypt', 'electronics', 'med' and 'space' (in alphabetical order)
sparselr.predict_proba(test_x)[0:10]  #Here, 0:10 means showing the results for only 10 documents

array([[1.69325153e-03, 4.44243027e-02, 9.38697803e-01, 1.51846429e-02],
       [6.16509219e-04, 9.95105465e-01, 8.40622155e-04, 3.43740329e-03],
       [2.72096500e-04, 9.20785674e-06, 7.18886624e-03, 9.92529829e-01],
       [5.50055022e-04, 9.97643472e-01, 1.40164243e-03, 4.04831000e-04],
       [9.55059579e-01, 2.98753415e-02, 9.63979128e-03, 5.42528805e-03],
       [1.17472274e-01, 5.28819865e-01, 2.56899361e-01, 9.68084996e-02],
       [1.08514438e-03, 9.80360385e-01, 1.32242090e-02, 5.33026128e-03],
       [5.38177645e-03, 8.75056110e-01, 1.07040148e-01, 1.25219657e-02],
       [9.98008426e-01, 1.33859357e-03, 2.30822650e-04, 4.22158253e-04],
       [9.85681588e-04, 3.36680505e-03, 9.42891313e-01, 5.27562001e-02]])

## Performance Metric

**Confusion table**: In the example below, there are 10, 2, 2 and 2 instances from the true Class0 are predicted as Class0, Class1, Class2 and Class3, respectively. Other rows should be understood in a similar way.

| | Pred Class0 | Pred Class1 | Pred Class2 | Pred Class3 |
| --- | --- | --- | --- | --- |
| True Class0 | 10 | 2 | 2 | 2 |
| True Class1 | 2 | 10 | 2 | 2 |
| True Class2 | 2 | 2 | 10 | 2 |
| True Class3 | 2 | 2 | 2 | 10 |

**Accuracy**: The percentage of correct predictions made by a model. 

- With the confusion table above, accuracy=$\frac{10+10+10+10}{10+10+10+10+2\times12}=\%76.9$.

In [17]:
#Confusion matrix
from sklearn.metrics import confusion_matrix
print("Train Confusion Matrix:")
print(confusion_matrix(train_y, sparselr.predict(train_x)))
print("Test Confusion Matrix:")
print(confusion_matrix(test_y, sparselr.predict(test_x)))

Train Confusion Matrix:
[[692   0   0   0]
 [  0 669   0   0]
 [  0   0 701   0]
 [  0   0   1 700]]
Test Confusion Matrix:
[[278  12   1   8]
 [  4 299   5   7]
 [  3  10 273   1]
 [  1   8   6 269]]


In [18]:
#Performance evaluation in terms of accuracy
from sklearn.metrics import accuracy_score
print("Train Accuracy:")
print(accuracy_score(train_y,sparselr.predict(train_x)))
print("Test Accuracy:")
print(accuracy_score(test_y,sparselr.predict(test_x)))

Train Accuracy:
0.9996380745566413
Test Accuracy:
0.9443037974683545


For multi-class classificaiton, **multi_class="ovr"** should be set in order to calculate AUC scores.

In [19]:
#Performance evaluation in terms of AUC
from sklearn.metrics import roc_auc_score
print("Train Accuracy:")
print(roc_auc_score(train_y,sparselr.predict_proba(train_x), multi_class="ovr"))
print("Test Accuracy:")
print(roc_auc_score(test_y,sparselr.predict_proba(test_x), multi_class="ovr"))

Train Accuracy:
0.9999996540898342
Test Accuracy:
0.9919739991643397


## Cross Validation for Multi-Class SLR

Searching for the optimal $C$ can be done with cross validation. The code is exactly the same as the binary case except that **multi_class='ovr'**. 

In [20]:
#Generate the grid of parameters that increase proportionally.
import numpy as np
from sklearn.svm import l1_min_c
param_grid = l1_min_c(train_x, train_y, loss='log') * np.logspace(start=0, stop=5, num=20) 
param_grid

array([4.60364177e-04, 8.43838656e-04, 1.54673998e-03, 2.83514455e-03,
       5.19676527e-03, 9.52557049e-03, 1.74601870e-02, 3.20041859e-02,
       5.86630555e-02, 1.07528249e-01, 1.97097206e-01, 3.61275378e-01,
       6.62210798e-01, 1.21381962e+00, 2.22490795e+00, 4.07821336e+00,
       7.47528641e+00, 1.37020558e+01, 2.51156040e+01, 4.60364177e+01])

In [21]:
from sklearn.linear_model import LogisticRegressionCV
sparselr = LogisticRegressionCV(penalty='l1', 
                                solver='liblinear', 
                                Cs=param_grid,   #Use the grid generated above
                                cv=5,            #Number of folds, that is, K
                                scoring='accuracy', #The performance metric to select the best C.
                                multi_class='ovr', 
                                random_state=2021,  #To make sure the result is reproducible
                                tol=0.01,
                                max_iter=1000)
sparselr.fit(train_x, train_y)

LogisticRegressionCV(Cs=array([4.60364177e-04, 8.43838656e-04, 1.54673998e-03, 2.83514455e-03,
       5.19676527e-03, 9.52557049e-03, 1.74601870e-02, 3.20041859e-02,
       5.86630555e-02, 1.07528249e-01, 1.97097206e-01, 3.61275378e-01,
       6.62210798e-01, 1.21381962e+00, 2.22490795e+00, 4.07821336e+00,
       7.47528641e+00, 1.37020558e+01, 2.51156040e+01, 4.60364177e+01]),
                     cv=5, max_iter=1000, multi_class='ovr', penalty='l1',
                     random_state=2021, scoring='accuracy', solver='liblinear',
                     tol=0.01)

To use AUC as the performance metric, set **scoring='roc_auc'** above.

In [22]:
#Performance evaluation in terms of accuracy
from sklearn.metrics import accuracy_score
print("Train Accuracy:")
print(accuracy_score(train_y,sparselr.predict(train_x)))
print("Test Accuracy:")
print(accuracy_score(test_y,sparselr.predict(test_x)))

Train Accuracy:
0.9996380745566413
Test Accuracy:
0.950210970464135


In [23]:
from sklearn.metrics import roc_auc_score
print("Train Accuracy:")
print(roc_auc_score(train_y,sparselr.predict_proba(train_x), multi_class="ovr"))
print("Test Accuracy:")
print(roc_auc_score(test_y,sparselr.predict_proba(test_x), multi_class="ovr"))

Train Accuracy:
0.9999996540898344
Test Accuracy:
0.9931027019401462


## Multi-Class XGBoost 

The codes for building a multi-class XGBoost is **exactly the same** as the binary case except that the text class labels should be first encoded into 0, 1, 2, 3 following an alphabetical order. See **preprocessing.LabelEncoder()** below.

In [24]:
#!pip3 install xgboost
from xgboost import XGBClassifier

In [25]:
#Initialize a XGB model
xgb=XGBClassifier(n_estimators=200,    #How many trees in total
                  max_depth=5,         #The depth of each tree
                  nthread=4,           #Multi-thread speed up
                  use_label_encoder=False,  #To avoid an unimportant warning message 
                  verbosity = 0,       #Hidden other messages during training
                  random_state=2021)   #Fix the results of random sampling during training

**XGBClassifier** requires numeric class labels so we need to encode the labels first. By default, 
they will be encoded by the alphabetically order. In this case, 0=crypt, 1=electronics, 2=med, 3=space.

In [26]:
#Encode the text lables in to 0, 1, 2,...
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
train_y=le.fit_transform(train_y)
test_y=le.transform(test_y)
train_y

array([3, 2, 1, ..., 2, 3, 1])

In [27]:
xgb.fit(train_x, train_y)

XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,
              colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
              early_stopping_rounds=None, enable_categorical=False,
              eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',
              importance_type=None, interaction_constraints='',
              learning_rate=0.300000012, max_bin=256, max_cat_to_onehot=4,
              max_delta_step=0, max_depth=5, max_leaves=0, min_child_weight=1,
              missing=nan, monotone_constraints='()', n_estimators=200,
              n_jobs=4, nthread=4, num_parallel_tree=1,
              objective='multi:softprob', predictor='auto', random_state=2021, ...)

## Descriptive Analytics (XGBoost)

Gradient boosting calculates an importance score for each term that indicates how useful that term was in the construction of the decision trees within the model.

The more useful a term is in making a prediction, the higher its importance is.

In [28]:
dfbeta = pd.DataFrame({'Term': vectorizer.get_feature_names(),
                       'Importance': xgb.feature_importances_
                     })
dfbeta.sort_values(by="Importance",inplace=True,ascending=False)
dfbeta.reset_index(inplace=True,drop=True)
dfbeta.head(10)

Unnamed: 0,Term,Importance
0,medic,0.023729
1,amp,0.020726
2,encrypt,0.018355
3,clipper,0.018019
4,circuit,0.016998
5,gordon,0.01549
6,space,0.012574
7,key,0.011997
8,doctor,0.011246
9,treatment,0.010725


## Predictive Analytics  (XGBoost)

Predictions by XGBoost are implemented in a way similar to SLR.

In [29]:
#Apply the model to the reviews testing set and predict the classes (encoded into 0, 1, 2, 3)
xgb.predict(test_x)[0:10]

array([2, 1, 3, 1, 0, 2, 1, 1, 0, 2])

In [30]:
#Predict the probability in each class (in alphabetical order of the classes)
xgb.predict_proba(test_x)[0:10]

array([[1.8628282e-03, 3.6727185e-03, 9.8960155e-01, 4.8629339e-03],
       [2.5772478e-03, 9.7599483e-01, 1.6008351e-02, 5.4195407e-03],
       [1.1951150e-03, 3.1492312e-04, 2.4162915e-03, 9.9607372e-01],
       [2.0617708e-05, 9.9991596e-01, 1.7888802e-05, 4.5530174e-05],
       [9.9874842e-01, 9.3018153e-04, 2.2312807e-04, 9.8274104e-05],
       [6.5765195e-02, 4.0896973e-01, 4.4393978e-01, 8.1325270e-02],
       [1.5990302e-03, 9.8985171e-01, 4.8631728e-03, 3.6861652e-03],
       [5.8021598e-02, 7.4757516e-01, 1.4469145e-01, 4.9711838e-02],
       [9.9981886e-01, 4.0889270e-05, 1.1698996e-05, 1.2859701e-04],
       [1.0361956e-05, 2.0908603e-05, 9.9993110e-01, 3.7624468e-05]],
      dtype=float32)

In [31]:
print("Train Accuracy:")
print(accuracy_score(train_y,xgb.predict(train_x)))
print("Test Accuracy:")
print(accuracy_score(test_y,xgb.predict(test_x)))
print("Train AUC:")
print(roc_auc_score(train_y,xgb.predict_proba(train_x), multi_class="ovr"))
print("Test AUC:")
print(roc_auc_score(test_y,xgb.predict_proba(test_x), multi_class="ovr"))

Train Accuracy:
0.9996380745566413
Test Accuracy:
0.9308016877637131
Train AUC:
0.9999996540898343
Test AUC:
0.9914291374755656


## Cross Validation (XGBoost)

The cross validation for XGBoost can be done in the same way as in the binary case. The codes may take about 30 seconds. Please be patient.

In [32]:
from sklearn.model_selection import GridSearchCV   
param_list = {  
 'max_depth':[2, 5],       #Candidate for max_depth
 'n_estimators':[10, 100]  #Candidate for n_estimators
}
xgb=XGBClassifier(nthread=4,
                  use_label_encoder=False,
                  verbosity = 0,
                  random_state=2021
                 )
xgb = GridSearchCV(estimator = xgb, 
                   param_grid = param_list,
                   scoring = 'accuracy',  #The performance metric to select the best parameters.
                   cv=5                   #Number of folds, i.e., K
                  )  
xgb.fit(train_x, train_y)

GridSearchCV(cv=5,
             estimator=XGBClassifier(base_score=None, booster=None,
                                     callbacks=None, colsample_bylevel=None,
                                     colsample_bynode=None,
                                     colsample_bytree=None,
                                     early_stopping_rounds=None,
                                     enable_categorical=False, eval_metric=None,
                                     gamma=None, gpu_id=None, grow_policy=None,
                                     importance_type=None,
                                     interaction_constraints=None,
                                     learning_rate=None, max_bin=None,
                                     max_cat_to_onehot=None,
                                     max_delta_step=None, max_depth=None,
                                     max_leaves=None, min_child_weight=None,
                                     missing=nan, monotone_constraints=None,
    

In [33]:
#This is the best combination of parameters.
xgb.best_params_

{'max_depth': 5, 'n_estimators': 100}

Evaluate the performance using the best parameters.

In [34]:
#Confusion matrix
from sklearn.metrics import confusion_matrix
print("Train Confusion Matrix:")
print(confusion_matrix(train_y, xgb.predict(train_x)))
print("Test Confusion Matrix:")
print(confusion_matrix(test_y, xgb.predict(test_x)))

Train Confusion Matrix:
[[691   0   1   0]
 [  0 667   2   0]
 [  0   1 700   0]
 [  0   0   2 699]]
Test Confusion Matrix:
[[279  14   4   2]
 [  4 288  15   8]
 [  1   9 271   6]
 [  2   7  11 264]]


In [35]:
print("Train Accuracy:")
print(accuracy_score(train_y,xgb.predict(train_x)))
print("Test Accuracy:")
print(accuracy_score(test_y,xgb.predict(test_x)))
print("Train AUC:")
print(roc_auc_score(train_y,xgb.predict_proba(train_x), multi_class="ovr"))
print("Test AUC:")
print(roc_auc_score(test_y,xgb.predict_proba(test_x), multi_class="ovr"))

Train Accuracy:
0.997828447339848
Test Accuracy:
0.929957805907173
Train AUC:
0.9999924829076889
Test AUC:
0.9914886153602305
