# Experimento XGBoost

Fontes primárias:
- Tutorial Using XGBoost in Python: https://www.datacamp.com/community/tutorials/xgboost-in-python
- Documentação XGBoost Python API: https://xgboost.readthedocs.io/en/latest/python/python_api.html
- XGBoost Python Package Introduction: https://xgboost-clone.readthedocs.io/en/latest/python/python_intro.html
- Hyperparameter tuning in XGBoost: https://blog.cambridgespark.com/hyperparameter-tuning-in-xgboost-4ff9100a3b2f

### Import de ambos os arquivos csv:
 - train.csv, do trabalho de Pedro

In [1]:
import pandas as pd
import numpy as np
import xgboost as xgb
from sklearn.metrics import mean_squared_error

In [2]:
path_to_text = '/home/mstauffer/Documentos/UnB/9º Semestre/KnEDle/sprints/5_27_maio-03_junho/luz_de_araujo_etal_propor2020/data/clean/train.csv'
data_peter = pd.read_csv(path_to_text, encoding='utf8')#[['v1', 'v2']]
# Creating the feature set and label set
textPT = data_peter['text']
labelPT = data_peter['label']
print(data_peter[10:14])

                                                label  \
10  SECRETARIA DE ESTADO DE DESENVOLVIMENTO URBANO...   
11  SECRETARIA DE ESTADO DE FAZENDA, PLANEJAMENTO,...   
12                      SECRETARIA DE ESTADO DE SAÚDE   
13          SECRETARIA DE ESTADO DE SEGURANÇA PÚBLICA   

                                                 text  is_valid  
10  O Termo de Recebimento Definitivo declarará fo...     False  
11  O DISTRITO FEDERAL, por intermédio da Diretori...     False  
12  O SECRETÁRIO DE ESTADO DE SAÚDE DO DISTRITO FE...     False  
13  O DIRETOR-GERAL DO DEPARTAMENTO DE TRÂNSITO DO...     False  


### Pré-processamento para X e y
Precisamos processar os valores dos dataframes X e y para inserção na estrutura de dados do XGBoost, DMatrix.

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = list(textPT)
X_tfidf = TfidfVectorizer(max_features = 6000) 
X_tfidf.fit(corpus)
X_tfidf_features = X_tfidf.transform(corpus)
print(X_tfidf_features.shape)

(717, 6000)


In [4]:
from sklearn.preprocessing import LabelEncoder
from sklearn.svm import LinearSVC
from sklearn.metrics import precision_recall_fscore_support
# Converting the labels from strings to binary
le = LabelEncoder()
le.fit(labelPT)
y_labelPT = le.transform(labelPT)
print(y_labelPT)
print(y_labelPT.shape)

[15  6 15 11  0 13 10  2 14 18  8 11 14 15  4 18 15 15 10 10 10 12 11 11
 15 15 11  6 14 15 11  0 16 11  1 18 15  8 13 10  6 15  6 15 16 10  8 12
 11 15 15 13 16 15  3  5  3  6 11 15  0  2 15 15 11 15 16 14  5 15 14  2
 18 14 11 14  3 12 16  6 16 15 13 12  2 15  6 13  0 13  0 12 18 14  9 16
 14 14 15 11 15 15 13 15 13 15 14 13 18 11 14  7 15 18 16  6 10 16  0 11
  6 16 10 17  0  8 15 15  6 11 10  8 16 11 10 15 10 14 16  0 15 15 14  3
 11 14 11  0  3 15  7  7 14 12 18  2 13 13 18 18 10  0  0 15 14 16 18  5
 14 11 16  8 11 15 15 16 18 11 14 11 16  3  6 10 15  0  0 14 10  6 15 15
 10 15  0  0 14 13 14 15 15  2 16 13 15 10 12 16 11 11 13 11 14 15 15 10
  4 14 14 14 15 14 13 14  0  1  0  9 10  2 10 16 13  3 11 15  6  0 15  3
 13  0 10 11  0 15 15 11 15  3 15 15 15  3 18 15 16  6  1 16 16 14 15 14
 13 14 18  6 15  3  3 14 13  0 16  0 15  5 13 15 15 16 14 18 15  0 13 14
 15 13 15  0  0 11 14 15 13 15  0  0 15 15  8 15 15 11 14  6 15 14 14  0
 16  2  6  2 17 13  7  2 15 16  3 14 13 16 15  2  0

In [5]:
data_dmatrix = xgb.DMatrix(data=X_tfidf_features, label=y_labelPT)

### Split em treino e teste

In [6]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_tfidf_features, y_labelPT, test_size=0.2, random_state=123)
print(X_train)
print(X_test)
print(y_train)
print(y_test)



  (0, 5865)	0.12310278357902607
  (0, 5844)	0.07214349812284936
  (0, 5835)	0.0868076648446096
  (0, 5754)	0.041587122710311904
  (0, 5743)	0.10162001575332077
  (0, 5668)	0.028382729835351277
  (0, 5420)	0.2645848762644777
  (0, 5401)	0.05307699402489731
  (0, 5400)	0.04830196249802336
  (0, 5385)	0.07645466551500592
  (0, 5356)	0.09638715650117681
  (0, 5349)	0.05393150042329729
  (0, 5331)	0.0282035262524768
  (0, 5270)	0.2223558217390038
  (0, 5251)	0.07931376976638826
  (0, 5130)	0.05583318328674835
  (0, 5018)	0.09662967176577104
  (0, 4963)	0.052275930237077724
  (0, 4957)	0.09193180845870408
  (0, 4956)	0.06077461776690305
  (0, 4929)	0.2379413092991648
  (0, 4863)	0.1494332956490792
  (0, 4832)	0.05635622926945937
  (0, 4831)	0.05128648248887804
  (0, 4790)	0.06575128557963889
  :	:
  (572, 855)	0.10020939169551783
  (572, 754)	0.09235404367340458
  (572, 615)	0.09712011448381817
  (572, 575)	0.13776221649461948
  (572, 537)	0.08245792432452834
  (572, 517)	0.07229381086033336

### Instanciação de objeto XGBClassifier
Instanciação e fit do objeto com X e y de treino.

Refs: 
- https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.XGBClassifier
- https://info.cambridgespark.com/latest/getting-started-with-xgboost

In [7]:
xg_cls = xgb.XGBClassifier(base_score=0.5, colsample_bylevel=1, objective='multi:softprob', colsample_bytree = 1, gamma=0, learning_rate = 0.1, max_delta_step=0,
                max_depth = 3, min_child_weight=1, missing=None, alpha = 10, n_estimators = 100, nthread=-1, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=42, subsample=1)

In [8]:
xg_cls.fit(X_train, y_train)

Parameters: { scale_pos_weight } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.




XGBClassifier(alpha=10, base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, monotone_constraints='()',
              n_estimators=100, n_jobs=-1, nthread=-1, num_parallel_tree=1,
              objective='multi:softprob', random_state=42, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=1, seed=42, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

### Predições e métrica F1-score

In [12]:
preds = xg_cls.predict(X_test)

In [13]:
print(preds)

[14 14 14  0  6 16  6 14  0 15 10 15  0  0 16 15 10  6 15  8 15 13 14  0
 15 11 15 12 18 15 15 11  0 10 11  3 15 16 15 12 15 13 11  6 13  6 16 15
  3 15 15 14  2  2 14 14 13 13 10 14 16 11  6  6 13  9 15 14 16  6  3 10
 16 11 12 11 13 13 15 18 10 18  6  3  6 15 14 11 11 14  2 15 16 11 18  5
  5 14 15 10 13 11 13 14 18  1  0 13 11 13 16 13  0 10 13 14 13 11  3  7
 15 18 18  0 10  8 15 16 14 11 15 15 15 11  0 11 11  0 15 15 15  3 14 16]


In [14]:
from sklearn.metrics import f1_score

print('F1-score - Average Micro')
print(f1_score(y_test, preds, average='micro'))
print('F1-score - Average Weighted')
print(f1_score(y_test, preds, average='weighted'))

F1-score - Average Micro
0.9305555555555556
F1-score - Average Weighted
0.9265797062136476


In [None]:
resultsPT = pd.DataFrame(index = ['TF-IDF - XGBoost'], 
          columns = ['Precision', 'Recall', 'F1 score', 'support']
          )
resultsPT.loc['TF-IDF'] = precision_recall_fscore_support(
          labelPT[train_cutoff + 1 : len(text)], 
          tfidf_predictionPT, 
          average = 'weighted'
          )

### Tabelas de resultados

- Para fins de comparação, resultados dados Pedro TFIDF + SVM (Em português)

In [20]:
resultsPT.head()

Unnamed: 0,Precision,Recall,F1 score,support
Word Embedding,0.835766,0.833333,0.828399,
TF-IDF,0.904991,0.897059,0.894488,
