# Construindo um Classificador de Textos
Uma das aplicações mais comuns em mineração de textos é a classificações de documentos em categorias pré-definidas, sejam elas autorais, temáticas, temporais ou outras. 

Neste capítulo iremos explorar os passos necessários para o desenvolvimento de um classificador de documentos utilizando as análises feitas sobre o corpus do DHBB nos capítulos anteriores.

Para esta tarefa utilizaremos modelos de machine learning clássicos disponibilizados na biblioteca [Scikit-Learn](https://scikit-learn.org/). Começaremos então importando algumas funcionalidades a partir do Scikit-Learn. Os demais imports já foram utilizados anteriormente.

In [16]:
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import StratifiedKFold, cross_val_score, cross_val_predict
from sklearn.metrics import roc_curve, auc
from sklearn.preprocessing import StandardScaler
from gensim.models import Word2Vec, word2vec
import spacy
from spacy import displacy
from string import punctuation
from sqlalchemy import create_engine
import pandas as pd
import numpy as np

## Preparando o corpus
Vamos utilizar o corpus do DHBB conforme armazenado na biblioteca SQLite anteriormente. Abaixo iremos desenvolver um iterador sobre o corpus que fará um preprocessamento básico dos documentos.

In [23]:
eng = create_engine("sqlite:///minha_tabela.sqlite")
nlp = spacy.load("pt_core_news_sm")
class DHBBCorpus:
    def __init__(self, ndocs=10000):
        self.ndocs = min(7687,ndocs)
        self.counter = 1
    def __iter__(self):
        with eng.connect() as con:
            res = con.execute(f'select corpo from resultados limit {self.ndocs};')
            for doc in res:
                d = self.pre_process(doc[0])
                if self.counter%10 == 0:
                    print (f"Verbete {self.counter} de {self.ndocs}\r", end='')
                
                yield d
                self.counter += 1
    def pre_process(self, doc):
        n = nlp(doc, disable=['tagger', 'ner','entity-linker', 'textcat','entity-ruler','merge-noun-chunks','merge-entities','merge-subtokens'])
        results = [token.text.strip().strip(punctuation) for token in n if not token.is_stop]
        return results

## Carregando o Modelo Word2vec
Vamos utilizar a representação vetorial do corpus construida anteriormente como base para o treinamento do classificador. 

In [3]:
model = Word2Vec.load('dhbb.w2v')

In [24]:
model.wv.vectors.shape

(38762, 100)

Como  o word2vec é uma representação vetorial do vocabulário do corpus, e desejamos treinar um modelo para classificar documentos, precisamos primeiro construir uma representação dos documentos do corpus no mesmo espaço vetorial gerado pelo Word2vec.

Na função abaixo, contruimos um vetor de documento que é a média dos vetores das palavras únicas que este contém.

In [14]:
def build_document_vector(text):
    """
    Build a scaled vector for the document (mean of the words present in it)
    :param text: document to be vectorized (tokenized)
    :param model: word2vec model
    :return:
    """
    feature_count = model.wv.vectors.shape[1]
    vec = np.zeros(feature_count).reshape((1, feature_count))
    count = 0.
    

    for word in text:
        try:
            vec += model.wv[word].reshape((1, feature_count))
            count += 1.
        except KeyError:
            continue
    if count != 0:
        vec /= count
    return vec

In [29]:

def gera_docv(n):
    corpus = DHBBCorpus(n)
    for doc in corpus:
        v = build_document_vector(set(doc))
        yield v


## Preparando os dados treinamento do Classificador

In [37]:
gerador = gera_docv(10000)
data = pd.DataFrame(data=np.vstack([a for a in gerador]), columns=range(100))
data

  


Verbete 7680 de 7687

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,0.296355,-0.369351,-0.219112,-0.011535,-0.356944,0.057425,0.636584,-0.031091,-0.371255,0.106736,...,-0.075344,-0.159404,-0.233228,0.010510,0.757997,0.168332,0.209847,-0.646779,-0.494476,0.050851
1,-0.047746,-0.046702,-0.190949,0.577955,0.155968,0.225539,0.211839,-0.107882,-0.071698,0.108209,...,0.343698,0.062024,-0.120656,0.116875,0.259419,0.288715,0.168819,-0.298981,-0.517028,0.308318
2,0.191656,-0.186713,-0.269119,0.111749,-0.207819,0.006255,0.069423,-0.323620,-0.119195,-0.002357,...,-0.190077,0.420614,-0.118308,0.006932,0.213823,0.026442,-0.033372,-0.273386,0.125667,0.066307
3,0.003558,-0.059873,-0.280428,0.378681,0.008146,0.141234,0.169813,-0.022787,-0.217140,0.172066,...,0.196561,0.124252,0.025444,0.031496,-0.001780,0.221582,0.003191,-0.161416,-0.288433,0.343550
4,0.033655,-0.257470,-0.435646,0.096290,-0.507049,-0.220573,0.253772,0.591106,-0.738819,0.498734,...,0.052100,-0.008746,-0.341211,-0.454881,1.328264,0.269425,0.735788,-0.424710,-0.949855,0.047187
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7682,-0.097823,-0.078286,-0.375540,0.583358,-0.058045,0.269644,0.320550,-0.223646,-0.040692,-0.032362,...,0.273379,0.021381,-0.304205,0.141669,0.163567,0.197348,0.046773,-0.304688,-0.608374,0.099034
7683,0.179406,-0.121504,-0.562740,0.047880,-0.832265,-0.569388,0.306019,0.671887,-0.339240,0.517512,...,-0.486355,-0.092336,-0.464590,-0.744464,1.226775,0.115940,0.150833,-0.272562,-0.524888,-0.228353
7684,-0.303439,-0.413298,-0.492052,0.071817,-0.198558,0.158098,0.492865,0.041102,-0.143934,0.048369,...,-0.009319,-0.001092,-0.335632,-0.023029,0.161608,0.404850,0.005394,-0.387414,-0.421911,0.199390
7685,0.115561,-0.189773,-0.054918,-0.000709,-0.266947,0.112972,0.494245,-0.106279,-0.168027,0.263188,...,-0.033749,-0.103522,-0.494310,-0.084172,0.415913,0.078416,0.180037,-0.411079,-0.181538,0.041477


### Definindo a categoria de cada documento para o treinamento

In [42]:
def gera_alvo():
    df = pd.read_sql_query('select * from resultados', con=eng)
    alvo = df.natureza.values=='biográfico'
    return alvo
Y = gera_alvo()


7687

In [53]:
def print_class_report(Xtest, Ytest, clf, clf_name):
    """
    Prints Classification report
    :param Xtest:
    :param Ytest:
    :param clf: trained classifier
    :param clf_name: Name for the classifier
    """
    y_predict = clf.predict(Xtest)
    print('\nClassification Report for {}:\n'.format(clf_name))
    print(classification_report(Ytest, y_predict, target_names=['Temático', 'Biográfico']))
    

def plot_roc(probas):
    tprs = []
    fprs = []


    labels = ['False positive rate', 'True Positive rate']
    for k, v in probas.items():
        roc_aucs = []
        for j, fold in enumerate(v):
            try:
                fpr, tpr, thresholds = roc_curve(fold[1], fold[0][:, 1])
            except IndexError:
                print(fold[0], fold[0].shape)
                continue
            roc_aucs.append(auc(fpr, tpr))
            tprs.append([float(t) for t in tpr])
            fprs.append([float(f) for f in fpr])

        print('{}: AUCs: {}'.format(k, str(roc_aucs)))
    # pserver2.scatter(fprs, tprs, [], "ROC curve", "points", 0, 0)
    

## Definindo os modelos

In [54]:
rfclf = RandomForestClassifier(n_estimators=400, criterion='entropy', n_jobs=-1, min_samples_leaf=3, warm_start=True, verbose=0)
etclf = ExtraTreesClassifier(n_estimators=400, n_jobs=-1,min_samples_leaf=3, warm_start=True, verbose=0)

In [55]:
vcclf = VotingClassifier(estimators=[('rf', rfclf), ('et', etclf)], voting='soft', weights=[2,1])

## Treinando e validando o classificador

In [56]:
from collections import defaultdict
scaler = StandardScaler()

acc_hist = defaultdict(lambda: [])

X = data.as_matrix()
probas = defaultdict(lambda: [])
skf = StratifiedKFold(2, shuffle=True)

for train_index, test_index in skf.split(X, Y):
    scaler.fit(X)
    X = scaler.transform(X)
    print("==> Fitting:")
    print("==> Extra Trees")
    etclf.fit(X[train_index],Y[train_index])
    print("Random Forest")
    rfclf.fit(X[train_index], Y[train_index])
    probas['RF'].append((rfclf.predict_proba(X[test_index]), Y[test_index]))
    print("Voting")
    vcclf.fit(X[train_index], Y[train_index])
    probas['Voting'].append(vcclf.predict_proba(X[test_index]))
    print("==> Scoring:")
    acc_hist['ET'].append(cross_val_score(etclf, X[test_index], Y[test_index], cv=2, n_jobs=-1).mean())

    acc_hist['RF'].append(cross_val_score(rfclf, X[test_index], Y[test_index], cv=2, n_jobs=-1).mean())
    acc_hist['Voting'].append(vcclf.score(X[test_index], Y[test_index]))
    print_class_report(X[test_index], Y[test_index], etclf, 'ET')
    print_class_report(X[test_index], Y[test_index], rfclf, 'RF')
    print_class_report(X[test_index], Y[test_index], vcclf, 'Voting')

#     plot_learning(acc_hist)
plot_roc(probas)

# print('trained {} documents.'.format((n+1)*batchsize))
df_acc = pd.DataFrame(acc_hist)

  


==> Fitting:
==> Extra Trees
Random Forest
Voting
==> Scoring:

Classification Report for ET:

              precision    recall  f1-score   support

    Temático       0.97      0.95      0.96       482
  Biográfico       0.99      1.00      0.99      3362

    accuracy                           0.99      3844
   macro avg       0.98      0.97      0.98      3844
weighted avg       0.99      0.99      0.99      3844


Classification Report for RF:

              precision    recall  f1-score   support

    Temático       0.96      0.96      0.96       482
  Biográfico       0.99      0.99      0.99      3362

    accuracy                           0.99      3844
   macro avg       0.98      0.98      0.98      3844
weighted avg       0.99      0.99      0.99      3844


Classification Report for Voting:

              precision    recall  f1-score   support

    Temático       0.97      0.95      0.96       482
  Biográfico       0.99      1.00      0.99      3362

    accuracy       

  warn("Warm-start fitting without increasing n_estimators does not "
  warn("Warm-start fitting without increasing n_estimators does not "


==> Scoring:

Classification Report for ET:

              precision    recall  f1-score   support

    Temático       1.00      1.00      1.00       481
  Biográfico       1.00      1.00      1.00      3362

    accuracy                           1.00      3843
   macro avg       1.00      1.00      1.00      3843
weighted avg       1.00      1.00      1.00      3843


Classification Report for RF:

              precision    recall  f1-score   support

    Temático       1.00      1.00      1.00       481
  Biográfico       1.00      1.00      1.00      3362

    accuracy                           1.00      3843
   macro avg       1.00      1.00      1.00      3843
weighted avg       1.00      1.00      1.00      3843


Classification Report for Voting:

              precision    recall  f1-score   support

    Temático       0.96      0.89      0.92       481
  Biográfico       0.98      0.99      0.99      3362

    accuracy                           0.98      3843
   macro avg   