In [1]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


# Projeto: Classificação de Notícias Curtas em Português utilizando *Machine Learning* (um subprojeto do Projeto Luppar News-Rec)
- 1 - Definição do Problema
- 2 - Preparação dos Dados e *Embeddings*
- 3 - Criação dos Modelos (*Pipelines*)
- 4 - *Deploy* em Produção



## 1. Definição do Problema
Classificação de Notícias Curtas em Português (*uma parte do Projeto Luppar Recommender*, maiores informações em [Luppar News-Rec](https://pessoalex.wordpress.com/2019/11/24/luppar-news-rec-recomendador-inteligente-de-noticias/))

## 2. Preparação dos Dados e *Embeddings*

Importando as Bibliotecas necessárias

In [2]:
from time import time
from tabulate import tabulate
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import gensim
import pickle
from gensim.models.word2vec import Word2Vec
from gensim.models import FastText
from collections import Counter, defaultdict
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.multiclass import OneVsRestClassifier
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn import tree
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, confusion_matrix
from sklearn.metrics import average_precision_score
from sklearn import metrics
from sklearn.preprocessing import label_binarize
#from sklearn.utils.fixes import signature
from sklearn.ensemble import RandomForestClassifier

  import pandas.util.testing as tm


### Criando as Classes Personalizadas para Representações de Documentos *Embeddings*

Classes *Embeddings* Médio
- Calcula a média dos vetores de cada uma das palavras do documento - para cada um dos documentos
  - Cada documento (notícia no caso) é a média dos vetores dos termos contidos no documento.

In [3]:
class E2V_AVG(object):
    def __init__(self, word2vec):
        self.w2v = word2vec
        self.dimensao = 300
    
    def fit(self, X, y):
        return self 

    def transform(self, X):
        return np.array([
            np.mean([self.w2v[word] for word in words if word in self.w2v] or [np.zeros(self.dimensao)], axis=0)
            for words in X
        ])

Classe da Abordagem Proposta - **E2V-IDF**

`Essa abordagem representa um documento pela média dos vetores dos seus termos, ponderando cada vetor de termo pelo IDF (Inverso da Frequência nos Documentos) do termo. A intuição por trás desta proposta é que um termo, apresente poder discriminatório diferente dependendo do número de documentos em que esse termo esteja presente, ou seja, o peso dos termos que ocorrem com mais frequência em documentos da coleção tendem a diminuir, e aumentar caso os termos ocorram mais raramente em documentos da coleção (SOUZA, 2019).`

In [4]:
# Referência (SOUZA, 2019)
class E2V_IDF(object):
    def __init__(self, word2vec):
        self.w2v = word2vec
        self.wIDF = None # IDF da palavra na colecao
        self.dimensao = 300
        
    def fit(self, X, y):
        tfidf = TfidfVectorizer(analyzer=lambda x: x)
        tfidf.fit(X)
        maximo_idf = max(tfidf.idf_) # Uma palavra que nunca foi vista (rara) então o IDF padrão é o máximo de idfs conhecidos (exemplo: 9.2525763918954524)
        self.wIDF = defaultdict(
            lambda: maximo_idf, 
            [(word, tfidf.idf_[i]) for word, i in tfidf.vocabulary_.items()])
        return self
    
    # Gera um vetor de 300 dimensões, para cada documento, com a média dos vetores (embeddings) dos termos * IDF, contidos no documento.
    def transform(self, X):
        return np.array([
                np.mean([self.w2v[word] * self.wIDF[word] for word in words if word in self.w2v] or [np.zeros(self.dimensao)], axis=0)
                for words in X
            ])

### Carregando a Fonte de Dados (z6News)
Nóticias curtas colhidas do site G1 Notícias

Tópicos
- esporteNews
- politicaNews
- tecnologiaNews
- financaPessoal
- educacaonews
- ciencianaturezasaudenews


In [5]:
# Arquivo com nóticias curtas em Português do site G1
X = pickle.load(open('/content/drive/My Drive/0. Business/2. Consultoria em Dados/2. IA, ML/NLP/data/z6News_X.ipy', 'rb'))
# Arquivo com o rótulos das notícias
y = pickle.load(open('/content/drive/My Drive/0. Business/2. Consultoria em Dados/2. IA, ML/NLP/data/z6News_y.ipy', 'rb'))

# Essa fonte de dados é própria e esta disponível aqui no GitHub na Pasta: data
# - Podem utilizar, bastando referenciar o autor: SOUZA, 2019 (descrito na seção Referências)

In [6]:
# Tranformando em Array
X, y = np.array(X), np.array(y)

In [7]:
print ("Total de Notícias - G1: %s" % len(y))

Total de Notícias - G1: 34327


### Treinando os *Embeddings* com base na Coleção (Fonte de Dados)

Word2Vec - [GENSIM](https://radimrehurek.com/gensim/models/word2vec.html)

Parâmetros
- sg=1 -- Skip Gram


In [8]:
model = Word2Vec(X, size=300, window=5, sg=1, workers=4)
w2v = {w: vec for w, vec in zip(model.wv.index2word, model.wv.vectors)}

In [9]:
# Verificando tamanho do Vetor do Word2Vec (W2V)
# 7398 - Termos (palavras)
len(w2v)

7398

In [11]:
# Consultando o vetor embedding de uma das palavras
# Vetor com a distancia dessa palavra para as outras 300 posições do vetor embedding
w2v['internaco']

array([-0.04045309, -0.01430207,  0.11801744,  0.02612158,  0.12245786,
        0.00649389,  0.11605934,  0.07124674, -0.06232863, -0.10530096,
        0.00637942, -0.14090425, -0.05615794,  0.08213024, -0.16268834,
       -0.06161891, -0.04556866, -0.11122014,  0.18279219, -0.1467552 ,
       -0.17773533, -0.07251669,  0.02747214, -0.02131107, -0.03223983,
       -0.03284115,  0.07178151, -0.10853031, -0.06393187, -0.16217506,
       -0.05219315,  0.16122033,  0.09927419,  0.03981264, -0.06516748,
       -0.01737884,  0.04568584,  0.12629637,  0.18297443,  0.01089129,
        0.08341968, -0.02700089, -0.11271318,  0.10150865,  0.00551293,
        0.0555806 ,  0.02811121, -0.15438072,  0.02054688,  0.04436369,
        0.02697118,  0.11070438,  0.08668947,  0.0605762 ,  0.13689233,
       -0.00766471, -0.04319409, -0.00111114, -0.08611667, -0.06722403,
        0.01139539,  0.07120934,  0.04332718,  0.04551566, -0.01937538,
        0.02243992,  0.00897486,  0.1368795 , -0.02056652,  0.04

FastText - [GENSIM](https://radimrehurek.com/gensim/models/fasttext.html)

Parâmetros
- sg=1 -- Skip Gram

In [10]:
model_ft = FastText(X, size=300, window=5, sg=1, workers=4)
ft  = {w: vec for w, vec in zip(model_ft.wv.index2word, model_ft.wv.vectors)}

In [11]:
# Verificando tamanho do Vetor do FT
# 7398 - Termos (palavras)
len(ft)

7398

In [12]:
# Consultando o vetor embedding de uma das palavras, agora usando o FastText (FT)
ft['internaco']

array([ 0.05413393,  0.05002799, -0.04061728,  0.17854394,  0.28756377,
        0.03100248, -0.08970266,  0.02063751,  0.00186416, -0.1186173 ,
        0.02227419, -0.14152965,  0.09124412, -0.03458006, -0.25723612,
        0.02480055,  0.23841013, -0.01460795,  0.08602637,  0.04161023,
       -0.03365967,  0.10400467,  0.1993656 , -0.03280523, -0.11122724,
        0.05185151,  0.03256325,  0.23878828, -0.1291739 , -0.13890347,
        0.14867438, -0.14454703,  0.07656992,  0.01087445, -0.08687098,
        0.0028606 , -0.08720215, -0.27818274, -0.05454101, -0.03843651,
        0.24320164,  0.21770497, -0.00951782,  0.0221708 , -0.14401294,
       -0.14727247,  0.17364445,  0.01237763, -0.14003523,  0.15786518,
        0.07097363, -0.03634983,  0.12774265, -0.19544782,  0.13492921,
        0.15873346, -0.05498461, -0.0617607 , -0.06662105, -0.07308092,
        0.05825625,  0.07395235,  0.07041017, -0.12768014, -0.01776723,
       -0.03116737, -0.1295951 , -0.11831437, -0.01275796, -0.08

## 3. Criação dos Modelos (*Pipelines*)

#### Classificadores
- **SVM** + RBF (Support Vector Machine + Radial Basis Function)
- **KNN** - K-Nearest Neighbors
- **Decision Tree**
- **Random Forest**

#### Representações de Documentos Tradicionais X Classificadores
- **BoW** combinado com os Classificadores (SVM, KNN, Decision Tree (DT) e Random Forest (RF)

In [13]:
svm_rbf_bow   = Pipeline([("count_vectorizer", CountVectorizer(analyzer=lambda x: x)), ("svm rbf bow"  , OneVsRestClassifier(SVC(kernel="rbf", gamma=0.01, C=1.0)))])

In [14]:
knn_bow   = Pipeline([("count_vectorizer", CountVectorizer(analyzer=lambda x: x)), ("knn bow"  , OneVsRestClassifier(KNeighborsClassifier(n_neighbors=5, p=2)))])

In [15]:
dt_bow   = Pipeline([("count_vectorizer", CountVectorizer(analyzer=lambda x: x)), ("dt bow"  , OneVsRestClassifier(tree.DecisionTreeClassifier(min_samples_split=40), n_jobs=-1))])

In [16]:
rf_bow   = Pipeline([("count_vectorizer", CountVectorizer(analyzer=lambda x: x)), ("rf bow"  , OneVsRestClassifier(RandomForestClassifier(min_samples_split=40, n_estimators=10, n_jobs=-1), n_jobs=-1))])

- **TF-IDF** combinado os Classificadores (SVM, KNN, Decision Tree (DT) e Random Forest (RF)

In [17]:
svm_rbf_tfidf = Pipeline([("tfidf_vectorizer", TfidfVectorizer(analyzer=lambda x: x)), ("svm rbf tfidf", OneVsRestClassifier(SVC(kernel="rbf", gamma=0.01, C=1.0)))])

In [18]:
knn_tfidf = Pipeline([("tfidf_vectorizer", TfidfVectorizer(analyzer=lambda x: x)), ("knn tfidf", OneVsRestClassifier(KNeighborsClassifier(n_neighbors=5, p=2)))])

In [19]:
dt_tfidf = Pipeline([("tfidf_vectorizer", TfidfVectorizer(analyzer=lambda x: x)), ("dt tfidf", OneVsRestClassifier(tree.DecisionTreeClassifier(min_samples_split=40), n_jobs=-1))])

In [20]:
rf_tfidf = Pipeline([("tfidf_vectorizer", TfidfVectorizer(analyzer=lambda x: x)), ("rf tfidf", OneVsRestClassifier(RandomForestClassifier(min_samples_split=40, n_estimators=10, n_jobs=-1), n_jobs=-1))])

#### Representações de Documentos *Embeddings*
- **Word2Vec (w2v)** combinado com os Classificadores (SVM, KNN, Decision Tree (DT) e Random Forest (RF)
 - Vetor médio (padrão)

In [21]:
svm_rbf_w2v  = Pipeline([("w2v", E2V_AVG(w2v))    , ("svm rbf w2v",     OneVsRestClassifier(SVC(kernel="rbf", gamma=0.01, C=1.0), n_jobs=-1))])

In [22]:
knn_w2v      = Pipeline([("w2v", E2V_AVG(w2v))    , ("knn w2v",     OneVsRestClassifier(KNeighborsClassifier(n_neighbors=5, p=2)))])

In [23]:
dt_w2v       = Pipeline([("w2v", E2V_AVG(w2v))    , ("dt w2v",     OneVsRestClassifier(tree.DecisionTreeClassifier(min_samples_split=40), n_jobs=-1))])

In [24]:
rf_w2v       = Pipeline([("w2v", E2V_AVG(w2v))    , ("rf w2v",     OneVsRestClassifier(RandomForestClassifier(min_samples_split=40, n_estimators=10, n_jobs=-1), n_jobs=-1))])

- **Word2Vec (w2v_idf)** combinado com os Classificadores (SVM, KNN, Decision Tree (DT) e Random Forest (RF)
 - Abordagem Proposta **E2V-IDF**

In [25]:
svm_rbf_w2v_idf = Pipeline([("w2v-idf", E2V_IDF(w2v)), ("svm rbf w2v-idf", OneVsRestClassifier(SVC(kernel="rbf", gamma=0.01, C=1.0), n_jobs=-1))])

In [26]:
knn_w2v_idf     = Pipeline([("w2v-idf", E2V_IDF(w2v)), ("knn w2v-idf", OneVsRestClassifier(KNeighborsClassifier(n_neighbors=5, p=2)))])

In [27]:
dt_w2v_idf   = Pipeline([("w2v-idf", E2V_IDF(w2v)), ("dt w2v-idf", OneVsRestClassifier(tree.DecisionTreeClassifier(min_samples_split=40), n_jobs=-1))])

In [28]:
rf_w2v_idf   = Pipeline([("w2v-idf", E2V_IDF(w2v)), ("rf w2v-idf", OneVsRestClassifier(RandomForestClassifier(min_samples_split=40, n_estimators=10, n_jobs=-1), n_jobs=-1))])

- **FastText (FT)** combinado com os Classificadores (SVM, KNN, Decision Tree (DT) e Random Forest (RF)
 - Vetor médio (padrão)

In [29]:
svm_rbf_ft  = Pipeline([("ft", E2V_AVG(ft))    , ("svm rbf ft",     OneVsRestClassifier(SVC(kernel="rbf", gamma=0.01, C=1.0), n_jobs=-1))])

In [30]:
knn_ft      = Pipeline([("ft", E2V_AVG(ft))    , ("knn ft",     OneVsRestClassifier(KNeighborsClassifier(n_neighbors=5, p=2)))])

In [31]:
dt_ft       = Pipeline([("ft", E2V_AVG(ft))    , ("dt ft",     OneVsRestClassifier(tree.DecisionTreeClassifier(min_samples_split=40), n_jobs=-1))])

In [32]:
rf_ft       = Pipeline([("ft", E2V_AVG(ft))    , ("rf ft",     OneVsRestClassifier(RandomForestClassifier(min_samples_split=40, n_estimators=10, n_jobs=-1), n_jobs=-1))])

- **FastText (FT_IDF)** combinado com os Classificadores (SVM, KNN, Decision Tree (DT) e Random Forest (RF)
 - Abordagem Proposta **E2V-IDF**

In [33]:
svm_rbf_ft_idf = Pipeline([("ft-idf", E2V_IDF(ft)), ("svm rbf ft-idf", OneVsRestClassifier(SVC(kernel="rbf", gamma=0.01, C=1.0), n_jobs=-1))])

In [34]:
knn_ft_idf     = Pipeline([("ft-idf", E2V_IDF(ft)), ("knn ft-idf", OneVsRestClassifier(KNeighborsClassifier(n_neighbors=5, p=2)))])

In [35]:
dt_ft_idf   = Pipeline([("ft-idf", E2V_IDF(ft)), ("dt ft-idf", OneVsRestClassifier(tree.DecisionTreeClassifier(min_samples_split=40), n_jobs=-1))])

In [36]:
rf_ft_idf   = Pipeline([("ft-idf", E2V_IDF(ft)), ("rf ft-idf", OneVsRestClassifier(RandomForestClassifier(min_samples_split=40, n_estimators=10, n_jobs=-1), n_jobs=-1))])

#### Agrupando os Pipelines por Classificador
- SVM


In [37]:
# Nome e o Nome do Pipeline
all_models_svm = [
    ("SVM(RBF)+BoW", svm_rbf_bow),
    ("SVM(RBF)+TFIDF", svm_rbf_tfidf),
    ("SVM(RBF)+W2V", svm_rbf_w2v),
    ("SVM(RBF)+W2V-IDF", svm_rbf_w2v_idf),
    ("SVM(RBF)+FT", svm_rbf_ft),
    ("SVM(RBF)+FT-IDF", svm_rbf_ft_idf)
]

- KNN

In [38]:
all_models_knn = [
    ("KNN+BoW", knn_bow),
    ("KNN+TFIDF", knn_tfidf),
    ("KNN+W2V", knn_w2v),
    ("KNN+W2V-IDF", knn_w2v_idf),
    ("KNN+FT", knn_ft),
    ("KNN+FT-IDF", knn_ft_idf)
]

- *Decision Tree* (DT)

In [39]:
all_models_dt = [
    ("DT+BoW", dt_bow),
    ("DT+TFIDF", dt_tfidf),
    ("DT+W2V", dt_w2v),
    ("DT+W2V-IDF", dt_w2v_idf),
    ("DT+FT", dt_ft),
    ("DT+FT-IDF", dt_ft_idf)
]

- *Random Forest* (RF)

In [40]:
all_models_rf = [
    ("RF+BoW", rf_bow),
    ("RF+TFIDF", rf_tfidf),
    ("RF+W2V", rf_w2v),
    ("RF+W2V-IDF", rf_w2v_idf),
    ("RF+TF", rf_ft),
    ("RF+TF-IDF", rf_ft_idf)
]

#### Treinamento dos Modelos com base nos Pipelines
- Por classificador
- Usando as métricas *F1-Score*
- *Cross-Validation* = 10

In [44]:
# Criando a função para a métrica F1-Score

# Essa função tem como entrada a lista de pipelines criadas acima e o X e y 
# (Onde X são as notícias e y os rótulos)

from sklearn.model_selection import KFold
def benchmark_new_f1(model, X, y):
	scores = []
	kf = KFold(n_splits=10, random_state=66, shuffle=True)
	kf.get_n_splits(X, y)
	for train, test in kf.split(X, y):
		X_train, X_test = X[train], X[test]
		y_train, y_test = y[train], y[test]
		scores.append(f1_score(model.fit(X_train, y_train).predict(X_test), y_test, average = 'micro'))
		print (pd.DataFrame(scores)) # Guardar dados das 10 rodadas
	return np.mean(scores)

 # Faz o cross-validation (10 Etapas)
 # - Divide a fonte de dados em treinamento e teste
 # - Aplica cada um dos pipelines contidos na lista de pipelines (por classif.)
 # -- Treina
 # -- Faz a predição
 # -- Testa
 # -- Retorna a métrica F1-Score (média das 10 Etapas)

###### *Executando...*
Iremos rodar a função acima, para cada uma das combinações
- all_models_svm
- all_models_knn
- all_models_dt
- all_models_rf

In [None]:
# Executei antes, pois leva um tempo bom para processar (7hs)
# SVM
table = []
t0 = time()
for name, model in all_models_svm:
	 print(name)
	 table.append({'model': name, 
				   'f1-score': benchmark_new_f1(model, X, y)})
	 print(table)

df_result_f1 = pd.DataFrame(table)
print(df_result_f1)
print("Resultados (SVM) - F1-Score - DONE in %0.3fs." % (time() - t0))

SVM(RBF)+BoW
          0
0  0.809496
          0
0  0.809496
1  0.814157
          0
0  0.809496
1  0.814157
2  0.800757
          0
0  0.809496
1  0.814157
2  0.800757
3  0.820856
          0
0  0.809496
1  0.814157
2  0.800757
3  0.820856
4  0.804835
          0
0  0.809496
1  0.814157
2  0.800757
3  0.820856
4  0.804835
5  0.801923
          0
0  0.809496
1  0.814157
2  0.800757
3  0.820856
4  0.804835
5  0.801923
6  0.805709
          0
0  0.809496
1  0.814157
2  0.800757
3  0.820856
4  0.804835
5  0.801923
6  0.805709
7  0.798368
          0
0  0.809496
1  0.814157
2  0.800757
3  0.820856
4  0.804835
5  0.801923
6  0.805709
7  0.798368
8  0.803904
          0
0  0.809496
1  0.814157
2  0.800757
3  0.820856
4  0.804835
5  0.801923
6  0.805709
7  0.798368
8  0.803904
9  0.807401
[{'model': 'SVM(RBF)+BoW', 'f1-score': 0.8067407420232937}]
SVM(RBF)+TFIDF
          0
0  0.779202
          0
0  0.779202
1  0.772211
          0
0  0.779202
1  0.772211
2  0.769589
          0
0  0.779202




          0
0  0.739878
1  0.753568
2  0.739295
3  0.747742
4  0.739004
5  0.731430
6  0.736382
7  0.734266




          0
0  0.739878
1  0.753568
2  0.739295
3  0.747742
4  0.739004
5  0.731430
6  0.736382
7  0.734266
8  0.742424




          0
0  0.739878
1  0.753568
2  0.739295
3  0.747742
4  0.739004
5  0.731430
6  0.736382
7  0.734266
8  0.742424
9  0.741259
[{'model': 'SVM(RBF)+BoW', 'f1-score': 0.8067407420232937}, {'model': 'SVM(RBF)+TFIDF', 'f1-score': 0.7731518845267752}, {'model': 'SVM(RBF)+W2V', 'f1-score': 0.7405248455787344}]
SVM(RBF)+W2V-IDF
          0
0  0.777454
          0
0  0.777454
1  0.795514
          0
0  0.777454
1  0.795514
2  0.778328
          0
0  0.777454
1  0.795514
2  0.778328
3  0.789106
          0
0  0.777454
1  0.795514
2  0.778328
3  0.789106
4  0.777163
          0
0  0.777454
1  0.795514
2  0.778328
3  0.789106
4  0.777163
5  0.780076
          0
0  0.777454
1  0.795514
2  0.778328
3  0.789106
4  0.777163
5  0.780076
6  0.783571
          0
0  0.777454
1  0.795514
2  0.778328
3  0.789106
4  0.777163
5  0.780076
6  0.783571
7  0.779138
          0
0  0.777454
1  0.795514
2  0.778328
3  0.789106
4  0.777163
5  0.780076
6  0.783571
7  0.779138
8  0.780012
          0
0  0.777454



          0
0  0.735508
          0
0  0.735508
1  0.744247
          0
0  0.735508
1  0.744247
2  0.734052
          0
0  0.735508
1  0.744247
2  0.734052
3  0.748616
          0
0  0.735508
1  0.744247
2  0.734052
3  0.748616
4  0.736673
          0
0  0.735508
1  0.744247
2  0.734052
3  0.748616
4  0.736673
5  0.729682
          0
0  0.735508
1  0.744247
2  0.734052
3  0.748616
4  0.736673
5  0.729682
6  0.740169




          0
0  0.735508
1  0.744247
2  0.734052
3  0.748616
4  0.736673
5  0.729682
6  0.740169
7  0.733100
          0
0  0.735508
1  0.744247
2  0.734052
3  0.748616
4  0.736673
5  0.729682
6  0.740169
7  0.733100
8  0.738928
          0
0  0.735508
1  0.744247
2  0.734052
3  0.748616
4  0.736673
5  0.729682
6  0.740169
7  0.733100
8  0.738928
9  0.740676
[{'model': 'SVM(RBF)+BoW', 'f1-score': 0.8067407420232937}, {'model': 'SVM(RBF)+TFIDF', 'f1-score': 0.7731518845267752}, {'model': 'SVM(RBF)+W2V', 'f1-score': 0.7405248455787344}, {'model': 'SVM(RBF)+W2V-IDF', 'f1-score': 0.7818915730836792}, {'model': 'SVM(RBF)+FT', 'f1-score': 0.7381652404300234}]
SVM(RBF)+FT-IDF
          0
0  0.774541
          0
0  0.774541
1  0.780658
          0
0  0.774541
1  0.780658
2  0.774833
          0
0  0.774541
1  0.780658
2  0.774833
3  0.786193
          0
0  0.774541
1  0.780658
2  0.774833
3  0.786193
4  0.765803
          0
0  0.774541
1  0.780658
2  0.774833
3  0.786193
4  0.765803
5  0.769007

In [None]:
# Executei antes, pois leva um tempo bom para processar
# KNN
table = []
t0 = time()
for name, model in all_models_knn:
	 print(name)
	 table.append({'model': name, 
				   'f1-score': benchmark_new_f1(model, X, y)})
	 print(table)

df_result_f1 = pd.DataFrame(table)
print(df_result_f1)
print("Resultados (KNN) - F1-Score - DONE in %0.3fs." % (time() - t0))

KNN+BoW
         0
0  0.65453
          0
0  0.654530
1  0.651908
          0
0  0.654530
1  0.651908
2  0.647247
          0
0  0.654530
1  0.651908
2  0.647247
3  0.651908
          0
0  0.654530
1  0.651908
2  0.647247
3  0.651908
4  0.660647
          0
0  0.654530
1  0.651908
2  0.647247
3  0.651908
4  0.660647
5  0.648412
          0
0  0.654530
1  0.651908
2  0.647247
3  0.651908
4  0.660647
5  0.648412
6  0.652199
          0
0  0.654530
1  0.651908
2  0.647247
3  0.651908
4  0.660647
5  0.648412
6  0.652199
7  0.652972
          0
0  0.654530
1  0.651908
2  0.647247
3  0.651908
4  0.660647
5  0.648412
6  0.652199
7  0.652972
8  0.643648
          0
0  0.654530
1  0.651908
2  0.647247
3  0.651908
4  0.660647
5  0.648412
6  0.652199
7  0.652972
8  0.643648
9  0.662587
[{'model': 'KNN+BoW', 'f1-score': 0.6526058609804605}]
KNN+TFIDF
          0
0  0.769298
          0
0  0.769298
1  0.775706
          0
0  0.769298
1  0.775706
2  0.757355
          0
0  0.769298
1  0.775706
2  0.

In [None]:
# Executei antes, pois leva um tempo bom para processar
# Decision Tree
table = []
t0 = time()
for name, model in all_models_dt:
	 print(name)
	 table.append({'model': name, 
				   'f1-score': benchmark_new_f1(model, X, y)})
	 print(table)

df_result_f1 = pd.DataFrame(table)
print(df_result_f1)
print("Resultados (Decision Tree) - F1-Score - DONE in %0.3fs." % (time() - t0))

DT+BoW
          0
0  0.680163
          0
0  0.680163
1  0.692689
          0
0  0.680163
1  0.692689
2  0.681328
          0
0  0.680163
1  0.692689
2  0.681328
3  0.690941
          0
0  0.680163
1  0.692689
2  0.681328
3  0.690941
4  0.681037
          0
0  0.680163
1  0.692689
2  0.681328
3  0.690941
4  0.681037
5  0.672298
          0
0  0.680163
1  0.692689
2  0.681328
3  0.690941
4  0.681037
5  0.672298
6  0.671716
          0
0  0.680163
1  0.692689
2  0.681328
3  0.690941
4  0.681037
5  0.672298
6  0.671716
7  0.675408
          0
0  0.680163
1  0.692689
2  0.681328
3  0.690941
4  0.681037
5  0.672298
6  0.671716
7  0.675408
8  0.673951
          0
0  0.680163
1  0.692689
2  0.681328
3  0.690941
4  0.681037
5  0.672298
6  0.671716
7  0.675408
8  0.673951
9  0.673660
[{'model': 'DT+BoW', 'f1-score': 0.6793190509364411}]
DT+TFIDF
          0
0  0.664142
          0
0  0.664142
1  0.664725
          0
0  0.664142
1  0.664725
2  0.659773
          0
0  0.664142
1  0.664725
2  0.6

In [45]:
# Executei antes, pois leva um tempo bom para processar (22min)
# Random Forest
table = []
t0 = time()
for name, model in all_models_rf:
	 print(name)
	 table.append({'model': name, 
				   'f1-score': benchmark_new_f1(model, X, y)})
	 print(table)

df_result_f1 = pd.DataFrame(table)
print(df_result_f1)
print("Resultados (Random Forest) - F1-Score - DONE in %0.3fs." % (time() - t0))

RF+BoW
          0
0  0.775706
          0
0  0.775706
1  0.766968
          0
0  0.775706
1  0.766968
2  0.769589
          0
0  0.775706
1  0.766968
2  0.769589
3  0.779784
          0
0  0.775706
1  0.766968
2  0.769589
3  0.779784
4  0.771920
          0
0  0.775706
1  0.766968
2  0.769589
3  0.779784
4  0.771920
5  0.764637
          0
0  0.775706
1  0.766968
2  0.769589
3  0.779784
4  0.771920
5  0.764637
6  0.764055
          0
0  0.775706
1  0.766968
2  0.769589
3  0.779784
4  0.771920
5  0.764637
6  0.764055
7  0.764860
          0
0  0.775706
1  0.766968
2  0.769589
3  0.779784
4  0.771920
5  0.764637
6  0.764055
7  0.764860
8  0.766317
          0
0  0.775706
1  0.766968
2  0.769589
3  0.779784
4  0.771920
5  0.764637
6  0.764055
7  0.764860
8  0.766317
9  0.777389
[{'model': 'RF+BoW', 'f1-score': 0.7701225915069492}]
RF+TFIDF
          0
0  0.758812
          0
0  0.758812
1  0.763763
          0
0  0.758812
1  0.763763
2  0.760559
          0
0  0.758812
1  0.763763
2  0.7

## 3.1 Teste dos Modelos para Notícias Curtas em Português

Abaixo a compilação dos resultados:

- **model	           (f1-score)**
- **SVM(RBF)+BoW     (0.806741)**
- **SVM(RBF)+W2V-IDF (0.781892)**
- SVM(RBF)+FT-IDF  (0.774696)
- SVM(RBF)+TFIDF   (0.773152)
- RF+BoW           (0.768957)
- RF+TFIDF         (0.759868)
- KNN+TFIDF        (0.759518)
- KNN+W2V-IDF      (0.752294)
- KNN+W2V          (0.746992)
- KNN+FT-IDF       (0.742418)
- SVM(RBF)+W2V     (0.740525)
- KNN+FT           (0.740292)
- SVM(RBF)+FT      (0.738165)
- RF+W2V-IDF       (0.732630)
- RF+W2V           (0.730999)
- RF+TF-IDF        (0.721182)
- RF+TF            (0.719608)
- DT+BoW           (0.679319)
- DT+TFIDF         (0.657645)
- KNN+BoW          (0.652606)
- DT+W2V-IDF       (0.640516)
- DT+W2V           (0.636350)
- DT+FT-IDF        (0.624523)
- DT+FT            (0.620765)

### 3.1.1 Validando um dos melhores modelos (com o objetivo de ver a performance por rótulo)
- **SVM(RBF)+W2V-IDF**

In [41]:
# "Bizarizando" as classes
from sklearn.preprocessing import label_binarize

name_labels = ['esporteNews', 'politicaNews', 'tecnologiaNews', 'financaPessoal', 'educacaonews', 'ciencianaturezasaudenews']
Y = label_binarize(y, classes=['esporteNews', 'politicaNews', 'tecnologiaNews', 'financaPessoal', 'educacaonews', 'ciencianaturezasaudenews'])

In [42]:
n_classes = Y.shape[1]

In [43]:
# Visualizando o número de classes (rótulos)
n_classes

6

In [49]:
# Criando o conjunto de treinamento e testes
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=.2, random_state=66)

In [59]:
# Training (+- 10min)
svm_rbf_w2v_idf.fit(X_train, Y_train)

Pipeline(memory=None,
         steps=[('w2v-idf', <__main__.E2V_IDF object at 0x7f133f245e10>),
                ('svm rbf w2v-idf',
                 OneVsRestClassifier(estimator=SVC(C=1.0, break_ties=False,
                                                   cache_size=200,
                                                   class_weight=None, coef0=0.0,
                                                   decision_function_shape='ovr',
                                                   degree=3, gamma=0.01,
                                                   kernel='rbf', max_iter=-1,
                                                   probability=False,
                                                   random_state=None,
                                                   shrinking=True, tol=0.001,
                                                   verbose=False),
                                     n_jobs=-1))],
         verbose=False)

In [None]:
# Prediction E2VIDF
pred_E2VIDF = svm_rbf_w2v_idf.predict(X_test)

In [None]:
# Reports
print ("Precision: %s" %precision_score(Y_test, pred_E2VIDF, average="micro"))
print ("Recall...: %s" %recall_score(Y_test, pred_E2VIDF, average="micro"))
print ("F1-Score.: %s" %f1_score(Y_test, pred_E2VIDF, average="micro"))
print ("Accuracy.: %s" %accuracy_score(Y_test, pred_E2VIDF))

print (classification_report(pred_E2VIDF,Y_test))

Precision: 0.8612184796613289
Recall...: 0.6814739295077192
F1-Score.: 0.7608748678754371
Accuracy.: 0.6746286047189047
              precision    recall  f1-score   support

           0       0.90      0.96      0.93      1008
           1       0.80      0.83      0.82      1319
           2       0.58      0.78      0.66       735
           3       0.33      0.85      0.48       408
           4       0.72      0.89      0.79       925
           5       0.69      0.85      0.76      1038

   micro avg       0.68      0.86      0.76      5433
   macro avg       0.67      0.86      0.74      5433
weighted avg       0.72      0.86      0.78      5433
 samples avg       0.68      0.68      0.68      5433



  _warn_prf(average, modifier, msg_start, len(result))


**Rótulos**: `esporteNews(0)`, `politicaNews(1)`, `tecnologiaNews(2)`, `financaPessoal(3)`, `educacaonews(4)`, `ciencianaturezasaudenews(5)` 

Podemos observar que as notícias de `tecnologia` e `Finanças` não tiveram bons resultados (maior concentração de notícias que podem pertencer a mais de um rótulo, exemplo: as notícias de finanças são bem ligadas a política)

##4. *Deploy* em Produção (Projeto Completo)
Aplicação em Produção: **Luppar Recommender**

[Luppar News-Rec](http://luppar.com/recommender)




## Versionamento
- **v1.0** 
 - Adicionado mais 1 tópico (saúde) - coleção Z6News;
 - Adaptação para versão em Notebook e mais didática;
- **v2.0** (*em desenvolvimento*)
 - Melhorias em Parâmetros;
 - Multirrótulo (notícia com mais de um rótulo);
 - Testar com notícias de outras fontes de notíticas (funcionalidade na versão *full*);
 - Novos métodos Embeddings (BERT);
 - Melhorias em Features.

## Referências
- (SOUZA, 2019) SOUZA, ANTONIO ALEX DE. LUPPAR NEWS-REC: UM RECOMENDADOR INTELIGENTE DE NOTÍCIAS. 2019. 95 f. Dissertação (Mestrado Acadêmico em Computação) – Universidade Estadual do Ceará, , 2019. Disponível em: <http://siduece.uece.br/siduece/trabalhoAcademicoPublico.jsf?id=93501> Acesso em: 27 de fevereiro de 2020

- Alex Souza ([Blog](https://blogdozouza.wordpress.com/))
