### Projeto Classificador de Twitter
#### RESUMO

<p align='justify'> Este Projeto teve como objetivo avaliar o impacto dos métodos de tratamento nos tweets para retirada de stopwords, tokenizacao, correção de palavras feitos de forma manual em comparação da adoção da biblioteca spacy(https://spacy.io/) para tratamento automático e tokenização. 
    
    Utilizando uma técnica chamada Word2Vec, cuja ideia é transformar cada palavra (a.k.a. token) do nosso conjunto de frases (a.k.a. corpus) em um vetor numérico que a represente semanticamente, podemos obter representações com propriedades bem interessantes. Um exemplo é que em um contexto o vetor que representa o token “Madri”, subtraído do vetor representante de “Espanha” e somado ao vetor de “França” será muito próximo do vetor obtido para “Paris”. Ou de uma forma equacionada:
vec(“Madri”) — vec(“Espanha”) + vec(“França”) ≈ vec(“Paris”)
Este modelo também é capaz de capturar relações entre as palavras como, por exemplo: “homem” está para “mulher” assim como “rei” está para “rainha”.
    
    O Word2Vec pertence a uma classe de modelos conhecidos na literatura como neural language models, pois utiliza redes neurais para aprender as suas representações. Existem dois algoritmos que são utilizados para o seu treinamento: Continuous bag of words (a.k.a. CBOW) e Skip-gram. 
    
    - CBOW: A ideia do algoritmo Continuous bag of words é: prever qual a palavra que estamos buscando a partir de um determinado contexto. 
    - Skip-gram: Já a abordagem do Skip-gram é a inversa: tomando como ponto de partida uma determinada palavra, o objetivo é prever o contexto do qual esta palavra veio. FONTE: https://medium.com/luizalabs/similaridade-entre-t%C3%ADtulos-de-produtos-com-word2vec-5e26199862f0#:~:text=CBOW%3A%20A%20ideia%20do%20algoritmo,partir%20de%20um%20determinado%20contexto.&text=Skip%2Dgram%3A%20J%C3%A1%20a%20abordagem,do%20qual%20esta%20palavra%20veio.
    
    Utizando como métrica de comparação os resultados obtidos com o modelo de regressão logistica, aplicados com a seguinte divisão:
    - CBOW com tratamento automático
    - CBOW com tratamento manual
    - SKIPGRAM com tratamento automático
    - SKIPGRAM com tratamento manual

In [310]:
from gensim.models import KeyedVectors
import pandas as pd
import numpy as np
import spacy
import string
import nltk
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.tokenize.treebank import TreebankWordDetokenizer
import wordninja
import textblob
from nltk.tokenize import TweetTokenizer
from gensim.models import Word2Vec
from gensim.models.callbacks import CallbackAny2Vec
from gensim.models import Word2Vec

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import classification_report
from sklearn.preprocessing import MultiLabelBinarizer

from sklearn.model_selection import StratifiedKFold, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn

[nltk_data] Downloading package stopwords to /Users/eric/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/eric/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [36]:
##Dataset
dados = pd.read_csv('/Users/eric/Downloads/labeled_data.csv')
dados.head()

Unnamed: 0.1,Unnamed: 0,count,hate_speech,offensive_language,neither,class,tweet
0,0,3,0,0,3,2,!!! RT @mayasolovely: As a woman you shouldn't...
1,1,3,0,3,0,1,!!!!! RT @mleew17: boy dats cold...tyga dwn ba...
2,2,3,0,3,0,1,!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...
3,3,3,0,2,1,1,!!!!!!!!! RT @C_G_Anderson: @viva_based she lo...
4,4,6,0,6,0,1,!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...


In [37]:
##Criação de dicionário para classificação
Classe = {0: ['Discurso de ódio'], 1: ['Linguagem ofensiva'], 2: ['Neutro']}
dados['classe']= dados['class'].map(Classe)
dados.head()

Unnamed: 0.1,Unnamed: 0,count,hate_speech,offensive_language,neither,class,tweet,classe
0,0,3,0,0,3,2,!!! RT @mayasolovely: As a woman you shouldn't...,[Neutro]
1,1,3,0,3,0,1,!!!!! RT @mleew17: boy dats cold...tyga dwn ba...,[Linguagem ofensiva]
2,2,3,0,3,0,1,!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...,[Linguagem ofensiva]
3,3,3,0,2,1,1,!!!!!!!!! RT @C_G_Anderson: @viva_based she lo...,[Linguagem ofensiva]
4,4,6,0,6,0,1,!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...,[Linguagem ofensiva]


In [38]:
# Excluindo da descrição os números, informações julgadas irrelevantes para a classificação.
dados['tweet_novo'] = dados['tweet'].str.replace('[0-9]+', '', regex=True).copy()
# Excluindo da descrição puntuação, informações julgadas irrelevantes para a classificação.
dados['tweet_novo'] = dados['tweet_novo'].str.replace('[,.:;!?]+', ' ', regex=True).copy()
# Excluindo da descrição caracteres especiais, informações julgadas irrelevantes para a classificação.
dados['tweet_novo'] = dados['tweet_novo'].str.replace("(@[A-Za-z0-9]+)|(#[A-Za-z0-9]+)", " ", regex=True).copy()

# Colocando todos os caracteres em caixa baixa.
dados['tweet_novo'] = dados['tweet_novo'].str.lower().copy()

In [39]:
dados.head()

Unnamed: 0.1,Unnamed: 0,count,hate_speech,offensive_language,neither,class,tweet,classe,tweet_novo
0,0,3,0,0,3,2,!!! RT @mayasolovely: As a woman you shouldn't...,[Neutro],rt as a woman you shouldn't complain abou...
1,1,3,0,3,0,1,!!!!! RT @mleew17: boy dats cold...tyga dwn ba...,[Linguagem ofensiva],rt boy dats cold tyga dwn bad for cuffin ...
2,2,3,0,3,0,1,!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...,[Linguagem ofensiva],rt dawg rt you ever fuck a bitch and s...
3,3,3,0,2,1,1,!!!!!!!!! RT @C_G_Anderson: @viva_based she lo...,[Linguagem ofensiva],rt _g_anderson _based she look like a tranny
4,4,6,0,6,0,1,!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...,[Linguagem ofensiva],rt the shit you hear about me might be tr...


In [40]:
#!pip install -U pip setuptools wheel
#!pip install -U spacy
#!python3 -m spacy download en_core_web_sm

In [41]:
# Função para retirar stop words
punct = list(string.punctuation)
stop_words = stopwords.words('english')
additional_stop_words = ['RT', 'rt', 'via', '...', 'http', 'twitpic',
'tinyurl' ,'www']
stopword_list = punct + stop_words + additional_stop_words
def tokenize_df(tokenized_words):
  tokenized_words = word_tokenize(tokenized_words)
  stop = [word for word in tokenized_words if word not in stopword_list]
  text = TreebankWordDetokenizer().detokenize(stop)
  return text
# Eliminando as stop words
dados['tweet_organizado'] = dados['tweet_novo'].apply(tokenize_df).copy()
dados.head(2)

Unnamed: 0.1,Unnamed: 0,count,hate_speech,offensive_language,neither,class,tweet,classe,tweet_novo,tweet_organizado
0,0,3,0,0,3,2,!!! RT @mayasolovely: As a woman you shouldn't...,[Neutro],rt as a woman you shouldn't complain abou...,womann't complain cleaning house amp man alway...
1,1,3,0,3,0,1,!!!!! RT @mleew17: boy dats cold...tyga dwn ba...,[Linguagem ofensiva],rt boy dats cold tyga dwn bad for cuffin ...,boy dats cold tyga dwn bad cuffin dat hoe st p...


In [42]:
# Separando palavras juntas
dados['text_split'] = dados['tweet_organizado'].apply(wordninja.split)
dados['text_novo2'] = dados['text_split'].apply(TreebankWordDetokenizer().detokenize)
dados.head()

Unnamed: 0.1,Unnamed: 0,count,hate_speech,offensive_language,neither,class,tweet,classe,tweet_novo,tweet_organizado,text_split,text_novo2
0,0,3,0,0,3,2,!!! RT @mayasolovely: As a woman you shouldn't...,[Neutro],rt as a woman you shouldn't complain abou...,womann't complain cleaning house amp man alway...,"[woman, n, ', t, complain, cleaning, house, am...",woman n' t complain cleaning house amp man alw...
1,1,3,0,3,0,1,!!!!! RT @mleew17: boy dats cold...tyga dwn ba...,[Linguagem ofensiva],rt boy dats cold tyga dwn bad for cuffin ...,boy dats cold tyga dwn bad cuffin dat hoe st p...,"[boy, dat, s, cold, tyga, dw, n, bad, cuff, in...",boy dat s cold tyga dw n bad cuff in dat hoe s...
2,2,3,0,3,0,1,!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...,[Linguagem ofensiva],rt dawg rt you ever fuck a bitch and s...,dawg ever fuck bitch start cry confused shit,"[daw, g, ever, fuck, bitch, start, cry, confus...",daw g ever fuck bitch start cry confused shit
3,3,3,0,2,1,1,!!!!!!!!! RT @C_G_Anderson: @viva_based she lo...,[Linguagem ofensiva],rt _g_anderson _based she look like a tranny,_g_anderson _based look like tranny,"[g, anderson, based, look, like, tr, an, ny]",g anderson based look like tr an ny
4,4,6,0,6,0,1,!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...,[Linguagem ofensiva],rt the shit you hear about me might be tr...,shit hear might true might faker bitch told ya,"[shit, hear, might, true, might, faker, bitch,...",shit hear might true might faker bitch told ya


In [43]:
#corrigindo palavras incorretas
from time import time
t0=time()
dados['tweet_definitivo'] = dados['text_novo2'].apply(textblob.TextBlob).apply(textblob.TextBlob.correct).apply(str)
dados.head(2)
tf=(time()-t0)/60
print(f'{tf} minutos')

22.167536314328512 minutos


In [5]:
#nlp = spacy.load('en_core_web_sm')

In [8]:
#dados.to_csv('/content/drive/MyDrive/Projeto Daniel/procesado_labeled_data.csv')
##Dataset
dados = pd.read_csv('/content/drive/MyDrive/Projeto Daniel/procesado_labeled_data.csv')
dados.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,count,hate_speech,offensive_language,neither,class,tweet,classe,tweet_novo,tweet_organizado,text_split,text_novo2,tweet_definitivo,eric_tweet_definito
0,0,0,3,0,0,3,2,!!! RT @mayasolovely: As a woman you shouldn't...,['Neutro'],rt as a woman you shouldn't complain abou...,womann't complain cleaning house amp man alway...,"['woman', 'n', ""'"", 't', 'complain', 'cleaning...",woman n' t complain cleaning house amp man alw...,woman n' t complain cleaning house amp man alw...,rt woman complain cleaning house amp man trash
1,1,1,3,0,3,0,1,!!!!! RT @mleew17: boy dats cold...tyga dwn ba...,['Linguagem ofensiva'],rt boy dats cold tyga dwn bad for cuffin ...,boy dats cold tyga dwn bad cuffin dat hoe st p...,"['boy', 'dat', 's', 'cold', 'tyga', 'dw', 'n',...",boy dat s cold tyga dw n bad cuff in dat hoe s...,boy dat s cold tea do n bad cuff in dat he st ...,rt boy dats cold tyga dwn bad cuffin dat hoe p...
2,2,2,3,0,3,0,1,!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...,['Linguagem ofensiva'],rt dawg rt you ever fuck a bitch and s...,dawg ever fuck bitch start cry confused shit,"['daw', 'g', 'ever', 'fuck', 'bitch', 'start',...",daw g ever fuck bitch start cry confused shit,day g ever fuck bitch start cry confused shit,rt dawg rt fuck bitch start cry confused shit
3,3,3,3,0,2,1,1,!!!!!!!!! RT @C_G_Anderson: @viva_based she lo...,['Linguagem ofensiva'],rt _g_anderson _based she look like a tranny,_g_anderson _based look like tranny,"['g', 'anderson', 'based', 'look', 'like', 'tr...",g anderson based look like tr an ny,g anderson based look like tr an ny,rt look like tranny
4,4,4,6,0,6,0,1,!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...,['Linguagem ofensiva'],rt the shit you hear about me might be tr...,shit hear might true might faker bitch told ya,"['shit', 'hear', 'might', 'true', 'might', 'fa...",shit hear might true might faker bitch told ya,shit hear might true might baker bitch told a,rt shit hear true faker bitch told ya


In [44]:
texto_para_tratamento_eric = (titulo.lower() for titulo in dados['tweet'])
texto_para_tratamento_eric

textos_tratados_helano = (titulo for titulo in dados['tweet_definitivo'])
textos_tratados_helano

<generator object <genexpr> at 0x7fab30029950>

In [45]:
def trata_texto(doc):
  tokens_validos=[]
  for token in doc:
    e_valido= not token.is_stop and token.is_alpha
    if e_valido:
      tokens_validos.append(token.text)
  if len(tokens_validos)>2:
      return ' '.join(tokens_validos)

In [47]:
from time import time

t0 = time()
textos_tratados_eric = [trata_texto(doc) for doc in nlp.pipe(texto_para_tratamento_eric,batch_size=1000,n_process=-1)]

tf=(time()-t0)/60
print(f'{tf} minutos')

0.267153267065684 minutos


In [46]:
dados.shape

(24783, 13)

In [48]:
dados['eric_tweet_definito']=textos_tratados_eric
dados.head()

Unnamed: 0.1,Unnamed: 0,count,hate_speech,offensive_language,neither,class,tweet,classe,tweet_novo,tweet_organizado,text_split,text_novo2,tweet_definitivo,eric_tweet_definito
0,0,3,0,0,3,2,!!! RT @mayasolovely: As a woman you shouldn't...,[Neutro],rt as a woman you shouldn't complain abou...,womann't complain cleaning house amp man alway...,"[woman, n, ', t, complain, cleaning, house, am...",woman n' t complain cleaning house amp man alw...,woman n' t complain cleaning house amp man alw...,rt woman complain cleaning house amp man trash
1,1,3,0,3,0,1,!!!!! RT @mleew17: boy dats cold...tyga dwn ba...,[Linguagem ofensiva],rt boy dats cold tyga dwn bad for cuffin ...,boy dats cold tyga dwn bad cuffin dat hoe st p...,"[boy, dat, s, cold, tyga, dw, n, bad, cuff, in...",boy dat s cold tyga dw n bad cuff in dat hoe s...,boy dat s cold tea do n bad cuff in dat he st ...,rt boy dats cold tyga dwn bad cuffin dat hoe p...
2,2,3,0,3,0,1,!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...,[Linguagem ofensiva],rt dawg rt you ever fuck a bitch and s...,dawg ever fuck bitch start cry confused shit,"[daw, g, ever, fuck, bitch, start, cry, confus...",daw g ever fuck bitch start cry confused shit,day g ever fuck bitch start cry confused shit,rt dawg rt fuck bitch start cry confused shit
3,3,3,0,2,1,1,!!!!!!!!! RT @C_G_Anderson: @viva_based she lo...,[Linguagem ofensiva],rt _g_anderson _based she look like a tranny,_g_anderson _based look like tranny,"[g, anderson, based, look, like, tr, an, ny]",g anderson based look like tr an ny,g anderson based look like tr an ny,rt look like tranny
4,4,6,0,6,0,1,!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...,[Linguagem ofensiva],rt the shit you hear about me might be tr...,shit hear might true might faker bitch told ya,"[shit, hear, might, true, might, faker, bitch,...",shit hear might true might faker bitch told ya,shit hear might true might baker bitch told a,rt shit hear true faker bitch told ya


In [55]:
#dados.to_csv('/Users/eric/Downloads/projeto_daniel/29042021procesado_labeled_data_new.csv')

In [61]:
dados_helano = dados.copy()
display(dados_helano.shape)



dados_helano = dados_helano.drop('eric_tweet_definito', 1)
#display(print(len(dados_helano)))
#dados_helano = dados_helano.dropna().drop_duplicates()

#print(len(dados_helano))

dados_helano.shape

(24783, 14)

(24783, 13)

In [76]:
## REMOVENDO OS NONETYPE PARA POSSIBILITAR A GERACAO DOS TOKENS

dados_eric = dados.copy()
display(dados_eric.shape)



dados_eric = dados_eric.drop('tweet_definitivo', 1)
display(print(len(dados_eric)))

dados_eric = dados_eric[dados_eric['eric_tweet_definito'].notna()]

print(len(dados_eric))

dados_eric.shape

(24783, 14)

24783


None

22831


(22831, 13)

In [77]:
print(dados_eric.count())

Unnamed: 0             22831
count                  22831
hate_speech            22831
offensive_language     22831
neither                22831
class                  22831
tweet                  22831
classe                 22831
tweet_novo             22831
tweet_organizado       22831
text_split             22831
text_novo2             22831
eric_tweet_definito    22831
dtype: int64


In [78]:
##PEGANDO CADA TOKEN DOS TITULOS
lista_lista_tokens = [titulo.split(" ") for titulo in dados_eric['eric_tweet_definito'] ]

In [62]:
##PEGANDO CADA TOKEN DOS TITULOS HELANO
lista_lista_tokens_helano = [titulo.split(" ") for titulo in dados_helano['tweet_definitivo']]

In [81]:
## MONTANDO O VOCABULARIO com mensagem de LOG para acompanhar
import logging

logging.basicConfig(format="%(asctime)s - %(message)s",level=logging.INFO)

w2v_modelo_cbow_helano = Word2Vec(sg = 0,window = 2,vector_size = 300,min_count = 5,alpha = 0.03,min_alpha = 0.07) ## ERA SIZE E MUDOU PARA VECTOR_SIZE

## definindo através do progress_per de quantas em quantas interações o log será exibido.
w2v_modelo_cbow_helano.build_vocab(lista_lista_tokens_helano,progress_per=5000)

2021-04-29 19:24:37,643 - Word2Vec lifecycle event {'params': 'Word2Vec(vocab=0, vector_size=300, alpha=0.03)', 'datetime': '2021-04-29T19:24:37.643404', 'gensim': '4.0.1', 'python': '3.7.4 (default, Aug 13 2019, 15:17:50) \n[Clang 4.0.1 (tags/RELEASE_401/final)]', 'platform': 'Darwin-20.3.0-x86_64-i386-64bit', 'event': 'created'}
2021-04-29 19:24:37,653 - collecting all words and their counts
2021-04-29 19:24:37,654 - PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2021-04-29 19:24:37,669 - PROGRESS: at sentence #5000, processed 46335 words, keeping 6129 word types
2021-04-29 19:24:37,682 - PROGRESS: at sentence #10000, processed 90264 words, keeping 8522 word types
2021-04-29 19:24:37,703 - PROGRESS: at sentence #15000, processed 140142 words, keeping 10364 word types
2021-04-29 19:24:37,729 - PROGRESS: at sentence #20000, processed 192416 words, keeping 11902 word types
2021-04-29 19:24:37,759 - collected 12923 word types from a corpus of 236957 raw words and 24783

In [82]:
## MONTANDO O VOCABULARIO com mensagem de LOG para acompanhar
import logging

logging.basicConfig(format="%(asctime)s - %(message)s",level=logging.INFO)

w2v_modelo_cbow_eric = Word2Vec(sg = 0,
                      window = 2,
                      vector_size = 300, ## ERA SIZE
                      min_count = 5,
                      alpha = 0.03,
                      min_alpha = 0.07)

## definindo através do progress_per de quantas em quantas interações o log será exibido.
w2v_modelo_cbow_eric.build_vocab(lista_lista_tokens,progress_per=5000)

2021-04-29 19:25:19,714 - Word2Vec lifecycle event {'params': 'Word2Vec(vocab=0, vector_size=300, alpha=0.03)', 'datetime': '2021-04-29T19:25:19.714557', 'gensim': '4.0.1', 'python': '3.7.4 (default, Aug 13 2019, 15:17:50) \n[Clang 4.0.1 (tags/RELEASE_401/final)]', 'platform': 'Darwin-20.3.0-x86_64-i386-64bit', 'event': 'created'}
2021-04-29 19:25:19,721 - collecting all words and their counts
2021-04-29 19:25:19,722 - PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2021-04-29 19:25:19,731 - PROGRESS: at sentence #5000, processed 34801 words, keeping 7955 word types
2021-04-29 19:25:19,742 - PROGRESS: at sentence #10000, processed 70543 words, keeping 11847 word types
2021-04-29 19:25:19,754 - PROGRESS: at sentence #15000, processed 109914 words, keeping 15191 word types
2021-04-29 19:25:19,773 - PROGRESS: at sentence #20000, processed 148523 words, keeping 17747 word types
2021-04-29 19:25:19,780 - collected 18996 word types from a corpus of 168936 raw words and 2283

In [83]:
w2v_modelo_cbow_helano.corpus_count

24783

In [84]:
w2v_modelo_cbow_eric.corpus_count

22831

2021-04-29 19:29:30,518 - Word2Vec lifecycle event {'msg': 'training model with 3 workers on 3945 vocabulary and 300 features, using sg=0 hs=0 sample=0.001 negative=5 window=2', 'datetime': '2021-04-29T19:29:30.518204', 'gensim': '4.0.1', 'python': '3.7.4 (default, Aug 13 2019, 15:17:50) \n[Clang 4.0.1 (tags/RELEASE_401/final)]', 'platform': 'Darwin-20.3.0-x86_64-i386-64bit', 'event': 'train'}
2021-04-29 19:29:30,726 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:29:30,731 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:29:30,734 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:29:30,735 - EPOCH - 1 : training on 168936 raw words (115978 effective words) took 0.2s, 623507 effective words/s
2021-04-29 19:29:30,837 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:29:30,842 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:29:30,844 - worker thread finished; await

(3480431, 5068080)

2021-04-29 19:29:40,237 - Word2Vec lifecycle event {'msg': 'training model with 3 workers on 4226 vocabulary and 300 features, using sg=0 hs=0 sample=0.001 negative=5 window=2', 'datetime': '2021-04-29T19:29:40.237421', 'gensim': '4.0.1', 'python': '3.7.4 (default, Aug 13 2019, 15:17:50) \n[Clang 4.0.1 (tags/RELEASE_401/final)]', 'platform': 'Darwin-20.3.0-x86_64-i386-64bit', 'event': 'train'}
2021-04-29 19:29:40,425 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:29:40,431 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:29:40,433 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:29:40,433 - EPOCH - 1 : training on 236957 raw words (180362 effective words) took 0.2s, 977914 effective words/s
2021-04-29 19:29:40,575 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:29:40,582 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:29:40,583 - worker thread finished; await

(5409161, 7108710)

In [87]:
# iniciando a chamada callback
class callback(CallbackAny2Vec):
  def __init__(self):
    self.epoch = 0
    
  def on_epoch_end(self, model):
       loss = model.get_latest_training_loss()
       if self.epoch == 0:
           print('Loss após a época {}: {}'.format(self.epoch, loss))
       else:
           print('Loss após a época {}: {}'.format(self.epoch, loss- self.loss_previous_step))
       self.epoch += 1
       self.loss_previous_step = loss

In [88]:
w2v_modelo_cbow_eric.train(lista_lista_tokens,
                total_examples=w2v_modelo_cbow_eric.corpus_count,
                epochs = 30,
                compute_loss = True,
                callbacks=[callback()])

2021-04-29 19:30:19,165 - Word2Vec lifecycle event {'msg': 'training model with 3 workers on 3945 vocabulary and 300 features, using sg=0 hs=0 sample=0.001 negative=5 window=2', 'datetime': '2021-04-29T19:30:19.165700', 'gensim': '4.0.1', 'python': '3.7.4 (default, Aug 13 2019, 15:17:50) \n[Clang 4.0.1 (tags/RELEASE_401/final)]', 'platform': 'Darwin-20.3.0-x86_64-i386-64bit', 'event': 'train'}
2021-04-29 19:30:19,295 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:30:19,301 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:30:19,305 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:30:19,306 - EPOCH - 1 : training on 168936 raw words (116063 effective words) took 0.1s, 939285 effective words/s
2021-04-29 19:30:19,404 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:30:19,409 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:30:19,411 - worker thread finished; await

Loss após a época 0: 53907.8359375
Loss após a época 1: 53221.765625


2021-04-29 19:30:19,509 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:30:19,511 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:30:19,512 - EPOCH - 3 : training on 168936 raw words (115851 effective words) took 0.1s, 1341685 effective words/s
2021-04-29 19:30:19,597 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:30:19,601 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:30:19,604 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:30:19,604 - EPOCH - 4 : training on 168936 raw words (116009 effective words) took 0.1s, 1540089 effective words/s
2021-04-29 19:30:19,688 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:30:19,692 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:30:19,694 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:30:19,695 - EPOCH - 5 : training on 168936 raw words (11

Loss após a época 2: 52756.6796875
Loss após a época 3: 53156.53125
Loss após a época 4: 52739.15625


2021-04-29 19:30:19,788 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:30:19,792 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:30:19,795 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:30:19,796 - EPOCH - 6 : training on 168936 raw words (116137 effective words) took 0.1s, 1340177 effective words/s
2021-04-29 19:30:19,879 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:30:19,884 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:30:19,887 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:30:19,887 - EPOCH - 7 : training on 168936 raw words (116126 effective words) took 0.1s, 1598444 effective words/s
2021-04-29 19:30:19,982 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:30:19,987 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:30:19,991 - worker thread finished; awaiting finish of 0

Loss após a época 5: 52475.84375
Loss após a época 6: 52072.78125
Loss após a época 7: 52260.0


2021-04-29 19:30:20,090 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:30:20,094 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:30:20,100 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:30:20,100 - EPOCH - 9 : training on 168936 raw words (115944 effective words) took 0.1s, 1246426 effective words/s
2021-04-29 19:30:20,202 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:30:20,209 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:30:20,211 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:30:20,212 - EPOCH - 10 : training on 168936 raw words (116110 effective words) took 0.1s, 1225626 effective words/s


Loss após a época 8: 51722.8125
Loss após a época 9: 52697.40625


2021-04-29 19:30:20,316 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:30:20,323 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:30:20,325 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:30:20,325 - EPOCH - 11 : training on 168936 raw words (115896 effective words) took 0.1s, 1125944 effective words/s
2021-04-29 19:30:20,425 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:30:20,430 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:30:20,435 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:30:20,436 - EPOCH - 12 : training on 168936 raw words (115979 effective words) took 0.1s, 1223643 effective words/s


Loss após a época 10: 51328.0
Loss após a época 11: 51299.9375


2021-04-29 19:30:20,535 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:30:20,538 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:30:20,543 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:30:20,543 - EPOCH - 13 : training on 168936 raw words (116023 effective words) took 0.1s, 1255787 effective words/s
2021-04-29 19:30:20,650 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:30:20,653 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:30:20,657 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:30:20,658 - EPOCH - 14 : training on 168936 raw words (116126 effective words) took 0.1s, 1114310 effective words/s


Loss após a época 12: 50799.75
Loss após a época 13: 51374.1875


2021-04-29 19:30:20,751 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:30:20,755 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:30:20,757 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:30:20,757 - EPOCH - 15 : training on 168936 raw words (116182 effective words) took 0.1s, 1349381 effective words/s
2021-04-29 19:30:20,846 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:30:20,850 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:30:20,853 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:30:20,854 - EPOCH - 16 : training on 168936 raw words (116000 effective words) took 0.1s, 1356600 effective words/s
2021-04-29 19:30:20,941 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:30:20,945 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:30:20,948 - worker thread finished; awaiting finish of

Loss após a época 14: 50023.875
Loss após a época 15: 49734.625
Loss após a época 16: 49821.375


2021-04-29 19:30:21,037 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:30:21,044 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:30:21,046 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:30:21,046 - EPOCH - 18 : training on 168936 raw words (116080 effective words) took 0.1s, 1387092 effective words/s
2021-04-29 19:30:21,139 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:30:21,143 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:30:21,145 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:30:21,145 - EPOCH - 19 : training on 168936 raw words (116138 effective words) took 0.1s, 1297973 effective words/s
2021-04-29 19:30:21,233 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:30:21,238 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:30:21,240 - worker thread finished; awaiting finish of

Loss após a época 17: 49045.0
Loss após a época 18: 48195.125
Loss após a época 19: 47987.25


2021-04-29 19:30:21,333 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:30:21,337 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:30:21,340 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:30:21,341 - EPOCH - 21 : training on 168936 raw words (115946 effective words) took 0.1s, 1237486 effective words/s
2021-04-29 19:30:21,429 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:30:21,433 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:30:21,436 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:30:21,437 - EPOCH - 22 : training on 168936 raw words (115990 effective words) took 0.1s, 1437781 effective words/s
2021-04-29 19:30:21,526 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:30:21,530 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:30:21,533 - worker thread finished; awaiting finish of

Loss após a época 20: 47992.3125
Loss após a época 21: 46595.125
Loss após a época 22: 46131.75


2021-04-29 19:30:21,618 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:30:21,622 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:30:21,625 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:30:21,626 - EPOCH - 24 : training on 168936 raw words (115931 effective words) took 0.1s, 1517022 effective words/s
2021-04-29 19:30:21,713 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:30:21,717 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:30:21,720 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:30:21,721 - EPOCH - 25 : training on 168936 raw words (115925 effective words) took 0.1s, 1459094 effective words/s
2021-04-29 19:30:21,802 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:30:21,806 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:30:21,809 - worker thread finished; awaiting finish of

Loss após a época 23: 44862.625
Loss após a época 24: 42989.75
Loss após a época 25: 42904.875


2021-04-29 19:30:21,891 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:30:21,896 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:30:21,898 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:30:21,898 - EPOCH - 27 : training on 168936 raw words (115958 effective words) took 0.1s, 1418122 effective words/s
2021-04-29 19:30:21,976 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:30:21,980 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:30:21,982 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:30:21,983 - EPOCH - 28 : training on 168936 raw words (116036 effective words) took 0.1s, 1663416 effective words/s
2021-04-29 19:30:22,067 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:30:22,070 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:30:22,072 - worker thread finished; awaiting finish of

Loss após a época 26: 41476.625
Loss após a época 27: 39736.125
Loss após a época 28: 39428.5


2021-04-29 19:30:22,152 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:30:22,156 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:30:22,158 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:30:22,158 - EPOCH - 30 : training on 168936 raw words (115966 effective words) took 0.1s, 1554033 effective words/s
2021-04-29 19:30:22,159 - Word2Vec lifecycle event {'msg': 'training on 5068080 raw words (3480945 effective words) took 3.0s, 1163385 effective words/s', 'datetime': '2021-04-29T19:30:22.159598', 'gensim': '4.0.1', 'python': '3.7.4 (default, Aug 13 2019, 15:17:50) \n[Clang 4.0.1 (tags/RELEASE_401/final)]', 'platform': 'Darwin-20.3.0-x86_64-i386-64bit', 'event': 'train'}


Loss após a época 29: 37272.125


(3480945, 5068080)

[('prize', 0.3201570510864258),
 ('jets', 0.31379517912864685),
 ('holds', 0.30752310156822205),
 ('horsey', 0.30421167612075806),
 ('season', 0.3026547431945801),
 ('treading', 0.29793864488601685),
 ('points', 0.29273083806037903),
 ('patted', 0.2840335965156555),
 ('picking', 0.27932408452033997),
 ('woof', 0.2775574326515198)]

In [89]:
w2v_modelo_cbow_helano.train(lista_lista_tokens_helano,
                total_examples=w2v_modelo_cbow_helano.corpus_count,
                epochs = 30,
                compute_loss = True,
                callbacks=[callback()])

2021-04-29 19:31:24,326 - Word2Vec lifecycle event {'msg': 'training model with 3 workers on 4226 vocabulary and 300 features, using sg=0 hs=0 sample=0.001 negative=5 window=2', 'datetime': '2021-04-29T19:31:24.325976', 'gensim': '4.0.1', 'python': '3.7.4 (default, Aug 13 2019, 15:17:50) \n[Clang 4.0.1 (tags/RELEASE_401/final)]', 'platform': 'Darwin-20.3.0-x86_64-i386-64bit', 'event': 'train'}
2021-04-29 19:31:24,481 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:31:24,484 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:31:24,485 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:31:24,486 - EPOCH - 1 : training on 236957 raw words (180343 effective words) took 0.1s, 1205724 effective words/s
2021-04-29 19:31:24,623 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:31:24,626 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:31:24,627 - worker thread finished; awai

Loss após a época 0: 97542.6015625
Loss após a época 1: 97456.5703125


2021-04-29 19:31:24,765 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:31:24,773 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:31:24,774 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:31:24,775 - EPOCH - 3 : training on 236957 raw words (180600 effective words) took 0.1s, 1316764 effective words/s
2021-04-29 19:31:24,904 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:31:24,907 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:31:24,908 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:31:24,909 - EPOCH - 4 : training on 236957 raw words (180247 effective words) took 0.1s, 1460645 effective words/s


Loss após a época 2: 97203.828125
Loss após a época 3: 97340.71875


2021-04-29 19:31:25,038 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:31:25,042 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:31:25,043 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:31:25,044 - EPOCH - 5 : training on 236957 raw words (180226 effective words) took 0.1s, 1480685 effective words/s
2021-04-29 19:31:25,169 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:31:25,172 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:31:25,173 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:31:25,174 - EPOCH - 6 : training on 236957 raw words (180357 effective words) took 0.1s, 1473106 effective words/s


Loss após a época 4: 92699.125
Loss após a época 5: 92530.03125


2021-04-29 19:31:25,303 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:31:25,309 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:31:25,311 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:31:25,311 - EPOCH - 7 : training on 236957 raw words (180288 effective words) took 0.1s, 1379953 effective words/s
2021-04-29 19:31:25,432 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:31:25,436 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:31:25,437 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:31:25,437 - EPOCH - 8 : training on 236957 raw words (180316 effective words) took 0.1s, 1561937 effective words/s


Loss após a época 6: 91773.875
Loss após a época 7: 90689.5


2021-04-29 19:31:25,567 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:31:25,573 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:31:25,574 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:31:25,575 - EPOCH - 9 : training on 236957 raw words (180325 effective words) took 0.1s, 1412284 effective words/s
2021-04-29 19:31:25,699 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:31:25,705 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:31:25,707 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:31:25,707 - EPOCH - 10 : training on 236957 raw words (180350 effective words) took 0.1s, 1520081 effective words/s


Loss após a época 8: 90735.625
Loss após a época 9: 88315.5


2021-04-29 19:31:25,846 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:31:25,852 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:31:25,854 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:31:25,855 - EPOCH - 11 : training on 236957 raw words (180109 effective words) took 0.1s, 1318307 effective words/s
2021-04-29 19:31:26,001 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:31:26,004 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:31:26,005 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:31:26,006 - EPOCH - 12 : training on 236957 raw words (180404 effective words) took 0.1s, 1273176 effective words/s


Loss após a época 10: 86214.8125
Loss após a época 11: 88534.1875


2021-04-29 19:31:26,147 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:31:26,151 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:31:26,154 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:31:26,155 - EPOCH - 13 : training on 236957 raw words (180484 effective words) took 0.1s, 1292800 effective words/s
2021-04-29 19:31:26,292 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:31:26,295 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:31:26,298 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:31:26,299 - EPOCH - 14 : training on 236957 raw words (180323 effective words) took 0.1s, 1412663 effective words/s


Loss após a época 12: 80991.125
Loss após a época 13: 78137.25


2021-04-29 19:31:26,429 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:31:26,436 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:31:26,437 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:31:26,437 - EPOCH - 15 : training on 236957 raw words (180352 effective words) took 0.1s, 1439771 effective words/s
2021-04-29 19:31:26,571 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:31:26,575 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:31:26,576 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:31:26,576 - EPOCH - 16 : training on 236957 raw words (180125 effective words) took 0.1s, 1395738 effective words/s


Loss após a época 14: 74862.125
Loss após a época 15: 69773.25


2021-04-29 19:31:26,709 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:31:26,716 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:31:26,719 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:31:26,720 - EPOCH - 17 : training on 236957 raw words (180665 effective words) took 0.1s, 1405371 effective words/s
2021-04-29 19:31:26,854 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:31:26,860 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:31:26,860 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:31:26,861 - EPOCH - 18 : training on 236957 raw words (180229 effective words) took 0.1s, 1358403 effective words/s


Loss após a época 16: 66346.375
Loss após a época 17: 64560.25


2021-04-29 19:31:26,997 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:31:27,001 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:31:27,002 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:31:27,003 - EPOCH - 19 : training on 236957 raw words (180372 effective words) took 0.1s, 1380533 effective words/s
2021-04-29 19:31:27,127 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:31:27,130 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:31:27,132 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:31:27,133 - EPOCH - 20 : training on 236957 raw words (180203 effective words) took 0.1s, 1483294 effective words/s


Loss após a época 18: 57569.75
Loss após a época 19: 53944.625


2021-04-29 19:31:27,252 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:31:27,255 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:31:27,257 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:31:27,258 - EPOCH - 21 : training on 236957 raw words (180602 effective words) took 0.1s, 1566062 effective words/s
2021-04-29 19:31:27,386 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:31:27,391 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:31:27,392 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:31:27,393 - EPOCH - 22 : training on 236957 raw words (180373 effective words) took 0.1s, 1413169 effective words/s


Loss após a época 20: 51408.375
Loss após a época 21: 48130.875


2021-04-29 19:31:27,528 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:31:27,533 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:31:27,533 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:31:27,534 - EPOCH - 23 : training on 236957 raw words (180134 effective words) took 0.1s, 1375104 effective words/s
2021-04-29 19:31:27,663 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:31:27,669 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:31:27,670 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:31:27,671 - EPOCH - 24 : training on 236957 raw words (180051 effective words) took 0.1s, 1415368 effective words/s


Loss após a época 22: 45319.75
Loss após a época 23: 43828.875


2021-04-29 19:31:27,800 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:31:27,806 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:31:27,807 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:31:27,808 - EPOCH - 25 : training on 236957 raw words (180344 effective words) took 0.1s, 1463371 effective words/s
2021-04-29 19:31:27,931 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:31:27,935 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:31:27,936 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:31:27,937 - EPOCH - 26 : training on 236957 raw words (180372 effective words) took 0.1s, 1583098 effective words/s


Loss após a época 24: 39944.5
Loss após a época 25: 38037.25


2021-04-29 19:31:28,068 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:31:28,071 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:31:28,073 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:31:28,074 - EPOCH - 27 : training on 236957 raw words (180381 effective words) took 0.1s, 1412527 effective words/s
2021-04-29 19:31:28,201 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:31:28,204 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:31:28,205 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:31:28,206 - EPOCH - 28 : training on 236957 raw words (180119 effective words) took 0.1s, 1448551 effective words/s


Loss após a época 26: 35997.625
Loss após a época 27: 34853.625


2021-04-29 19:31:28,322 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:31:28,326 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:31:28,327 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:31:28,328 - EPOCH - 29 : training on 236957 raw words (180231 effective words) took 0.1s, 1604143 effective words/s
2021-04-29 19:31:28,458 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:31:28,460 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:31:28,463 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:31:28,465 - EPOCH - 30 : training on 236957 raw words (180403 effective words) took 0.1s, 1434133 effective words/s
2021-04-29 19:31:28,466 - Word2Vec lifecycle event {'msg': 'training on 7108710 raw words (5409806 effective words) took 4.1s, 1307030 effective words/s', 'datetime': '2021-04-29T19:31:28.465999', 'gensim': '4.0.1', 'python': '3.7.4 (de

Loss após a época 28: 32868.25
Loss após a época 29: 31902.75


(5409806, 7108710)

In [91]:
## TREINANDO O MODELO COM SKIPGRAM HELANO

w2v_modelo_sg_helano = Word2Vec(sg = 1,
                      window = 5,
                      vector_size = 300,
                      min_count = 5,
                      alpha = 0.03,
                      min_alpha = 0.07)

## definindo através do progress_per de quantas em quantas interações o log será exibido.
w2v_modelo_sg_helano.build_vocab(lista_lista_tokens_helano,progress_per=5000)


### TREINAMENTO

w2v_modelo_sg_helano.train(lista_lista_tokens_helano,
                total_examples=w2v_modelo_sg_helano.corpus_count,
                epochs = 30,
                compute_loss = True,
                callbacks=[callback()])

2021-04-29 19:32:49,689 - Word2Vec lifecycle event {'params': 'Word2Vec(vocab=0, vector_size=300, alpha=0.03)', 'datetime': '2021-04-29T19:32:49.689217', 'gensim': '4.0.1', 'python': '3.7.4 (default, Aug 13 2019, 15:17:50) \n[Clang 4.0.1 (tags/RELEASE_401/final)]', 'platform': 'Darwin-20.3.0-x86_64-i386-64bit', 'event': 'created'}
2021-04-29 19:32:49,690 - collecting all words and their counts
2021-04-29 19:32:49,691 - PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2021-04-29 19:32:49,701 - PROGRESS: at sentence #5000, processed 46335 words, keeping 6129 word types
2021-04-29 19:32:49,711 - PROGRESS: at sentence #10000, processed 90264 words, keeping 8522 word types
2021-04-29 19:32:49,721 - PROGRESS: at sentence #15000, processed 140142 words, keeping 10364 word types
2021-04-29 19:32:49,733 - PROGRESS: at sentence #20000, processed 192416 words, keeping 11902 word types
2021-04-29 19:32:49,751 - collected 12923 word types from a corpus of 236957 raw words and 24783

Loss após a época 0: 688765.6875


2021-04-29 19:32:50,683 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:32:50,684 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:32:50,691 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:32:50,691 - EPOCH - 2 : training on 236957 raw words (180346 effective words) took 0.4s, 450992 effective words/s


Loss após a época 1: 640172.8125


2021-04-29 19:32:51,096 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:32:51,097 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:32:51,107 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:32:51,107 - EPOCH - 3 : training on 236957 raw words (180275 effective words) took 0.4s, 443734 effective words/s


Loss após a época 2: 638995.875


2021-04-29 19:32:51,519 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:32:51,523 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:32:51,532 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:32:51,532 - EPOCH - 4 : training on 236957 raw words (180173 effective words) took 0.4s, 433470 effective words/s


Loss após a época 3: 609139.625


2021-04-29 19:32:51,956 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:32:51,964 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:32:51,966 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:32:51,967 - EPOCH - 5 : training on 236957 raw words (180272 effective words) took 0.4s, 426035 effective words/s


Loss após a época 4: 592156.5


2021-04-29 19:32:52,385 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:32:52,386 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:32:52,387 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:32:52,388 - EPOCH - 6 : training on 236957 raw words (180243 effective words) took 0.4s, 436558 effective words/s


Loss após a época 5: 584795.75


2021-04-29 19:32:52,797 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:32:52,799 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:32:52,807 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:32:52,808 - EPOCH - 7 : training on 236957 raw words (180382 effective words) took 0.4s, 440143 effective words/s


Loss após a época 6: 570021.75


2021-04-29 19:32:53,228 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:32:53,236 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:32:53,241 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:32:53,242 - EPOCH - 8 : training on 236957 raw words (180304 effective words) took 0.4s, 424845 effective words/s


Loss após a época 7: 529154.5


2021-04-29 19:32:53,671 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:32:53,672 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:32:53,681 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:32:53,682 - EPOCH - 9 : training on 236957 raw words (180332 effective words) took 0.4s, 418283 effective words/s


Loss após a época 8: 536406.5


2021-04-29 19:32:54,172 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:32:54,176 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:32:54,178 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:32:54,178 - EPOCH - 10 : training on 236957 raw words (180289 effective words) took 0.5s, 370020 effective words/s


Loss após a época 9: 534772.5


2021-04-29 19:32:54,627 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:32:54,628 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:32:54,634 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:32:54,635 - EPOCH - 11 : training on 236957 raw words (180213 effective words) took 0.4s, 400697 effective words/s


Loss após a época 10: 536736.0


2021-04-29 19:32:55,060 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:32:55,062 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:32:55,064 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:32:55,065 - EPOCH - 12 : training on 236957 raw words (180098 effective words) took 0.4s, 432913 effective words/s


Loss após a época 11: 533714.5


2021-04-29 19:32:55,472 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:32:55,475 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:32:55,482 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:32:55,483 - EPOCH - 13 : training on 236957 raw words (180203 effective words) took 0.4s, 445283 effective words/s


Loss após a época 12: 538548.5


2021-04-29 19:32:55,875 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:32:55,880 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:32:55,888 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:32:55,889 - EPOCH - 14 : training on 236957 raw words (180121 effective words) took 0.4s, 456454 effective words/s


Loss após a época 13: 531148.0


2021-04-29 19:32:56,292 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:32:56,293 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:32:56,300 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:32:56,301 - EPOCH - 15 : training on 236957 raw words (180321 effective words) took 0.4s, 448249 effective words/s


Loss após a época 14: 515766.5


2021-04-29 19:32:56,728 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:32:56,731 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:32:56,737 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:32:56,738 - EPOCH - 16 : training on 236957 raw words (180158 effective words) took 0.4s, 422179 effective words/s


Loss após a época 15: 497035.0


2021-04-29 19:32:57,143 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:32:57,147 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:32:57,149 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:32:57,150 - EPOCH - 17 : training on 236957 raw words (180361 effective words) took 0.4s, 452919 effective words/s


Loss após a época 16: 502221.0


2021-04-29 19:32:57,534 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:32:57,538 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:32:57,544 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:32:57,544 - EPOCH - 18 : training on 236957 raw words (180445 effective words) took 0.4s, 468870 effective words/s


Loss após a época 17: 497320.0


2021-04-29 19:32:57,925 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:32:57,928 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:32:57,930 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:32:57,930 - EPOCH - 19 : training on 236957 raw words (180295 effective words) took 0.4s, 483548 effective words/s


Loss após a época 18: 494817.0


2021-04-29 19:32:58,319 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:32:58,320 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:32:58,328 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:32:58,329 - EPOCH - 20 : training on 236957 raw words (180358 effective words) took 0.4s, 463816 effective words/s


Loss após a época 19: 496196.0


2021-04-29 19:32:58,739 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:32:58,744 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:32:58,749 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:32:58,750 - EPOCH - 21 : training on 236957 raw words (180233 effective words) took 0.4s, 438157 effective words/s


Loss após a época 20: 501922.0


2021-04-29 19:32:59,175 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:32:59,180 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:32:59,184 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:32:59,185 - EPOCH - 22 : training on 236957 raw words (180345 effective words) took 0.4s, 425738 effective words/s


Loss após a época 21: 497442.0


2021-04-29 19:32:59,583 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:32:59,588 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:32:59,591 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:32:59,592 - EPOCH - 23 : training on 236957 raw words (180421 effective words) took 0.4s, 453493 effective words/s


Loss após a época 22: 502739.0


2021-04-29 19:32:59,978 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:32:59,979 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:32:59,984 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:32:59,984 - EPOCH - 24 : training on 236957 raw words (180160 effective words) took 0.4s, 471143 effective words/s


Loss após a época 23: 497662.0


2021-04-29 19:33:00,368 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:33:00,372 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:33:00,375 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:33:00,376 - EPOCH - 25 : training on 236957 raw words (180561 effective words) took 0.4s, 473650 effective words/s


Loss após a época 24: 495552.0


2021-04-29 19:33:00,761 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:33:00,762 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:33:00,771 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:33:00,772 - EPOCH - 26 : training on 236957 raw words (180442 effective words) took 0.4s, 468422 effective words/s


Loss após a época 25: 507658.0


2021-04-29 19:33:01,192 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:33:01,194 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:33:01,205 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:33:01,206 - EPOCH - 27 : training on 236957 raw words (180436 effective words) took 0.4s, 426324 effective words/s


Loss após a época 26: 499497.0


2021-04-29 19:33:01,615 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:33:01,617 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:33:01,625 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:33:01,626 - EPOCH - 28 : training on 236957 raw words (180358 effective words) took 0.4s, 440909 effective words/s


Loss após a época 27: 502264.0


2021-04-29 19:33:02,039 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:33:02,040 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:33:02,046 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:33:02,047 - EPOCH - 29 : training on 236957 raw words (180317 effective words) took 0.4s, 438624 effective words/s


Loss após a época 28: 499747.0


2021-04-29 19:33:02,471 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:33:02,474 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:33:02,484 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:33:02,485 - EPOCH - 30 : training on 236957 raw words (180328 effective words) took 0.4s, 419654 effective words/s
2021-04-29 19:33:02,486 - Word2Vec lifecycle event {'msg': 'training on 7108710 raw words (5409207 effective words) took 12.6s, 429558 effective words/s', 'datetime': '2021-04-29T19:33:02.486121', 'gensim': '4.0.1', 'python': '3.7.4 (default, Aug 13 2019, 15:17:50) \n[Clang 4.0.1 (tags/RELEASE_401/final)]', 'platform': 'Darwin-20.3.0-x86_64-i386-64bit', 'event': 'train'}


Loss após a época 29: 499961.0


(5409207, 7108710)

In [93]:
## TREINANDO O MODELO COM SKIPGRAM

w2v_modelo_sg_eric = Word2Vec(sg = 1,
                      window = 5,
                      vector_size = 300,
                      min_count = 5,
                      alpha = 0.03,
                      min_alpha = 0.07)

## definindo através do progress_per de quantas em quantas interações o log será exibido.
w2v_modelo_sg_eric.build_vocab(lista_lista_tokens,progress_per=5000)


### TREINAMENTO

w2v_modelo_sg_eric.train(lista_lista_tokens,
                total_examples=w2v_modelo_sg_eric.corpus_count,
                epochs = 30,
                compute_loss = True,
                callbacks=[callback()])

2021-04-29 19:34:54,397 - Word2Vec lifecycle event {'params': 'Word2Vec(vocab=0, vector_size=300, alpha=0.03)', 'datetime': '2021-04-29T19:34:54.397531', 'gensim': '4.0.1', 'python': '3.7.4 (default, Aug 13 2019, 15:17:50) \n[Clang 4.0.1 (tags/RELEASE_401/final)]', 'platform': 'Darwin-20.3.0-x86_64-i386-64bit', 'event': 'created'}
2021-04-29 19:34:54,399 - collecting all words and their counts
2021-04-29 19:34:54,400 - PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2021-04-29 19:34:54,408 - PROGRESS: at sentence #5000, processed 34801 words, keeping 7955 word types
2021-04-29 19:34:54,416 - PROGRESS: at sentence #10000, processed 70543 words, keeping 11847 word types
2021-04-29 19:34:54,425 - PROGRESS: at sentence #15000, processed 109914 words, keeping 15191 word types
2021-04-29 19:34:54,433 - PROGRESS: at sentence #20000, processed 148523 words, keeping 17747 word types
2021-04-29 19:34:54,438 - collected 18996 word types from a corpus of 168936 raw words and 2283

Loss após a época 0: 399550.46875


2021-04-29 19:34:54,996 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:34:55,014 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:34:55,021 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:34:55,022 - EPOCH - 2 : training on 168936 raw words (116178 effective words) took 0.2s, 529479 effective words/s


Loss após a época 1: 362717.40625


2021-04-29 19:34:55,241 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:34:55,267 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:34:55,270 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:34:55,271 - EPOCH - 3 : training on 168936 raw words (116092 effective words) took 0.2s, 488628 effective words/s


Loss após a época 2: 354526.875


2021-04-29 19:34:55,489 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:34:55,511 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:34:55,514 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:34:55,515 - EPOCH - 4 : training on 168936 raw words (116059 effective words) took 0.2s, 508413 effective words/s


Loss após a época 3: 342561.0


2021-04-29 19:34:55,739 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:34:55,760 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:34:55,769 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:34:55,770 - EPOCH - 5 : training on 168936 raw words (116043 effective words) took 0.2s, 481252 effective words/s


Loss após a época 4: 330368.25


2021-04-29 19:34:56,027 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:34:56,046 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:34:56,053 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:34:56,054 - EPOCH - 6 : training on 168936 raw words (116054 effective words) took 0.3s, 419246 effective words/s


Loss após a época 5: 316429.25


2021-04-29 19:34:56,268 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:34:56,290 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:34:56,295 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:34:56,296 - EPOCH - 7 : training on 168936 raw words (116060 effective words) took 0.2s, 514653 effective words/s


Loss após a época 6: 298778.25


2021-04-29 19:34:56,506 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:34:56,530 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:34:56,533 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:34:56,534 - EPOCH - 8 : training on 168936 raw words (116024 effective words) took 0.2s, 535253 effective words/s


Loss após a época 7: 297413.5


2021-04-29 19:34:56,761 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:34:56,792 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:34:56,793 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:34:56,794 - EPOCH - 9 : training on 168936 raw words (116059 effective words) took 0.2s, 473143 effective words/s


Loss após a época 8: 289776.5


2021-04-29 19:34:57,035 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:34:57,063 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:34:57,064 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:34:57,065 - EPOCH - 10 : training on 168936 raw words (115922 effective words) took 0.3s, 440724 effective words/s


Loss após a época 9: 286013.5


2021-04-29 19:34:57,299 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:34:57,325 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:34:57,331 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:34:57,331 - EPOCH - 11 : training on 168936 raw words (116068 effective words) took 0.3s, 448509 effective words/s


Loss após a época 10: 276515.5


2021-04-29 19:34:57,552 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:34:57,569 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:34:57,578 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:34:57,578 - EPOCH - 12 : training on 168936 raw words (116016 effective words) took 0.2s, 485810 effective words/s


Loss após a época 11: 275467.0


2021-04-29 19:34:57,785 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:34:57,806 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:34:57,810 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:34:57,811 - EPOCH - 13 : training on 168936 raw words (115979 effective words) took 0.2s, 524676 effective words/s


Loss após a época 12: 273383.25


2021-04-29 19:34:58,015 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:34:58,034 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:34:58,041 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:34:58,041 - EPOCH - 14 : training on 168936 raw words (115932 effective words) took 0.2s, 543712 effective words/s
2021-04-29 19:34:58,242 - worker thread finished; awaiting finish of 2 more threads


Loss após a época 13: 263632.25


2021-04-29 19:34:58,262 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:34:58,268 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:34:58,269 - EPOCH - 15 : training on 168936 raw words (116078 effective words) took 0.2s, 536253 effective words/s


Loss após a época 14: 258333.5


2021-04-29 19:34:58,500 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:34:58,515 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:34:58,531 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:34:58,531 - EPOCH - 16 : training on 168936 raw words (116035 effective words) took 0.3s, 458531 effective words/s


Loss após a época 15: 257300.0


2021-04-29 19:34:58,759 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:34:58,772 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:34:58,787 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:34:58,788 - EPOCH - 17 : training on 168936 raw words (116041 effective words) took 0.2s, 465812 effective words/s


Loss após a época 16: 259162.0


2021-04-29 19:34:59,009 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:34:59,029 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:34:59,031 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:34:59,032 - EPOCH - 18 : training on 168936 raw words (116113 effective words) took 0.2s, 493478 effective words/s


Loss após a época 17: 259335.0


2021-04-29 19:34:59,244 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:34:59,270 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:34:59,271 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:34:59,272 - EPOCH - 19 : training on 168936 raw words (116103 effective words) took 0.2s, 522337 effective words/s


Loss após a época 18: 265751.0


2021-04-29 19:34:59,474 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:34:59,497 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:34:59,500 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:34:59,500 - EPOCH - 20 : training on 168936 raw words (116210 effective words) took 0.2s, 536038 effective words/s


Loss após a época 19: 260739.0


2021-04-29 19:34:59,712 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:34:59,737 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:34:59,741 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:34:59,742 - EPOCH - 21 : training on 168936 raw words (116056 effective words) took 0.2s, 501817 effective words/s


Loss após a época 20: 260616.0


2021-04-29 19:34:59,978 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:35:00,001 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:35:00,007 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:35:00,008 - EPOCH - 22 : training on 168936 raw words (115988 effective words) took 0.3s, 452665 effective words/s


Loss após a época 21: 259770.5


2021-04-29 19:35:00,246 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:35:00,266 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:35:00,273 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:35:00,274 - EPOCH - 23 : training on 168936 raw words (116008 effective words) took 0.2s, 464837 effective words/s


Loss após a época 22: 260186.0


2021-04-29 19:35:00,502 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:35:00,526 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:35:00,527 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:35:00,528 - EPOCH - 24 : training on 168936 raw words (116018 effective words) took 0.2s, 477899 effective words/s


Loss após a época 23: 259840.5


2021-04-29 19:35:00,769 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:35:00,796 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:35:00,797 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:35:00,797 - EPOCH - 25 : training on 168936 raw words (115980 effective words) took 0.3s, 448047 effective words/s


Loss após a época 24: 268162.0


2021-04-29 19:35:01,032 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:35:01,056 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:35:01,062 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:35:01,063 - EPOCH - 26 : training on 168936 raw words (116043 effective words) took 0.3s, 454444 effective words/s


Loss após a época 25: 261247.5


2021-04-29 19:35:01,287 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:35:01,301 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:35:01,310 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:35:01,311 - EPOCH - 27 : training on 168936 raw words (116146 effective words) took 0.2s, 489463 effective words/s


Loss após a época 26: 261351.5


2021-04-29 19:35:01,526 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:35:01,547 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:35:01,552 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:35:01,553 - EPOCH - 28 : training on 168936 raw words (116077 effective words) took 0.2s, 496044 effective words/s


Loss após a época 27: 263345.5


2021-04-29 19:35:01,794 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:35:01,823 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:35:01,824 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:35:01,825 - EPOCH - 29 : training on 168936 raw words (116143 effective words) took 0.3s, 447645 effective words/s


Loss após a época 28: 262661.5


2021-04-29 19:35:02,050 - worker thread finished; awaiting finish of 2 more threads
2021-04-29 19:35:02,077 - worker thread finished; awaiting finish of 1 more threads
2021-04-29 19:35:02,079 - worker thread finished; awaiting finish of 0 more threads
2021-04-29 19:35:02,080 - EPOCH - 30 : training on 168936 raw words (116282 effective words) took 0.2s, 468151 effective words/s
2021-04-29 19:35:02,081 - Word2Vec lifecycle event {'msg': 'training on 5068080 raw words (3481731 effective words) took 7.5s, 462781 effective words/s', 'datetime': '2021-04-29T19:35:02.081229', 'gensim': '4.0.1', 'python': '3.7.4 (default, Aug 13 2019, 15:17:50) \n[Clang 4.0.1 (tags/RELEASE_401/final)]', 'platform': 'Darwin-20.3.0-x86_64-i386-64bit', 'event': 'train'}


Loss após a época 29: 258017.5


(3481731, 5068080)

In [94]:
w2v_modelo_cbow_helano.wv.save_word2vec_format('/Users/eric/Downloads/projeto_daniel/29042021w2v_modelo_cbow_helano.txt',binary=False)
w2v_modelo_sg_helano.wv.save_word2vec_format('/Users/eric/Downloads/projeto_daniel/29042021w2v_modelo_skipgram_helano.txt',binary=False)

2021-04-29 19:36:14,065 - storing 4226x300 projection weights into /Users/eric/Downloads/projeto_daniel/29042021w2v_modelo_cbow_helano.txt
2021-04-29 19:36:14,873 - storing 4226x300 projection weights into /Users/eric/Downloads/projeto_daniel/29042021w2v_modelo_skipgram_helano.txt


In [96]:
w2v_modelo_cbow_eric.wv.save_word2vec_format('/Users/eric/Downloads/projeto_daniel/29042021w2v_modelo_cbow.txt',binary=False)
w2v_modelo_sg_eric.wv.save_word2vec_format('/Users/eric/Downloads/projeto_daniel/29042021w2v_modelo_skipgram.txt',binary=False)

2021-04-29 19:37:02,696 - storing 3945x300 projection weights into /Users/eric/Downloads/projeto_daniel/29042021w2v_modelo_cbow.txt
2021-04-29 19:37:03,414 - storing 3945x300 projection weights into /Users/eric/Downloads/projeto_daniel/29042021w2v_modelo_skipgram.txt


###AGORA FAZER REGRESSAO LOGISTICA - BASTA IMPORTAR OS TXT e o CSV COM OS CAMPOS ADICIONADOS



In [5]:
##Dataset
#dados = pd.read_csv('/content/drive/MyDrive/Projeto Daniel/procesado_labeled_data_new.csv')
#dados = pd.read_csv('/Users/eric/Downloads/projeto_daniel/procesado_labeled_data_new.csv')
#dados.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,count,hate_speech,offensive_language,neither,class,tweet,classe,tweet_novo,tweet_organizado,text_split,text_novo2,tweet_definitivo,eric_tweet_definito,helano_tweet_definito,eric_tweet_definito_final
0,0,0,0,3,0,0,3,2,!!! RT @mayasolovely: As a woman you shouldn't...,['Neutro'],rt as a woman you shouldn't complain abou...,womann't complain cleaning house amp man alway...,"['woman', 'n', ""'"", 't', 'complain', 'cleaning...",woman n' t complain cleaning house amp man alw...,woman n' t complain cleaning house amp man alw...,rt woman complain cleaning house amp man trash,woman n' t complain cleaning house amp man alw...,rt woman complain cleaning house amp man trash
1,1,1,1,3,0,3,0,1,!!!!! RT @mleew17: boy dats cold...tyga dwn ba...,['Linguagem ofensiva'],rt boy dats cold tyga dwn bad for cuffin ...,boy dats cold tyga dwn bad cuffin dat hoe st p...,"['boy', 'dat', 's', 'cold', 'tyga', 'dw', 'n',...",boy dat s cold tyga dw n bad cuff in dat hoe s...,boy dat s cold tea do n bad cuff in dat he st ...,rt boy dats cold tyga dwn bad cuffin dat hoe p...,boy dat s cold tea do n bad cuff in dat he st ...,rt boy dats cold tyga dwn bad cuffin dat hoe p...
2,2,2,2,3,0,3,0,1,!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...,['Linguagem ofensiva'],rt dawg rt you ever fuck a bitch and s...,dawg ever fuck bitch start cry confused shit,"['daw', 'g', 'ever', 'fuck', 'bitch', 'start',...",daw g ever fuck bitch start cry confused shit,day g ever fuck bitch start cry confused shit,rt dawg rt fuck bitch start cry confused shit,day g ever fuck bitch start cry confused shit,rt dawg rt fuck bitch start cry confused shit
3,3,3,3,3,0,2,1,1,!!!!!!!!! RT @C_G_Anderson: @viva_based she lo...,['Linguagem ofensiva'],rt _g_anderson _based she look like a tranny,_g_anderson _based look like tranny,"['g', 'anderson', 'based', 'look', 'like', 'tr...",g anderson based look like tr an ny,g anderson based look like tr an ny,rt look like tranny,g anderson based look like tr an ny,rt look like tranny
4,4,4,4,6,0,6,0,1,!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...,['Linguagem ofensiva'],rt the shit you hear about me might be tr...,shit hear might true might faker bitch told ya,"['shit', 'hear', 'might', 'true', 'might', 'fa...",shit hear might true might faker bitch told ya,shit hear might true might baker bitch told a,rt shit hear true faker bitch told ya,shit hear might true might baker bitch told a,rt shit hear true faker bitch told ya


In [5]:
#w2v_modelo_cbow_eric = KeyedVectors.load_word2vec_format("/content/drive/MyDrive/Projeto Daniel/w2v_modelo_cbow.txt")
#w2v_modelo_eric_sg = KeyedVectors.load_word2vec_format("/content/drive/MyDrive/Projeto Daniel/w2v_modelo_skipgram.txt")
#w2v_modelo_cbow_helano = KeyedVectors.load_word2vec_format("/content/drive/MyDrive/Projeto Daniel/w2v_modelo_cbow_helano.txt")
#w2v_modelo_sg_helano = KeyedVectors.load_word2vec_format("/content/drive/MyDrive/Projeto Daniel/w2v_modelo_skipgram_helano.txt")


In [217]:
w2v_modelo_cbow_eric = KeyedVectors.load_word2vec_format("/Users/eric/Downloads/projeto_daniel/29042021w2v_modelo_cbow.txt")
w2v_modelo_sg_eric = KeyedVectors.load_word2vec_format("/Users/eric/Downloads/projeto_daniel/29042021w2v_modelo_skipgram.txt")
w2v_modelo_cbow_helano = KeyedVectors.load_word2vec_format("/Users/eric/Downloads/projeto_daniel/29042021w2v_modelo_cbow_helano.txt")
w2v_modelo_sg_helano = KeyedVectors.load_word2vec_format("/Users/eric/Downloads/projeto_daniel/29042021w2v_modelo_skipgram_helano.txt")



2021-04-30 09:29:56,126 - loading projection weights from /Users/eric/Downloads/projeto_daniel/29042021w2v_modelo_cbow.txt
2021-04-30 09:29:57,240 - KeyedVectors lifecycle event {'msg': 'loaded (3945, 300) matrix of type float32 from /Users/eric/Downloads/projeto_daniel/29042021w2v_modelo_cbow.txt', 'binary': False, 'encoding': 'utf8', 'datetime': '2021-04-30T09:29:57.236006', 'gensim': '4.0.1', 'python': '3.7.4 (default, Aug 13 2019, 15:17:50) \n[Clang 4.0.1 (tags/RELEASE_401/final)]', 'platform': 'Darwin-20.3.0-x86_64-i386-64bit', 'event': 'load_word2vec_format'}
2021-04-30 09:29:57,242 - loading projection weights from /Users/eric/Downloads/projeto_daniel/29042021w2v_modelo_skipgram.txt
2021-04-30 09:29:58,326 - KeyedVectors lifecycle event {'msg': 'loaded (3945, 300) matrix of type float32 from /Users/eric/Downloads/projeto_daniel/29042021w2v_modelo_skipgram.txt', 'binary': False, 'encoding': 'utf8', 'datetime': '2021-04-30T09:29:58.326270', 'gensim': '4.0.1', 'python': '3.7.4 (def

In [218]:
nlp = spacy.load('en_core_web_sm',disable=['paser','ner','tagger','textcat', "lemmatizer"])
#nlp = en_core_sci_lg.load(disable=["tagger", "ner", "lemmatizer"])

def tokenizador(texto):
  tokens_validos=[]
  doc=nlp(str(texto))
  for token in doc:
    e_valido= not token.is_stop and token.is_alpha
    if e_valido:
      tokens_validos.append(token.text.lower())
  
  return tokens_validos


In [219]:
def combinacao_vetores_por_soma(palavras,modelo):
  vetor_resultante = np.zeros((1,300))
  for pn in palavras:
    try:
      vetor_resultante += modelo.get_vector(pn)
    except KeyError:
      pass
  return vetor_resultante 


In [220]:
def matriz_vetores(textos,modelo):
  x=len(textos)
  y=300
  matriz = np.zeros((x,y))

  for i in range(x):
    palavras = tokenizador(textos.iloc[i])
    matriz[i] = combinacao_vetores_por_soma(palavras,modelo)

  return matriz



In [221]:
display(dados_eric.head(2))
dados_helano.head(2)

Unnamed: 0.1,Unnamed: 0,count,hate_speech,offensive_language,neither,class,tweet,classe,tweet_novo,tweet_organizado,text_split,text_novo2,eric_tweet_definito
0,0,3,0,0,3,2,!!! RT @mayasolovely: As a woman you shouldn't...,[Neutro],rt as a woman you shouldn't complain abou...,womann't complain cleaning house amp man alway...,"[woman, n, ', t, complain, cleaning, house, am...",woman n' t complain cleaning house amp man alw...,rt woman complain cleaning house amp man trash
1,1,3,0,3,0,1,!!!!! RT @mleew17: boy dats cold...tyga dwn ba...,[Linguagem ofensiva],rt boy dats cold tyga dwn bad for cuffin ...,boy dats cold tyga dwn bad cuffin dat hoe st p...,"[boy, dat, s, cold, tyga, dw, n, bad, cuff, in...",boy dat s cold tyga dw n bad cuff in dat hoe s...,rt boy dats cold tyga dwn bad cuffin dat hoe p...


Unnamed: 0.1,Unnamed: 0,count,hate_speech,offensive_language,neither,class,tweet,classe,tweet_novo,tweet_organizado,text_split,text_novo2,tweet_definitivo
0,0,3,0,0,3,2,!!! RT @mayasolovely: As a woman you shouldn't...,[Neutro],rt as a woman you shouldn't complain abou...,womann't complain cleaning house amp man alway...,"[woman, n, ', t, complain, cleaning, house, am...",woman n' t complain cleaning house amp man alw...,woman n' t complain cleaning house amp man alw...
1,1,3,0,3,0,1,!!!!! RT @mleew17: boy dats cold...tyga dwn ba...,[Linguagem ofensiva],rt boy dats cold tyga dwn bad for cuffin ...,boy dats cold tyga dwn bad cuffin dat hoe st p...,"[boy, dat, s, cold, tyga, dw, n, bad, cuff, in...",boy dat s cold tyga dw n bad cuff in dat hoe s...,boy dat s cold tea do n bad cuff in dat he st ...


In [222]:
### DEFININDO DADOS DE TREINO E TESTES 
X_helano = dados_helano['tweet_definitivo']

X_eric = dados_eric['eric_tweet_definito']
y_helano = dados_helano['class']
y_eric = dados_eric['class']
### TESTE
#y_eric=MultiLabelBinarizer().fit_transform(y_eric)

In [223]:
y_eric

0        2
1        1
2        1
3        1
4        1
        ..
24778    1
24779    2
24780    1
24781    1
24782    2
Name: class, Length: 22831, dtype: int64

In [224]:
## TESTE COM 20% percentual que helano utilizou.
X_train_helano, X_test_helano, y_train_helano, y_test_helano = train_test_split(X_helano, y_helano, random_state=42, test_size=0.2)
#X_train_eric, X_test_eric, y_train_eric, y_test_eric = train_test_split(X_eric, y, random_state=42, test_size=0.2)

In [225]:
X_train_helano[0]

"woman n' t complain cleaning house amp man always take trash"

In [226]:
X_train_eric, X_test_eric, y_train_eric, y_test_eric = train_test_split(X_eric, y_eric, random_state=42, test_size=0.2)

In [227]:
y_train_eric

18878    1
19780    2
5780     1
3781     2
12790    1
        ..
13424    0
23408    1
6296     1
907      1
17326    1
Name: class, Length: 18264, dtype: int64

##CBOW ERIC

In [228]:
## 
matriz_vetores_treino_cbow_eric = matriz_vetores(X_train_eric,w2v_modelo_cbow_eric)
matriz_vetores_teste_cbow_eric = matriz_vetores(X_test_eric, w2v_modelo_cbow_eric)
print(matriz_vetores_treino_cbow_eric.shape)
print(matriz_vetores_teste_cbow_eric.shape)

(18264, 300)
(4567, 300)


In [229]:
matriz_vetores_treino_cbow_helano = matriz_vetores(X_train_helano,w2v_modelo_cbow_helano)
matriz_vetores_teste_cbow_helano = matriz_vetores(X_test_helano, w2v_modelo_cbow_helano)
print(matriz_vetores_treino_cbow_helano.shape)
print(matriz_vetores_teste_cbow_helano.shape)

(19826, 300)
(4957, 300)


##SKIPGRAM HELANO

In [230]:
matriz_vetores_treino_skip_eric = matriz_vetores(X_train_eric,w2v_modelo_sg_eric)
matriz_vetores_teste_skip_eric = matriz_vetores(X_test_eric, w2v_modelo_sg_eric)
print(matriz_vetores_treino_skip_eric.shape)
print(matriz_vetores_teste_skip_eric.shape)

(18264, 300)
(4567, 300)


In [231]:
matriz_vetores_treino_skip_helano = matriz_vetores(X_train_helano,w2v_modelo_sg_helano)
matriz_vetores_teste_skip_helano = matriz_vetores(X_test_helano, w2v_modelo_sg_helano)
print(matriz_vetores_treino_skip_helano.shape)
print(matriz_vetores_teste_skip_helano.shape)

(19826, 300)
(4957, 300)


### **REGRESSAO LOGISTICA**

In [319]:

def classificador(modelo,x_treino,y_treino,x_teste,y_teste):
    LR = LogisticRegression(max_iter=800,solver='lbfgs', multi_class='auto',random_state=42)

    LR.fit(x_treino,y_treino)


  #calculando o y_pred
    previsao = LR.predict(x_teste)
    target_names = ['Discurso de ódio', 'Linguagem ofensiva', 'Neutro']


    resultado = classification_report(y_teste,previsao,target_names=target_names)

    print(resultado)



    return LR,previsao

In [324]:
RL__CBOW_eric = classificador(w2v_modelo_cbow_eric,matriz_vetores_treino_cbow_eric,
              y_train_eric,
              matriz_vetores_teste_cbow_eric,
              y_test_eric)

                    precision    recall  f1-score   support

  Discurso de ódio       0.47      0.15      0.22       280
Linguagem ofensiva       0.89      0.95      0.92      3514
            Neutro       0.76      0.72      0.74       773

          accuracy                           0.86      4567
         macro avg       0.71      0.60      0.63      4567
      weighted avg       0.84      0.86      0.85      4567



In [323]:
RL__CBOW_helano = classificador(w2v_modelo_cbow_helano,matriz_vetores_treino_cbow_helano,
              y_train_helano,
              matriz_vetores_teste_cbow_helano,
              y_test_helano)

                    precision    recall  f1-score   support

  Discurso de ódio       0.46      0.16      0.23       290
Linguagem ofensiva       0.89      0.95      0.92      3832
            Neutro       0.75      0.70      0.72       835

          accuracy                           0.86      4957
         macro avg       0.70      0.60      0.62      4957
      weighted avg       0.84      0.86      0.85      4957



In [321]:
RL__SKIP_eric = classificador(w2v_modelo_sg_eric,matriz_vetores_treino_skip_eric,
              y_train_eric,
              matriz_vetores_teste_skip_eric,
              y_test_eric)

                    precision    recall  f1-score   support

  Discurso de ódio       0.50      0.15      0.23       280
Linguagem ofensiva       0.90      0.95      0.92      3514
            Neutro       0.76      0.74      0.75       773

          accuracy                           0.87      4567
         macro avg       0.72      0.61      0.64      4567
      weighted avg       0.85      0.87      0.85      4567



In [322]:
RL__SKIP_helano = classificador(w2v_modelo_sg_helano,matriz_vetores_treino_skip_helano,
              y_train_helano,
              matriz_vetores_teste_skip_helano,
              y_test_helano)

                    precision    recall  f1-score   support

  Discurso de ódio       0.42      0.10      0.17       290
Linguagem ofensiva       0.88      0.95      0.91      3832
            Neutro       0.73      0.66      0.69       835

          accuracy                           0.85      4957
         macro avg       0.68      0.57      0.59      4957
      weighted avg       0.83      0.85      0.83      4957

