# Word Embeddings notícias de 2018

Os dados utilizados são referentes as notícias dos jornais `Carta Capital`, `O Antagonista`, `O Globo` e `Veja`, durante todo o ano de 2018. Uma análise detalhada dos dados está disponível [aqui](https://pages.github.com/).

Objetivo deste notebook é utilizar o modelo word2vec para gerar embeddings a partir dos textos dessas notícias. A arquitetura utilizada pelo modelo é a skip-gram.

In [1]:
# importing modules and setting log format
import re
import nltk
import gensim, logging
import pandas as pd
from nltk.corpus import stopwords
nltk.download('stopwords')
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
PUNCTUATION = u'[^a-zA-Z0-9áéíóúÁÉÍÓÚâêîôÂÊÎÔãõÃÕçÇ%]' # define news punctuation 

[nltk_data] Downloading package stopwords to /home/diogo/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Definindo Lexicons e Funções

In [2]:
# Mapping words in lexicons
map_lexicons = {'a ponto':'a_ponto','ao menos ':'ao_menos ','ate mesmo ':'ate_mesmo ',
                'nao mais que ':'nao_mais_que ','nem mesmo ':'nem_mesmo ','no minimo ':'no_minimo ',
                'o unico ':'o_unico ','a unica ':'a_unica ','pelo menos ':'pelo_menos ',
                'quando menos ':'quando_menos ','quando muito ':'quando_muito ','a par disso ':'a_par_disso ',
                'e nao ':'e_nao ','em suma ':'em_suma ','mas tambem ': 'mas_tambem ','muito menos ':'muito_menos ',
                'nao so ':'nao_so ','ou mesmo ':'ou_mesmo ','por sinal ':'por_sinal ','com isso ':'com_isso ',
                'como consequencia ':'como_consequencia ','de modo que ':'de_modo_que ','deste modo ':'deste_modo ',
                'em decorrencia ':'em_decorrencia ','nesse sentido ':'nesse_sentido ','por causa ':'por_causa ',
                'por conseguinte ':'por_conseguinte ','por essa razao ':'por_essa_razao ','por isso ':'por_isso ',
                'sendo assim ':'sendo_assim ','ou entao ':'ou_entao ','ou mesmo ':'ou_mesmo ','como se ':'como_se ',
                'de um lado ':'de_um_lado ','por outro lado ':'por_outro_lado ','mais que ':'mais_que ',
                'menos que ':'menos_que ','desde que ':'desde_que ','do contrario ':'do_contrario ',
                'em lugar ':'em_lugar ','em vez ':'em_vez','no caso ':'no_caso ','se acaso ':'se_acaso ',
                'de certa forma ':'de_certa_forma ','desse modo ':'desse_modo ','em funcao ':'em_funcao ',
                'isso e ':'isso_e ','ja que ':'ja_que ','na medida que ':'na_medida_que ','nessa direcao ':'nessa_direcao ',
                'no intuito ':'no_intuito ','no mesmo sentido ':'no_mesmo_sentido ','ou seja ':'ou_seja ',
                'uma vez que ':'uma_vez_que ','tanto que ':'tanto_que ','visto que ':'visto_que ','ainda que ':'ainda_que ',
                'ao contrario ':'ao_contrario ','apesar de ':'apesar_de ','fora isso ':'fora_isso ','mesmo que ':'mesmo_que ',
                'nao obstante ':'nao_obstante ','nao fosse isso ':'nao_fosse_isso ','no entanto ':'no_entanto ',
                'para tanto ':'para_tanto ','pelo contrario ':'pelo_contrario ','por sua vez ':'por_sua_vez ','posto que ':'posto_que '
               }

In [3]:
# Convert word from text into lexicons
def word2lexicon(text):
    for k, v in map_lexicons.items():
        text = str(text).replace(k,v)
    return text

In [4]:
# function for processing sentences
def processSentences(text):
    stop_words = stopwords.words('portuguese') # load stop words
    text = re.sub(PUNCTUATION, ' ', str(text)) # remove punctuation from text
    text = str(text).lower().split() # split sentences by words
    text = [word for word in text if word not in stop_words] # Remove stopwords
    return text

### Carregando Notícias

In [5]:
# load data
carta_capital = pd.read_csv("data_2018/carta_capital.csv") 
oantagonista = pd.read_csv("data_2018/oantagonista.csv") 
oglobo = pd.read_csv("data_2018/oglobo.csv") 
veja = pd.read_csv("data_2018/veja.csv") 
# concat all news
news = pd.concat((carta_capital, oantagonista, oglobo, veja), sort=False, ignore_index=True)

In [6]:
# processing news text
news['text'] = news['text'].apply(word2lexicon) 
news['text'] = news['text'].apply(processSentences)

### Treinando Word2Vec

In [7]:
# Train word2vec model 
# settings: approach skip-gram, size embeddings vectors 300 
model = gensim.models.Word2Vec(news['text'], workers=4, size=300, sg=1, window=5, min_count=5)
# Saving model
model.save('model/news-w2v.bin')
# Saving embeddings
model.wv.save_word2vec_format("model/news-vectors.bin")

2019-04-08 09:39:45,852 : INFO : collecting all words and their counts
2019-04-08 09:39:45,857 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2019-04-08 09:39:46,427 : INFO : PROGRESS: at sentence #10000, processed 2597527 words, keeping 96897 word types
2019-04-08 09:39:46,537 : INFO : PROGRESS: at sentence #20000, processed 3057991 words, keeping 100936 word types
2019-04-08 09:39:46,650 : INFO : PROGRESS: at sentence #30000, processed 3520893 words, keeping 104303 word types
2019-04-08 09:39:46,845 : INFO : PROGRESS: at sentence #40000, processed 4363232 words, keeping 114672 word types
2019-04-08 09:39:47,517 : INFO : PROGRESS: at sentence #50000, processed 7330638 words, keeping 145526 word types
2019-04-08 09:39:48,206 : INFO : PROGRESS: at sentence #60000, processed 10360830 words, keeping 170209 word types
2019-04-08 09:39:48,881 : INFO : PROGRESS: at sentence #70000, processed 13300961 words, keeping 190357 word types
2019-04-08 09:39:49,486 : INFO 

2019-04-08 09:40:49,227 : INFO : EPOCH 1 - PROGRESS: at 59.63% examples, 207200 words/s, in_qsize 7, out_qsize 0
2019-04-08 09:40:50,267 : INFO : EPOCH 1 - PROGRESS: at 60.34% examples, 207281 words/s, in_qsize 7, out_qsize 0
2019-04-08 09:40:51,304 : INFO : EPOCH 1 - PROGRESS: at 61.04% examples, 207384 words/s, in_qsize 7, out_qsize 0
2019-04-08 09:40:52,321 : INFO : EPOCH 1 - PROGRESS: at 61.63% examples, 206724 words/s, in_qsize 8, out_qsize 0
2019-04-08 09:40:53,348 : INFO : EPOCH 1 - PROGRESS: at 62.23% examples, 206377 words/s, in_qsize 7, out_qsize 0
2019-04-08 09:40:54,382 : INFO : EPOCH 1 - PROGRESS: at 62.88% examples, 206185 words/s, in_qsize 7, out_qsize 0
2019-04-08 09:40:55,394 : INFO : EPOCH 1 - PROGRESS: at 63.57% examples, 206230 words/s, in_qsize 7, out_qsize 0
2019-04-08 09:40:56,448 : INFO : EPOCH 1 - PROGRESS: at 64.26% examples, 206266 words/s, in_qsize 7, out_qsize 0
2019-04-08 09:40:57,452 : INFO : EPOCH 1 - PROGRESS: at 64.92% examples, 206314 words/s, in_qsiz

2019-04-08 09:42:00,901 : INFO : EPOCH 2 - PROGRESS: at 36.77% examples, 203001 words/s, in_qsize 7, out_qsize 0
2019-04-08 09:42:01,934 : INFO : EPOCH 2 - PROGRESS: at 37.48% examples, 203429 words/s, in_qsize 7, out_qsize 0
2019-04-08 09:42:02,945 : INFO : EPOCH 2 - PROGRESS: at 38.15% examples, 203195 words/s, in_qsize 8, out_qsize 0
2019-04-08 09:42:03,971 : INFO : EPOCH 2 - PROGRESS: at 38.86% examples, 203307 words/s, in_qsize 7, out_qsize 0
2019-04-08 09:42:04,976 : INFO : EPOCH 2 - PROGRESS: at 39.49% examples, 203143 words/s, in_qsize 6, out_qsize 1
2019-04-08 09:42:06,010 : INFO : EPOCH 2 - PROGRESS: at 40.21% examples, 203838 words/s, in_qsize 7, out_qsize 0
2019-04-08 09:42:07,021 : INFO : EPOCH 2 - PROGRESS: at 40.84% examples, 203281 words/s, in_qsize 7, out_qsize 0
2019-04-08 09:42:08,137 : INFO : EPOCH 2 - PROGRESS: at 41.56% examples, 203409 words/s, in_qsize 7, out_qsize 0
2019-04-08 09:42:09,209 : INFO : EPOCH 2 - PROGRESS: at 42.28% examples, 203774 words/s, in_qsiz

2019-04-08 09:43:16,885 : INFO : EPOCH 2 - PROGRESS: at 95.07% examples, 205709 words/s, in_qsize 7, out_qsize 0
2019-04-08 09:43:17,952 : INFO : EPOCH 2 - PROGRESS: at 95.75% examples, 205761 words/s, in_qsize 8, out_qsize 0
2019-04-08 09:43:18,985 : INFO : EPOCH 2 - PROGRESS: at 96.53% examples, 205860 words/s, in_qsize 7, out_qsize 0
2019-04-08 09:43:19,993 : INFO : EPOCH 2 - PROGRESS: at 97.26% examples, 206045 words/s, in_qsize 7, out_qsize 0
2019-04-08 09:43:21,006 : INFO : EPOCH 2 - PROGRESS: at 97.97% examples, 206087 words/s, in_qsize 7, out_qsize 0
2019-04-08 09:43:22,055 : INFO : EPOCH 2 - PROGRESS: at 98.53% examples, 206095 words/s, in_qsize 7, out_qsize 0
2019-04-08 09:43:23,068 : INFO : EPOCH 2 - PROGRESS: at 98.97% examples, 206149 words/s, in_qsize 7, out_qsize 0
2019-04-08 09:43:24,073 : INFO : EPOCH 2 - PROGRESS: at 99.49% examples, 206248 words/s, in_qsize 7, out_qsize 0
2019-04-08 09:43:24,772 : INFO : worker thread finished; awaiting finish of 3 more threads
2019-

2019-04-08 09:44:29,776 : INFO : EPOCH 3 - PROGRESS: at 66.33% examples, 208697 words/s, in_qsize 6, out_qsize 1
2019-04-08 09:44:30,836 : INFO : EPOCH 3 - PROGRESS: at 67.06% examples, 208839 words/s, in_qsize 7, out_qsize 0
2019-04-08 09:44:31,890 : INFO : EPOCH 3 - PROGRESS: at 67.74% examples, 208704 words/s, in_qsize 7, out_qsize 0
2019-04-08 09:44:32,947 : INFO : EPOCH 3 - PROGRESS: at 68.35% examples, 208800 words/s, in_qsize 7, out_qsize 0
2019-04-08 09:44:33,990 : INFO : EPOCH 3 - PROGRESS: at 68.92% examples, 208783 words/s, in_qsize 6, out_qsize 1
2019-04-08 09:44:35,026 : INFO : EPOCH 3 - PROGRESS: at 69.81% examples, 208781 words/s, in_qsize 7, out_qsize 0
2019-04-08 09:44:36,105 : INFO : EPOCH 3 - PROGRESS: at 70.35% examples, 208775 words/s, in_qsize 7, out_qsize 0
2019-04-08 09:44:37,162 : INFO : EPOCH 3 - PROGRESS: at 71.08% examples, 208862 words/s, in_qsize 7, out_qsize 0
2019-04-08 09:44:38,260 : INFO : EPOCH 3 - PROGRESS: at 72.76% examples, 208842 words/s, in_qsiz

2019-04-08 09:45:41,700 : INFO : EPOCH 4 - PROGRESS: at 45.08% examples, 205021 words/s, in_qsize 7, out_qsize 0
2019-04-08 09:45:42,747 : INFO : EPOCH 4 - PROGRESS: at 45.68% examples, 204891 words/s, in_qsize 7, out_qsize 0
2019-04-08 09:45:43,847 : INFO : EPOCH 4 - PROGRESS: at 46.40% examples, 204992 words/s, in_qsize 7, out_qsize 0
2019-04-08 09:45:44,906 : INFO : EPOCH 4 - PROGRESS: at 47.09% examples, 205063 words/s, in_qsize 7, out_qsize 0
2019-04-08 09:45:45,947 : INFO : EPOCH 4 - PROGRESS: at 47.73% examples, 205120 words/s, in_qsize 7, out_qsize 0
2019-04-08 09:45:46,995 : INFO : EPOCH 4 - PROGRESS: at 48.41% examples, 205220 words/s, in_qsize 7, out_qsize 0
2019-04-08 09:45:48,050 : INFO : EPOCH 4 - PROGRESS: at 49.08% examples, 205083 words/s, in_qsize 8, out_qsize 0
2019-04-08 09:45:49,127 : INFO : EPOCH 4 - PROGRESS: at 49.82% examples, 205293 words/s, in_qsize 7, out_qsize 0
2019-04-08 09:45:50,173 : INFO : EPOCH 4 - PROGRESS: at 50.48% examples, 205196 words/s, in_qsiz

2019-04-08 09:46:56,564 : INFO : EPOCH 4 - PROGRESS: at 99.93% examples, 199788 words/s, in_qsize 2, out_qsize 1
2019-04-08 09:46:56,566 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-04-08 09:46:56,575 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-04-08 09:46:56,598 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-04-08 09:46:56,599 : INFO : EPOCH - 4 : training on 22156621 raw words (21603892 effective words) took 108.1s, 199898 effective words/s
2019-04-08 09:46:57,668 : INFO : EPOCH 5 - PROGRESS: at 0.51% examples, 172359 words/s, in_qsize 7, out_qsize 0
2019-04-08 09:46:58,689 : INFO : EPOCH 5 - PROGRESS: at 0.93% examples, 177152 words/s, in_qsize 7, out_qsize 0
2019-04-08 09:46:59,756 : INFO : EPOCH 5 - PROGRESS: at 1.27% examples, 180441 words/s, in_qsize 7, out_qsize 0
2019-04-08 09:47:00,824 : INFO : EPOCH 5 - PROGRESS: at 1.68% examples, 187530 words/s, in_qsize 7, out_qsize 0
2019-04-08 09:47:01,864 :

2019-04-08 09:48:08,272 : INFO : EPOCH 5 - PROGRESS: at 68.64% examples, 199848 words/s, in_qsize 7, out_qsize 0
2019-04-08 09:48:09,333 : INFO : EPOCH 5 - PROGRESS: at 69.53% examples, 199785 words/s, in_qsize 7, out_qsize 0
2019-04-08 09:48:10,369 : INFO : EPOCH 5 - PROGRESS: at 70.04% examples, 199892 words/s, in_qsize 8, out_qsize 0
2019-04-08 09:48:11,381 : INFO : EPOCH 5 - PROGRESS: at 70.74% examples, 199977 words/s, in_qsize 7, out_qsize 0
2019-04-08 09:48:12,420 : INFO : EPOCH 5 - PROGRESS: at 71.18% examples, 199951 words/s, in_qsize 7, out_qsize 0
2019-04-08 09:48:13,463 : INFO : EPOCH 5 - PROGRESS: at 72.99% examples, 199982 words/s, in_qsize 7, out_qsize 0
2019-04-08 09:48:14,481 : INFO : EPOCH 5 - PROGRESS: at 74.09% examples, 200189 words/s, in_qsize 7, out_qsize 0
2019-04-08 09:48:15,505 : INFO : EPOCH 5 - PROGRESS: at 77.75% examples, 200303 words/s, in_qsize 7, out_qsize 0
2019-04-08 09:48:16,557 : INFO : EPOCH 5 - PROGRESS: at 78.82% examples, 200260 words/s, in_qsiz

In [10]:
print(model.wv.most_similar(positive=[u'bolsonaro'], negative=[u'presidente']))

[('jair', 0.4555439352989197), ('psl', 0.41817405819892883), ('presidenciável', 0.33089786767959595), ('elenão', 0.29954150319099426), ('pesselista', 0.28388893604278564), ('pedetista', 0.27705198526382446), ('bolsonarista', 0.2756008207798004), ('facada', 0.27309465408325195), ('bolsodoria', 0.26183512806892395), ('antipetista', 0.2597796320915222)]
