# _Word Embeddings_

A **representação semântica distribuı́da** é baseada na hipótese distribucional que estabelece que o sentido de uma palavra é dado por seu contexto de ocorrência [2]. Esses vetores de palavras podem ser usados como recursos em uma variedade de aplicações, tais como: classificação de documentos [3], perguntas e respostas [4] e reconhecimento de entidade nomeada [5]. A representação de palavras como vetores contı́nuos tem uma longa história [6], [7], [8]). 

Neste notebook, faremos o treinamento do Word2Vec em um córpus da Wikipédia no PT-BR.


In [6]:
!mkdir data
!wget https://github.com/marcoaleixo/word2vec-train/raw/master/wiki.pt-br_part.text.zip
!wget https://github.com/marcoaleixo/word2vec-train/raw/master/text8.zip
!wget https://github.com/marcoaleixo/word2vec-train/raw/master/sinopses.txt


mkdir: cannot create directory ‘data’: File exists
--2018-03-01 18:00:07--  https://github.com/marcoaleixo/word2vec-train/raw/master/wiki.pt-br_part.text.zip
Resolving github.com (github.com)... 192.30.255.113, 192.30.255.112
Connecting to github.com (github.com)|192.30.255.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/marcoaleixo/word2vec-train/master/wiki.pt-br_part.text.zip [following]
--2018-03-01 18:00:07--  https://raw.githubusercontent.com/marcoaleixo/word2vec-train/master/wiki.pt-br_part.text.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 100033487 (95M) [application/zip]
Saving to: ‘wiki.pt-br_part.text.zip.1’


2018-03-01 18:00:09 (125 MB/s) - ‘wiki.pt-br_part.text.zip.1’ saved 

In [7]:
!pip install gensim
#imports
import multiprocessing
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
from gensim.corpora import  WikiCorpus
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence



In [8]:
#import data
import zipfile
from os.path import isfile, isdir


outp = "wiki.pt-br.word2vec.model"

inp = "data/wiki.pt-br_part.text"
dataset_filename = 'wiki.pt-br_part.text.zip'
dataset_folder_path = 'data'

with zipfile.ZipFile(dataset_filename) as zip_ref:
    zip_ref.extractall(dataset_folder_path)
!ls ./data
!ls

wiki.pt-br_part.text
data	 sinopses.txt	 text8.zip    wiki.pt-br_part.text.zip
datalab  sinopses.txt.1  text8.zip.1  wiki.pt-br_part.text.zip.1


## Word2Vec

Partindo da premissa de que técnicas básicas como contagem de n-gramas já estão em seu limite, Mikolov et al. (2013) [1] propõe a utilização de modelos de linguagem baseados em redes neurais para modelar representações distribuídas de palavras. O principal objetivo das técnicas propostas por Mikolov et al. (2013) [1] é aprender vetores de palavras de alta qualidade, a partir de enormes conjuntos de dados com bilhões de palavras. De maneira surpreendente, verificou-se que a similaridade das representações de palavras vai além das simples regularidades sintáticas. Dentro de um espaço de dimensões vetoriais, usando uma simples operação algébrica nos vetores de palavras, foi mostrado por exemplo que:

> vetor(**rei**) - vetor(**homem**) + vetor(**mulher**) = vetor que está próximo da representação vetorial da palavra **rainha**.

Mikolov et al. (2013) [1] propõe duas arquiteturas de modelos para a aprendizagem de representações distribuı́das de palavras que tentam minimizar a complexidade computacional: o modelo _Continuous Bag-of-Words_ (CBOW) e o modelo Skip-gram.

* **CBOW** – No CBOW, a arquitetura é semelhante à do NNLM _feedforward_, onde a camada escondida não-linear é removida e a camada de projeção é compartilhada para todas as palavras (não apenas a matriz de projeção). Assim, todas as palavras são projetadas na mesma posição. Essa arquitetura é chamada de modelo de saco de palavras (_bag of words_), pois a ordem das palavras não influencia a projeção. O CBOW usa representação distribuı́da contı́nua do contexto. A arquitetura do modelo é mostrada na figura abaixo, na qual pode-se observar que a matriz de pesos entre a entrada e a camada de projeção é compartilhada para todas as posições de palavras (da mesma maneira que no NNLM).
    
    
* **Skip-gram** – A arquitetura do Skip-gram é semelhante à do CBOW, mas em vez de prever a palavra atual com base no contexto, Skip-gram tenta maximizar a classificação de uma palavra com base em outra da mesma sentença. Mais precisamente, usa-se cada palavra atual como uma entrada para um classificador log-linear para prever palavras dentro de um intervalo anterior e posterior à palavra atual. O aumento do intervalo melhora a qualidade dos vetores de palavra resultantes, mas também aumenta a complexidade computacional. A distância entre uma palavra do contexto e a palavra atual indica o grau de relação entre elas. Quanto mais distante, menos relacionada estará à palavra atual, podendo receber pesos menores.

https://github.com/marcoaleixo/word2vec-train/blob/master/images/CBOW_Skip-Gram.png

### Parâmetros

Na próxima célula de código, definimos os seguintes parâmetros:

* **sg**: define o algoritmo de treinamento. Por padrão, o CBOW é usado (sg = 0). O outro é o skip-gram (sg = 1).

* **size**: dimensionalidade dos vetores.

* **window**: é a quantidade de palavras anteriores e posteriores à palavra alvo.

* **LineSentence**: Interpreta uma string ou arquivo. Cada linha é uma sentença.

* **min_count**: ignore as palavras com frequência total inferior a **min_count**.

* **max_vocab_size**: Limite a RAM durante a construção do vocabulário; se houver mais palavras únicas do que **max_vocab_size**, ocorre a poda os infrequentes. Cada 10 milhões de tipos de palavras precisam de cerca de 1GB de RAM.

* **sample**: limiar para configurar quais palavras de maior frequência são aleatoriamente reduzidas; O padrão é 1e-3, o intervalo útil é (0, 1e-5).

* **workers**: parâmetro que indica quantos cores da máquina serão utilizados para o treinamento.

* **hs**: se 1, softmax hierárquico será usado para o treinamento do modelo. Se definido como 0 (padrão), e existir amostragem negativa, esse recurso será utilizado.

* **negative**: se > 0, será utilizada amostragem negativa. O valor indica quantas "palavras de ruído" devem ser consideradas (normalmente entre 5 a 20). Se **negative** configurado para 0, não é utilizada a amostragem negativa.

* **cbow_mean**: se 0, usa a soma dos vetores das palavras de contexto. Se 1 (padrão), usa a média. Aplica-se apenas quando cbow é utilizado.

* **hashfxn**: função hash para inicializar os pesos aleatoriamente.

* **iter**: número de iterações (épocas) sobre o córpus. O padrão é 5.


In [9]:
!ls

data	 sinopses.txt	 text8.zip    wiki.pt-br_part.text.zip
datalab  sinopses.txt.1  text8.zip.1  wiki.pt-br_part.text.zip.1


In [10]:
#train model
%time model = Word2Vec(LineSentence(inp), size=400, window=5, min_count=5, workers=multiprocessing.cpu_count())

2018-03-01 18:00:35,599 : INFO : collecting all words and their counts
2018-03-01 18:00:35,602 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2018-03-01 18:00:39,644 : INFO : PROGRESS: at sentence #10000, processed 11332580 words, keeping 272468 word types
2018-03-01 18:00:42,687 : INFO : PROGRESS: at sentence #20000, processed 19965353 words, keeping 367913 word types
2018-03-01 18:00:46,349 : INFO : PROGRESS: at sentence #30000, processed 30160507 words, keeping 491080 word types
2018-03-01 18:00:48,429 : INFO : PROGRESS: at sentence #40000, processed 35913388 words, keeping 555342 word types
2018-03-01 18:00:51,150 : INFO : PROGRESS: at sentence #50000, processed 43666857 words, keeping 636715 word types
2018-03-01 18:00:52,012 : INFO : collected 663151 word types from a corpus of 45891425 raw words and 58065 sentences
2018-03-01 18:00:52,013 : INFO : Loading a fresh vocabulary
2018-03-01 18:00:52,830 : INFO : min_count=5 retains 178094 unique words (26% 

2018-03-01 18:01:25,047 : INFO : EPOCH 1 - PROGRESS: at 11.34% examples, 261485 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:01:26,062 : INFO : EPOCH 1 - PROGRESS: at 11.76% examples, 261346 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:01:27,089 : INFO : EPOCH 1 - PROGRESS: at 12.19% examples, 261495 words/s, in_qsize 2, out_qsize 1
2018-03-01 18:01:28,104 : INFO : EPOCH 1 - PROGRESS: at 12.77% examples, 261325 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:01:29,112 : INFO : EPOCH 1 - PROGRESS: at 13.38% examples, 261337 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:01:30,117 : INFO : EPOCH 1 - PROGRESS: at 14.66% examples, 261255 words/s, in_qsize 2, out_qsize 1
2018-03-01 18:01:31,126 : INFO : EPOCH 1 - PROGRESS: at 15.50% examples, 261470 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:01:32,135 : INFO : EPOCH 1 - PROGRESS: at 16.17% examples, 261349 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:01:33,169 : INFO : EPOCH 1 - PROGRESS: at 16.79% examples, 261509 words/s, in_qsiz

2018-03-01 18:02:06,717 : INFO : EPOCH 1 - PROGRESS: at 38.01% examples, 262688 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:02:07,724 : INFO : EPOCH 1 - PROGRESS: at 38.54% examples, 262681 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:02:08,745 : INFO : EPOCH 1 - PROGRESS: at 39.11% examples, 262720 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:02:09,778 : INFO : EPOCH 1 - PROGRESS: at 39.66% examples, 262758 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:02:10,788 : INFO : EPOCH 1 - PROGRESS: at 40.12% examples, 262719 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:02:11,810 : INFO : EPOCH 1 - PROGRESS: at 40.65% examples, 262717 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:02:12,862 : INFO : EPOCH 1 - PROGRESS: at 41.29% examples, 262728 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:02:13,905 : INFO : EPOCH 1 - PROGRESS: at 41.81% examples, 262780 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:02:14,914 : INFO : EPOCH 1 - PROGRESS: at 42.27% examples, 262877 words/s, in_qsiz

2018-03-01 18:02:48,520 : INFO : EPOCH 1 - PROGRESS: at 68.41% examples, 263102 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:02:49,525 : INFO : EPOCH 1 - PROGRESS: at 69.11% examples, 263145 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:02:50,531 : INFO : EPOCH 1 - PROGRESS: at 69.93% examples, 263162 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:02:51,557 : INFO : EPOCH 1 - PROGRESS: at 70.90% examples, 263116 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:02:52,578 : INFO : EPOCH 1 - PROGRESS: at 71.69% examples, 263184 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:02:53,579 : INFO : EPOCH 1 - PROGRESS: at 72.51% examples, 263166 words/s, in_qsize 2, out_qsize 1
2018-03-01 18:02:54,600 : INFO : EPOCH 1 - PROGRESS: at 73.27% examples, 263179 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:02:55,601 : INFO : EPOCH 1 - PROGRESS: at 74.02% examples, 263186 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:02:56,635 : INFO : EPOCH 1 - PROGRESS: at 74.65% examples, 263208 words/s, in_qsiz

2018-03-01 18:03:26,336 : INFO : EPOCH 2 - PROGRESS: at 1.66% examples, 267644 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:03:27,367 : INFO : EPOCH 2 - PROGRESS: at 1.92% examples, 267546 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:03:28,390 : INFO : EPOCH 2 - PROGRESS: at 2.47% examples, 267550 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:03:29,408 : INFO : EPOCH 2 - PROGRESS: at 2.72% examples, 267859 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:03:30,432 : INFO : EPOCH 2 - PROGRESS: at 2.94% examples, 267505 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:03:31,461 : INFO : EPOCH 2 - PROGRESS: at 3.56% examples, 267767 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:03:32,468 : INFO : EPOCH 2 - PROGRESS: at 4.39% examples, 268197 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:03:33,488 : INFO : EPOCH 2 - PROGRESS: at 5.18% examples, 268440 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:03:34,507 : INFO : EPOCH 2 - PROGRESS: at 5.85% examples, 268276 words/s, in_qsize 3, out_

2018-03-01 18:04:08,047 : INFO : EPOCH 2 - PROGRESS: at 26.62% examples, 268311 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:04:09,047 : INFO : EPOCH 2 - PROGRESS: at 27.59% examples, 268296 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:04:10,066 : INFO : EPOCH 2 - PROGRESS: at 28.25% examples, 268337 words/s, in_qsize 2, out_qsize 1
2018-03-01 18:04:11,098 : INFO : EPOCH 2 - PROGRESS: at 28.88% examples, 268498 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:04:12,105 : INFO : EPOCH 2 - PROGRESS: at 29.52% examples, 268548 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:04:13,111 : INFO : EPOCH 2 - PROGRESS: at 30.39% examples, 268630 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:04:14,136 : INFO : EPOCH 2 - PROGRESS: at 31.10% examples, 268714 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:04:15,173 : INFO : EPOCH 2 - PROGRESS: at 31.87% examples, 268778 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:04:16,183 : INFO : EPOCH 2 - PROGRESS: at 32.64% examples, 268950 words/s, in_qsiz

2018-03-01 18:04:49,794 : INFO : EPOCH 2 - PROGRESS: at 51.16% examples, 268148 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:04:50,809 : INFO : EPOCH 2 - PROGRESS: at 51.90% examples, 268092 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:04:51,817 : INFO : EPOCH 2 - PROGRESS: at 52.52% examples, 268054 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:04:52,831 : INFO : EPOCH 2 - PROGRESS: at 53.21% examples, 268039 words/s, in_qsize 2, out_qsize 1
2018-03-01 18:04:53,839 : INFO : EPOCH 2 - PROGRESS: at 53.69% examples, 268034 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:04:54,841 : INFO : EPOCH 2 - PROGRESS: at 54.40% examples, 268043 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:04:55,859 : INFO : EPOCH 2 - PROGRESS: at 55.17% examples, 267997 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:04:56,896 : INFO : EPOCH 2 - PROGRESS: at 55.81% examples, 267992 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:04:57,947 : INFO : EPOCH 2 - PROGRESS: at 56.47% examples, 267976 words/s, in_qsiz

2018-03-01 18:05:31,475 : INFO : EPOCH 2 - PROGRESS: at 86.23% examples, 267955 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:05:32,509 : INFO : EPOCH 2 - PROGRESS: at 89.41% examples, 268018 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:05:33,510 : INFO : EPOCH 2 - PROGRESS: at 92.88% examples, 268080 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:05:34,530 : INFO : EPOCH 2 - PROGRESS: at 96.57% examples, 268171 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:05:35,533 : INFO : EPOCH 2 - PROGRESS: at 97.69% examples, 268115 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:05:36,561 : INFO : EPOCH 2 - PROGRESS: at 98.67% examples, 268114 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:05:37,565 : INFO : EPOCH 2 - PROGRESS: at 99.71% examples, 268149 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:05:37,798 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-03-01 18:05:37,806 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-03-01 18:05:37,807 : I

2018-03-01 18:06:10,415 : INFO : EPOCH 3 - PROGRESS: at 15.84% examples, 264757 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:06:11,418 : INFO : EPOCH 3 - PROGRESS: at 16.36% examples, 264657 words/s, in_qsize 2, out_qsize 1
2018-03-01 18:06:12,445 : INFO : EPOCH 3 - PROGRESS: at 17.08% examples, 264451 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:06:13,450 : INFO : EPOCH 3 - PROGRESS: at 17.67% examples, 264448 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:06:14,451 : INFO : EPOCH 3 - PROGRESS: at 18.45% examples, 264485 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:06:15,466 : INFO : EPOCH 3 - PROGRESS: at 19.27% examples, 264590 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:06:16,492 : INFO : EPOCH 3 - PROGRESS: at 19.89% examples, 264657 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:06:17,496 : INFO : EPOCH 3 - PROGRESS: at 20.35% examples, 264411 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:06:18,507 : INFO : EPOCH 3 - PROGRESS: at 20.74% examples, 264334 words/s, in_qsiz

2018-03-01 18:06:52,132 : INFO : EPOCH 3 - PROGRESS: at 41.47% examples, 264185 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:06:53,164 : INFO : EPOCH 3 - PROGRESS: at 41.89% examples, 264044 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:06:54,166 : INFO : EPOCH 3 - PROGRESS: at 42.41% examples, 264121 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:06:55,217 : INFO : EPOCH 3 - PROGRESS: at 42.78% examples, 263973 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:06:56,232 : INFO : EPOCH 3 - PROGRESS: at 43.28% examples, 263942 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:06:57,235 : INFO : EPOCH 3 - PROGRESS: at 43.81% examples, 263895 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:06:58,275 : INFO : EPOCH 3 - PROGRESS: at 44.28% examples, 263902 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:06:59,276 : INFO : EPOCH 3 - PROGRESS: at 44.78% examples, 263884 words/s, in_qsize 2, out_qsize 1
2018-03-01 18:07:00,283 : INFO : EPOCH 3 - PROGRESS: at 45.47% examples, 263843 words/s, in_qsiz

2018-03-01 18:07:33,897 : INFO : EPOCH 3 - PROGRESS: at 73.35% examples, 263482 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:07:34,910 : INFO : EPOCH 3 - PROGRESS: at 74.16% examples, 263510 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:07:35,940 : INFO : EPOCH 3 - PROGRESS: at 74.76% examples, 263450 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:07:36,944 : INFO : EPOCH 3 - PROGRESS: at 75.41% examples, 263432 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:07:37,948 : INFO : EPOCH 3 - PROGRESS: at 76.22% examples, 263407 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:07:38,952 : INFO : EPOCH 3 - PROGRESS: at 76.45% examples, 263850 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:07:39,963 : INFO : EPOCH 3 - PROGRESS: at 77.09% examples, 263918 words/s, in_qsize 2, out_qsize 1
2018-03-01 18:07:40,985 : INFO : EPOCH 3 - PROGRESS: at 77.72% examples, 263864 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:07:42,004 : INFO : EPOCH 3 - PROGRESS: at 78.48% examples, 263857 words/s, in_qsiz

2018-03-01 18:08:12,713 : INFO : EPOCH 4 - PROGRESS: at 5.04% examples, 267775 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:08:13,742 : INFO : EPOCH 4 - PROGRESS: at 5.79% examples, 267400 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:08:14,752 : INFO : EPOCH 4 - PROGRESS: at 6.22% examples, 267648 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:08:15,774 : INFO : EPOCH 4 - PROGRESS: at 7.16% examples, 267591 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:08:16,792 : INFO : EPOCH 4 - PROGRESS: at 7.55% examples, 267799 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:08:17,797 : INFO : EPOCH 4 - PROGRESS: at 7.99% examples, 267855 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:08:18,800 : INFO : EPOCH 4 - PROGRESS: at 8.33% examples, 267690 words/s, in_qsize 2, out_qsize 1
2018-03-01 18:08:19,830 : INFO : EPOCH 4 - PROGRESS: at 8.78% examples, 267965 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:08:20,839 : INFO : EPOCH 4 - PROGRESS: at 9.40% examples, 267708 words/s, in_qsize 3, out_

2018-03-01 18:08:54,469 : INFO : EPOCH 4 - PROGRESS: at 31.83% examples, 268408 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:08:55,470 : INFO : EPOCH 4 - PROGRESS: at 32.51% examples, 268460 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:08:56,478 : INFO : EPOCH 4 - PROGRESS: at 33.23% examples, 268525 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:08:57,501 : INFO : EPOCH 4 - PROGRESS: at 33.77% examples, 268495 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:08:58,524 : INFO : EPOCH 4 - PROGRESS: at 34.31% examples, 268511 words/s, in_qsize 2, out_qsize 1
2018-03-01 18:08:59,530 : INFO : EPOCH 4 - PROGRESS: at 34.95% examples, 268479 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:09:00,535 : INFO : EPOCH 4 - PROGRESS: at 35.48% examples, 268402 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:09:01,538 : INFO : EPOCH 4 - PROGRESS: at 36.04% examples, 268319 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:09:02,562 : INFO : EPOCH 4 - PROGRESS: at 36.54% examples, 268263 words/s, in_qsiz

2018-03-01 18:09:36,108 : INFO : EPOCH 4 - PROGRESS: at 55.61% examples, 267262 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:09:37,145 : INFO : EPOCH 4 - PROGRESS: at 56.21% examples, 267255 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:09:38,176 : INFO : EPOCH 4 - PROGRESS: at 56.90% examples, 267210 words/s, in_qsize 2, out_qsize 1
2018-03-01 18:09:39,203 : INFO : EPOCH 4 - PROGRESS: at 58.29% examples, 267226 words/s, in_qsize 2, out_qsize 1
2018-03-01 18:09:40,210 : INFO : EPOCH 4 - PROGRESS: at 60.47% examples, 267187 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:09:41,212 : INFO : EPOCH 4 - PROGRESS: at 61.44% examples, 267187 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:09:42,223 : INFO : EPOCH 4 - PROGRESS: at 64.82% examples, 267222 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:09:43,224 : INFO : EPOCH 4 - PROGRESS: at 65.91% examples, 267216 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:09:44,249 : INFO : EPOCH 4 - PROGRESS: at 66.54% examples, 267189 words/s, in_qsiz

2018-03-01 18:10:17,588 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-03-01 18:10:17,596 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-03-01 18:10:17,597 : INFO : EPOCH - 4 : training on 45891425 raw words (37156323 effective words) took 139.1s, 267206 effective words/s
2018-03-01 18:10:18,608 : INFO : EPOCH 5 - PROGRESS: at 0.22% examples, 263173 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:10:19,618 : INFO : EPOCH 5 - PROGRESS: at 0.45% examples, 265558 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:10:20,637 : INFO : EPOCH 5 - PROGRESS: at 0.73% examples, 266477 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:10:21,665 : INFO : EPOCH 5 - PROGRESS: at 0.95% examples, 266026 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:10:22,705 : INFO : EPOCH 5 - PROGRESS: at 1.18% examples, 265930 words/s, in_qsize 2, out_qsize 1
2018-03-01 18:10:23,721 : INFO : EPOCH 5 - PROGRESS: at 1.45% examples, 264922 words/s, in_qsize 3, out_qsize 0
2018-

2018-03-01 18:10:57,350 : INFO : EPOCH 5 - PROGRESS: at 20.49% examples, 266869 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:10:58,402 : INFO : EPOCH 5 - PROGRESS: at 20.95% examples, 266800 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:10:59,427 : INFO : EPOCH 5 - PROGRESS: at 21.55% examples, 266836 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:11:00,437 : INFO : EPOCH 5 - PROGRESS: at 22.09% examples, 266920 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:11:01,473 : INFO : EPOCH 5 - PROGRESS: at 22.62% examples, 266768 words/s, in_qsize 2, out_qsize 1
2018-03-01 18:11:02,505 : INFO : EPOCH 5 - PROGRESS: at 23.43% examples, 266733 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:11:03,505 : INFO : EPOCH 5 - PROGRESS: at 24.71% examples, 266860 words/s, in_qsize 4, out_qsize 0
2018-03-01 18:11:04,505 : INFO : EPOCH 5 - PROGRESS: at 25.26% examples, 266841 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:11:05,535 : INFO : EPOCH 5 - PROGRESS: at 25.88% examples, 266801 words/s, in_qsiz

2018-03-01 18:11:38,968 : INFO : EPOCH 5 - PROGRESS: at 45.33% examples, 266777 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:11:39,985 : INFO : EPOCH 5 - PROGRESS: at 45.90% examples, 266745 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:11:41,011 : INFO : EPOCH 5 - PROGRESS: at 46.50% examples, 266737 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:11:42,064 : INFO : EPOCH 5 - PROGRESS: at 47.12% examples, 266724 words/s, in_qsize 4, out_qsize 0
2018-03-01 18:11:43,070 : INFO : EPOCH 5 - PROGRESS: at 47.59% examples, 266718 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:11:44,073 : INFO : EPOCH 5 - PROGRESS: at 48.18% examples, 266703 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:11:45,124 : INFO : EPOCH 5 - PROGRESS: at 48.81% examples, 266701 words/s, in_qsize 2, out_qsize 1
2018-03-01 18:11:46,175 : INFO : EPOCH 5 - PROGRESS: at 49.53% examples, 266731 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:11:47,189 : INFO : EPOCH 5 - PROGRESS: at 50.15% examples, 266751 words/s, in_qsiz

2018-03-01 18:12:20,851 : INFO : EPOCH 5 - PROGRESS: at 78.80% examples, 266909 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:12:21,870 : INFO : EPOCH 5 - PROGRESS: at 79.65% examples, 266868 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:12:22,900 : INFO : EPOCH 5 - PROGRESS: at 80.22% examples, 266863 words/s, in_qsize 2, out_qsize 1
2018-03-01 18:12:23,917 : INFO : EPOCH 5 - PROGRESS: at 80.97% examples, 266870 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:12:24,920 : INFO : EPOCH 5 - PROGRESS: at 81.77% examples, 266843 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:12:25,925 : INFO : EPOCH 5 - PROGRESS: at 82.55% examples, 266805 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:12:26,929 : INFO : EPOCH 5 - PROGRESS: at 83.25% examples, 266801 words/s, in_qsize 2, out_qsize 1
2018-03-01 18:12:27,938 : INFO : EPOCH 5 - PROGRESS: at 84.09% examples, 266728 words/s, in_qsize 3, out_qsize 0
2018-03-01 18:12:28,940 : INFO : EPOCH 5 - PROGRESS: at 85.23% examples, 266774 words/s, in_qsiz

CPU times: user 22min 53s, sys: 6.54 s, total: 22min 59s
Wall time: 12min 1s


* **init_sims(replace=True)**: Normaliza o modelo para não demandar tanta memória.

In [11]:
# trim unneeded model memory = use (much) less RAM
model.init_sims(replace=True)

2018-03-01 18:12:43,093 : INFO : precomputing L2-norms of word weight vectors


Salva o modelo no caminho especificado em outp

In [12]:
model.save(outp)
!ls

2018-03-01 18:12:57,289 : INFO : saving Word2Vec object under wiki.pt-br.word2vec.model, separately None
2018-03-01 18:12:57,291 : INFO : storing np array 'vectors' to wiki.pt-br.word2vec.model.wv.vectors.npy
2018-03-01 18:12:58,022 : INFO : not storing attribute vectors_norm
2018-03-01 18:12:58,036 : INFO : storing np array 'syn1neg' to wiki.pt-br.word2vec.model.trainables.syn1neg.npy
2018-03-01 18:12:59,086 : INFO : not storing attribute cum_table
2018-03-01 18:12:59,576 : INFO : saved wiki.pt-br.word2vec.model


data		wiki.pt-br_part.text.zip
datalab		wiki.pt-br_part.text.zip.1
sinopses.txt	wiki.pt-br.word2vec.model
sinopses.txt.1	wiki.pt-br.word2vec.model.trainables.syn1neg.npy
text8.zip	wiki.pt-br.word2vec.model.wv.vectors.npy
text8.zip.1


## Leituras

Sugiro as seguintes leituras complementares sobre o Word2Vec.

* A really good [conceptual overview](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/) of word2vec from Chris McCormick 
* [First word2vec paper](https://arxiv.org/pdf/1301.3781.pdf) from Mikolov et al.
* [NIPS paper](http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf) with improvements for word2vec also from Mikolov et al.
* An [implementation of word2vec](http://www.thushv.com/natural_language_processing/word2vec-part-1-nlp-with-deep-learning-with-tensorflow-skip-gram/) from Thushan Ganegedara
* TensorFlow [word2vec tutorial](https://www.tensorflow.org/tutorials/word2vec)
* [Deep Learning com Word2Vec](https://radimrehurek.com/gensim/models/word2vec.html) por Gensim

## Referências

[1] [Efficient estimation of word representations in vector space.](https://arxiv.org/abs/1301.3781)

[2] [Multimodal distributional semantics.](https://www.jair.org/media/4135/live-4135-7609-jair.pdf)

[3] [Machine learning in automated text categorization.](http://delivery.acm.org/10.1145/510000/505283/p1-sebastiani.pdf?ip=200.137.216.145&id=505283&acc=ACTIVE%20SERVICE&key=344E943C9DC262BB%2E0ACEC6856BE69272%2E4D4702B0C3E38B35%2E4D4702B0C3E38B35&__acm__=1518017578_f5561e072809aadaea8bb04a71a5b21c)

[4] [Quantitative evaluation of passage retrieval algorithms for question answering.](http://delivery.acm.org/10.1145/870000/860445/p41-tellex.pdf?ip=200.137.216.145&id=860445&acc=ACTIVE%20SERVICE&key=344E943C9DC262BB%2E0ACEC6856BE69272%2E4D4702B0C3E38B35%2E4D4702B0C3E38B35&__acm__=1518017629_0f0d78efd0501b7ad05a74e586cd7ef8)

[5] [Word representations: a simple and general method for semi-supervised learning.](http://delivery.acm.org/10.1145/1860000/1858721/p384-turian.pdf?ip=200.137.216.145&id=1858721&acc=OPEN&key=344E943C9DC262BB%2E0ACEC6856BE69272%2E4D4702B0C3E38B35%2E6D218144511F3437&__acm__=1518017678_cec1b87c9c6e3f9ccd8e61f591acaa26)

[6] [Distributed representations.](https://web.stanford.edu/~jlmcc/papers/PDP/Chapter3.pdf)

[7] [Learning internal representations by back-propagating errors. Parallel Distributed Processing: Explorations in the Microstructure of Cognition](http://lia.disi.unibo.it/Courses/SistInt/articoli/nnet1.pdf)

[8] [Finding structure in time.](http://onlinelibrary.wiley.com/doi/10.1207/s15516709cog1402_1/epdf)

[9] [Combining heterogeneous models for measuring relational similarity.](http://www.aclweb.org/anthology/N13-1120)

[10] [Neural network based language models for highly inflective languages.](http://ieeexplore.ieee.org/abstract/document/4960686/)