# Word2Vec 

##### Import libraries

In [1]:
import pandas as pd

import numpy as np
from numpy import array
from numpy import zeros

import matplotlib.pyplot as plt

import re
import multiprocessing

# Word2Vec
import gensim 
import gensim.models
from gensim.models import Word2Vec
from gensim.test.utils import datapath
from gensim import utils

##### Load data

In [2]:
posts = pd.read_csv('../data/posts-preprocessed.csv')

In [3]:
posts.head()

Unnamed: 0,author,created_utc,subreddit,timeframe,text_clean,sent_tokens,word_tokens
0,sub30605,1499390694,bulimia,pre-covid,chest anyone else experience chest purging kno...,['chest anyone else experience chest purging k...,"['chest', 'anyone', 'else', 'experience', 'che..."
1,sub27274,1499060654,bulimia,pre-covid,dying eat eating die study shifting coping men...,['dying eat eating die study shifting coping m...,"['dying', 'eat', 'eating', 'die', 'study', 'sh..."
2,sub6055,1499029087,bulimia,pre-covid,without purging way lose weight without exercise,['without purging way lose weight without exer...,"['without', 'purging', 'way', 'lose', 'weight'..."
3,sub40365,1498978259,bulimia,pre-covid,melancholy little month since slowly losing ha...,['melancholy little month since slowly losing ...,"['melancholy', 'little', 'month', 'since', 'sl..."
4,sub49857,1498814187,bulimia,pre-covid,relapsing upset right twice week good tired cy...,['relapsing upset right twice week good tired ...,"['relapsing', 'upset', 'right', 'twice', 'week..."


##  Create Word2Vec model

Using **Continuous Bag of Words (CBOW) model**: CBOW model predicts the current word given context words within specific window. The input layer contains the context words and the output layer contains the current word. The hidden layer contains the number of dimensions in which we want to represent current word present at the output layer.

https://www.geeksforgeeks.org/python-word-embedding-using-word2vec/

In [4]:
# How many cores am I working with?
cores = multiprocessing.cpu_count() # Count the number of cores in this computer
cores

8

Parameters: 
- size: (default 100) The number of dimensions of the embedding, e.g. the length of the dense vector to represent each token (word).
- window: (default 5) The maximum distance between a target word and words around the target word.
- min_count: (default 5) The minimum count of words to consider when training the model; words with an occurrence less than this count will be ignored.
- workers: (default 3) The number of threads to use while training.
- sg: (default 0 or CBOW) The training algorithm, either CBOW (0) or skip gram (1).

Code help from:
- https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html
- https://machinelearningmastery.com/develop-word-embedding-model-predicting-movie-review-sentiment/

The word2vec algorithm processes documents sentence by sentence. This means we will preserve the sentence-based structure during cleaning.

In [5]:
posts.head(1)

Unnamed: 0,author,created_utc,subreddit,timeframe,text_clean,sent_tokens,word_tokens
0,sub30605,1499390694,bulimia,pre-covid,chest anyone else experience chest purging kno...,['chest anyone else experience chest purging k...,"['chest', 'anyone', 'else', 'experience', 'che..."


### Training a word2vec model on reddit posts

Preprocessing of sentences using Gensim's pre-processing on posts, so that the input yields one sentence (list of utf8 words) after another. https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html

In [6]:
class MyCorpus:
    """An iterator that yields sentences (lists of str)."""

    def __iter__(self):
        for line in posts['sent_tokens']:
            # assume there's one document per line, tokens separated by whitespace
            yield utils.simple_preprocess(line)

In [7]:
sentences = MyCorpus()

In [8]:
model = gensim.models.Word2Vec(sentences=sentences, 
                               window = 10,
                               size = 150,
                               min_count = 2,
                               workers=cores-1) # run on all cores minus 1

After the model is fit, we print the size of the learned vocabulary

In [9]:
# summarize vocabulary size in model
vocab = list(model.wv.vocab)
print('Vocabulary size: %d' % len(vocab))

Vocabulary size: 16910


Finally, we save the learned embedding vectors to file using the save_word2vec_format() on the model’s ‘wv‘ (word vector) attribute. The embedding is saved in ASCII format with one word and vector per line.

In [10]:
filename = '../embedding_word2vec.txt'

In [11]:
# save model in ASCII (word2vec) format
model.wv.save_word2vec_format(filename, binary=False)