# Intro to word embeddings

So far we have focused on bag-of-word approaches i.e representations of text as a vector of word frequencies. An alternative formalization of text consists in representing the words (or bi-grams, phrases, etc) themselves as vectors. A _word vector_ has no meaning per se, but it is informative of the _context_ in which the word is used. This vector representation can become very close to the semantic meaning of the word. Combined with simple vector operations, these representations can used to find synonyms, to test analogies, etc. Word vectors can also be used in any subsequent task (dictionary methods, classification, etc) as features instead of the simple word frequencies in the classical bag-of-words approach.

In this notebook we are going to construct word embeddings using neural networks. The spirit of the method is to use 'prediction as an excuse': either predict a target word conditional on its surrounding words (_continuous-bag-of-words_) or predict surrounding words conditional on the target (_skip-gram_). What we care about is not the final output but the _hidden layer_ projection from a two-layers neural network designed to solve that prediction problem (see skip-gram diagram below from Mikolov et al. 2013). 

<img src='img/wordembeddings_diagram.png' />

If we choose the hidden layer to have $K$ hidden neurons then each target word is represented by a $K$-dimensional vector of hidden outputm which we call _embedding_. In practice $K$ should be between 100 and 300, but that really depends on the vocabulary size. For the purpose of learning we are going to apply the famous skip-gram Google's Word2Vec approach (Mikolov et al. 2013) to a corpus that is typically _too small_ so that we reduce our word representations to vectors of 30 dimensions. Keep in mind that the typical required vocabulary size is at least half a million of unique tokens.

Last note on Word2Vec: it is very powerful! When trained on very large corpora (like all of English Wikipedia) it can perform very strong analogies such as finding that the vector corresponding the most to the output of the operation 'king' - 'man' + 'woman' is 'queen'. There exist alternative packages such as GloVe (Stanford NLP) or FastText.

## Data

We are going to embed the vocabulary from the corpus of Frances Ellen Watkins Harper's books we encountered on day 1. First, let's read the files into a single string:

In [10]:
import codecs

import os
DATA_DIR = 'data'

import glob
fnames = os.path.join(DATA_DIR, 'harper', '*.txt')
fnames = glob.glob(fnames)
raw = ''
for fname in fnames:
    with codecs.open(fname, "r", encoding='utf-8-sig', errors='ignore') as f:
        t = f.read()
        raw += t

## Quick pre-processing

In [11]:
text = raw[1114:] # gets rid of meta information at the beginning

# A few modifications before sentence segmentation
text = text.replace('Mrs.', 'Mrs')
text = text.replace('Mr.', 'Mr')
text = text.replace('\n', ' ')
text = text.replace('\r', ' ')

# Sentence segmentation
import re
sent_boundary_pattern = r'[.?!]'
sentences = re.split(sent_boundary_pattern, text)

# Remove punctuation, special characters and upper cases
from string import punctuation
special = ['“', '”']
sentences = [''.join([ch for ch in sent if ch not in punctuation and ch not in special]) for sent in sentences]
sentences = [sent.lower() for sent in sentences]

# Remove white sace
sentences = [sent.strip() for sent in sentences]

# Tokenization within sentence
list_of_list = [sent.split() for sent in sentences]
list_of_list[:2]

[['ocate', 'for', 'civil', 'rights'],
 ['she',
  'attended',
  'the',
  'academy',
  'for',
  'negro',
  'youth',
  'and',
  'was',
  'educated',
  'as',
  'a',
  'teacher']]

## Train a skip-gram model with Word2Vec 

First, you'll need to install [Gensim](https://pypi.org/project/gensim/). You can do so directly in the notebook using     ```!pip install```.

In [14]:
!pip install gensim



In [17]:
from gensim.models import Word2Vec

model = Word2Vec(min_count=2, vector_size=30, sg=1)
model.build_vocab(list_of_list)  # prepare the model vocabulary
model.train(list_of_list, total_examples=model.corpus_count, epochs=model.iter)


(314670, 450755)

## Asses model accuracy
### Size of vocabulary

In [19]:
 print(len(model.wv.index_to_key))

4464


### Latent vector representation

In [21]:
print(model.wv.get_vector('woman'))

[ 0.06924564  0.28071073  0.56905425 -0.18667486  0.00204633 -0.1813182
  0.11958066  0.26388696 -0.5582225  -0.17833006  0.481077   -0.11125989
  0.05924989 -0.00705785 -0.26934934 -0.02716676  0.54685074  0.11274735
 -0.45409024  0.04612001  0.5777753  -0.6056014   0.26844725  0.38412955
  0.12074065 -0.09278563  0.84118444  0.07433166 -0.38083047 -0.4524818 ]


### Similarity between words

In [35]:
print(model.wv.similarity('woman', 'daughter'))

0.9143768


### Most similar words 

In [36]:
print(model.wv.similar_by_word('woman'))

[('girl', 0.9766966104507446), ('man', 0.9766955375671387), ('great', 0.975362241268158), ('known', 0.9738488793373108), ('true', 0.9689123034477234), ('slave', 0.9674014449119568), ('thought', 0.9663357138633728), ('knew', 0.9654760360717773), ('too', 0.9645689725875854), ('white', 0.964078962802887)]


In [38]:
vector = model.wv.get_vector('woman') - model.wv.get_vector('girl') 
print(model.wv.similar_by_vector(vector))

[('work', 0.37240225076675415), ('colored', 0.34611696004867554), ('copy', 0.3123661279678345), ('people', 0.3053143620491028), ('other', 0.29998040199279785), ('be', 0.295703262090683), ('terms', 0.2878722548484802), ('united', 0.2847304344177246), ('refund', 0.2731558084487915), ('associated', 0.2707774341106415)]


## Challenge

Try to improve the model by tuning its parameters:
- Increase the context window
- Construct continuous-bag-of-words representations
