<a id='top'></a><a name='top'></a>
# Chapter 6: Reasoning with word vectors (Word2vec)

## Word vectors (6.2.5 - 6.2.10)

* [Introduction](#introduction)
* [6.0 Imports and Setup](#6.0)
* [6.2 Word vectors](#6.2)
    - [6.2.5 Word2vec vs. GloVe (Global Vectors)](#6.2.5)
    - [6.2.6 fastText](#6.2.6)
    - [6.2.7 Word2Vec vs. LSA](#6.2.7)
    - [6.2.8 Visualizing word relationships](#6.2.8)
    - [6.2.9 Unnatural words](#6.2.9)
    - [6.2.10 Document similarity with Doc2vec](#6.2.10)

---
<a name='introduction'></a><a id='introduction'></a>
# Introduction
<a href="#top">[back to top]</a>


### Datasets

* glove.6B.zip: [script](#glove.6B.zip), [source](https://huggingface.co/stanfordnlp/glove/resolve/main/glove.6B.zip)


### Explore

* How word vectors are created
* Using pretrained models for applications
* Reasoning with word vectors to solve real problems
* Visualizing word vectors
* Uses for word embeddings
* How every word has some geography, sentiment (positivity), and gender associated with it

### Key points

* Word vectors and vector-oriented reasoning can solve problems like analogy questions and non-synonomy relationships between words.
* It is possible to train Word2vec and other word vector embeddings on words in an application so a NLP pipeline isn't "polluted" by the GoogleNews meaning of words inherent in most Word2vec pretrained models. 
* Gensim can be used to explore, visualize, and build word vector vocabularies.
* A PCA projection of geographic word vectors like US city names can reveal the cultural closeness of places that are geographically far apart.
* If you respect sentence boundaries with n-grams and are efficient at setting up word pairs for training, you can greatly improve the accuracy of your latent semantic analysis word embeddings. 

---
<a name='6.0'></a><a id='6.0'></a>
# 6.0 Imports and Setup
<a href="#top">[back to top]</a>

In [1]:
import os
if not os.path.exists('setup'):
    os.mkdir('setup')

In [2]:
req_file = "setup/requirements_06.txt"

In [3]:
%%writefile {req_file}
annoy
isort
watermark

Overwriting setup/requirements_06.txt


In [4]:
import sys
IS_COLAB = 'google.colab' in sys.modules

if IS_COLAB:
    print("Installing packages")
    !pip install --upgrade --quiet -r {req_file}
else:
    print("Running locally.")

Running locally.


In [5]:
# if IS_COLAB:
from sklearnex import patch_sklearn
patch_sklearn()

Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)


In [6]:
%%writefile setup/chp06_imports.py
import pickle
import locale
import pprint
import random
import warnings

import seaborn as sns
from annoy import AnnoyIndex
from tqdm.auto import tqdm
from watermark import watermark

Overwriting setup/chp06_imports.py


In [7]:
!isort setup/chp06_imports.py --sl
!cat setup/chp06_imports.py

Fixing /Users/gb/Desktop/examples/setup/chp06_imports.py
import locale
import pickle
import pprint
import random

import seaborn as sns
from annoy import AnnoyIndex
from tqdm.auto import tqdm
from watermark import watermark


In [8]:
import locale
import pickle
import pprint
import random
import warnings

import seaborn as sns
from annoy import AnnoyIndex
from tqdm.auto import tqdm
from watermark import watermark

In [9]:
def HR():
    print("-"*40)
    
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"

locale.getpreferredencoding = getpreferredencoding
warnings.filterwarnings('ignore')
sns.set_style("darkgrid")
tqdm.pandas(desc="progress-bar")
pp = pprint.PrettyPrinter(indent=4)
random.seed(23)

print(watermark(iversions=True,globals_=globals(),python=True,machine=True))

Python implementation: CPython
Python version       : 3.8.12
IPython version      : 7.34.0

Compiler    : Clang 13.0.0 (clang-1300.0.29.3)
OS          : Darwin
Release     : 21.6.0
Machine     : x86_64
Processor   : i386
CPU cores   : 4
Architecture: 64bit

sys    : 3.8.12 (default, Dec 13 2021, 20:17:08) 
[Clang 13.0.0 (clang-1300.0.29.3)]
seaborn: 0.12.1



---


<a name='6.2.5'></a><a id='6.2.5'></a>
## 6.2.5 Word2vec vs GloVe (Global Vectors)
<a href="#top">[back to top]</a>

Problem: How does the GloVe model differ Word2vec?

Idea: Word2vec relies on backpropagation to update weights that form the word embeddings. GloVe produces matrices equivalent to the input weight matrix and output weight matrix of Word2vec, but via direct optimization of a cost function using gradient descent. GloVe achieves direct optimization of the global vectors of word co-occurrences (co-occurrences across the entire corpus). Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations can reveal interesting linear substructures of the word vector space.

Importance: In general, GloVe is faster, and more likely to find the global optimum for vector representations, giving more accurate results. GloVe can also be trained on smaller corpora and still converge.

Below example is from *Real World Natural Language Processing* by Masato Hagiwara.

Reference: 
* https://github.com/mhagiwara/realworldnlp/blob/master/examples/embeddings/glove_lookup.py
* https://nlp.stanford.edu/projects/glove/
* https://www.geeksforgeeks.org/pre-trained-word-embedding-using-glove-in-nlp-models/

<a id='glove.42B.300d.zip'></a><a name='glove.42B.300d.zip'></a>
### Dataset: glove.42B.300d.zip
<a href="#top">[back to top]</a>

In [10]:
data_dir = 'data/data_glove'
if not os.path.exists(data_dir):
    os.makedirs(data_dir)
    
data_glove_path = f'{data_dir}/glove.42B.300d.txt'
data_glove_src = f'{data_dir}/glove.42B.300d.zip'
print(data_glove_path)
HR()

!wget -P {data_dir} -nc https://nlp.stanford.edu/data/glove.42B.300d.zip
HR()
!ls -l {data_glove_src}

data/data_glove/glove.42B.300d.txt
----------------------------------------
File ‘data/data_glove/glove.42B.300d.zip’ already there; not retrieving.

----------------------------------------
-rw-r--r--@ 1 gb  staff  1877800501 Oct 25  2015 data/data_glove/glove.42B.300d.zip


In [11]:
# Reset project
# !rm -fr {data_glove_path}

In [12]:
%%time
import os.path

if not os.path.isfile(data_glove_path):
    print(f"{data_glove_path} not found, extracting now.")
    !unzip {data_glove_src} -d {data_dir}
    print("Done")
    !ls -l {data_glove_path}

CPU times: user 1.15 ms, sys: 1.24 ms, total: 2.38 ms
Wall time: 2.24 ms


In [13]:
GLOVE_FILE_PREFIX_TXT = f"{data_dir}/glove.42B.300d.txt"
GLOVE_FILE_PREFIX_ANN = f"{data_dir}/glove.42B.300d.ann"  # Annoy index file
GLOVE_FILE_PREFIX_PKL = f"{data_dir}/glove.42B.300d.pkl"
GLOVE_FILE_PREFIX_I2W = f"{data_dir}/glove.42B.300d.i2w"

HR()

print(GLOVE_FILE_PREFIX_TXT)
print(GLOVE_FILE_PREFIX_ANN)
print(GLOVE_FILE_PREFIX_PKL)
print(GLOVE_FILE_PREFIX_I2W)

----------------------------------------
data/data_glove/glove.42B.300d.txt
data/data_glove/glove.42B.300d.ann
data/data_glove/glove.42B.300d.pkl
data/data_glove/glove.42B.300d.i2w


### Annoy

https://github.com/spotify/annoy

In [14]:
# From Real World NLP Book
# https://github.com/mhagiwara/realworldnlp/blob/master/examples/embeddings/glove_lookup.py
# Using annoy-1.15.1

EMBEDDING_DIM = 300

def build_index():
    print("Start building index..")
    num_trees = 10

    idx = AnnoyIndex(EMBEDDING_DIM)

    index_to_word = {}
    
    with open(GLOVE_FILE_PREFIX_TXT) as f:
        for i, line in enumerate(tqdm(f)):
            fields = line.rstrip().split(' ')
            vec = [float(x) for x in fields[1:]]
            idx.add_item(i, vec)
            index_to_word[i] = fields[0]
            # if i > 100_000:
            #     break

    idx.build(num_trees)
    idx.save(GLOVE_FILE_PREFIX_ANN)
    pickle.dump(index_to_word, open(GLOVE_FILE_PREFIX_I2W, mode='wb'))
    print("Done building index.")

    
def search(query, top_n=10):
    idx = AnnoyIndex(EMBEDDING_DIM)
    idx.load(GLOVE_FILE_PREFIX_ANN)
    index_to_word = pickle.load(open(GLOVE_FILE_PREFIX_I2W, mode='rb'))
    word_to_index = {word: index for index, word in index_to_word.items()}
    query_id = word_to_index[query]
    print(query_id)
    word_ids = idx.get_nns_by_item(query_id, top_n)
    for word_id in word_ids:
        print(index_to_word[word_id])

In [15]:
%%time
import os.path
if not os.path.isfile(GLOVE_FILE_PREFIX_ANN):
    print("Build Annoy index file.")
    build_index()
else:
    print(f"Annoy index file {GLOVE_FILE_PREFIX_ANN} already exists.")
HR()

Annoy index file data/data_glove/glove.42B.300d.ann already exists.
----------------------------------------
CPU times: user 1.26 ms, sys: 1.83 ms, total: 3.09 ms
Wall time: 3.35 ms


In [16]:
!du -h {data_dir}/* | sort -h

 29M	data/data_glove/glove.42B.300d.i2w
1.7G	data/data_glove/glove.42B.300d.zip
2.4G	data/data_glove/glove.42B.300d.ann
4.7G	data/data_glove/glove.42B.300d.txt


In [17]:
search('dog')

828
dog
dogs
puppy
cat
cats
puppies
rabbit
paws
pig
toy


In [18]:
search('december')

543
december
january
october
november
september
february
august
july
april
march


In [19]:
search('sun')

821
sun
30
am
planet
ocean
see
looking
good
full
side


In [20]:
search('snow')

1947
snow
winter
snowfall
ice
ski
mountain
skiing
weather
riding
sand


In [21]:
search('december')

543
december
january
october
november
september
february
august
july
april
march
