# Learning Word Embeddings

Source: 😊[Day 12 - Special Data Types: Natural Language Processing](https://github.com/core-skills/12-text-processing) *repository*

> ☝️Before moving on with this notebook, ensure that you have:
- Downloaded the [hotels review data](https://github.com/kavgan/nlp-in-practice/blob/master/word2vec/reviews_data.txt.gz) (*reviews_data.txt.gz*) and placed it in the `./data` directory.
- Downloaded the GSWA data (*wamex_xml.zip*) and placed it in the `./data` directory.

**Overview**: Generating word embeddings using Gensim Word2Vec... Word Embeddings and Word2Vec
In this notebook, rather than loading word embeddings trained on large generic datasets, we train our own embedding models.

This notebook uses two types of datasets.
1. General Domain: 259,000 *hotel reviews* from [OpinRank](https://archive.ics.uci.edu/ml/datasets/opinrank+review+dataset) 
2. Domain Specific: *geological surveys* (GSWA)

**Supplementary Content**: ...s

Adapted from: [*Kavita Ganesan's tutorial*](http://kavita-ganesan.com/gensim-word2vec-tutorial-starter-code/) ([*repository*](https://github.com/kavgan/data-science-tutorials/tree/master/word2vec))

## Table of Contents
1. [Understanding Word2Vec parameters](#word2vec_parameters)
2. [Building general-domain word vectors](#general-domain_word_vectors)
3. [Building domain-specific word vectors](#domain-specific_word_vectors)

### Import Dependencies
- [logging](https://docs.python.org/3/library/logging.html) - library that we use to log and track progress of data preprocessing
- [gzip](https://docs.python.org/3/library/gzip.html) - library that we use to read data in .zip format
- [tqdm](https://github.com/tqdm/tqdm) - library that we use to track the progress of various intensive operations
- [bokeh](https://bokeh.org/) - library that we use to interactively visualise word vectors 
- [gensim](https://radimrehurek.com/gensim/) - library that we will use to experiment with word embeddings/vectors

In [1]:
import logging
import gzip
from pathlib import Path
from typing import List
from tqdm import tqdm

import gensim
import bokeh

### Set up the notebook environment and load helper functions

In [2]:
def read_input(input_file_path: str) -> List:
    '''Parses input file which is in gzip format'''
    assert input_file_path.endswith('.gz')
    
    corpus = []
    with gzip.open(input_file_path, 'rb') as f:
        for line in tqdm(f, desc="Reading file"): 
            # Perform pre-processing and return a list of words from each review text 
            corpus.append(gensim.utils.simple_preprocess(line))
    return corpus

In [3]:
def prettify_similarities(similarities: List[tuple]) -> List[str]:
    ''' Prettifies list of word similarities produced by Gensim.
    '''
    longest_str = max([len(sim[0]) for sim in similarities])
    return "\n".join([f'{idx+1}.\t{sim[0]:{longest_str+1}}\t{sim[1]*100:0.1f}%' for idx, sim in enumerate(similarities)])

## Understanding Word2Vec Parameters<a name="word2vec_parameters"></a>

Before training our custom embedding models, we need to understand some of the models parameters. For reference, this is the command that we will use to train the model: `model = gensim.models.Word2Vec(sentences=documents, size=150, window=10, min_count=2, workers=10)`

#### Parameters of Interest
`sentences`: The corpus that the model will be trained on in the format of a list of lists of tokens.

`size`: The size of the dense vector to represent each token or word. If you have very limited data, then size should be a much smaller value. If you have lots of data, its good to experiment with various sizes. Typical sizes are 100-300. A value of 100-150 has worked well for me. 

`window`: The maximum distance between the target word and its neighboring word. If your neighbor's position is greater than the maximum window width to the left and the right, then, some neighbors are not considered as being related to the target word. In theory, a smaller window should give you terms that are more related. If you have lots of data, then the window size should not matter too much, as long as its a decent sized window.

`min_count`: Minimium frequency count of words. The model would ignore words that do not statisfy the min_count. Extremely infrequent words are usually unimportant, so its best to get rid of those. Unless your dataset is really tiny, this does not really affect the model.

`workers`: How many threads to use behind the scenes?

<hr/>

## 🏢 Learning General-Domain Word Embeddings <a name="general-domain_word_vectors"></a>
In this section we will learn word embeddings from a general-domain (hotel reviews). The hotel reviews contains:
- Full reviews of hotels in 10 different cities (Dubai, Beijing, London, New York City, New Delhi, San Francisco, Shanghai, Montreal, Las Vegas, Chicago) 
- There are about 80-700 hotels in each city 
- Extracted fields include date, review title and the full review 
- Total number of reviews: ~259,000

### Load and pre-process hotel reviews dataset
Before learning word embeddings, we need to load and pre-process the hotel reviews corpus. The helper function `read_input` aids us with this task. This helper parses the dataset that is in .zip format and composed of numerous .json data files. These are read into memory and pre-processed with Gensim's `gensim.utils.simple_preprocess()` function that converts each review into a list of tokens that are lower cased.

⚠️Loading the hotel reviews documents will take a few minutes

In [4]:
# The data_file path will be different depending on where you've copied the notebooks...
data_file = '../data/reviews_data.txt.gz'

# Read the tokenized reviews into a list each review item becomes a series of words so this becomes a list of lists
documents = read_input(input_file_path=data_file)

Reading file: 255404it [01:07, 3799.15it/s]


In [5]:
# reviewing the tokenized hotel reviews documents
print(" ".join(documents[0]))

oct nice trendy hotel location not too bad stayed in this hotel for one night as this is fairly new place some of the taxi drivers did not know where it was and or did not want to drive there once have eventually arrived at the hotel was very pleasantly surprised with the decor of the lobby ground floor area it was very stylish and modern found the reception staff geeting me with aloha bit out of place but guess they are briefed to say that to keep up the coroporate image as have starwood preferred guest member was given small gift upon check in it was only couple of fridge magnets in gift box but nevertheless nice gesture my room was nice and roomy there are tea and coffee facilities in each room and you get two complimentary bottles of water plus some toiletries by bliss the location is not great it is at the last metro stop and you then need to take taxi but if you are not planning on going to see the historic sites in beijing then you will be ok chose to have some breakfast in the 

### Training general-domain word embeddings
Here we set-up our Word2Vec embedding model and train it on the documents in the hotel reviews dataset. After training, we save it so it can be used without requiring re-training. 

⚠️ Training the embedding model will take a while due to the 250k documents

📣 Alternatively, load the embeddings that we've already pre-trained.

In [None]:
modelpath = Path('../data/word2vec/word2vec_reviews.bin').resolve()

In [None]:
model = gensim.models.Word2Vec.load(str(modelpath))

### Reviewing learnt word embeddings
Similar to the notebook [12.2.1 - Word vector visualisation with Gensim](), we will explore the similarity of words represented with our general-domain embeddings.

#### Word Similarities

In [None]:
prettify_similarities(model.wv.most_similar(['polite']))

In [None]:
# Get vector representation of a word
testWord = 'dirty'
print(f'Word - {testWord}\nVector Representation\n{model.wv.get_vector(testWord)}')
print(f'Shape of vector: {model.wv.get_vector(testWord).shape[0]}')

In [None]:
w1 = "dirty"
model.wv.most_similar(positive=w1)

In [None]:
# look up top 6 words similar to 'polite'
w1 = ["polite"]
model.wv.most_similar(positive=w1,topn=6)

In [None]:
# look up top 6 words similar to 'france'
w1 = ["france"]
model.wv.most_similar(positive=w1,topn=6)

In [None]:
# look up top 6 words similar to 'shocked'
w1 = ["shocked"]
model.wv.most_similar(positive=w1,topn=6)

In [None]:
# Get everything related to stuff on the bed
w1 = ['bed','sheet','pillow']
w2 = ['couch']
model.wv.most_similar(positive=w1,negative=w2,topn=10)

In [None]:
# Similarity between two different words
print(f'Similarity between the words dirty and smelly: {model.wv.similarity(w1="dirty",w2="smelly"):0.3}')

In [None]:
# Similarity between two identical words
print(f'Similarity between the same word dirty: {model.wv.similarity(w1="dirty",w2="dirty"):0.3}')

In [None]:
# similarity between two unrelated words
print(f'Similarity between the words dirty and clean: {model.wv.similarity(w1="dirty",w2="clean"):0.3}')

In [None]:
# Which one is the odd one out in this list?
oddOneOutList = ["cat", "dog", "france"]
print(f'Which words doesnt belong in the set [cat, dog, france]? Odd one out: {model.wv.doesnt_match(oddOneOutList)}')

In [None]:
# Which one is the odd one out in this list?
oddOneOutList = ["bed","pillow","duvet","shower"]
print(f'Which words doesnt belong in the set [bed, pillow, duvet, shower]? Odd one out: {model.wv.doesnt_match(oddOneOutList)}')

## Visualising the word vectors in 2D space

Here we use a dimensionality reduction and visualisation package from Scikit-Learn, t-Distributed Stochastic Neighbor Embedding (t-SNE), which is particularly suited for visualising high dimensional data. Another popular options for dimensionality reduction is Principal Component Analysis (PCA).

Note: the dimension of our word vectors by choice is 150, typical numbers could be 50, 100, and 300. Among these numbers, a dimension of 300 has been shown as most effective in capturing the syntatic and semantic information of a word. However, it will take much longer to train. 

In [None]:
# Load required statistical package
from sklearn.manifold import TSNE

import numpy as np
import matplotlib.pyplot as plt
import matplotlib
# Need the interactive Tools for Matplotlib
%matplotlib inline
# Plot formatting
plt.rcParams["figure.figsize"] = [16,9]
font = {'size':16}
matplotlib.rc('font', **font)

In [None]:
def display_closestwords_tsnescatterplot(model, word):
    
    arr = np.empty((0,150), dtype='f')
    word_labels = [word]

    # get close words
    close_words = model.similar_by_word(word)
    
    # add the vector for each of the closest words to the array
    arr = np.append(arr, np.array([model[word]]), axis=0)
    for wrd_score in close_words:
        wrd_vector = model[wrd_score[0]]
        word_labels.append(wrd_score[0])
        arr = np.append(arr, np.array([wrd_vector]), axis=0)
        
    # find tsne coords for 2 dimensions
    tsne = TSNE(n_components=2, random_state=0)
    np.set_printoptions(suppress=True)
    Y = tsne.fit_transform(arr)

    x_coords = Y[:, 0]
    y_coords = Y[:, 1]
    # display scatter plot
    plt.scatter(x_coords, y_coords)

    for label, x, y in zip(word_labels, x_coords, y_coords):
        plt.annotate(label, xy=(x, y), xytext=(0, 0), textcoords='offset points')
    plt.xlim(x_coords.min()+0.00005, x_coords.max()+0.00005)
    plt.ylim(y_coords.min()+0.00005, y_coords.max()+0.00005)
    plt.title(f'Words closest to: {word}')
    plt.show()

### Visualising closest words in 2D

In [None]:
display_closestwords_tsnescatterplot(model.wv, "bed")

## ⚙️ Learning Domain-Specific Word Embeddings <a name="domain-specific_word_vectors"></a>

Word embeddings are interesting and thought provoking but
- What insights can we gain from them?
- How do we apply them to industry data?

This section of the notebook will explore word embeddings and their application to geological survey data (GSWA).

## Loading a small sample of Geological Survey of Western Australia (GSWA) data
Note: we have already added our Google Drive to the notebook, so we don't need to do this again.

In [None]:
import zipfile
import json

In [None]:
data_file="../data/wamex_xml.zip"

In [None]:
# Import Dataset
data = list()
with zipfile.ZipFile(data_file, "r") as z:
    #df = [pd.read_json(filename) for filename in z.namelist()]
    print(len(z.namelist()))
    for filename in z.namelist():
        # print(filename)
        # df = pd.read_json(filename)
        with z.open(filename) as f:
            # load the json file
            # The resulting `content` is a list
            content = json.loads(f.read()) 
            # Convert content to a string   
            content = "".join(content)
            # Add to the data list
            data.append(content)

In [None]:
# Previewing the data that we have loaded. This is very different to the hotel reviews dataset.
print(data[0])

In [None]:
def read_input(input_file):
    """This method reads the input file which is in zip format"""
    
    logging.info("reading file {0}...this may take a while".format(input_file))

    data = list()
    with zipfile.ZipFile(input_file, "r") as z:
    #df = [pd.read_json(filename) for filename in z.namelist()]
        print(len(z.namelist()))
        for i, filename in enumerate(z.namelist()):
            # print(filename)
            # df = pd.read_json(filename)
            if (i%100==0):
                logging.info ("read {0} reports".format (i))
            with z.open(filename) as f:
                # load the json file
                # The resulting `content` is a list
                content = json.loads(f.read()) 
                # Convert content to a string   
                content = "".join(content)
                if len(content) >= 10:
                    # Add to the data list
                    yield gensim.utils.simple_preprocess (content)
                else:
                    logging.info("removed {0} because of small size".format (filename))

In [None]:
# Read the tokenized reviews into a list each review item becomes a series of words so this becomes a list of lists
documents = list(read_input (data_file))
logging.info("Done reading data file")

In [None]:
# Review the first documents top 25 words. See how they have been pre-processed and tokenized.
print(documents[0][:25])

### Training word embedding model off of domain-specific text.

Note: we are required to use bigrams to aid our model due to domain-specific terms.

In [None]:
import nltk

Similar to the hotel reviews, we load pre-trained embeddings rather than training them due to time requirements.

In [None]:
modelDS = gensim.models.Word2Vec.load('../data/word2vec/word2vec_gswa.bin')

## Looking at domain-specific outputs

In [None]:
print(list(model.wv.vocab.keys())[:100])

In [None]:
print(modelDS.wv['gold'])

In [None]:
w1 = "gold"
modelDS.wv.most_similar(positive = w1)

In [None]:
# look up top 6 words similar to 'polite'
w1 = ["iron"]
modelDS.wv.most_similar(positive = w1, topn = 6)

In [None]:
# get everything related to stuff on the commodity
w1 = ["gold",'commodity','ore']
w2 = ['rock']
modelDS.wv.most_similar(positive = w1, negative = w2, topn = 10)

In [None]:
# similarity between two different words
modelDS.wv.similarity(w1 = "gold", w2 = "ore")

In [None]:
# similarity between two identical words
modelDS.wv.similarity(w1 = "gold", w2 = "gold")

In [None]:
# similarity between two unrelated words
modelDS.wv.similarity(w1 = "gold", w2 = "rock")

In [None]:
# Which one is the odd one out in this list?
modelDS.wv.doesnt_match(["gold", "rock", "copper"])

### Comparing our domain-specific embeddings with embeddings trained on a different dataset (news articles).
- What are the nearest terms to 'commodity', 'ore', 'rock', etc.

Unfortunately due to the size of pretrained word embedding models they have been omitted from this comparison. However, the code below shows how to load the Google news word2vec embedding model. This model is trained on 100 billion words and is 1.6GB in size (massive!).

In [None]:
# Load Google's pre-trained Word2Vec model.
# 1.6GB trained over 100B words; dimension of 300. Cannot use in this notebook due to massive size and time required to load.

#import gensim.downloader as api
#wv = api.load('word2vec-google-news-300')

In [None]:
# Define function to compare top-n similariries for a given word between two embedding models.
def compare_words(word, topn, model1, model2):
    similarWordsModel1 = model1.wv.most_similar(positive=word, topn=topn)
    similarWordsModel2 = model2.wv.most_similar(positive=word,topn=topn)

    print(f'Top {topn} words similar to {word}\n(format: n | model 1 | model 2 )\n')
    for n in range(topn):
        print(f'{n+1} |{similarWordsModel1[n][0]:<15} | {similarWordsModel2[n][0]:<15}')

In [None]:
# Looking at the word 'commodity'
compare_words(word = 'commodity', topn = 5, model1 = model, model2 = modelDS)

In [None]:
# Looking at the word 'ore'
compare_words(word = 'ore', topn = 5, model1 = model, model2 = modelDS)

In [None]:
# Looking at the word 'rock'
compare_words(word = 'rock', topn = 5, model1 = model, model2 = modelDS)

### Visualising the domain-specific word vectors in 2D space

In [None]:
def display_closestwords_tsnescatterplot(model, word):
    
    arr = np.empty((0,50), dtype='f')
    word_labels = [word]

    # get close words
    close_words = model.similar_by_word(word)
    
    # add the vector for each of the closest words to the array
    arr = np.append(arr, np.array([model[word]]), axis=0)
    for wrd_score in close_words:
        wrd_vector = model[wrd_score[0]]
        word_labels.append(wrd_score[0])
        arr = np.append(arr, np.array([wrd_vector]), axis=0)
        
    # find tsne coords for 2 dimensions
    tsne = TSNE(n_components=2, random_state=0)
    np.set_printoptions(suppress=True)
    Y = tsne.fit_transform(arr)

    x_coords = Y[:, 0]
    y_coords = Y[:, 1]
    # display scatter plot
    plt.scatter(x_coords, y_coords)

    for label, x, y in zip(word_labels, x_coords, y_coords):
        plt.annotate(label, xy=(x, y), xytext=(0, 0), textcoords='offset points')
    plt.xlim(x_coords.min()+0.00005, x_coords.max()+0.00005)
    plt.ylim(y_coords.min()+0.00005, y_coords.max()+0.00005)
    plt.title(f'Words closest to: {word}')
    plt.show()

In [None]:
display_closestwords_tsnescatterplot(model = modelDS.wv, word = "gold")

#### Visualising all of the words in the vocabulary

In [None]:
tsne = TSNE(n_components=2)
print(type(modelDS.wv.vocab))
X = modelDS[modelDS.wv.vocab]
# Shape of our model before t-SNE
print(f'Shape of model before t-SNE: {X.shape}')

In [None]:
# Fitting subset of data to t-SNE
points = 2500
X_limited = X[:points]
X_tsne = tsne.fit_transform(X_limited)

# Uncomment line below to fit entire model to t-SNE
# X_tsne = tsne.fit_transform(X)

plt.rcParams['figure.figsize'] = [10, 10]
plt.scatter(X_tsne[:, 0], X_tsne[:, 1])
plt.show()

#### Interactive Visualisation

Refer to https://www.datascience.com/resources/notebooks/word-embeddings-in-python, and also for ideas of incoporating POS and bigrams into word2vec training.

In [None]:
from bokeh.plotting import figure, show
from bokeh.io import push_notebook, output_notebook
from bokeh.models import ColumnDataSource, LabelSet
from bokeh.plotting import figure, show
from bokeh.io import push_notebook, output_notebook
from bokeh.models import ColumnDataSource, LabelSet
import pandas as pd

In [None]:
def interactive_tsne(text_labels, tsne_array):
    '''makes an interactive scatter plot with text labels for each point'''

    # Define a dataframe to be used by bokeh context
    bokeh_df = pd.DataFrame(tsne_array, text_labels, columns=['x','y'])
    bokeh_df['text_labels'] = bokeh_df.index

    # interactive controls to include to the plot
    TOOLS="hover, zoom_in, zoom_out, box_zoom, undo, redo, reset, box_select"

    p = figure(tools=TOOLS, plot_width=700, plot_height=700)

    # define data source for the plot
    source = ColumnDataSource(bokeh_df)

    # scatter plot
    p.scatter('x', 'y', source=source, fill_alpha=0.6,
              fill_color="#8724B5",
              line_color=None)

    # text labels
    labels = LabelSet(x='x', y='y', text='text_labels', y_offset=8,
                      text_font_size="8pt", text_color="#555555",
                      source=source, text_align='center')

    p.add_layout(labels)

    # show plot inline
    output_notebook()
    show(p)

Launch interactive visualisation of word embeddings

In [None]:
interactive_tsne(list(modelDS.wv.vocab.keys())[:points], X_tsne)