<a href="https://colab.research.google.com/github/ameasure/colab_tutorials/blob/master/Word_Embeddings_with_Gensim.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Self Supervised Learning
Self supervised learning is a type of supervised learning in which the supervision comes from the input data itself rather than external labels that have been added. A common example is `language modeling`, in which we train a model to predict a hidden word from the words around it. Self supervised learning is a very important concept for two reasons:
1. Essentially any data can be converted into self-supervised data by masking or corrupting part of the input and using the rest of the data to predict what is missing.
2. By pretraining models on self-supervised tasks we can often dramatically reduce the amount of labeled training data needed to get good performance on supervised tasks. 

There are good reasons to suspect that much of human knowledge comes from self-supervised learning. After all, human learning occurs even in the absence of direct supervision and there is little doubt the brain is constantly anticipating, imagining, and filling in missing information (for example, there is an optic nerve blocking your field of vision right now and I bet you can't even see it).  

# Word Embeddings
Word embeddings (i.e. dense vector representations of words) are currently one of the most popular uses of self-supervised learning (although they can be generated with direct supervision as well).  One of the motivations for creating word embeddings is to create word representations that capture the shared meaning between words, so that our NLP systems can recognize similarities between sentences even when no words are directly shared. Word embeddings are typically generated using some variation of the `language modeling` objective. One of the simplest and most efficient algorithms for training these is `word2vec`. There are several variants, but each essentially amounts to the following:
1. sample words
2. sample word contexts (surrounding words)
3. predict one from the other

We will demonstrate how to train these on our MSHA dataset using the `gensim` library. If you don't have it already you can install it from the Anaconda command prompt with `pip install gensim`.

# Load the Data

In [0]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelBinarizer
from keras.preprocessing.text import Tokenizer

import pandas as pd

# read in and separate the training and validation data
df = pd.read_excel(r'Data/msha_2003-2018.xlsx')
df['NARRATIVE'].fillna('', inplace=True)
df['INJ_BODY_PART'].fillna('', inplace=True)
df_train = df[df['YEAR'] < 2017].copy()
df_valid = df[df['YEAR'] == 2018].copy()
print('training rows:', len(df_train))
print('validation rows:', len(df_valid))

Using TensorFlow backend.


training rows: 163352
validation rows: 6778


# Step 1: Tokenize the Data

Gensim expects language to be tokenized (broken into individual words) before being fed into the word2vec algorithm. We will do this with spacy as follows.

In [0]:
import spacy
from spacy.lang.en import English

# we disable all the annotators except the tokenizer so its fast
nlp = English(disable=['tagger', 'parser', 'ner'])

def tokenize(text):
  return [t.text.lower() for t in nlp(text)]  

In [0]:
df_train['TOKENS'] = df_train['NARRATIVE'].apply(tokenize)
df_train['TOKENS'].head()

0    [using, drill, press, in, shop, to, drill, hol...
1    [railroad, switch, was, iced, over, ., ee, jum...
2    [ee, was, lifting, 5, gallon, buckets, of, med...
3    [employee, was, working, on, sifter, blower, ....
4    [employee, was, stepping, over, berm, to, get,...
Name: TOKENS, dtype: object

# Step 2: Fit the Word2Vec Model
This is essentially a simple neural network where the inputs are target words and the outputs are context words (or vice versa, depending on the word2vec variant). The word embeddings are the activation formed by the hidden layer of the neural network when that word is presented as an input. The following are the main parameters of the model:
1. `sentences` - a list of the tokenized texts we will use
2. `size` - the dimensionality of the word embedding (100 means each word is mapped to a 100 element vector). 300 seems to be the most popular choice for embeddings trained on massive datasets.
3. `window` - the distance in number of words considered "in context" for a given target word. Larger windows (>5) result in embeddings that are more reflective of word meaning. Smaller windows result in embeddings that are more reflective of word syntax (how it is used in a sentence). For example, "good" and "bad" have opposite meaning but similar syntax, i.e. you can replace one with the other in most sentences without violating any grammatical rules.


In [0]:
from gensim.models import Word2Vec

w2vmodel = Word2Vec(sentences=df_train['TOKENS'], size=100, window=15)

Word vectors are designed to capture similarity in meaning between words. For example, we can see the words most similar to knee:

# Step 3: Use the Embeddings

Gensim makes it very easy to use the embeddings for a variety of purposes. We illustrate some below:

### Find similar words

In [0]:
w2vmodel.wv.most_similar('knee')

[('ankle', 0.8319008946418762),
 ('leg', 0.6941453218460083),
 ('hip', 0.6907269358634949),
 ('elbow', 0.6734733581542969),
 ('shoulder', 0.6668279767036438),
 ('foot', 0.656676173210144),
 ('shin', 0.6550939083099365),
 ('calf', 0.6223568916320801),
 ('thigh', 0.5957974195480347),
 ('heel', 0.5956833362579346)]

### Calculate similarity between words

In [0]:
w2vmodel.wv.similarity('contusions', 'bruises')

0.9211543

In [0]:
w2vmodel.wv.similarity('employee', 'bruises')

-0.15032719

### Similarity between sentences

In [0]:
s1 = tokenize('ee twisted ankle')
s2 = tokenize('employee sprained ankle')
w2vmodel.wv.n_similarity(s1, s2)

0.8895315

### Compute analogies

In [0]:
w2vmodel.wv.most_similar_cosmul(positive=['leg', 'shoulder'], negative=['arm'])

[('knee', 0.9422257542610168),
 ('hip', 0.9205620288848877),
 ('ankle', 0.9082322120666504),
 ('calf', 0.8602388501167297),
 ('buttock', 0.8295111656188965),
 ('thigh', 0.8266130685806274),
 ('foot', 0.820786714553833),
 ('shin', 0.8100503087043762),
 ('heel', 0.8018286228179932),
 ('kneecap', 0.7879019975662231)]

In [0]:
w2vmodel.wv.most_similar_cosmul(positive=['rock', 'bolts'], negative=['bolt'])

[('slate', 0.8299375176429749),
 ('thick', 0.8163900971412659),
 ('rocks', 0.8160615563392639),
 ('shale', 0.8149546980857849),
 ('cribs', 0.8007421493530273),
 ('timbers', 0.7989218831062317),
 ('straps', 0.7926580905914307),
 ('draw', 0.7889403700828552),
 ('tons', 0.7791443467140198),
 ('cemented', 0.7789480686187744)]

# Weaknesses of Word2Vec

One weakness of the original word2vec algorithm was that it has no way of dealing with words that were not in the original training data. Take, for example, the following:

In [0]:
try:
  w2vmodel.wv.most_similar('kneee')
except KeyError as e:
  print(e)

"word 'kneee' not in vocabulary"


One solution to this is FastText, also available in Gensim

# FastText with Gensim

FastText is an extension of word2vec which seeks to resolve out-of-vocabulary problems by breaking words down into smaller pieces, learning embeddings for these, and then combining these pieces to produce embeddings for whole words. We accomplish this in almost exactly the same way using gensim.

Parameters are as follows:
* sentences - iterable of tokenized texts
* size - dimensionality of learned embeddings
* window - distance from target word considered same-context
* min_count - the minimum number of times the word or word-piece must occur to be included in our vocabulary

In [0]:
from gensim.models import FastText

ftmodel = FastText(sentences=df_train['TOKENS'], 
                   size=100, window=15, min_count=5)

In [0]:
ftmodel.wv.most_similar('knee')

[('rt.knee', 0.9715948104858398),
 ('kneecap', 0.9335264563560486),
 ('kneel', 0.8777055740356445),
 ('ankle', 0.8228880763053894),
 ('knees', 0.7768344283103943),
 ('ankles', 0.7281184196472168),
 ('knew', 0.7249534130096436),
 ('elbow', 0.6834861636161804),
 ('leg', 0.6708968877792358),
 ('hip', 0.6534525156021118)]

In [0]:
ftmodel.wv.most_similar('kneee')

[('rt.knee', 0.8781274557113647),
 ('knee', 0.8767409920692444),
 ('kneecap', 0.8357614278793335),
 ('kneel', 0.7971817255020142),
 ('knees', 0.6991230845451355),
 ('knew', 0.6636230945587158),
 ('ankle', 0.6466142535209656),
 ('ankles', 0.5612426996231079),
 ('elbow', 0.534054160118103),
 ('kneeled', 0.5328353047370911)]

# Pre-Trained Embeddings

Word embeddings are only as good as the data they are trained on and this regard our MSHA data has some strengths and weaknesses. On the plus side, if we're working on an MSHA related task it is reflective of the data we're working with. Lots of mining-related injury words occur in our data. On the other hand, it is a tiny slice of the text information available out there for pretraining. We can learn a lot about language from training on the rest of the data out there. One solution is to use pre-trained embeddings, that is embeddings trained on massive amounts of data, like all of wikipedia. We'll demonstrate this approach by using publicly available embeddings that have already been trained on massive datasets.

For more information about the embeddings available for download see: [Gensim Data API Documentation](https://github.com/RaRe-Technologies/gensim-data).

In [0]:
import gensim.downloader as api

# download the pretrained embeddings
#glove_vectors = api.load("glove-wiki-gigaword-100")
#cn_vectors = api.load("conceptnet-numberbatch-17-06-300")
pre_ft_vectors = api.load('fasttext-wiki-news-subwords-300')

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [0]:
pre_ft_vectors.most_similar('knee')

[('ankle', 0.8057968616485596),
 ('elbow', 0.7759488224983215),
 ('knees', 0.7516900897026062),
 ('kneecap', 0.7214105129241943),
 ('thigh', 0.7110310196876526),
 ('knee-', 0.6905907392501831),
 ('groin', 0.6795806884765625),
 ('bended', 0.6676689982414246),
 ('knee-cap', 0.6633883714675903),
 ('shoulder', 0.6611182689666748)]

# References
* [Gensim Word2Vec Documentation](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec)
* [Gensim FastText Documentation](https://radimrehurek.com/gensim/models/fasttext.html)
* [Gensim KeyedVectors Documentation](https://radimrehurek.com/gensim/models/keyedvectors.html#module-gensim.models.keyedvectors)
* [Gensim Data Download API](https://github.com/RaRe-Technologies/gensim-data) - describes which pretrained embeddings are available for download
* [Word Mover Distance Paper](http://proceedings.mlr.press/v37/kusnerb15.pdf) - describes an effective measure of similarity between documents
