# Session 2, part IV

- Representing words and meanings
- Language modeling

<img src="images/_99.jpg" width="100%">

### Is there a statistical or machine learning approach that might work in place of an annotated ontology or a pattern-matching approach?

### Provided we have access to a reasonably large (and diverse) corpus of text, how can we represent the duality between words and meanings?

Statistical framework route
======================

+ Traditional NLP
    - Bag-of-Words
    - One-hot encodings

+ Modern NLP
    - Embeddings (e.g., a vector space generated via $\texttt{word2vec}$) 

<img src="images/_8.jpg" width="100%">

Bag-of-Words (BoW)
===========

Given some text corpora $D$, a BoW work-flow implies the following steps:

1. $\forall d \in D$ (e.g, documents, state- ments, sentences, and even single words), get the the token $\Phi(d)$
2. $\forall s \in S$ (i.e., unique lexical items) and $\forall d \in D$, get the cardinality $\vert s \vert$

We call the possible vectors a machine might create this way a vector space.

Such a vector space allows us to use linear algebra (and libraries such as NumPy, Scipy, or Numba) to manipulate lexical items and compute things like distances and statistics involving natural language data.

<img src="images/_12.png" width="100%">

Token sorting tray (source is Lane, Howard & Hapke 2019)

What can we do with BoW data?
==========================

For example, we can address search queries such as: *''What is the combination of words most likely to follow a particular bag of words?''* Or, if a user enters a sequence of words, *''What is the closest bag of words in our database to a bag-of-words vector provided by the user?''*

General take-home on BoW:

+ a BoW approach can generate meaningful responses to answers 
+ in a BoW approach, humans do not pass any rules to machines (you remember pattern-matching?)
+ BoW leverages distributional data to appreciate the semantic similarity of lexical items

Warning: a BoW approach doesn't say anything about the specific meanings of lexical items.

BoW in action
============

In [1]:
# let's import Counter, a special kind of
# dictionary that computes the cardinality of
# elements
from collections import Counter

# sample sentence
sentence = """
Success is not final;
failure is not fatal:
It is the courage to continue that counts. 
-Winston S. Churchill
"""

# tokenization
tokens = sentence.split()

# BoW
bow = Counter(tokens)

# print
from pprint import pprint
pprint(bow)

Counter({'is': 3,
         'not': 2,
         'Success': 1,
         'final;': 1,
         'failure': 1,
         'fatal:': 1,
         'It': 1,
         'the': 1,
         'courage': 1,
         'to': 1,
         'continue': 1,
         'that': 1,
         'counts.': 1,
         '-Winston': 1,
         'S.': 1,
         'Churchill': 1})


One-hot vectors 
=============

One of the main limitations of the BoW approach is the proliferation of unique vectors to compare and contrast.

One-hot vectors (a form of discrete representation of lexical items) mitigate the curse of dimensionality by considering whether a word is or is not present in a piece of text.

In [2]:
# let's import numpy to manipulat the text
import numpy as np

# sample sentence
sentence = """
Success is not final; failure is not fatal:
It is the courage to continue that counts. 
-Winston S. Churchill
"""

# tokenization
tokens = str.split(sentence)

# vocabulary (unique words)
vocab = sorted(set(tokens))
', '.join(vocab)

# count of tokens
num_tokens = len(tokens)

# size of vocabulary
vocab_size = len(vocab)

# one-hot vector representation
# -- empty np array
onehot_vectors = np.zeros((num_tokens, vocab_size), int)
# -- fill-in values
for i, word in enumerate(tokens):
    onehot_vectors[i, vocab.index(word)] = 1
    
# print np array
pprint(onehot_vectors)

array([[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 0,

In [3]:
# some embellishments
import pandas as pd
df = pd.DataFrame(onehot_vectors, columns=vocab)
df[df ==0] = ''
pprint(df)

   -Winston Churchill It S. Success continue counts. courage failure fatal:  \
0                                 1                                           
1                                                                             
2                                                                             
3                                                                             
4                                                                  1          
5                                                                             
6                                                                             
7                                                                         1   
8                      1                                                      
9                                                                             
10                                                                            
11                                                  

Limitations of one-hot encodings
===========================

There is no natural notion of similarity for one-hot vectors! (Mannings, 2019)

**Example 1:** the vectors associated with 'good' and 'fine' are orthogonal:

```
good = [0, 0, 1, 0, 0, 0]

fine = [0, 0, 0, 0, 1, 0]
```

**Example 2:** 'greasy spoon' and 'British cafe' express the same category of eatery but the intersection of their one-hot vectors is empty.

```
greasy spoon = [[0, 1, 0,  0], [0, 0, 1, 0]]

British cafe = [[1, 0, 0, 0], [0, 0, 0, 1]]
```

Shall we try to use WordNet’s list of synonyms to get similarity? Likely as not, a bad idea...WordNet has severe limitations.

Modern NLP: Distributional Hypothesis + DL
====================================

From DH to word vectors
=====================

According to the Distributional Hypothesis, a focal word’s $\omega$ meaning is a function of the linguistic context ― i.e., the lexical items in the neighborhood of the focal word.

Then, considering (all) the many contexts of $\omega$ (e.g., regulation) helps to create an accurate vector representation of $\omega$.

Sample sentences containing the word 'regulation':

```
... to encourage and implement the adoption of common REGULATIONs for all forms of motor sports and series across the ...

countries should adhere to the cost-benefit paradigm of REGULATION, forcing bureaucrats to outline all the benefits of ...

Agencies create REGULATIONs (also known as "rules") under the authority of Congress to help ...

```

Word vectors as dense, real valued vectors
===================================

Ultimately, by observing and analyzing a same word in multiple context, we aim at building a dense vector for each word, chosen so that it is similar to vectors of words that appear in similar contexts.

Below is a portion of the [vector](https://spacy.io/usage/vectors-similarity) associated with the word 'banana'.

```
array([2.02280000e-01,  -7.66180009e-02,   3.70319992e-01,
       3.28450017e-02,  -4.19569999e-01,   7.20689967e-02,
      -3.74760002e-01,   5.74599989e-02,  -1.24009997e-02,
       5.29489994e-01,  -5.23800015e-01,  -1.97710007e-01,
      -3.41470003e-01,   5.33169985e-01,  -2.53309999e-02,
       1.73800007e-01,   1.67720005e-01,   8.39839995e-01,
       5.51070012e-02,   1.05470002e-01,   3.78719985e-01,
       2.42750004e-01,   1.47449998e-02,   5.59509993e-01,
       1.25210002e-01,  -6.75960004e-01,   3.58420014e-01,
       # ... and so on ...
       3.66849989e-01,   2.52470002e-03,  -6.40089989e-01,
      -2.97650009e-01,   7.89430022e-01,   3.31680000e-01,
      -1.19659996e+00,  -4.71559986e-02,   5.31750023e-01], dtype=float32)
```

Overview of the $\texttt{word2vec}$ algorithm
================================

$\texttt{word2vec}$ (Mikolov et al. 2013) is a framework for learning word vectors

Idea ― given a corpus of text $D$:

+ each word $d$ is associated with a vector 
+ go through each position $k$ in the text, which has a center word $\omega$ and context words $\eta$
+ use the similarity of the word vectors for $\omega$ and $\eta$ to calculate the probability of $\eta$ given $\omega$ (or vice versa)
+ keep adjusting the word vectors to maximize this probability

Source is Manning 2019.

# Next week, we'll focus on $\texttt{word2vec}$ and word vectors only. 