# Session 3, part I

## Vector semantics and embeddings

<img src="images/_0.jpg" width="50%">

Learning goals
============


- Appreciating the anatomy of word2vec
- Assigning a role to new forms of embeddings
- Understanding the various steps to train ad hoc embeddings 

Week 2: Key tenets
================

+ modern NLP ~ Distributional Hypothesis + DL
+ words can be represented as vectors (Lenci 2018)
+ desiderata: by observing and analyzing a same word in multiple context, we aim at:
  - building a *dense, real-value* vector for each word
  - ... chosen so that it is similar to vectors of words that appear in similar contexts

Word vectors as dense, real valued vectors
===================================

Ultimately, by observing and analyzing a same word in multiple context, we aim at building a dense vector for each word, chosen so that it is similar to vectors of words that appear in similar contexts.

Below is a portion of the [vector](https://spacy.io/usage/vectors-similarity) associated with the word 'banana'.

```
array([2.02280000e-01,  -7.66180009e-02,   3.70319992e-01,
       3.28450017e-02,  -4.19569999e-01,   7.20689967e-02,
      -3.74760002e-01,   5.74599989e-02,  -1.24009997e-02,
       5.29489994e-01,  -5.23800015e-01,  -1.97710007e-01,
      -3.41470003e-01,   5.33169985e-01,  -2.53309999e-02,
       1.73800007e-01,   1.67720005e-01,   8.39839995e-01,
       5.51070012e-02,   1.05470002e-01,   3.78719985e-01,
       2.42750004e-01,   1.47449998e-02,   5.59509993e-01,
       1.25210002e-01,  -6.75960004e-01,   3.58420014e-01,
       # ... and so on ...
       3.66849989e-01,   2.52470002e-03,  -6.40089989e-01,
      -2.97650009e-01,   7.89430022e-01,   3.31680000e-01,
      -1.19659996e+00,  -4.71559986e-02,   5.31750023e-01], dtype=float32)
```

How to retrieve word vectors from a pre-trained model of the language?
=========================================================

In [4]:
import spacy

import spacy
nlp = spacy.load('en_core_web_sm')

In [6]:
nlp = spacy.load('en_core_web_lg')

In [7]:
nlp.vocab['banana'].vector

array([ 2.0228e-01, -7.6618e-02,  3.7032e-01,  3.2845e-02, -4.1957e-01,
        7.2069e-02, -3.7476e-01,  5.7460e-02, -1.2401e-02,  5.2949e-01,
       -5.2380e-01, -1.9771e-01, -3.4147e-01,  5.3317e-01, -2.5331e-02,
        1.7380e-01,  1.6772e-01,  8.3984e-01,  5.5107e-02,  1.0547e-01,
        3.7872e-01,  2.4275e-01,  1.4745e-02,  5.5951e-01,  1.2521e-01,
       -6.7596e-01,  3.5842e-01, -4.0028e-02,  9.5949e-02, -5.0690e-01,
       -8.5318e-02,  1.7980e-01,  3.3867e-01,  1.3230e-01,  3.1021e-01,
        2.1878e-01,  1.6853e-01,  1.9874e-01, -5.7385e-01, -1.0649e-01,
        2.6669e-01,  1.2838e-01, -1.2803e-01, -1.3284e-01,  1.2657e-01,
        8.6723e-01,  9.6721e-02,  4.8306e-01,  2.1271e-01, -5.4990e-02,
       -8.2425e-02,  2.2408e-01,  2.3975e-01, -6.2260e-02,  6.2194e-01,
       -5.9900e-01,  4.3201e-01,  2.8143e-01,  3.3842e-02, -4.8815e-01,
       -2.1359e-01,  2.7401e-01,  2.4095e-01,  4.5950e-01, -1.8605e-01,
       -1.0497e+00, -9.7305e-02, -1.8908e-01, -7.0929e-01,  4.01

Why dense, real-valued vectors rather than sparse vectors?
=================================================

+ convenience: in ML, you don't want to deal with thousands of features 
+ they may do better at capturing synonymy/semantic similarity:
  - 'car' and 'automobile' are synonyms; but are distinct dimensions
  - a word with 'car' as a neighbor and a word with 'automobile' as a neighbor should be similar, but aren't
+ in practice, they work better

Notes. ― The materials included in the following slides are based on Jurafsky and Martin's book 'Speech and Language Processing and Speech Recognition' (chapter 6). 

Examples of word-vectors and embeddings
====================================

### [$\texttt{word2vec}$](https://code.google.com/archive/p/word2vec/) ― the pioneer

<img src="images/_1.png" width="15%">

### [**Fasttext**](http://www.fasttext.cc/) ― it takes word parts into account

<img src="images/_2.png" width="6%">

### [**GloVe**](http://nlp.stanford.edu/projects/glove/) ― it emphasizes co-occurrences of words in the whole text corpus

<img src="images/_3.png" width="22%">