# Vector Semantics and Word Embeddings

![](../figs/intro_nlp/vector/entelecheia_alphabet_letters.png)

## Vector Semantics and Word Embeddings

- Lexical semantics is the study of the meaning of words
- Distributional hypothesis: words that occur in similar contexts have similar meanings
- Sparse vectors: one-hot encoding
- Dense vectors: word embeddings

### What do words mean, and how do we represent that?

> `cassoulet`

Do we want to represent that ...

- "cassoulet" is a French dish?
- "cassoulet" contains meat and beans?
- "cassoulet" is a stew?

> `bar`

Do we want to represent that ...

- "bar" is a place where you can drink alcohol?
- "bar" is a long rod?
- "bar" is to prevent something from moving?

### Different approaches to lexical semantics

NLP draws on two different approaches to lexical semantics:

- **Lexical semantics**: 
  - The study of the meaning of words
  - The lexicographic tradition aims to capture the information represented in lexical entries in dictionaries
- **Distributional semantics**: 
  - The study of the meaning of words based on their distributional properties in large corpora
  - The distributional hypothesis: words that occur in similar contexts have similar meanings

#### Lexical semantics

- Uses resources such as `lexicons`, `thesauri`, `ontologies` etc. that capture explicit knowledge about word meanings.
- Assumes that words have `discrete word senses` that can be represented in a `lexicon`.
  - bank 1 = a financial institution
  - bank 2 = a river bank
- May capture explicit knowledge about word meanings, but is limited in its ability to capture the meaning of words that are not in the lexicon.
  -  `dog` is a `canine` (lexicon)
  -  `cars` have `wheels` (lexicon)





#### Distributional semantics

- Uses `large corpora of raw text` to learn the meaning of words from the contexts in which they occur.
- Maps words to `vector representations` that capture the `distributional properties` of the words in the corpus.
- Uses neural networks to learn the dense vector representations of words, `word embeddings`, from large corpora of raw text.
- If each word is mapped to a single vector, this ignores the fact that words can have multiple meanings or parts of speech.

### How do we represent words to capture word similarities?

- As `atomic symbols`
  - in a traditional n-gram language model
  - explicit features in a machine learning model
  - this is equivalent to very high-dimensional one-hot vectors:
    - aardvark = [1,0,...,0], bear = [0,1,...,0], ..., zebra = [0,0,...,1]
    - height and tall are as different as aardvark and zebra
- As very high-dimensional `sparse vectors`
  - to capture the distributional properties of words
- As low-dimensional `dense vectors`
  - word embeddings

### What should word representations capture?

- Vector representations of words were originally used to capture `lexical semantics` so that words with similar meanings would be represented by vectors that are close together in vector space.
- These representations may also capture some `morphological` and `syntactic` information about words. (part of speech, inflections, stems, etc.)

#### The Distributional Hypothesis

Zellig Harris (1954):
- Words that occur in similar contexts have similar meanings.
- `oculist` and `eye doctor` occur in almost the same contexts
- If A and B have almost the same environment, then A and B are synonymous.

John Firth (1957):
- You shall know a word by the company it keeps.

> The `contexts` in which words occur tell us a lot about the meaning of words.
> 
> Words that occur in similar contexts have similar meanings.