# Vector Semantics and Word Embeddings

![](../figs/intro_nlp/vector/entelecheia_alphabet_letters.png)

## Vector Semantics and Word Embeddings

- Lexical semantics is the study of the meaning of words
- Distributional hypothesis: words that occur in similar contexts have similar meanings
- Sparse vectors: one-hot encoding
- Dense vectors: word embeddings

### What do words mean, and how do we represent that?

> `cassoulet`

Do we want to represent that ...

- "cassoulet" is a French dish?
- "cassoulet" contains meat and beans?
- "cassoulet" is a stew?

> `bar`

Do we want to represent that ...

- "bar" is a place where you can drink alcohol?
- "bar" is a long rod?
- "bar" is to prevent something from moving?

### Different approaches to lexical semantics

NLP draws on two different approaches to lexical semantics:

- **Lexical semantics**: 
  - The study of the meaning of words
  - The lexicographic tradition aims to capture the information represented in lexical entries in dictionaries
- **Distributional semantics**: 
  - The study of the meaning of words based on their distributional properties in large corpora
  - The distributional hypothesis: words that occur in similar contexts have similar meanings

#### Lexical semantics

- Uses resources such as `lexicons`, `thesauri`, `ontologies` etc. that capture explicit knowledge about word meanings.
- Assumes that words have `discrete word senses` that can be represented in a `lexicon`.
  - bank 1 = a financial institution
  - bank 2 = a river bank
- May capture explicit knowledge about word meanings, but is limited in its ability to capture the meaning of words that are not in the lexicon.
  -  `dog` is a `canine` (lexicon)
  -  `cars` have `wheels` (lexicon)





#### Distributional semantics

- Uses `large corpora of raw text` to learn the meaning of words from the contexts in which they occur.
- Maps words to `vector representations` that capture the `distributional properties` of the words in the corpus.
- Uses neural networks to learn the dense vector representations of words, `word embeddings`, from large corpora of raw text.
- If each word is mapped to a single vector, this ignores the fact that words can have multiple meanings or parts of speech.

### How do we represent words to capture word similarities?

- As `atomic symbols`
  - in a traditional n-gram language model
  - explicit features in a machine learning model
  - this is equivalent to very high-dimensional one-hot vectors:
    - aardvark = [1,0,...,0], bear = [0,1,...,0], ..., zebra = [0,0,...,1]
    - height and tall are as different as aardvark and zebra
- As very high-dimensional `sparse vectors`
  - to capture the distributional properties of words
- As low-dimensional `dense vectors`
  - word embeddings

### What should word representations capture?

- Vector representations of words were originally used to capture `lexical semantics` so that words with similar meanings would be represented by vectors that are close together in vector space.
- These representations may also capture some `morphological` and `syntactic` information about words. (part of speech, inflections, stems, etc.)

#### The Distributional Hypothesis

Zellig Harris (1954):
- Words that occur in similar contexts have similar meanings.
- `oculist` and `eye doctor` occur in almost the same contexts
- If A and B have almost the same environment, then A and B are synonymous.

John Firth (1957):
- You shall know a word by the company it keeps.

> The `contexts` in which words occur tell us a lot about the meaning of words.
> 
> Words that occur in similar contexts have similar meanings.

#### Why do we care about word contexts?

What is `tezgüino`?

- A bottle of `tezgüino` is on the table.
- Everybody likes `tezgüino`.
- `Tezgüino` makes you drunk.
- We make `tezgüino` out of corn.

We don't know what `tezgüino` is, but we can guess that it is a drink because we understand these sentences.

If we have the following sentences:

- A bottle of `wine` is on the table.
- There is a `beer` bottle on the table
- `Beer` makes you drunk.
- We make `bourbon` out of corn.
- Everybody likes `chocolate`
- Everybody likes `babies`

Could we guess that `tezgüino` is a drink like `wine` or `beer`?

However, there are also red herrings:
- Everybody likes `babies`
- Everybody likes `chocolate`


### Two ways NLP uses context for semantics

`Distributional similarity`: (vector-space semantics)
- Assume that words that occur in similar contexts have similar meanings.
- Use the `set of all contexts` in which a word occurs to measure the `similarity` between words.

`Word sense disambiguation`:
- Assume that if a word has multiple meanings, then it will occur in different contexts for each meaning.
- Use the context of a particular occurrence of a word to identify the `sense` of the word in that context.

## Distributional Similarity

### Basic idea

- Measure the semantic `similarities of words` by measuring the `similarity of their contexts` in which they occur

### How?

- Represent words as `sparse vectors` such that:
  - each `vector element` (dimension) represents a different `context`
  - the `value` of each element is the `frequency` of the context in which the word occurs, capturing how `strongly` the word is associated with that context
- Compute the `semantic similarity of words` by measuring the `similarity of their context vectors`

Distributional similarities represent each word $w$ as a vector $v_w$ of context counts:

$$w = (w_1 , \ldots , w_N ) \in R^N$$

in a vector space $R^N$ where $N$ is the number of contexts.

- each dimension $i$ represents a different context $c_i$
- each element $v_{w,i}$ captures how strongly $w$ is associated with context $c_i$
- $v_{w,i}$ is the co-occurrence count of $w$ and $c_i$

### The Information Retrieval perspective: The Term-Document Matrix

In information retrieval, we search a collection of $N$ documents for $M$ terms:

- We can represent each `word` in the vocabulary $V$ as an $N$-dimensional vector $v_w$ where $v_{w,i}$ is the `frequency` of the word $w$ in document $i$.
- Conversely, we can represent each `document` as an $M$-dimensional vector $v_d$ where $v_{d,j}$ is the `frequency` of the term $t_j$ in document $d$.

Finding the `most relevant` documents for a query $q$ is equivalent to finding the `most similar` documents to the query vector $v_q$.
- Queries are also documents, so we can use the same vector representation for queries and documents.
- Use the similarity of the query vector $v_q$ to the document vectors $v_d$ to rank the documents.
- Documents are similar to queries if they have similar terms.

### Term-Document Matrix

![](../figs/intro_nlp/vector/2.png)