# Session 2, part I

- Representing words and meanings
- Language modeling

<img src="images/_99.jpg" width="60%">

Meanings are central to natural language 
=================================

Key **emerging properties** of natural language:

+ language is socially-oriented
+ words reflect symbols and social categories (that is, culture)
+ language convey (ambiguous) meanings

The **concept of meaning** is the place to start for any natural language processing analyses.

Let's have a closer look at:

+ how 'meanings' are represented in computational linguistics
+ how machines look at 'meanings'.

<img src="images/_0.jpg" width="75%">

How to represent the meaning of a word?
=====================================

A (computational) linguist's perspective
----------------------------------

Two pillars that reflect how linguists' think about meanings:

+ denotational semantics
+ distributional hypothesis

The intuition behind denotational semantics
---------------------------------------------------------------

Semantics, as the study of meanings, concerns the relationship between signifiers ― like words, phrases, signs, and symbols ― and what they stand for in reality, their denotation.

Denotations comprise both the salient features associated with an entity (being a concrete instance or a category) and the cognitive and behavioral effects of using a signifier that invokes an entity.  

Example: the lexeme 'hip-hop' conveys meanings about what constitute a 'hip-hop' song as well as the values, norms, and beliefs that orient the behavior of 'hip-hop people.'

<img src="images/_2.jpg" width="100%">

The distributional hypothesis (DH)
============================

*''Difference of meaning correlates with difference of distribution''*

―Harris, 1954

*''Semantic similarity is a function of the contexts in which words are used.''*

―Miller & Charles, 1951

*''DS is not only a method for lexical analysis but also a theoretical framework to build computational models of semantic memory''*

―Lenci, 2018

DH lies at the hearth of vector space models.

<img src="images/_1.png" width="100%">

Fig. 1 ― Distributional vectors of the lexemes car, cat, dog, and van. Notes: source is 'Lenci 2018 ― ARL'

Distributional representations
========================

The distributional representation of a lexical item is typically a distributional vector representing its co-occurrences with linguistic contexts ― hence the name vector space semantics.

The kind of co-occurrence relation between target and context lexemes determines different
types of collocates and distributional representations.

Context types (Firth (1957): (You shall know a word) by the company it keeps!

| Context types                                | Co-occurrences             |
| -------------------------------------------- | -------------------------- |
| Undirected window-based collocate            | $word$                     |
| Directed window-based collocate              | $\langle R, word \rangle$  |
| Dependency-filtered syntactic collocate [(see spaCy's documentation)](https://spacy.io/displacy-3504502e1d5463ede765f0a789717424.svg) | word                       |
| Dependency-typed syntactic collocate         | $\langle obj, word \rangle$|
| Text region                                  | Firth (1957)               |

Notes: source is 'Lenci 2018 ― ARL'

    [1]: 

Building distributional representations (1/3)
====================================

The basic method of building distributional vectors consists of the following procedure:

+ co-occurrences between lexical items and linguistic contexts are extracted from a corpus and counted
+ the distribution of lexical items is represented with a co-occurrence matrix, whose rows correspond to target lexical items, columns to contexts, and the entries to their co-occurrence frequency
+ raw frequencies are then usually transformed into significance weights to reflect the importance of the contexts
+ the semantic similarity between lexemes is measured with the similarity between their row vectors in the co-occurrence matrix

Suppose we have extracted and counted the co-occurrences of the targets $T =\{bike, car, dog, lion\}$ with the context lexemes $C =\{bite, buy, drive, eat, get, live, park, ride, tell\}$ in a corpus. Their distribution is represented with the following co-occurrence matrix $MT x C$,in which mt,c is the co-occurrence frequency of t with $c$:

<img src="images/_4.png" width="70%">

Notes: source is 'Lenci 2018 ― ARL'

Building distributional representations (2/3)
====================================

The most common weighting function in DS is positive pointwise mutual information (PPMI) (Bullinaria & Levy 2007).

PPMI measures how much the probability of a target–context pair estimated in the training corpus is higher than the probability we should expect if the target and the context occurred independently of one another.

Matrix 3 contains the PPMI weights computed from the raw co- occurrence frequencies in matrix 1

\begin{equation}
PPMI(t,c) = max \bigg( 0, log_{2} \frac{p(t,c)}{p(t)p(c)} \bigg )
\end{equation}

<img src="images/_5.png" width="70%">

Notes: source is 'Lenci 2018 ― ARL'

Building distributional representations (3/3)
====================================

The distributional similarity between two lexemes u and v is measured with the similarity
between their distributional vectors u and v.

Once we have computed the pairwise distributional similarity between the targets, we can identify the k nearest neighbors of each target t, that is, the k lexical items with the highest similarity score with t. The cosine is the most popular measure of vector similarity in DS:

\begin{equation}
cos(u,v) = \frac{u \cdot v}{\Vert u \Vert \Vert v \Vert} 
\end{equation}

The cosine ranges from 1 for identical vectors to −1 (0, if the vectors do not contain negative values).matrix reports the cosines between the row vectors in matrix 3:

<img src="images/_6.png" width="35%">

Notes: source is 'Lenci 2018 ― ARL'

Distributional semantics and NLP frameworks/tools
==========================================

<img src="images/_7.png" width="90%">

Notes: source is 'Lenci 2018 ― ARL'