# XCS224N Natural Language Processing with Deep Learning


# Lecture 1

[CS224N](http://web.stanford.edu/class/cs224n/) / [XCS224N](http://scpd.stanford.edu/search/publicCourseSearchDetails.do?method=load&courseId=93933715) / [Lecture](https://youtu.be/8rXD5-xhemo) / [Slides](http://web.stanford.edu/class/cs224n/slides/cs224n-2019-lecture01-wordvecs1.pdf)

Objectives:
* Understand how deep learning methods can be applied to NLP
* Understand the challenges faced in NLP and language in general
* How to build NLP systems of course!

## Human Language

* Human language is a very slow medium
* We have come up with a way of compression to communicate. We assume others know about nuance and context.

## Word Meaning and Representation

### Meaning

Definition: <strong>meaning</strong> (Webster Dictionary)
* the idea that is represented by a word, phrase, etc.

Common linguistic way of thinking of meaning (<strong>denotational semantics</strong>):

<em> signifier (symbol) <==> signified (idea or thing) </em>

Common solution in computers is to use <strong>WordNet</strong> - A thesaurus containing lists of synonym sets and hypernyms ("is a" relationships).

Problems with resources like WordNet:
* Great resource but lacks context and nuance
* New words or meanings of words may not be captured
* Subjective
* Requires human labor to maintain
* Cannot compute accurate word similarity

### Localist Representation

Representing words as discrete symbols or otherwise called <strong>localist representation</strong>.

Example:
```python
motel = [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
hotel = [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
```
Traits:
* Vector dimension = number of words in vocabulary (e.g. 500,000)
* Matrix will be sparse

Limitations of localist representation:
* Example like hotel and motel are orthogonal thus there is no notion of similarity when <em>one hot encoding</em>

Possible solutions:
* WordNet list of synonyms to get similarity. Most likely will fail badly due to incompleteness and other factors
* Instead: learn to <em>encode similarity</em> into the vectors themselves

### Distributed Representation

<strong>Distributional Semantics</strong>: A word's meaning is given by the words that frequently appear close-by.

The idea is that when a <strong>word w</strong> appears in a text, its <strong>context</strong> is the set of words that appear nearby (within a fixed-size window).

<br>

<img src="images/context_example.PNG" />

<strong>Word vectors</strong>: We build a dense vector for each word, chosen so that it is similar to vectors of words that appear in similar contexts.

Sometimes called:
* Word vectors
* Word embeddings
* Word representations
* Distributed representations

<br>

<img src="images/word_vector_example.PNG" />

## Word2Vec Introduction

<strong>Word2vec</strong> (Mikolov et al. 2013) is a framework for learning word vectors.

1. Consider a large corpus of text
2. Every word in a fixed vocabulary is represented by a vector
3. Iterate through each position <strong>t in the text</strong> , which has a <strong>center word c</strong> and <strong>context words o </strong>
4. Use the similarity of the word vectors for <strong>c</strong> and <strong>o</strong> to calculate the probability of <strong>o</strong> given <strong>c</strong> (or vice versa)
5. Keep adjusting the word vectors to maximise the probability

<br>

<img src="images/word2vec_example.PNG" />

<br>

<img src="images/word2vec_example2.PNG" />

## Word2vec Derivations of Gradient

For each position $t = 1,..,T$ predict context words within a window fixed size $m$, given center word $w_j$:

$$Likelihood = L(\theta) = \displaystyle\prod_{t=1}^T \displaystyle\prod_{\substack{-m\leq j \leq m \\ j \neq 0}} P(w_{t+j}| w_t; \theta)$$

Objective function

$$J(\theta) = -\frac{1}{2} log L(\theta)$ = -\frac{1}{T} \displaystyle\sum_{t=1}^T \displaystyle\sum_{\substack{-m\leq j \leq m \\ j \neq 0}}logP(w_{t+j}| w_t; \theta)$$

$\theta$ - All variables to be optimized