# Introduction
* Zipf's law: $ f*r = k $, where $f$ is frequency, $r$ is rank, $k$ is a constant
* Even in a large corpus, there will be many infrequent words

# Distributional semantics
## Vector semantics
* Model words with vectors, aka embeddings
* Shorter windows - more syntactic representation
* Longer windows - more semantic representation
* **PMI**: Is context word **informative** about a target word?
    * $ P(x,y) = \large{\frac{P(x,y)}{P(x)P(y)}} $
    * Replace negatives with 0
    * Biased toward infrequent events 
        * Raise prob. of rare words by $\alpha = .75$
        * Add-k smoothing
* **tf.idf**: Combine term frequency + inverse document frequency
    * $ w_{i,j} = tf_{i,j} * log(\frac{N}{df_i})$
    * $N$ = # docs, $df_i$ is # of docs with word
* **Cosine similarity**: $ cos(vec{v},\vec{w}) = \frac{\vec{v} \cdot \vec{w}}{|\vec{v}||\vec{w}|} $
    

## Word sense disambiguation
* **Zeugma** combines distinct senses of a word in an uncomfortable way
* Same word, different senses:
    * Homonyms: Unrelated (financial bank vs river bank)
    * Polysemes: Related but distinct (financial bank vs blood bank vs tree bank)
    * Metonyms: "Stand-in" ("Washington" instead of "the US gov't")
* **Cohen's Kappa**: Quantify disagreement between human annotators
    * $ K = \large{\frac{Pr(a) - Pr(e)}{1-Pr(e)}} $
    * a = actual agreement, e = chance agreement
* **Lesk's algorithm**: Count overlapping words between glosses & context

### WSD as supervised classification
* Precision: % of selected items that are correct
* Recall: % of correct items that are selected
* $ F = \large{\frac{(\beta^2 + 1)PR}{\beta^2P+R}} $

## Perceptron
* $ f(x) = sign(w^Tx + b) $
* Update rule: if $sign(\hat{y}) \neq sign(y)$ update weights:
    * $w_d = w_d + yx_d$ for all $ 1...D$
    * $b = b + y$
* $X$ is the **feature vector**
* $w,b$ are **parameters**
* $MaxIter$ is a **hyperparameter**
* Voted perceptron: $\hat{y} = sign( \sum_{k=1}^Kc^{(k)}sign(w^{(k)}\cdot \hat{x} + b^{(k)}))$
* Averaged perceptron: $\hat{y} = sign((\sum_{k=1}^Kc^{(k)}w^{(k)})\cdot \hat{x} + \sum_{k=1}^Kc^{(k)}b^{(k)})$
* Converges after $\frac{R^2}{\gamma^2} $ for margin $\gamma$

## Logistic regression
* Sigmoid: $\sigma(z) = \large{\frac{1}{1+e^{-z}}}$
* $P(y=1) = \sigma(w^Tx + b)$
* $L_{CE}(w,b) = -[ylog\sigma(w^Tx + b) + (1-y)log(1-\sigma(w^Tx + b))]$
    * aka: $-[ylog\hat{y} + (1-y)log(1-\hat{y})]$
  
### Gradient descent
* Minimizes loss by steps in the opposite direction of the gradient
* $\frac{\delta L_{CE}(w,b)}{\delta w_j} = [\sigma(w^Tx+b)-y]x_j$
* $\theta = \theta - \eta g$

### Multiclass LR
* Softmax: $softmax(z_i) = \large{\frac{e^z_i}{\sum_{j=1}^k e^z_j}}$ for class $ 1 \leq i \leq k $
* $P(y = c|x) = \large{\frac{e^{w_c^Tx + b_c}}{\sum_{j=1}^k e^{w_j^Tx + b_j}}} $
* $L_{CE}(\hat{y},y) = -\sum_{k=1}^K 1\{y=k\}log p(y=k|x)$, where $1\{\} = 1$ if true, and $0$ otherwise
* $\frac{\delta L_{CE}(w,b)}{\delta w_k} = -(1\{y=k\}-\large{\frac{e^{w_k^Tx + b_k}}{\sum_{j=1}^k e^{w_j^Tx + b_j}}})x_k$

## N-gram language models
* Markov assumption: Approximate context history by last few words only
* Bigram: $P(w_i|w_1w_2...w_{i-1}) \cup P(w_i| w_{i-1})$
* Generally insufficient because language has long-distance dependencies
* $ P(w_i|w_{i-1}) = \large{\frac{count(w_{i-1}, w_i)}{count(w_{i-1})} }$
* Smoothing:
    * Sparse stats, generalize better
    * LaPlace (add-1) smoothing: Add one to all counts
        * Adjusted counts: $ c^*(w_{n-1}w_n) = \large{\frac{[c(w_{n-1}w_n)+1]*c(w_{n-1})}{C(w_{n-1})+V}}$
    * Stupid backoff: Use less context
        * $S(w_i|w_{i-k+1}^{i-1}) = \large{\frac{count(w_{i-k+1}^i)}{count(w_{i-k+1}^{i-1})}}$ if $count(w_{i-k+1}^i) > 0$,
            $0.4S(w_i|w_{i-k+2}^{i-1})$ otherwise
        * $S(w_i) = \large{\frac{count(w_i)}{N}}$
* Can use < UNK > token for unknown words

### Evaluating language models
* **Perplexity**: $PP(W) = P(w_1w_2...w_N)^{-\frac{1}{N}} = [\prod_{i=1}^N P(w_i|w_1...w_{i-1})]^{-\frac{1}{N}}$ 

## Neural network language model
* Represent words as one-hot vectors
* Probabilistic classifier to compute prob. of a word given n prev words
* Error: Same as multiclass LR
    * Corpus level: $error(\lambda) = -\sum_{E in corpus} log P_\lambda (E) $
    * Word level: $-log P_\lambda (e_t|e_1...e_{t-1})$
* Learns word embeddings 