**Lecture 2 - Word Vectors, Word Senses, and Neural Network Classifiers**

**Optimization**

- for current value of _theta_, calculate gradient of cost function, then take small step in direction of negative gradient. repeat.

   

In [None]:
while True:
    theta_grad = evaluate_gradient(J, corpus, theta)
    theta = theta - alpha * theta_grad

- problem: function of ALL windoes in the corpus so very expensive to compute
    - "stochastic gradient descent" (SGD)
        - repeatedly sample windows, and update after each one

In [None]:
while True:
    window = sample_window(corpus)
    theta_grad = evaluate_gradient(J, window, theta)
    theta = theta - alpha * theta_grad

**Word2Vec**
- **bag of words model:** 
    - makes the same predictions at each position
    - we want a model that gives a reasonably high probabiltiy estimate to ALL words that occur in the context

- why 2 vectors?
    - easier optimization. average both at the end
    - but can implement the algorithm with just one vector per word...and it helps a bit

- 2 model variants:
    - 1. **skip-grams (SG)** - predict context words (position independent) given center word
    - 2. **continuous bag of words (CBOW)** - predict center word from (bag of) context words

- loss functions for training:
    - 1. **naive softmax** (simple but expensive loss function, when many output classes)
    - 2. more optimized variants like **hierarchical softmax**
    - 3. **negative sampling**



**skip-gram model with negative sampling**

- train binary logistic regressions to differentiate:
    - a true pair (center word and a word in its context window) vs.
    - several "noise" pairs (the center word paired with a random word)
- we take K negative samples (using word probabilities *)
- maximize probability of real outside word; minimize probability of random words
    - minimize J_neg_samp(u_o, v_c, U) = -log_sigmoid_(u_o_real * v_c) - SUM for k sampled indicies (log_sigmoid_(-u_k_real_, v_c))
        - note: **logistic/sigmoid function:** rather than softmax
- sample with unigram probability per word (and raise them to the 3/4) 
    - reflects word frequency
    - upping the probability of less frequent words



**Co-occurrence**

- why not capture co-occurrence counts directly?
    - instead of iterating through the whole corpus
    - buildling co-corrence matrix X
        - 2 options: window vs. full document
        - window: similar to word2vec, use window around each word --> captures some syntactic and semantic information ("word space")
        - word-document co-occurence matrix will give general topics (all sports terms will have similar entries) leading to "Latent Semantic Analysis" ("document space")

- simple count co-occurent vectors
    - vectors increase in size with vocabulary
    - very high dimensional - require a lot of storage though sparse
    - subsequent classificatio models have sparsity issues --> models are less robust

- low-dimensional vectors
    - idea: store "most" of the important information in a fixed, small number of dimensions: a dense vector
    - usually 25-1000 dimentions, similar to word2vec


- how to reduce the dimensionality?
    - classic method: singluar value decomposition of co-occurence matrix X
        - factorizes X into U * _SIGMA_ * V, where U and V are orthonormal (unit vectors and orthogonal)
        - retain only k singular values, in order to generalize
        - _X_hat_ is the best rank k approximation to X, in terms of least squares
        - classic linear algebra results. expensive to compute for large matrices
        - problem: running an SVD on raw counts doesn't work well
        -   function words (the, he, has) are too frequent --> syntax has too much impact
        - solution: 
            - scaling the counts in the cells can help A LOT
                - log the frequencies
                - min(X,t), with t approx. 100
                - ignore the function words
            - ramped windows that count closer words more than further away words
            - use pearson correlations instead of counts, then set negative values to 0

                

**Encoding meaning components in vector differences**

- how can we capture ratios of co-occurrence probabilities as linear meaning components in a word vector space?
    - **log-bilinear model with vector differences:**
        - w_i . w_j = log[P(i|j)]
        - w_x . (w_a - w_b) = log[P(x|a) / P(x|b)]

**How to evaluate word vectors?**

- a general concept of evaluation in NLP: intrinsic vs. extrinsic
- **intrinsic:**
    - evaluation on a specific/intermediate subtask
    - fast to compute
    - helps to understand that system
    - not clear if really helpful unless correlation to real task is established
- **extrinsic**
    - evaluation on a real task
    - can take a long time to compute accuracy
    - unclear if the subsystem is the problem or its interaction or other subsystems
    - if replacing exactly one subsystem with another improves accuracy --> winning!

**Intrinsic word vector evaluation**

- word vector analogies
    - evaluate word vecs by how well their cosine distance after addition captures intuitive semantic and syntactic analogy questions
    - discarding the input words from the search !!!
    - problem: what if the information is not linear?
        - word vector distances vs. correlation with human judgments

**Extrinsic word vector evaluation**

- **named entity recognition (NER):** identifying references to a person, organization, or location
    - find and classify names in text by labeling word tokens

    - simple NER: window classification using binary logistic classifier
        - idea: classify each word in its context window of neighboring words
        - train logistic classifier on hand-labeled data to classify center word (y/n) for each class based on a concatenation of word vectors in a window
        - to classify all words: run classifier for each class on the vector centered on each word in the sentence

**Word senses and word sense ambiguity**

- most words have lost of meanings
    - especially common words
    - especially words that have existed for a long time
- does one vector capture all these meanings or do we have a mess?

- improving word representations via global context and multiple word prototypes
    - idea: cluster word windows around words, retrain with each word assigned to multiple different clusters (bank1 - money, bank2 - river, etc.)

- linear algebraic structure of word senses
    - different sense of a word reside in a linear superposition (weighted sum) in standard word embeddings like word2vec
    - bc of ideas from sparse coding, you can separate out the senses

**Neural classification**

- typical softmax classifier 
    - learned parameters _theta_ are just elements of W (not input representation x, which has sparse symbolic features)
    - problem: classifier gives linear decision boundary, which can be limiting

- **neural network classifier**:
    - we learn both W and (distributed) representations for words
    - word vectors x re-represent one-hot vectors, moving them around in an intermediate vector space, for easy classification with a (linear) softmax classifier
        - we have an embedding layer
    - we use deep networks, more layers, that let us re-represent and compose our data multiple times giving a non-linear classifier
    - 1. x (input) x = [x_museums x_in X_paris x_are x_amazing]
    - 2. h = f(Wx + b), where f() is activation function
    - 3. score s = u_transpose * h
    - 4. predicted model prob of class = J_t(_theta_) = _sigmoid_(s) = 1 / (1 + e^-s)

- training with cross entropy loss
    - cross entropy = H(p,q) = -SUM( p(c) * log[q(c)] )
    - since prob dist is 1 at right class and 0 everywhere else, loss function = negative log prob of the true class y_i = -log[p(y_i | x_i)]
    - PyTorch: torch.nn..CrossEntropyLoss()

- neural network = running several logistic regressions at the same time
    - it is the final loss function that will direct what the intermediate hidden variables should be to predict targets for next layer well
    - allows us to re-represent and compose our data multiple times and to learn a classifier that is highly non-linear in terms of the original inputs
