# NLP 
-------------------
## 1. Basic Embedding Model

### Neural Network Language Model (Predict Next Word)

[A Neural Probabilistic Language Model - Bengio (2003)](http://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf)
__Statistical language models__ learn the joint probability function of sequences of words in languages. Curse of dimensionality is a major challenge for this task. (Many, many possible combinations of words.)

Traditional models were n-gram based, concatenated short sequences of words seen in the training set.

__Learning a distributed representation for words__ the model learns a representation for each word and the probability function for words sequences. Unseen sentences can be represented if composed of words similair in representation to previously seen words.

#### Intro
Curse of dimensionality is a bitch in langauge models. The  task is much harder for discrete variables because conditional prob. may not exhibit local smoothness, and there are many, many combinations of variables.

*Pˆ(wT ) =∏Pˆ(wt|wt−1)*

**statistical language model** can be represnted as the conditional prob. of the next words given the previous ones. This can be improved by accounting for word order. 

**n-grams** play off the idea that temporally closer words in a sequence are statistically more dependent. (ie there are common themes of what words occur next to each other.) n-gram models construct conditional probabilities of the next word given the large numebr of contexts (ie n-grams of the last n-1 words.)

What happens when a new combination of n words appears
that was not seen in the training corpus?

#### Distributed Representations

1. associate with each word in the vocabulary a *distributed word feature vector* (a real valued vector in Rm),

2. express the *joint probability function* of word sequences in terms of the feature vectors of these words in the sequence, and

3. learn simultaneously the word feature vectors and the parameters of that probability function.

The feature vector represents different aspects of the word: each word is associated with a point
in a vector space. The probability function is expressed as a product of conditional probabilities of the next word given the previous ones, (e.g. using a multilayer neural network to predict the next word given the previous ones, in the experiments). This function has parameters that can be iteratively tuned in order to maximize the log-likelihood of the training data.

. In the proposed model, it will so generalize because “similar” words
are expected to have a similar feature vector, and because the probability function is a smooth
function of these feature values, a small change in the features will induce a small change in the
probability. Therefore, the presence of only one of the above sentences in the training data will increase the probability, not only of that sentence, but also of its combinatorial number of “neighbors”
in sentence space

. In the model proposed here, instead
of characterizing the similarity with a discrete random or deterministic variable (which corresponds
to a soft or hard partition of the set of words), we use a continuous real-vector for each word,


#### A Neural Model

We decompose the function f(wt ,··· ,wt−n+1) = Pˆ(wt|wt−1
1 ) in two parts:
1. A mapping C from any element i of V to a real vector C(i) ∈ Rm. It represents the distributedfeature vectors associated with each word in the vocabulary. In practice, C is represented bya |V| ×m matrix of free parameters

2. The probability function over words, expressed with C: a function g maps an input sequence
of feature vectors for words in context, (C(wt−n+1),··· ,C(wt−1)), to a conditional probability
distribution over words in V for the next word wt.

The function g may be implemented by a
feed-forward or recurrent neural network or another parametrized function, with parameters ω. The
overall parameter set is θ = (C,ω). Training is achieved by looking for θ that maximizes the training corpus penalized log-likelihood:



In [1]:
import tensorflow as tf
import numpy as np

In [None]:
sentences = ["I like burritos", "I love coffee", 'I hate cheese']

word_list = ' '.join(sentences).split()
word_list = list(set(word_list))
word_dict = {w: i for i, w in enumerate(word_list)} #word encodings
number_dict = {i: w for i, w in enumerate(word_list)}

n_class = len(word_dict)

n_step = 2 #number of steps
n_hidden = 2 # number of hidden units

def make_batch(sentences):
    input_batch=[]
    target_batch=[]
    for sen in sentences: #grab a sentence/
        word =sen.split()  #get words into list
        Input = [word_dict[n] for n in word[:-1]] #map to word_dict for all words except last
        target = word_dict[word[-1]] #get encoding of target word
        input_batch.append(np.eye(n_class)[Input]) #OHE for the words before target
        #2D array of OHE vectors for each word (len_sentence-1, n_class)
        target_batch.append(np.eye(n_class)[target])
    return input_batch, target_batch

#----------------------------------------------
#TENSORFLOW
#model
X = tf.placeholder(tf.float32, [None, n_step, n_class]) #2d matrix with OHE for each word up to target
Y = tf.placeholder(tf.float32, [None, n_class]) #OHE shape (n_class)

Input = tf.reshape(X,shape=[-1, n_step*n_class]) #stacks the OHE vectors of sentence to one dim shape, [batch_size, n_sep*n_class]
H = tf.Variable(tf.random_normal([n_step*n_class, n_hidden])) #hidden coeffs, one coeff for each input dim
#2D array n_step*n_class coeffs one for each input, n_hidden is number of hidden units
d = tf.Variable(tf.random_normal([n_hidden])) #outputs for each hidden unit
U = tf.Variable(tf.random_normal([n_hidden, n_class])) #coefficients mapping hidden ouputs to word_classes
b = tf.Variable(tf.random_normal([n_class])) #Model output OHE for word_class prediction

tanh = tf.nn.tanh(d+tf.matmul(Input,H)) #activation function for the hidden layer
#??? WHY add to d?
model =tf.matmul(tanh, U) + b #multiply hidden activ. output with terminal coeffs.
# add to terminal bias?

#reduce_mean computes the mean of elements in Tensor across axis
cost  = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(logits=model,labels=Y)) #softmax maps outputs to bounded probs for each class.
#entropy measures error of each class prob to the True class label Y
optimizer = tf.train.AdamOptimizer(0.001).minimize(cost) #optimization function LR 0.001
prediction = tf.argmax(model,1) #outputs the maximum probability class

#training
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init)

input_batch, target_batch = make_batch(sentences) #here n_step is 2 since all sentences have length 3 (predict last word)
for epoch in range(10000):
    _,loss = sess.run([optimizer, cost], feed_dict={X:input_batch, Y:target_batch})
    if (epoch+1)%1000==0:
        print("Epoch:", '%04d' %(epoch+1), 'cost =','{:.6f}'.format(loss))

#Predictions
predict = sess.run([prediction], feed_dict={X: input_batch})

Input = [sen.split()[:2] for sen in sentences]
print(Input,'->', [number_dict[n] for n in predict[0]])

In [23]:
#--------------------------------------------
#PYTORCH

import torch
import torch.nn as nn
import torch.optim as optim
from torch.autograd import Variable

dtype = torch.FloatTensor

class NNLM(nn.Module):
    der __init__(self):
        super(NNLM,self).__inint__()
        self.C = nn.Embedding()

### Word2Vec (Skip-Gram)  Embedding Words and Show Graph 2013

[Distributed Representations of Words and Phrases
and their Compositionality](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf)


**Distributed vector representations** learn continuous vectors for each word in a language that can be reused across all contexts (distributed). The first implementation learned representations by predicting the next word given prior, (since the task was also to build a probablistic model that predicts the next word) **If we double down on learning a representation, we can use skip-gram to encode more information about the context of a word.**

These embeddings encode syntactic and semmantic word relationships.

#### Introduction
Distributed representations of words in a vector space help learning algorithms to achieve better
performance in natural language processing tasks by grouping similar words.

The skip-gram model was introduced to learn distr. repr. from large amounts of unstructured text. Unlike previous implementations, training the skip-gram model does not involve dense matrix multiplication. (->efficient)

The word representations computed using neural networks are very interesting because the learned
vectors explicitly encode many linguistic regularities and patterns.


#### Skip-Gram Model
Skip-gram model objective is to find word representations that are useful for predicting the surrounding words in a sentence or document.

<img src='img/skip-gram.png'>

 More formally, given a sequence of training words w1, w2, w3, . . . , wT , the objective of the Skip-gram model is to maximize the average log probability
<img src='img/skip-func.png'>
where c is the size of the training context. Larger
c results in more training examples and thus can lead to a higher accuracy, at the expense of the training time.


##### Hierarchical Softmax