# Section 3

## Review of Language Modeling and Neural Networks

- Language model using bigrams - `p(current_word | previous_word)`. This is a Markov model, a probabilistic model in which predictions are made based only on current state.
- Model previous language in a neural version. Logistic regression.
- Neural network of the previous model.
- Improve efficiency of the above with an implementation trick.

### Bigrams and language models

- Language model: a model of the probability of a sequence of words
  - given a sentence `s`, the language model gives `p(s)`
  - creating the model typically involve making assumptions about the structure of a specific language
- **Note:** A model is never 100% correct, it has assumptions, which may be:
  - Correct most of the times, incorrect sometimes
  - Incorrect most of the times, but still powerfull. 
- Example: a map. "The map is not the territory"

- Bigram: two consecutive words in a sentence
- Trigrams: three consecutive words in a sentence
- N-grams: sequence of `n` consecutive words
- Bigram model: `p(w_t | w_t-1)`
    - given a set of documents
    - build the model by counting: # appearances `w_t-1 w_t` / # appearances `w_t-1`
- Set of documents: it a file of files containing sentences. Training corpus. 

- Using Bigram models we'll build a language model: 
    - Bayes rule: `p(A -> B -> C) = p(C|A -> B) * p(A -> B) = p(C|A -> B) * p(B|A) * p(A)`
    - Chain rule of probability
    - `p(B|A)` is a Bigram model
    - `p(C|A -> B) = count(A -> B -> C) / count(A -> B)`
    - `p(A) = count(A) / corpus length`
    - For long sentences this becomes problematic
    
- Long sentences, say `"A B C D E F"`, may lead to `p(G| A B C D E) = 0` which may not be true. 
    - Add-one smoothing: `p_smooth(B|A) = (count(A -> B) + 1) / (count(A) + V)`
    - V: vocabulary size = number of distinct words
    - Adding V to the denominator makes the probability valid, i.e. summing all values gives 1.
    - Also this smoothing allows for not having a probability of 0 for any pair A and B. 
    
- Markov Assumption
    - What I see now depends *only* on what I saw in the previous step.
    - `p(w_t | w_t-1, w_t-2, ..., w_1) = p(w_t | w_t-1)`
    - Second, third, ... order Markov. 
    - `p(A B C D E) = p(E | D) * p(D | C) * p(C | B) * p(B | A) * p(A)`
    - All elements are bigrams!

- Long sentences are rare, but small sentences, like bigrams, may be really common. 

Load the Brown corpus using NLTK

In [6]:
import nltk
nltk.download('brown')

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.


True

In [12]:
from nltk.corpus import brown
len(brown.sents())

57340

In [19]:
sentences = brown.sents()
print('\n\n'.join(' '.join(s) for s in sentences[:5]))

The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced `` no evidence '' that any irregularities took place .

The jury further said in term-end presentments that the City Executive Committee , which had over-all charge of the election , `` deserves the praise and thanks of the City of Atlanta '' for the manner in which the election was conducted .

The September-October term jury had been charged by Fulton Superior Court Judge Durwood Pye to investigate reports of possible `` irregularities '' in the hard-fought primary which was won by Mayor-nominate Ivan Allen Jr. .

`` Only a relative handful of such reports was received '' , the jury said , `` considering the widespread interest in the election , the number of voters and the size of this city '' .

The jury said it did find that many of Georgia's registration and election laws `` are outmoded or inadequate and often ambiguous '' .


Create a bigram language model with add-one smoothing. 
- Use lower case preprocessing.

Hints: 
- Use log probabilities to avoid underflow to 0
- Normalize each sentence, by dividing by their length $T$

$$
\frac{1}{T} log p(w_1, ..., w_T) = \frac{1}{T} \left[ log p(w_1) + \sum^{T}_{t=2} log p (w_t | w_t-1) \right]
$$

Test your model
- Compare the probability of a real sentence from the corpus vs. a fake sentence (randomly generated words)
- Compare a fake sentence vs. a custom valid sentence written by me.