# Assignment-02
## Probability Model A First Look: An Introduction of Language Model

### 1. Review the course online programming code. 

Please refer to `Lecture-02.ipynb`

### 2. Review the main points of this lesson. 

##### 1. How to Github and Why do we use Jupyter and Pycharm; 

**Github** most commonly used commands:
```bash
git init
git add [file_name]
git commit -m "commit message"
git push origin [branch name] 
```
[Cheat sheat](https://education.github.com/git-cheat-sheet-education.pdf)

**Jupyter Notebook**  
Used for writing scripts. It's nice for presenting code and conveying information/visualization. It's also suitable for iterating ideas.

**Pycharm**  
full-fledged IDE, used for building projects. It integrates debugger, CVS, code profiliing and tons of other features to help with development.

##### 2. What's the Probability Model?

A probability model is a mathematical representation of a random phenomenon. It is defined by its sample space, events within the sample space, and probabilities associated with each event. The sample space S for a probability model is the set of all possible outcomes.  
reference: [probability models](http://www.stat.yale.edu/Courses/1997-98/101/probint.htm)

##### 3. Can you came up with some scenarios at which we could use Probability Model?

* language model
* weather prediction
* gambling
* disease diagnosis based on symptoms
* sales forecasting

##### 4. Why do we use probability and what's the difficult points for programming based on parsing and pattern match? 


Many real-world problems are probability problems in natural, in that case, probability models can represent these problems more easily and more precisely than rule-based models.  
  
difficulties:
* complex for defining the patterns and programming, especially when the application scenario is complicated.
* not flexible  

##### 5. What's the Language Model;

A (probabilistic) language model is a probability distribution over sequences of words. Given such a sequence, say of length $m$, it assigns a probability $P(w_1, w_2, \ldots, w_m)$ to the whole sequence.  
*elaboration:* Assume that the probability of a word only depends on the previous n words. This is known as an `n-gram` model or unigram model when n = 1. The unigram model is also known as the `bag-of-words` model.  
reference: [Language Model](https://en.wikipedia.org/wiki/Language_model)

#####  6. Can you came up with some sceneraies at which we could use Language Model?

* speech recognition
* Optical Character Recognition (OCR)
* text understanding
* machine translation
* information retrieval

##### 7. What's the 1-gram language model;

Assume the probability of each word only depends on that word's own probability in the document, so we only have one-state finite automata as units. the probability distribution over the entire vocabulary of the model sums to 1. In mathematical terms: 
$$P(w_1, w_2, \ldots, w_m) = P(w_1)P(w_2) \ldots P(w_m)$$

##### 8. What's the disadvantages and advantages of 1-gram language model;

Advantages:
* simple and effective  
  
Disadvantages:
* sparsity representation may have scalability problem
* ignores the context of words, lose the order information

##### 9.  What't the 2-gram models; 

Assume that the probability of observing the $i$th word $w_i$ in the context history of the preceding $i − 1$ words can be approximated by the probability of observing it in the shortened context history of the preceding one word. In mathematical terms:
$$P(w_1, w_2, \ldots, w_m) = P(w_1)P(w_2 | w_1) \ldots P(w_m | w_{m-1})$$

##### 10. what's the web crawler, and can you implement a simple crawler? 

It's a program or automated script that systematically collects content from the web. To be more specific, it starts from one seed page, extract the links on that page, follow those links to other pages, and keep going.   
**A simple crawler**:
1. get the response from a url in a list of urls to crawl. `response = requests.get(url)`
2. extract information we need from the url. `re.findall(regex_pattern, response.text)`
3. update the list of urls to crawl

##### 11.  There may be some issues to make our crwaler programming difficult, what are these, and how do we solve them?

1. JavaScript website
*solution:*   
In general, the crawler needs to pretend to be a browser, let all the content load, and only then go and get the HTML to parse. We may use `Selenium`(make the crawler interact with a website just as a human would do) or `WebKit`(open source web browser engine) to crawl JavaScript website.
2. anti-crawler, ban accesses from a particular IP or user id.  
*suggestion:* 
Be nice and follow a website's crawling policies. Make the crawling slower, disguise the requests by rotating IPs and proxy services.

##### 12. What is Regular Expression and how to use?

A regular expression is a sequence of characters that define a search pattern. 
reference: [Regular Expression](https://en.wikipedia.org/wiki/Regular_expression)  
Regular expressions are widely used in text processing tasks for text matching, e.g., web crawling. 

### 3. Using Wikipedia dataset to finish the language model. 
English corpus: https://dumps.wikimedia.org/enwiki/20190320/

#### Create (part of) English Wikipedia Corpus 

In [2]:
import numpy as np
import pandas as pd
import re
import os
import glob
from collections import Counter

In [None]:
def create_corpus(path, outfile):
    """Concatenate the plain texts extracted by WikiExtractor and save as a .csv file."""
    text = []
    for d in glob.glob(enwiki_path):
        for file in os.listdir(d):
            # DecodeError: using `errors='ignore` will strip out 
            # some characters not encoded as 'utf-8'
            with open(os.path.join(d, file), 'r', errors='ignore') as f:
                text += f.readlines()
    df = pd.DataFrame([line for line in text if line.strip()], columns=["text"])
    df.to_csv(outfile, index=False)

In [None]:
enwiki_path = "data/enwiki/*"
outfile = "data/enwiki_corpus.csv"
create_corpus(enwiki_path, outifle)

#### Load data

In [3]:
df = pd.read_csv("data/enwiki_corpus.csv")
text = df["text"].tolist()

In [4]:
def remove_html_tag(line):
    """Remove html tags in a string."""
    pattern = re.compile("<[^>]*>")
    return re.sub(pattern, "", line)

def remove_punctuation(line):
    """Remove special characters in a string."""
    pattern = re.compile("[^A-Za-z0-9- ]")
    return re.sub(pattern, "", line)

def preprocess(text):
    """Trim whitespaces, remove html tags and special characters of a list of strings."""
    return [remove_punctuation(remove_html_tag(line.strip().lower())) 
            for line in text]

In [7]:
def get_tokens(text):
    """Get all tokens in a list of strings."""
    return [word
            for line in cleaned_text if line.strip()
            for word in line.split()]

In [8]:
cleaned_text = preprocess(text)
VOCABULARY = get_tokens(cleaned_text)

#### Language Model

##### Some utility functions

In [9]:
def get_n_gram_counts(words, n):
    """Calculate the frequency of n-gram."""
    n_gram_phrases = [' '.join(words[i:i+n]) for i in range(len(words)-(n-1))]
    n_gram_count = Counter(n_gram_phrases)
    n_gram_total = len(n_gram_phrases)
    return (n_gram_count, n_gram_total)

In [None]:
# def get_prob(word):
#     """Calculate the probability of a single word."""
#     if word in words_count:
#         return words_count[word] / words_total
#     else:
#         return 1. / words_total # deal with OOV words

In [10]:
def get_joint_prob(n, *args):
    """Calculate joint probability of all args."""
    count = eval('_{}_gram_count'.format(n))
    total = eval('_{}_gram_total'.format(n))
    _n_gram = args[0]
    for w in args[1:]:
        _n_gram += ' ' + w
    # _2_gram = w1 + ' ' + w2
    if _n_gram in count:
        return count[_n_gram] / total
    else:
        return 1. / total
    
def get_conditional_prob(w1, w2):
    """Calculate conditional probability P(w2|w1)."""
    return get_joint_prob(2, w1, w2) / get_joint_prob(1, w1)

#### Unigram Model

In [12]:
_1_gram_count, _1_gram_total = get_n_gram_counts(VOCABULARY, 1)
# unigram_count = Counter(VOCABULARY)
# unigram_total = len(VOCABULARY) # sum([f for w, f in words_count.most_common()])

In [13]:
def product(nums):
    """Calculate the product of all numbers."""
    return np.prod(nums)

In [14]:
def unigram_model(sen):
    """Return the probability of the sentence using unigram language model."""
    words = sen.strip().lower().split()
    return product([get_joint_prob(1, w) for w in words])

In [15]:
unigram_model("Today is a sunny day")

6.975108238625902e-17

In [16]:
unigram_model("tomorrow will rain")

8.819117218767523e-14

#### 2-gram Model

In [17]:
_2_gram_count, _2_gram_total = get_n_gram_counts(VOCABULARY, 2)

In [18]:
def two_gram_model(sen):
    """Return the probability of the sentence using 2-gram language model."""
    prob = 1.
    words = sen.strip().lower().split()
    for i, w in enumerate(words):
        if i == 0:
            prob *= get_joint_prob(1, w)
        else:
            prob *= get_conditional_prob(words[i-1], w)
            
    return prob

In [21]:
two_gram_model("how are you")

4.815572708081116e-10

In [None]:
two_gram_model("how do you do")

#### 3-gram Model

In [22]:
_3_gram_count, _3_gram_total = get_n_gram_counts(VOCABULARY, 3)

In [23]:
def three_gram_model(sen):
    """Return the probability of the sentence using 3-gram language model."""
    prob = 1.
    words = sen.strip().lower().split()
    for i, w in enumerate(words):
        if i == 0:
            prob *= get_joint_prob(1, w)
        elif i == 1:
            prob *= get_conditional_prob(words[i-1], w)
        else:
            prob *= (get_joint_prob(3, words[i-2], words[i-1], w) / 
                     get_conditional_prob(words[i-2], words[i-1]))
    return prob

In [24]:
three_gram_model("Today is a beautiful day")

1.2015208128683736e-19

### 4. Try some interested sentence pairs, and check if your model could fit them

In [25]:
need_compared = [
    "One morning I shot an elephant & One morning I eat an elephant",
    "I went to China last Month & I went to Antarctica last Month",
    "The computer is running & The computer is walking"
]

for s in need_compared:
    s1, s2 = s.split('&')
    p1, p2 = two_gram_model(s1), two_gram_model(s2)
    
    better = s1 if p1 > p2 else s2
    
    print('{} is more possible'.format(better))
    print('-'*4 + ' {} with probility {}'.format(s1, p1))
    print('-'*4 + ' {} with probility {}'.format(s2, p2))

One morning I shot an elephant  is more possible
---- One morning I shot an elephant  with probility 1.0977121145145774e-18
----  One morning I eat an elephant with probility 4.142264610773665e-19
 I went to Antarctica last Month is more possible
---- I went to China last Month  with probility 3.411796448061578e-17
----  I went to Antarctica last Month with probility 5.4131492098626475e-17
The computer is running  is more possible
---- The computer is running  with probility 2.4745655473949806e-11
----  The computer is walking with probility 2.986544626166356e-12


Our model fits the first and the third sentence pairs, but could not fit the second sentence pair.

### 5. If we need to solve following problems, how can language model help us? 

+ Speech Recognition.  
+ Sogou *pinyin* input.
+ Auto correction in search engine. 
+ Abnormal Detection.

Calculate the probability of a particular word sequence(input) using a language model, if the probability is abnormaly small, then something might be something wrong with the input and the language model can correct it with more probable word sequence.

### Compared to the previous learned parsing and pattern match problems. What's the advantage and disavantage of Probability Based Methods? 

**Advantages**:  
+ Easier for programming, more flexible.
+ Can deal with unseen word sequences.

**Disadvantages**:  
+ Highly dependent on the quality of corpus. The model will be a tragedy if the corpus used is bad.
+ The probability of a sentence has negative correlation to the length of it, which isn't always true.   
And the probability would be approximately 0 when sentences are long, it would be hard to tell which sentence is more likely to be correct.

###  How to solve *OOV* problem?

If some words are not in our dictionary or corpus. When we using language model, we need to overcome this `out-of-vocabulary`(OOV) problems. There are so many intelligent man to solve this probelm. 

**Q1: How did you solve this problem in your programming task?**  
We set the probability of OOV words as $1 / total \_ count \_of\_n\_grams$

**Q2: Read about the 'Good-Turing Estimator', can explain the main points about this method, and may implement this method in your programming task**  
Good-Turing frequency estimation is a statistical technique for estimating the probability of hitherto unseen species, given a set of past observations of objects from different species. Considering our language model, it can be used to solve OOV problem mentioned above. In fact, it is one of the commonly used smoothing methods.  
**Main point**: reallocate the probability mass of n-grams that occur r + 1 times in the training data to the n-grams that occur r times. In particular, reallocate the probability mass of n-grams that were seen once to the n-grams that were never seen. In mathematical terms:
$$
P_r = \frac{(r+1)S(N_{r+1})}{NS(N_r)}
$$
where $S()$ is the smoothed frequency. for small $r$, $S(N_r) = N_r$, for large $r$, $log(N_r) = a + b \times log(r).$

Reference: 
+ https://www.wikiwand.com/en/Good%E2%80%93Turing_frequency_estimation
+ https://github.com/Computing-Intelligence/References/blob/master/NLP/Natural-Language-Processing.pdf, Page-37

In [26]:
def log_smoothed(r, slope, intercept):
    """Returns moothed frequency of frequencies vector."""
    return np.exp(intercept + slope * np.log(r))

In [32]:
def good_turing_estimator(words):
    """Find simple linear regression parameters."""
    # frequencies vector
    words_count = Counter(words)
    # frequency of frequencies vector
    count_freq = sorted(Counter(words_count.values()).items())
    # simple linear regression, estimate the slope and intercept
    r, nr = map(list, zip(*count_freq))
    mu_r, mu_nr = sum(np.log(r)) / len(r), sum(np.log(nr)) / len(nr)
    slope = sum([(np.log(r[i]) - mu_r) * (np.log(nr[i]) - mu_nr) for i in range(len(r))]) / \
            sum([(np.log(r[i]) - mu_r) ** 2 for i in range(len(r))])
    intercept = mu_nr - slope * mu_r
        
    return intercept, slope   

In [33]:
def get_good_turing_prob(word):
    """Calculate the good turing probability of a single word."""
    global slope, intercept
    global words_count, words_total
    r = words_count.get(word, 0)
    if r == 0:
        nr_next = log_smoothed(r+1, slope, intercept)
        return (r + 1) * nr_next / words_total
    else:
        return r / words_total

In [34]:
words_count, words_total = get_n_gram_counts(VOCABULARY, 1)
slope, intercept = good_turing_estimator(VOCABULARY)

In [35]:
# word exists
get_good_turing_prob("an")

0.004094423562529852

In [36]:
# unseen word
get_good_turing_prob("subsittute")

1.0425769483406945e-08