# Natural Language Processing
## 6️⃣ Language Model



### Language Model

**Language model** is a model that calculates the probability that a given sentence occurs in the text data.

The probability that a given sentence occurs in the text data can be calculated by products of the conditional probabilty that each words occurs.

$$P(sentence1) = P(word_1) * P(word_2 | word_1) * P(word_3 | word_1, word_2) * P(word_4 | word_1, word_2, word_3) * P(word_5 | word_1, word_2, word_3, word_4) * ...$$

### N-gram Language Model

Unlike Statistical Language Model, N-gram language model predicts next word **only based on *N-1* words**.

$$P(sentence1) \approx P(word_3 | word_1, word_2) * P(word_4 | word_2, word_3) * P(word_5 | word_3, word_4) * ...$$

Each N-gram-based conditional probability is calculated using **each n-gram's frequency in the data**.

$$P(word_3 | word_1, word_2) = \frac{\text{frequency of }word_1, word_2, word_3\text{ in the entire data}}{\text{frequency of }word_1, word_2\text{ in the entire data}}$$

In [2]:
data = ['this is a dog', 'this is a cat', 'this is my horse','my name is elice', 'my name is hank']

def count_unigram(docs):
    unigram_counter = dict()
    # Save frequency of every unigram in data to unigram_counter.
    for line in docs:
        for word in line.split():
            if word not in unigram_counter:
                unigram_counter[word] = 1
            else:
                unigram_counter[word] += 1
    return unigram_counter

def count_bigram(docs):
    bigram_counter = dict()
    # Save frequency of every bigram in data to bigram_counter.
    for line in docs:
        previous_word = line.split()[0]
        for word in line.split()[1:] :
            bigram = (previous_word, word)
            if bigram not in bigram_counter:
                bigram_counter[bigram] = 1
            else:
                bigram_counter[bigram] += 1
            previous_word = word
    
    return bigram_counter

def cal_prob(sent, unigram_counter, bigram_counter):
    words = sent.split()
    result = 1.0
    previous_word = words[0]
    
    for word in words[1:]:
        top = bigram_counter[(previous_word, word)]
        bottom = unigram_counter[previous_word]
        result *= float(top/bottom)
        previous_word = word
    
    return result

unigram_counter = count_unigram(data)
bigram_counter = count_bigram(data)

print(cal_prob("this is elice", unigram_counter, bigram_counter))

0.2


### RNN Language Model

We can use **RNN** to create language model.

When each word in a sentence is given, the RNN model can be trained using a problem of predicting the next word of each words.

Also, it is possible to process and generate words that did not exist in the learning data through character-by-character data.