# Building intelligent bots. Generative-based chatbots

In this section we go only through n-gram model. Other methods are presented as demo of working open source examples. Learning takes too long.

## N-gram

In this example, we use the Wall Street Journal corpus to generate new sentences. The corpus is available as a part of NLTK library. This example is based on [Erroll Wood's work](https://github.com/errollw/gengram).

In [None]:
from nltk.book import *

wall_street = text7.tokens

We need to clean the text up and delete all meaningless words/characters. The easiest way is to use regular expressions:

In [None]:
import re

tokens = wall_street

def cleanup():
    compiled_pattern = re.compile("^[a-zA-Z0-9.!?]")
    clean = list(filter(compiled_pattern.match,tokens))
    return clean
tokens = cleanup()
#print(tokens)

The next step is to build ngrams. It means that we group the tokens into a list of three that are placed next to each other. You can print the ngrams.

In [None]:
def build_ngrams():
    ngrams = []
    for i in range(len(tokens)-N+1):
        ngrams.append(tokens[i:i+N])
    #print(ngrams)
    return ngrams
#build_ngrams()

The next step is to calculate the frequency of tokens in each ngram and sum if there are more than one tokens related to a ngram. There are 85826 ngrams and 54677 frequency ngrams.

In [None]:
def ngram_freqs(ngrams):
    counts = {}

    for ngram in ngrams:
        token_seq  = SEP.join(ngram[:-1])
        last_token = ngram[-1]

        if token_seq not in counts:
            counts[token_seq] = {}

        if last_token not in counts[token_seq]:
            counts[token_seq][last_token] = 0

        counts[token_seq][last_token] += 1;

    return counts;
#ngram_freqs(ngrams)

We choose the next word by using the most recent tokens and adds it.

In [None]:
def next_word(text, N, counts):

    token_seq = SEP.join(text.split()[-(N-1):]);
    choices = counts[token_seq].items();

    total = sum(weight for choice, weight in choices)
    r = random.uniform(0, total)
    upto = 0
    for choice, weight in choices:
        upto += weight;
        if upto > r: return choice
    assert False # should not reach here

We need to setup a few parameters like the windows size N, the number of sentences that we want to generate and start of the sentence that we want to generate. The sentence start string are N-1 words that exists in our ngrams list.

In [None]:
import random

N=3 # fix it for other value of N

SEP=" "

sentence_count=5

ngrams = build_ngrams()
start_seq="We have"

counts = ngram_freqs(ngrams)

if start_seq is None: start_seq = random.choice(list(counts.keys()))
generated = start_seq.lower();

sentences = 0
while sentences < sentence_count:
    generated += SEP + next_word(generated, N, counts)
    sentences += 1 if generated.endswith(('.','!', '?')) else 0

print(generated)

## EXERCISE 3

There are three tasks to 
1. Use 5-gram model instead of 3.
2. Change to capital letter each first letter of a sentence.
3. Remove the whitespace between the last word in a sentence and . ! or ?.

Hint: for 2. and 3. implement a function called clean_generated() that takes the generated text and fix both issues at once. It could be easier to fix the text after it's generated rather then doing some changes in the while loop.

In [None]:
def clean_generated():
    # put your code here
    pass



N=3 # fix it for other value of N

SEP=" "

sentence_count=5

ngrams = build_ngrams()
start_seq="We have"

counts = ngram_freqs(ngrams)

if start_seq is None: start_seq = random.choice(list(counts.keys()))
generated = start_seq.lower();

sentences = 0
while sentences < sentence_count:
    generated += SEP + next_word(generated, N, counts)
    sentences += 1 if generated.endswith(('.','!', '?')) else 0

# put your code here:

print(generated)