<a href="https://colab.research.google.com/github/daoc-info/using_llms/blob/main/ngram_glm_sample.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Building a simple generative language model based on N-grams (trigrams)

First we get all the modules, libraries and objects that we need to work with:
- *NLTK* is the Natural Language Tool Kit module
- *movie_reviews* is a dataset from IMDb. We will use this data to train our model
- a *defaultdic* is a dictionary that gives back a default value if the key does not exist. The default value is given with a lambda function. Our model will be a defaultdic.


In [None]:
import nltk
nltk.download('movie_reviews')
from nltk.corpus import movie_reviews
from nltk import trigrams
from collections import defaultdict

[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!


Our **model** will be a dictionary, where the key is a tuple with the first two words
in the trigram, and the value is another dictionary. If the key does not exist, it will be created with an empty dictionary as value. This inner
dictionary has as key the third word in the trigram, and as value the probability for that word. If the key does not exist, it will be created with value 0

In [None]:
model = defaultdict(lambda: defaultdict(lambda: 0))

First, we count every trigram in every sentence in the dataset: we get the frequency

In [None]:
for sentence in movie_reviews.sents():
    for w1, w2, w3 in trigrams(sentence):
        model[(w1, w2)][w3] += 1

Then, we transform the frequencies into probabilities. The meaning is: given words 1 and 2, what is the probability of having word 3.

In [None]:
for w1_w2 in model:
    total_count = float(sum(model[w1_w2].values()))
    for w3 in model[w1_w2]:
        model[w1_w2][w3] /= total_count

Let's try now with the words: "the fantastic". The highest probability is in "special", so, the prediction should be: "the fantastic special"

In [None]:
print(model["the", "fantastic"].items())

dict_items([('preview', 0.08333333333333333), ('jon', 0.08333333333333333), ('animation', 0.08333333333333333), ('special', 0.16666666666666666), (',', 0.08333333333333333), ('four', 0.08333333333333333), ('back', 0.08333333333333333), ('concept', 0.08333333333333333), ('"', 0.08333333333333333), ('work', 0.08333333333333333), ('through', 0.08333333333333333)])


And now, let's generate a longer sentence, let's say 10 words in total:
- first, we create a list with the first two words
- then, we loop 8 times, to complete 10 words. In each iteration:
  - we get the last two words in the list as a tuple (our model has two words tuples as keys in the outter dictionary)
  -then, we get the elements (items) in the inner dictionary, and we sort them, from higher to lower `reverse=True`, by its value `key=lambda x:x[1]` (value, or probability, is the second element, or index 1, on every item. The key, or word, is the first element, or index 0)
  - finally (inside the loop), we get the word with the highest probability, and add it to the list of words
- the very last thing is to print the list of words as a string, concatenating every element separated by a space, having: "the fantastic special effects , and the film , and"

In [None]:
new_sentence = "the fantastic".split()

for i in range(2, 10):
  w1_w2 = tuple(new_sentence[-2:])
  highest_probability = sorted(model[w1_w2].items(), key=lambda x:x[1], reverse=True)
  new_sentence.append(highest_probability[0][0])

print(" ".join(new_sentence))

the fantastic special effects , and the film , and


Try with different words, so you can see how different sentences get generated. You will realize, also, that the generation of sentences is deterministic (you always get the same sentence when you start with the same two words). That's because we always take the more probable value. You can add some degree of randomness when choosing the next word and see how sentences start varying from one run to the other. That is, in principle, how some real world models work. Do it as an exercise, and try with different degrees of randomness. You will see how the model changes from more precise, to more creative or hallucinatory!

One last thing to notice is that in this model we are considering that punctuation signs like the comma are also a token, that's why you see them in the final sentence. That is finally a design choice. You could easily get read of all punctuation signs before building your model. Nevertheless, you should agree that, at list in the example sentence, the commas are rather well placed!