# N-gram Models from Text for Language Models

**(C) 2024 by [Damir Cavar](http://damir.cavar.me/) <<dcavar@iu.edu>>**

**Version:** 1.0, February 2024

**Download:** This and various other Jupyter notebooks are available from my [GitHub repo](https://github.com/dcavar/python-tutorial-notebooks).

**License:** [Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/) ([CA BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/))

**Prerequisites:**

In [None]:
!pip install -U nltk

## Simple File Processing

We will make use of the following modules and functions:

In [1]:
import nltk
from collections import defaultdict, Counter

Reading a text into memory in Python is faily simple. We open a file, read from it, and close the file again. The following code prints out the first 300 characters of the text in memory:

In [2]:
ifile = open("data/HOPG.txt", mode='r', encoding='utf-8')
text = ifile.read()
ifile.close()
print(text[:300], "...")

A HOUSE OF POMEGRANATES




Contents:

The Young King
The Birthday of the Infanta
The Fisherman and his Soul
The Star-child




THE YOUNG KING




[TO MARGARET LADY BROOKE--THE RANEE OF SARAWAK]


It was the night before the day fixed for his coronation, and the
young King was sitting alone in his b ...


The optional parameters in the *open* function above define the **mode** of operations on the file and the **encoding** of the content. For example, setting the **mode** to **r** declares that *reading* from the file is the only permitted operation that we will perform in the following code. Setting the **encoding** to **utf-8** declares that all characters will be encoded using the [Unicode](https://en.wikipedia.org/wiki/Unicode) encoding schema [UTF-8](https://en.wikipedia.org/wiki/UTF-8) for the content of the file.

We can now import the [NLTK](https://www.nltk.org/) module in Python to work with frequency profiles and [n-grams](https://en.wikipedia.org/wiki/N-gram) using the tokens or words in the text.

We can now lower the text, which means normalizing it to all characters lower case:

In [3]:
text = text.lower()
print(text[:300], "...")

a house of pomegranates




contents:

the young king
the birthday of the infanta
the fisherman and his soul
the star-child




the young king




[to margaret lady brooke--the ranee of sarawak]


it was the night before the day fixed for his coronation, and the
young king was sitting alone in his b ...


We see that the frequency profile is for the characters in the text, not the words or tokens. In order to generate a frequency profile over words/tokens in the text, we need to utilize a **tokenizer**. [NLTK](https://www.nltk.org/) provides basic tokenization functions. We will use the *word_tokenize* function to generate a list of tokens:

In [4]:
tokens = nltk.word_tokenize(text)

We can now print the first 20 tokens:

In [49]:
print(tokens[:20])

['a', 'house', 'of', 'pomegranates', 'contents', ':', 'the', 'young', 'king', 'the', 'birthday', 'of', 'the', 'infanta', 'the', 'fisherman', 'and', 'his', 'soul', 'the']


To generate a bigram model, we use the NLTK ngram model function and generate a bigram profile and relativize the frequencies:

In [22]:
myTokenBigrams = nltk.ngrams(tokens, 2)
bigramModel = defaultdict(Counter)
total = float(len(tokens) - 1)
for t1, t2 in myTokenBigrams:
    bigramModel[t1][t2] += 1
for a in bigramModel:
    for b in bigramModel[a]:
        bigramModel[a][b] = bigramModel[a][b] / total

We can now look up the continuation of "white" with the likelihood of the continuation:

In [50]:
print(bigramModel['white'])

Counter({'rose': 0.00020983056182132927, 'gold': 0.00020983056182132927, 'foam': 7.868646068299847e-05, 'as': 7.868646068299847e-05, 'hands': 5.245764045533232e-05, 'feet': 5.245764045533232e-05, 'girl': 2.622882022766616e-05, 'glare': 2.622882022766616e-05, 'blossoms': 2.622882022766616e-05, 'velvet': 2.622882022766616e-05, 'rose-tree': 2.622882022766616e-05, 'snow-wreaths': 2.622882022766616e-05, 'mule': 2.622882022766616e-05, 'berries': 2.622882022766616e-05, 'statues': 2.622882022766616e-05, 'ceiling': 2.622882022766616e-05, 'petal': 2.622882022766616e-05, 'stars': 2.622882022766616e-05, 'ivory': 2.622882022766616e-05, 'hand': 2.622882022766616e-05, 'teeth': 2.622882022766616e-05, 'arms': 2.622882022766616e-05, 'rocks': 2.622882022766616e-05, 'doves': 2.622882022766616e-05, 'smiling': 2.622882022766616e-05, 'grapes': 2.622882022766616e-05, 'house': 2.622882022766616e-05, 'alabaster': 2.622882022766616e-05, 'leather': 2.622882022766616e-05, 'pigeons': 2.622882022766616e-05, 'peacock

We can now generate a trigram frequency profile and relativize the frequencies:

In [26]:
myTokenTrigrams = nltk.ngrams(tokens, 3)
trigramModel = defaultdict(Counter)
total = float(len(tokens) - 2)
for t1, t2, t3 in myTokenTrigrams:
    trigramModel[(t1, t2)][t3] += 1
for a in trigramModel:
    for b in trigramModel[a]:
        trigramModel[a][b] = trigramModel[a][b] / total

We can now look up the continuation of a word sequence "white rose":

In [27]:
print(trigramModel[('white', 'rose')])

Counter({',': 0.00013114754098360657, '.': 2.622950819672131e-05, 'in': 2.622950819672131e-05, 'to': 2.622950819672131e-05})


In the following loop we will set a start word and generate a continuation of 20 words:

In [48]:
state = ['in']
bigrams = list(bigramModel[state[-1]].items())
sorted(bigrams, key=lambda x: x[1])
state.append(bigrams[0][0])
print(state)
for n in range(10):
    continuation = list(trigramModel[tuple(state[-2:])].items())
    sorted(continuation, key=lambda x: x[1])
    state.append(continuation[0][0])
    print(state)

['in', 'his']
['in', 'his', 'beautiful']
['in', 'his', 'beautiful', 'chamber']
['in', 'his', 'beautiful', 'chamber', '.']
['in', 'his', 'beautiful', 'chamber', '.', 'his']
['in', 'his', 'beautiful', 'chamber', '.', 'his', 'courtiers']
['in', 'his', 'beautiful', 'chamber', '.', 'his', 'courtiers', 'had']
['in', 'his', 'beautiful', 'chamber', '.', 'his', 'courtiers', 'had', 'all']
['in', 'his', 'beautiful', 'chamber', '.', 'his', 'courtiers', 'had', 'all', 'taken']
['in', 'his', 'beautiful', 'chamber', '.', 'his', 'courtiers', 'had', 'all', 'taken', 'their']
['in', 'his', 'beautiful', 'chamber', '.', 'his', 'courtiers', 'had', 'all', 'taken', 'their', 'leave']


(C) 2024 by [Damir Cavar](http://damir.cavar.me/) <<dcavar@iu.edu>>