## Ngrams lab
LLM's and ChatGPT | Fall 2023 | McSweeney | CUNY Graduate Center

**Due:** September 17


### Background
The purpose of this lab is to explore ngram models. Ngram models are a good introduction to language models generally. Language models are probabilistic representations of language. Ngrams have the benefit of being easy to interrogate and relatively easy to understand (as compared to neural networks). 

In this lab, you will build an ngram model from the corpus of your choosing. The example is with 'The Great Gatsby' from Project Gutenberg, but there's a code block for any text file on your computer  


#### Notes
This lab is based heavily on the [nltk documentation](https://www.nltk.org/api/nltk.lm.html)

In [351]:
import numpy as np
import re

import nltk
# if you haven't downloaded punkt before, you only need to run the line below once 
#nltk.download('punkt')
from nltk import word_tokenize
from nltk import sent_tokenize

from nltk.util import bigrams
from nltk.lm.preprocessing import padded_everygram_pipeline

import requests

# Part 1
An example of how ngrams are generated

In [352]:
# you will need to leverage the requests package
r = requests.get(r'https://www.gutenberg.org/cache/epub/36/pg36.txt')
war_ot_worlds = r.text

# first, remove unwanted new line and tab characters from the text
for char in ["\n", "\r", "\d", "\t"]:
    war_ot_worlds = war_ot_worlds.replace(char, " ")

# check
print(war_ot_worlds[:100])

﻿The Project Gutenberg eBook of The War of the Worlds        This ebook is for the use of anyone any


In [353]:
# remove the metadata at the beginning - this is slightly different for each book
war_ot_worlds = war_ot_worlds[1928:]
print(war_ot_worlds[:150])

BOOK ONE  THE COMING OF THE MARTIANS          I.  THE EVE OF THE WAR.      No one would have believed in the last years of the nineteenth century  tha


In [354]:
print(len(war_ot_worlds))

361520


In [355]:
# also need to remove the legalese from the end of the book

print(war_ot_worlds[342639:])

                                  *** END OF THE PROJECT GUTENBERG EBOOK THE WAR OF THE WORLDS ***                      Updated editions will replace the previous one—the old editions will  be renamed.    Creating the works from print editions not protected by U.S. copyright  law means that no one owns a United States copyright in these works,  so the Foundation (and you!) can copy and distribute it in the United  States without permission and without paying copyright  royalties. Special rules, set forth in the General Terms of Use part  of this license, apply to copying and distributing Project  Gutenberg™ electronic works to protect the PROJECT GUTENBERG™  concept and trademark. Project Gutenberg is a registered trademark,  and may not be used if you charge for an eBook, except by following  the terms of the trademark license, including paying royalties for use  of the Project Gutenberg trademark. If you do not charge anything for  copies of this eBook, complying with the trademark l

In [356]:
war_ot_worlds = war_ot_worlds[:342639]

In [357]:
print(war_ot_worlds[342615:])

nted me, among the dead.


#### Txt locally
If you'd rather use a file on your computer, here's the code -- you just need to save the text file in your local directory, and change the variables throughout. 

The example is a report from the [Congressional Research Service](https://www.everycrsreport.com/files/2020-11-10_R45178_62d6238caecf6c02ddf495be33b3439f09eed744.pdf) on AI and National Security.

In [358]:
# this is simplified for demonstration
def sample_clean_text(text: str):
    # lowercase
    text = text.lower()
    
    # remove punctuation from text
    text = re.sub(r"[^\w\s]", "", text)
    
    # tokenize the text
    tokens = nltk.word_tokenize(text)
    
    # return your tokens
    return tokens

# call the function
sample_tokens = sample_clean_text(text = war_ot_worlds)

# check
print(sample_tokens[:50])

['book', 'one', 'the', 'coming', 'of', 'the', 'martians', 'i', 'the', 'eve', 'of', 'the', 'war', 'no', 'one', 'would', 'have', 'believed', 'in', 'the', 'last', 'years', 'of', 'the', 'nineteenth', 'century', 'that', 'this', 'world', 'was', 'being', 'watched', 'keenly', 'and', 'closely', 'by', 'intelligences', 'greater', 'than', 'mans', 'and', 'yet', 'as', 'mortal', 'as', 'his', 'own', 'that', 'as', 'men']


In [359]:
# create bigrams from the sample tokens
my_bigrams = bigrams(sample_tokens)

# check
list(my_bigrams)[:15]

[('book', 'one'),
 ('one', 'the'),
 ('the', 'coming'),
 ('coming', 'of'),
 ('of', 'the'),
 ('the', 'martians'),
 ('martians', 'i'),
 ('i', 'the'),
 ('the', 'eve'),
 ('eve', 'of'),
 ('of', 'the'),
 ('the', 'war'),
 ('war', 'no'),
 ('no', 'one'),
 ('one', 'would')]

# Part 2 - creating an ngram model


In [360]:
# 2 is for bigrams
n = 2
#specify the text you want to use
text = war_ot_worlds

Now we are going to use an NLTK shortcut for preprocessing. This will:
* pad all of the sentences with `<s>` and `</s>` to train on sentence boundaries, too.
* create both unigrams and bigrams
* create a training set and a full vocab to train on

We need to give it a pre-tokenized text (we'll use nltk's tokenizer)

In [361]:
# step 1: tokenize the text into sentences
sentences = nltk.sent_tokenize(text)

# step 2: tokenize each sentence into words
tokenized_sentences = [nltk.word_tokenize(sent) for sent in sentences]

# step 3: convert each word to lowercase
tokenized_text = [[word.lower() for word in sent] for sent in tokenized_sentences]

#notice the sentence breaks and what the first 10 items of the tokenized text
print(tokenized_text[0:4])

[['book', 'one', 'the', 'coming', 'of', 'the', 'martians', 'i', '.'], ['the', 'eve', 'of', 'the', 'war', '.'], ['no', 'one', 'would', 'have', 'believed', 'in', 'the', 'last', 'years', 'of', 'the', 'nineteenth', 'century', 'that', 'this', 'world', 'was', 'being', 'watched', 'keenly', 'and', 'closely', 'by', 'intelligences', 'greater', 'than', 'man', '’', 's', 'and', 'yet', 'as', 'mortal', 'as', 'his', 'own', ';', 'that', 'as', 'men', 'busied', 'themselves', 'about', 'their', 'various', 'concerns', 'they', 'were', 'scrutinised', 'and', 'studied', ',', 'perhaps', 'almost', 'as', 'narrowly', 'as', 'a', 'man', 'with', 'a', 'microscope', 'might', 'scrutinise', 'the', 'transient', 'creatures', 'that', 'swarm', 'and', 'multiply', 'in', 'a', 'drop', 'of', 'water', '.'], ['with', 'infinite', 'complacency', 'men', 'went', 'to', 'and', 'fro', 'over', 'this', 'globe', 'about', 'their', 'little', 'affairs', ',', 'serene', 'in', 'their', 'assurance', 'of', 'their', 'empire', 'over', 'matter', '.']]


In [362]:
print(tokenized_text[2])

['no', 'one', 'would', 'have', 'believed', 'in', 'the', 'last', 'years', 'of', 'the', 'nineteenth', 'century', 'that', 'this', 'world', 'was', 'being', 'watched', 'keenly', 'and', 'closely', 'by', 'intelligences', 'greater', 'than', 'man', '’', 's', 'and', 'yet', 'as', 'mortal', 'as', 'his', 'own', ';', 'that', 'as', 'men', 'busied', 'themselves', 'about', 'their', 'various', 'concerns', 'they', 'were', 'scrutinised', 'and', 'studied', ',', 'perhaps', 'almost', 'as', 'narrowly', 'as', 'a', 'man', 'with', 'a', 'microscope', 'might', 'scrutinise', 'the', 'transient', 'creatures', 'that', 'swarm', 'and', 'multiply', 'in', 'a', 'drop', 'of', 'water', '.']


Why tokenize sentences and words?

We want to be able to retain sentence boundaries to encode that, too.

In [363]:
# notice what the first 10 items of the vocabulary are:
#print(text[:10])
print(tokenized_text[0][:10])

['book', 'one', 'the', 'coming', 'of', 'the', 'martians', 'i', '.']


In [364]:
# we imported this function from nltk
train_data, padded_sents = padded_everygram_pipeline(n, tokenized_text)

In [365]:
from nltk.lm import MLE
# we imported this function from nltk linear models (lm) 
# it is for Maximum Likelihood Estimation

# MLE is the model we will use
lm = MLE(n)

In [366]:
# currently the vocab length is 0: it has no prior knowledge
len(lm.vocab)

0

In [367]:
# fit the model 
# training data is the bigrams and unigrams 
# the vocab is all the sentence tokens in the corpus 

lm.fit(train_data, padded_sents)
len(lm.vocab)

7203

In [368]:
help(MLE.fit) # this breaks the following code!  Not sure why?

Help on function fit in module nltk.lm.api:

fit(self, text, vocabulary_text=None)
    Trains the model on a text.
    
    :param text: Training text as a sequence of sentences.



In [369]:
#tokenized_text[0] 

In [370]:
# inspect the model's vocabulary. 
# be sure that a sentence you know exists (from tokenized_text) is in the 
print(lm.vocab.lookup(tokenized_text[3]))

('with', 'infinite', 'complacency', 'men', 'went', 'to', 'and', 'fro', 'over', 'this', 'globe', 'about', 'their', 'little', 'affairs', ',', 'serene', 'in', 'their', 'assurance', 'of', 'their', 'empire', 'over', 'matter', '.')


In [371]:
print(lm.vocab.lookup(["There", "is", "no", "spoon", "."]))

('<UNK>', 'is', 'no', 'spoon', '.')


In [372]:
my_list = ["There", "is", "no", "spoon", "."]
print(my_list)

['There', 'is', 'no', 'spoon', '.']


In [373]:
print(lm.vocab.lookup(my_list))

('<UNK>', 'is', 'no', 'spoon', '.')


In [374]:
print(lm.vocab.lookup(tokenized_text[0][0])) 

book


In [375]:
# see what happens when we include a word that is not in the vocab. 
print(lm.vocab.lookup('then wear the gold hat iphone .'.split()))

('then', 'wear', 'the', 'gold', 'hat', '<UNK>', '.')


What did the model replace 'iphone' with? 

Given that it didn't just return an "out of vocab" error, what does that mean about our model? 

Our model added '<UNK>' to our vocabulary so that when we train it on new data it will not crash when it encounters wordls that are not in our corpus.

In [376]:
# how many times does alien appear in the model?
print(lm.counts['alien'])

# what is the probability of alien appearing? 
# this is technically the relative frequency of alien appearing 
lm.score('alien')

1


1.3259961546111516e-05

In [377]:
# how many times does world appear in the model?
print(lm.counts['world'])

# what is the probability of world appearing? 
# this is technically the relative frequency of world appearing 
lm.score('world')

24


0.0003182390771066764

In [378]:
# how often does (world, and) occur and what is the relative frequency?
print(lm.counts[['world']]['war'])
lm.score('war', 'world'.split())

0


0.0

In [379]:
# how many times does martians appear in the model?
print(lm.counts['martians'])

# what is the probability of martians appearing? 
# this is technically the relative frequency of martians appearing 
lm.score('martians')

162


0.0021481137704700655

In [380]:
# how often does (martians, were) occur and what is the relative frequency?
print(lm.counts[['martians']]['were'])
lm.score('were', 'martians'.split())

16


0.09876543209876543

In [381]:
# From NLTK Documentation:
# Here’s how you get the score for a word given some preceding context. 

# For example we want to know what is the chance that “were” is preceded by “martians”.

lm.score("were", ["martians"])

0.09876543209876543

In [382]:
# what is the score of 'UNK'? 

lm.score("<UNK>")

0.0

Does the relative frequency of 'UNK' change your assumption about how the model behaves? 

How should we change our model to account for the fact the `<UNK>` words are not accounted for by the model? 

(We would want to implement Laplace smoothing or something similar to account for the unknown words per [NATURAL LANGUAGE PROCESSING N-gram language models: Part 1: Unigram model by Khanh Nguyen](https://medium.com/mti-technology/n-gram-language-model-b7c2fc322799)

Note: *Programmatically implementing this solution is beyond the scope of this course.*

## Generate text
We want to start our sentence with a word, and use that to predict all the words that come after that. We'll specify how long it should be. 

There is a certain amount of randomness encoded into n-gram models. This prevents a model from becoming entirely deterministic. Maximum Likelihood Estimation without some degree of randomness will only produce the most likely result every time. Setting Random Seed means we will get the same result every time. 

In [383]:
# generate a 20 word sentence starting with the word, 'martians'

print(lm.generate(20, text_seed= 'martians', random_seed=42))

['park', ',', 'and', 'drawn', 'closely', 'to', 'the', 'captain', 'lay', '.', '</s>', 'in', 'a', 'ditch', 'for', 'my', 'brother', 'saw', 'the', 'advance']


This next code block is just to clean up the tokenized words and make them easier on human eyes. It is literally a detokenizer, which removes some extraneous text markup and reconciles some words back together. 

In [384]:
from nltk.tokenize.treebank import TreebankWordDetokenizer

detokenize = TreebankWordDetokenizer().detokenize

def generate_sent(lm, num_words, text_seed, random_seed=42):
    """
    :param model: An ngram language model from `nltk.lm.model`.
    :param num_words: Max no. of words to generate.
    :param random_seed: Seed value for random.
    """
    content = []
    for token in lm.generate(num_words, text_seed=text_seed, random_seed=random_seed):
        if token == '<s>':
            continue
        if token == '</s>':
            break
        content.append(token)
    return detokenize(content)

In [385]:
# Now generate sentences that look much nicer. 
generate_sent(lm, 40, text_seed='martians', random_seed = 42)
# Stops after exactly one sentence.

'park, and drawn closely to the captain lay.'

Try a few more sentences, and try out another text. Once you are satisfied with what ngrams can (and cannot) do - post your code to your Github or another site. 

In [386]:
# Now generate sentences that look much nicer. 
generate_sent(lm, 40, text_seed='martians', random_seed = 20)
# Stops after exactly one sentence.

'what these strange tidings, one thinks of the cylinders have you are driving us and disease and turned back and silent mass of baker street—portman square, and striding by the cylinder, according to the dull radiation arrested'

In [387]:
generate_sent(lm, 40, text_seed='martians', random_seed = 200)

'account for the crest of his against the cylinder, and made a tall against the handling-machine, felled and myself now?'

In [388]:
generate_sent(lm, 40, text_seed='martians', random_seed = 547)

'account of its length might have filled my terror not see the sight of a multitude that the silence, astonished, with three or awake and dispersed by a puff of the last time, skirting the road and'

# Alice in Wonderland

In [389]:
# you will need to leverage the requests package
r = requests.get(r'https://www.gutenberg.org/cache/epub/28885/pg28885.txt')
alice = r.text

# first, remove unwanted new line and tab characters from the text
for char in ["\n", "\r", "\d", "\t"]:
    alice = alice.replace(char, " ")

# check
print(alice[:100])

﻿The Project Gutenberg eBook of Alice's Adventures in Wonderland        This ebook is for the use of


In [390]:
print(alice[6363:6400])

ALICE was beginning to get very tired


In [391]:
# remove the metadata at the beginning - this is slightly different for each book
alice = alice[6363:]
print(alice[:150])

ALICE was beginning to get very tired of sitting by her  sister on the bank, and of having nothing to do: once or twice she had  peeped into the book 


In [392]:
print(len(alice))

170878


In [393]:
361520 - 342639

18881

In [394]:
170878 - 18881

151997

In [395]:
# also need to remove the legalese from the end of the book
print(alice[151997:])

                       *** END OF THE PROJECT GUTENBERG EBOOK ALICE'S ADVENTURES IN WONDERLAND ***                      Updated editions will replace the previous one—the old editions will  be renamed.    Creating the works from print editions not protected by U.S. copyright  law means that no one owns a United States copyright in these works,  so the Foundation (and you!) can copy and distribute it in the United  States without permission and without paying copyright  royalties. Special rules, set forth in the General Terms of Use part  of this license, apply to copying and distributing Project  Gutenberg™ electronic works to protect the PROJECT GUTENBERG™  concept and trademark. Project Gutenberg is a registered trademark,  and may not be used if you charge for an eBook, except by following  the terms of the trademark license, including paying royalties for use  of the Project Gutenberg trademark. If you do not charge anything for  copies of this eBook, complying with the trademark l

In [396]:
alice = alice[:151997]

In [397]:
print(alice[150797:150871])

s, remembering her own child-life, and the happy summer days.      THE END


In [398]:
alice = alice[:150871]

In [399]:
print(alice[150780:])

their simple  joys, remembering her own child-life, and the happy summer days.      THE END


# Part 2 - creating an ngram model

In [400]:
# 2 is for bigrams
n = 2
#specify the text you want to use
text = alice

In [401]:
# step 1: tokenize the text into sentences
sentences = nltk.sent_tokenize(text)

# step 2: tokenize each sentence into words
tokenized_sentences = [nltk.word_tokenize(sent) for sent in sentences]

# step 3: convert each word to lowercase
tokenized_text = [[word.lower() for word in sent] for sent in tokenized_sentences]

#notice the sentence breaks and what the first 10 items of the tokenized text
print(tokenized_text[0:4])

[['alice', 'was', 'beginning', 'to', 'get', 'very', 'tired', 'of', 'sitting', 'by', 'her', 'sister', 'on', 'the', 'bank', ',', 'and', 'of', 'having', 'nothing', 'to', 'do', ':', 'once', 'or', 'twice', 'she', 'had', 'peeped', 'into', 'the', 'book', 'her', 'sister', 'was', 'reading', ',', 'but', 'it', 'had', 'no', 'pictures', 'or', 'conversations', 'in', 'it', ',', '``', 'and', 'what', 'is', 'the', 'use', 'of', 'a', 'book', ',', "''", 'thought', 'alice', ',', '``', 'without', 'pictures', 'or', 'conversations', '?', "''"], ['so', 'she', 'was', 'considering', 'in', 'her', 'own', 'mind', '(', 'as', 'well', 'as', 'she', 'could', ',', 'for', 'the', 'hot', 'day', 'made', 'her', 'feel', 'very', 'sleepy', 'and', 'stupid', ')', 'whether', 'the', 'pleasure', 'of', 'making', 'a', 'daisy-chain', 'would', 'be', 'worth', 'the', 'trouble', 'of', 'getting', 'up', 'and', 'picking', 'the', 'daisies', ',', 'when', 'suddenly', 'a', 'white', 'rabbit', 'with', 'pink', 'eyes', 'ran', 'close', 'by', 'her', '.']

In [402]:
print(tokenized_text[2])

['there', 'was', 'nothing', 'so', '_very_', 'remarkable', 'in', 'that', ';', 'nor', 'did', 'alice', 'think', 'it', 'so', '_very_', 'much', 'out', 'of', 'the', 'way', 'to', 'hear', 'the', 'rabbit', 'say', 'to', 'itself', ',', '``', 'oh', 'dear', '!']


In [403]:
# notice what the first 10 items of the vocabulary are:
#print(text[:10])
print(tokenized_text[0][:10])

['alice', 'was', 'beginning', 'to', 'get', 'very', 'tired', 'of', 'sitting', 'by']


In [404]:
# we imported this function from nltk
train_data, padded_sents = padded_everygram_pipeline(n, tokenized_text)

In [405]:
from nltk.lm import MLE
# we imported this function from nltk linear models (lm) 
# it is for Maximum Likelihood Estimation

# MLE is the model we will use
lm = MLE(n)

In [406]:
# fit the model 
# training data is the bigrams and unigrams 
# the vocab is all the sentence tokens in the corpus 

lm.fit(train_data, padded_sents)
len(lm.vocab)

2778

In [407]:
# inspect the model's vocabulary. 
# be sure that a sentence you know exists (from tokenized_text) is in the 
print(lm.vocab.lookup(tokenized_text[3]))

('oh', 'dear', '!')


In [408]:
print(lm.vocab.lookup(["There", "is", "no", "spoon", "."]))

('<UNK>', 'is', 'no', 'spoon', '.')


In [409]:
print(lm.vocab.lookup(tokenized_text[0][0])) 

alice


In [410]:
# see what happens when we include a word that is not in the vocab. 
print(lm.vocab.lookup('then wear the gold hat iphone .'.split()))

('then', '<UNK>', 'the', '<UNK>', 'hat', '<UNK>', '.')


In [411]:
# how many times does alice appear in the model?
print(lm.counts['alice'])

# what is the probability of alice appearing? 
# this is technically the relative frequency of alice appearing 
lm.score('alice')

396


0.010450754776733875

In [412]:
# how often does (alice, was) occur and what is the relative frequency?
print(lm.counts[['alice']]['was'])
lm.score('was', 'alice'.split())

17


0.04292929292929293

In [413]:
# From NLTK Documentation:
# Here’s how you get the score for a word given some preceding context. 

# For example we want to know what is the chance that “was” is preceded by “alice”.

lm.score("was", ["alice"])

0.04292929292929293

## Generate text

In [414]:
print(lm.generate(20, text_seed= 'alice', random_seed=42))

['--', "'", "''", '</s>', 'right', 'size', 'for', 'a', 'little', ',', '``', 'it', "'ll", 'do', "n't", 'matter', 'much', 'like', 'the', 'air']


In [415]:
# Now generate sentences that look much nicer. 
generate_sent(lm, 40, text_seed='alice', random_seed = 42)
# Stops after exactly one sentence.

'--"\''

In [416]:
generate_sent(lm, 40, text_seed='alice', random_seed = 20)

'-- i shall tell me my time together."'

In [417]:
generate_sent(lm, 40, text_seed='martians', random_seed = 200)

"'ll don't be talking about, and crept a great hurry."

In [418]:
generate_sent(lm, 40, text_seed='martians', random_seed = 547)

"'ll see anything more evidence _yet_, and tillie; for some time she appeared; to sink into a grown in things went on, ma'am, miss, just what to herself in; but she swam nearer"

# Part 3 - creating a tri-gram model?

In [419]:
# 3 is for tri-grams!!!
n = 3

In [420]:
# we imported this function from nltk
train_data, padded_sents = padded_everygram_pipeline(n, tokenized_text)

In [421]:
# MLE is the model we will use
lm = MLE(n)

In [422]:
# fit the model 
# training data is the bigrams and unigrams 
# the vocab is all the sentence tokens in the corpus 

lm.fit(train_data, padded_sents)
len(lm.vocab)

2778

In [423]:
# inspect the model's vocabulary. 
# be sure that a sentence you know exists (from tokenized_text) is in the 
print(lm.vocab.lookup(tokenized_text[3]))

('oh', 'dear', '!')


In [424]:
# From NLTK Documentation:
# Here’s how you get the score for a word given some preceding context. 
# For example we want to know what is the chance that “beginning” is preceded by “alice was”.

print(lm.counts[['alice', 'was']]['beginning'])
lm.score("beginning", ["alice", "was"])

2


0.11764705882352941

## Generate text

In [425]:
generate_sent(lm, 40, text_seed='alice', random_seed = 42)

'-- e--e--evening, beautiful, beauti--ful soup!'

In [426]:
generate_sent(lm, 40, text_seed='alice', random_seed = 55)

'-- evening, beautiful soup, and it stood for a minute or two she stood still where she was considering in her own child-life, and the executioner ran wildly up and down looking for eggs, as to'

In [427]:
generate_sent(lm, 40, text_seed='alice', random_seed = 200)

'-- e--e--e--e--e--e--evening, beautiful, beauti--ful soup!"'

In [428]:
generate_sent(lm, 40, text_seed='alice', random_seed = 547)

'-- evening, beautiful, beautiful soup!'

That doesn't work!  They are all almost the same...  Maybe we need two words as a text seed for a tri-gram?

In [429]:
print(lm.generate(20, text_seed= 'alice was', random_seed=42))

['larger', ',', 'i', "'ve", 'seen', 'them', 'so', 'often', ',', 'of', 'course', '?', "''", '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>']


In [430]:
# Now generate sentences that look much nicer. 
generate_sent(lm, 40, text_seed='alice was', random_seed = 42)
# Stops after exactly one sentence.

'larger, i\'ve seen them so often, of course?"'

In [431]:
generate_sent(lm, 40, text_seed='alice was', random_seed = 55)

', with a soldier on each side, to the mock turtle a little sharp bark just over her head!"'

In [432]:
generate_sent(lm, 40, text_seed='alice was', random_seed = 200)

"'s argument was, that attempt proved a failure."

In [433]:
generate_sent(lm, 40, text_seed='alice was', random_seed = 547)

'\'s really dreadful," said the duchess sneezed occasionally; and the party.'

That looks better.  Still not quite sure this is right, but it was a fun experiment!