Homework 3: n-gram LM
----

Due date: October 4th, 2023

Points: 105

Goals:
- understand the difficulties of counting and probablities in NLP applications
- work with real world data to build a functioning language model
- stress test your model (to some extent)

Complete in groups of: __one (individually)__

Allowed python modules:
- `numpy`, `matplotlib`, and all built-in python libraries (e.g. `math` and `string`)
- do not use `nltk` or `pandas`

Instructions:
- Complete outlined problems in this notebook.
- When you have finished, __clear the kernel__ and __run__ your notebook "fresh" from top to bottom. Ensure that there are __no errors__.
    - If a problem asks for you to write code that does result in an error (as in, the answer to the problem is an error), leave the code in your notebook but commented out so that running from top to bottom does not result in any errors.
- Double check that you have completed Task 0.
- Submit your work on Gradescope.
- Double check that your submission on Gradescope looks like you believe it should __and__ that all partners are included (for partner work).

6120 students: complete __all__ problems.

4120 students: you are not required to complete problems marked "CS 6120 REQUIRED". If you complete these you will not get extra credit. We will not take points off if you attempt these problems and do not succeed.

Task 0: Name, References, Reflection (5 points)
---

Name: Anisha Kumari Kushwaha

References
---
List the resources you consulted to complete this homework here. Write one sentence per resource about what it provided to you. If you consulted no references to complete your assignment, write a brief sentence stating that this is the case and why it was the case for you.

(Example)
- https://docs.python.org/3/tutorial/datastructures.html
    - Read about the the basics and syntax for data structures in python.
    
    
 https://towardsdatascience.com/perplexity-in-language-models-87a196019a94
 Read about Perplexity Score
 
 https://towardsdatascience.com/evaluation-of-language-models-through-perplexity-and-shannon-visualization-method-9148fbe10bd0
 Read about Perplexity Score 
    
AI Collaboration
---
Following the *AI Collaboration Policy* in the syllabus, please cite any LLMs that you used here and briefly describe what you used them for. Additionally, provide comments in-line identifying the specific sections that you used LLMs on, if you used them towards the generation of any of your answers.

Reflection
----
Answer the following questions __after__ you complete this assignment (no more than 1 sentence per question required, this section is graded on completion):

1. Does this work reflect your best effort? 

    Yes


2. What was/were the most challenging part(s) of the assignment? 

    Dealing with UNK token while scoring probabilities.


3. If you want feedback, what function(s) or problem(s) would you like feedback on and why? 

    I'm unsure about Perplexity as there is no way to cross check it.

Task 1: Berp Data Write-Up (5 points)
---

Every time you use a data set in an NLP application (or in any software application), you should be able to answer a set of questions about that data. Answer these now. Default to no more than 1 sentence per question needed. If more explanation is necessary, do give it.

This is about the __berp__ data set.

1. Where did you get the data from? https://www1.icsi.berkeley.edu/Speech/berp.html


2. How was the data collected (where did the people acquiring the data get it from and how)?

      The data was likely acquired through the recording of spontaneous continuous speech interactions with users in the domain of restaurants in the city of Berkeley, which serves as the system's knowledge domain.
      

3. How large is the dataset? (# lines, # tokens)

    It contains 7500 lines, with 1500 words


4. What is your data? (i.e. newswire, tweets, books, blogs, etc)

    The data given to us is a file with sentences in each line.
    

5. Who produced the data? (who were the authors of the text? Your answer might be a specific person or a particular group of people)

    It is developed by the International Computer Science Institute in Berkeley, CA, in a project headed by Nelson Morgan.

Task 2: Implement an n-gram Language Model (90 points)
----

Implement the `LanguageModel` class as outlined in the provided `lm_starter.py` file. Do not change function signatures (the unit tests that we provide and in the autograder will break).

Your language model:
- *must* work for both the unigram and bigram cases (5 points are allocated to an experiment involving larger values of `n`)
    - 6120 students must create a model that works for trigram cases as well
    - hint: try to implement the bigram case as a generalized "n greater than 1" case
- should be *token agnostic* (this means that if we give the model text tokenized as single characters, it will function as a character language model and if we give the model text tokenized as "words" (or "traditionally"), then it will function as a language model with those tokens)
- will use Laplace smoothing
- will replace all tokens that occur only once with `<UNK>` at train time
    - do not add `<UNK>` to your vocabulary if no tokens in the training data occur only once!

We have provided:
- a function to read in files
- some functions to change a list of strings into tokens
- the skeleton of the `LanguageModel` class

You need to implement:
- all functions marked

You may implement:
- additional functions/methods as helpful to you

As a guideline, including comments, all code required for CS 6120 and some debugging code that can be run with `verbose` parameters, our solution is ~300 lines. (~+120 lines versus the starter code).

Points breakdown marked in code below.

In [1]:
# rename your lm_starter.py file to lm_model.py and put in the same directory as this file
import lm_model as lm
import numpy as np
import matplotlib.pyplot as plt
from itertools import groupby


In [2]:
# test the language model (unit tests)
import test_minitrainingprovided as test

# passing all these tests is a good indication that your model
# is correct. They are *not a guarantee*, so make sure to look
# at the tests and the cases that they cover. (we'll be testing
# your model against all of the testing data in addition).

# autograder points in gradescope are assigned SIXTY points
# this is essentially 60 points for correctly implementing your
# underlying model
# there are an additional 10 points manually graded for the correctness
# parts of your sentence generation

# make sure all training files are in a "training_files" directory
# that is in the same directory as this notebook

unittest = test.TestMiniTraining()
unittest.test_createunigrammodellaplace()
unittest.test_createbigrammodellaplace()
unittest.test_unigramlaplace()
unittest.test_unigramunknownslaplace()
unittest.test_bigramlaplace()
unittest.test_bigramunknownslaplace()
# produces output
unittest.test_generateunigramconcludes()
# produces output
unittest.test_generatebigramconcludes()

unittest.test_onlyunknownsgenerationandscoring()

[['<s>', '</s>'], ['<s>', 'am', '</s>']]
[['<s>', 'ham', 'i', 'am', '</s>'], ['<s>', 'ham', '</s>']]


In [3]:
# 5 points

# instantiate a bigram language model, train it, and generate ten sentences
# make sure your output is nicely formatted!
ngram = 2
training_file_path = "training_files/berp-training.txt"
# optional parameter tells the tokenize function how to tokenize
by_char = False
data = lm.read_file(training_file_path)
tokens = lm.tokenize(data, ngram, by_char=by_char)

# YOUR CODE HERE
lmc = lm.LanguageModel(ngram)
lmc.train(tokens)

lines = lmc.generate(10)

print("Generated 10 sentences are: \n")
for i in range(len(lines)):
  print(i+1, ' '.join(lines[i][1:-1]))



Generated 10 sentences are: 

1 not cost less
2 let's let's start over
3 lots of any type of the maharani serve <UNK> restaurant
4 what's the <UNK> and uh with mediterranean food for ten dollars
5 i want
6 it got a little more than five dollars
7 can i want to the closest to go in berkeley thai restaurants open
8 oh i'd like to see one hour
9 are available uh let's start over
10 i would like to icksee


In [4]:
# 5 points

# evaluate your bigram model on the test data
# score each line in the test data individually, then calculate the average score
# you need not re-train your model
test_path = "testing_files/berp-test.txt"
test_data = lm.read_file(test_path)

scores = []

# YOUR CODE HERE
scores = [lmc.score(lm.tokenize_line(line, ngram, by_char)) for line in test_data ]
# print("Inidividual Score: ",scores)
print("Average Score: ", np.average(scores))
print("Standard Deviation : ",np.std(scores))

# scores = [lmc.score(' '.join(line)) for line in each_line]
# print(scores)
# average_score = average = sum(scores) / len(scores)
# print(average_score)
# Print out the mean score and standard deviation
# for words-as-tokens, these values should be
# ~4.9 * 10^-5 and 0.000285


Average Score:  4.9620823627262653e-05
Standard Deviation :  0.000285298086084196


In [5]:
# 5 points

# see if you can train your model on the data you found for your first homework
ngram = 10
dataFile = "HW1text.txt"
by_char = False
data = lm.read_file(dataFile)
tokens = lm.tokenize(data, ngram, by_char=by_char)
# print("tokens:", tokens)
lmc = lm.LanguageModel(ngram)
lmc.train(tokens)

# what is the maximum value of n <= 10 that you can train a model *in your programming environment* in a reasonable amount of time? (less than 3 - 5 minutes)


# generate three sentences with this model
lines = lmc.generate(3)
print("Generated 3 sentences: \n")
for i in range(len(lines)):
  print(i+1, ' '.join(lines[i][1:-1]))


Generated 3 sentences: 

1 that she would be
2 Avenue. It hadn't been <UNK> <UNK> von <UNK>
3 take advantage of "Fire!" <UNK> what of a doubt that <UNK> out his <UNK> He was still this question <UNK> me in. She was at the lamps had rushed at <UNK>


CS 6120 REQUIRED
----
Implement the corresponding function and evaluate the perplexity of your model on the first 20 lines in the test data for values of `n` from 1 to 3. Perplexity should be individually calculated for each line.

In [6]:
test_path = "training_files/berp-test.txt"
test_data = lm.read_file(test_path)

for ngram in range(1, 4):
    print("********")
    print("Ngram model:", ngram)
    # YOUR CODE HERE
    data = lm.read_file("testing_files/berp-test.txt")
    tokens = lm.tokenize(data, ngram, by_char=by_char)
    lmc = lm.LanguageModel(ngram)
    lmc.train(tokens)

    for i in range(20):
        print("\nLine",i+1," : ",test_data[i] )
        print("    Perplexity:", lmc.perplexity(test_data[i]))



********
Ngram model: 1

Line 1  :  a vegetarian meal

    Perplexity: 13.909097001971821

Line 2  :  about ten miles

    Perplexity: 9.756733216337008

Line 3  :  and i'm willing to drive ten miles

    Perplexity: 10.195553650348451

Line 4  :  and this will be for dinner

    Perplexity: 9.818395235754153

Line 5  :  are any of these restaurants open for breakfast

    Perplexity: 10.23979910708653

Line 6  :  are there russian restaurants in berkeley

    Perplexity: 10.280828998956883

Line 7  :  between fifteen and twenty dollars

    Perplexity: 9.155844704408999

Line 8  :  can you at least list the nationality of these restaurants

    Perplexity: 10.89893425587025

Line 9  :  can you give me more information on viva taqueria 

    Perplexity: 11.166839008684798

Line 10  :  dining

    Perplexity: 11.854739896051703

Line 11  :  display sizzler

    Perplexity: 10.724109042023283

Line 12  :  do you have indonesian food 

    Perplexity: 9.999309436034542

Line 13  :  do you

1. What are the common attributes of the test sentences that cause very high perplexity?

    The lines with high ngram value has high Perplexity compared to low ngram value.

5 points in this assignment are reserved for overall style (both for writing and for code submitted). All work submitted should be clear, easily interpretable, and checked for spelling, etc. (Re-read what you write and make sure it makes sense). Course staff are always happy to give grammatical help (but we won't pre-grade the content of your answers).