# HW 2: N-gram Language Models

## Date Out: Thursday, February 20
## Due Date: Thursday, March 5

This programming assignment is more open-ended than the previous ones. It is centered on the N-gram language models and tasks you to:

* download and process a large text dataset in python using the <code>csv</code> library
* perform sentence and word tokenization
* calculate N-gram counts and probabilities
* compare the characteristics of the N-grams across different models
* generate random sentences using the models

<u>You may work in teams of two or three (2-tuples or 3-tuples?) for this assignment.</u>

<hr>

In [None]:
pip install nltk

In [1]:
import nltk

In [None]:
nltk.download('punkt')

In [2]:
import csv

In [3]:
import pickle

### Task #1

<u>Download two large text datasets from Kaggle.</u>

The <a href="http://kaggle.com">Kaggle competition hosting site</a> offers a number of free datasets that contain interesting text fields. For this assignment, we will use the "Wine Reviews" and "All the News" datasets. They can be accessed by selecting the "Datasets" header and then searching for these specific datasets. Then, choose "Data" from the sub-header, preview some of the csv data and notice how at least one of the columns in the dataset will contain sufficient text. I chose to direct you to these two datasets because the textual content seemed interesting and would have different language characteristics, and both were large csv files that could generate significant n-gram counts, but not be too large of a file.

<em>(You can use other datasets if you wish. Others that looked interesting on Kaggle include the "Yelp Dataset" (but its over 3GB !!!), "SMS Spam Collection Dataset", "Russian Troll Tweets", and "A Million News Headlines".)</em>

### Task #2

<u>Process the downloaded <code>csv</code> files in python.</u>

There's a nice csv library already included in python for accessing values in that are stored in a comma separated values (csv) format. Read the <a href="https://docs.python.org/3/library/csv.html">csv library documentation</a>.
What is the delimiter in your csv files? Open each of the two .csv files that you downloaded using this library and be able to read in the data. Note that we really only care about the text column in this assignment.

In [None]:
# PYTHON CODE HERE
with open(r'winemag-data_first150k.csv', 'r', encoding='utf-8') as wines, open(r'vocab.txt', 'w', encoding='utf-8') as vocab:
    result = csv.reader(wines, delimiter=',')
    for i in result:
        vocab.write(i[2])

In [None]:
with open(r'vocab.txt', 'r', encoding='utf-8') as vocab:
    content = vocab.read()

In [None]:
tokens2 = nltk.word_tokenize(content)

### Task #3

<u>Perform sentence segmentation and word tokenization.</u>

Utilize the nltk module to perform sentence segmentation and word tokenization. But at this point, there are a few decisions that need to be made:

* How we should handle the .csv rows in the previous step? If we ignore row makers, and "lump everything together", how will that effect our language model?
* Do we want to remove punctuation? What is the effect of keeping punctuation in the model?
* Do we want to add sentence boundary markers, such as <samp>&lt;S&gt;</samp> and <samp>&lt;/S&gt;</samp>?</li>
* Should two the words <samp>The</samp> and <samp>the</samp> be treated as the same? What are the effects of doing, or not doing, this?

In [4]:
#dump the sentences and tokens out to files for faster load time.
with open(r'tokens.pickle', 'rb') as pickleOut:
    tokens = pickle.load(pickleOut)
with open(r'sents.pickle', 'rb') as pickleOut:
    sents = pickle.load(pickleOut)

### Task #4

<u>Calculate N-gram counts and compute probabilities.</u>

Use a python dictionary (or any suitable data structure) to first compute unigram counts. Then try bigram counts. Finally, trigram counts.

How much memory are you using? How fast, or slow, is the code -- how long is this step taking? If it is taking too long, try only using a fraction of your corpus: instead of loading the entire .csv file, try only reading the first 1000 rows of data.

Using those counts, compute the probabilities for the unigrams, bigrams, and trigrams, and store those in a new python dictionary (or some other data structure).

In [None]:
#Load this cell only on start up
with open(r'ctnsUnigram.pickle', 'rb') as unigram:
    pickle.load(unigramsCnts ,unigram)
with open(r'ctnsBigram.pickle', 'rb') as bigram:
    pickle.load(bigramCnts ,bigram)
with open(r'ctnsTrigram.pickle', 'rb') as trigram:
    pickle.load(trigramCnts ,trigram)
    with open(r'probUnigram.pickle', 'rb') as unigram:
    pickle.load(unigramProb ,unigram)
with open(r'probBigram.pickle', 'rb') as bigram:
    pickle.load(bigramProb ,bigram)
with open(r'probTrigram.pickle', 'rb') as trigram:
    pickle.load(trigramProb ,trigram)

In [5]:
# PYTHON CODE HERE
unigramsOfText = tokens
bigramsOfText = list(nltk.bigrams(tokens))
trigramsOfText = list(nltk.trigrams(tokens))

In [None]:
with open(r'unigrams.pickle', 'wb') as unigram:
    pickle.dump(unigramsOfText ,unigram)

In [38]:
with open(r'bigrams.pickle', 'wb') as bigram:
    pickle.dump(bigramsOfText ,bigram)

MemoryError: 

In [None]:
with open(r'trigrams.pickle', 'wb') as trigram:
    pickle.dump(trigramsOfText ,trigram)

In [27]:
totalUnigrams = len(unigramsOfText)
totalBigrams = len(bigramsOfText)
totalTrigrams = len(trigramsOfText)

In [30]:
unigramsCnts = nltk.FreqDist(unigramsOfText)
bigramCnts = nltk.FreqDist(bigramsOfText)
trigramCnts = nltk.FreqDist(trigramsOfText)

In [43]:
with open(r'ctnsUnigram.pickle', 'wb') as unigram:
    pickle.dump(unigramsCnts ,unigram)
with open(r'ctnsBigram.pickle', 'wb') as bigram:
    pickle.dump(bigramCnts ,bigram)
with open(r'ctnsTrigram.pickle', 'wb') as trigram:
    pickle.dump(trigramCnts ,trigram)

In [42]:
unigramProb = {x : unigramsCnts[x]/totalTokens for x in unigramsCnts}
bigramProb = {x : bigramCnts[x]/totalTokens for x in bigramCnts}
trigramProb = {x : trigramCnts[x]/totalTokens for x in trigramCnts}

In [47]:
with open(r'probUnigram.pickle', 'wb') as unigram:
    pickle.dump(unigramProb ,unigram)
with open(r'probBigram.pickle', 'wb') as bigram:
    pickle.dump(bigramProb ,bigram)
with open(r'probTrigram.pickle', 'wb') as trigram:
    pickle.dump(trigramProb ,trigram)

### Task #5

<u>Compare the statistics of the corpora.</u>
                        
Use the results of those calculations that you just made the poor computer painstakingly compute. What are the differences in the most common unigrams between the two language models? Are there interesting differences between the bigram models or trigram models?

Be able to sort the n-grams to output the top k with the highest count or probability.

In [None]:
# PYTHON CODE HERE

### Task #6

<u>Generate random sentences from the N-grams models for both datasets.</u>
                        
We briefly talked about this idea in class. It's also introduced at a high-level in J&M 4.3. How can a random number in the range [0,1] probabilistically generate a word using your model?

In [None]:
# PYTHON CODE HERE

### Report

Write a technical report (in this Jupyter Notebook, with good Markdown formatting) that documents your findings, "lessons learned", any areas of where you ran into difficult, and also any other interesting details. Include in your report the following details:

1. Names of the datasets used.
1. Does your model use all of the data in the .csv file or only a subset of it (i.e. first 1,000 rows)?
1. What is the vocabulary and size of each dataset?
1. How did you handle the merging of separate rows in a .csv file? How did you handle sentence segmentation with sentence boundary markers? Also report on any other decisions made in step #3.
1. How long did it take your program to build these models? Do you have any statistics on memory/RAM usage?
1. Output the top 15 unigrams, bigrams, trigrams for each model. Are there any interesting differences?
1. Output 3 different randomly generated sentences for each unigram, bigram, trigram model. How did you know where the randomly generated sentence ended?

Also submit this python notebook `.ipynb` to D2L.

In [None]:
# PYTHON CODE AND REPORT HERE