# Assignment 2a: Building a Unigram Language Model in NLP

## DTSC-685: Natural Language Processing

## Name:

### Overview

Objective: Develop a program that calculates the probability of each word within a provided text corpus using a unigram model.

Tasks:

- 1 - Importing the Data

- 2 - Data Cleaning and Tokenization: Preprocess the text by removing special characters and converting everything to lowercase to ensure uniformity. Split the text into individual words (unigrams) to analyze their frequencies.

- 3 - Probability Calculation: Determine the probability of each word by dividing its frequency by the total number of words in the text.

- 4 - Presentation: Display the top 10 most probable words along with their corresponding probabilities.


### 01 - Importing the Data

For this assignment, we will be working with Bram Stoker's classic novel "Dracula," published in 1897. Since this book is in the public domain, it is available for use at no cost through the [Gutenberg Project](https://www.gutenberg.org/).

The downloaded text file of "Dracula" has been provided for you in the text file `pg345.txt`.

You will need to create a list named `dracula`, where each element is an individual line from the book.


In [4]:
# Step 1 - Importing the Data
with open('pg345.txt', 'r', encoding='utf-8') as file:
    dracula = file.readlines()


### 02 - Data Cleaning

In this part of the assignment, we will refine the list `dracula` by converting all the text to lowercase, removing punctuation, and eliminating 'stop words'. To accomplish this, you will create a function named `cleanText` that takes a list of text as input and returns a list of cleaned tokens. Each token in this list should be in lowercase, stripped of punctuation, and free of ['stop words'](https://medium.com/@saitejaponugoti/stop-words-in-nlp-5b248dadad47).


- For the punctuation, import the string library (import string) and use as list (`list(string.punctuation)`).

- For the 'stop words', import the library stopword (from nltk.corpus import stopwords), load the 'stop words' (nltk.download('stopwords')) , and use the list stopwords.words('english').

- Remove the characters in this list: `["“","’","--","”"]`

- Instead of using the split() function to divide the lines into words, it's better to utilize the `word_tokenize` function from `nltk.tokenize` for a more robust tokenization that can handle complex word structures.


Output:
    
    dracula[400:410]
    
    [['seemed', 'darkness', 'closing', 'upon', 'us', 'great', 'masses'],
     ['greyness', 'bestrewed', 'trees', 'produced'],
     ['peculiarly', 'weird', 'solemn', 'effect', 'carried', 'thoughts'],
     ['grim', 'fancies', 'engendered', 'earlier', 'evening', 'falling', 'sunset'],
     ['threw', 'strange', 'relief', 'ghost-like', 'clouds', 'amongst'],
     ['carpathians', 'seem', 'wind', 'ceaselessly', 'valleys', 'sometimes'],
     ['hills', 'steep', 'despite', 'driver', 'haste', 'horses', 'could'],
     ['go', 'slowly', 'wished', 'get', 'walk', 'home'],
     ['driver', 'would', 'hear', 'said', 'must'],
     ['walk', 'dogs', 'fierce', 'added']]

Ps.:
    Don't forget to:
        
        nltk.download('stopwords')
        nltk.download('punkt')

In [6]:
import string
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Download necessary nltk data
nltk.download('stopwords')
nltk.download('punkt')

# Define the function for cleaning text
def cleanText(text_lines):
    # List of punctuation to remove
    punctuation_list = list(string.punctuation) + ["“","’","--","”"]
    
    # Set of English stop words
    stop_words = set(stopwords.words('english'))
    
    cleaned_tokens = []
    
    for line in text_lines:
        # Convert to lowercase
        line = line.lower()
        # Tokenize the line
        tokens = word_tokenize(line)
        # Remove punctuation and stopwords
        cleaned_line = [word for word in tokens if word not in punctuation_list and word not in stop_words]
        cleaned_tokens.append(cleaned_line)
    
    return cleaned_tokens

# Apply the cleaning function to the Dracula text
cleaned_dracula = cleanText(dracula)

# Output a small sample
cleaned_dracula[400:410]


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\MiyahSegura\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\MiyahSegura\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


[['seemed', 'darkness', 'closing', 'upon', 'us', 'great', 'masses'],
 ['greyness', 'bestrewed', 'trees', 'produced'],
 ['peculiarly', 'weird', 'solemn', 'effect', 'carried', 'thoughts'],
 ['grim', 'fancies', 'engendered', 'earlier', 'evening', 'falling', 'sunset'],
 ['threw', 'strange', 'relief', 'ghost-like', 'clouds', 'amongst'],
 ['carpathians', 'seem', 'wind', 'ceaselessly', 'valleys', 'sometimes'],
 ['hills', 'steep', 'despite', 'driver', 'haste', 'horses', 'could'],
 ['go', 'slowly', 'wished', 'get', 'walk', 'home'],
 ['driver', 'would', 'hear', 'said', 'must'],
 ['walk', 'dogs', 'fierce', 'added']]

### 3 - Probability Calculation: 

In this section, we will calculate the probability of occurrence for each word within our text. This is achieved by dividing the frequency of each word by the total number of words present in the text.

To accomplish this, we will utilize a selection of fundamental functions provided by the NLTK library:

#### **- padded_everygram_pipeline:**

    from nltk.lm.preprocessing import padded_everygram_pipeline
    
    This function is designed to prepare a sequence of textual data for training n-gram language models. It executes two primary operations: padding and the generation of everygrams. Padding entails appending special tokens at the beginning and end of each sentence to delineate sentence boundaries, typically using <s> for the start and </s> for the end. Everygrams are contiguous sequences of tokens, or n-grams, and this function produces all possible n-grams up to a specified size.

    **Parameters:**
        `n`: The order of the model, which specifies the maximum size of n-grams to be created. For instance, n=2 would create bigrams (and unigrams), n=3 would create trigrams (and bigrams and unigrams), and so on.

        `text`: An iterable of lists of strings (sentences), where each string is a token. Essentially, this should be the tokenized text data split into sentences.

    **Returns:**
        The function returns a tuple with two elements:

        `train_data`: An iterable of lists of n-grams represented as tuples. For a unigram model (n=1), this would be an iterable of lists of single-word tuples. For higher-order models, these tuples would contain more words. The iterable is ready to be used as training data for an n-gram model.

        `padded_vocab`: An iterator over all the unique tokens (including padding tokens) that appear in the padded n-grams. This iterator can be converted to a list or set to get a vocabulary list that includes the special padding tokens.
    


#### **- MLE:**

    from nltk.lm import MLE    
    
    The MLE class is used to create an n-gram language model where the probability of each n-gram is estimated using the Maximum Likelihood Estimation approach. This approach calculates the probabilities of n-grams purely based on their frequencies in the training data, without applying any smoothing. In other words, the probability of an n-gram is the number of times it occurs in the training data divided by the total number of occurrences of its (n-1)-gram prefix.

    - fit Function: The fit function is used to train the language model using the provided training data.

        **Parameters**: `train_data`: An iterable of lists of n-grams, where each n-gram is represented as a tuple. `vocab`: An iterable of all the unique tokens in the training data, which can include padding tokens.

        **Returns**: This function does not return a value; instead, it updates the MLE model in place with the probabilities calculated from the training data.


    - score Function: The score function is used to calculate the probability of a given n-gram within the model.

        **Parameters**: `word`: The n-gram for which the probability is being calculated. For unigram models, this would be a single word.
        `context`: (Optional) The preceding words (n-1) of the n-gram for which the probability is calculated. This parameter is not used in unigram models.

        **Returns**: A float representing the probability of the given n-gram. If the n-gram has not been seen in the training data, the probability returned is 0, indicating that, according to the model, the n-gram is impossible in the language.
        

Please adhere to the sequence outlined below for your coding task:

1. Employ the `padded_everygram_pipeline` function, utilizing the correct parameter, to generate both `train_data` and `padded_vocab`.
   
2. Transform `padded_vocab` into a list format.

3. Instantiate a model of the MLE type, called `unigram_model`, ensuring to apply the relevant parameter.

4. Proceed to fit the model using `train_data` in conjunction with `padded_vocab`.

5. Construct a dictionary called `unigram_probs` employing the `model.score` method, structured as follows:
   \{ "word": probability_of_unigram \}


Output:
    
    unigram_probs (first 10 elements):
     
    ﻿the 1.3585110718652357e-05
    project 0.0012090748539600599
    gutenberg 0.0004211384322782231
    ebook 0.00017660643934248063
    dracula 0.0005298193180274419
    use 0.0007335959788072273
    anyone 8.151066431191414e-05
    anywhere 0.00024453199293574245
    united 0.00024453199293574245
    states 0.00029887243581035183

In [10]:
from nltk.lm import MLE
from nltk.lm.preprocessing import padded_everygram_pipeline

# Prepare the data for a unigram model
n = 1
train_data, padded_vocab = padded_everygram_pipeline(n, cleaned_dracula)

# Convert the padded vocabulary to a list
vocab_list = list(padded_vocab)

# Create and train an MLE unigram model
unigram_model = MLE(n)
unigram_model.fit(train_data, vocab_list)

# Calculate the unigram probabilities
unigram_probs = {word: unigram_model.score(word) for word in vocab_list}

# Output the first 10 unigram probabilities
for i, (word, prob) in enumerate(unigram_probs.items()):
    if i == 10:
        break
    print(f"{word}: {prob}")


﻿the: 1.3584741618214421e-05
project: 0.0012090420040210836
gutenberg: 0.00042112699016464707
ebook: 0.00017660164103678748
dracula: 0.0005298049231103625
use: 0.0007335760473835787
anyone: 8.150844970928653e-05
anywhere: 0.0002445253491278596
united: 0.0002445253491278596
states: 0.00029886431560071725


### 4 - Presentation: 

Display the top 10 most probable words along with their corresponding probabilities.

In [12]:
# Sort the unigram probabilities in descending order
sorted_unigrams = sorted(unigram_probs.items(), key=lambda item: item[1], reverse=True)

# Display the top 10 most probable words
print("Top 10 most probable words:")
for word, prob in sorted_unigrams[:10]:
    print(f"{word}: {prob}")


Top 10 most probable words:
said: 0.0077433027223822205
one: 0.006846709775580068
could: 0.0066836928761614955
us: 0.006262565885996848
must: 0.00611313372819649
would: 0.00586860837906863
shall: 0.005814269412595772
may: 0.005637667771558985
see: 0.00540672716404934
know: 0.005298049231103624


### 5 - Export Models for codegrade evaluation

Using the "pickle" library:

- Export the model `unigram_model` as "unigram_model.pkl".


In [14]:
import pickle

#Save the models in disk
with open('unigram_model.pkl', 'wb') as file:
    pickle.dump(unigram_model , file)    


This material is for enrolled students' academic use only and protected under U.S. Copyright Laws. This content must not be shared outside the confines of this course, in line with Eastern University's academic integrity policies. Unauthorized reproduction, distribution, or transmission of this material, including but not limited to posting on third-party platforms like GitHub, is strictly prohibited and may lead to disciplinary action. You may not alter or remove any copyright or other notice from copies of any content taken from BrightSpace or Eastern University’s website.

© Copyright Notice 2024, Eastern University - All Rights Reserved