# Building a Unigram Language Model in NLP

## By Brea Koenes

### Overview

Objective: Develop a program that calculates the probability of each word within a provided text corpus using a unigram model.

Steps:

- 1 - Import the Data

- 2 - Data Cleaning and Tokenization

- 3 - Probability Calculation

- 4 - Presentation

### 01 - Importing the Data

Data: Bram Stoker's classic novel "Dracula".

In [2]:
# Read in data
data = 'pg345.txt'

In [3]:
# List where each element is a line from the book
dracula = []
with open (data) as file:
    for line in file:
        line = line.rstrip('\n')
        dracula.append(line)

### 02 - Data Cleaning

Refine the list `dracula` by converting all the text to lowercase, removing punctuation, and eliminating 'stop words'.

In [6]:
# Imports
import nltk
from nltk.tokenize import word_tokenize
import string
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')

# Takes list of text as input and returns a list of cleaned tokens.
def cleanText(texts):
    # Initialize
    punctuation_list = list(string.punctuation)
    stop_words = stopwords.words('english')
    
    # Add specific characters to remove
    additional_chars_to_remove = ["“","’","--","”"]
    punctuation_list.extend(additional_chars_to_remove)
    
    cleaned_texts = []

    for text in texts:
        # Tokenize the text 
        tokens = word_tokenize(text)
        
        # Clean the tokens 
        cleaned_tokens = []
        for token in tokens:
            token = token.lower()
            
            if token in punctuation_list or token in stop_words:
                continue
                
            cleaned_tokens.append(token)
            
        cleaned_texts.extend(cleaned_tokens)
        
    return cleaned_texts

# Apply to data
dracula = cleanText(dracula)
dracula[0:10]

[nltk_data] Downloading package punkt to /Users/bkoenes/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/bkoenes/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/bkoenes/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['\ufeffthe',
 'project',
 'gutenberg',
 'ebook',
 'dracula',
 'ebook',
 'use',
 'anyone',
 'anywhere',
 'united']

### 3 - Probability Calculation:

Calculate the probability of occurrence for each word within our text. This is achieved by dividing the frequency of each word by the total number of words present in the text.

In [7]:
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import MLE

# Generate sets with 1 as the maximum size of n-grams and dracula as the data
train_data, padded_vocab = padded_everygram_pipeline(1, dracula)

# Transform padded_vocab into a list
padded_vocab = list(padded_vocab)

# Create an MLE unigram model
unigram_model = MLE(1)

# Fit the model using train_data and padded_vocab
unigram_model.fit(train_data, padded_vocab)

# Construct a dictionary of unigram probabilities
unigram_probs = {word: unigram_model.score(word) for word in padded_vocab}

### 4 - Presentation:

Display the top 10 most probable words along with their corresponding probabilities.

In [8]:
top_10_probs = sorted(unigram_probs.items(), key=lambda x:x[1],reverse=True)[:10]

for word, prob in top_10_probs:
    print(f"{word}:{prob}")

e:0.12772979339451343
a:0.07334004024144869
t:0.0697305785935123
s:0.06914658683810178
o:0.06852578887961917
n:0.06785100848996417
r:0.06663149629484222
i:0.06148108161162095
l:0.05518967463316484
d:0.049894488884526675


### 5 - Export Models


In [7]:
import pickle

#Save the models in disk
with open('unigram_model.pkl', 'wb') as file:
    pickle.dump(unigram_model , file)