# Natural Language Processing Assignment

### **Jeffery Osagie**  

#### AI4PH

##### June, 2024

## Introduction

This notebook successfully completes several basic Natural Language Processing (NLP) tasks using the Brown Corpus from the Natural Language Toolkit (NLTK) library. It covers raw text data preprocessing, part-of-speech tagging, and frequency analysis of tokens. Tasks completed include:

1. Loading the Brown Corpus from NLTK using the corpus reader function `paras()` for paragraphs. 


2. Removing punctuation and stopwords.


3. Applying the Lancaster Stemmer - reducing tokens to their base or root form. 


4. Calculating Term frequency (TF) for the entire corpus and printing the top 10 words with the highest TF values. 


5. Calculating and Displaying the top 10 words in terms of term frequency-inverse document frequency (TF-IDF). Using paragraphs as documents for calculating TF-IDF. 


6. Part-of-Speech (POS) Tagging of Tokens. Tokens are tagged with their respective parts of speech using the `pos_tag` function from NLTK.


7. Printing the 10 most common trigrams of word-tag pairs with their frequencies as well, using `nltk.trigrams()`. 

### Import relevant libraries and Download necessary data

In [None]:
# install required dependencies
%pip install nltk

In [None]:

# import relevant libraries

import nltk
from nltk.corpus import brown
from nltk.corpus import stopwords
from nltk.stem import LancasterStemmer
from nltk.probability import FreqDist
from nltk import pos_tag, trigrams
from nltk.tokenize import word_tokenize
import string
import math
import re

# Download necessary NLTK data
nltk.download('brown')
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

### 1. Loading the Brown Corpus

The `paras()` function returns the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of word string

In [3]:
# Load Brown Corpus paragraphs
brown_paras = brown.paras()

# Print the first two paragraphs
print(brown_paras[:2]) 


[[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.']], [['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.']]]


### **Preprocessing**


We flatten the list of list of paragraphs into a list of strings because it simplifies the data structure, making it easier to work with and more compatible with many natural language processing libraries and tools that expect text data in a specific format. Additionally, it can be a useful preprocessing step for tasks like tokenization, language modeling, and creating bag-of-words representations, where treating the entire text corpus as a continuous sequence of tokens is often required.

In [4]:
# Flatten the list of lists of sentences into a list of paragraphs
paras = [' '.join([' '.join(sent) for sent in para]) for para in brown_paras]

# Print first two paragraphs of flattened list
print(paras[:2])

["The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced `` no evidence '' that any irregularities took place .", "The jury further said in term-end presentments that the City Executive Committee , which had over-all charge of the election , `` deserves the praise and thanks of the City of Atlanta '' for the manner in which the election was conducted ."]


### 2. Removing Punctuation and Stopwords



In [5]:

# Define stopwords and punctuation
stop_words = set(stopwords.words('english'))
punctuation = set(string.punctuation)


 The text is then tokenized, converted to lowercase, and filtered to remove **punctuation** and **stopwords**. 
 
 **Tokenization** is crucial because it breaks the text into individual units (tokens/words) that can be processed and analyzed.
 
 **Stopwords** are common words such as "*and*," "*the*," "*is*," and "*in*" that are typically filtered out during text processing. These words are usually removed because they are considered to carry little meaningful information and can clutter the analysis.
 
 **Stemming** reduces words to their root forms, which can improve the ability to match and group related words together, enhancing the performance of tasks like document classification or information retrieval.

In [6]:
# Function to preprocess text: remove punctuation and stopwords, and apply stemming
def preprocess(text, stemmer):
    # Convert to lowercase and remove non-word characters
    text_lower = re.sub(r'[^a-zA-Z\s]', '', text.lower())
    
    # Tokenize the text
    tokens = word_tokenize(text_lower)
        
    # Remove stopwords, punctuation, hyphens, numbers, past tenses, 2-letter words and perform stemming
    filtered_tokens = [stemmer.stem(token) for token in tokens
                       if token not in stop_words
                       and token not in punctuation
                       and not re.search(r'-', token)
                       and not re.search(r'^\d', token)
                       and not re.search(r'ed$', token)
                       and len(token) >= 3]
    return filtered_tokens

### 3. Applying the Lancaster Stemmer



In [7]:
# Initialize Lancaster Stemmer
lancaster = LancasterStemmer()

# Preprocess each paragraph
processed_paras = [preprocess(para, lancaster) for para in paras]


### **Term Frequencies**
Calculating term frequency (TF) is a fundamental step in many natural language processing tasks as it quantifies how important a given word is within a document or corpus based on its frequency of occurrence.

### 4. Calculating and Displaying the Top 10 Words by Term Frequency (TF)
Printing out the top 10 words by term frequency can provide insights into the main topics or themes present in the text data, which can be valuable for tasks like topic modeling, document clustering, or keyword extraction.


In [8]:
# Flatten the list of lists into a single list of tokens for TF calculation
all_tokens = [token for para in processed_paras for token in para]

# Calculate term frequency (TF)
tf_dist = FreqDist(all_tokens)
top_10_tf = tf_dist.most_common(10)

# Print top 10 words in terms of TF and their values
print("Top 10 words by Term Frequency (TF):")
for word, freq in top_10_tf:
    print(f"{word}: {freq}")


Top 10 words by Term Frequency (TF):
on: 3501
would: 2715
stat: 2061
said: 1961
tim: 1953
ev: 1946
man: 1796
new: 1785
year: 1689
could: 1602


### 5. Calculating and Displaying the Top 10 Words by Term Frequency-Inverse Document Frequency (TF-IDF)
To identify the relevance of a word in a document relative to a collection of documents, we weigh its frequency in the document against its rarity across the corpus. This allows for better feature extraction and improves the performance of information retrieval and text mining tasks by emphasizing significant terms and downplaying common ones.

In [9]:
# Calculate TF-IDF
def compute_tf_idf(corpus):
    num_docs = len(corpus)
    idf_dict = {}
    tf_idf_dict = {}
    
    for doc in corpus:
        for word in set(doc):
            idf_dict[word] = idf_dict.get(word, 0) + 1
    
    for word, doc_count in idf_dict.items():
        idf_dict[word] = math.log(num_docs / float(doc_count))
    
    for doc in corpus:
        word_counts = FreqDist(doc)
        doc_len = len(doc)
        for word, count in word_counts.items():
            tf = count / float(doc_len)
            idf = idf_dict[word]
            tf_idf = tf * idf
            tf_idf_dict[word] = tf_idf_dict.get(word, 0) + tf_idf 
    
    return tf_idf_dict


tf_idf_dict = compute_tf_idf(processed_paras)
sorted_tf_idf = sorted(tf_idf_dict.items(), key=lambda item: item[1], reverse=True)[:10]

# Print top 10 words in terms of TF-IDF and their values
print("\nTop 10 words by TF-IDF:")
for word, score in sorted_tf_idf:
    print(f"{word}: {score}")



Top 10 words by TF-IDF:
said: 315.14951335278766
on: 186.13514891683818
would: 173.55228489152353
man: 152.72032435666907
stat: 148.158996245877
tim: 140.29478480601392
get: 139.62499718459867
year: 137.1806091683378
ev: 136.2384551909466
new: 134.67350845426412


### **Part-of-Speech (POS) tagging**
 To understand the grammatical structure and meaning of a sentence, we identify and label the parts of speech (nouns, verbs, adjectives, etc.) for each word. This enhances tasks like syntactic parsing, named entity recognition, and sentiment analysis by providing context and disambiguating the roles words play within sentences.

### 6. Part-of-Speech (POS) Tagging of Tokens


In [10]:
# Tag tokens with part-of-speech tags
tagged_tokens = [pos_tag(para) for para in processed_paras]

### 7. Calculating and Displaying the Most Common Trigrams of Word-Tag Pairs
Trigram word pairs, also known simply as trigrams, are sequences of three consecutive words in a text. For example, in the sentence "The quick brown fox," the trigrams would be "The quick brown" and "quick brown fox." Each trigram captures a snapshot of the local context within the text. They are used to capture more contextual information than individual words or bigrams. 

In [11]:
# Create trigrams of word-tag pairs
trigrams_list = [trigrams(para) for para in tagged_tokens]
flat_trigrams = [trigram for sublist in trigrams_list for trigram in sublist]

# Calculate frequency distribution of trigrams
trigrams_freq = FreqDist(flat_trigrams)
top_10_trigrams = trigrams_freq.most_common(10)

# Print top 10 trigrams and their frequencies
print("\nTop 10 most common trigrams of word-tag pairs:")
for trigram, freq in top_10_trigrams:
    print(f"{trigram}: {freq}")



Top 10 most common trigrams of word-tag pairs:
(('new', 'JJ'), ('york', 'NN'), ('city', 'NN')): 28
(('new', 'JJ'), ('york', 'NN'), ('tim', 'NN')): 23
(('index', 'NN'), ('word', 'NN'), ('electron', 'NN')): 21
(('word', 'NN'), ('electron', 'NN'), ('switch', 'NN')): 20
(('new', 'JJ'), ('york', 'NN'), ('cent', 'NN')): 15
(('drug', 'NN'), ('chem', 'NN'), ('nam', 'FW')): 14
(('per', 'IN'), ('capit', 'NN'), ('incom', 'NN')): 14
(('two', 'CD'), ('year', 'NN'), ('ago', 'RB')): 13
(('john', 'NN'), ('not', 'RB'), ('govern', 'JJ')): 11
(('per', 'IN'), ('head', 'NN'), ('dai', 'NN')): 10
