# CSE 5334 Programming Assignment 1 (P1)

### Dataset

We use a corpus of 15 Inaugural addresses of different US presidents. We processed the corpus and provided you a .zip file, which includes 15 .txt files.

In [67]:
import os

corpusroot = './US_Inaugural_Addresses'
documents = {}  # Use a dictionary to store content of each file

for filename in os.listdir(corpusroot):
    if filename.startswith('0') or filename.startswith('1'):
        try:
            with open(os.path.join(corpusroot, filename), "r", encoding='windows-1252') as file:
                doc = file.read().lower()
                documents[filename] = doc  # Store content in the dictionary with filename as key
        except Exception as e:
            print(f"Error reading {filename}: {e}")

(2) <b>Tokenize</b> the content of each file. For this, you need a tokenizer. For example, the following piece of code uses a regular expression tokenizer to return all course numbers in a string. Play with it and edit it. You can change the regular expression and the string to observe different output results. 

For tokenizing the inaugural Presidential speeches, we will use RegexpTokenizer(r'[a-zA-Z]+')


In [68]:
from nltk.tokenize import RegexpTokenizer
from nltk.stem.porter import PorterStemmer

tokenizer = RegexpTokenizer(r'[a-zA-Z]+')

stemmer = PorterStemmer()
# Assuming 'documents' is the dictionary containing content of all files from the previous step
tokenized_documents = {}
for filename, content in documents.items():
    tokens = tokenizer.tokenize(content)
    stemmed_tokens = [stemmer.stem(token) for token in tokens if token not in stop_words]
    tokenized_documents[filename] = stemmed_tokens
    
    # Printing the tokenized content for each file
    print(f"Tokenized content for {filename}:")
    print(tokens)
    print("\n" + "-"*50 + "\n")  # Separator for better readability

Tokenized content for 01_washington_1789.txt:
['george', 'washington', 'fellow', 'citizens', 'of', 'the', 'senate', 'and', 'of', 'the', 'house', 'of', 'representatives', 'i', 'among', 'the', 'vicissitudes', 'incident', 'to', 'life', 'no', 'event', 'could', 'have', 'filled', 'me', 'with', 'greater', 'anxieties', 'than', 'that', 'of', 'which', 'the', 'notification', 'was', 'transmitted', 'by', 'your', 'order', 'and', 'received', 'on', 'the', 'th', 'day', 'of', 'the', 'present', 'month', 'on', 'the', 'one', 'hand', 'i', 'was', 'summoned', 'by', 'my', 'country', 'whose', 'voice', 'i', 'can', 'never', 'hear', 'but', 'with', 'veneration', 'and', 'love', 'from', 'a', 'retreat', 'which', 'i', 'had', 'chosen', 'with', 'the', 'fondest', 'predilection', 'and', 'in', 'my', 'flattering', 'hopes', 'with', 'an', 'immutable', 'decision', 'as', 'the', 'asylum', 'of', 'my', 'declining', 'years', 'a', 'retreat', 'which', 'was', 'rendered', 'every', 'day', 'more', 'necessary', 'as', 'well', 'as', 'more', 

(3) Perform <b>stopword removal</b> on the obtained tokens. NLTK already comes with a stopword list, as a corpus in the "NLTK Data" (http://www.nltk.org/nltk_data/). You need to install this corpus. Follow the instructions at http://www.nltk.org/data.html. You can also find the instruction in this book: http://www.nltk.org/book/ch01.html (Section 1.2 Getting Started with NLTK). Basically, use the following statements in Python interpreter. A pop-up window will appear. Click "Corpora" and choose "stopwords" from the list.

In [69]:
import nltk
#nltk.download()

After the stopword list is downloaded, you will find a file "english" in folder nltk_data/corpora/stopwords, where folder nltk_data is the download directory in the step above. The file contains 179 stopwords. nltk.corpus.stopwords will give you this list of stopwords. Try the following piece of code.

In [70]:
from nltk.corpus import stopwords
print(stopwords.words('english'))
stop_words = set(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

(4) Also perform <b>stemming</b> on the obtained tokens. NLTK comes with a Porter stemmer. Try the following code and learn how to use the stemmer.

In [71]:

print(stemmer.stem('studying'))
print(stemmer.stem('vector'))
print(stemmer.stem('entropy'))
print(stemmer.stem('hispanic'))
print(stemmer.stem('ambassador'))

studi
vector
entropi
hispan
ambassador


In [72]:
import math

def compute_idf(tokenized_docs):
    N = len(tokenized_docs)
    idf_dict = {}
    for tokens in tokenized_docs.values():
        for token in set(tokens):
            idf_dict[token] = idf_dict.get(token, 0) + 1

    for token, df in idf_dict.items():
        idf_dict[token] = math.log10(N / df)

    return idf_dict

idf_values = compute_idf(tokenized_documents)

def getidf(term):
    stemmed_term = stemmer.stem(term)
    return idf_values.get(stemmed_term, -1)

def compute_weights(tokenized_docs, idf_vals):
    weights = {}
    for filename, tokens in tokenized_docs.items():
        tf_idf = {}
        for token in tokens:
            tf = 1 + math.log10(tokens.count(token))
            tf_idf[token] = tf * idf_vals.get(token, 0)
        
        # Cosine normalization
        norm = math.sqrt(sum([value**2 for value in tf_idf.values()]))
        for token, value in tf_idf.items():
            tf_idf[token] = value / norm
        
        weights[filename] = tf_idf

    return weights

document_weights = compute_weights(tokenized_documents, idf_values)


def getweight(doc, term):
    stemmed_term = stemmer.stem(term)
    return document_weights.get(doc, {}).get(stemmed_term, 0)

def query(q):
    query_tokens = tokenizer.tokenize(q.lower())
    query_tokens = [stemmer.stem(token) for token in query_tokens if token not in stop_words]
    
    # Compute weights for query
    query_weights = {}
    for token in query_tokens:
        tf = 1 + math.log10(query_tokens.count(token))
        query_weights[token] = tf

    # Cosine normalization for query
    norm = math.sqrt(sum([value**2 for value in query_weights.values()]))
    for token, value in query_weights.items():
        query_weights[token] = value / norm

    # Compute similarity scores
    scores = {}
    for doc, weights in document_weights.items():
        score = sum([query_weights.get(token, 0) * weight for token, weight in weights.items()])
        scores[doc] = score

    # Get the document with the highest score
    best_doc = max(scores, key=scores.get)
    return best_doc, scores[best_doc]



(5) Using the tokens, we would like to compute the TF-IDF vector for each document. Given a query string, we can also calculate the query vector and calcuate similarity.

In the class, we learned that we can use different weightings for queries and documents and the possible choices are shown below:

<img src = 'weighting_scheme.png'>

The notation of a weighting scheme is as follows: ddd.qqq, where ddd denotes the combination used for document vector and qqq denotes the combination used for query vector.

A very standard weighting scheme is: ltc.lnc; where the processing for document and query vectors are as follows:
Document: logarithmic tf, logarithmic idf, cosine normalization
Query: logarithmic tf, no idf, cosine normalization

Implement query-document similarity using the <b>ltc.lnc</b> weighting scheme and show the outputs for the following:

In [73]:
print("%.12f" % getidf('british'))
print("%.12f" % getidf('union'))
print("%.12f" % getidf('war'))
print("%.12f" % getidf('military'))
print("%.12f" % getidf('great'))
print("--------------")
print("%.12f" % getweight('02_washington_1793.txt','arrive'))
print("%.12f" % getweight('07_madison_1813.txt','war'))
print("%.12f" % getweight('12_jackson_1833.txt','union'))
print("%.12f" % getweight('09_monroe_1821.txt','british'))
print("%.12f" % getweight('05_jefferson_1805.txt','public'))
print("--------------")
print("(%s, %.12f)" % query("pleasing people"))
print("(%s, %.12f)" % query("british war"))
print("(%s, %.12f)" % query("false public"))
print("(%s, %.12f)" % query("people institutions"))
print("(%s, %.12f)" % query("violated willingly"))

0.698970004336
0.062147906749
0.096910013008
0.273001272064
0.096910013008
--------------
0.303506392956
0.016131105316
0.011635243734
0.042120287643
0.004181740452
--------------
(03_adams_john_1797.txt, 0.030495521512)
(07_madison_1813.txt, 0.075637848644)
(05_jefferson_1805.txt, 0.070378778537)
(12_jackson_1833.txt, 0.010531477991)
(02_washington_1793.txt, 0.287226944691)
