<a href="https://colab.research.google.com/github/Venkatalakshmikottapalli/NLP/blob/main/V_Kottapalli_NLP_Assn3_Part1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Part-1

### Introduction to Vector Semantics and Word Embeddings
In NLP, it's important to understand the meanings of words and their contexts. The distributional hypothesis states that words appearing in similar contexts have similar meanings. Vector semantics represents words as numerical vectors (embeddings) based on their use in text.

Word embeddings are crucial because machine learning models require numerical input. One simple method is one-hot encoding, where each word is represented by a binary vector. This approach can be inefficient for large vocabularies due to sparsity.

Note: This approach is inefficient. A one-hot encoded vector is sparse (meaning, most indices are zero). Imagine you have 10,000 words in the vocabulary. To one-hot encode each word, you would create a vector where 99.99% of the elements are zero.

In [None]:
import numpy as np
import pandas as pd
import nltk
import re
import string
from numpy import argmax
from nltk.tokenize import word_tokenize
from nltk.tokenize import RegexpTokenizer
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
nltk.download('punkt')

# %matplotlib inline
tokenizer = RegexpTokenizer(r'\w+')

# Define input string
data = 'Live as if you were to die tomorrow. Learn as if you were to live forever'

# Tokenize the string
wordlist = nltk.word_tokenize(data.lower())

# Create a vector representation of the wordlist
wordlist_clean = []

for i in wordlist:  # Go through every word in your tokens list
    if (i not in string.punctuation):  # Remove punctuation
        wordlist_clean.append(i)

# Define universe of possible input values
wordlist_clean_df = pd.DataFrame(data=wordlist_clean, columns=['words'])

# Encode using scikit-learn
one_hot_encoder = OneHotEncoder(sparse=False)
one_hot_encoder.fit(wordlist_clean_df)
wordlist_clean_df_encoded = one_hot_encoder.transform(wordlist_clean_df)
wordlist_clean_df_encoded = pd.DataFrame(data=wordlist_clean_df_encoded, columns=one_hot_encoder.categories_)
print('\n\n One-Hot Encoded Vector using SKLearn')
display(wordlist_clean_df_encoded)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!




 One-Hot Encoded Vector using SKLearn


Unnamed: 0,as,die,forever,if,learn,live,to,tomorrow,were,you
0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
6,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
8,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
9,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Comment:
- Imports and Downloads:
Libraries like nltk, pandas, and sklearn are imported.
nltk's tokenizer is downloaded to split sentences into words.
Tokenization and Cleaning:
- The sentence is tokenized into words.
Punctuation is removed.
- One-Hot Encoding:
- The cleaned words are put into a DataFrame.
- OneHotEncoder transforms the words into one-hot vectors.
Output:
- The one-hot encoded vectors are displayed in a readable format.

### Introduction to Dense Vector Encoding and Singular Value Decomposition (SVD)
In NLP, dense vector encoding is a method used to represent words as vectors where each element contains a non-zero value, unlike sparse vectors which mostly contain zeros. This approach, like Singular Value Decomposition (SVD), aims to capture more detailed relationships between words compared to simpler methods like one-hot encoding.

There are several challenges:

The integer-encoding is arbitrary (it does not capture any relationship between words)
An integer-encoding can be challenging for a model to interpret. A linear classifier, for example, learns a single weight for each feature. Because there is no relationship between the similarity of any two words and the similarity of their encodings, this feature-weight combination is not meaningful.
Word order is ignored.
Raw absolute frequency counts of words do not necessarily represent the meaning of the text properly

In [None]:
import numpy as np
import pandas as pd
import nltk
import re
import string
from nltk.tokenize import word_tokenize
from nltk.tokenize import RegexpTokenizer
nltk.download('punkt')

# Default Style Settings
# matplotlib.rcParams['figure.dpi'] = 150
# pd.options.display.max_colwidth = 200
# %matplotlib inline

tokenizer = RegexpTokenizer(r'\w+')

# Define input string
data = 'Live as if you were to die tomorrow. Learn as if you were to live forever'

# Tokenize the string
wordlist = nltk.word_tokenize(data.lower())

# Create a vector representation of the wordlist
wordlist_clean = []
for i in wordlist:
    if (i not in string.punctuation):
        wordlist_clean.append(i)

# Define universe of possible input values
wordlist_clean_df = pd.DataFrame(data=wordlist_clean, columns=['words'])
dense_vector = np.unique(wordlist_clean_df.values.ravel(), return_counts=True)
dense_vector_df = pd.DataFrame(data=dense_vector, columns=np.unique(wordlist_clean_df))

# Display the dense vector representation
display(dense_vector_df)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Unnamed: 0,as,die,forever,if,learn,live,to,tomorrow,were,you
0,as,die,forever,if,learn,live,to,tomorrow,were,you
1,2,1,1,2,1,2,2,1,2,2


#### Comment:
- Imports and Downloads:
Libraries such as numpy, pandas, nltk, re, and string are imported.
nltk's punkt tokenizer is downloaded to split sentences into words.
Tokenization and Cleaning:
- The input sentence is tokenized into individual words and converted to lowercase.
Punctuation is removed from the tokenized words to prepare the data.
Dense Vector Encoding:
- Each word is encoded with a unique integer, showing how often each word appears in the sentence.
This method creates a dense vector where each element represents a word and its frequency in the sentence.
- Output:
The dense vector representation is shown as a DataFrame where columns represent words and their frequencies in the sentence.

### Introduction to N-Gram Bag-of-Words Model
In natural language processing (NLP), the N-Gram Bag-of-Words model enhances the traditional Bag-of-Words approach by considering sequences of words (n-grams) instead of single words alone. This method captures the contextual meaning and word order within text, which is crucial for tasks like sentiment analysis and topic modeling.

In [None]:
import pandas as pd                        # Data manipulation with pandas
import numpy as np                         # Matrix algebra with numpy
import matplotlib.pyplot as plt            # Visual display of data with matplotlib
import nltk                                # Natural Language Toolkit
import re                                  # Regular expression operations
import string                              # String operations
nltk.download('stopwords')                 # Download stopwords from NLTK
from nltk.corpus import stopwords          # Import stopwords from NLTK

from nltk.stem import PorterStemmer        # Stemming
from sklearn.feature_extraction.text import CountVectorizer

# Default Style Settings
# matplotlib.rcParams['figure.dpi'] = 150
pd.options.display.max_colwidth = 200
# %matplotlib inline

corpus = [
    'The sky is blue and beautiful.', 'Love this blue and beautiful sky!',
    'The quick brown fox jumps over the lazy dog.',
    "A king's breakfast has sausages, ham, bacon, eggs, toast and beans",
    'I love green eggs, ham, sausages and bacon!',
    'The brown fox is quick and the blue dog is lazy!',
    'The sky is very blue and the sky is very beautiful today',
    'The dog is lazy but the brown fox is quick!'
]
labels = [
    'weather', 'weather', 'animals', 'food', 'food', 'animals', 'weather',
    'animals'
]

corpus = np.array(corpus)  # Convert corpus to numpy array
corpus_df = pd.DataFrame({'Document': corpus, 'Category': labels})
corpus_df

wpt = nltk.WordPunctTokenizer()
stop_words = nltk.corpus.stopwords.words('english')

def normalize_document(doc):
    doc = re.sub(r'[^a-zA-Z\s]', '', doc, re.I | re.A)  # Remove special characters and whitespaces
    doc = doc.lower()  # Convert to lowercase
    doc = doc.strip()  # Strip leading and trailing whitespaces
    tokens = wpt.tokenize(doc)  # Tokenize document
    filtered_tokens = [token for token in tokens if token not in stop_words]  # Filter stopwords
    doc = ' '.join(filtered_tokens)  # Recreate document from filtered tokens
    return doc

normalize_corpus = np.vectorize(normalize_document)  # Vectorize normalization function

norm_corpus = normalize_corpus(corpus)
print(corpus)
print("="*50)
print(norm_corpus)

# Set n-gram range to (2, 2) for bigrams
bv = CountVectorizer(ngram_range=(2, 2))
bv_matrix = bv.fit_transform(norm_corpus)

bv_matrix = bv_matrix.toarray()  # Convert matrix to array
vocab = bv.get_feature_names_out()  # Get vocabulary (feature names)
pd.DataFrame(bv_matrix, columns=vocab)  # Display matrix as DataFrame


['The sky is blue and beautiful.' 'Love this blue and beautiful sky!'
 'The quick brown fox jumps over the lazy dog.'
 "A king's breakfast has sausages, ham, bacon, eggs, toast and beans"
 'I love green eggs, ham, sausages and bacon!'
 'The brown fox is quick and the blue dog is lazy!'
 'The sky is very blue and the sky is very beautiful today'
 'The dog is lazy but the brown fox is quick!']
['sky blue beautiful' 'love blue beautiful sky'
 'quick brown fox jumps lazy dog'
 'kings breakfast sausages ham bacon eggs toast beans'
 'love green eggs ham sausages bacon' 'brown fox quick blue dog lazy'
 'sky blue sky beautiful today' 'dog lazy brown fox quick']


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,bacon eggs,beautiful sky,beautiful today,blue beautiful,blue dog,blue sky,breakfast sausages,brown fox,dog lazy,eggs ham,...,lazy dog,love blue,love green,quick blue,quick brown,sausages bacon,sausages ham,sky beautiful,sky blue,toast beans
0,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
1,0,1,0,1,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,1,0,0,...,1,0,0,0,1,0,0,0,0,0
3,1,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,1,0,0,1
4,0,0,0,0,0,0,0,0,0,1,...,0,0,1,0,0,1,0,0,0,0
5,0,0,0,0,1,0,0,1,1,0,...,0,0,0,1,0,0,0,0,0,0
6,0,0,1,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,1,1,0
7,0,0,0,0,0,0,0,1,1,0,...,0,0,0,0,0,0,0,0,0,0


### Introduction to Term Frequency-Inverse Document Frequency (TF-IDF)
TF-IDF is an advanced technique used in natural language processing (NLP) to evaluate the importance of words in a document relative to a collection of documents (corpus). It combines two metrics:

Term Frequency (TF): Measures how frequently a term appears in a document.
Inverse Document Frequency (IDF): Measures how important a term is across all documents in the corpus.
TF-IDF assigns higher weights to terms that are frequent in a document but rare in the corpus, thus capturing their significance in representing the document's content.

In [None]:
norm_corpus = ['sky blue beautiful', 'love blue beautiful sky',
 'quick brown fox jumps lazy dog',
 'kings breakfast sausages ham bacon eggs toast beans',
 'love green eggs ham sausages bacon', 'brown fox quick blue dog lazy',
 'sky blue sky beautiful today' ,'dog lazy brown fox quick']


from sklearn.feature_extraction.text import CountVectorizer
# get bag of words features in sparse format
cv = CountVectorizer(min_df=0., max_df=1.)
cv_matrix = cv.fit_transform(norm_corpus)
cv_matrix

from sklearn.feature_extraction.text import TfidfTransformer


"""Note: With Tfidftransformer you will systematically compute word counts using CountVectorizer
and then compute the Inverse Document Frequency (IDF) values and only then compute the Tf-idf scores."""
tt = TfidfTransformer(norm='l2',
                      use_idf=True,
                      smooth_idf=True)
tt_matrix = tt.fit_transform(cv_matrix)
tt_matrix = tt_matrix.toarray()
vocab = cv.get_feature_names_out()
tt_df = pd.DataFrame(np.round(tt_matrix, 2), columns=vocab)
display(tt_df)



"""Note: WWith Tfidfvectorizer on the contrary, you will do all three steps at once.
Under the hood, it computes the word counts, IDF values, and Tf-idf scores all using the same dataset."""

from sklearn.feature_extraction.text import TfidfVectorizer

tv = TfidfVectorizer(min_df=0.,
                     max_df=1.,
                     norm='l2',
                     use_idf=True,
                     smooth_idf=True)
tv_matrix = tv.fit_transform(norm_corpus)
tv_matrix = tv_matrix.toarray()

vocab = tv.get_feature_names_out()
tv_df  = pd.DataFrame(np.round(tv_matrix, 2), columns=vocab)
display(tv_df)


Unnamed: 0,bacon,beans,beautiful,blue,breakfast,brown,dog,eggs,fox,green,ham,jumps,kings,lazy,love,quick,sausages,sky,toast,today
0,0.0,0.0,0.6,0.53,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.6,0.0,0.0
1,0.0,0.0,0.49,0.43,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.57,0.0,0.0,0.49,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.38,0.38,0.0,0.38,0.0,0.0,0.53,0.0,0.38,0.0,0.38,0.0,0.0,0.0,0.0
3,0.32,0.38,0.0,0.0,0.38,0.0,0.0,0.32,0.0,0.0,0.32,0.0,0.38,0.0,0.0,0.0,0.32,0.0,0.38,0.0
4,0.39,0.0,0.0,0.0,0.0,0.0,0.0,0.39,0.0,0.47,0.39,0.0,0.0,0.0,0.39,0.0,0.39,0.0,0.0,0.0
5,0.0,0.0,0.0,0.37,0.0,0.42,0.42,0.0,0.42,0.0,0.0,0.0,0.0,0.42,0.0,0.42,0.0,0.0,0.0,0.0
6,0.0,0.0,0.36,0.32,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.72,0.0,0.5
7,0.0,0.0,0.0,0.0,0.0,0.45,0.45,0.0,0.45,0.0,0.0,0.0,0.0,0.45,0.0,0.45,0.0,0.0,0.0,0.0


Unnamed: 0,bacon,beans,beautiful,blue,breakfast,brown,dog,eggs,fox,green,ham,jumps,kings,lazy,love,quick,sausages,sky,toast,today
0,0.0,0.0,0.6,0.53,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.6,0.0,0.0
1,0.0,0.0,0.49,0.43,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.57,0.0,0.0,0.49,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.38,0.38,0.0,0.38,0.0,0.0,0.53,0.0,0.38,0.0,0.38,0.0,0.0,0.0,0.0
3,0.32,0.38,0.0,0.0,0.38,0.0,0.0,0.32,0.0,0.0,0.32,0.0,0.38,0.0,0.0,0.0,0.32,0.0,0.38,0.0
4,0.39,0.0,0.0,0.0,0.0,0.0,0.0,0.39,0.0,0.47,0.39,0.0,0.0,0.0,0.39,0.0,0.39,0.0,0.0,0.0
5,0.0,0.0,0.0,0.37,0.0,0.42,0.42,0.0,0.42,0.0,0.0,0.0,0.0,0.42,0.0,0.42,0.0,0.0,0.0,0.0
6,0.0,0.0,0.36,0.32,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.72,0.0,0.5
7,0.0,0.0,0.0,0.0,0.0,0.45,0.45,0.0,0.45,0.0,0.0,0.0,0.0,0.45,0.0,0.45,0.0,0.0,0.0,0.0


### Introduction to Document Similarity and Word Semantics
In natural language processing (NLP), understanding the semantic relationships between words and documents is crucial. Lexical semantics focuses on the meanings of words and their relationships, addressing nuances like word senses (e.g., "mouse" as a device or an animal). While synonyms share meanings, words like "cat" and "dog" share semantic similarities but aren't synonyms.

Document similarity metrics help quantify these relationships:

Manhattan Distance: Measures absolute differences across dimensions.
Euclidean Distance: Calculates the shortest straight-line distance between points.
Cosine Similarity: Computes similarity based on the angle between vectors.
Jaccard Similarity: Assesses similarity based on shared versus unique words.

In [None]:
import pandas as pd                        # Python library for pandas - data maniplation
import numpy as np                         # Python library for numpy -- matrix algebra library
import matplotlib                          # Python library for matplotlib -- visual display of data
import matplotlib.pyplot as plt            # Python library for matplotlib -- visual display of data
import nltk                                # Python library for NLP
import re                                  # library for regular expression operations
import string                              # for string operations

nltk.download('stopwords')                 # package for stop words
nltk.download('brown')                 # package for stop words
from nltk.corpus import stopwords          # module for stop words that come with NLTK
from nltk.corpus import brown              # this is the corpus you use for this exercise.
from nltk.stem import PorterStemmer        # module for stemming
from sklearn.feature_extraction.text import CountVectorizer

#The seed() method is used to initialize the random number generator
np.random.seed(200)

brown_cat= brown.categories() # Creates a list of categories

docs=[]
for cat in brown_cat: # We append tuples of each document and categories in a list
    t1=brown.sents(categories=cat) # At each iteration we retrieve all documents of a given category
    for doc in t1:
        docs.append((' '.join(doc), cat)) # These documents are appended as a tuple (document, category) in the list

brown_df=pd.DataFrame(docs, columns=['sentence', 'category']) #The data frame is created using the generated tuple.

brown_df.head()


#Step 1. Pre-Processing the Brown Corpus Text
wpt = nltk.WordPunctTokenizer()
stop_words = nltk.corpus.stopwords.words('english')

def normalize_document(doc):
    # lower case and remove special characters\whitespaces
    doc = re.sub(r'[^a-zA-Z\s]', '', doc, re.I | re.A)
    doc = doc.lower()
    doc = doc.strip()
    # tokenize document
    tokens = wpt.tokenize(doc)
    # filter stopwords out of document
    filtered_tokens = [token for token in tokens if token not in stop_words]
    # re-create document from filtered tokens
    doc = ' '.join(filtered_tokens)
    return doc

normalize_corpus = np.vectorize(normalize_document)

### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###

#create some normalized corpus from the pre-processing functiong above
# Create a normalized corpus
normalize_corpus = np.vectorize(normalize_document)
norm_corpus = normalize_corpus(brown_df['sentence'].values)


#Using the nromalized corpus.
#Because the brown corpus is very large,select 10,000 random records from the corpus. Set seed so you can return the same results.
norm_corpus = normalize_corpus(brown_df['sentence'].values)

#Step 2. Create a tri-gram data frame and count its frequencies
# Step 2. Create a tri-gram dataframe and count its frequencies
# Initialize CountVectorizer for tri-gram features
vectorizer = CountVectorizer(ngram_range=(3, 3))

# Fit and transform the normalized corpus to tri-gram BOW matrix
tri_gram_bow = vectorizer.fit_transform(norm_corpus)

# Get feature names (tri-gram vocabulary)
vocab = vectorizer.get_feature_names_out()

# Convert tri-gram BOW matrix to DataFrame and print
tri_gram_bow_df = pd.DataFrame(tri_gram_bow.toarray(), columns=vocab)
tri_gram_bow_df.head()

### END CODE HERE ###


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!


Unnamed: 0,aa people would,aaa splits way,aab follows vowel,aah go said,aajk mercedes lots,aaron blaustein standing,aaron burr mark,aaron burrs political,aaron cohn san,aaron mcbride field,...,zubkovskaya excellent lilac,zubkovskaya yuri kornevey,zur bestimmung von,zur khaneh latter,zurcher hillsboro phyllis,zurich prince boun,zurich statement said,zwei planeten also,zworykin farrar mr,zworykin novel technique
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


###  Exercise: TD-IDF

In [None]:
import pandas as pd
import numpy as np
import nltk
import re
from nltk.corpus import stopwords, inaugural
from sklearn.feature_extraction.text import TfidfVectorizer

# Download NLTK resources if not already downloaded
nltk.download('stopwords')
nltk.download('inaugural')

# Set random seed for reproducibility
np.random.seed(200)

# Load inaugural speeches fileids
inaugural_fileids = inaugural.fileids()

# Load inaugural speeches text
docs = [(inaugural.raw(fileid), fileid) for fileid in inaugural_fileids]

# Create DataFrame from documents
df = pd.DataFrame(docs, columns=['document', 'fileid'])

# Step 1. Pre-processing the Text
wpt = nltk.WordPunctTokenizer()
stop_words = stopwords.words('english')

def normalize_document(doc):
    doc = re.sub(r'[^a-zA-Z\s]', '', doc)
    doc = doc.lower()
    doc = doc.strip()
    tokens = wpt.tokenize(doc)
    filtered_tokens = [token for token in tokens if token not in stop_words]
    doc = ' '.join(filtered_tokens)
    return doc

# Apply normalization to the document column
df['normalized_document'] = df['document'].apply(normalize_document)

# Step 2. Calculate TF-IDF
# Initialize TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(ngram_range=(3, 3), stop_words='english')

# Fit and transform the normalized documents
tfidf_matrix = tfidf_vectorizer.fit_transform(df['normalized_document'])

# Get feature names (words in the vocabulary)
tfidf_vocab = tfidf_vectorizer.get_feature_names_out()

# Convert TF-IDF matrix to DataFrame and print
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vocab)
tfidf_df.head()


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package inaugural to /root/nltk_data...
[nltk_data]   Package inaugural is already up-to-date!


Unnamed: 0,abandon acted great,abandon claims magnanimity,abandon delusions join,abandon government far,abandon great measures,abandon habits racism,abandon mastery pursue,abandon sectional prejudice,abandon tolerance abuse,abandoned enemy advancing,...,zealous exertions government,zealous labors political,zealous unceasing efforts,zealously contended preservation,zealously devote limits,zealously devoted long,zealously enforce laws,zealously steadily persevered,zealously unite coordinate,zone extending degrees
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.037531,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Comment:
Imports and Setup: The code imports necessary libraries, downloads NLTK resources, and sets a random seed.

Loading and Structuring Data: It loads all inaugural speeches, organizes them into tuples with their file IDs, and creates a DataFrame (df) with columns for the speech text and file ID.

Text Pre-processing: Defines a function (normalize_document) to clean each speech by removing non-alphabetic characters, converting to lowercase, tokenizing, removing stopwords, and rejoining into a cleaned text. Applies this function to create a new column (normalized_document) in the DataFrame.

Calculating TF-IDF: Initializes a TF-IDF Vectorizer (TfidfVectorizer), fits it to the cleaned speeches, and computes TF-IDF scores for each term-document pair.

Converting to DataFrame: Converts the TF-IDF matrix into a DataFrame (tfidf_df) where rows represent speeches and columns represent terms, with each cell containing the TF-IDF score for the corresponding term in the speech.

Output: Displays the first few rows (head()) of tfidf_df, showing TF-IDF scores calculated for the inaugural speeches.