<a href="https://colab.research.google.com/github/Zeaxanthin80/CAI2300C/blob/main/CAI2300C_20250124_Week_03.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP | Bag of Words (BoW), TF-IDF, and Embeddings

## Bag of Words

- BoW converts text into a numerical form (truisms of AI - all data must be numeric AND there can be no missing values in the dataset)
- focuses on the frequency of words while disregarding grammer and word order
- Evolution of NLP:  BoW --> TF-IDF --> Word2Vec (context... embedding) --> LLM
- Bag of Words Article:  [link text](https://drlee.io/bag-of-words-bow-model-a-step-by-step-guide-for-beginners-9110dfa08b0a)


In [None]:
# Install necessary library (if needed)
!pip install nltk

# Import necessary libraries
import nltk
import re # regular expression
import numpy as np
import pandas as pd

# Download the NLTK Punkt tokenizer if not already downloaded
nltk.download('punkt')  # Correcting typo in your code ('punkt_tab' is not valid)

# Sample text
text = '''Daisy. I was trying to explain to somebody as we were hiking up, that’s oak. That’s pine. And they were very impressed at my botanical knowledge. Please give it up for Daisy once again for her boundless energy. I have a bunch of great memories here today, including an adventure I had with my best buddy, who is one of the finest companions in the world, my dog, Daisy, the Belgian Malinois. I also noticed, by the way, the old trail marker here, which I hadn’t seen in years, and somehow it still stands as it always has. And it’s great to see it again. I want to thank the forest and the fresh air for making it possible for me to share this with you. And I am deeply grateful for the bond I share with my dog, Daisy, who sets the stage for so much joy and adventure in my life. Now, I want to start by addressing the adventure we had. I know people are still wondering why we ended up climbing the ridge.'''

# Step 1: Split text into sentences manually
# We split the text using ". " as the delimiter. Note that this does not handle cases like "e.g." properly.
manual_sentences = text.split(". ")

# Step 2: Preprocess sentences and prepare data for tabular representation
processed_data = []
for i, sentence in enumerate(manual_sentences):
    # Step 2.1: Convert the sentence to lowercase for uniformity
    cleaned_sentence = sentence.lower()

    # Step 2.2: Remove non-word characters (e.g., punctuation, special symbols) using regex
    cleaned_sentence = re.sub(r'\W', ' ', cleaned_sentence)

    # Step 2.3: Replace multiple spaces with a single space
    cleaned_sentence = re.sub(r'\s+', ' ', cleaned_sentence)

    # Step 2.4: Calculate the size (length) of the cleaned sentence
    sentence_size = len(cleaned_sentence)

    # Step 2.5: Append processed data as a list: [Index, Type, Size, Value]
    # The 'Type' is arbitrarily set as 'str' here to match your example
    processed_data.append([i + 1, 'str', sentence_size, cleaned_sentence.strip()])

# Step 3: Create a DataFrame for tabular representation
df = pd.DataFrame(processed_data, columns=["Index", "Type", "Size", "Value"])

# Display the DataFrame
df




[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Unnamed: 0,Index,Type,Size,Value
0,1,str,5,daisy
1,2,str,67,i was trying to explain to somebody as we were...
2,3,str,11,that s pine
3,4,str,54,and they were very impressed at my botanical k...
4,5,str,63,please give it up for daisy once again for her...
5,6,str,174,i have a bunch of great memories here today in...
6,7,str,125,i also noticed by the way the old trail marker...
7,8,str,30,and it s great to see it again
8,9,str,97,i want to thank the forest and the fresh air f...
9,10,str,123,and i am deeply grateful for the bond i share ...


In [None]:
# Split text into sentences and tokenize
manual_sentences = text.split(". ")

# Creating the Bag of Words dictionary
word2count = {}
for sentence in manual_sentences:
    words = re.findall(r'\w+', sentence.lower())  # Tokenize each sentence into words
    for word in words:
        if word not in word2count.keys():
            word2count[word] = 1  # Add new word to dictionary
        else:
            word2count[word] += 1  # Increment count for existing word

# Convert the Bag of Words dictionary into a DataFrame format
processed_bow_data = []
for key, value in word2count.items():
    processed_bow_data.append([key, 'int', 1, value])

# Create the DataFrame
bow_df = pd.DataFrame(processed_bow_data, columns=["Key", "Type", "Size", "Value"])

# Display the DataFrame
bow_df

Unnamed: 0,Key,Type,Size,Value
0,daisy,int,1,4
1,i,int,1,10
2,was,int,1,1
3,trying,int,1,1
4,to,int,1,6
...,...,...,...,...
104,wondering,int,1,1
105,why,int,1,1
106,ended,int,1,1
107,climbing,int,1,1


In [None]:
import heapq

# Select the top 100 most frequent words
freq_words = heapq.nlargest(100, word2count, key=word2count.get)

print("Top 100 Most Frequent Words:")
freq_words

Top 100 Most Frequent Words:


['the',
 'i',
 'to',
 'and',
 'it',
 'for',
 'my',
 'daisy',
 'we',
 'up',
 's',
 'adventure',
 'with',
 'in',
 'as',
 'were',
 'that',
 'again',
 'of',
 'great',
 'here',
 'had',
 'who',
 'dog',
 'by',
 'still',
 'want',
 'share',
 'was',
 'trying',
 'explain',
 'somebody',
 'hiking',
 'oak',
 'pine',
 'they',
 'very',
 'impressed',
 'at',
 'botanical',
 'knowledge',
 'please',
 'give',
 'once',
 'her',
 'boundless',
 'energy',
 'have',
 'a',
 'bunch',
 'memories',
 'today',
 'including',
 'an',
 'best',
 'buddy',
 'is',
 'one',
 'finest',
 'companions',
 'world',
 'belgian',
 'malinois',
 'also',
 'noticed',
 'way',
 'old',
 'trail',
 'marker',
 'which',
 'hadn',
 't',
 'seen',
 'years',
 'somehow',
 'stands',
 'always',
 'has',
 'see',
 'thank',
 'forest',
 'fresh',
 'air',
 'making',
 'possible',
 'me',
 'this',
 'you',
 'am',
 'deeply',
 'grateful',
 'bond',
 'sets',
 'stage',
 'so',
 'much',
 'joy',
 'life',
 'now',
 'start']

In [None]:
# Split text into sentences
manual_sentences = text.split(". ")

# Tokenizing sentences and creating a list of all words
word2count = {}
all_words = []
for sentence in manual_sentences:
    words = re.findall(r'\w+', sentence.lower())
    all_words.extend(words)
    for word in words:
        if word not in word2count.keys():
            word2count[word] = 1
        else:
            word2count[word] += 1

# Selecting frequent words (Bag of Words)
freq_words = [word for word in word2count.keys()]

# Building the Bag of Words matrix
X = []
for sentence in manual_sentences:
    vector = []
    for word in freq_words:
        if word in re.findall(r'\w+', sentence.lower()):
            vector.append(1)  # Word is present
        else:
            vector.append(0)  # Word is absent
    X.append(vector)

# Convert to DataFrame
bow_df = pd.DataFrame(X, columns=freq_words)

# Display the DataFrame
bow_df

Unnamed: 0,daisy,i,was,trying,to,explain,somebody,as,we,were,...,start,addressing,know,people,are,wondering,why,ended,climbing,ridge
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,1,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,1,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,1,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,1,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,1,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
# Download the dataset
data_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip"
!wget -O smsspamcollection.zip $data_url
!unzip smsspamcollection.zip

import pandas as pd
# Load the dataset
messages = pd.read_csv("SMSSpamCollection", sep='\t', names=["label", "message"], header=None)
# Preprocess the dataset
messages["message"] = messages["message"].str.lower()  # Convert to lowercase
messages["message"] = messages["message"].str.replace(r'\W', ' ', regex=True)  # Remove non-word characters
messages["message"] = messages["message"].str.replace(r'\s+', ' ', regex=True)  # Remove extra spaces
# Tokenize messages
nltk.download('punkt')
messages["tokens"] = messages["message"].apply(nltk.word_tokenize)
# Create BoW dictionary
word2count = {}
for tokens in messages["tokens"]:
    for word in tokens:
        if word not in word2count:
            word2count[word] = 1
        else:
            word2count[word] += 1
# Select the top 1000 frequent words
freq_words = heapq.nlargest(1000, word2count, key=word2count.get)
# Create BoW vectors
X = []
for tokens in messages["tokens"]:
    vector = [1 if word in tokens else 0 for word in freq_words]
    X.append(vector)
X = np.asarray(X)
# Convert labels to binary
y = messages["label"].apply(lambda x: 1 if x == "spam" else 0).values
# Train a Naive Bayes Classifier
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
classifier = MultinomialNB()
classifier.fit(X_train, y_train)
# Evaluate the model
y_pred = classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

--2025-01-25 00:21:02--  https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified
Saving to: ‘smsspamcollection.zip’

smsspamcollection.z     [   <=>              ] 198.65K   308KB/s    in 0.6s    

2025-01-25 00:21:03 (308 KB/s) - ‘smsspamcollection.zip’ saved [203415]

Archive:  smsspamcollection.zip
  inflating: SMSSpamCollection       
  inflating: readme                  


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Accuracy: 0.9865470852017937


## TF-IDF Vectorization

- TF-IDF measures the importance of words in a document relative to the other words in a document.  Every word has a score...
- The "essence" of TF-IDF... words that are COMMON in a document but RARE across the all of the documents...are given a higher score.
-- Document:
-- Corpus:  ALL DOCUMENTS
- TF (how often does a word appear in a single document?)
- DF (count of the number of documents in the corpus that contain that word)
- IDF (reduce the weight of common words by take the log of the ratio of the total number of documents to the doc frequency)
- TF-IDF... simply combines those scores for each word to give it a weight.

In [None]:
# Install necessary library
!pip install nltk
import nltk
import re
# Sample text
text = '''Daisy was excited about her new project. Dr. Lee encouraged her to use TF-IDF for document analysis. Together, they worked on analyzing text with precision and care.'''
# Sentence tokenization
nltk.download('punkt')
dataset = nltk.sent_tokenize(text)
# Preprocessing the sentences
for i in range(len(dataset)):
    dataset[i] = dataset[i].lower()  # Convert to lowercase
    dataset[i] = re.sub(r'\W', ' ', dataset[i])  # Remove non-word characters
    dataset[i] = re.sub(r'\s+', ' ', dataset[i])  # Remove extra spaces
print("Preprocessed text:")
print(dataset)

Preprocessed text:
['daisy was excited about her new project ', 'dr lee encouraged her to use tf idf for document analysis ', 'together they worked on analyzing text with precision and care ']


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
# Sample documents
doc1 = 'Daisy loves analyzing text.'
doc2 = 'Dr. Lee is a mentor to Daisy.'
doc3 = 'They collaborate on text analysis projects.'
# Create a corpus
corpus = [doc1, doc2, doc3]
# Create a TfidfVectorizer object
tfidf = TfidfVectorizer()
# Calculate TF-IDF values
result = tfidf.fit_transform(corpus)
# Display IDF values
print("\nIDF values:")
for word, idf_value in zip(tfidf.get_feature_names_out(), tfidf.idf_):
    print(f"{word}: {idf_value}")
# Display TF-IDF Matrix
print("\nTF-IDF Matrix:")
result.toarray()


IDF values:
analysis: 1.6931471805599454
analyzing: 1.6931471805599454
collaborate: 1.6931471805599454
daisy: 1.2876820724517808
dr: 1.6931471805599454
is: 1.6931471805599454
lee: 1.6931471805599454
loves: 1.6931471805599454
mentor: 1.6931471805599454
on: 1.6931471805599454
projects: 1.6931471805599454
text: 1.2876820724517808
they: 1.6931471805599454
to: 1.6931471805599454

TF-IDF Matrix:


array([[0.        , 0.5628291 , 0.        , 0.42804604, 0.        ,
        0.        , 0.        , 0.5628291 , 0.        , 0.        ,
        0.        , 0.42804604, 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.32200242, 0.42339448,
        0.42339448, 0.42339448, 0.        , 0.42339448, 0.        ,
        0.        , 0.        , 0.        , 0.42339448],
       [0.42339448, 0.        , 0.42339448, 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.42339448,
        0.42339448, 0.32200242, 0.42339448, 0.        ]])

In [None]:
# Download dataset
data_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip"
!wget -O smsspamcollection.zip $data_url
!unzip smsspamcollection.zip
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
# Load dataset
messages = pd.read_csv("SMSSpamCollection", sep='\t', names=["label", "message"], header=None)
# Preprocess messages
messages["message"] = messages["message"].str.lower().str.replace(r'\W', ' ', regex=True).str.replace(r'\s+', ' ', regex=True)
# Calculate TF-IDF vectors
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(messages["message"])
y = messages["label"].apply(lambda x: 1 if x == "spam" else 0).values
# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train a Naive Bayes Classifier
classifier = MultinomialNB()
classifier.fit(X_train, y_train)
# Evaluate Model
y_pred = classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

--2025-01-25 00:56:12--  https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified
Saving to: ‘smsspamcollection.zip’

smsspamcollection.z     [   <=>              ] 198.65K   308KB/s    in 0.6s    

2025-01-25 00:56:14 (308 KB/s) - ‘smsspamcollection.zip’ saved [203415]

Archive:  smsspamcollection.zip
replace SMSSpamCollection? [y]es, [n]o, [A]ll, [N]one, [r]ename: A
  inflating: SMSSpamCollection       
  inflating: readme                  
Accuracy: 0.9668161434977578


In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

ds = pd.read_csv("/content/sample-data.csv")

tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 3), min_df=0.0, stop_words='english')
tfidf_matrix = tf.fit_transform(ds['description'])

cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)

results = {}

for idx, row in ds.iterrows():
    similar_indices = cosine_similarities[idx].argsort()[:-100:-1]
    similar_items = [(cosine_similarities[idx][i], ds['id'][i]) for i in similar_indices]

    results[row['id']] = similar_items[1:]

print('done!')

def item(id):
    return ds.loc[ds['id'] == id]['description'].tolist()[0].split(' - ')[0]

# Just reads the results out of the dictionary.
def recommend(item_id, num):
    print("Recommending " + str(num) + " products similar to " + item(item_id) + "...")
    print("-------")
    recs = results[item_id][:num]
    for rec in recs:
        print("Recommended: " + item(rec[1]) + " (score:" + str(rec[0]) + ")")

recommend(item_id=1, num=3)

done!
Recommending 3 products similar to Active classic boxers...
-------
Recommended: Cap 1 boxer briefs (score:0.22037921472617453)
Recommended: Active boxer briefs (score:0.16938950913002357)
Recommended: Cap 1 bottoms (score:0.16769458065321552)


## Embeddings

In [None]:
from openai import OpenAI

client = OpenAI(api_key="INSERT_KEY")
response = client.embeddings.create(
    model="text-embedding-3-small",
    input="Embeddings are a numerical representation of text that can be used to measure the relatedness between two pieces of text."
)

response_dict = response.model_dump()
print(response_dict)

{'data': [{'embedding': [-0.016230566427111626, -0.01696062460541725, 0.0343233086168766, 0.0007829607930034399, 0.01564863510429859, 0.008136443793773651, 0.06170577555894852, -0.021446777507662773, -0.011035514995455742, 0.003380486276000738, -0.0009820074774324894, -0.005271759815514088, 1.603614873602055e-05, -0.03354034945368767, 0.041094861924648285, -0.0013066980754956603, 0.02577422372996807, -0.004639571998268366, 0.02026175707578659, 0.06576871126890182, 0.013553686439990997, -0.011606864631175995, 0.009009339846670628, 0.035931553691625595, 0.022853991016745567, 0.011363512836396694, 0.013077561743557453, 0.03808998689055443, 0.057219624519348145, -0.03961358591914177, -0.023044440895318985, -0.013797039166092873, 0.02535100094974041, -0.009125725366175175, -0.007078389171510935, 0.03741282969713211, -0.031741656363010406, 0.03311712667346001, -0.014389550313353539, 0.025329841300845146, 0.011183642782270908, 0.050532713532447815, -0.07004325091838837, 0.005665885284543037, 

In [None]:
!pip install PyPDF2

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/232.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m13.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1


In [None]:
from openai import OpenAI
import PyPDF2
import json

# Set up OpenAI API client
client = OpenAI(api_key="INSERT_KEY")

# Function to extract text from a PDF file
def extract_text_from_pdf(file_path):
    text = ""
    with open(file_path, "rb") as file:
        reader = PyPDF2.PdfReader(file)
        for page in reader.pages:
            text += page.extract_text()
    return text

# Function to create embeddings using OpenAI
def create_embedding(text, model="text-embedding-3-small"):
    response = client.embeddings.create(
        model=model,
        input=text
    )
    return response.model_dump()["data"][0]["embedding"]

# Function to calculate cosine similarity
def cosine_similarity(vec1, vec2):
    dot_product = sum(a * b for a, b in zip(vec1, vec2))
    magnitude1 = sum(a * a for a in vec1) ** 0.5
    magnitude2 = sum(b * b for b in vec2) ** 0.5
    return dot_product / (magnitude1 * magnitude2)

# Query the embeddings
def query_embeddings(embeddings, texts, query):
    query_embedding = create_embedding(query)
    similarities = [
        cosine_similarity(query_embedding, embedding)
        for embedding in embeddings
    ]
    most_relevant_index = similarities.index(max(similarities))
    return texts[most_relevant_index]

# Main workflow
if __name__ == "__main__":
    # Get PDF file paths
    pdf1_path = input("Enter the path of the first PDF: ")
    pdf2_path = input("Enter the path of the second PDF: ")

    # Extract text from PDFs
    text1 = extract_text_from_pdf(pdf1_path)
    text2 = extract_text_from_pdf(pdf2_path)

    # Create embeddings for the PDFs
    embedding1 = create_embedding(text1)
    embedding2 = create_embedding(text2)

    # Store texts and embeddings
    texts = [text1, text2]
    embeddings = [embedding1, embedding2]

    # User query
    query = input("Enter your query: ")
    result = query_embeddings(embeddings, texts, query)

    print("\nMost relevant text:\n")
    print(result)


Enter the path of the first PDF: /content/CAP2791C_MDC.pdf
Enter the path of the second PDF: /content/CAP2791C_MDC.pdf


BadRequestError: Error code: 400 - {'error': {'message': "This model's maximum context length is 8192 tokens, however you requested 9266 tokens (9266 in your prompt; 0 for the completion). Please reduce your prompt; or completion length.", 'type': 'invalid_request_error', 'param': None, 'code': None}}