<a href="https://colab.research.google.com/github/dornercr/INFO371/blob/main/INFO371_Week6_Text_Embeddings_DynamicDataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Representation with Dynamic Dataset and Word Embeddings
This notebook covers:
- Dynamically generated ham/spam dataset
- TF-IDF vectorization
- Cosine similarity
- Naive Bayes classification
- SpaCy-based text embeddings and semantic similarity

In [None]:
# Step 1: Generate a new dataset
import pandas as pd
import random

ham = [
    "Lunch at 12? Meet you by the cafeteria.",
    "Remember to bring your notebook to class.",
    "Can you check the door before leaving?",
    "I'll be offline after 5pm, text if needed.",
    "Meeting rescheduled to next Tuesday at 3.",
    "Found your charger in the conference room.",
    "I uploaded the slides to the shared folder.",
    "Pick up milk and eggs on your way home.",
    "Did you see the news this morning?",
    "We should review the budget before Friday."
]

spam = [
    "You're pre-approved for a $5000 credit line!",
    "Act fast! This crypto opportunity ends today.",
    "Lowest insurance rates, guaranteed. Click now!",
    "URGENT: Account suspended, verify immediately.",
    "Unlock unlimited streaming, no ads—subscribe here.",
    "You've been selected for a government grant!",
    "Earn money from home—no experience needed.",
    "Complete this survey to win a $200 gift card.",
    "You won't believe this investment tip—read more!",
    "Exclusive deal on designer watches—limited time!"
]

dataset = [["ham", msg] for _ in range(5) for msg in ham] + [["spam", msg] for _ in range(5) for msg in spam]
random.shuffle(dataset)
df = pd.DataFrame(dataset, columns=["label", "text"])
df.head()

Unnamed: 0,label,text
0,spam,Complete this survey to win a $200 gift card.
1,ham,Remember to bring your notebook to class.
2,ham,"I'll be offline after 5pm, text if needed."
3,spam,Exclusive deal on designer watches—limited time!
4,spam,You're pre-approved for a $5000 credit line!


In [None]:
# Step 2: TF-IDF Vectorization
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(df['text'])
y = df['label']

In [None]:
# Step 3: Cosine Similarity
from sklearn.metrics.pairwise import cosine_similarity
cos_sim = cosine_similarity(X[0:1], X[1:2])
print("Cosine similarity between doc 0 and doc 1:", cos_sim[0][0])

Cosine similarity between doc 0 and doc 1: 0.0


In [None]:
# Step 4: Train/Test Split and Naive Bayes Classifier
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, accuracy_score

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)
model = MultinomialNB()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Accuracy: 1.0
              precision    recall  f1-score   support

         ham       1.00      1.00      1.00        15
        spam       1.00      1.00      1.00        15

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30



## Step 5: Word Embeddings with SpaCy
Now we compute semantic similarity using SpaCy word embeddings.

In [None]:
# Ensure you run: !python -m spacy download en_core_web_md  (if not already installed)
import spacy
nlp = spacy.load("en_core_web_md")

doc1 = nlp(df['text'].iloc[0])
doc2 = nlp(df['text'].iloc[1])
print("SpaCy semantic similarity:", doc1.similarity(doc2))

SpaCy semantic similarity: 0.780891478061676
