## CCS 249 Unit 5 - Assignment

#### by Cherilyn Marie Deocampo

1. Using Wikipedia as the corpus, obtain 5 different topics that will serve as your documents, and create a term-document matrix. You can use the shared code on GitHub as a reference.
- Term-document matrix using raw frequency.
- Term-document matrix using TF-IDF weights.

In [216]:
# Import necessary modules
import wikipedia
import re
from collections import Counter
from math import log, sqrt

In [217]:
# 1. Define Wikipedia topics
topics = [
    "Data Science",
    "Data Visualization",
    "Quantum computing",
    "Big data",
    "Augmented reality"
]

In [218]:
# 2. Fetch Wikipedia summaries for the topics
documents = []
for topic in topics:
    try:
        summary = wikipedia.summary(topic)
        documents.append(summary)
    except Exception as e:
        print(f"Error fetching summary for {topic}: {e}")

In [219]:
# 3. Tokenize and preprocess documents
tokenized_docs = [re.findall(r'\b\w+\b', doc.lower()) for doc in documents]

# 4. Build vocabulary (set of unique words across all documents)
vocabulary = set(word for doc in tokenized_docs for word in doc)

In [220]:
# Compute Term Frequency (TF)
def compute_tf(tokens, vocab):
    count = Counter(tokens)
    total_terms = len(tokens)
    return { term: count[term] / total_terms for term in vocab }

# Compute Inverse Document Frequency (IDF)
def compute_idf(tokenized_docs, vocab):
    N = len(tokenized_docs)
    idf_dict = {}
    for term in vocab:
        df = sum(term in doc for doc in tokenized_docs)
        idf_dict[term] = log(N / (df or 1))
    return idf_dict

# Compute TF-IDF
def compute_tfidf(tf_vector, idf, vocab):
    return { term: tf_vector[term] * idf[term] for term in vocab }

# Compute Cosine Similarity
def cosine_similarity(vec1, vec2, vocab):
    dot_product = sum(vec1[term] * vec2[term] for term in vocab)
    vec1_len = sqrt(sum(vec1[term]**2 for term in vocab))
    vec2_len = sqrt(sum(vec2[term]**2 for term in vocab))
    if vec1_len == 0 or vec2_len == 0:
        return 0.0
    return dot_product / (vec1_len * vec2_len)

In [221]:
# Compute TF vectors for each document
tf_vectors = [compute_tf(doc, vocabulary) for doc in tokenized_docs]

# Compute IDF values for the entire corpus
idf = compute_idf(tokenized_docs, vocabulary)

# Compute TF-IDF vectors for each document
tfidf_vectors = [compute_tfidf(tf, idf, vocabulary) for tf in tf_vectors]

# Compute Cosine Similarity between Document 1 and Document 2
similarity = cosine_similarity(tfidf_vectors[0], tfidf_vectors[1], vocabulary)

In [222]:
# Print documents
for i, doc in enumerate(documents):
    print(f"\nDocument {i+1} ({topics[i]}):\n{doc}")


Document 1 (Data Science):
Data science is an interdisciplinary academic field that uses statistics, scientific computing, scientific methods, processing, scientific visualization, algorithms and systems to extract or extrapolate knowledge from potentially noisy, structured, or unstructured data. 
Data science also integrates domain knowledge from the underlying application domain (e.g., natural sciences, information technology, and medicine). Data science is multifaceted and can be described as a science, a research paradigm, a research method, a discipline, a workflow, and a profession.
Data science is "a concept to unify statistics, data analysis, informatics, and their related methods" to "understand and analyze actual phenomena" with data. It uses techniques and theories drawn from many fields within the context of mathematics, statistics, computer science, information science, and domain knowledge. However, data science is different from computer science and information science.

a. Term-document matrix using raw frequency

In [223]:
#Print Term-document matrix using raw frequency
for i, tf_vector in enumerate(tf_vectors):
    print(f"\nDocument {i+1}:")
    for term, freq in tf_vector.items():
        if freq > 0:
            print(f"{term}: {freq}")


Document 1:
of: 0.021052631578947368
is: 0.031578947368421054
creates: 0.005263157894736842
noisy: 0.005263157894736842
changing: 0.005263157894736842
to: 0.021052631578947368
driven: 0.005263157894736842
mathematics: 0.005263157894736842
empirical: 0.005263157894736842
many: 0.005263157894736842
deluge: 0.005263157894736842
programming: 0.005263157894736842
summarize: 0.005263157894736842
actual: 0.005263157894736842
method: 0.005263157894736842
visualization: 0.005263157894736842
technology: 0.010526315789473684
or: 0.010526315789473684
also: 0.005263157894736842
however: 0.005263157894736842
statistical: 0.005263157894736842
sciences: 0.005263157894736842
imagined: 0.005263157894736842
their: 0.005263157894736842
knowledge: 0.021052631578947368
because: 0.005263157894736842
theoretical: 0.005263157894736842
scientist: 0.005263157894736842
a: 0.05263157894736842
with: 0.010526315789473684
workflow: 0.005263157894736842
within: 0.005263157894736842
research: 0.010526315789473684
can:

b. Term-document matrix using TF-IDF weights

In [224]:
# Print Term-document matrix using TF-IDF weights
for i, tfidf_vector in enumerate(tfidf_vectors):
    print(f"\nDocument {i+1}:")
    for term, weight in tfidf_vector.items():
        if weight > 0:
            print(f"{term}: {weight}")


Document 1:
creates: 0.008470725854916317
noisy: 0.008470725854916317
changing: 0.008470725854916317
driven: 0.004822582799337658
mathematics: 0.008470725854916317
empirical: 0.008470725854916317
many: 0.004822582799337658
deluge: 0.008470725854916317
programming: 0.008470725854916317
summarize: 0.008470725854916317
actual: 0.008470725854916317
method: 0.008470725854916317
visualization: 0.0026885559145578457
technology: 0.009645165598675317
also: 0.0026885559145578457
however: 0.004822582799337658
statistical: 0.0026885559145578457
sciences: 0.008470725854916317
imagined: 0.008470725854916317
their: 0.0026885559145578457
knowledge: 0.019290331197350633
because: 0.008470725854916317
theoretical: 0.008470725854916317
scientist: 0.008470725854916317
workflow: 0.008470725854916317
within: 0.0026885559145578457
research: 0.0023488794875179977
uses: 0.016941451709832633
informatics: 0.004822582799337658
everything: 0.008470725854916317
phenomena: 0.004822582799337658
winner: 0.008470725854

In [225]:
# Print Inverse Document Frequency
for term, idf_value in idf.items():
    print(f"{term}: {idf_value}")

interpret: 1.6094379124341003
etc: 1.6094379124341003
reported: 1.6094379124341003
performance: 1.6094379124341003
several: 1.6094379124341003
communication: 1.6094379124341003
veracity: 1.6094379124341003
exact: 1.6094379124341003
reasoning: 1.6094379124341003
overlaid: 1.6094379124341003
seeing: 1.6094379124341003
than: 0.9162907318741551
reasonable: 1.6094379124341003
of: 0.0
appealing: 1.6094379124341003
experiences: 1.6094379124341003
timelines: 1.6094379124341003
communicate: 1.6094379124341003
should: 1.6094379124341003
is: 0.0
was: 1.6094379124341003
hololens: 1.6094379124341003
these: 0.9162907318741551
not: 0.5108256237659907
displays: 1.6094379124341003
difficulty: 1.6094379124341003
creates: 1.6094379124341003
capabilities: 1.6094379124341003
widely: 1.6094379124341003
communications: 1.6094379124341003
correlation: 1.6094379124341003
ecosystem: 1.6094379124341003
air: 1.6094379124341003
noisy: 1.6094379124341003
textual: 1.6094379124341003
massively: 1.6094379124341003
qua

### 2. What are the differences between using TF-IDF weights and raw frequency?
- Raw frequency simply counts how frequently a word occurs in all five Wikipedia topics (Data Science, Cybersecurity, Quantum Computing, Big Data, and Augmented Reality) without considering the word's popularity across other pieces of text while TF-IDF reduces the value of frequently occurring words throughout all documents and places greater significance on words which occur more commonly on a single subject. For instance, the words "data" and "information" show up in a lot in the topics, so TF-IDF gives them less weight, whereas special words such as "quantum" for Quantum Computing and "reality" for Augmented Reality receive higher weights. This is why TF-IDF can better indicate what makes each document unique than plain raw frequency alone.



3. Using cosine similarity, compare two documents and find out which of the documents is most similar.

In [226]:
# Cosine Similarity Between Documents
most_similar_score = -1
most_similar_pair = (None, None)

for i in range(len(tfidf_vectors)):
    for j in range(i + 1, len(tfidf_vectors)):
        sim = cosine_similarity(tfidf_vectors[i], tfidf_vectors[j], vocabulary)
        print(f"Document {i+1} vs Document {j+1}: {sim:.4f}")
        if sim > most_similar_score:
            most_similar_score = sim
            most_similar_pair = (i, j)

print(f"\nMost similar documents: {topics[most_similar_pair[0]]} and {topics[most_similar_pair[1]]} with similarity score of {most_similar_score:.4f}")


Document 1 vs Document 2: 0.1779
Document 1 vs Document 3: 0.0167
Document 1 vs Document 4: 0.2165
Document 1 vs Document 5: 0.0233
Document 2 vs Document 3: 0.0195
Document 2 vs Document 4: 0.2188
Document 2 vs Document 5: 0.0570
Document 3 vs Document 4: 0.0256
Document 3 vs Document 5: 0.0276
Document 4 vs Document 5: 0.0199

Most similar documents: Data Visualization and Big data with similarity score of 0.2188


4. Using the same dataset used above, use the word2vec package to create a classifier for dense vectors.

In [227]:
import numpy as np
from gensim.models import Word2Vec
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

In [228]:
# 1. Define Wikipedia topics
topics = [
    "Data Science",
    "Data Visualization",
    "Quantum computing",
    "Big data",
    "Augmented reality"
]

In [229]:
def get_doc_vector(doc_tokens):
    vectors = [w2v_model.wv[word] for word in doc_tokens if word in w2v_model.wv]
    return np.mean(vectors, axis=0) if vectors else np.zeros(w2v_model.vector_size)

In [230]:
# 2. Fetch summaries
documents = [wikipedia.summary(topic) for topic in topics]

# 3. Tokenize documents
tokenized_documents = {i: re.findall(r'\b\w+\b', doc.lower()) for i, doc in enumerate(documents)}

In [231]:
# 4. Train Word2Vec model
w2v_model = Word2Vec(sentences=tokenized_documents.values(), vector_size=100, window=5, min_count=1, workers=4, sg=1)
w2v_model.save("word2vec.model")

In [232]:
# 5. Create document vectors
X = np.array([get_doc_vector(tokens) for tokens in tokenized_documents.values()])

# 6. Define and encode labels
y_labels = ['data', 'data', 'tech', 'data', 'visual']
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y_labels)

a.	Use Logistic Regression, with the appropriate configuration for the model and dataset.

In [233]:
# 7. Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

# 8. Train Logistic Regression model
clf = LogisticRegression(max_iter=1000, class_weight='balanced')
clf.fit(X_train, y_train)

# 9. Predict on test set
y_pred = clf.predict(X_test)

# 10. Print classification report
print(classification_report(y_test, y_pred, target_names=label_encoder.classes_, labels=np.unique(y)))

              precision    recall  f1-score   support

        data       0.50      1.00      0.67         1
        tech       0.00      0.00      0.00         0
      visual       0.00      0.00      0.00         1

   micro avg       0.50      0.50      0.50         2
   macro avg       0.17      0.33      0.22         2
weighted avg       0.25      0.50      0.33         2



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


### 5. What are the differences of using word2vec compared to the tf-idf in terms of:
#
b. Vector Space?
- In the TF-IDF model, every document was mapped to a high-dimensional vector with each dimension representing a distinct word of the whole vocabulary drawn from the five Wikipedia topics, and hence the sparse vectors since most words weren't found in all documents. Word2Vec, however, produced densed vectors by learning word relationships from their contexts in the tokenized summaries of the same topics. Every word was projected to a 100-dimensional dense vector, and documents were represented by averaging word vectors they comprised, so that similar words and topics (like "technology" and "computing") can have highly similar vector representations, achieving semantic similarity that TF-IDF found it difficult to achieve.
#
a. Vector Size?
- In TF-IDF, the vector size was equal to the number of unique words across all documents, which made the document vectors very large and sparse. In Word2Vec, the vector size was fixed at 100 dimensions regardless of the number of words in the vocabulary, making Word2Vec more memory-efficient and better suited for machine learning models like the Logistic Regression classifier we trained, which achieved reasonable performance using the dense document vectors.


### 6. How do we evaluate the performance of Semantic Models (i.e TF-IDF and Word2Vec)?
- In this activity, I evaluated TF-IDF indirectly by using cosine similarity to measure how closely two documents were related, with documents on similar topics showing higher similarity scores based on their TF-IDF vectors. For Word2Vec, I checked the quality of the learned embeddings by using them as input features for a Logistic Regression classifier and measuring the classifier’s performance with a classification report, which included accuracy, precision, recall, and F1-score. Good performance showed that the Word2Vec embeddings captured important meaning from the documents. In Conlusion, I evaluated TF-IDF by checking document similarity and Word2Vec by looking at classification results.