## Week 11: Vector Space Modelling

In this tutorial, we will walk through a simple example of Vector Space Modelling. Then we will use cosine similarity to find similarity between document and query and rank the documents accordingly.

In [2]:
sample_docs = ['The quick brown fox jumps over the lazy dog.',
               'A brown dog chased the fox.',
               'The dog is lazy.']

In [3]:
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize import word_tokenize


In [4]:
## First step is to tokenize our text
tokenized_documents = [word_tokenize(document) for document in sample_docs]
tokenized_documents

[['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.'],
 ['A', 'brown', 'dog', 'chased', 'the', 'fox', '.'],
 ['The', 'dog', 'is', 'lazy', '.']]

In [5]:
## Second step is to calculate our TF-IDF 
## We need first to preprocess our text
## For simplicity I will just remove the stop words in documents
## and I will change words to lower
from nltk.corpus import stopwords
english_stopwords = stopwords.words('english')
cleaned_data = [[word.lower() for word in document if word.lower() not in english_stopwords] for document in tokenized_documents]
cleaned_data

[['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog', '.'],
 ['brown', 'dog', 'chased', 'fox', '.'],
 ['dog', 'lazy', '.']]

In [6]:
## TF_IDF vectorizer takes as an input sentences, lets join our tokens
cleaned_sentences = [' '.join(document) for document in cleaned_data]
cleaned_sentences

['quick brown fox jumps lazy dog .', 'brown dog chased fox .', 'dog lazy .']

In [7]:
## Lets define our vectorizer
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(cleaned_sentences)
print(tfidf_matrix)

  (0, 2)	0.29225439586501756
  (0, 5)	0.37633074615060896
  (0, 4)	0.49482970636510465
  (0, 3)	0.37633074615060896
  (0, 0)	0.37633074615060896
  (0, 6)	0.49482970636510465
  (1, 1)	0.6317450542765208
  (1, 2)	0.3731188059313277
  (1, 3)	0.4804583972923858
  (1, 0)	0.4804583972923858
  (2, 2)	0.6133555370249717
  (2, 5)	0.7898069290660905


In [17]:
## Given that we have the TFIDF vectors, lets write the query and then get the vector of the query
query = "the brown dog"
## Preprocess the query
query_tokens = word_tokenize(query)
query_cleaned = [word.lower() for word in query_tokens if word.lower() not in english_stopwords]
query_cleaned_combined = [' '.join(query_cleaned)]
query_cleaned_combined

['brown dog']

In [19]:
## Get the TFIDF vector of the query
query_tfIdf_vector = tfidf_vectorizer.transform(query_cleaned_combined)
print(query_tfIdf_vector)

  (0, 2)	0.6133555370249717
  (0, 0)	0.7898069290660905


In [20]:
## Now we need to find the cosine similarity between the query and documents
from sklearn.metrics.pairwise import cosine_similarity

cosine_similarities = cosine_similarity(query_tfIdf_vector, tfidf_matrix)
cosine_similarities

array([[0.47648448, 0.60832386, 0.37620501]])

In [21]:
## To rank the documents, first we will create a list of ranked results
results = [(sample_docs[i], cosine_similarities[0][i]) for i in range(len(sample_docs))]
results

[('The quick brown fox jumps over the lazy dog.', 0.4764844828540594),
 ('A brown dog chased the fox.', 0.6083238568956406),
 ('The dog is lazy.', 0.37620501479919144)]

In [26]:
## Sorting the results based on similarity to rank the documents
results.sort(key=lambda x:x[1], reverse=True)
results

[('A brown dog chased the fox.', 0.6083238568956406),
 ('The quick brown fox jumps over the lazy dog.', 0.4764844828540594),
 ('The dog is lazy.', 0.37620501479919144)]

## Hands On Exercise InClass:

Given the list of the following relevant and retrieved documents. Find the precision and recall of this retrieval system.<br>
Assume that all documents are either relevant or retrieved.

In [22]:
# Sample relevant documents and retrieved documents
relevant_documents = [0, 1, 2, 4]
retrieved_documents = [0, 1, 3, 5, 7]

tp_count = 0
tn_count = 0
fp_count = 0
fn_count = 0

#tp_count = len([num for num in relevant_documents if num in retrieved_documents])
# or 
tp_count = set(relevant_documents) & set(retrieved_documents)
fp_count = len([num for num in retrieved_documents if num not in relevant_documents])

precision = len(tp_count)/len(retrieved_documents)
recall = len(tp_count)/len(relevant_documents)
total_num_doc = len(set(relevant_documents) | set(retrieved_documents))
accuracy = len(tp_count)/total_num_doc

print(precision, recall, accuracy)

0.4 0.5 0.2857142857142857


In [26]:
## Calculate precision at k for the following k_values
k_values = [1,3,5]

precision_at_k = []
# Calculate Precision at k
for k in k_values:
    retrieved_doc_trunc = retrieved_documents[:k]
    
    tp_count = set(relevant_documents) & set(retrieved_doc_trunc)
    precision = len(tp_count)/len(retrieved_doc_trunc)

    precision_at_k.append(precision) 

precision_at_k

[1.0, 0.6666666666666666, 0.4]

In [37]:
# Calculate the average precision

# first find the k values where you need to calculate the precision at
relevant_retrieved = set(relevant_documents) & set(retrieved_documents)
indices = [retrieved_documents.index(val)for val in relevant_retrieved]

precision_at_k = []
# Calculate Precision at k
for k in indices:
    retrieved_doc_trunc = retrieved_documents[:k +1]
    
    tp_count = set(relevant_documents) & set(retrieved_doc_trunc)
    precision = len(tp_count)/len(retrieved_doc_trunc)

    precision_at_k.append(precision) 

avg_p_at_k = sum(precision_at_k)/len(precision_at_k)
precision_at_k

[1.0, 1.0]

In [49]:
## Calculate Mean_Average_Precision Given that another IR System returns the following results
retrieved_documents_ir2 = [0, 3, 7, 2]

# calculate avg_p_at_k for ir2
relevant_retrieved = set(relevant_documents) & set(retrieved_documents_ir2)
indices = [retrieved_documents_ir2.index(val)for val in relevant_retrieved]

precision_at_k2 = []
# Calculate Precision at k
for k in indices:
    retrieved_doc_trunc = retrieved_documents_ir2[:k +1]
    
    tp_count = set(relevant_documents) & set(retrieved_doc_trunc)
    precision = len(tp_count)/len(retrieved_doc_trunc)

    precision_at_k2.append(precision) 

avg_p_at_k2 = sum(precision_at_k2)/len(precision_at_k2)


In [50]:
## Mean_Average_Precision

mean_avg_precision = (avg_p_at_k + avg_p_at_k2)/2
print(mean_avg_precision)

0.875


Given the list of the following documents and query. Find the cosine_similarity between documents and the query. Rank the documents 

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Sample documents
documents = [
    "Natural language processing is a field of computer science.",
    "Machine learning algorithms analyze data to make predictions.",
    "Data preprocessing is essential for machine learning models.",
    "Python is a popular programming language for data science.",
    "Information retrieval involves finding relevant information in a collection.",
    "Neural networks are used in deep learning models.",
    "Statistical analysis helps in understanding data patterns.",
    "Big data technologies handle large volumes of data.",
    "Classification and regression are types of supervised learning.",
    "Clustering algorithms group similar data points together."
]

# TF-IDF vectorization
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)

# Queries
queries = [
    "What is the importance of preprocessing in machine learning?",
    "How do neural networks contribute to deep learning?"
]
# Convert query to TF-IDF representation
for query in queries:
    query_vector = vectorizer.transform([query])

    # Calculate cosine similarity between query and documents
    cosine_similarities = cosine_similarity(query_vector, tfidf_matrix)
    results = cosine_similarities[0].argsort()[-3:][::-1]  # Top 3 relevant documents
    
    # Output top documents
    print("Top documents:")
    for idx in results:
        print(documents[idx])

Top documents:
Data preprocessing is essential for machine learning models.
Machine learning algorithms analyze data to make predictions.
Natural language processing is a field of computer science.
Top documents:
Neural networks are used in deep learning models.
Machine learning algorithms analyze data to make predictions.
Data preprocessing is essential for machine learning models.


In [38]:
list_of_actual_ranks = [[1,8,4],[1,2,5]]
## Calculate the MAP given the results you got from cosine similarity.

Kappa Measure Example Code:

In [55]:
# Annotator 1's relevance assessments
annotator1 = [1, 0, 1, 0, 1, 0, 1, 0, 1, 0]  # 1 indicates relevant, 0 indicates not relevant

# Annotator 2's relevance assessments (with some disagreements)
annotator2 = [1, 1, 0, 0, 1, 0, 1, 0, 1, 1]  # 1 indicates relevant, 0 indicates not relevant

In [71]:
# kappa score of confidence
from sklearn.metrics import confusion_matrix

conf_mat = confusion_matrix(annotator2, annotator1)
P_A = (conf_mat[0,0] + conf_mat[1,1])/sum(conf_mat.flatten())

# one of these is screwed up somewhere. But the idea is there
P_rel = (sum(conf_mat[:,1].flatten())/len(annotator1)) * (sum(conf_mat[1:,].flatten())/len(annotator1))
P_norel = (sum(conf_mat[:,0].flatten())/len(annotator1)) * (sum(conf_mat[0:,].flatten())/len(annotator1))

P_E = P_rel + P_norel
kappa = (P_A + P_E)/(1- P_E)
kappa

7.500000000000002

In [42]:
from sklearn.metrics import cohen_kappa_score

# Compute Cohen's Kappa for relevance assessments
kappa_ir = cohen_kappa_score(annotator1, annotator2)

print(f"Cohen's Kappa for IR: {kappa_ir}")

Cohen's Kappa for IR: 0.4


This output suggests a fair agreement between the two annotators on the relevance assessments. More annotators are needed or replace one of the annotators.