# Cosine similarity

Cosine similarity is a metric used to measure how similar two vectors are, irrespective of their magnitude. In the context of Natural Language Processing (NLP) and word vectorization, cosine similarity is often used to determine the similarity between two text documents or two terms within a document corpus

The cosine similarity is calculated as follows:

![cosine-similarity-formula.jpg](../imgs/cosine-similarity-formula.jpg)

where:
- A and B are the vectors representing two documents or terms.
- A.B is the dot product of the vectors.
- ||A|| and ||B|| are the magnitudes (or Euclidean norms) of the vectors.

The resulting cosine similarity score ranges from -1 to 1:
* 1 indicates that the vectors are identical.
* 0 indicates that the vectors are orthogonal (no similarity).
* -1 indicates that the vectors are diametrically opposed.

In [1]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
# Define the corpus with a few sample sentences
corpus = [
    "que dia es hoy",
    "martes el dia de hoy es martes",
    "martes muchas gracias"
]

## Create a TF-IDF Vectorizer

In [3]:
# Create a TF-IDF Vectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the corpus
X = vectorizer.fit_transform(corpus)

# Convert the TF-IDF matrix to a DataFrame for better readability
tfidf_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())

In [4]:
print("TF-IDF Matrix:")
print(tfidf_df)

TF-IDF Matrix:
         de       dia        el        es   gracias       hoy    martes  \
0  0.000000  0.459854  0.000000  0.459854  0.000000  0.459854  0.000000   
1  0.406598  0.309228  0.406598  0.309228  0.000000  0.309228  0.618457   
2  0.000000  0.000000  0.000000  0.000000  0.622766  0.000000  0.473630   

     muchas       que  
0  0.000000  0.604652  
1  0.000000  0.000000  
2  0.622766  0.000000  


## Compute the cosine similarity matrix

In [5]:
cos_sim_matrix = cosine_similarity(X)

In [6]:
# Convert the cosine similarity matrix to a DataFrame for better readability
cos_sim_df = pd.DataFrame(cos_sim_matrix, index=["Doc1", "Doc2", "Doc3"], columns=["Doc1", "Doc2", "Doc3"])

In [7]:
print("\nCosine Similarity Matrix:")
print(cos_sim_df)


Cosine Similarity Matrix:
          Doc1      Doc2     Doc3
Doc1  1.000000  0.426599  0.00000
Doc2  0.426599  1.000000  0.29292
Doc3  0.000000  0.292920  1.00000


## Manually compute the cosine similarity matrix

In [8]:
def manual_cosine_similarity(tfidf_matrix):
    cosine_sim_matrix = np.dot(tfidf_matrix, tfidf_matrix.T)
    norm = np.linalg.norm(tfidf_matrix, axis=1)
    cosine_sim_matrix = cosine_sim_matrix / np.outer(norm, norm)
    return cosine_sim_matrix

In [9]:
# Calculate cosine similarity manually
manual_cos_sim_matrix = manual_cosine_similarity(X.toarray())

# Convert the manually calculated cosine similarity matrix to a DataFrame
manual_cos_sim_df = pd.DataFrame(manual_cos_sim_matrix, index=["Doc1", "Doc2", "Doc3"], columns=["Doc1", "Doc2", "Doc3"])

# Display the manually calculated cosine similarity DataFrame
print("\nManually Calculated Cosine Similarity Matrix:")
print(manual_cos_sim_df)


Manually Calculated Cosine Similarity Matrix:
          Doc1      Doc2     Doc3
Doc1  1.000000  0.426599  0.00000
Doc2  0.426599  1.000000  0.29292
Doc3  0.000000  0.292920  1.00000
