  # Cosine Similarity

## Formula

$$\text{cosine similarity} = \frac{A \cdot B}{\|A\| \|B\|} = \frac{\sum_{i=1}^{n} A_i B_i}{\sqrt{\sum_{i=1}^{n} A_i^2} \times \sqrt{\sum_{i=1}^{n} B_i^2}}$$

Where:
- **A** and **B** = two vectors
- **A · B** = dot product of A and B
- **||A||** and **||B||** = magnitude (Euclidean norm) of vectors A and B

## What It Measures

Cosine similarity measures the **cosine of the angle** between two vectors:
- **Range**: -1 to +1
- **+1**: Vectors point in the same direction (identical)
- **0**: Vectors are orthogonal (no similarity)
- **-1**: Vectors point in opposite directions

Unlike Euclidean distance, cosine similarity focuses on **orientation** rather than magnitude.

## When to Use Cosine Similarity

**Use when:**
- Comparing text documents (TF-IDF, word embeddings)
- Building recommendation systems
- Measuring semantic similarity
- Working with high-dimensional data
- Direction matters more than magnitude
- Finding similar items in a database

**Avoid when:**
- Magnitude is important (use Euclidean distance)
- Working with binary data (use Jaccard similarity)
- Need to consider feature importance differently

## Cosine Similarity vs Euclidean Distance

| Metric | Measures | Range | Scale Sensitive |
|--------|----------|-------|-----------------|
| Cosine Similarity | Angle/Orientation | [-1, 1] | No |
| Euclidean Distance | Magnitude | [0, ∞] | Yes |

## Common Applications

1. **Information Retrieval**: Find similar documents
2. **Recommendation Systems**: Find similar users or items
3. **Natural Language Processing**: Measure semantic similarity
4. **Image Similarity**: Compare image feature vectors
5. **Clustering**: Group similar items together
6. **Anomaly Detection**: Find outliers based on similarity

## Key sklearn Functions

```python
from sklearn.metrics.pairwise import cosine_similarity, cosine_distances

# Cosine similarity (returns values from -1 to 1)
similarity = cosine_similarity(X, Y)

# Cosine distance (returns values from 0 to 2)
# cosine_distance = 1 - cosine_similarity
distance = cosine_distances(X, Y)
```

**Note**: Cosine distance = 1 - Cosine similarity

# Cosine Similarity from scratch (no functions, no libraries)

In [8]:
import numpy as np

def cosine_similarity_manual(A, B):
    """
    Calculate cosine similarity between two vectors
    """
    # Dot product
    dot_product = np.dot(A, B)
    
    # Magnitudes
    magnitude_A = np.sqrt(np.sum(A**2))
    magnitude_B = np.sqrt(np.sum(B**2))
    
    # Cosine similarity
    similarity = dot_product / (magnitude_A * magnitude_B)
    
    return similarity

# Test
A = np.array([1, 2, 3])
B = np.array([4, 5, 6])

result = cosine_similarity_manual(A, B)
print(f"Cosine Similarity: {result:.4f}")

Cosine Similarity: 0.9746


# Quick Implementation with scikit-learn

In [2]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Two vectors
A = np.array([[1, 2, 3]])
B = np.array([[4, 5, 6]])

# Calculate cosine similarity
similarity = cosine_similarity(A, B)
print("Cosine Similarity:", similarity[0][0])

Cosine Similarity: 0.9746318461970762


# Compare Two Vectors

In [3]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Two documents represented as vectors
doc1 = np.array([[1, 2, 3, 4, 5]])
doc2 = np.array([[2, 4, 6, 8, 10]])

similarity = cosine_similarity(doc1, doc2)
print(f"Similarity: {similarity[0][0]:.4f}")

Similarity: 1.0000


# Compare Multiple Vectors

In [4]:
# Multiple vectors
vectors = np.array([
    [1, 2, 3],      # Vector 1
    [4, 5, 6],      # Vector 2
    [7, 8, 9],      # Vector 3
    [1, 0, 0]       # Vector 4
])

# Calculate all pairwise similarities
similarity_matrix = cosine_similarity(vectors)

print("Similarity Matrix:")
print(similarity_matrix)

Similarity Matrix:
[[1.         0.97463185 0.95941195 0.26726124]
 [0.97463185 1.         0.99819089 0.45584231]
 [0.95941195 0.99819089 1.         0.50257071]
 [0.26726124 0.45584231 0.50257071 1.        ]]


# Find Most Similar Items

In [5]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Query vector
query = np.array([[1, 2, 3]])

# Database of vectors
database = np.array([
    [1, 2, 3],      # Item 0
    [4, 5, 6],      # Item 1
    [1, 0, 0],      # Item 2
    [10, 20, 30]    # Item 3
])

# Calculate similarities
similarities = cosine_similarity(query, database)[0]

# Get indices sorted by similarity (descending)
sorted_indices = np.argsort(similarities)[::-1]

print("Items ranked by similarity:")
for idx in sorted_indices:
    print(f"Item {idx}: {similarities[idx]:.4f}")

Items ranked by similarity:
Item 3: 1.0000
Item 0: 1.0000
Item 1: 0.9746
Item 2: 0.2673


# Text Similarity with TF-IDF

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Documents
documents = [
    "I love machine learning",
    "Machine learning is awesome",
    "I enjoy deep learning",
    "Python is great for data science"
]

# Convert to TF-IDF vectors
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)

# Calculate cosine similarity
similarity_matrix = cosine_similarity(tfidf_matrix)

print("Document Similarity Matrix:")
print(similarity_matrix)

Document Similarity Matrix:
[[1.         0.44371444 0.18433833 0.        ]
 [0.44371444 1.         0.16128176 0.16102913]
 [0.18433833 0.16128176 1.         0.        ]
 [0.         0.16102913 0.         1.        ]]


# Recommendation System Example

In [7]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# User ratings for movies (rows=users, columns=movies)
# 0 = not rated
user_ratings = np.array([
    [5, 4, 0, 0, 1],  # User 1
    [4, 5, 0, 0, 2],  # User 2
    [0, 0, 5, 4, 0],  # User 3
    [0, 0, 4, 5, 0],  # User 4
])

# Calculate user similarity
user_similarity = cosine_similarity(user_ratings)

print("User Similarity Matrix:")
print(user_similarity)
print()
print("Users 1 and 2 similarity:", user_similarity[0][1])
print("Users 3 and 4 similarity:", user_similarity[2][3])

User Similarity Matrix:
[[1.         0.96609178 0.         0.        ]
 [0.96609178 1.         0.         0.        ]
 [0.         0.         1.         0.97560976]
 [0.         0.         0.97560976 1.        ]]

Users 1 and 2 similarity: 0.9660917830792959
Users 3 and 4 similarity: 0.9756097560975611
