# Finding Similar Documents 
Objective: To learn how to mathematically  calculate how similar two documents are. This is the core concept behind reccomendation engines and semantic search

## The concept: From Text to Vectors
How can we know if two articles are similar? The core idea is this: _if we represent each document as a numerical vector, we can then use mathematical formulas to measure the "distance" or "angle" between these vectors_. Documents with similar vector representations are likely about similar topics

### Setup and Vectorization

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
# Note: We'll re-use a simplified cleaning process here
import re

def simple_clean(text):
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)
    return text

# Load and clean the data
url = 'https://storage.googleapis.com/dataset-uploader/bbc/bbc-text.csv'
bbc_df = pd.read_csv(url)
bbc_df['cleaned_text'] = bbc_df['text'].apply(simple_clean)

# Vectorize the text with TF-IDF
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_features=1000)
tfidf_matrix = tfidf_vectorizer.fit_transform(bbc_df['cleaned_text'])

print("Setup complete. Text has been vectorized.")
print("Shape of our TF-IDF Matrix:", tfidf_matrix.shape)

### Measuring Similarity: Cosine Similartity
The most common way to measure the similarity between two text vectors is Cosine Similarity.

It measures the cosine of the angle between two vectors.
- A score of 1 means the vectors are identical (angle is 0°).
- A score of 0 means the vectors are completely unrelated (angle is 90°).

scikit-learn has a function that makes this very easy to calculate.

In [None]:
# Calculate the cosine similarity between the first document and ALL documents
# The result is a matrix where each value is the similarity score
cosine_sim_matrix = cosine_similarity(tfidf_matrix, tfidf_matrix)

print("Shape of the similarity matrix:", cosine_sim_matrix.shape)

# Let's look at the similarity of the first 5 docs to each other
print("\nSimilarity matrix for the first 5 documents:")
print(cosine_sim_matrix[0:5, 0:5])

### Building a Simple Recommendation Function
Let's put this together into a function that, given one article finds the n most similar articles

In [None]:
def find_similar_articles(article_index, top_n=5):
    # Get the similarity scores for the given article
    similar_scores = list(enumerate(cosine_sim_matrix[article_index]))
    
    # Sort the articles based on similarity score in descending order
    sorted_similar_scores = sorted(similar_scores, key=lambda x: x[1], reverse=True)
    
    # Get the indices of the top N most similar articles (excluding the article itself)
    top_indices = [i[0] for i in sorted_similar_scores[1:top_n+1]]
    
    print(f"--- Top {top_n} articles similar to Article {article_index} ---")
    print(f"ORIGINAL: {bbc_df['text'].iloc[article_index][:100]}...")
    print("-" * 50)
    
    # Return the most similar articles
    return bbc_df.iloc[top_indices]

# Let's test it! Find articles similar to the first one in the dataset.
find_similar_articles(0)

## Exercise
You task is to use the find_similar_articles function to find the top 3 most similar articles to Article 500
- Call the function with theh correct parameters(artcile_index=500, top_n=3)
- Look at the category of the original article and the categories of the recommended articles  

In [None]:
# Your code for the Exercise here
# First, see what the original article is about
print("--- Original Article #500 ---")
print(bbc_df.iloc[500])
print("\n" + "="*50 + "\n")

# Now, find similar articles