# Cosine Similarity / Cosine Distance

> Cosine similarity is a metric, helpful in determining, how similar the data objects are irrespective of their size. In cosine similarity, data objects in a dataset are treated as a vector.
> 
> [GeeksforGeeks - Cosine Similarity](https://www.geeksforgeeks.org/cosine-similarity/)

## Setup

Import required libraries, load environment variables, initialize OpenAI client and define method to create embeddings.

In [None]:
from dotenv import load_dotenv
from openai import OpenAI
from scipy.spatial import distance
import numpy as np

load_dotenv()

client = OpenAI()

def create_embeddings(texts):
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=texts
    )

    response_dict = response.model_dump()

    embeddings = [item["embedding"] for item in response_dict["data"]]

    return embeddings


## Create articles with embeddings

In [None]:
articles = [
  { "headline": "Federal Reserve Cuts Interest Rates Amid Softening Economy", "topic": "economy" },
  { "headline": "Stock Markets Surge to Record Highs Despite Volatility", "topic": "economy" },
  { "headline": "Agentic AI Dominates Technology Landscape", "topic": "technology" },
  { "headline": "Nvidia Invests $2 Billion in Synopsys, Deepens AI Partnership", "topic": "technology" },
  { "headline": "Coupang Hacked: Personal Data of Millions Exposed", "topic": "security"},
  { "headline": "Interstellar Comet 3I/ATLAS Makes Closest Approach to Earth", "topic": "science" },
  { "headline": "New Alzheimer's Theory Points to Lithium's Role", "topic": "science" },
  { "headline": "The agentic reality check: Preparing for a silicon-based workforce", "topic": "technology" },
  { "headline": "The great rebuild: Architecting an AI-native tech organization", "topic": "technology" },
  { "headline": "Deaths Rose in Emergency Rooms After Hospitals Were Acquired by Private Equity Firms", "topic": "healthcare" }
]

headlines = [article["headline"] for article in articles]

response = client.embeddings.create(
  model="text-embedding-3-small",
  input=headlines
)

response_dict = response.model_dump()

for i, article in enumerate(articles):
    article["embedding"] = response_dict["data"][i]["embedding"]

## Create embedding for search text

In [None]:
search_text = "nvidia"
search_embedding = create_embeddings([search_text])[0]

## Calculate cosine distances and find closest article

In [None]:
distances = []
for article in articles:
    dist = distance.cosine(search_embedding, article["embedding"])
    distances.append(dist)

min_dist_index = np.argmin(distances)

## Output closest related article

In [None]:
print(f"Closest article to '{search_text}': {articles[min_dist_index]['headline']} (Distance: {distances[min_dist_index]})")

## References

[Wikipedia - Cosine Similarity](https://en.wikipedia.org/wiki/Cosine_similarity)

[GeeksforGeeks - Cosine Similarity](https://www.geeksforgeeks.org/dbms/cosine-similarity/)

[IBM - Cosine Similarity](https://www.ibm.com/think/topics/cosine-similarity)

[Scikit-learn - cosine_similarity](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html)

[Scikit-learn - cosine_distances](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_distances.html)