# Vector Space

A vector space is a collection of vectors, and is characterized by its dimension. These vectors can then be used for various machine learning tasks, such as similarity search, clustering, and classification.

## Setup

In [None]:
from dotenv import load_dotenv
import os
from openai import OpenAI
import json
from sklearn.manifold import TSNE
import numpy as np
import matplotlib.pyplot as plt

load_dotenv()

## Create a dictionary of headlines

In [None]:
articles = [
  { "headline": "Federal Reserve Cuts Interest Rates Amid Softening Economy", "topic": "economy" },
  { "headline": "Stock Markets Surge to Record Highs Despite Volatility", "topic": "economy" },
  { "headline": "Agentic AI Dominates Technology Landscape", "topic": "technology" },
  { "headline": "Nvidia Invests $2 Billion in Synopsys, Deepens AI Partnership", "topic": "technology" },
  { "headline": "Coupang Hacked: Personal Data of Millions Exposed", "topic": "security"},
  { "headline": "Interstellar Comet 3I/ATLAS Makes Closest Approach to Earth", "topic": "science" },
  { "headline": "New Alzheimer's Theory Points to Lithium's Role", "topic": "science" },
  { "headline": "The agentic reality check: Preparing for a silicon-based workforce", "topic": "technology" },
  { "headline": "The great rebuild: Architecting an AI-native tech organization", "topic": "technology" },
  { "headline": "Deaths Rose in Emergency Rooms After Hospitals Were Acquired by Private Equity Firms", "topic": "healthcare" }
]

print(articles)

## Extract the headlines and topics from the articles

In [None]:
headlines = [article["headline"] for article in articles]

print(headlines)

## Configure OpenAI and generate embeddings

In [None]:
# Set your API key before running (e.g. in the environment as OPENAI_API_KEY)
client = OpenAI()

response = client.embeddings.create(
  model="text-embedding-3-small",
  input=headlines
)

response_dict = response.model_dump()

# Create a copy for display purposes
response_dict_sample = response_dict.copy()
response_dict_sample["data"] = [response_dict["data"][0].copy()]

# Truncate the embedding for display purposes
emb = response_dict_sample["data"][0]["embedding"]
response_dict_sample["data"][0]["embedding"] = emb[:3] + ["..."] + emb[-3:]

print(json.dumps(response_dict_sample, indent=2))

## Map embeddings to headlines

In [None]:
for i, article in enumerate(articles):
    article["embedding"] = response_dict["data"][i]["embedding"]

print(json.dumps(articles, indent=2))

## Examine length of embeddings

In [None]:
print(len(articles[0]["embedding"]))
print(len(articles[1]["embedding"]))

**Always return 1536 embeddings for each input**

## Dimensionality Reduction using t-SNE


> t-Distributed Stochastic Neighbor Embedding (t-SNE) is a technique for dimensionality reduction that is particularly well suited for the visualization of high-dimensional datasets.
> 
> Laurens van der Maaten, Research scientist in artificial intelligence.
>
> [Source](https://lvdmaaten.github.io/tsne/)

### Implementing t-SNE with sklearn

In [None]:

embeddings = [article["embedding"] for article in articles]

tsne = TSNE(n_components=2, perplexity=5)
embeddings_2d = tsne.fit_transform(np.array(embeddings))

### Visualize the embeddings in 2D space

In [None]:
plt.scatter(embeddings_2d[:,0], embeddings_2d[:,1])

topics = [article["topic"] for article in articles]

for i, topic in enumerate(topics):
    plt.annotate(topic, (embeddings_2d[i,0], embeddings_2d[i,1]))

plt.show()

## References

- [Natural Language Processing with Classification and Vector Spaces](https://www.coursera.org/learn/classification-vector-spaces-in-nlp?specialization=natural-language-processing)
- [Vector Space Model - Wikipedia](https://en.wikipedia.org/wiki/Vector_space_model)
- [5.1. Vector Space Model](https://hannibunny.github.io/nlpbook/05representations/05representations.html)
- [t-SNE](https://lvdmaaten.github.io/tsne/)
- [scikit-learn TSNE](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html#sklearn.manifold.TSNE)
- [scikit-learn Manifold Learning](https://scikit-learn.org/stable/modules/manifold.html#t-distributed-stochastic-neighbor-embedding-t-sne)
- [scikit-learn Manifold API](https://scikit-learn.org/stable/api/sklearn.manifold.html)
- [datacamp - Introduction to t-SNE](https://www.datacamp.com/tutorial/introduction-t-sne)