# Introducing Text Embedding

Text embedding is a technique in natural language processing (NLP) that represents text data, such as words, phrases, or documents, as numerical vectors. These vectors capture the semantic and syntactic relationships between words, enabling machines to understand and process text data more effectively.  

The main idea behind text embedding is to map words or pieces of text to vectors in a continuous Nth dimensional vector space. Words with similar meanings or contexts are mapped to vectors that are close together in this vector space, while dissimilar words are mapped to vectors that are far apart.

Embeddings create a vector representation of a piece of text. This is useful because it means we can think about text in the vector space, and do things like semantic search where we look for pieces of text that are most similar in the vector space.

Text embeddings have become a fundamental component in many NLP tasks, such as sentiment analysis, text classification, machine translation, and language modeling. They provide a way to represent textual data in a format that can be easily processed by machine learning algorithms.

The dimensions of the TextEmbedding-3-small vector are 1536, which can be interpreted in the following ways:

1. Semantic features: Each dimension of the vector can be thought of as a feature or a dimensionality reduction of the input text. These features capture various aspects of the text, such as:
* Word frequencies
* Part-of-speech tags
* Named entity recognition
* Sentiment analysis
* Syntax and grammar
* Contextual information  
  
2. Latent semantic analysis: The 1536 dimensions can be viewed as a latent semantic space, where similar texts are mapped to nearby points in this space. This allows for efficient similarity searches and clustering of texts.

3. Linear transformations: The dimensions can be seen as a linear transformation of the input text, where each dimension is a weighted sum of the input features. This allows for the model to capture complex relationships between the input text and the output vector.

4. High-dimensional space: The 1536 dimensions can be thought of as a high-dimensional space, where the input text is projected onto a lower-dimensional space (1536) from a higher-dimensional space (infinite). This allows for efficient computation and storage of the vector representation.

5. Model's internal representation: The dimensions can be viewed as the internal representation of the model, where each dimension is a learned feature or a combination of features that the model uses to represent the input text.
   
When working with the Text Embedding model, you can use the 1536-dimensional vector representation as input to various downstream tasks, such as:
1. Text classification
2. Sentiment analysis
3. Information retrieval
4. Clustering
5. Dimensionality reduction

In [1]:
from dotenv import load_dotenv
import os

if load_dotenv("../.env"):
    OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

In [2]:
from langchain_openai.embeddings import OpenAIEmbeddings
import numpy as np

embeddings_model = OpenAIEmbeddings(
    api_key=OPENAI_API_KEY, model="text-embedding-3-small"
)
# other models https://platform.openai.com/docs/guides/embeddings

orangeVector = embeddings_model.embed_query("Orange")
print(f"Orange: {orangeVector}")
print(f"Orange dimensions: {len(orangeVector)}")
print(f"Orange norm: {np.linalg.norm(orangeVector)}")
print("-" * 120)

appleVector = embeddings_model.embed_query("Apple")
print(f"Apple: {appleVector}")
print(f"Apple dimensions: {len(appleVector)}")
print(f"Apple norm: {np.linalg.norm(appleVector)}")
print("-" * 120)

HorseVector = embeddings_model.embed_query("Horse")
print(f"Horse: {HorseVector}")
print(f"Horse dimensions: {len(HorseVector)}")
print(f"Horse norm: {np.linalg.norm(HorseVector)}")
print("-" * 120)

# See that vectors on OPENAI are created fully normalized

Orange: [-0.0131614652993618, 0.005170333476176813, -0.024003909324024283, 0.07664977424676198, 0.02839784476295883, -0.03007947628402706, 0.05847731918434064, -0.0011849387717972088, 0.04507852959718903, -0.020450788714301227, 0.006584123129695415, 0.028913182824003122, -0.018402996527536866, -0.03517861107688299, 0.05690418488638214, 0.028207983470081736, 0.026906076397412944, 0.00864547630519521, -0.05348667788930394, 0.04732974127159805, 0.015555076203057823, 0.028750445371242148, 0.010408474689998677, 0.014700699453788272, -0.014131115575156988, 0.009438825345526114, 0.003685346070659957, -0.009845671771396424, 0.04708563416113396, -0.03488025628618667, 0.0033293559136847485, -0.043641007049230136, 0.002902167539049973, -0.03005235244391094, -0.019609973885089737, -0.016653559131468834, 0.025346500294222005, 0.014239607210330971, 0.00945916729429058, 0.02131872552098359, -0.009554098872051752, -0.023258026072573965, 0.020274489341482193, 0.017819850728847522, 0.05373078499976802, 

## Understanding Vector Distances

The distances between word vectors in the embedding space reflect the semantic relationships between those words. Words that are semantically similar or related tend to have smaller vector distances (higher cosine similarity), while words that are dissimilar or unrelated tend to have larger vector distances (lower cosine similarity).

However, it's important to note that word embeddings have limitations. They may not capture certain nuances or context-specific meanings, and their performance can be influenced by the quality and diversity of the training data.

The exact meaning of the dimensions is specific to the model architecture and the training data. 

In [3]:
from scipy.spatial import distance

euclidean_distance0 = distance.euclidean(orangeVector, orangeVector)
euclidean_distance1 = distance.euclidean(orangeVector, appleVector)
euclidean_distance2 = distance.euclidean(orangeVector, HorseVector)

print(f"Distance Orange-Orange: {euclidean_distance0}")
print(f"Distance Orange-Apple: {euclidean_distance1}")
print(f"Distance Orange-Horse: {euclidean_distance2}")

Distance Orange-Orange: 0.0
Distance Orange-Apple: 1.0532548062900984
Distance Orange-Horse: 1.2334881181237594


## What is the Cosine Similarity?

The cosine similarity is a measure of similarity between two vectors. It is a measure of the cosine of the angle between two vectors projected in a multi-dimensional space. The cosine similarity is a widely used measure in many applications, including information retrieval, natural language processing, and machine learning.

### How to Calculate the Cosine Similarity?
The cosine similarity is a value between -1 and 1, where:  
0 means the vectors are orthogonal (perpendicular)  
1 means the vectors are identical  
-1 means the vectors are opposed

In [4]:
import numpy as np
from scipy.spatial import distance

cosine_similarity0 = 1 - distance.cosine(orangeVector, orangeVector)
cosine_similarity1 = 1 - distance.cosine(orangeVector, appleVector)
cosine_similarity2 = 1 - distance.cosine(orangeVector, HorseVector)
print(f"Cosine Orange-Orange: {cosine_similarity0}")
print(f"Cosine Orange-Apple: {cosine_similarity1}")
print(f"Cosine Orange-Horse: {cosine_similarity2}")


def cosine_similarity(vec1, vec2):
    dot_product = np.dot(vec1, vec2)
    norm_vec1 = np.linalg.norm(vec1)
    norm_vec2 = np.linalg.norm(vec2)
    similarity = dot_product / (norm_vec1 * norm_vec2)
    return similarity


print(
    f"Cosine Orange-Orange: (eq) {cosine_similarity(orangeVector,orangeVector)}")
print(
    f"Cosine Orange-Apple: (eq) {cosine_similarity(orangeVector,appleVector)}")
print(
    f"Cosine Orange-Horse: (eq) {cosine_similarity(orangeVector,HorseVector)}")

Cosine Orange-Orange: 1.0
Cosine Orange-Apple: 0.4453271565134036
Cosine Orange-Horse: 0.23925353122375426
Cosine Orange-Orange: (eq) 1.0
Cosine Orange-Apple: (eq) 0.4453271565134036
Cosine Orange-Horse: (eq) 0.23925353122375423


In [5]:
from scipy.spatial.distance import cosine
import matplotlib.pyplot as plt

translations = {
    "En_apple": "apple",
    "Pt_maçã": "maçã",
    "Es_manzana": "manzana",
    "De_apfel": "Apfel",
    "Nl_appel": "Appel",
    "En_horse": "Horse",
    "En_orange": "Orange",
}

vectors = {
    key: embeddings_model.embed_query(value) for key, value in translations.items()
}

In [6]:
euclidean_distances = [
    format(distance.euclidean(value, vectors['En_apple']), ".4f")
    for key, value in vectors.items()]
cosine_distances = [
    format(1 - distance.cosine(value, vectors['En_apple']), ".4f")
    for key, value in vectors.items()]

languages = [f"En_apple->{name}" for name in translations.keys()]
print(languages)
print(euclidean_distances)
print(cosine_distances)

['En_apple->En_apple', 'En_apple->Pt_maçã', 'En_apple->Es_manzana', 'En_apple->De_apfel', 'En_apple->Nl_appel', 'En_apple->En_horse', 'En_apple->En_orange']
['0.0000', '0.9599', '1.2143', '0.8805', '1.0987', '1.2493', '1.0768']
['1.0000', '0.5393', '0.2628', '0.6124', '0.3964', '0.2197', '0.4203']


In [7]:
import pandas as pd

data = {
    "Language": languages,
    "Euclidean": euclidean_distances,
    "Cosine": cosine_distances,
}
df = pd.DataFrame(data)
df = df.sort_values(by="Euclidean", ascending=True)
print(df.to_string(index=False))

            Language Euclidean Cosine
  En_apple->En_apple    0.0000 1.0000
  En_apple->De_apfel    0.8805 0.6124
   En_apple->Pt_maçã    0.9599 0.5393
 En_apple->En_orange    1.0768 0.4203
  En_apple->Nl_appel    1.0987 0.3964
En_apple->Es_manzana    1.2143 0.2628
  En_apple->En_horse    1.2493 0.2197


In [8]:
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, SystemMessage

gpt = ChatOpenAI(temperature=0, model="gpt-3.5-turbo", api_key=OPENAI_API_KEY)

# chat with OpenAI


def chat(query):
    system = """
    You are an assistant that replies only in a clear direct and consise way without any need for introductions.
    When you do not know the answer, do not reply anything.
    """
    messages = [
        SystemMessage(content=system),
        HumanMessage(content=query)
    ]
    return gpt.invoke(messages)


# Test the function with the word "Apple"
most_dissimilar_word = chat(
    "What is similarly the opposite of the fruit Apple ?")
print(most_dissimilar_word.content)

The opposite of the fruit Apple is Orange.
