# Text Embedding with Free LLM´s

In [9]:
from dotenv import load_dotenv
import os

load_dotenv("../.env")

True

Note:
* Make sure Ollama is running with the Llama 3 model <font color="Red"><b>before</b></font> executing the next cell
* Check file `1.Free LLMs.ipynb` for the commands to run Docker

In [7]:
import ollama
import numpy as np

embeddings_model = ollama.embeddings(
    model="nomic-embed-text", prompt="Orange")
print(embeddings_model)
print("Embedded vector has: " + str(len(embeddings_model['embedding'])) + " dimensions ")

{'embedding': [0.17640408873558044, 0.9992102980613708, -6.178772926330566, -0.4140119254589081, -0.07329343259334564, -0.24235905706882477, 0.045288100838661194, 0.6890352368354797, -0.41832903027534485, -0.6768399477005005, -0.5575522780418396, 1.8139139413833618, 1.274893879890442, -0.268075168132782, 0.9826412200927734, -0.7586398124694824, -0.6269446611404419, -1.6365771293640137, -1.4819190502166748, 0.033576518297195435, -0.9945516586303711, -0.8853966593742371, -0.12487180531024933, 0.43810203671455383, 2.5989491939544678, 0.7287818193435669, 0.36205074191093445, 0.5461231470108032, -0.8242924809455872, 1.2908804416656494, 0.3841179311275482, -0.7681425213813782, -0.4637954831123352, -0.41008612513542175, -0.4977230131626129, -1.449981451034546, 1.6476339101791382, 0.26164644956588745, 0.4080559313297272, -0.37149128317832947, -0.3461301922798157, 0.656103253364563, -0.7025183439254761, -0.8793357610702515, 1.0409475564956665, -0.5338886380195618, -0.1799723207950592, 0.2580103

In [11]:
def ollama_embed(text):
    return ollama.embeddings(model="nomic-embed-text", prompt=text)['embedding']

In [12]:


orangeVector = ollama_embed("Orange")
print(f"Orange: {orangeVector}")
print(f"Orange dimensions: {len(orangeVector)}")
print(f'Orange norm: {np.linalg.norm(orangeVector)}')
print("-"*120)

appleVector= ollama_embed("Apple")
print(f"Apple: {appleVector}")
print(f"Apple dimensions: {len(appleVector)}")
print(f'Apple norm: {np.linalg.norm(appleVector)}')
print("-"*120)

HorseVector= ollama_embed("Horse")
print(f"Horse: {HorseVector}")
print(f"Horse dimensions: {len(HorseVector)}")
print(f'Horse norm: {np.linalg.norm(HorseVector)}')
print("-"*120)

Orange: [0.17640408873558044, 0.9992102980613708, -6.178772926330566, -0.4140119254589081, -0.07329343259334564, -0.24235905706882477, 0.045288100838661194, 0.6890352368354797, -0.41832903027534485, -0.6768399477005005, -0.5575522780418396, 1.8139139413833618, 1.274893879890442, -0.268075168132782, 0.9826412200927734, -0.7586398124694824, -0.6269446611404419, -1.6365771293640137, -1.4819190502166748, 0.033576518297195435, -0.9945516586303711, -0.8853966593742371, -0.12487180531024933, 0.43810203671455383, 2.5989491939544678, 0.7287818193435669, 0.36205074191093445, 0.5461231470108032, -0.8242924809455872, 1.2908804416656494, 0.3841179311275482, -0.7681425213813782, -0.4637954831123352, -0.41008612513542175, -0.4977230131626129, -1.449981451034546, 1.6476339101791382, 0.26164644956588745, 0.4080559313297272, -0.37149128317832947, -0.3461301922798157, 0.656103253364563, -0.7025183439254761, -0.8793357610702515, 1.0409475564956665, -0.5338886380195618, -0.1799723207950592, 0.2580103278160

## Understanding Vector Distances

The distances between word vectors in the embedding space reflect the semantic relationships between those words. Words that are semantically similar or related tend to have smaller vector distances (higher cosine similarity), while words that are dissimilar or unrelated tend to have larger vector distances (lower cosine similarity).

However, it's important to note that word embeddings have limitations. They may not capture certain nuances or context-specific meanings, and their performance can be influenced by the quality and diversity of the training data.

The exact meaning of the dimensions is specific to the model architecture and the training data. 

In [13]:
from scipy.spatial import distance

euclidean_distance0 = distance.euclidean(orangeVector, orangeVector)
euclidean_distance1 = distance.euclidean(orangeVector, appleVector)
euclidean_distance2 = distance.euclidean(orangeVector, HorseVector)

print(f"Distance Orange-Orange: {euclidean_distance0}")
print(f"Distance Orange-Apple: {euclidean_distance1}")
print(f"Distance Orange-Horse: {euclidean_distance2}")

Distance Orange-Orange: 0.0
Distance Orange-Apple: 18.097128659365776
Distance Orange-Horse: 13.758842458194108


## What is the Cosine Similarity?

The cosine similarity is a measure of similarity between two vectors. It is a measure of the cosine of the angle between two vectors projected in a multi-dimensional space. The cosine similarity is a widely used measure in many applications, including information retrieval, natural language processing, and machine learning.

### How to Calculate the Cosine Similarity?
The cosine similarity is a value between -1 and 1, where:  
0 means the vectors are orthogonal (perpendicular)  
1 means the vectors are identical  
-1 means the vectors are opposed

In [14]:
import numpy as np
from scipy.spatial import distance

cosine_similarity0= 1 - distance.cosine(orangeVector, orangeVector)
cosine_similarity1= 1 - distance.cosine(orangeVector, appleVector)
cosine_similarity2= 1 - distance.cosine(orangeVector, HorseVector)
print(f"Cosine Orange-Orange: {cosine_similarity0}")
print(f"Cosine Orange-Apple: {cosine_similarity1}")
print(f"Cosine Orange-Horse: {cosine_similarity2}")

def cosine_similarity(vec1, vec2):
    dot_product = np.dot(vec1, vec2)
    norm_vec1 = np.linalg.norm(vec1)
    norm_vec2 = np.linalg.norm(vec2)
    similarity = dot_product / (norm_vec1 * norm_vec2)
    return similarity

print(f"Cosine Orange-Orange: (eq) {cosine_similarity(orangeVector,orangeVector)}")
print(f"Cosine Orange-Apple: (eq) {cosine_similarity(orangeVector,appleVector)}")
print(f"Cosine Orange-Horse: (eq) {cosine_similarity(orangeVector,HorseVector)}")

Cosine Orange-Orange: 1.0
Cosine Orange-Apple: 0.7234816593852156
Cosine Orange-Horse: 0.8391578073354502
Cosine Orange-Orange: (eq) 1.0
Cosine Orange-Apple: (eq) 0.7234816593852156
Cosine Orange-Horse: (eq) 0.8391578073354502


In [15]:
from scipy.spatial.distance import cosine
import matplotlib.pyplot as plt

translations = {
    "En_apple": "apple",
    "Pt_maçã": "maçã",
    "Es_manzana": "manzana",
    "De_apfel": "Apfel",
    "Nl_appel": "Appel",
    "En_horse": "Horse",
    "En_orange": "Orange",
    "En_Strawberry": "Strawberry",
    "De_Kartoffel":"Kartoffel",
    "De_Erdbeere":"Erdbeere",
    "En_BigApple":"Big Apple",
    "En_SmallApple":"Small Apple",
    "Es_Fresa":"Fresa",
    "Nl_Aardbei":"Aardbei "
}

vectors = {key: ollama_embed(value) for key, value in translations.items()}
vectors

{'En_apple': [0.019055936485528946,
  1.4032655954360962,
  -3.29129958152771,
  -0.5449592471122742,
  -0.5651657581329346,
  0.2633413076400757,
  -0.1758601814508438,
  1.5990731716156006,
  -0.6624545454978943,
  -0.8208480477333069,
  -1.6256874799728394,
  1.2577292919158936,
  0.9282630681991577,
  0.11898184567689896,
  1.2292563915252686,
  -2.2586472034454346,
  -0.3463354706764221,
  -0.7408260107040405,
  -1.0338680744171143,
  0.13187889754772186,
  0.9622464179992676,
  -0.915485143661499,
  -0.10011395812034607,
  0.3821070194244385,
  -0.33036014437675476,
  -0.24738366901874542,
  -0.6890381574630737,
  0.4665466547012329,
  -0.6985698342323303,
  0.31945493817329407,
  0.34680691361427307,
  0.08562454581260681,
  -0.4882517457008362,
  0.21334049105644226,
  -0.9029620885848999,
  -0.821465015411377,
  2.2532052993774414,
  0.8520066738128662,
  0.3595903515815735,
  0.10155634582042694,
  -0.5431708097457886,
  1.2137629985809326,
  -1.870334506034851,
  -1.55230116

In [16]:
euclidean_distances = [
    format(distance.euclidean(value,vectors['En_apple']),".4f")
    for key,value in vectors.items()]
cosine_distances = [
    format(1 - distance.cosine(value,vectors['En_apple']),".4f")
    for key,value in vectors.items() ]

languages = [f"En_apple->{name}" for name in translations.keys()]
print(languages)
print(euclidean_distances)
print(cosine_distances)

['En_apple->En_apple', 'En_apple->Pt_maçã', 'En_apple->Es_manzana', 'En_apple->De_apfel', 'En_apple->Nl_appel', 'En_apple->En_horse', 'En_apple->En_orange', 'En_apple->En_Strawberry', 'En_apple->De_Kartoffel', 'En_apple->De_Erdbeere', 'En_apple->En_BigApple', 'En_apple->En_SmallApple', 'En_apple->Es_Fresa', 'En_apple->Nl_Aardbei']
['0.0000', '21.1384', '18.6926', '24.0714', '20.4961', '17.5112', '18.0971', '18.3202', '18.8642', '20.5087', '16.4677', '25.8372', '21.0962', '17.3009']
['1.0000', '0.6240', '0.7085', '0.5157', '0.6467', '0.7419', '0.7235', '0.7156', '0.7006', '0.6457', '0.7729', '0.4260', '0.6294', '0.7508']


In [17]:
import pandas as pd
data = {
    "Language": languages,
    "Euclidean": euclidean_distances,
    "Cosine": cosine_distances
}
df = pd.DataFrame(data)
df = df.sort_values(by="Cosine", ascending=False)
print(df.to_string(index=True))

                   Language Euclidean  Cosine
0        En_apple->En_apple    0.0000  1.0000
10    En_apple->En_BigApple   16.4677  0.7729
13     En_apple->Nl_Aardbei   17.3009  0.7508
5        En_apple->En_horse   17.5112  0.7419
6       En_apple->En_orange   18.0971  0.7235
7   En_apple->En_Strawberry   18.3202  0.7156
2      En_apple->Es_manzana   18.6926  0.7085
8    En_apple->De_Kartoffel   18.8642  0.7006
4        En_apple->Nl_appel   20.4961  0.6467
9     En_apple->De_Erdbeere   20.5087  0.6457
12       En_apple->Es_Fresa   21.0962  0.6294
1         En_apple->Pt_maçã   21.1384  0.6240
3        En_apple->De_apfel   24.0714  0.5157
11  En_apple->En_SmallApple   25.8372  0.4260


In [18]:
import pandas as pd
from rich.console import Console
from rich.table import Table

# Create a table using the DataFrame
table = Table(title="Tensor Distances")
table.add_column("Language", justify="center")
table.add_column("Euclidean", justify="left")
table.add_column("Cosine", justify="left")
for index, row in df.iterrows():
    table.add_row(row["Language"], row["Euclidean"], row["Cosine"])
    
console=Console()
console.print(table)

In [20]:
import numpy as np

bigVector= ollama_embed("Big")
redVector= ollama_embed("Red")
appleVector= ollama_embed("Apple")
bigRedAppleVector=ollama_embed("Big Red Apple")

sentenceVectors=[bigVector, redVector, appleVector]
resultVector = np.sum(sentenceVectors, axis=0)

print(distance.euclidean(bigRedAppleVector, resultVector))

59.68678479118035


The vector sum of the embeddings for the words **"big," "red,"** and **"apple"** may not equal the embedding of the phrase **"big red apple"** due to the way word embeddings capture context and meaning. When individual word embeddings are summed, each word contributes equally to the final vector, which doesn't account for the nuanced meaning that arises from the combination of words in a sentence.

Word embeddings are designed to capture semantic meaning, and the context in which words appear together can significantly alter that meaning. For example, "big" and "red" both modify "apple" in the phrase "big red apple," creating a specific image that is different from the sum of its parts. This is because language is inherently compositional, meaning that the combination of words can convey more specific or different meanings than the words in isolation.

Moreover, sentence embeddings often consider word order and syntax, which are lost when simply summing word vectors. Techniques like averaging word vectors or using more advanced models like **Doc2Vec** or contextual embeddings from models like **BERT** can better capture the meaning of phrases and sentences as a whole.

In summary, while summing word embeddings can provide a rough representation, it lacks the sophistication to fully capture the meaning of phrases and sentences, which is why the sum of word embeddings for "big," "red," and "apple" is not the same as the embedding for "big red apple."

Check this [link for more information](https://www.baeldung.com/cs/word2vec-word-embeddings)