# Text Embedding with Free LLM´s

In [45]:
from dotenv import load_dotenv
import os
load_dotenv("../.env")

True

Note:
* Make sure Ollama is running with the Llama 3 model <font color="Red"><b>before</b></font> executing the next cell
* Check file `1.Free LLMs.ipynb` for the commands to run Docker

In [3]:
import ollama
import numpy as np

embeddings_model = ollama.embeddings(
    model="llama3", prompt="Orange")
print(embeddings_model)

{'embedding': [0.5173478126525879, 0.5874137878417969, -0.03643041104078293, -0.14705179631710052, 1.0758187770843506, -0.3079666793346405, -0.13057821989059448, -0.08783403784036636, 0.3792644739151001, -0.39223387837409973, -0.37235352396965027, 0.8282191753387451, 0.6625686287879944, -0.784276008605957, -0.32100892066955566, 1.307268500328064, -0.1794016808271408, 0.2516718804836273, -0.16518950462341309, 0.2160254865884781, 0.20154938101768494, 0.561964213848114, -0.2281237542629242, 2.329874277114868, 0.7335413694381714, -0.5578284859657288, -0.4219829738140106, -0.17355434596538544, -0.17348161339759827, 0.05560017004609108, -1.1306207180023193, 1.1704390048980713, -0.3709803819656372, 0.8322083353996277, -0.19681581854820251, 1.6903834342956543, 0.4774298071861267, -0.7286704182624817, 0.12633585929870605, -0.3962174952030182, 2.0130324363708496, -0.33217519521713257, 0.22591547667980194, -0.3828010857105255, 0.41575387120246887, 1.1062238216400146, 1.5389716625213623, -0.042229

In [46]:
def ollama_embed(text):
    return ollama.embeddings(model="llama3", prompt=text)['embedding']

In [5]:


orangeVector = ollama_embed("Orange")
print(f"Orange: {orangeVector}")
print(f"Orange dimensions: {len(orangeVector)}")
print(f'Orange norm: {np.linalg.norm(orangeVector)}')
print("-"*120)

appleVector= ollama_embed("Apple")
print(f"Apple: {appleVector}")
print(f"Apple dimensions: {len(appleVector)}")
print(f'Apple norm: {np.linalg.norm(appleVector)}')
print("-"*120)

HorseVector= ollama_embed("Horse")
print(f"Horse: {HorseVector}")
print(f"Horse dimensions: {len(HorseVector)}")
print(f'Horse norm: {np.linalg.norm(HorseVector)}')
print("-"*120)

Orange: [0.5173478126525879, 0.5874137878417969, -0.03643041104078293, -0.14705179631710052, 1.0758187770843506, -0.3079666793346405, -0.13057821989059448, -0.08783403784036636, 0.3792644739151001, -0.39223387837409973, -0.37235352396965027, 0.8282191753387451, 0.6625686287879944, -0.784276008605957, -0.32100892066955566, 1.307268500328064, -0.1794016808271408, 0.2516718804836273, -0.16518950462341309, 0.2160254865884781, 0.20154938101768494, 0.561964213848114, -0.2281237542629242, 2.329874277114868, 0.7335413694381714, -0.5578284859657288, -0.4219829738140106, -0.17355434596538544, -0.17348161339759827, 0.05560017004609108, -1.1306207180023193, 1.1704390048980713, -0.3709803819656372, 0.8322083353996277, -0.19681581854820251, 1.6903834342956543, 0.4774298071861267, -0.7286704182624817, 0.12633585929870605, -0.3962174952030182, 2.0130324363708496, -0.33217519521713257, 0.22591547667980194, -0.3828010857105255, 0.41575387120246887, 1.1062238216400146, 1.5389716625213623, -0.042229659855

## Understanding Vector Distances

The distances between word vectors in the embedding space reflect the semantic relationships between those words. Words that are semantically similar or related tend to have smaller vector distances (higher cosine similarity), while words that are dissimilar or unrelated tend to have larger vector distances (lower cosine similarity).

However, it's important to note that word embeddings have limitations. They may not capture certain nuances or context-specific meanings, and their performance can be influenced by the quality and diversity of the training data.

The exact meaning of the dimensions is specific to the model architecture and the training data. 

In [6]:
from scipy.spatial import distance

euclidean_distance0 = distance.euclidean(orangeVector, orangeVector)
euclidean_distance1 = distance.euclidean(orangeVector, appleVector)
euclidean_distance2 = distance.euclidean(orangeVector, HorseVector)

print(f"Distance Orange-Orange: {euclidean_distance0}")
print(f"Distance Orange-Apple: {euclidean_distance1}")
print(f"Distance Orange-Horse: {euclidean_distance2}")

Distance Orange-Orange: 0.0
Distance Orange-Apple: 5.548707654535384
Distance Orange-Horse: 170.85386053071207


## What is the Cosine Similarity?

The cosine similarity is a measure of similarity between two vectors. It is a measure of the cosine of the angle between two vectors projected in a multi-dimensional space. The cosine similarity is a widely used measure in many applications, including information retrieval, natural language processing, and machine learning.

### How to Calculate the Cosine Similarity?
The cosine similarity is a value between -1 and 1, where:  
0 means the vectors are orthogonal (perpendicular)  
1 means the vectors are identical  
-1 means the vectors are opposed

In [7]:
import numpy as np
from scipy.spatial import distance

cosine_similarity0= 1 - distance.cosine(orangeVector, orangeVector)
cosine_similarity1= 1 - distance.cosine(orangeVector, appleVector)
cosine_similarity2= 1 - distance.cosine(orangeVector, HorseVector)
print(f"Cosine Orange-Orange: {cosine_similarity0}")
print(f"Cosine Orange-Apple: {cosine_similarity1}")
print(f"Cosine Orange-Horse: {cosine_similarity2}")

def cosine_similarity(vec1, vec2):
    dot_product = np.dot(vec1, vec2)
    norm_vec1 = np.linalg.norm(vec1)
    norm_vec2 = np.linalg.norm(vec2)
    similarity = dot_product / (norm_vec1 * norm_vec2)
    return similarity

print(f"Cosine Orange-Orange: (eq) {cosine_similarity(orangeVector,orangeVector)}")
print(f"Cosine Orange-Apple: (eq) {cosine_similarity(orangeVector,appleVector)}")
print(f"Cosine Orange-Horse: (eq) {cosine_similarity(orangeVector,HorseVector)}")

Cosine Orange-Orange: 1.0
Cosine Orange-Apple: 0.9979992671715949
Cosine Orange-Horse: 0.0846885125566299
Cosine Orange-Orange: (eq) 1.0
Cosine Orange-Apple: (eq) 0.9979992671715949
Cosine Orange-Horse: (eq) 0.08468851255662993


In [35]:
from scipy.spatial.distance import cosine
import matplotlib.pyplot as plt

translations = {
    "En_apple": "apple",
    "Pt_maçã": "maçã",
    "Es_manzana": "manzana",
    "De_apfel": "Apfel",
    "Nl_appel": "Appel",
    "En_horse": "Horse",
    "En_orange": "Orange",
    "En_Strawberry": "Strawberry",
    "De_Kartoffel":"Kartoffel",
    "De_Erdbeere":"Erdbeere",
    "En_BigApple":"Big Apple",
    "En_SmallApple":"Small Apple",
    "Es_Fresa":"Fresa",
    "Nl_Aardbei":"Aardbei "
}

vectors = {key: ollama_embed(value) for key, value in translations.items()}
vectors

{'En_apple': [0.5030889511108398,
  0.46349403262138367,
  -0.032169077545404434,
  -0.16091327369213104,
  1.072113275527954,
  -0.27176159620285034,
  0.16791942715644836,
  -0.3174496591091156,
  0.3805071711540222,
  -0.44044744968414307,
  -0.43799614906311035,
  0.6459420323371887,
  0.5815314650535583,
  -0.6167280673980713,
  -0.20129965245723724,
  1.3093242645263672,
  -0.14833074808120728,
  0.16652809083461761,
  -0.26765936613082886,
  0.3024403750896454,
  0.07268454134464264,
  0.6897044777870178,
  -0.5709834098815918,
  2.2812185287475586,
  0.5564745664596558,
  -0.5939659476280212,
  -0.5984370112419128,
  -0.10422328859567642,
  -0.09839150309562683,
  -0.046567246317863464,
  -1.356825590133667,
  1.0138493776321411,
  -0.3236565589904785,
  1.1882214546203613,
  -0.4622006118297577,
  1.5665675401687622,
  0.49252745509147644,
  -0.8177951574325562,
  -0.016053320840001106,
  -0.12353163957595825,
  1.8168344497680664,
  -0.324513703584671,
  0.11771279573440552,


In [36]:
euclidean_distances = [
    format(distance.euclidean(value,vectors['En_apple']),".4f")
    for key,value in vectors.items()]
cosine_distances = [
    format(1 - distance.cosine(value,vectors['En_apple']),".4f")
    for key,value in vectors.items() ]

languages = [f"En_apple->{name}" for name in translations.keys()]
print(languages)
print(euclidean_distances)
print(cosine_distances)

['En_apple->En_apple', 'En_apple->Pt_maçã', 'En_apple->Es_manzana', 'En_apple->De_apfel', 'En_apple->Nl_appel', 'En_apple->En_horse', 'En_apple->En_orange', 'En_apple->En_Strawberry', 'En_apple->De_Kartoffel', 'En_apple->De_Erdbeere', 'En_apple->En_BigApple', 'En_apple->En_SmallApple', 'En_apple->Es_Fresa', 'En_apple->Nl_Aardbei']
['0.0000', '161.0516', '161.0984', '165.2453', '166.6039', '171.2935', '8.6675', '166.9270', '163.1571', '167.0986', '173.1039', '174.9008', '162.6291', '173.9968']
['1.0000', '0.1377', '0.1344', '0.1167', '0.1070', '0.0765', '0.9951', '0.0940', '0.1311', '0.1137', '0.0493', '0.0266', '0.1340', '0.0152']


In [37]:
import pandas as pd
data = {
    "Language": languages,
    "Euclidean": euclidean_distances,
    "Cosine": cosine_distances
}
df = pd.DataFrame(data)
df = df.sort_values(by="Cosine", ascending=False)
print(df.to_string(index=True))

                   Language Euclidean  Cosine
0        En_apple->En_apple    0.0000  1.0000
6       En_apple->En_orange    8.6675  0.9951
1         En_apple->Pt_maçã  161.0516  0.1377
2      En_apple->Es_manzana  161.0984  0.1344
12       En_apple->Es_Fresa  162.6291  0.1340
8    En_apple->De_Kartoffel  163.1571  0.1311
3        En_apple->De_apfel  165.2453  0.1167
9     En_apple->De_Erdbeere  167.0986  0.1137
4        En_apple->Nl_appel  166.6039  0.1070
7   En_apple->En_Strawberry  166.9270  0.0940
5        En_apple->En_horse  171.2935  0.0765
10    En_apple->En_BigApple  173.1039  0.0493
11  En_apple->En_SmallApple  174.9008  0.0266
13     En_apple->Nl_Aardbei  173.9968  0.0152


In [38]:
import pandas as pd
from rich.console import Console
from rich.table import Table

# Create a table using the DataFrame
table = Table(title="Tensor Distances")
table.add_column("Language", justify="center")
table.add_column("Euclidean", justify="left")
table.add_column("Cosine", justify="left")
for index, row in df.iterrows():
    table.add_row(row["Language"], row["Euclidean"], row["Cosine"])
    
console=Console()
console.print(table)

In [59]:
import numpy as np

bigVector= ollama_embed("Big")
redVector= ollama_embed("Red")
appleVector= ollama_embed("Apple")
bigRedAppleVector=ollama_embed("Big Red Apple")

sentenceVectors=[bigVector, redVector, appleVector]
resultVector = np.sum(sentenceVectors, axis=0)

print(distance.euclidean(bigRedAppleVector, resultVector))

308.1563806641457


The vector sum of the embeddings for the words **"big," "red,"** and **"apple"** may not equal the embedding of the phrase **"big red apple"** due to the way word embeddings capture context and meaning. When individual word embeddings are summed, each word contributes equally to the final vector, which doesn't account for the nuanced meaning that arises from the combination of words in a sentence.

Word embeddings are designed to capture semantic meaning, and the context in which words appear together can significantly alter that meaning. For example, "big" and "red" both modify "apple" in the phrase "big red apple," creating a specific image that is different from the sum of its parts. This is because language is inherently compositional, meaning that the combination of words can convey more specific or different meanings than the words in isolation.

Moreover, sentence embeddings often consider word order and syntax, which are lost when simply summing word vectors. Techniques like averaging word vectors or using more advanced models like **Doc2Vec** or contextual embeddings from models like **BERT** can better capture the meaning of phrases and sentences as a whole.

In summary, while summing word embeddings can provide a rough representation, it lacks the sophistication to fully capture the meaning of phrases and sentences, which is why the sum of word embeddings for "big," "red," and "apple" is not the same as the embedding for "big red apple."

Check this [link for more information](https://www.baeldung.com/cs/word2vec-word-embeddings)