## Using Word Embeddings

In the previous sections, we have built a Word2Vec model from scratch using NumPy. Now, let's see how we can use the learned word embeddings for various NLP tasks with the focus on sentiment analysis.


In [21]:
%pip install gensim

import gensim.downloader as api

# Download the pretrained Word2Vec vectors (GoogleNews-vectors-negative300)
word2vec_model = api.load('word2vec-google-news-300')


Note: you may need to restart the kernel to use updated packages.


In [27]:
(word2vec_model["really"] + word2vec_model["i"]) / 2

array([-6.46972656e-02, -2.41088867e-02, -8.78906250e-03,  1.91406250e-01,
       -6.05468750e-02,  6.73828125e-02, -1.09863281e-02, -9.81445312e-02,
        3.22570801e-02,  2.62451172e-02, -5.65185547e-02, -2.45117188e-01,
       -1.46362305e-01, -2.02148438e-01, -1.06689453e-01,  6.86035156e-02,
        6.16455078e-02,  1.58935547e-01,  7.89794922e-02, -5.77392578e-02,
       -1.19506836e-01,  1.08154297e-01,  2.25585938e-01,  6.68334961e-02,
        3.12500000e-02,  7.65380859e-02, -2.17895508e-01, -1.10351562e-01,
       -9.94873047e-03, -1.67236328e-02, -1.26953125e-02,  1.56250000e-01,
        6.70166016e-02,  3.39355469e-02,  5.46875000e-02,  6.81762695e-02,
       -1.03027344e-01,  9.21020508e-02, -2.90527344e-02,  1.15600586e-01,
        9.19494629e-02, -7.56835938e-02,  2.07031250e-01,  5.43823242e-02,
        6.46362305e-02,  7.95898438e-02,  1.25732422e-01, -1.62109375e-01,
        4.88281250e-03, -4.35791016e-02,  6.48803711e-02,  1.57470703e-01,
        4.19921875e-02,  

In [23]:
import pandas as pd

df = pd.read_csv("https://github.com/febse/data/raw/refs/heads/main/ta/IMDB-Dataset-5000.csv.zip")
df.head()

Unnamed: 0,review,sentiment
0,I really liked this Summerslam due to the look...,positive
1,Not many television shows appeal to quite as m...,positive
2,The film quickly gets to a major chase scene w...,negative
3,Jane Austen would definitely approve of this o...,positive
4,Expectations were somewhat high for me when I ...,negative


In [28]:
import numpy as np

def document_embedding(text, model):
    words = [w for w in text.split() if w in model]
    if words:
        return np.mean([model[w] for w in words], axis=0)
    else:
        return np.zeros(model.vector_size)

df['avg_word'] = df['review'].apply(lambda x: document_embedding(x, word2vec_model))

In [5]:
df.head()

Unnamed: 0,review,sentiment,avg_word
0,I really liked this Summerslam due to the look...,positive,"[0.026841605, 0.044992644, 0.017120501, 0.0777..."
1,Not many television shows appeal to quite as m...,positive,"[0.044581123, 0.025324179, 0.020859266, 0.0939..."
2,The film quickly gets to a major chase scene w...,negative,"[0.04855869, 0.031639244, 0.0031714232, 0.0996..."
3,Jane Austen would definitely approve of this o...,positive,"[0.03325866, 0.024321612, -0.0009411083, 0.060..."
4,Expectations were somewhat high for me when I ...,negative,"[0.056710232, 0.029389769, 0.031232158, 0.0919..."


In [29]:
from sklearn.model_selection import train_test_split

# Convert avg_word column to numpy array
X = np.vstack(df['avg_word'].values)
y = df['sentiment'].map({'positive': 1, 'negative': 0}).values

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [30]:
X.shape

(5000, 300)

In [31]:
X[:5]

array([[ 0.02684161,  0.04499264,  0.0171205 , ..., -0.06149487,
         0.02433256, -0.02856859],
       [ 0.04458112,  0.02532418,  0.02085927, ..., -0.01347773,
         0.03515555, -0.0291379 ],
       [ 0.04855869,  0.03163924,  0.00317142, ..., -0.05177125,
         0.04260951, -0.02817884],
       [ 0.03325866,  0.02432161, -0.00094111, ..., -0.06046228,
         0.06452684, -0.00596103],
       [ 0.05671023,  0.02938977,  0.03123216, ..., -0.03327884,
         0.05188526, -0.03141383]], shape=(5, 300), dtype=float32)

In [9]:
y_train[:5]

array([0, 1, 1, 0, 1])

## Getting Document Vectors

![Large Language Models](https://bea.stollnitz.com/images/gpt-transformer/3-transformer.png)

In [None]:
from openai import OpenAI

client = OpenAI(api_key="API_KEY_HERE")

def get_openai_embedding(text, model="text-embedding-3-small"):
    """
    Get embedding for a text using OpenAI's API (v1.0+ compatible).
    
    Parameters:
    text (str): The text to embed
    model (str): The embedding model to use (default: "text-embedding-3-small")
    
    Returns:
    list: The embedding vector
    """
    response = client.embeddings.create(
        model=model,
        input=text
    )
    return response.data[0].embedding

# Download embeddings for each review (this may take time and cost money)
# Uncomment the line below to run (be aware of API costs)

doc_embedding = get_openai_embedding("Hello, world!")
doc_embedding[:5]  # Show first 5 dimensions of the embedding

[-0.019143931567668915,
 -0.025292053818702698,
 -0.0017211713129654527,
 0.01883450709283352,
 -0.03382139280438423]

In [36]:
len(doc_embedding)

1536

In [39]:
import requests

def get_ollama_embedding(text, model="llama3.2:latest"):
    """
    Get embedding for a text using Ollama's local API.

    Parameters:
    text (str): The text to embed
    model (str): The embedding model to use (default: "nomic-embed-text")

    Returns:
    list: The embedding vector
    """
    url = "http://localhost:11434/api/embeddings"
    payload = {
        "model": model,
        "prompt": text
    }
    response = requests.post(url, json=payload)
    response.raise_for_status()
    return response.json()["embedding"]

# Example usage:
doc_embedding = get_ollama_embedding("""
I really liked this Summerslam due to the look of the arena, the curtains and just the look overall was interesting to me for some reason. Anyways, this could have been one of the best Summerslam's ever if the WWF didn't have Lex Luger in the main event against Yokozuna, now for it's time it was ok to have a huge fat man vs a strong man but I'm glad times have changed. It was a terrible main event just like every match Luger is in is terrible. Other matches on the card were Razor Ramon vs Ted Dibiase, Steiner Brothers vs Heavenly Bodies, Shawn Michaels vs Curt Hening, this was the event where Shawn named his big monster of a body guard Diesel, IRS vs 1-2-3 Kid, Bret Hart first takes on Doink then takes on Jerry Lawler and stuff with the Harts and Lawler was always very interesting, then Ludvig Borga destroyed Marty Jannetty, Undertaker took on Giant Gonzalez in another terrible match, The Smoking Gunns and Tatanka took on Bam Bam Bigelow and the Headshrinkers, and Yokozuna defended the world title against Lex Luger this match was boring and it has a terrible ending. However it deserves 8/10
""")
doc_embedding[:5]  # Show first 5 dimensions of the embedding

[0.6366424560546875,
 0.37365591526031494,
 -1.7717797756195068,
 1.5997880697250366,
 -0.5661576986312866]