Adapted from the Open AI Cookbook
- https://github.com/openai/openai-cookbook/blob/main/examples/Get_embeddings_from_dataset.ipynb

In [1]:
import pandas as pd
import tiktoken
from openai.embeddings_utils import get_embedding

# embedding model parameters
embedding_model = "text-embedding-ada-002"
embedding_encoding = "cl100k_base"  # this is the encoding for text-embedding-ada-002
max_tokens = 8000  # the maximum for text-embedding-ada-002 is 8191

In [11]:
# load and inspect dataset
df = pd.read_csv("bc_cam_txt_df.csv")
# add a CEFR level column (created from the filename)
df['cefr'] = df['filename'].apply(lambda x: x[:2])
# convvert C1 and C2 to C (only 6 C2 texts)
df['cefr'] = df['cefr'].replace({'C1': 'C', 'C2': 'C'})
df

Unnamed: 0,filename,text,cefr
0,A1Movers_1_1,"Look, Grandpa. My friend's family are in the g...",A1
1,A1Movers_1_2,"Come quickly, children. The train's waiting to...",A1
2,A1Movers_1_3,"Hello, Mrs Castle. Hello Sally, Oh I'm tired. ...",A1
3,A1Movers_1_4,"Dad, come and watch this DVD with me. What's i...",A1
4,A1Movers_1_5,Can you colour this mountain picture now? Yes!...,A1
...,...,...,...
723,C2Prof_16-20,"Today, we're talking to marine biologists Gina...",C
724,C2Prof_21-30,I knew I'd be short of money if I didn't work ...,C
725,C2Prof_3-4,"Last year, Tim Fitzgerald exhibited photograph...",C
726,C2Prof_5-6,One of my own thoughts about this piece is the...,C


In [12]:
encoding = tiktoken.get_encoding(embedding_encoding)

# Truncate texts that are too long to embed
# actually I don't think I have any, but there could possibly be a YouTube video that is too long
def truncate_text(text):
    tokens = encoding.encode(text)
    if len(tokens) > max_tokens:
        tokens = tokens[:max_tokens]
    return encoding.decode(tokens)

df["text"] = df["text"].apply(truncate_text)
df

Unnamed: 0,filename,text,cefr
0,A1Movers_1_1,"Look, Grandpa. My friend's family are in the g...",A1
1,A1Movers_1_2,"Come quickly, children. The train's waiting to...",A1
2,A1Movers_1_3,"Hello, Mrs Castle. Hello Sally, Oh I'm tired. ...",A1
3,A1Movers_1_4,"Dad, come and watch this DVD with me. What's i...",A1
4,A1Movers_1_5,Can you colour this mountain picture now? Yes!...,A1
...,...,...,...
723,C2Prof_16-20,"Today, we're talking to marine biologists Gina...",C
724,C2Prof_21-30,I knew I'd be short of money if I didn't work ...,C
725,C2Prof_3-4,"Last year, Tim Fitzgerald exhibited photograph...",C
726,C2Prof_5-6,One of my own thoughts about this piece is the...,C


Get Embeddings and save them for future use
- Ensure you have your API key set in your environment per the README: https://github.com/openai/openai-python#usage
- I set the OPEN_AI_KEY environment variable in my Virtual Environment
- It can also be set as a variable -> openai.api_key = "sk-..."

In [13]:
# This may take a few minutes
df["embedding"] = df.text.apply(lambda x: get_embedding(x, engine=embedding_model))
df.to_csv("bc_cam_with_ada_002_embeddings.csv")

Testing the truncation function (I did this before running the above code)

In [7]:
sentence = "This is a sample sentence. "
long_text = sentence * 1600  # this will give you a text with approximately 8000 words

# Add a different sentence at the end
long_text += "This is the end of the long text. "

print(long_text[-34:])

This is the end of the long text. 


In [10]:
encoding = tiktoken.get_encoding(embedding_encoding)

def truncate_text(text):
    tokens = encoding.encode(text)
    if len(tokens) > max_tokens:
        tokens = tokens[:max_tokens]
    return encoding.decode(tokens)

truncated_text = truncate_text(long_text)

print(truncated_text[-34:])

This is a sample sentence. This is
