# Get Embeddings

### Step1. Set up Azure OpenAI

In [20]:
import os
import openai
from dotenv import load_dotenv
import pandas as pd

load_dotenv()

openai.api_type = "azure"
openai.api_version = "2023-03-15-preview"
openai.api_base = os.getenv("OPENAI_API_BASE")
openai.api_key = os.getenv("OPENAI_API_KEY")

### Step2. 데이터 로드

In [18]:
df_orig = pd.read_csv("../data/rottentomatoes-20movies-wordcount.csv", sep='\t')
df = df_orig.copy()
df

Unnamed: 0,Movie,Publish,Review,Date,Score,Word_Count
0,SOLO: A STAR WARS STORY,Stuff.co.nz,The formula is strong with this one.,2018-05-24,70.0,7
1,BLACK PANTHER,Gone With The Twins,Just about the same as every other Marvel title.,2020-05-12,50.0,9
2,DUNKIRK,Screen Zealots,This is one heck of a stunning war picture.,2018-12-20,80.0,9
3,KNIVES OUT,Student Edge,Don't fear: No spoilers here. All you need to ...,2019-11-26,80.0,17
4,KNIVES OUT,Deep Focus Review,"Sharp and funny, Knives Out exceeds expectatio...",2022-02-23,100.0,29
...,...,...,...,...,...,...
6635,ROGUE ONE: A STAR WARS STORY,Movie Nation,This is more like it...the 'Star Wars' movie J...,2016-12-13,75.0,13
6636,ROGUE ONE: A STAR WARS STORY,Newsday,"This ""Star Wars"" spinoff doesn't spin very far...",2016-12-13,75.0,19
6637,ROGUE ONE: A STAR WARS STORY,Metro,Boasts thin characters played by great actors ...,2016-12-13,40.0,37
6638,ROGUE ONE: A STAR WARS STORY,Den of Geek,Rogue One builds to one of the best third acts...,2016-12-13,80.0,14


### Step3. 모델 배포
ref: 
- https://learn.microsoft.com/en-us/azure/cognitive-services/openai/concepts/models
- https://learn.microsoft.com/en-us/azure/cognitive-services/openai/concepts/models#text-search-embedding


In [22]:
desired_model = "text-embedding-ada-002"

deployment_id = None
result = openai.Deployment.list()

for deployment in result.data:
    if deployment["status"] != "succeeded":
        continue
    
    model = openai.Model.retrieve(deployment["model"])
    if model["id"] == desired_model:
        deployment_id = deployment["id"]
        
# if not model deployed, deploy one
if not deployment_id:
    print('No deployment with status: succeeded found.')

    # Now let's create the deployment
    print(f'Creating a new deployment with model: {desired_model}')
    result = openai.Deployment.create(model=desired_model, scale_settings={"scale_type":"standard"})
    deployment_id = result["id"]
    print(f'Successfully created {desired_model} with deployment_id {deployment_id}')
else:
    print(f'Found a succeeded deployment of "{desired_model}" that supports text search with id: {deployment_id}.')

Found a succeeded deployment of "text-embedding-ada-002" that supports text search with id: deployment-fae836fd05cf472288199ea73e995b43.


### Step4. 임베딩
ref: https://learn.microsoft.com/en-us/azure/cognitive-services/openai/tutorials/embeddings?tabs=bash

In [24]:
# 첫 행으로 임베딩 테스트 
input = 'Movie title: ' + df['Movie'][0] + '\n' + df['Review'][0]
input

embedding = openai.Embedding.create(
    input=input,
    deployment_id=deployment_id)

# embedding
len(embedding["data"][0]["embedding"])

1536

In [13]:
from ratelimiter import RateLimiter

@RateLimiter(max_calls=50, period=60) # OpenAI API의 요청 제한을 고려하여 60초 동안 50개의 요청만 허용하도록 설정 
def request_api(df, deployment_id, i):
    try:
        input = 'Movie title: ' + df['Movie'][i] + '\n' + df['Review'][i]
        embedding = openai.Embedding.create(input=input, deployment_id=deployment_id)
        df['embedding'].iloc[i] = embedding['data'][0]['embedding']
    except Exception as err:
        print(i)
        print(f"Unexpected {err=}, {type(err)=}")

In [14]:
df['embedding'] = ''

for i in range(len(df)): # 약 2시간 10분 소요
    request_api(df, deployment_id, i)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['embedding'].iloc[i] = embedding['data'][0]['embedding']
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['embedding'].iloc[i] = embedding['data'][0]['embedding']
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['embedding'].iloc[i] = embedding['data'][0]['embedding']
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-vers

In [15]:
df

Unnamed: 0,Movie,Publish,Review,Date,Score,Word_Count,embedding
0,SOLO: A STAR WARS STORY,Stuff.co.nz,The formula is strong with this one.,2018-05-24,70.0,7,"[-0.018743040040135384, -0.0029137227684259415..."
1,BLACK PANTHER,Gone With The Twins,Just about the same as every other Marvel title.,2020-05-12,50.0,9,"[-0.0029224527534097433, -0.016656650230288506..."
2,DUNKIRK,Screen Zealots,This is one heck of a stunning war picture.,2018-12-20,80.0,9,"[-0.02633911371231079, -0.0019438054878264666,..."
3,KNIVES OUT,Student Edge,Don't fear: No spoilers here. All you need to ...,2019-11-26,80.0,17,"[-0.0036253829021006823, 0.0177458543330431, -..."
4,KNIVES OUT,Deep Focus Review,"Sharp and funny, Knives Out exceeds expectatio...",2022-02-23,100.0,29,"[-0.014687717892229557, 0.021414518356323242, ..."
...,...,...,...,...,...,...,...
6635,ROGUE ONE: A STAR WARS STORY,Movie Nation,This is more like it...the 'Star Wars' movie J...,2016-12-13,75.0,13,"[-0.04067985340952873, 0.004907699301838875, 0..."
6636,ROGUE ONE: A STAR WARS STORY,Newsday,"This ""Star Wars"" spinoff doesn't spin very far...",2016-12-13,75.0,19,"[-0.01999182440340519, 0.016782935708761215, 0..."
6637,ROGUE ONE: A STAR WARS STORY,Metro,Boasts thin characters played by great actors ...,2016-12-13,40.0,37,"[-0.034423135221004486, 0.001245751278474927, ..."
6638,ROGUE ONE: A STAR WARS STORY,Den of Geek,Rogue One builds to one of the best third acts...,2016-12-13,80.0,14,"[-0.02347734197974205, 0.014124579727649689, 0..."


In [16]:
df.to_csv("../data/rottentomatoes-20movies-embeddings.csv", sep='\t', index=False)