# Summarise Personalised Reviews

### Step1. Set up Azure OpenAI

In [55]:
import os
import openai
from dotenv import load_dotenv

load_dotenv()

openai.api_type = "azure"
openai.api_version = "2023-03-15-preview"
openai.api_base = os.getenv("OPENAI_API_BASE")
openai.api_key = os.getenv("OPENAI_API_KEY")

### Step2. 모델 배포
ref: 
- https://learn.microsoft.com/en-us/azure/cognitive-services/openai/concepts/models
- https://learn.microsoft.com/en-us/azure/cognitive-services/openai/concepts/models#text-search-embedding

In [58]:
# id of desired_model
query_model = "text-embedding-ada-002"

# list models deployed
deployment_id = None
result = openai.Deployment.list()

for deployment in result.data:
    if deployment["status"] != "succeeded":
        continue
    
    model = openai.Model.retrieve(deployment["model"])
    if model["id"] == query_model:
        deployment_id = deployment["id"]
        
# if not model deployed, deploy one
if not deployment_id:
    print('No deployment with status: succeeded found.')

    # Now let's create the deployment
    print(f'Creating a new deployment with model: {query_model}')
    result = openai.Deployment.create(model=query_model, scale_settings={"scale_type":"standard"})
    deployment_id = result["id"]
    print(f'Successfully created {query_model} with deployment_id {deployment_id}')
else:
    print(f'Found a succeeded deployment of "{query_model}" that supports text search with id: {deployment_id}.')

No deployment with status: succeeded found.
Creating a new deployment with model: text-embedding-ada-002
Successfully created text-embedding-ada-002 with deployment_id deployment-fae836fd05cf472288199ea73e995b43


#### Step3. 데이터 로드

In [45]:
import pandas as pd
fname = '../data/rottentomatoes-20movies-embeddings.csv'
df_orig = pd.read_csv(fname, delimiter='\t', index_col=False) # 약 1분 소요 

In [46]:
# 임베딩 컬럼에 저장된 문자열 값을 numpy 배열로 변환 
import numpy as np

DEVELOPMENT = False  

if DEVELOPMENT:
    df = df_orig.sample(n=50, replace=False, random_state=9).copy()
else:
    df = df_orig.copy()

# 'NaN'값을 가지는 행 제거 
df.dropna(inplace=True)

# 문자열을 numpy 배열로 변환 
df["embedding"] = df['embedding'].apply(eval).apply(np.array) # 최대 10분 소요 
df.head()
df.shape

(6640, 7)

In [47]:
df['Movie'].value_counts()

Movie
JOKER                               380
CAPTAIN MARVEL                      373
ONCE UPON A TIME... IN HOLLYWOOD    372
AVENGERS: ENDGAME                   370
US                                  358
STAR WARS: THE RISE OF SKYWALKER    351
A STAR IS BORN                      340
BLACK PANTHER                       339
AVENGERS: INFINITY WAR              329
SOLO: A STAR WARS STORY             321
STAR WARS: THE LAST JEDI            320
SPIDER-MAN: FAR FROM HOME           319
DUNKIRK                             316
KNIVES OUT                          311
TOY STORY 4                         309
READY PLAYER ONE                    308
WONDER WOMAN                        307
1917                                306
FIRST MAN                           306
ROGUE ONE: A STAR WARS STORY        305
Name: count, dtype: int64

#### Step4. 토큰 개수 카운팅

In [48]:
import tiktoken
encoding = tiktoken.get_encoding('gpt2')

# token_count 열 추가 
df['token_count'] = ''

for idx, movie, review in zip(df.index.values, df['Movie'].loc[df.index.values], df['Review'].loc[df.index.values]):
    df['token_count'].loc[idx] = len(encoding.encode(review))

df.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['token_count'].loc[idx] = len(encoding.encode(review))
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['token_count'].loc[idx] = len(encoding.encode(review))
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['token_count'].loc[idx] = len(encoding.encode(review))
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-c

Unnamed: 0,Movie,Publish,Review,Date,Score,Word_Count,embedding,token_count
0,SOLO: A STAR WARS STORY,Stuff.co.nz,The formula is strong with this one.,2018-05-24,70.0,7,"[-0.018743040040135384, -0.0029137227684259415...",8
1,BLACK PANTHER,Gone With The Twins,Just about the same as every other Marvel title.,2020-05-12,50.0,9,"[-0.0029224527534097433, -0.016656650230288506...",10
2,DUNKIRK,Screen Zealots,This is one heck of a stunning war picture.,2018-12-20,80.0,9,"[-0.02633911371231079, -0.0019438054878264666,...",10
3,KNIVES OUT,Student Edge,Don't fear: No spoilers here. All you need to ...,2019-11-26,80.0,17,"[-0.0036253829021006823, 0.0177458543330431, -...",23
4,KNIVES OUT,Deep Focus Review,"Sharp and funny, Knives Out exceeds expectatio...",2022-02-23,100.0,29,"[-0.014687717892229557, 0.021414518356323242, ...",37


#### Step5. 프롬프트 구성

In [59]:
import numpy as np

# 텍스트의 임베딩 값 반환 
def get_embedding(text, deployment_id=deployment_id):
    """ 
    Get embeddings for an input text. 
    """
    result = openai.Embedding.create(
      deployment_id=deployment_id,
      input=text
    )
    result = np.array(result["data"][0]["embedding"])
    return result

# 두 벡터간 유사도 계산 
def vector_similarity(x, y):
    """
    Returns the similarity between two vectors.
    Because OpenAI Embeddings are normalized to length 1, the cosine similarity is the same as the dot product.
    """
    similarity = np.dot(x, y)
    return similarity 

# 쿼리와 문서 섹션 간의 유사도를 계산하여, 유사도가 높은 순으로 문서 섹션을 정렬하여 반환
def order_document_sections_by_query_similarity(query, contexts):
    """
    Find the query embedding for the supplied query, and compare it against all of the pre-calculated document embeddings
    to find the most relevant sections. 
    Return the list of document sections, sorted by relevance in descending order.
    """
    query_embedding = get_embedding(query)

    document_similarities = sorted([
        (vector_similarity(query_embedding, doc_embedding), doc_index) for doc_index, doc_embedding in contexts.items()
    ], reverse=True)
    
    return document_similarities

In [50]:
MAX_SECTION_LEN = 500 # 섹션의 최대 길이 
SEPARATOR = "\n* " # 섹션 구분 문자열
ENCODING = "gpt2"  # encoding for text-davinci-003

encoding = tiktoken.get_encoding(ENCODING)
separator_len = len(encoding.encode(SEPARATOR))

In [60]:
# 주어진 쿼리에 대해 가장 관련성이 높은 섹션들을 찾아 쿼리와 함께 프롬프트를 구성하는 함수
def construct_prompt(query: str, context_embeddings: pd.DataFrame, df: pd.DataFrame) -> str:
    """
    Append sections of document that are most similar to the query.
    """
    most_relevant_document_sections = order_document_sections_by_query_similarity(query, context_embeddings)
    
    chosen_sections = []
    chosen_sections_len = 0
    chosen_sections_indexes = []
     
    for _, section_index in most_relevant_document_sections:
        # Add contexts until we run out of space.
        document_section = df.loc[section_index]
        
        chosen_sections_len += document_section['token_count'] + separator_len
        if chosen_sections_len > MAX_SECTION_LEN:
            break
            
        chosen_sections.append(SEPARATOR + 
                               'movie title: ' + document_section['Movie'] + ' ' +
                               document_section['Review'].replace("\n", " "))
        
        chosen_sections_indexes.append(str(section_index))
            
    # Diagnostic information
    print(f"Selected {len(chosen_sections)} document sections, with indexes:")    
    for i in chosen_sections_indexes:
        print(i + ' ' + df['Movie'].loc[int(i)])

    
    header = """Answer the question truthfully using context, if unsure, say "I don't know."\n\nContext:\n"""
    prompt = header + "".join(chosen_sections) + "\n\n Q: " + query + "\n A:"
    
    return prompt

#### 프롬프트 예시 

In [65]:
query = 'Summarise reviews of Captain Marvel.'
prompt = construct_prompt(query=query, context_embeddings=df['embedding'], df=df); print(prompt)

Selected 15 document sections, with indexes:
2839 CAPTAIN MARVEL
2745 CAPTAIN MARVEL
3082 CAPTAIN MARVEL
2953 CAPTAIN MARVEL
2866 CAPTAIN MARVEL
3054 CAPTAIN MARVEL
2884 CAPTAIN MARVEL
2898 CAPTAIN MARVEL
2777 CAPTAIN MARVEL
2984 CAPTAIN MARVEL
2874 CAPTAIN MARVEL
3023 CAPTAIN MARVEL
2725 CAPTAIN MARVEL
3007 CAPTAIN MARVEL
2885 CAPTAIN MARVEL
Answer the question truthfully using context, if unsure, say "I don't know."

Context:

* movie title: CAPTAIN MARVEL Captain Marvel does the job it was meant to do, and I can understand how some people might like it more than others.
* movie title: CAPTAIN MARVEL There's a lot to like with Captain Marvel, but there's a lot to forget as well.
* movie title: CAPTAIN MARVEL Captain Marvel is an entertaining, quality MCU instalment. But those who were hoping that this Marvel landmark would knock their socks off might be left feeling unsatisfied.
* movie title: CAPTAIN MARVEL Captain Marvel didn't markedly improve the experience for me. I enjoyed it w

In [79]:
def retrieve_information(prompt):
    try:
        # Request API
        response = openai.Completion.create(
            deployment_id= "text-davinci-003", # Assumed already deployed
            prompt=prompt,
            temperature=1,
            max_tokens=3000,
            top_p=1.0,
            frequency_penalty=0.0,
            presence_penalty=1
        )

        # response
        result = response['choices'][0]['text']; print(result)
    except Exception as err:
        print(idx)
        print(f"Unexpected {err=}, {type(err)=}")

    return 

## Example Queries

In [85]:
query = 'Summarise reviews of Captain Marvel.'
prompt = construct_prompt(query=query, context_embeddings=df['embedding'], df=df); print(prompt)
retrieve_information(prompt=prompt)

Selected 15 document sections, with indexes:
2839 CAPTAIN MARVEL
2745 CAPTAIN MARVEL
3082 CAPTAIN MARVEL
2953 CAPTAIN MARVEL
2866 CAPTAIN MARVEL
3054 CAPTAIN MARVEL
2884 CAPTAIN MARVEL
2898 CAPTAIN MARVEL
2777 CAPTAIN MARVEL
2984 CAPTAIN MARVEL
2874 CAPTAIN MARVEL
3023 CAPTAIN MARVEL
2725 CAPTAIN MARVEL
3007 CAPTAIN MARVEL
2885 CAPTAIN MARVEL
Answer the question truthfully using context, if unsure, say "I don't know."

Context:

* movie title: CAPTAIN MARVEL Captain Marvel does the job it was meant to do, and I can understand how some people might like it more than others.
* movie title: CAPTAIN MARVEL There's a lot to like with Captain Marvel, but there's a lot to forget as well.
* movie title: CAPTAIN MARVEL Captain Marvel is an entertaining, quality MCU instalment. But those who were hoping that this Marvel landmark would knock their socks off might be left feeling unsatisfied.
* movie title: CAPTAIN MARVEL Captain Marvel didn't markedly improve the experience for me. I enjoyed it w

In [91]:
query = 'Ready Player One 영화를 봐도 될까?'
prompt = construct_prompt(query=query, context_embeddings=df['embedding'], df=df); print(prompt)
retrieve_information(prompt=prompt)

Selected 16 document sections, with indexes:
4784 READY PLAYER ONE
4756 READY PLAYER ONE
4978 READY PLAYER ONE
4827 READY PLAYER ONE
5013 READY PLAYER ONE
4775 READY PLAYER ONE
4925 READY PLAYER ONE
4875 READY PLAYER ONE
4858 READY PLAYER ONE
4101 READY PLAYER ONE
4871 READY PLAYER ONE
4753 READY PLAYER ONE
4819 READY PLAYER ONE
4804 READY PLAYER ONE
4779 READY PLAYER ONE
5040 READY PLAYER ONE
Answer the question truthfully using context, if unsure, say "I don't know."

Context:

* movie title: READY PLAYER ONE Ready Player One is a no-brainer, must-see in theaters. It's such a fun movie that you'll probably want to see it again.
* movie title: READY PLAYER ONE Pure escapism that begs the question, Will there ever be a video game movie that really works?
* movie title: READY PLAYER ONE Ready Player One is a movie designed to be embraced by the geekiest among us, and on that level it earns a high score.
* movie title: READY PLAYER ONE A cross between a thrill ride and a particularly sat

In [93]:
query = '내가 Spiderman Far from Home 봐야 할까? 나는 시각 효과를 좋아해.'
prompt = construct_prompt(query=query, context_embeddings=df['embedding'], df=df); print(prompt)
retrieve_information(prompt=prompt)

Selected 13 document sections, with indexes:
2216 SPIDER-MAN: FAR FROM HOME
2115 SPIDER-MAN: FAR FROM HOME
2228 SPIDER-MAN: FAR FROM HOME
2262 SPIDER-MAN: FAR FROM HOME
2212 SPIDER-MAN: FAR FROM HOME
2325 SPIDER-MAN: FAR FROM HOME
2330 SPIDER-MAN: FAR FROM HOME
2186 SPIDER-MAN: FAR FROM HOME
2054 SPIDER-MAN: FAR FROM HOME
2296 SPIDER-MAN: FAR FROM HOME
2274 SPIDER-MAN: FAR FROM HOME
2076 SPIDER-MAN: FAR FROM HOME
2238 SPIDER-MAN: FAR FROM HOME
Answer the question truthfully using context, if unsure, say "I don't know."

Context:

* movie title: SPIDER-MAN: FAR FROM HOME I'm a very happy person right now...a good sequel, a good time, and I can't wait to see what they do with Tom Holland as Spider-Man...there's an aspect that feels fresh because it's not in New York City the whole time.
* movie title: SPIDER-MAN: FAR FROM HOME I like that "Far From Home" is trying something new and that its humor feels more real than the ironic cracks in most superhero movies. I just wish its good pieces

In [None]:
query = 'Why shouldn\'t I watch spiderman? I am big fan of visual effects.'
prompt = construct_prompt(query=query, context_embeddings=df['embedding'], df=df); print(prompt)
retrieve_information(prompt=prompt)

Selected 11 document sections, with indexes:
2093 SPIDER-MAN: FAR FROM HOME
2309 SPIDER-MAN: FAR FROM HOME
2253 SPIDER-MAN: FAR FROM HOME
2090 SPIDER-MAN: FAR FROM HOME
2044 SPIDER-MAN: FAR FROM HOME
2228 SPIDER-MAN: FAR FROM HOME
2050 SPIDER-MAN: FAR FROM HOME
2110 SPIDER-MAN: FAR FROM HOME
2169 SPIDER-MAN: FAR FROM HOME
2310 SPIDER-MAN: FAR FROM HOME
2116 SPIDER-MAN: FAR FROM HOME
Answer the question truthfully using context, if unsure, say "I don't know."

Context:

* movie title: SPIDER-MAN: FAR FROM HOME Goodness, this is cleverly constructed stuff - full of fun, humour, youth and big visual effects, but underpinned by a screenplay that does a really well thought-through job of linking the MCU's future to its hugely popular past.
* movie title: SPIDER-MAN: FAR FROM HOME The story is pyrotechnical. The soundtrack is multi-decibel. There is no room for wit, thought, emotion or seriously challenging novelty. But, simultaneously, I'd rather watch Holland do this rubbish than most movi

In [99]:
query = 'Why shouldn\'t I watch 1917?'
prompt = construct_prompt(query=query, context_embeddings=df['embedding'], df=df); print(prompt)
retrieve_information(prompt=prompt)

Selected 16 document sections, with indexes:
821 1917
735 1917
820 1917
764 1917
778 1917
726 1917
759 1917
879 1917
895 1917
898 1917
826 1917
753 1917
774 1917
944 1917
737 1917
706 1917
Answer the question truthfully using context, if unsure, say "I don't know."

Context:

* movie title: 1917 This is a movie one does not watch so much as witness. It simply must be seen.
* movie title: 1917 It's hard to watch, but it feels important that we do.
* movie title: 1917 Making a hollow spectacle of war is ignoble. Sometimes it's dangerously irresponsible.
* movie title: 1917 1917 is Absolutely incredible, it's beyond stressful from start to finish. This is a film that demands to be seen in a movie theater.
* movie title: 1917 From the sole perspective of the filmmaking craft, "1917" is worth a watch.
* movie title: 1917 I was expecting an exciting and technical marvel, but it's an emotional one too. It's breathtaking, in that I didn't want to breathe.
* movie title: 1917 In other words, "1

In [97]:
query = 'I am not a big fan of lengthy movie, should I watch 1917?'
prompt = construct_prompt(query=query, context_embeddings=df['embedding'], df=df); print(prompt)
retrieve_information(prompt=prompt)

Selected 15 document sections, with indexes:
759 1917
674 1917
879 1917
895 1917
821 1917
764 1917
949 1917
726 1917
935 1917
846 1917
753 1917
774 1917
719 1917
770 1917
835 1917
Answer the question truthfully using context, if unsure, say "I don't know."

Context:

* movie title: 1917 In other words, "1917" often seems built more to wow audience than make them feel. And it may well have been a better film set around extended cuts than fully committing to the one-take gimmick.
* movie title: 1917 The film's final hour loses steam and is beset by more than a few narrative lapses it ultimately can't overcome. Still, this is a worthwhile epic best seen on the big screen.
* movie title: 1917 Sitting through it is like watching someone else playing a video game for two solid hours, and not an especially compelling one at that.
* movie title: 1917 '1917' unfolds like an overstuffed video game. It becomes rather silly and farfetched the further toward the front it proceeds. The movie feature

In [None]:
query = 'I love visual effects, should I watch Captain Marvel or Spiderman?'
prompt = construct_prompt(query=query, context_embeddings=df['embedding'], df=df); print(prompt)
retrieve_information(prompt=prompt)

Selected 11 document sections, with indexes:
2823 CAPTAIN MARVEL
2093 SPIDER-MAN: FAR FROM HOME
2148 SPIDER-MAN: FAR FROM HOME
2857 CAPTAIN MARVEL
2116 SPIDER-MAN: FAR FROM HOME
3091 CAPTAIN MARVEL
2785 CAPTAIN MARVEL
2253 SPIDER-MAN: FAR FROM HOME
2875 CAPTAIN MARVEL
2851 CAPTAIN MARVEL
3005 CAPTAIN MARVEL
Answer the question truthfully using context, if unsure, say "I don't know."

Context:

* movie title: CAPTAIN MARVEL Great visual effects and acting by Brie Larson make for an enjoyable watch that embraces a confident and smart woman character, something rarely seen in this genre.
* movie title: SPIDER-MAN: FAR FROM HOME Goodness, this is cleverly constructed stuff - full of fun, humour, youth and big visual effects, but underpinned by a screenplay that does a really well thought-through job of linking the MCU's future to its hugely popular past.
* movie title: SPIDER-MAN: FAR FROM HOME Special credit must be given to the effects team, which brings to life Mysterio's trippy, percep

In [109]:
query = 'I love emotional movies, what movie should I watch?'
prompt = construct_prompt(query=query, context_embeddings=df['embedding'], df=df); print(prompt)
retrieve_information(prompt=prompt)

Selected 15 document sections, with indexes:
2665 AVENGERS: ENDGAME
5719 DUNKIRK
2463 AVENGERS: ENDGAME
2998 CAPTAIN MARVEL
1817 TOY STORY 4
2550 AVENGERS: ENDGAME
5652 STAR WARS: THE LAST JEDI
5657 STAR WARS: THE LAST JEDI
3638 A STAR IS BORN
726 1917
5938 DUNKIRK
3734 A STAR IS BORN
3558 A STAR IS BORN
5843 DUNKIRK
2369 AVENGERS: ENDGAME
Answer the question truthfully using context, if unsure, say "I don't know."

Context:

* movie title: AVENGERS: ENDGAME "Endgame" is the most emotional of the Avengers movies, a tear-stained farewell that effectively tugs at the heartstrings while simultaneously blowing them up with a large army's worth of CGI warfare.
* movie title: DUNKIRK It's an emotional gauntlet, as you'll be glued to the edge of your seats with your eyes staring at the screening.
* movie title: AVENGERS: ENDGAME Emotional heft combines with a sweeping sense of the epic, often within the same scene.
* movie title: CAPTAIN MARVEL It's cool...I mean, it's enjoyable...I mean, it'

In [102]:
query = 'Is Joker a scary movie?'
prompt = construct_prompt(query=query, context_embeddings=df['embedding'], df=df); print(prompt)
retrieve_information(prompt=prompt)

Selected 15 document sections, with indexes:
1289 JOKER
1269 JOKER
1151 JOKER
1245 JOKER
1319 JOKER
1295 JOKER
1164 JOKER
1047 JOKER
1314 JOKER
1011 JOKER
1110 JOKER
1019 JOKER
1130 JOKER
1119 JOKER
1220 JOKER
Answer the question truthfully using context, if unsure, say "I don't know."

Context:

* movie title: JOKER This movie is indeed terrifying, but that's because it's horrifically, unintentionally insightful of how such men are viewed, both from without and within.
* movie title: JOKER It's a very good movie, and it features a blood-curdling performance from Joaquin Phoenix, in the most frightening portrayal of a violent maniac in decades.
* movie title: JOKER "Joker" stays with the viewer, forcing you to consider it more carefully. The picture you'll get is dark - haunting, disturbing and all too real.
* movie title: JOKER The Joker is deadly serious, a bleak but oddly beautiful horror film that evokes the nightmarish nihilism of Martin Scorsese's Taxi Driver.
* movie title: JOKE

In [103]:
query = 'What type of movie is Joker?'
prompt = construct_prompt(query=query, context_embeddings=df['embedding'], df=df); print(prompt)
retrieve_information(prompt=prompt)

Selected 12 document sections, with indexes:
1119 JOKER
1314 JOKER
1110 JOKER
1204 JOKER
1164 JOKER
1306 JOKER
1105 JOKER
1047 JOKER
1146 JOKER
1169 JOKER
1126 JOKER
1130 JOKER
Answer the question truthfully using context, if unsure, say "I don't know."

Context:

* movie title: JOKER A dark, deranged and often mesmerizing take on the superhero genre. The sort of movie that crawls into your guts and stays there awhile.
* movie title: JOKER A violent, nihilistic horror film masquerading as both a character drama and a comic book movie
* movie title: JOKER Intense and disturbing, Joker is a well-crafted movie, if rather open-ended in its intentions.
* movie title: JOKER 'Joker' is a film that is simply trying to elicit a response and provoke viewers. It is not what you will be expecting from a big-budget movie with a DC logo slapped on it. How Phillips made something that is on par with 'Taxi Driver' is utterly shocking.
* movie title: JOKER Joker is a devastatingly bold, brashly aggress