openai_multimodality.qmd

---
title: "chatGPT"
subtitle: "OpenAI 다중 모드성(Multimodality)"
author:
  - name: 이광춘
    url: https://www.linkedin.com/in/kwangchunlee/
    affiliation: 한국 R 사용자회
    affiliation-url: https://github.com/bit2r
title-block-banner: true
#title-block-banner: "#562457"
format:
  html:
    css: css/quarto.css
    theme: flatly
    code-fold: true
    toc: true
    toc-depth: 3
    toc-title: 목차
    number-sections: true
    highlight-style: github    
    self-contained: false
filters:
   - lightbox
lightbox: auto
link-citations: true
knitr:
  opts_chunk: 
    message: false
    warning: false
    collapse: true
    comment: "#>" 
    R.options:
      knitr.graphics.auto_pdf: true
editor_options: 
  chunk_output_type: console
bibliography: bibliography.bib
csl: apa-single-spaced.csl    
editor: 
  markdown: 
    wrap: sentence
---

# API 설정

2023년 11월 열린 OpenAI 개발자 컨퍼런스에서 기존 OpenAI API 인터페이스가 대대적인 개선작업이 이뤄졌다.

# 텍스트

OpenAI의 다중 모드성 API를 사용하여 한국의 유명 인사들에 대한 정보를 조회하고 처리하는 과정을 보여준다.
먼저, 필요한 라이브러리들을 임포트하고, 환경변수를 로드한다.
OpenAI 클라이언트를 초기화하고, GPT-3.5 모델을 이용하여 '대한민국 최고 유명인 3명'에 대한 질문을 제출한다.
이 요청은 JSON 형식의 출력을 반환하도록 설정되어 있다.
반환된 JSON 문자열은 파싱되어 Python 딕셔너리로 변환되며, 딕셔너리를 Pandas DataFrame으로 변환되어 출력된다.
OpenAI API를 활용하여 텍스트 데이터를 구조화하는 전형적인 방법을 보여준다.

```{python}
#| eval: false
import os
from openai import OpenAI
from dotenv import load_dotenv
import json
import pandas as pd

load_dotenv()

client = OpenAI(
    api_key=os.getenv('OPENAI_API_KEY'),
)

text_response = client.chat.completions.create(
    model="gpt-3.5-turbo-1106",
    max_tokens=256,
    temperature=0,
    messages=[
        {
            "role": "system",
            "content": "대한민국 최고 역사전문가로 역할을 수행해줘. 출력결과는 JSON 형식으로 부탁해",
        },      
        {
            "role": "user",
            "content": "대한민국 최고 유명인 3 명은 누구야?",
        }
    ],
    response_format={ "type": "json_object" }
)


# JSON 문자열 예시
text_output_json = text_response.choices[0].message.content
text_dict = json.loads(text_output_json)

text_df = pd.DataFrame(list(text_dict.values()), index=text_dict.keys(), columns=['유명인'])
print(text_df)
```

```         
   유명인
1  윤보선
2   김구
3  유관순
```

# 이미지

[Teemu Maatta, "How to use GPT-4 Vision API? - OpenAI released today API for GPT-4 Turbo Vision. In this tutorial, I will build you an application using this SOTA model.", medium, 2023-11-07](https://medium.com/@tmmtt/how-to-use-gpt-4-vision-api-ba6b57af569c)

## 이미지 → 텍스트

`gpt-4-vision-preview` 모형을 사용해서 이미지를 설명하는 텍스트를 생성하도록 한다.

```{python}
#| eval: false
image_response = client.chat.completions.create(
    model="gpt-4-vision-preview",
    messages=[
        {
            "role": "system",
            "content": "대한민국 최고 역사전문가로 역할을 수행해줘.",
        },      
        {
            "role": "user",
            "content": [
                { 
                  "type": "text", 
                  "text": "이미지를 설명해주세요."
                },
                {
                  "type": "image_url",
                  "image_url": "https://upload.wikimedia.org/wikipedia/commons/4/4e/An_Jung-geun.JPG"
                },
            ],
        }
    ],
    max_tokens=500
)

print(image_response.choices[0].message.content)
```

::: columns
::: column
![](https://upload.wikimedia.org/wikipedia/commons/4/4e/An_Jung-geun.JPG){fig-align="center" width="250"}
:::

::: column
사진 속 인물은 어두운색의 자켓을 입고 있으며, 배경은 흰색으로 보이는 벽으로 추정됩니다.
이 사람은 카메라를 직시하고 있는데, 표정은 비교적 평온해 보입니다.
사진의 색조와 품질을 통해 추정컨대 오래된 사진으로 보입니다.
이 사람의 정체에 관한 정보가 없으므로, 구체적인 역사적 배경이나 이 인물에 대한 자세한 정보를 제공하기는 어렵습니다.
:::
:::

## 텍스트 → 이미지

`dall-e-3` 모형을 사용해서 텍스트를 이미지로 생성하도록 한다.

```{python}
#| eval: false

from IPython.display import display, Image
import requests

response = client.images.generate(
  model="dall-e-3",
  prompt="A black Scottish fold cat with light golden eyes laying down on white sheets",
  size="1024x1024",
  quality="standard",
  n=1,
)

# Save the image URL
image_url = response.data[0].url

# Fetch the image
image_response = requests.get(image_url)

# Display the image
img = Image(data=image_response.content)
display(img)
```

```{r}
#| eval: false
download.file(url = 'https://oaidalleapiprodscus.blob.core.windows.net/private/org-GpPkNlGHcRh9i7pQIlhT18p7/user-Qkv0ntrn5tQoUu6pocAidY5V/img-rypewgc6ys0EPhho7OTn7f5m.png?st=2023-11-16T08%3A52%3A12Z&se=2023-11-16T10%3A52%3A12Z&sp=r&sv=2021-08-06&sr=b&rscd=inline&rsct=image/png&skoid=6aaadede-4fb3-4698-a8f6-684d7786b067&sktid=a48cca56-e6da-484e-a814-9c849652bcb3&skt=2023-11-16T09%3A51%3A05Z&ske=2023-11-17T09%3A51%3A05Z&sks=b&skv=2021-08-06&sig=bPIlrK7Lhflj2Igsg6TWOSnT1DKQ3F2A0/vTTZSwgfI%3D', destfile = "images/dalle_gpt4_cat.jpg", mode = "wb")
```

![](images/dalle_gpt4_cat.jpg)

# 오디오

## 텍스트 → 음성

대한민국 헌법 제1장 제1조를 읽어주는 음성을 생성한다.

> 대한민국은 민주공화국이다. 대한민국의 주권은 국민에게 있고, 모든 권력은 국민으로부터 나온다


```{python}
#| eval: false
from openai import OpenAI

client = OpenAI()
response = client.audio.speech.create(
  model="tts-1",
  voice="alloy",
  input="대한민국은 민주공화국이다. 대한민국의 주권은 국민에게 있고, 모든 권력은 국민으로부터 나온다",
)

# Save to an MP3 file.
with open("data/alloy-korean.mp3", "wb") as file:
  file.write(response.content)
  
```

```{r}
library(av)
library(embedr)

html_tag_audio <- function(file, type = c("wav")) {
  type <- match.arg(type)
  htmltools::tags$audio(
    controls = "",
    htmltools::tags$source(
      src = file,
      type = glue::glue("audio/{type}", type = type)
    )
  )
}

embedr::embed_audio("data/alloy-korean.mp3")
```

## 음성 → 텍스트

다음으로 음성을 텍스트로 변환하는 예제를 살펴보자.

[Korean Speech Database for ASR](https://github.com/knlee-voice/AI.Tech/blob/master/docs/KoSpeechDB.md) 웹사이트에서 다양한 한국어 음성 데이터를 다운로드 받을 수 있다.

```{r}
embedr::embed_audio("data/audio_0001.wav", "wav")
```


```{python}
from openai import OpenAI
client = OpenAI()

audio_file = open("data/audio_0001.wav", "rb")

transcript = client.audio.transcriptions.create(
  model="whisper-1", 
  file=audio_file, 
  response_format="text"
)

transcript
```

::::: columns
::: {.column width="45%"}
### 라벨 음성 텍스트 {.unnumbered}

커피숍에 쓰는 돈이 월 (100000원)/(십만 원)이 (넘더라고요)/(넘더라구요). 소비를 좀 줄이려고 캡슐커피머신 구매했는데 캡슐 가격이 너무 나가서 그게 그거 같아요.

:::

::: {.column width="10%"}
:::

::: {.column width="45%"}
### 인식된 음성 텍스트 {.unnumbered}

'커피숍에 쓰는 돈이 월 10만원이 넘더라고요. 소비를 좀 줄이라고 캡슐커피머신 구매했는데 캡슐가격이 너무 나가서 그게 그거인 것 같아요.\n'

:::
:::::

## 보정작업

OpenAI [Speech to text](https://platform.openai.com/docs/guides/speech-to-text/quickstart) 문서를 바탕으로 음성 오디오를 필사하는 보정작업을 수행하는 예제를 살펴보자.

```{python}
#| eval: false
audio_file = open("data/audio_0001.wav", "rb")

system_prompt = "You are a helpful assistant for the company Korea R User Group. Your task is to correct any spelling discrepancies in the transcribed text. Make sure that the names of the following products are spelled correctly: Korea R User Group, 한국 R 사용자회. Only add necessary punctuation such as periods, commas, and capitalization, and use only the context provided."

def transcribe(audio_file):
    transcript = client.audio.transcriptions.create(
      model="whisper-1", 
      file=audio_file, 
      response_format="text"
    )
    
    return transcript
  
# transcribe(audio_file)

def generate_corrected_transcript(temperature, system_prompt, audio_file):
    response = client.chat.completions.create(
        model="gpt-4",
        temperature=temperature,
        messages=[
            {
                "role": "system",
                "content": system_prompt
            },
            {
                "role": "user",
                "content": transcribe(audio_file)
            }
        ]
    )
    return response

corrected_text = generate_corrected_transcript(0, system_prompt, audio_file)

corrected_text.choices[0].message.content
```

```
'커피숍에 쓰는 돈이 월 10만원이 넘더라고요. 소비를 좀 줄이라고 캡슐 커피 머신을 구매했는데, 캡슐 가격이 너무 나가서 그게 그거인 것 같아요.'
```