<a href="https://colab.research.google.com/github/crosstar1228/Machine_Learning/blob/main/kpop_rag.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## RAG VS 일반 LLM

1.데이터: "Kpop trend" 를 검색해서 나온 영상 10개와 영상마다 달린 댓글 50개, 총 500개의 댓글을 API로 응답을 받아 수집

2.Langchain을 이용해 Document 객체로 변환(이미 댓글 형식으로 수집했으므로 textsplit과정은 생략)

3. Langchain 및 Sentence Transformer Embedding, chromaDB 활용하여 벡터 db 구축

4. Mistral-7B LLM 모델에서 단순히 질문했을때와 RAG 를 context로 하여 질문했을때 답변을 비교함


In [1]:
#https://developers.google.com/youtube/v3/quickstart/python
!pip install --upgrade google-api-python-client

Collecting google-api-python-client
  Downloading google_api_python_client-2.165.0-py2.py3-none-any.whl.metadata (6.6 kB)
Downloading google_api_python_client-2.165.0-py2.py3-none-any.whl (13.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.1/13.1 MB[0m [31m100.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: google-api-python-client
  Attempting uninstall: google-api-python-client
    Found existing installation: google-api-python-client 2.164.0
    Uninstalling google-api-python-client-2.164.0:
      Successfully uninstalled google-api-python-client-2.164.0
Successfully installed google-api-python-client-2.165.0


In [2]:
!pip install --upgrade google-auth-oauthlib google-auth-httplib2



## Step1. 프로젝트 및 credential 셋업

In [3]:
import os
from googleapiclient.discovery import build
from googleapiclient.errors import HttpError
import json


class YoutubeAPI:
  """
  get_video_ids: 특정 키워드로 동영상 검색
  get_video_comments: 특정 동영상의 댓글 정보 추출
  get_info: 두 메소드 활용하여 최종 정보 정리
  dump_info: get_info의 결과를 json 파일로 저장
  """
  def __init__(self, api_key):
      self.api_key = api_key
      self.youtube = build('youtube', 'v3', developerKey=api_key)

  def get_video_ids(self, query:str, max_results:int=10) -> list:
      try:
        # request
        search_response = self.youtube.search().list(
            q=query,
            type='video',
            part='id,snippet',
            maxResults=max_results
        ).execute()

        # video info
        videos = []
        for search_result in search_response.get('items', []):
            video_id = search_result['id']['videoId']
            video_info = {
                'title': search_result['snippet']['title'],
                'video_id': video_id,
                'channel_title': search_result['snippet']['channelTitle']
            }
            videos.append(video_info)

      except HttpError as e:
        print(f'An error occurred: {e}')
        return []

      return videos

  def get_video_comments(self, video_id:str, max_comments:int=50) -> list:

    try:
      # request
      comments_response = self.youtube.commentThreads().list(
          part='snippet',
          videoId=video_id,
          maxResults=max_comments,
          textFormat='plainText'
      ).execute()

      # comment info
      comments = []
      for comment in comments_response.get('items', []):
          comment_snippet = comment['snippet']['topLevelComment']['snippet']
          comment_info = {
              'text': comment_snippet['textDisplay'],
              'author': comment_snippet['authorDisplayName'],
              'like_count': comment_snippet.get('likeCount', 0)
          }
          comments.append(comment_info)
      return comments

    except HttpError as e:
      print(f'An error occurred: {e}')
      return []

    return comments
  def get_info(self, query:str, max_results:int=10, max_comments:int=50) -> list:
    videos = self.get_video_ids(query, max_results)
    comment_infos = []
    for video in videos:
      print(f"video info:{video}")
      comments = self.get_video_comments(video['video_id'], max_comments)
      for comment in comments:
        # convert it to dict
        comment_info= {'video_title': video['title'], 'comment': comment['text'], 'author': comment['author'], 'like_count': comment['like_count']}
        comment_infos.append(comment_info)
    return comment_infos

  def dump_info(self, query:str, max_results:int=10, max_comments:int=50) -> None:
    comment_infos = self.get_info(query, max_results, max_comments)
    # dump it to json
    with open('comment_infos.json', 'w') as f:
      json.dump(comment_infos, f, indent = 4)
    return None



In [4]:
"""YOUTUBE_API_KEY 입력 필요"""
API_KEY = "YOUTUBE_API_KEY"
youtube_api = YoutubeAPI(API_KEY)

In [5]:
#youtube_api.dump_info("Kpop trend")

In [6]:
!pip install langchain-community
!pip install jq

Collecting langchain-community
  Downloading langchain_community-0.3.20-py3-none-any.whl.metadata (2.4 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.8.1-py3-none-any.whl.metadata (3.5 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading marshmallow-3.26.1-py3-none-any.whl.metadata (7.3 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
Collecting python-dotenv>=0.21.0 (from pydantic-settings<3.0.0,>=2.4.0->langchain-community)
  Downloading python_dotenv-1.1.0-py3-none-any.whl.metadata (24 kB

In [7]:
import json
from langchain.schema import Document

# JSON 파일 읽기
with open("comment_infos.json", "r", encoding="utf-8") as f:
    json_data = json.load(f)

# JSON 데이터를 LangChain Document 객체로 변환
docs = [
    Document(
        page_content=item["comment"],  # 주요 내용을 comment 필드로 설정
        metadata={"video_title": item["video_title"], "author": item["author"], "likes": item["like_count"]}
    )
    for item in json_data
]

# 결과 출력
for doc in docs[:3]:  # 처음 3개만 출력
    print(doc)

page_content='I love 💗 blackpink’s and New jeans style' metadata={'video_title': 'Kpop Untouchable Dance Trend', 'author': '@AlexaVega-gi4qh', 'likes': 0}
page_content='I love blackpink' metadata={'video_title': 'Kpop Untouchable Dance Trend', 'author': '@SjHaines', 'likes': 0}
page_content='OMG UR SOOO PRETTYYYYYYYY ❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤💖💖💖💖💖💖💅💅💅💅💅💅' metadata={'video_title': 'Kpop Untouchable Dance Trend', 'author': '@Krazysparks-love', 'likes': 1}


In [8]:
docs[1].page_content

'I love blackpink'

In [9]:
!pip install chromadb sentence-transformers
# vector store using chromadb
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings

# Hugging Face 임베딩 모델 사용
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

# ChromaDB 벡터스토어 생성
vectorstore = Chroma.from_documents(docs, embedding_model, persist_directory="./chroma_db")

# 저장 (필수!)
vectorstore.persist()
print("✅ ChromaDB 저장 완료!")

Collecting chromadb
  Downloading chromadb-0.6.3-py3-none-any.whl.metadata (6.8 kB)
Collecting build>=1.0.3 (from chromadb)
  Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
Collecting chroma-hnswlib==0.7.6 (from chromadb)
  Downloading chroma_hnswlib-0.7.6-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (252 bytes)
Collecting fastapi>=0.95.2 (from chromadb)
  Downloading fastapi-0.115.12-py3-none-any.whl.metadata (27 kB)
Collecting uvicorn>=0.18.3 (from uvicorn[standard]>=0.18.3->chromadb)
  Downloading uvicorn-0.34.0-py3-none-any.whl.metadata (6.5 kB)
Collecting posthog>=2.4.0 (from chromadb)
  Downloading posthog-3.21.0-py2.py3-none-any.whl.metadata (2.9 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Downloading onnxruntime-1.21.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.5 kB)
Collecting opentelemetry-exporter-otlp-proto-grpc>=1.2.0 (from chromadb)
  Downloading opentelemetry_exporter_otlp_proto_grpc-1.31.1-py

  embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

✅ ChromaDB 저장 완료!


  vectorstore.persist()


In [15]:
vectorstore.similarity_search("Jungkook")

[Document(metadata={'author': '@УткирИбрагимов-ц5в', 'likes': 1, 'video_title': 'Ta Ta Ta ft. Jungkook dance trend @ the airport 😗 #standingnexttoyou #3D #bayanni #jasonderulo'}, page_content="Jungkook's trend😂❤"),
 Document(metadata={'author': '@رفروفي', 'likes': 0, 'video_title': 'Ta Ta Ta ft. Jungkook dance trend @ the airport 😗 #standingnexttoyou #3D #bayanni #jasonderulo'}, page_content='Jungkook, oh my God, how wonderful he is. I love how he makes everyone dance without feeling. I love him so much. I can’t express my love for him, but I love him and wait for him impatiently 😩💗💗💗💗'),
 Document(metadata={'author': '@RAKESHKUMAR-verma', 'likes': 0, 'video_title': 'Ta Ta Ta ft. Jungkook dance trend @ the airport 😗 #standingnexttoyou #3D #bayanni #jasonderulo'}, page_content='Jankook trend star'),
 Document(metadata={'author': '@MdSakib-l2v', 'likes': 0, 'video_title': 'Ta Ta Ta ft. Jungkook dance trend @ the airport 😗 #standingnexttoyou #3D #bayanni #jasonderulo'}, page_content='Jonk

In [17]:
!pip install langchain_openai

Collecting langchain_openai
  Downloading langchain_openai-0.3.10-py3-none-any.whl.metadata (2.3 kB)
Collecting langchain-core<1.0.0,>=0.3.48 (from langchain_openai)
  Downloading langchain_core-0.3.48-py3-none-any.whl.metadata (5.9 kB)
Collecting tiktoken<1,>=0.7 (from langchain_openai)
  Downloading tiktoken-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Downloading langchain_openai-0.3.10-py3-none-any.whl (61 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.2/61.2 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading langchain_core-0.3.48-py3-none-any.whl (418 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m418.7/418.7 kB[0m [31m14.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading tiktoken-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m46.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling coll

In [27]:
from langchain.llms import HuggingFaceHub
"""API KEY 입력 필요"""
api_token = "HUGGINGFACEHUB_API_TOKEN"

# Hugging Face Inference API에서 무료 LLM 사용
llm = HuggingFaceHub(
    repo_id="mistralai/Mistral-7B-Instruct-v0.1",  # 무료 LLM 모델
    model_kwargs={"temperature": 0.1, "max_length": 512},
    huggingfacehub_api_token = api_token
)

In [35]:
llm.predict("Which idol star do you like the most?")



"Which idol star do you like the most?\n\nI don't have personal preferences. I'm here to provide information and assist users."

In [36]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

# Prompt 템플릿
template = '''Answer the question based only on the following context:
{context}

Question: {question}
'''

prompt = ChatPromptTemplate.from_template(template)

# Retriever (기존 ChromaDB 벡터스토어)
retriever = vectorstore.as_retriever()

# Combine Documents
def format_docs(docs):
    return '\n\n'.join(doc.page_content for doc in docs)

# RAG Chain
rag_chain = (
    {'context': retriever | format_docs, 'question': RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

# 실행
response = rag_chain.invoke("Which idol star do you like the most?")
print(response)



Human: Answer the question based only on the following context:
Lisa is both amazing and very talented! She works hard to be where she is now. She's smart to get away from the K-Pop management. She was mistreated compared to the other three Blankpink members due to the fact that she is not Korean. She is a proud and gorgeous Thai icon. She has made her home country proud with her beauty, talent and hard work. I will always be one of her biggest fans! 😍😍😍❤️❤️❤️

Even lisa

why i thout that was lisa

I'm more annoyed by the fact that VS invited TYLA when she's a one hit wonder who accomplished nothing plus she has a very bad attitude, very unlikable. 🙄😒

Question: Which idol star do you like the most?

Answer: Lisa


## 회고
1. 데이터 수집방식 재고
  - 유튜브 댓글은 정보량이 많지는 않음. 유튜브 트렌드를 알기보단 챗봇에 적절한 데이터인듯.
  - Reddit이나 X(twitter) 등을 활용해볼수 있을듯
2. Fine Tuning 보다 RAG 가 간단할수는 있다. 장단점이 있음.
3. 대용량 데이터를 벡터 DB에 저장할땐:
	1.	임베딩 모델의 토큰 제한과 문맥 유지를 고려한 청크 크기 설정.
	2.	중복 추가로 문맥 단절 최소화.
	3.	메타데이터와 고유 ID 관리.
	4.	Pinecone과 같은 도구를 활용해 효율적인 인덱싱 및 검색 구현.
  