<a href="https://colab.research.google.com/github/ancestor9/24_fall_textmining_NLP/blob/main/1209_00_groq_embedding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **[Groq API](https://wikidocs.net/259655)**

- https://console.groq.com/playground

### **HuggingFace Endpoint Embedding**
- **HuggingFaceEndpointEmbeddings 는 내부적으로 InferenceClient를 사용하여 임베딩을 계산한다는 점에서 HuggingFaceEndpoint가 LLM에서 수행하는 것과 매우 유사**



In [1]:
!pip install langchain_huggingface --quiet
!pip install langchain-community --quiet

In [3]:
texts = [
    "안녕, 만나서 반가워.",
    "LangChain simplifies the process of building applications with large language models",
    "랭체인 한국어 튜토리얼은 LangChain의 공식 문서, cookbook 및 다양한 실용 예제를 바탕으로 하여 사용자가 LangChain을 더 쉽고 효과적으로 활용할 수 있도록 구성되어 있습니다. ",
    "LangChain은 초거대 언어모델로 애플리케이션을 구축하는 과정을 단순화합니다.",
    "Retrieval-Augmented Generation (RAG) is an effective technique for improving AI responses.",
]

### **Hugging-face 인증 키**

In [15]:
# prompt: 토큰이 valid한지 확인하는 방법
from langchain_core.documents import Document
from google.colab import userdata
import os

HUGGINGFACEHUB_API_TOKEN = userdata.get('hugging-face')

# Check if the token exists
if HUGGINGFACEHUB_API_TOKEN:
    print("Hugging Face API token found.")
    os.environ["HUGGINGFACE_TOKEN"] = HUGGINGFACEHUB_API_TOKEN

    # Test the token validity (example using LangChain)
    try:
        from langchain.embeddings import HuggingFaceEmbeddings
        embeddings = HuggingFaceEmbeddings()
        embeddings.embed_query("test query") # Try to embed something
        print("Hugging Face API token is valid.")

    except Exception as e:
        print(f"Error using the token: {e}")
        print("The Hugging Face API token might be invalid.")

else:
    print("Hugging Face API token not found in user data.")
    print("Please set the 'hugging-face' user data.")

Hugging Face API token found.


  embeddings = HuggingFaceEmbeddings()


Hugging Face API token is valid.


### **HuggingFaceEmbeddings 클라스 객체화**
- **입력 텍스트를 임베딩 : "intfloat/multilingual-e5-large-instruct"는 1024차원의 임베딩 벡터를 생성**

In [6]:
from langchain.embeddings import HuggingFaceEmbeddings

# 모델 이름 설정
model_name = "intfloat/multilingual-e5-large-instruct"
# Define embeddings instance (task 인자 제거)
hf_embeddings = HuggingFaceEmbeddings(
    model_name=model_name
)

try:
    # Perform Document Embedding
    texts = ["Hello, world!", "How are you?"]  # 테스트용 텍스트 목록
    embedded_documents = hf_embeddings.embed_documents(texts)
except Exception as e:
    print(f"An error occurred: {e}")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/128 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/140k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/690 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.12G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/964 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/271 [00:00<?, ?B/s]

[[0.014341098256409168, 0.012391467578709126, 0.012731142342090607, -0.038295548409223557, 0.038368869572877884, -0.026084208860993385, -0.04091770946979523, 0.021641483530402184, 0.04629184305667877, -0.055249713361263275, 0.0200033038854599, 0.020856762304902077, -0.02855556085705757, -0.062167733907699585, -0.023383745923638344, -0.016161423176527023, -0.016865933313965797, 0.01406352873891592, -0.032072603702545166, -0.013066214509308338, 0.0064966874197125435, -0.018332969397306442, -0.008861781097948551, -0.05364926904439926, -0.01724337413907051, -0.00915795098990202, -0.05476408824324608, -0.041605144739151, -0.015668960288167, -0.044741492718458176, 0.008498680777847767, 0.048068877309560776, -0.03353061527013779, -0.05379054322838783, -0.01807963289320469, 0.04645642265677452, 0.03524462506175041, 0.016543550416827202, -0.032920900732278824, 0.04991605505347252, -0.03722576051950455, 0.03831548988819122, 0.029692016541957855, -0.04558524489402771, -0.020987188443541527, 0.016

In [8]:
print(f'print(embedded_documents): {embedded_documents}')
print("[HuggingFace Endpoint Embedding]")
print(f"Model: \t\t{model_name}")
print(f"Dimension: \t{len(embedded_documents[0])}")

print(embedded_documents): [[0.014341098256409168, 0.012391467578709126, 0.012731142342090607, -0.038295548409223557, 0.038368869572877884, -0.026084208860993385, -0.04091770946979523, 0.021641483530402184, 0.04629184305667877, -0.055249713361263275, 0.0200033038854599, 0.020856762304902077, -0.02855556085705757, -0.062167733907699585, -0.023383745923638344, -0.016161423176527023, -0.016865933313965797, 0.01406352873891592, -0.032072603702545166, -0.013066214509308338, 0.0064966874197125435, -0.018332969397306442, -0.008861781097948551, -0.05364926904439926, -0.01724337413907051, -0.00915795098990202, -0.05476408824324608, -0.041605144739151, -0.015668960288167, -0.044741492718458176, 0.008498680777847767, 0.048068877309560776, -0.03353061527013779, -0.05379054322838783, -0.01807963289320469, 0.04645642265677452, 0.03524462506175041, 0.016543550416827202, -0.032920900732278824, 0.04991605505347252, -0.03722576051950455, 0.03831548988819122, 0.029692016541957855, -0.04558524489402771, -

In [13]:
import numpy as np
np.array(embedded_documents)

array([[ 0.0143411 ,  0.01239147,  0.01273114, ..., -0.00126962,
        -0.01809308,  0.02280117],
       [ 0.03221318,  0.01226758, -0.01812768, ..., -0.04441429,
        -0.02057675,  0.00868573]])

In [14]:
np.array(embedded_documents).shape

(2, 1024)