Modelo Grande de Lenguaje (LLM) e IA Generativa

TP Clase 6

Carlos Villalobos

In [2]:
!pip install langchain pypdf faiss-cpu sentence-transformers transformers torch
!pip install -U langchain-community
!pip install langchain_groq
!pip install gradio langchain pypdf faiss-cpu langchain_groq transformers torch

Collecting langchain
  Downloading langchain-0.3.9-py3-none-any.whl.metadata (7.1 kB)
Collecting faiss-cpu
  Downloading faiss_cpu-1.9.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.4 kB)
Collecting sentence-transformers
  Downloading sentence_transformers-3.3.1-py3-none-any.whl.metadata (10 kB)
Collecting langchain-core<0.4.0,>=0.3.21 (from langchain)
  Downloading langchain_core-0.3.21-py3-none-any.whl.metadata (6.3 kB)
Collecting langchain-text-splitters<0.4.0,>=0.3.0 (from langchain)
  Downloading langchain_text_splitters-0.3.2-py3-none-any.whl.metadata (2.3 kB)
Collecting langsmith<0.2.0,>=0.1.17 (from langchain)
  Downloading langsmith-0.1.147-py3-none-any.whl.metadata (14 kB)
Collecting packaging (from faiss-cpu)
  Downloading packaging-24.2-py3-none-any.whl.metadata (3.2 kB)
Collecting requests-toolbelt<2.0.0,>=1.0.0 (from langsmith<0.2.0,>=0.1.17->langchain)
  Downloading requests_toolbelt-1.0.0-py2.py3-none-any.whl.metadata (14 kB)
Downloading 

In [3]:
from transformers import AutoModel

embedding_model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True)

config.json:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

configuration_bert.py:   0%|          | 0.00/8.24k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/jinaai/jina-bert-implementation:
- configuration_bert.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_bert.py:   0%|          | 0.00/97.7k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/jinaai/jina-bert-implementation:
- modeling_bert.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors:   0%|          | 0.00/275M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/373 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

In [4]:
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_groq import ChatGroq
from langchain_core.prompts import ChatPromptTemplate
import torch
from transformers import AutoTokenizer, AutoModel
import faiss
import numpy as np
import os

class JinaEmbeddings:
    def __init__(self):
        """Initialize the Jina embeddings model"""
        self.model_name = "jinaai/jina-embeddings-v2-base-en"
        self.tokenizer = AutoTokenizer.from_pretrained(self.model_name)
        self.model = AutoModel.from_pretrained(self.model_name)
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        self.model.to(self.device)
        
    def encode(self, texts):
        """
        Encode texts to embeddings
        
        Args:
            texts (str or list): Text or list of texts to encode
            
        Returns:
            numpy.ndarray: Embeddings
        """
        # Asegurar que el texto existe
        if isinstance(texts, str):
            texts = [texts]
            
        # Tokenizar y obtener las salidas del modelo
        encoded_input = self.tokenizer(
            texts,
            padding=True,
            truncation=True,
            max_length=512,
            return_tensors='pt'
        ).to(self.device)
        
        with torch.no_grad():
            model_output = self.model(**encoded_input)
            
        # Mean pooling
        attention_mask = encoded_input['attention_mask']
        token_embeddings = model_output[0]
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
        embeddings = torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
        
        return embeddings.cpu().numpy()

class DocumentQA:
    def __init__(self, pdf_path, groq_api_key):
        """
        Initialize the Document QA system
        
        Args:
            pdf_path (str): Path to the PDF document
            groq_api_key (str): Groq API key
        """
        self.pdf_path = pdf_path
        self.vector_store = None
        self.jina_embeddings = JinaEmbeddings()
        
        # Inicializar Groq LLM
        os.environ["GROQ_API_KEY"] = groq_api_key
        self.llm = ChatGroq(
            model_name="mixtral-8x7b-32768",
            temperature=0.3,
            max_tokens=1000
        )
        
        # Crear un template de QA prompt
        self.qa_prompt = ChatPromptTemplate.from_messages([
            ("system", """You are a helpful assistant that answers questions based on the provided context. 
            Your answers should be:
            1. Accurate and based solely on the provided context
            2. Comprehensive yet concise
            3. Well-structured and easy to understand
            If the context doesn't contain enough information to answer the question, say so.
            
            Context: {context}"""),
            ("human", "{question}")
        ])
        
    def load_and_split_document(self):
        """Load PDF and split into chunks"""
        # Cargar el PDF
        loader = PyPDFLoader(self.pdf_path)
        documents = loader.load()
        
        # Dividir el texto enchunks
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=1000,
            chunk_overlap=200,
            separators=["\n\n", "\n", " ", ""]
        )
        self.chunks = text_splitter.split_documents(documents)
        return self.chunks
    
    def create_vector_store(self):
        """Create FAISS vector store from document chunks"""
        # Extraer el texto de los chunks
        texts = [doc.page_content for doc in self.chunks]
        
        # Generar los embeddings
        embeddings = self.jina_embeddings.encode(texts)
        
        # Crear el indice FAISS
        dimension = embeddings.shape[1]
        index = faiss.IndexFlatL2(dimension)
        index.add(embeddings)
        
        # Guardar el documento en chunks y crear el retriever
        self.vector_store = {
            'index': index,
            'documents': self.chunks,
            'embeddings': embeddings
        }
        return self.vector_store
    
    def retrieve_similar_chunks(self, query, k=3):
        """
        Retrieve similar chunks for a query
        
        Args:
            query (str): Query text
            k (int): Number of chunks to retrieve
            
        Returns:
            list: Similar document chunks
        """
        # Generar la query de embedding
        query_embedding = self.jina_embeddings.encode(query)
        
        # Buscar los vectores similares
        distances, indices = self.vector_store['index'].search(query_embedding, k)
        
        # Devolver los documentos que corresponder
        similar_docs = [self.vector_store['documents'][i] for i in indices[0]]
        return similar_docs
    
    def generate_answer(self, question, context_docs):
        """
        Generate answer using Groq LLM
        
        Args:
            question (str): Question to answer
            context_docs (list): List of relevant document chunks
            
        Returns:
            str: Generated answer
        """
        # Combinar el contexto de los documentos obtenidos
        context = "\n".join([doc.page_content for doc in context_docs])
        
        # Dar formato al prompt con context y question
        formatted_prompt = self.qa_prompt.format_messages(
            context=context,
            question=question
        )
        
        # Generar la respuesta usando Groq
        response = self.llm.invoke(formatted_prompt)
        return response.content
    
    def initialize_system(self):
        """Initialize the complete system"""
        print("Loading and splitting document...")
        self.load_and_split_document()
        print("Creating vector store...")
        self.create_vector_store()
        print("System initialized!")
    
    def ask_question(self, question):
        """
        Ask a question to the system
        
        Args:
            question (str): Question to ask
            
        Returns:
            dict: Contains answer and source documents
        """
        if not self.vector_store:
            raise ValueError("System not initialized. Call initialize_system() first.")
        
        # Recuperar los chunks relevantes
        similar_docs = self.retrieve_similar_chunks(question)
        
        # Generar respuesta usando Groq
        answer = self.generate_answer(question, similar_docs)
        
        return {
            "answer": answer,
            "sources": [doc.page_content for doc in similar_docs]
        }


In [6]:
# Ejemplo de uso
def main():
    # Incializa el sistema
    pdf_path = "/kaggle/input/curriculum/cv.pdf"
    groq_api_key = "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"  # Aqui va la Groq API key
    
    qa_system = DocumentQA(pdf_path, groq_api_key)
    qa_system.initialize_system()
    
    # Ejemplo de preguntas
    questions = [
        "What is your Professional Summary?",
        "What is your Technical Expertise?",
        "What are your Certifications?",
        ""
    ]
    
    # Get answers
    for question in questions:
        print(f"\nQuestion: {question}")
        print("Generating answer...")
        result = qa_system.ask_question(question)
        print(f"Answer: {result['answer']}")
        print("\nSources:")
        for i, source in enumerate(result['sources'], 1):
            print(f"Source {i}: {source[:200]}...")

if __name__ == "__main__":
    main()

Some weights of BertModel were not initialized from the model checkpoint at jinaai/jina-embeddings-v2-base-en and are newly initialized: ['embeddings.position_embeddings.weight', 'encoder.layer.0.intermediate.dense.bias', 'encoder.layer.0.intermediate.dense.weight', 'encoder.layer.0.output.LayerNorm.bias', 'encoder.layer.0.output.LayerNorm.weight', 'encoder.layer.0.output.dense.bias', 'encoder.layer.0.output.dense.weight', 'encoder.layer.1.intermediate.dense.bias', 'encoder.layer.1.intermediate.dense.weight', 'encoder.layer.1.output.LayerNorm.bias', 'encoder.layer.1.output.LayerNorm.weight', 'encoder.layer.1.output.dense.bias', 'encoder.layer.1.output.dense.weight', 'encoder.layer.10.intermediate.dense.bias', 'encoder.layer.10.intermediate.dense.weight', 'encoder.layer.10.output.LayerNorm.bias', 'encoder.layer.10.output.LayerNorm.weight', 'encoder.layer.10.output.dense.bias', 'encoder.layer.10.output.dense.weight', 'encoder.layer.11.intermediate.dense.bias', 'encoder.layer.11.intermedi

Loading and splitting document...
Creating vector store...
System initialized!

Question: What is your Professional Summary?
Generating answer...
Answer: The provided context includes a profile summary for Carlos Villalobos, an Electronics and Telecommunication Engineer with extensive experience in AI, computer vision, and technical solutions. Carlos is adept at developing innovative solutions to complex problems, collaborating across teams, and driving projects from concept to deployment. He has strong expertise in AI modeling, REST API development, and cloud integrations. Key skills include programming languages such as Python, JavaScript, and HTML/CSS, as well as experience with cloud technologies, web monitoring and logging, database management, and team collaboration.

Sources:
Source 1: •Conducted diagnostics and preventive maintenance for advanced telecommunication systems.•Spearheaded projects to enhance operational efﬁciency, reducing downtime by 15%.•Collaborated with cross-d