https://python.langchain.com/v0.2/docs/tutorials/local_rag/

### Instalar librerias

In [1]:
# Document loading, retrieval methods and text splitting
!pip install -qU langchain langchain_community

# Local vector store via Chroma
!pip install -qU langchain_chroma

# Local inference and embeddings via Ollama
!pip install -U langchain_ollama
!pip install -qU beautifulsoup4





In [2]:
!ollama list

NAME                       ID              SIZE      MODIFIED    
llama3:latest              365c0bd3c000    4.7 GB    5 days ago     
nomic-embed-text:latest    0a109f422b47    274 MB    12 days ago    


### Importar librerias

In [3]:
#%pip install langchain

from langchain_chroma import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_ollama import OllamaEmbeddings, ChatOllama
from langchain_community.document_loaders import WebBaseLoader
from langchain_core.runnables import RunnablePassthrough
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

USER_AGENT environment variable not set, consider setting it to identify your requests.


## PASO 1: Cargando documentos

In [4]:
import re
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

base_url = "http://neuralnetworksanddeeplearning.com/acknowledgements.html"
response = requests.get(base_url)
soup = BeautifulSoup(response.text, 'html.parser')

# Creamos un patrón que coincida con URLs que terminen en '/chap' seguido de cualquier cosa
pattern = re.compile(r'^(?:/)?chap.*$', re.IGNORECASE)


# Extraemos solo los enlaces que cumplan el patrón
chapter_links = []
for a in soup.find_all('a', href=True):
    href = a['href']
    if pattern.search(href):
        full_url = urljoin(base_url, href)
        chapter_links.append(full_url)

# Mostramos los enlaces filtrados
for link in chapter_links:
    print(link)

# Ahora, obtenemos el contenido de cada capítulo
chapters_content = []
for link in chapter_links:
    r = requests.get(link)
    if r.status_code == 200:
        chapter_soup = BeautifulSoup(r.text, 'html.parser')
        # Extrae el texto completo del capítulo (ajusta si es necesario)
        content = chapter_soup.get_text()
        chapters_content.append(content)
        print(f"Cargado: {link}")
    else:
        print(f"Error al cargar: {link}")

http://neuralnetworksanddeeplearning.com/chap1.html
http://neuralnetworksanddeeplearning.com/chap1.html#perceptrons
http://neuralnetworksanddeeplearning.com/chap1.html#sigmoid_neurons
http://neuralnetworksanddeeplearning.com/chap1.html#the_architecture_of_neural_networks
http://neuralnetworksanddeeplearning.com/chap1.html#a_simple_network_to_classify_handwritten_digits
http://neuralnetworksanddeeplearning.com/chap1.html#learning_with_gradient_descent
http://neuralnetworksanddeeplearning.com/chap1.html#implementing_our_network_to_classify_digits
http://neuralnetworksanddeeplearning.com/chap1.html#toward_deep_learning
http://neuralnetworksanddeeplearning.com/chap2.html
http://neuralnetworksanddeeplearning.com/chap2.html#warm_up_a_fast_matrix-based_approach_to_computing_the_output_from_a_neural_network
http://neuralnetworksanddeeplearning.com/chap2.html#the_two_assumptions_we_need_about_the_cost_function
http://neuralnetworksanddeeplearning.com/chap2.html#the_hadamard_product_$s_\odot_t$


In [5]:
chapters_content

['\n\n\n\n\n\n\n\n\n\n\n\n\n\nNeural networks and deep learning\n\n\n\n\n\n\n\n\n\nCHAPTER 1\nUsing neural nets to recognize handwritten digits\nNeural Networks and Deep LearningWhat this book is aboutOn the exercises and problemsUsing neural nets to recognize handwritten digitsPerceptronsSigmoid neuronsThe architecture of neural networksA simple network to classify handwritten digitsLearning with gradient descentImplementing our network to classify digitsToward deep learning\nHow the backpropagation algorithm worksWarm up: a fast matrix-based approach to computing the output\r  from a neural networkThe two assumptions we need about the cost functionThe Hadamard product, $s \\odot t$The four fundamental equations behind backpropagationProof of the four fundamental equations (optional)The backpropagation algorithmThe code for backpropagationIn what sense is backpropagation a fast algorithm?Backpropagation: the big picture\nImproving the way neural networks learnThe cross-entropy cost fu

NAME                       ID              SIZE      MODIFIED    
llama3:latest              365c0bd3c000    4.7 GB    5 days ago     
nomic-embed-text:latest    0a109f422b47    274 MB    12 days ago    


In [62]:

#loader = WebBaseLoader(chapters_content)
#data = loader.load()

## PASO 2: Particionar los documentos

In [57]:
from langchain.docstore.document import Document
docs = [Document(page_content=content) for content in chapters_content if content.strip()]
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
all_splits = text_splitter.split_documents(docs)

In [58]:
# Convert loaded documents into strings by concatenating their content
# and ignoring metadata
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

## PASO 3: Almacenar documentos

In [None]:
# Crea las embeddings y genera el vectorstore
local_embeddings = OllamaEmbeddings(model="nomic-embed-text")
vectorstore = Chroma.from_documents(documents=all_splits, embedding=local_embeddings)

## PASO 4: LLM

In [None]:
model = ChatOllama(
    model="llama3.1",
    base_url= "https://ollama.gsi.upm.es/"
)

## PASO 5: Retrieval and Generation

In [None]:
RAG_TEMPLATE = """
You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.

<context>
{context}
</context>

Answer the following question:

{question}"""

rag_prompt = ChatPromptTemplate.from_template(RAG_TEMPLATE)

### PASO 5.1: Q&A manual con vector store

In [46]:
'''chain = (
    RunnablePassthrough.assign(context=lambda input: format_docs(input["context"]))
    | rag_prompt
    | model
    | StrOutputParser()
)'''

#question = "What techniques can be used to improve the way neural networks learn?"

#docs = vectorstore.similarity_search(question)

# Run
#chain.invoke({"context": docs, "question": question})

'chain = (\n    RunnablePassthrough.assign(context=lambda input: format_docs(input["context"]))\n    | rag_prompt\n    | model\n    | StrOutputParser()\n)'

In [71]:
'''question = "¿Qué es el producto de Hadamard?"

# Run
chain.invoke({"context": docs, "question": question})'''

'question = "¿Qué es el producto de Hadamard?"\n\n# Run\nchain.invoke({"context": docs, "question": question})'

In [72]:
'''question = "¿Cuáles son las cuatro ecuaciones fundamentales detrás de la retropropagación?"

# Run
chain.invoke({"context": docs, "question": question})'''

'question = "¿Cuáles son las cuatro ecuaciones fundamentales detrás de la retropropagación?"\n\n# Run\nchain.invoke({"context": docs, "question": question})'

### PASO 5.2: Q&A automático con retrieval

In [None]:
retriever = vectorstore.as_retriever()

qa_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | rag_prompt
    | model
    | StrOutputParser()
)

In [None]:
question = "¿Qué es el producto de Hadamard?"

qa_chain.invoke(question)

In [None]:
question = "What techniques can be used to improve the way neural networks learn?"

qa_chain.invoke(question)

In [40]:
question = "¿Cuáles son las cuatro ecuaciones fundamentales detrás de la retropropagación?"

qa_chain.invoke(question)

'Las cuatro ecuaciones fundamentales detrás de la retropropagación se presentan en el capítulo "How the backpropagation algorithm works" del libro. Estas ecuaciones forman la base teórica detrás del algoritmo de retropropagación. Se pueden encontrar las ecuaciones específicas en el texto del contexto proporcionado.'

In [50]:
questions = ["What is a perceptron?",
            "What techniques can be used to improve learning in neural networks?",
            "How are the weights initialized in a neural network?"
]

In [28]:
question = ["¿Qué es un perceptron?",
            "¿Qué técnicas se pueden utilizar para mejorar el aprendizaje en redes neuronales?",
            "¿Cómo se inicializan los pesos en una red neuronal?"]

In [None]:
ground_truth = ["A perceptron is an artificial neuron developed in the 1950s and 1960s by Frank Rosenblatt, inspired by the work of Warren McCulloch and Walter Pitts. It takes several binary inputs and produces a single binary output based on weights and a threshold. Perceptrons can be used to compute elementary logical functions such as AND, OR, and NAND.",
                "Several techniques can improve learning in neural networks, including using the cross-entropy cost function, overfitting and regularization methods, and proper weight initialization. Other techniques and modifications to the cost function can be used to enhance performance",
                "Weights in a neural network are initialized randomly, often using a Numpy function to generate Gaussian distributions with a mean of 0 and a standard deviation of 1. This random initialization provides a starting point for the stochastic gradient descent algorithm. Later chapters may discuss better ways of initializing weights and biases"
    
]

In [51]:
ground_truth = ["A perceptron is a type of artificial neuron. Perceptrons were developed in the 1950s and 1960s by scientist Frank Rosenblatt, inspired by earlier work by Warren McCulloch and Walter Pitts. Today, it is more common to use other models of artificial neurons, such as the sigmoid neuron.",
                "Learning in neural networks can be improved through cross-entropy cost function, regularization (L1 and L2, dropout and artificial expansion of training data) and initialization of weights.",
                "The input weights of a neuron are initialized as Gaussian random variables with mean 0 and standard deviation = 1, which gives the stochastic gradient descent algorithm a place to start from." ]


In [29]:
ground_truth = ["Un perceptrón es un tipo de neurona artificial. Los perceptrones fueron desarrollados en las décadas de 1950 y 1960 por el científico Frank Rosenblatt, inspirándose en trabajos anteriores de Warren McCulloch y Walter Pitts. Hoy en día, es más común utilizar otros modelos de neuronas artificiales, como la neurona sigmoidea",
                "El aprendizaje en redes neuronales puede mejorarse mediante la función de coste de entropía cruzada, la regularización (L1 y L2, abandono y expansión artificial de los datos de entrenamiento) y la inicialización de los pesos",
                "Los pesos de entrada de una neurona se inicializan como variables aleatorias gaussianas con media 0 y desviación típica = 1, lo que proporciona al algoritmo de descenso por gradiente estocástico un punto de partida." ]

In [52]:
answer = []
content = []


In [53]:
for query in questions:
    answer.append(qa_chain.invoke(query))
    content.append(docs.page_content for docs in retriever.get_relevant_documents(query))

  content.append(docs.page_content for docs in retriever.get_relevant_documents(query))


In [None]:
#contexts= [list(c) for c in content]

In [54]:
content

[<generator object <genexpr> at 0x00000157B6B6F430>,
 <generator object <genexpr> at 0x00000157B6B6F0B0>,
 <generator object <genexpr> at 0x00000157B6BF8970>]

In [55]:
# Si 'content' es una lista de generadores, conviértelo a listas:
content = [list(gen) for gen in content]

In [56]:
answer

['', '', '']

In [None]:
data = {
    "question": questions,
    "ground_truth": ground_truth,
    "answer": answer,
    "contexts": content
}

In [None]:
from datasets import Dataset
dataset = Dataset.from_dict(data)

In [None]:
dataset

In [None]:
from ragas import evaluate
from ragas.metrics import (
    faithfulness, 
    answer_relevancy,
    context_recall, 
    context_precision
)

In [None]:
result = evaluate(dataset = dataset, 
                 metrics = [context_precision, 
                           context_recall,
                           answer_relevancy],
                 llm = model, 
                 embeddings = local_embeddings)

In [None]:
result