# LangChain Core Mechanics: A Practical Study

This document outlines a series of experiments conducted to gain practical experience with the LangChain library. The focus is on deconstructing and implementing key components essential for building LLM-driven applications.

**The experimental process covers:**
1.  Loading and segmenting textual data for LLM consumption.
2.  Converting text segments into semantic vector representations (embeddings) via local Hugging Face models.
3.  Indexing these embeddings within a FAISS vector store to enable rapid retrieval of relevant information.
4.  Integrating the retrieval mechanism with a language model to perform context-aware question answering (Retrieval-Augmented Generation).

The results and observations herein are part of an ongoing learning endeavor. While foundational, this exploration provides a basis for more advanced LangChain development.

**Software Stack:** Python, LangChain, Hugging Face (`sentence-transformers`), FAISS.

In [70]:
import warnings
import tqdm
warnings.filterwarnings('ignore')
from dotenv import load_dotenv
import os
import json
import requests
from typing import Any, List, Mapping, Optional

from langchain_core.documents import Document
from langchain_core.callbacks.manager import CallbackManagerForLLMRun
from langchain_core.language_models.llms import LLM
from langchain_community.document_loaders import WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.chains import RetrievalQA

load_dotenv()

True

In [71]:
class Claude2LLM(LLM):
    n: int
    model_to_use: str = "anthropic/claude-3-haiku-20240307" # Default model

    @property
    def _llm_type(self) -> str:
        return "deepseek/deepseek-r1-0528:free"

    def _call(
        self,
        prompt: str,
        stop: Optional[List[str]] = None,
        run_manager: Optional[CallbackManagerForLLMRun] = None,
        **kwargs: Any,
    ) -> str:
        OPENROUTER_API_KEY = os.getenv('OPENROUTER_API_KEY')
        if not OPENROUTER_API_KEY:
            raise ValueError("OPENROUTER_API_KEY not found in environment variables.")

        # Consider making YOUR_SITE_URL configurable or passed in __init__
        YOUR_SITE_URL = 'https://github.com/billabavandar/testBot'
        headers = {
            'Authorization': f'Bearer {OPENROUTER_API_KEY}',
            'HTTP-Referer': YOUR_SITE_URL,
            'X-Title': 'LangChain Learning Project',
            'Content-Type': 'application/json'
        }
        data = {
            'model': self.model_to_use,
            'messages': [{'role': 'user', 'content': prompt}]
        }

        response = requests.post('https://openrouter.ai/api/v1/chat/completions', headers=headers, data=json.dumps(data))

        if response.status_code != 200:
            raise ValueError(f"OpenRouter API request failed: {response.status_code} - {response.text}")
        try:
            response_json = response.json()
        except requests.exceptions.JSONDecodeError:
            raise ValueError(f"Failed to decode JSON from OpenRouter response: {response.text}")

        try:
            output = response_json['choices'][0]['message']['content']
        except (KeyError, IndexError, TypeError) as e:
            print(f"Error parsing OpenRouter response structure: {e}")
            print(f"Full JSON response was: {response_json}")
            raise

        if stop is not None:
            # Currently not passing stop to API, this is a placeholder
            print("Warning: 'stop' arguments were passed but are not implemented in this custom LLM.")
        return output

    @property
    def _identifying_params(self) -> Mapping[str, Any]:
        return {"n": self.n, "model_name": self.model_to_use}

In [72]:
urls = [
    "https://lilianweng.github.io/posts/2023-06-23-agent/",
    "https://www.promptingguide.ai/introduction/settings",
    "https://python.langchain.com/v0.1/docs/modules/model_io/llms/" # LangChain LLM docs
]
print(f"Loading documents from {len(urls)} URLs...")
loader = WebBaseLoader(web_paths=urls)
all_web_docs = []
try:
    all_web_docs = loader.load()
    print(f"Successfully loaded {len(all_web_docs)} documents from the web.")
    if not all_web_docs:
        print("Warning: No documents were loaded. Check URLs and network connection.")
except Exception as e:
    print(f"Error loading documents from web: {e}")

Loading documents from 1 URLs...
Successfully loaded 1 documents from the web.


In [73]:
split_docs = []
if all_web_docs:
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=150
    )
    split_docs = text_splitter.split_documents(all_web_docs)
    print(f"Split web documents into {len(split_docs)} chunks.")
else:
    print("Skipping document splitting as no documents were loaded.")

Split web documents into 1202 chunks.


In [74]:
from langchain_community.embeddings import HuggingFaceEmbeddings
model_name = "BAAI/bge-large-en-v1.5" # biggest baddest embedder

embeddings_model = None
if split_docs:
    print(f"Initializing HuggingFaceEmbeddings with model: {model_name}")
    embeddings_model = HuggingFaceEmbeddings(
        model_name=model_name,
        model_kwargs={'device': 'cpu'},
        show_progress=True,
        encode_kwargs={'normalize_embeddings': True}
    )
    print("HuggingFaceEmbeddings model initialized.")
else:
    print("Skipping embedding model initialization as there are no split documents.")

Initializing HuggingFaceEmbeddings with model: BAAI/bge-large-en-v1.5
HuggingFaceEmbeddings model initialized.


In [76]:
vector_store = None
index_path = "faiss_index_web_docs_v2" # Use a distinct name

if split_docs and embeddings_model:
    print(f"\nCreating FAISS vector store, will save to: {index_path}")
    vector_store = FAISS.from_documents(split_docs, embeddings_model)
    vector_store.save_local(index_path)
    print(f"FAISS vector store created and saved to {index_path}.")
else:
    print("Skipping FAISS store creation/saving as prerequisites are not met.")


Creating FAISS vector store, will save to: faiss_index_harrypotter1


Batches:   0%|          | 0/38 [00:00<?, ?it/s]

FAISS vector store created and saved to faiss_index_harrypotter1.


In [77]:
# Ensure embeddings_model is defined from Cell 5 if you run this cell directly in a new session
# This cell is for loading an already existing index.
vector_store_loaded = None
# index_path = "faiss_index_web_docs_v2" # Ensure this matches the save path

# if not 'embeddings_model' in locals() or embeddings_model is None:
#     print("Re-initializing embeddings model for loading FAISS index...")
#     model_name_for_load = "BAAI/bge-small-en-v1.5" # Must match the model used for saving
#     embeddings_model = HuggingFaceEmbeddings(
#         model_name=model_name_for_load,
#         model_kwargs={'device': 'cpu'},
#         show_progress=False, # No need for progress bar on just init
#         encode_kwargs={'normalize_embeddings': True}
#     )
#     print("Embeddings model re-initialized.")


# if os.path.exists(index_path) and embeddings_model:
#     print(f"\nLoading FAISS index from: {index_path}")
#     vector_store_loaded = FAISS.load_local(
#         index_path,
#         embeddings_model,
#         allow_dangerous_deserialization=True
#     )
#     print("FAISS Index loaded from disk.")
# else:
#     print(f"FAISS index not found at {index_path} or embeddings model not ready. Please run Cell 6 first or ensure embeddings_model is initialized.")

# Determine which vector store to use: the newly created one or the loaded one
# If Cell 6 was just run, vector_store will be populated.
# If you are in a new session and intend to load, vector_store_loaded would be used.
# For simplicity in a single pass notebook, we'll primarily use the one from Cell 6 if it ran.
if 'vector_store' in locals() and vector_store is not None:
    vector_store_to_use = vector_store
    print("Using newly created/updated vector store for Q&A.")
# elif vector_store_loaded is not None:
#     vector_store_to_use = vector_store_loaded
#     print("Using loaded vector store from disk for Q&A.")
else:
    vector_store_to_use = None
    print("Error: No vector store available for Q&A. Run Cell 6 or ensure Cell 7 loads successfully.")

Using newly created/updated vector store for Q&A.


The above cell is to make re runs easier

In [78]:
if vector_store_to_use:
    retriever = vector_store_to_use.as_retriever(search_kwargs={"k": 3})

    print("\nInitializing LLM for Q&A...")
    llm = Claude2LLM(n=1, model_to_use="deepseek/deepseek-r1-0528:free")

    qa_chain = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",
        retriever=retriever,
        return_source_documents=True
    )

    print("\n--- Answering questions using RetrievalQA chain ---")

    questions_to_ask = [
        "What are the core components of an LLM-powered autonomous agent system as described by Lilian Weng?",
        "Explain the role of the 'temperature' setting in language model prompting.",
        "What is the base class for LLMs in LangChain?"
    ]

    for question in questions_to_ask:
        print(f"\nProcessing Question: {question}")
        try:
            print("--- Retrieving relevant documents ---")
            retrieved_docs = retriever.get_relevant_documents(question)
            # or in newer LangChain: retrieved_docs = retriever.invoke(question)
            for i, doc in enumerate(retrieved_docs):
                print(f"  Retrieved Doc {i+1} (Source: {doc.metadata.get('source', 'N/A')}), Score (if available): {doc.metadata.get('score', 'N/A')}")
                print(f"    Content snippet: {doc.page_content[:150]}...")
            print("--- End of retrieved documents ---")

            result = qa_chain.invoke({"query": question})
            print(f"LLM Answer: {result['result']}")
            print("\nSource Documents actually used by LLM (might differ if chain does further processing):")
            for i, source_doc in enumerate(result['source_documents']): # These are the docs combined by 'stuff' chain
                print(f"  Source {i+1} (URL: {source_doc.metadata.get('source', 'N/A')}):")
                print(f"    Content snippet: {source_doc.page_content[:250]}...")
        except Exception as e:
            print(f"An error occurred while processing question '{question}': {e}")
            import traceback
            traceback.print_exc() # Prints full traceback for the error
        print("-" * 40)
else:
    print("\nSkipping Q&A as vector store is not available.")

print("\n--- End of Notebook Execution ---")


Initializing LLM for Q&A...

--- Answering questions using RetrievalQA chain ---

Processing Question: Who stole the Philosopher's stone
--- Retrieving relevant documents ---


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  Retrieved Doc 1 (Source: https://hazidesaratcollege.ac.in/library/uploads/85jkr_harrypotter_1.pdf), Score (if available): N/A
    Content snippet: ��	����;q��������Pi��b/�1���ϥ��p�9�gb
��Rm�d\�N�X=F�,8-D%�tN#��!h%9ed.��gB9�U��M�Hn�����ƌB��* ��t�
�mpe'(\(���&S6...Ԕ��A����0
  Retrieved Doc 2 (Source: https://hazidesaratcollege.ac.in/library/uploads/85jkr_harrypotter_1.pdf), Score (if available): N/A
    Content snippet: startxref
963612
%%EOF
1 0 obj<</CropBox[0 0 595 842]/Annots 1398 0 R/Parent 1264 0 R/Contents 3 0 R/Rotate 0/MediaBox[0 0 595 842]/Thumb 757 0 R/R...
  Retrieved Doc 3 (Source: https://hazidesaratcollege.ac.in/library/uploads/85jkr_harrypotter_1.pdf), Score (if available): N/A
endobj...j<</Parent 1309 0 R/A 1321 0 R/Next 1316 0 R/Prev 1320 0 R/Title(CHAPTER 13 - Nicolas Flamel)>>
--- End of retrieved documents ---


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

LLM Answer: Based solely on the provided context, it is impossible to determine who stole the Philosopher's Stone.

Here's why:

1.  **The Context is Fragmentary PDF Data:** The "pieces of context" you provided are garbled binary data and structural fragments extracted from a PDF file. This includes corrupted or unreadable streams, PDF object definitions (`obj`, `endobj`, `stream`, `endstream`), and metadata tags (like `Subject`, `Author`, `Creator`, `Title`).
2.  **Title & Metadata Acknowledged, But Content Missing:** The metadata repeatedly identifies the document as "*Harry Potter, Book 1; The Sorcerer's Stone*", confirming the subject matter. There are also references to chapter titles like "CHAPTER 13 - Nicolas Flamel".
3.  **Crucial Plot Elements Not Present:** While the metadata confirms the document is *about* Harry Potter and the Sorcerer's Stone, the specific narrative content revealing *who stole* the Stone is **not contained in the provided fragments**. The actual story tex

In [80]:
# Final Q&A Cell (Single Shot)

my_question = "What are the key components of an LLM agent system?"
if 'qa_chain' in locals() and qa_chain is not None:
    if my_question and my_question.strip():
        print(f"\nProcessing Question: \"{my_question}\"")
        try:
            result = qa_chain.invoke({"query": my_question})

            print("\nLLM Answer:")
            answer_text = result.get('result', "No answer found or 'result' key missing in response.")
            print(answer_text)

        except Exception as e:
            print(f"An error occurred while processing your question: {e}")

            print("-" * 40)
    else:
        print("The variable 'my_question' is empty. Please set a question.")
else:
    print("Error: 'qa_chain' is not defined. Please run the setup cells that initialize 'qa_chain' first.")


Processing Question: "Who stole Neville's remebrall?"


Batches:   0%|          | 0/1 [00:00<?, ?it/s]


LLM Answer:
Based solely on the provided context, I cannot determine who stole Neville's Remembrall.

**Here's why:**

1.  **The context provided is not the text of the book:** The text you provided appears to be corrupted PDF metadata and structural data (like bookmarks and object references). It contains chapter titles (like "CHAPTER 11 - Quidditch") but **does not contain the actual narrative text** of "Harry Potter and the Sorcerer's Stone" where events like the theft of Neville's Remembrall are described.
2.  **No mention of the event or characters:** The context does not contain the words "Neville", "Remembrall", names of characters involved (like Malfoy), or any description of the flying lesson where the incident occurred.
3.  **Only chapter outlines:** The relevant section only lists chapter titles (e.g., Chapter 9 "The Midnight Duel", Chapter 10 "Halloween", Chapter 11 "Quidditch"). While the Remembrall scene happens *around* the flying lessons early in the book (relevant to 