# LangChain Demo: Raw PDF vs Markdown
This notebook demonstrates why raw PDFs perform poorly with LLMs by comparing:
1. **Raw PDF**: Basic PyPDF2 extraction with no preprocessing
2. **Markdown**: Proper preprocessing with chunking, embeddings, and RAG

The goal is to show that proper document formatting is essential for good LLM performance.


In [1]:
# Automatically reload modules when they change
# This ensures changes to QandAFunctions.py are picked up automatically
%load_ext autoreload
%autoreload 2


In [2]:
from QandAFunctions import initialize_environment

# Load environment variables from config.env
initialize_environment()


‚úì Environment initialized: OPENAI_API_KEY loaded


In [3]:
from pathlib import Path
from langchain_core.prompts import PromptTemplate

# Setup main variables
PDF_PATH = 'report.pdf'
MARKDOWN_PATH = Path("result") / "report.md"
QUESTION = "In 2021, what is the impact of electric trains on the green bond asset portfolio, in millions of euros?"

## Set up PDF Reading (No Preprocessing)


In [4]:
from QandAFunctions import load_pdf

# Load PDF and setup processing chain
pdf_content = load_pdf(PDF_PATH)


## Questions for the PDF

In [5]:
from QandAFunctions import ask_pdf

print(f"\n{'='*80}")
print(f"QUESTION : {QUESTION}")
print('='*80)

pdf_response = ask_pdf(pdf_content, QUESTION)

print(f"\nPDF ANSWER (NO PREPROCESSING):")
if "error" in pdf_response:
    print(f"{pdf_response['answer']}")
else:
    answer = pdf_response.get('answer', 'No answer')
    print(f"{answer}")



QUESTION : In 2021, what is the impact of electric trains on the green bond asset portfolio, in millions of euros?

PDF ANSWER (NO PREPROCESSING):
Chunk 1: Based on the provided content, there is no information regarding the impact of electric trains on the green bond asset portfolio in millions of euros in 2021. The content focuses on the Climate Financial Risk Forum Guide 2023 and the Data, Disclosures, and Metrics Working Group.

Chunk 2: The impact of electric trains on the green bond asset portfolio in millions of euros is not mentioned in the provided content.

Chunk 3: Based on the provided content, there is no specific information or mention about the impact of electric trains on the green bond asset portfolio in millions of euros in 2021. Therefore, it is not possible to determine the impact of electric trains on the green bond asset portfolio from the given content.

Chunk 4: Based on the provided content, there is no specific information about the impact of electric trains 

## Set up Markdown File Reading

In [6]:
from QandAFunctions import load_markdown

# Load markdown and setup RAG chain
# md_rag_chain, md_splits = load_and_setup_markdown_rag(MARKDOWN_PATH)
md_content = load_markdown(MARKDOWN_PATH)


## Questions for the Formatted Markdown

In [9]:
from QandAFunctions import ask_markdown

print(f"\n{'='*80}")
print(f"QUESTION : {QUESTION}")
print('='*80)

# md_response = ask_markdown(md_rag_chain, QUESTION, "Markdown")
md_response = ask_markdown(md_content, QUESTION)

print(f"\nüìù MARKDOWN ANSWER:")
if "error" in md_response:
    print(f"{md_response['answer']}")
else:
    # Display full answer without clipping (or with much larger threshold)
    answer = md_response.get('answer', 'No answer')
    print(f"{answer}")


QUESTION : In 2021, what is the impact of electric trains on the green bond asset portfolio, in millions of euros?

üìù MARKDOWN ANSWER:
Chunk 1: Based on the provided content, there is no information about the impact of electric trains on the green bond asset portfolio in millions of euros in 2021. Therefore, the answer to the question is not applicable or cannot be determined from the given content.

Chunk 2: Based on the provided content, there is no specific mention of the impact of electric trains on the green bond asset portfolio in millions of euros in 2021. Therefore, the answer to the question cannot be determined from the given information.

Chunk 3: Based on the provided content, there is no specific mention of the impact of electric trains on the green bond asset portfolio in millions of euros in 2021. Therefore, it is not possible to provide an answer to this question based on the given information.

Chunk 4: Based on the provided content, the impact of electric trains o