# RAG Pipeline üèõ Zuckerberg Congressional Hearing transcript | LlamaIndex ü¶ô

On March 25, 2021 the United States Congress subpoena'd executives from Facebook, Twitter, and Google for a special meeting called "Disinformation Nation: Social Media's Role in Promoting Extremism and Misinformation".

The transcript is in the data folder

1. Extract data using `pdfplumber`
2. Convert to `LlamaIndex Documents`
3. Generate vector embeddings using `LlamaIndex` and `MiniLM`
4. Query! See what we can learn from all this

In [None]:
%pip install llama-index openai pdfplumber pandas sentence-transformers llama-index-embeddings-huggingface

Collecting pymupdf
  Downloading pymupdf-1.26.5-cp39-abi3-macosx_11_0_arm64.whl.metadata (3.4 kB)
Downloading pymupdf-1.26.5-cp39-abi3-macosx_11_0_arm64.whl (22.4 MB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m22.4/22.4 MB[0m [31m7.2 MB/s[0m  [33m0:00:03[0m eta [36m0:00:01[0m
[?25hInstalling collected packages: pymupdf
Successfully installed pymupdf-1.26.5
Note: you may need to restart the kernel to use updated packages.


#### Here's a sample snippet of the document:

![image](data/transcript-sample.png)

## üßπ Step 1: Extract and Clean Data üßº

#### 1Ô∏è‚É£ Line 2Ô∏è‚É£ Numbers 3Ô∏è‚É£

Each line is numbered. These numbers are easily ignored by human eyes and periodically very useful. To LLMs they are very 22     harmful to the 23      semantic meaning 24. So we strip these with a regular expression.

In [1]:
def clean_line_numbers(text: str) -> str:
    """Remove leading numeric indices from each line."""
    return re.sub(r"^\s*\d+\s+", "", text, flags=re.MULTILINE)

#### Speaker Name Extraction üöú

The format in the document is this:

When a paragraph starts with `"*Mr. Zuckerberg."`, it means that all the folowing paragraphs are Mark Zuckerberg's until the next time you see the speaker signifier. 

There is one special exception, `*The Chairman` referrs to Frank Pallone, who is the chairman of the committee.

This makes it easy enough for us to extract the speaker name and store it in metadata using a regular expression.

Here is how we extract the speaker's last name from the document

In [None]:
def extract_speaker(line: str) -> str:
    """Extract speaker name from a line, including periods in titles."""
    line = line.strip()
    # Special case: The Chairman
    if re.match(r"^\*The Chairman", line, flags=re.IGNORECASE):
        return "Pallone"

    # Lines starting with *
    m = re.match(r"^\*([A-Za-z\.\-']+ [A-Za-z\.\-']+)", line)
    if m:
        full_name = m.group(1).strip()
        # return only the last name and not Mr. or Ms. prefix
        return full_name.split()[-1].rstrip('.')
    # fallback: no change
    return None

## ü¶ô Step 2: Convert to LlamaIndex documents ü¶ô

A `LlamaIndex` Document is the core data unit that LlamaIndex uses for 
indexing and retrieval. It holds both the raw text content and optional 
metadata (like speaker attribution, source, page number, or timestamps) to preserve context.

Converting the raw text into Document objects allows LlamaIndex to:
- Split and preprocess text intelligently (e.g., chunking, cleaning)
- Track source attribution for retrieved passages
- Embed, index, and query across data types uniformly

### üß± Step 2.1: Chunking üß©

`Chunking` LLMs can only make sense of so much text at a time. In typical writing, paragraphs are a great way to chunk. If properly written, a paragraph addresses a single topic or idea. With a transcript, it makes sense to chunk by speaker turn.

Our documents will look like this:

```
Document(
  metadata: speaker name
  text: quote from the transcript
)
```

Here's our helper function for chunking the document:

In [3]:
from llama_index.core import Document

def chunk_transcript(text: str):
    """Split text into chunks per speaker turn with metadata."""
    lines = text.split("\n")
    chunks = []
    current_chunk = []
    current_speaker = None

    for line in lines:
        line = line.strip()
        if not line:
            continue

        speaker = extract_speaker(line)
        if speaker:
            # start a new chunk if current has content
            if current_chunk:
                chunks.append(
                    Document(
                        text="\n".join(current_chunk).strip(),
                        metadata={"speaker": current_speaker}
                    )
                )
                current_chunk = []
            current_speaker = speaker
            # remove speaker marker from text line
            
            if line.startswith("*"):
              # remove leading *Speaker. or *Mr. Speaker:
              line = re.sub(r"^\*\s*[^ ]+(?:\s+[^\.:\s]+)*[\.:]\s*", "", line)

        current_chunk.append(line)

    # append last chunk
    if current_chunk:
        chunks.append(
            Document(
                text="\n".join(current_chunk).strip(),
                metadata={"speaker": current_speaker}
            )
        )

    return chunks

### Step 2.2: Process PDF

1. We load the PDF using `pdfplumber` 
2. Strip page numbers
3. Chunk (extract speaker name, create `LlamaIndex` Document)

In [8]:
import re
import pdfplumber
import pandas as pd

# --- Extract text from PDF ---
pdf_path = "data/disinformation-nation-transcript-20210325.pdf"
all_text = []
with pdfplumber.open(pdf_path) as pdf:
    for page in pdf.pages:
        text = page.extract_text() or ""
        if text.strip():
            text = clean_line_numbers(text)
            all_text.append(text)

full_text = "\n".join(all_text)

# --- Chunk by speaker ---
documents = chunk_transcript(full_text)
print(f"Created {len(documents)} speaker-based chunks.")

Created 768 speaker-based chunks.


In [7]:
# --- Example usage: check first chunk ---
print(documents[11].metadata)
print(documents[11].text[:300], "...")
print(documents[12].metadata)
print(documents[12].text[:300], "...")

{'speaker': 'Pallone'}
Thank you, Chairman Doyle and
Schakowsky, for this very important hearing. We are here
today because the spread of disinformation and extremism has
been growing online, particularly on social media, where
there are little to no guardrails in place to stop it.
And unfortunately, this disinformation a ...
{'speaker': 'Doyle'}
The gentleman yields back. The chair now
recognizes Ms. Rodgers, the ranking member of the full
committee, for five minutes for her opening statement. ...


## ü§ñ Step 3: Generate Vector Embeddings ü§ñ

Vector embeddings allow LLMs to understand our data.

You can find more on this at the end of the document if you want. For now, here is what you need to know...

`LlamaIndex.VectorStoreIndex` does two things
1. Creates searchable Vector index of each document using `all-MiniLM-L6-v2`
2. Organizes these embeddings into a structure optimized for semantic search


In [9]:
from llama_index.core import VectorStoreIndex
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# downloads the model from huggingface and caches it
embed_model = HuggingFaceEmbedding(model_name="sentence-transformers/all-MiniLM-L6-v2")
hearing_index = VectorStoreIndex.from_documents(documents, embed_model=embed_model)

## ‚ùìStep 4: Query our new knowledgebase‚ùì

In [13]:
# --- 4Ô∏è‚É£ Query ---
from llama_index.llms.openai import OpenAI

llm = OpenAI(model="gpt-5-mini")
query_engine = hearing_index.as_query_engine(llm=llm)

In [16]:
from IPython.display import Markdown

query = "Summarize the hearing"
response = query_engine.query(query)

Markdown(f"### üß† Let me think!\n\n{response}")

### üß† Let me think!

A joint remote hearing of the House Subcommittee on Communications and Technology and the Subcommittee on Consumer Protection and Commerce, titled "Disinformation Nation: Social Media's Role in Promoting Extremism and Misinformation," opened with procedural directions and an extended opening statement from the chair.

Key points:
- Logistics: hearing held by videoconference because of COVID-19; participants must unmute to speak and be visible to be recognized; a 15-minute recess was scheduled around 3:00 PM; documents for the record were to be submitted to designated staff email addresses.
- Chair‚Äôs opening: social media platforms were described as having two faces ‚Äî everyday uses (family, friends, quirky videos) alongside persistent extremist, conspiratorial, and misinformation content. The chair argued that platform recommendation and ranking systems steer users toward harmful content.
- January 6 and extremism: this was the first appearance of the three witnesses since the January 6 attack. The chair characterized that attack as planned and nourished on social media, with FBI materials and research cited showing use of platforms for planning, recruitment, and coordination.
- Misinformation scale: cited research figures were given for large exposures to false information ‚Äî 1.1 billion exposures to election-related misinformation on one platform in the prior year, and an estimated 3.8 billion exposures to COVID-19 disinformation across five countries on that platform.
- Public-health impact: with hundreds of thousands of U.S. COVID deaths noted, the chair warned that ongoing vaccine misinformation on social platforms is undermining vaccination efforts and contributing to vaccine refusal.
- Procedure note: witnesses are permitted to have counsel present and members were reminded they may mute themselves while conferring.

Overall message: the chair framed social media as having enabled widespread misinformation and extremist organizing with real-world harms (including the Capitol attack and public-health consequences), and called out platforms for failing to protect users.

#### That looks great! Now let's get more specific

In [None]:
# Define a filter to retrieve documents by a specific author

from llama_index.core.vector_stores import MetadataFilters, ExactMatchFilter
filters = MetadataFilters(
    filters=[ExactMatchFilter(key="speaker", value="Zuckerberg")]
)

# Create a query engine with the defined filters
zuckerberg_query_engine = hearing_index.as_query_engine(filters=filters)

# Query the engine
zuckerberg_response = zuckerberg_query_engine.query("Detail his thoughs on hate speach with direct quotes")

Markdown(f"### üß† Let me think about Zuckerberg!\n\n{zuckerberg_response}")

Zuckerberg emphasized the challenge in identifying hate speech due to the nuanced difference between racist remarks and statements denouncing racism. He mentioned the need to build systems to handle hate speech content in multiple languages globally. Additionally, he acknowledged the difficulty in distinguishing between hateful speech and speech denouncing hate when enforcing hate speech policies.


### üß† Let me think about Zuckerberg!

Zuckerberg emphasized the challenge in identifying hate speech due to the nuanced difference between racist remarks and statements denouncing racism. He mentioned the need to build systems to handle hate speech content in multiple languages globally. Additionally, he acknowledged the difficulty in distinguishing between hateful speech and speech denouncing hate when enforcing hate speech policies.

In [101]:
# Example: only retrieve chunks from Zuckerberg
speaker_filter = MetadataFilter(
    key="speaker",
    value="Mr. Zuckerberg",
    operator=FilterOperator.EQ
)

response = filtered_query_engine.query(
    "Summarize Zuckerberg's responses about privacy",
    metadata_filters=[speaker_filter]
)

print(response)


TypeError: got an unexpected keyword argument 'metadata_filters'

In [None]:
from llama_index.core.vector_stores.types import MetadataFilter, FilterOperator

# Example: only retrieve chunks from Zuckerberg
speaker_filter = MetadataFilter(
    key="speaker",
    value="Mr. Zuckerberg",
    operator=FilterOperator.EQ
)

query_engine = zuckerberg_index.as_query_engine(
    llm=llm,
    similarity_top_k=5,
    include_metadata=True  # ensures metadata is available for filtering
)



print(response)

TypeError: too many positional arguments

In [90]:
speaker_filter = MetadataFilter(
    key="speaker",
    value=["Mr. Zuckerberg", "The Chair"],
    operator=FilterOperator.IN
)

response = query_engine.query(
    "Summarize Zuckerberg's responses about privacy",
    [speaker_filter]  # metadata filters go here
)

TypeError: too many positional arguments

## üõë Try that again üîÑ

The numbers I spot checked were accurate, but 147,000 million is hard to understand. Let's have it convert it to billions.

In [31]:
from IPython.display import Markdown

query = "Breakdown the sales numbers changes for me. Use billions instead of millions."
response = query_engine.query(query)

Markdown(f"### üß† Sales Numbers\n\n{response}")

### üß† Sales Numbers

Three months ended June 30 (Q2)
- North America: $90.033 bn ‚Üí $100.068 bn; +$10.035 bn (+11.2%)
- International: $31.663 bn ‚Üí $36.761 bn; +$5.098 bn (+16.1%)
- AWS: $26.281 bn ‚Üí $30.873 bn; +$4.592 bn (+17.5%)
- Consolidated: $147.977 bn ‚Üí $167.702 bn; +$19.725 bn (+13.3%)

Net sales by type (Q2)
- Net product sales: $61.569 bn ‚Üí $68.246 bn; +$6.677 bn (+10.8%)
- Net service sales: $86.408 bn ‚Üí $99.456 bn; +$13.048 bn (+15.1%)

Six months ended June 30
- North America: $176.374 bn ‚Üí $192.955 bn; +$16.581 bn (+9.4%)
- International: $63.598 bn ‚Üí $70.274 bn; +$6.676 bn (+10.5%)
- AWS: $51.318 bn ‚Üí $60.140 bn; +$8.822 bn (+17.2%)
- Consolidated: $291.290 bn ‚Üí $323.369 bn; +$32.079 bn (+11.0%)

Net sales by type (six months)
- Net product sales: $122.484 bn ‚Üí $132.216 bn; +$9.732 bn (+7.9%)
- Net service sales: $168.806 bn ‚Üí $191.153 bn; +$22.347 bn (+13.2%)

# üèÅüèÅüèÅ We're done! üèÅüèÅüèÅ

This was impossible to do just a matter of months ago. If you downgrade to gpt-4o-mini, you get a virtually unusable output. All of the numbers I spot checked were accurate.

The paper path that lead us here:

| Date | Paper | Lab | Description |
| --- | --- | --- | --- |
| June 2017 | [Attention Is All You Need](https://arxiv.org/abs/1706.03762) | Google | Transformers paper. This is built the foundation for LLMs |
| October 2018 | [Bidirectional encoder representations from transformers (BERT)](https://arxiv.org/abs/1810.04805) | Google | Uses transformer architecture to derive semantic meaning of text chunks |
| August 2019 | [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://arxiv.org/abs/1908.10084) | UKPLab | Reduces the computation time of BERT from 65 hours to 5 seconds. The `all-MiniLM-L6-v2` we used came from this |




## üì¶ Package notes üì¶

`llama-index`
- Open Source project to help with the RAG Pipeline

`llama-index-embeddings-huggingface`
- creates vector embeddings of text
- wraps embedding model (in this case... `sentence-transformers/all-MiniLM-L6-v2`)

`pdfplumber `
- https://github.com/jsvine/pdfplumber 
- built by data journalist, jsvine. 
- Extends upon pdfminer parsing engine

`pandas`
- Data analysis tool
- Handles structured and labeled data

`sentence-transformers`
- UKPLab (Ubiquitous Knowledge Processing Lab) at TU Darmstadt, Germany
- Turns sentences (or paragraphs or pages) into embeddings
- the model `all-MiniLM-L6-v` does this incredibly fast, read the SentenceBERT Paper above to see how
