# Making History Accessible: Exploring the MLK Assassination Declassified Records

## Introduction

Dr. Martin Luther King Jr.'s legacy is one of courage, justice, and transformation. The recently declassified records surrounding his assassination (now hosted by the National Archives) are a vital part of the historical record. These documents provide important insights into a pivotal moment in American history and the civil rights movement.

In this notebook, we'll explore how modern AI and data processing technologies can make these historical documents more accessible and searchable, enabling researchers, educators, journalists, and the public to engage with this important material.

## Historical Context and Significance

Dr. Martin Luther King Jr. was assassinated on April 4, 1968, in Memphis, Tennessee. His assassination had profound impacts on the civil rights movement and the nation as a whole.

The declassified records surrounding his assassination provide valuable insights into:

- The investigation conducted by various government agencies
- The evidence collected and analyzed
- The broader historical and social context of the time

By making these records more accessible through modern AI and data processing technologies, we can help ensure that this important historical information is preserved and available for future generations to study and learn from.

## The Unstructured Platform and Document Processing Workflow

Before we build our question-answering application, let's understand the data processing workflow that makes this possible.

### How the MLK Records Were Prepared for Search

> *Note: The steps below were completed prior to this notebook. You do not need to rerun them—they’re included here to explain how the records were made searchable.*

The declassified MLK assassination records were processed using the **Unstructured platform** in a multi-step ETL pipeline to make them AI-ready and searchable:

---

#### **Step 1: Document Ingestion into Amazon S3**

- Original documents—including PDFs, images, and other file types—were streamed from the National Archives to **Amazon S3**, providing secure and scalable cloud storage.

---

#### **Step 2: Document Processing with Unstructured**

The Unstructured platform processed each document through a series of enrichment steps:

1. **VLM Partitioning**  
    Vision-Language Models (VLMs) segmented each document into meaningful sections, preserving layout and context. Because most documents were scanned images of typed pages—making OCR challenging—VLMs were chosen for partitioning.

2. **Title-Based Chunking**  
   Documents were split into semantically coherent chunks using structural cues (like section headers) to improve context retention.

3. **Named Entity Recognition (NER)**  
   Entities such as people, organizations, locations, and dates were extracted to enhance downstream filtering and relevance.

4. **Vector Embedding**  
   Each chunk was embedded using OpenAI’s `text-embedding-3-large` model (3072 dims), enabling semantic similarity search.

---

#### **Step 3: Indexing in Elasticsearch**

- The enriched document chunks—with metadata and vector embeddings—were indexed into **Elasticsearch**, enabling:
  - Fast full-text and semantic (vector) search  
  - Metadata-based filtering and sorting  
  - Scalable querying across large document sets

---

This end-to-end pipeline transformed the raw historical documents into a searchable, structured knowledge base—optimized for natural language queries and intelligent retrieval. Unstructured made it possible to transform 243,496 pages of grainy text in a single day, with minimal engineering effort. 

### Benefits of This Approach

This workflow offers several advantages for historical document collections:

- **Preservation of Context**: The intelligent partitioning and chunking preserve the document's original structure and context.

- **Enhanced Searchability**: Both keyword and semantic search capabilities make it easier to find relevant information.

- **Metadata Enrichment**: Named entity recognition adds valuable metadata that can be used for filtering and organization.

- **Accessibility**: Makes historical documents more accessible to researchers, educators, and the public.

- **AI-Readiness**: The processed data is ready for use in advanced AI applications, including RAG (Retrieval-Augmented Generation) systems.

## Building a Question-Answering System for the MLK Assassination Records

Now, let's build a Retrieval-Augmented Generation (RAG) application using LangChain that will allow us to query the ElasticSearch database to ask questions about the MLK assassination declassified records.

### Setting Up the Environment

First, let's install the necessary packages:

In [None]:
# Install required packages
%pip install langchain langchain-elasticsearch langchain-openai langchain_anthropic python-dotenv elasticsearch

Note: you may need to restart the kernel to use updated packages.


In [None]:
# Import required libraries
import os
import json
from dotenv import load_dotenv
from langchain_elasticsearch import ElasticsearchStore
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_anthropic import ChatAnthropic
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

# Load environment variables
load_dotenv()

False

### Connecting to Elasticsearch

We'll connect to the Elasticsearch instance where the processed MLK assassination records are stored:

In [None]:
# Elasticsearch connection settings
ELASTICSEARCH_HOSTS = os.getenv("ELASTICSEARCH_HOSTS", "your-host-url")
ELASTICSEARCH_API_KEY = os.getenv("ELASTICSEARCH_API_KEY", "your-elasticsearch-api-key")
ELASTICSEARCH_INDEX_NAME = os.getenv("ELASTICSEARCH_INDEX_NAME", "mlk-archive-public")

# API key for OpenAI (used for both embeddings and LLM)
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", "your-openai-api-key")

from langchain_elasticsearch.vectorstores import ElasticsearchStore
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(
    model="text-embedding-3-large",
    openai_api_key=OPENAI_API_KEY,
)

# Correct parameter name is `es_api_key`
es_store = ElasticsearchStore(
    es_url=ELASTICSEARCH_HOSTS,
    index_name=ELASTICSEARCH_INDEX_NAME,
    es_api_key=ELASTICSEARCH_API_KEY,
    embedding=embeddings,
    vector_query_field="embeddings",
    query_field="text",
)


# Create a retriever
retriever = es_store.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 5}  # Retrieve top 5 most relevant documents
)

### Creating a Prompt

When dealing with sensitive historical material like the MLK assassination records, it's important to create a prompt that:

1. Respects the historical significance of the material
2. Provides accurate information based on the documents
3. Acknowledges the limitations of the available information
4. Avoids speculation beyond what's in the documents

In [None]:
# Define a custom prompt template
template = """
You are a respectful and knowledgeable assistant helping to provide information about the declassified MLK assassination records.

These documents are an important part of American history and the civil rights movement. Please treat this subject with the appropriate gravity and respect.

Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}

Question: {question}

Answer:"""

PROMPT = PromptTemplate(
    template=template,
    input_variables=["context", "question"]
)

### Building the RAG Application

Now we'll create a Retrieval-Augmented Generation (RAG) application that:

1. Takes a user question about the MLK assassination records
2. Retrieves relevant document chunks from Elasticsearch
3. Uses the retrieved context to generate an accurate, respectful answer

In [None]:
# Always use OpenAI for the LLM
llm = ChatOpenAI(
    model_name="gpt-4o",
    temperature=0.1,  # Low temperature for more factual responses
    openai_api_key=OPENAI_API_KEY
)
print("Using OpenAI GPT-4o model for RAG responses")

# Create the RAG chain
rag_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    chain_type_kwargs={"prompt": PROMPT},
    return_source_documents=True
)

Using OpenAI GPT-4o model for RAG responses


### Interactive Question Answering

Let's create a simple interface to ask questions about the MLK assassination records:

In [None]:
def ask_question(question):
    """
    Ask a question about the MLK assassination records and get an answer
    with source citations.
    """
    result = rag_chain({"query": question})
    
    # Extract the answer and source documents
    answer = result["result"]
    source_docs = result["source_documents"]
    
    # Print the answer
    print("Answer:")
    print("-" * 80)
    print(answer)
    print("-" * 80)
    
    # Print source information
    print("\nSources:")
    print("-" * 80)
    for i, doc in enumerate(source_docs):
        print(f"Source {i+1}:")
        print(f"  Text: {doc.page_content[:150]}...")
        if hasattr(doc, 'metadata') and doc.metadata:
            if 'filename' in doc.metadata:
                print(f"  Document: {doc.metadata['filename']}")
            if 'page_number' in doc.metadata:
                print(f"  Page: {doc.metadata['page_number']}")
        print()

### Example Questions

Here are some example questions you might ask about the MLK assassination records:

In [None]:
# Example 1: General information about the investigation
ask_question("What agencies were involved in investigating Dr. King's assassination?")

  result = rag_chain({"query": question})


Answer:
--------------------------------------------------------------------------------
The agencies involved in investigating Dr. Martin Luther King Jr.'s assassination were the Federal Bureau of Investigation (FBI) and the Department of Justice. Additionally, local law enforcement, such as the Memphis Police Department, was involved in the immediate response and evidence collection following the assassination.
--------------------------------------------------------------------------------

Sources:
--------------------------------------------------------------------------------
Source 1:
  Text: Prefix: This chunk appears in the middle section of a December 1978 Summary of Findings and Recommendations report from the U.S. House Select Committe...
  Document: 00387609_final_report_of_the_selec_104-10322-10090.pdf
  Page: 10

Source 2:
  Text: Prefix: This chunk appears in the middle section of a 1979 House Select Committee on Assassinations report to Congress summarizing their findi

In [None]:
# Example 2: Information about James Earl Ray
ask_question("What information do the declassified records contain about James Earl Ray?")

Answer:
--------------------------------------------------------------------------------
The declassified records contain several pieces of information about James Earl Ray, the assassin of Dr. Martin Luther King Jr. These records include:

1. **CIA Memorandum (1975)**: This document indicates that the CIA conducted a file check on James Earl Ray in response to an inquiry from the Senate Select Committee. The file primarily consisted of news media material, State Department cables about his extradition from England, and memoranda regarding his first lawyer, Arthur J. Hanes. The CIA had no file on Ray prior to the assassination of Dr. King.

2. **FBI Wanted Notice and Criminal Record (1968)**: This document details Ray's criminal history from 1953 to 1966, including arrests for robbery, burglary, larceny, forgery, and other charges. It also notes his addition to the FBI's "Ten Most Wanted Fugitives" list following the assassination.

3. **CIA Document (1968)**: This document requests in

In [None]:
# Example 3: Timeline of events
ask_question("What was the timeline of events on April 4, 1968, according to the records?")

Answer:
--------------------------------------------------------------------------------
On April 4, 1968, the timeline of events according to the records is as follows:

- **1:00 a.m.**: Dr. King's brother, Rev. A.D. King, along with Mrs. Georgia M. Davis and Mrs. Lucie Ward, arrived in Memphis and registered at the Lorraine Motel.

- **4:30 a.m.**: Dr. King, along with Reverends Ralph Abernathy and Bernard Lee, returned to the Lorraine Motel from a strategy meeting and visited with his brother and others in room 207 until about 5:00 a.m.

- **5:00 a.m.**: Dr. King returned to room 306, where he and Rev. Abernathy were registered.

- **8:00 a.m.**: A strategy meeting was scheduled in room 306.

- **8:30 a.m.**: Solomon Jones, Jr., Dr. King's chauffeur, returned to the Lorraine Motel to take Dr. King to court, but Rev. Andrew Young went instead.

- **1:30 p.m.**: Dr. King visited Mrs. Davis in room 201 and was later joined by others, including his brother and SCLC staff.

- **5:30 p.m.

## Conclusion

In this notebook, we've:

1. Explored how the Unstructured platform processes historical documents through a sophisticated workflow
2. Built a RAG application using LangChain to query the MLK assassination declassified records
3. Demonstrated how these technologies can make important historical documents more accessible

This approach can be applied to many other historical document collections, helping to preserve and make accessible our shared history.

The combination of modern document processing, vector embeddings, and large language models creates powerful tools for researchers, educators, journalists, and the public to engage with historical materials in new and meaningful ways.