In [7]:
import os
from langchain.document_loaders import Docx2txtLoader, JSONLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores.faiss import FAISS
from langchain.chat_models import ChatOpenAI
from langchain.chat_models import ChatOpenAI
from langchain.llms import OpenAI
from langchain.chains import LLMChain, HypotheticalDocumentEmbedder
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.chains import LLMChain
from dotenv import find_dotenv, load_dotenv
import pandas as pd
import logging
import re
import re
from langchain.document_loaders import Docx2txtLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores.faiss import FAISS
from langchain.chat_models import ChatOpenAI
from langchain import PromptTemplate
from langchain.chains import LLMChain
from langchain.llms import OpenAI
from langchain.chains import LLMChain, HypotheticalDocumentEmbedder
import logging

logging.basicConfig(
    format="%(asctime)s - %(levelname)s - %(message)s", level=logging.INFO
)
logger = logging.getLogger(__name__)




Guidelines:

•	introduce the context, what are the exonerations documents, why do we want to index officers
•	very briefly explain where we are in the process (e.g. we’ve done page classification, we have these transcripts, etc)

•	introduce the extraction method – do we want to address alternative approaches? 
    Maybe we discuss a regex-based solution just as a way of introducing our method and the strengths/weaknesses
•	describe the main problem: responses to multiple prompts, no automatic way to choose the best response
•	describe our solution with the summarizer

Exoneration documents—records that formally vindicate individuals erroneously convicted of crimes—serve as rich, informative resources in the field of wrongful conviction research. These documents are particularly revealing about the law enforcement personnel involved in such cases. However, these documents are both voluminous, with thousands of pages of text per case, and unstructured, printed and collected in legal file storage boxes.

We seek to make these collections searchable and useful for lawyers, advocates, and community members to better investigate patterns of police misconduct and corruption. In order to do so, we rely on a multi-stage process:

1. Metadata Compilation: We started by compiling a comprehensive CSV index. This structured approach forms the foundation of our file management system, enabling efficient file retrieval. The metadata we organize in this step includes:

    - file path and name
    - file type
    - sha1 content hash: we truncate this to create unique file identifiers
    - file size and number of pages
    - case ID: when we scanned the documents, we organized them into directories organized by case ID, here we pluck and validate the directory name to add a structured case id field to our metadata index.

2. Page classification: The documents in the collection are varied, representing all documents produced or acquired in the course of an exoneration case, with case timelines going back decades. After some internal review and discussions with Innocence Project lawyers, we narrowed our focus to three types of documents:

    - police reports: include mentions of officers involved in the arrest that led to the wrongful conviction, or related arrests.
    - transcripts: court transcripts, recorded by clerks of the court
    - testimonies: witness testimony, 

    [*Evaluation of Deep Convolutional Nets for Document Image Classification and Retrieval*](https://arxiv.org/abs/1502.07058) describes an effective approach for retrieving specific types of documents from disorganized collections: fine-tuning a pretrained convolutional neural network to label thumbnail images of document pages. In order to use this technique, we needed training data and a pretrained model.

3. To quickly assemble a training data set for our page classifier, we started by noticing that in many cases the file name indicated the document type. These documents were scanned by many people at different times, so we could not rely on this heuristic for comprehensive categorization of documents, but there was more than enough there to jumpstart our training process. We collected our initial training data by probing filenames for specific search terms. Once we had a trained classifier, we were able to measure generalization performance on documents that couldn't be classified via filename, and we were also better able to target additional training data, for example by reviewing pages where the classifier had low confidence about its prediction.

4. We then used [FastAI](https://docs.fast.ai/) to fine-tune the `ResNet34` architecture, pretrained on [ImageNet](https://www.image-net.org/), to identify reports, transcripts, and testimonies based on page thumbnails.

5. Information Extraction: Currently, we're engaged in extracting structured information from the documents we've identified, and that work is the focus of the current post. Our goal is to extract structured information related to each police officer or prosecutor mentioned in the documents, such as their names, ranks, and roles ("arresting officer", "handled evidence", etc).

6. Deduplication: The previous step leaves us with many distinct mentions, but some individuals are mentioned many times, within the same case or across cases. Here we rely on HRDAG's [extensive experience with database deduplication](https://hrdag.org/tech-notes/adaptive-blocking-writeup-1.html) to create a unique index of officers and prosecutors involved in wrongful convictions, and a record and the role or roles they had in the wrongful conviction.

7. Cross-referencing: In the final stage, we'll cross-reference the officer names and roles we've extracted with the Louisiana Law Enforcement Accountability Database ([LLEAD.co](https://llead.co/). This step will help us identify additional individuals associated with implicated officers (for example those who are co-accused on misconduct complaints, or who are involved in the same use-of-force incidents), and allow us to request public records, allowing us to review arrests by these officers.


Our process necessitates the extraction of officer information from documents, specifically the officer's name and associated role in wrongful convictions. Initially, a regex-based approach was considered, but its limitations became evident due to the complexity and variability of the data. For instance, a text string from a court transcript reading, "Sergeant Ruiz was mentioned as being involved in the joint investigation with Detective Martin Venezia regarding the Seafood City burglary and the murder of Kathy Ulfers," would pose a problem for regex because it fails to capture semantic context, making it unable to infer that Sergeant Ruiz acted as a lead detective in Kathy Ulfers' murder case. As a result, we pivoted to using the Generative Pretrained Transformer 4 (GPT-4), an AI model capable of comprehending and generating human-like text.

One issue we faced with the GPT-4 approach was the extensive length of the documents we were processing, potentially reaching hundreds of pages. GPT-4, however, has a limit on the number of tokens it can store in memory. To tackle this issue, we developed a method to extract only the relevant sections containing the necessary officer information from each document.

We split up the problem into two steps, identifying the relevant chunks of text content, and then extracting structured officer information from those chunks. We use [Langchain](https://docs.langchain.com/docs/), a natural language processing library, to manage this pipeline, and use OpenAI's language model, GPT-4 as the language model powering the pipeline. 

For the first step -- identifying the relevant chunks of text within the larger document, we used the approach outlined in [Precise Zero-Shot Dense Retrieval without Relevance Labels](https://arxiv.org/abs/2212.10496). This method splits our information retrieval task into two steps:

1. First we chunk the document text into overlapping chunks, and calculate embeddings for each chunk, which are mathematical representations of text in high-dimensional space that capture the semantic meaning of the text.
1. We then feed our query asking for names and roles of mentioned officers to GPT-4, instructing it to compose a "hypothetical" document in response to the query.
3. We embed the hypothetical document using the same embedding system as we use to encode the text chunks from the document.
3. We use [Faiss](https://faiss.ai/) to do a similarity search, comparing our hypothetical document embeddings to find chunks of text content that resemble our hypothetical document.

Here is the method we use to generate hypothetical document embeddings. These embeddings encapsulate the contextual information in our documents.

In [8]:
def generate_hyde():
    llm = OpenAI()
    prompt_template = """\
    You're an AI assistant specializing in criminal justice research. 
    Your main focus is on identifying the names and providing detailed context of mention for each law enforcement personnel.
    ...
    """
    prompt = PromptTemplate(input_variables=["question"], template=prompt_template)
    llm_chain = LLMChain(llm=llm, prompt=prompt)
    base_embeddings = OpenAIEmbeddings()
    embeddings = HypotheticalDocumentEmbedder(llm_chain=llm_chain, base_embeddings=base_embeddings)
    return embeddings

Building upon the concept of a hypothetical document embedder, the process_single_document function stands as the initial step in handling raw text. This function employs Langchain's RecursiveCharacterTextSplitter to dissect documents into digestible chunks of 500 tokens, all the while maintaining an overlap of 250 tokens to ensure contextual continuity. As our primary focus lies in accurately capturing true positives, the F-beta score (with β=2) was utilized during the testing phase to weigh recall twice as heavily as precision. The model underwent rigorous testing with varying chunk sizes, including 2000, 1000, and 500, with corresponding overlaps of 1000, 500, and 250 respectively. The optimal configuration revealed itself to be a chunk size of 500 with an overlap of 250. Following segmentation, the text is transformed into a high-dimensional space using the precomputed embeddings generated by our hypothetical document embedder. The FAISS.from_documents function facilitates this transformation, building an indexed document database for similarity search.

In [9]:
def process_single_document(file_path, embeddings):
    logger.info(f"Processing document: {file_path}")

    loader = JSONLoader(file_path)
    text = loader.load()
    logger.info(f"Text loaded from document: {file_path}")

    text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=250)
    docs = text_splitter.split_documents(text)

    db = FAISS.from_documents(docs, embeddings)
    return db

In the following sections, we define the core function get_response_from_query(db, query). This function serves as the backbone of our information extraction process, taking in a document database and a query, and returning the system's response to the query along with the documents that it considered while generating the response.

The process begins by setting up the relevant parameters. We use a prompt template to guide the query and a role template to define the roles we're interested in. We set the temperature parameter to 0 to maximize the determinism of our responses. The k parameter is set to 20, a decision guided by the F-beta score results from our testing phase, instructing the system to select and concatenate the top 20 relevant text chunks from the document corpus.

The query is then fed into the FAISS system, which conducts a similarity search to identify the most relevant documents. These documents are concatenated into a single string, referred to as docs_page_content.

This processed string, docs_page_content, is passed to the LLMChain class of the LangChain module as part of the run method. Along with the docs_page_content, the run method also receives the prompt and role templates, and the original query.

The LLMChain processes these inputs and outputs a structured response to the original query.

Consider the following example:

**Query**

"Identify individuals, by name, with the specific titles of officers, sergeants, lieutenants, captains, detectives, homicide officers, and crime lab personnel in the transcript. Specifically, provide the context of their mention related to key events in the case, if available."

**Relevant Document**

(1 of 20 the documents identified by the Faiss as relevant)
 
 Document(page_content=".\nMartin Venezia, New Orleans police sergeant.\nA\n16\n.01\nSergeant Venezia, where are you assigned now?\n: -\nA\nSecond Police District.\n13\n.\nAnd in October, September of 1979 and in\nQ\n19\nSeptember and October of 1980, where\nwere you assigned?\n:1\nHomicide division.\nA.\nAnd how long have you been on the police\ndepartment right now?\nThirteen and a half years.\nA\nOfficer Venezia, when did you or did you\never take over the investigation of\n...\nCathy Ulfers' murder?\nA", metadata={'source': '../../data/convictions/transcripts/iterative\\(C) Det. Martin Venezia Testimony - Trial One.docx'}),

**Response from the Model** 

Officer Name: Sergeant Martin Venezia

Officer Context: Sergeant Martin Venezia, formerly assigned to the Homicide Division, took over the invesitgation of Cather Ulfers murder.

Officer Role: Lead Detective 

Now, let's look at the Python function that makes it possible:

In [None]:
PROMPT_TEMPLATE_MODEL = PromptTemplate(
    input_variables=["roles" ,"question", "docs"],
    template="""
    As an AI assistant, my role is to meticulously analyze court transcripts, traditional officer roles, and extract information about law enforcement personnel.

    Query: {question}

    Transcripts: {docs}

    Roles: {roles}

    The response will contain:

    1) The name of a officer, detective, deputy, lieutenant, 
       sergeant, captain, officer, coroner, investigator, criminalist, patrolman, or technician - 
       if an individual's name is not associated with one of these titles they do not work in law enforcement.
       Please prefix the name with "Officer Name: ". 
       For example, "Officer Name: John Smith".

    2) If available, provide an in-depth description of the context of their mention. 
       If the context induces ambiguity regarding the individual's employment in law enforcement, 
       remove the individual.
       Please prefix this information with "Officer Context: ". 

    3) Review the context to discern the role of the officer.
       Please prefix this information with "Officer Role: "
       For example, the column "Officer Role: Lead Detective" will be filled with a value of 1 for officer's who were the lead detective.
""",
)

ROLE_TEMPLATE = """
US-IPNO-Exonerations: Model Evaluation Guide 
Roles:
Lead Detective
•	Coordinates with other detectives and law enforcement officers on the case.
•	Liaises with the prosecutor's office, contributing to legal strategy and court proceedings.
•	May be involved in obtaining and executing search warrants.
•	Could be called to testify in court about the investigation.
"""

def get_response_from_query(db, query):
    # Set up the parameters
    prompt = PROMPT_TEMPLATE_MODEL
    roles = ROLE_TEMPLATE
    temperature = 0
    k = 20

    # Perform the similarity search
    docs = db.similarity_search(query, k=k)
    docs_page_content = " ".join([d.page_content for d in docs])

    # Create an instance of the OpenAI model
    llm = ChatOpenAI(model_name="gpt-4")

    # Create an instance of the LLMChain
    chain = LLMChain(llm=llm, prompt=prompt)

    # Run the LLMChain and print the response
    response = chain.run(roles=roles, question=query, docs=docs_page_content, temperature=temperature)
    print(response)

    # Return the response and the documents
    return response, docs

## Evaluations, issues, improvements

Placeholder to talk about performance compared to hand-labeled data:
- How does HyDE help?
- What is the effect of different chunk_sizes/chunk_overlap parameters? 

Despite the strengths of AI, a major challenge remains: determining the best response from the AI model. Given a prompt, the model can yield multiple responses, and figuring out which response is the most accurate or relevant is not straightforward.

Let's consider a situation where we have multiple queries, and for each query, an officer may be identified more than once. This repetition is not a limitation but an inherent characteristic of our approach because it allows us to capture every possible mention of an officer. Hence, we end up with a rich, albeit redundant, dataset, where the same officer could be mentioned multiple times across different queries.