# Validation of Extracted Topics
This notebook validates the accuracy of topics extracted from the deposition transcript. It uses a reasoning LLM to confirm correctness.

In [None]:
# imports
import json
import random
import os
import re
from PyPDF2 import PdfReader
from ollama import chat, ChatResponse

In [None]:
# topics loader
def load_extracted_topics(file_path):
    with open(file_path, "r") as f:
        return json.load(f)

# topic file path
topics_file = "../outputs/extracted_topics.json"
extracted_topics = load_extracted_topics(topics_file)

In [None]:
# NUMBER OF TOPICS TO BE VALIDATED
# number_of_topics = int(input("Enter the number of random topics to validate: "))
number_of_topics = 50  # testing default

In [None]:
# random topics are picked from pool of extracted topics
def select_random_topics(extracted_topics, num_samples):
    all_topics = []
    for page, topics in extracted_topics.items():
        for topic in topics:
            all_topics.append({"page": page, **topic})
    return random.sample(all_topics, num_samples)

random_topics = select_random_topics(extracted_topics, number_of_topics)
print("Selected Topics:\n")
for topic in random_topics:
    print(f"{random_topics.index(topic)+1}. {topic['topic']}, Page: {topic['page_start']}")

Selected Topics:

1. Vervent Access to Documents, Page: 54
2. Data Transfer from Original Servicer, Page: 23
3. Consumer Financial Protection Bureau Concern - Borrower Misleading, Page: 60
4. The Witness's Request for Clarification, Page: 79
5. Witness Understanding of ITT Investigation Findings, Page: 74
6. Request to Rephrase Question, Page: 45
7. Lack of Recall, Page: 25
8. Vervent's Lack of Origination Involvement, Page: 73
9. East Coast Accent, Page: 15
10. Two CFPB Complaints, Page: 67
11. Number of Schools Receiving Loans, Page: 64
12. Question - John Purcell to Ms. Yu, Page: 7
13. CFPB Investigation of ITT, Page: 68
14. Reference to Page 11, Page: 62
15. Access Group Disclosures, Page: 51
16. Retention Rates, Page: 29
17. Definition of 'Outlier', Page: 34
18. Question - Settlement Date, Page: 72
19. Request for Clarification - Mr. Blood, Page: 84
20. Mr. Blood's Objection, Page: 65
21. Enforceability of PEAKS Loans, Page: 86
22. Misrepresentations About ITT Benefits - Mr. Purce

In [None]:
# transcript loader
def load_transcript(file_path):
    reader = PdfReader(file_path)
    return [page.extract_text() for page in reader.pages]

# path to transcript
transcript_file = "../inputs/Deposition for PersisYu_Link.pdf"
transcript_pages = load_transcript(transcript_file)

# this function gets a 1000 character text excerpt from the transcript to provide context to LLM
def get_text_excerpt(page_number, line_start, transcript_pages, char_context=500):
    page_text = transcript_pages[page_number - 1]
    lines = page_text.split("\n")
    if 0 <= line_start - 1 < len(lines):
        target_line = lines[line_start - 1]
        target_index = page_text.find(target_line)
        start_index = max(0, target_index - char_context)   # 500 chars before target line
        end_index = min(len(page_text), target_index + len(target_line) + char_context) # 500 chars after target line
        return page_text[start_index:end_index]
    return None

# excerpts attached to topics
for topic in random_topics:
    topic["text_excerpt"] = get_text_excerpt(
        int(topic["page_start"]), int(topic["line_start"]), transcript_pages
    )

# Validation
Validation is done using the reasoning capabilities of DeepSeek R1 (8b).

The way R1 output with OLlama is structured is as follows:

```
<think>
Reasoning goes here.
</think>
Output goes here.
```

Therefore, we can display the reasoning alongside the output, and get rid of everything between the `<think>` tags to get our final Yes/No outputs. With those, we can calculate our accuracy.

In [None]:
# VALIDATION WITH REASONING
def validate_topic_with_llm(topic):
    prompt = f"""
    The following is an excerpt from a deposition transcript file:
    "{topic['text_excerpt']}"

    The extracted topic is: "{topic['topic']}".
    Is the topic relevant to the excerpt?
    Provide either "Yes." or "No." and nothing else. Do not use formatting.
    """
    try:
        response: ChatResponse = chat(
            model="deepseek-r1",
            messages=[{"role": "user", "content": prompt}],
        )
        raw_response = response.message.content.strip()
        print(f"Topic: {topic['topic']}\n\nExcerpt:\n{topic['text_excerpt']}\n\nResponse:\n{raw_response}\n\n---------------")
        
        # We remove everything inside <think> tags
        cleaned_response = re.sub(r"<think>.*?</think>", "", raw_response, flags=re.DOTALL).strip()
        return cleaned_response
    except Exception as e:
        return f"Error: {e}"

# iterate and validate
for topic in random_topics:
    if "llm_response" in topic:
        continue
    print(f"\n\nTopic {random_topics.index(topic)+1}/{len(random_topics)}:\n")
    topic["llm_response"] = validate_topic_with_llm(topic)



Topic 1/50:

Topic: Vervent Access to Documents

Excerpt:
1   was asked to look at is the documents that Vervent would     02:35
2   have had access to which it would have used to rely when     02:35
3   communicating to student loan borrowers about -- about the   02:35
4   terms and conditions of their loans.                         02:35
5            The fact that they didn't have those -- those       02:35
6   documents means that they could not definitively know when   02:35
7   a borrower says what is my interest rate, for example.       02:35
8   Without that document, the servicer can't answer the         02:35
9   question.                                                    02:35
10

Response:
<think>
Okay, let's look at this deposition transcript excerpt. It talks about documents that Vervent would have had access to which they might have used when communicating with student loan borrowers regarding terms and conditions. The speaker mentions that since those documents weren'

# Results

### Accuracy

In [7]:
# accuracy calculation
def calculate_accuracy(random_topics):
    correct_count = sum(1 for topic in random_topics if topic["llm_response"] == "Yes.")
    accuracy = (correct_count / len(random_topics)) * 100   # percentage of correctly identified topics
    
    print(f"{len(random_topics)} random topics selected for validation.")
    print(f"Of those, {correct_count} were evaluated as accurate to the context.")
    print(f"\nValidation Accuracy: {accuracy:.2f}%")

calculate_accuracy(random_topics)

50 random topics selected for validation.
Of those, 48 were evaluated as accurate to the context.

Validation Accuracy: 96.00%


### Inaccurate Topics

In [8]:
def print_inaccurate(random_topics):
    for topic in random_topics:
        if topic["llm_response"] == "No.":
            print(f"Topic {random_topics.index(topic)+1}: {topic['topic']}, Page: {topic['page_start']}")

print_inaccurate(random_topics)

Topic 26: Question Regarding Vervent's Cessation of PEAKS Loan Servicing (Repeat), Page: 79
Topic 40: Witness's Response to Repeated Question, Page: 70
