# Validation of Extracted Topics
This notebook validates the accuracy of topics extracted from the deposition transcript. It uses a reasoning LLM to confirm correctness.

In [1]:
# imports
import json
import random
import os
import re
from PyPDF2 import PdfReader
from ollama import chat, ChatResponse

In [2]:
# topics loader
def load_extracted_topics(file_path):
    with open(file_path, "r") as f:
        return json.load(f)

# topic file path
topics_file = "../outputs/toc.json"
extracted_topics = load_extracted_topics(topics_file)

In [3]:
# NUMBER OF TOPICS TO BE VALIDATED
# number_of_topics = int(input("Enter the number of random topics to validate: "))
number_of_topics = 50  # testing default

In [4]:
# random topics are picked from pool of extracted topics
def select_random_topics(extracted_topics, num_samples):
    all_topics = []
    for page, topics in extracted_topics.items():
        for topic in topics:
            all_topics.append({"page": page, **topic})
    return random.sample(all_topics, num_samples)

random_topics = select_random_topics(extracted_topics, number_of_topics)
print("Selected Topics:\n")
for topic in random_topics:
    print(f"{random_topics.index(topic)+1}. {topic['topic']}, Page: {topic['page_start']}")

Selected Topics:

1. Witnesses discuss problematic wording and related work., Page: 120
2. Affidavit of John Purcell - Heather Turrey vs. Vervent, Inc., Page: 90
3. Witness avoids speculation on PEAKS loan issue., Page: 87
4. Purcell questions Blood about Civil Investigative Demand complaint, Page: 67
5. Witness clarifies cease collection instructions - Judge Miller, Page: 46
6. Witness unable to recall SEC discussion about Vervent, Page: 70
7. Detailed deposition segment analysis – various timestamps noted, Page: 93
8. Witness clarifies timeframe: last five to ten years., Page: 25
9. Questioning about PEAKS loan enforceability issues., Page: 49
10. Senate HELP Committee’s review of ITT retention rates., Page: 30
11. Examination of documents reveals loan defects - Purcell, Page: 49
12. Public nature of events, no court orders issued, Page: 62
13. Analysis of institution practices regarding ITT benefits., Page: 35
14. Defendant denies public awareness of wrongdoing., Page: 68
15. Witnes

In [5]:
# transcript loader
def load_transcript(file_path):
    reader = PdfReader(file_path)
    return [page.extract_text() for page in reader.pages]

# path to transcript
transcript_file = "../inputs/deposition.pdf"
transcript_pages = load_transcript(transcript_file)

# this function gets a 1000 character text excerpt from the transcript to provide context to LLM
def get_text_excerpt(page_number, line_start, transcript_pages, char_context=500):
    page_text = transcript_pages[page_number - 1]
    lines = page_text.split("\n")
    if 0 <= line_start - 1 < len(lines):
        target_line = lines[line_start - 1]
        target_index = page_text.find(target_line)
        start_index = max(0, target_index - char_context)   # 500 chars before target line
        end_index = min(len(page_text), target_index + len(target_line) + char_context) # 500 chars after target line
        return page_text[start_index:end_index]
    return None

# excerpts attached to topics
for topic in random_topics:
    topic["text_excerpt"] = get_text_excerpt(
        int(topic["page_start"]), int(topic["line_start"]), transcript_pages
    )

# Validation
Validation is done using the reasoning capabilities of DeepSeek R1 (8b).

The way R1 output with OLlama is structured is as follows:

```
<think>
Reasoning goes here.
</think>
Output goes here.
```

Therefore, we can display the reasoning alongside the output, and get rid of everything between the `<think>` tags to get our final Yes/No outputs. With those, we can calculate our accuracy.

In [6]:
# VALIDATION WITH REASONING
def validate_topic_with_llm(topic):
    prompt = f"""
    The following is an excerpt from a deposition transcript file:
    "{topic['text_excerpt']}"

    The extracted topic is: "{topic['topic']}".
    Is the topic relevant to the excerpt?
    Provide either "Yes." or "No." and nothing else. Do not use formatting.
    """
    try:
        response: ChatResponse = chat(
            model="deepseek-r1",
            messages=[{"role": "user", "content": prompt}],
        )
        raw_response = response.message.content.strip()
        print(f"Topic: {topic['topic']}\n\nExcerpt:\n{topic['text_excerpt']}\n\nResponse:\n{raw_response}\n\n---------------")
        
        # We remove everything inside <think> tags
        cleaned_response = re.sub(r"<think>.*?</think>", "", raw_response, flags=re.DOTALL).strip()
        return cleaned_response
    except Exception as e:
        return f"Error: {e}"

# iterate and validate
for topic in random_topics:
    if "llm_response" in topic:
        continue
    print(f"\n\nTopic {random_topics.index(topic)+1}/{len(random_topics)}:\n")
    topic["llm_response"] = validate_topic_with_llm(topic)



Topic 1/50:

Topic: Witnesses discuss problematic wording and related work.

Excerpt:
32:2,13,16 33:7
34:22 37:6
38:6,20 39:22
41:7,25 43:4
46:4 48:11
49:4 51:20
52:9,13 53:4,17
53:24 58:4
59:11 63:3
64:8 65:13
69:13 70:3,12
70:20 72:8,11
77:15 78:7,24
79:6,15 80:13
80:16,20 81:14
81:23 82:10
83:23 84:5,9,16
85:11 87:8,16
88:2 89:10,19
90:18 91:13,16
92:2,5 93:24
witnesses 90:9
wondered
26:21
word75:16
76:22
wording 82:4
words33:8
43:11 72:2
80:5 84:18
work11:14,23
12:8,12,17
13:15 18:18,19
18:21 19:15
22:2 24:1825:14,15 27:19
64:12 77:15
worked 16:3,5
16:7,16 18:4,6
18:12 40:7
workers 18:2
working 14:18
18:16 24:21
40:7
works7:16
64:21
worse30:2
worthless 29:3
29:8 31:25
would've 27:23
36:24 38:2
87:4,9
write13:24
57:14
written 14:21
wrong67:21
68:5,15 69:3
73:6 74:11,15
78:5
wrongdoing
68:9 69:8
74:19 78:16,19
78:22,25
x
x5:1,5 92:9
y
yea

Response:
<think>
Okay, let's take a look at this query. The user provided an excerpt from a deposition transcript along with a topic: "Witnesse

# Results

### Accuracy

In [7]:
# accuracy calculation
def calculate_accuracy(random_topics):
    correct_count = sum(1 for topic in random_topics if topic["llm_response"] == "Yes.")
    accuracy = (correct_count / len(random_topics)) * 100   # percentage of correctly identified topics
    
    print(f"{len(random_topics)} random topics selected for validation.")
    print(f"Of those, {correct_count} were evaluated as accurate to the context.")
    print(f"\nValidation Accuracy: {accuracy:.2f}%")

calculate_accuracy(random_topics)

50 random topics selected for validation.
Of those, 42 were evaluated as accurate to the context.

Validation Accuracy: 84.00%


### Inaccurate Topics

In [8]:
def print_inaccurate(random_topics):
    for topic in random_topics:
        if topic["llm_response"] == "No.":
            print(f"Topic {random_topics.index(topic)+1}: {topic['topic']}, Page: {topic['page_start']}")

print_inaccurate(random_topics)

Topic 2: Affidavit of John Purcell - Heather Turrey vs. Vervent, Inc., Page: 90
Topic 9: Questioning about PEAKS loan enforceability issues., Page: 49
Topic 35: Counsel requests repetition of witness testimony - Mr. Purcell, Page: 83
Topic 37: Witness clarifies non-legal status of reviewed report., Page: 52
Topic 42: Request to review and modify testimony - Witness 1, Witness 2, Page: 47
Topic 43: Witness requests slower pace, marks exhibit - Terrific., Page: 10
Topic 44: Federal government approval regarding ITT’s institutional practices., Page: 64
