<a href="https://colab.research.google.com/github/alex-jk/buyers-ID-manual-project/blob/main/process_studies.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**This notebook extracts and cleans text from PDF reports for use in training the buyer profile extractor model.**

**Add input and output examples to jsonl file**

In [3]:
#!git clone https://github.com/alex-jk/buyers-ID-manual-project.git
%cd buyers-ID-manual-project
!ls

/content/buyers-ID-manual-project
data			      LICENSE		     README.md
extraction_utils.py	      process_studies.ipynb
extract_text_from_pdfs.ipynb  prompt_template.txt


In [4]:
# import json

# input_text = """
# Buyers were willing to travel significant distances to reach underage victims.
# One man cited the Atlanta airport as a regular meeting place.
# """

# output_text = """
# Buyers are often mobile and willing to travel, including to major hubs like the Atlanta airport.
# """

# entry = {"input": input_text.strip(), "output": output_text.strip()}

# with open("data/labeled_chunks.jsonl", "a", encoding="utf-8") as f:
#     f.write(json.dumps(entry) + "\n")

**Import libraries**

In [3]:
# Install specific compatible versions for torch 2.3.1 / cu121
!pip install torch==2.3.1+cu121 torchaudio==2.3.1+cu121 torchvision==0.18.1+cu121 -f https://download.pytorch.org/whl/torch_stable.html

# Install the other pinned versions
!pip install transformers==4.41.2 accelerate==0.31.0 --no-deps

# Install compatible tokenizers version LAST to ensure it sticks
!pip install tokenizers==0.19.1

# Ensure base dependencies are present (redundant ok)
!pip install bitsandbytes sentencepiece

print("\nInstalled specific package versions (including compatible tokenizers). PLEASE RESTART RUNTIME NOW.")

Looking in links: https://download.pytorch.org/whl/torch_stable.html
Collecting torch==2.3.1+cu121
  Downloading https://download.pytorch.org/whl/cu121/torch-2.3.1%2Bcu121-cp311-cp311-linux_x86_64.whl (781.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m781.0/781.0 MB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting torchaudio==2.3.1+cu121
  Downloading https://download.pytorch.org/whl/cu121/torchaudio-2.3.1%2Bcu121-cp311-cp311-linux_x86_64.whl (3.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.4/3.4 MB[0m [31m53.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting torchvision==0.18.1+cu121
  Downloading https://download.pytorch.org/whl/cu121/torchvision-0.18.1%2Bcu121-cp311-cp311-linux_x86_64.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m99.5 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch==2.3.1+cu121)
  Downloading nvidia_cuda_nvrtc_cu12-12.1

In [5]:
import pandas as pd
import json
import sys
import os

**Read input output file and write to jsonl**

In [6]:
input_output_df = pd.read_csv("data/buyers_id_manual_input_output.csv")

print(input_output_df.shape)
print(input_output_df.columns)
print("\n-------------------------------\n")
print(input_output_df.head())

(9, 3)
Index(['Study', 'Input', 'Output'], dtype='object')

-------------------------------

                                    Study  \
0  The Shapiro Group Georgia Demand Study   
1  The Shapiro Group Georgia Demand Study   
2  The Shapiro Group Georgia Demand Study   
3  The Shapiro Group Georgia Demand Study   
4  The Shapiro Group Georgia Demand Study   

                                               Input  \
0  Almost half these men are the age 30-39, with ...   
1  The data clearly debunk the myth that CSEC is ...   
2  Not only are 65% of men who buy sex with young...   
3  Craigslist is by far the most efficient medium...   
4  While many of the men who exploit these childr...   

                                              Output  
0  Almost half these men are the age 30-39, with ...  
1  Men who\nrespond to advertisements for sex wit...  
2  65% of men who buy sex with young females do s...  
3  Buyers respond to Craigslist ads 3 times more ...  
4  Nearly half of buyers

In [7]:
input_output_df = input_output_df.drop_duplicates(subset=["Study", "Input", "Output"])

# Convert to JSONL
with open("data/labeled_chunks.jsonl", "w") as f:
    for _, row in input_output_df.iterrows():
        json_obj = {
            "study": row["Study"],
            "input": row["Input"],
            "output": row["Output"]
        }
        f.write(json.dumps(json_obj) + "\n")

**Models for text extraction**

In [30]:
%cd /content/buyers-ID-manual-project

!git pull

/content/buyers-ID-manual-project
remote: Enumerating objects: 5, done.[K
remote: Counting objects: 100% (5/5), done.[K
remote: Compressing objects: 100% (3/3), done.[K
remote: Total 3 (delta 2), reused 0 (delta 0), pack-reused 0 (from 0)[K
Unpacking objects: 100% (3/3), 977 bytes | 977.00 KiB/s, done.
From https://github.com/alex-jk/buyers-ID-manual-project
   2971db4..a7686c5  main       -> origin/main
Updating 2971db4..a7686c5
Fast-forward
 extraction_utils.py | 2 [32m+[m[31m-[m
 1 file changed, 1 insertion(+), 1 deletion(-)


In [31]:
import importlib
import extraction_utils

# Reload the entire module
importlib.reload(extraction_utils)
print("Module 'extraction_utils' reloaded.")

from extraction_utils import extract_information, chunk_text_by_paragraphs, process_text_chunks_with_prompt

print("Functions imported from extraction_utils successfully.")

Module 'extraction_utils' reloaded.
Functions imported from extraction_utils successfully.


In [9]:
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Jun__6_02:18:23_PDT_2024
Cuda compilation tools, release 12.5, V12.5.82
Build cuda_12.5.r12.5/compiler.34385749_0


In [10]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

In [11]:
# Check installed versions of required packages
!pip list | grep -E 'transformers|torch|accelerate|flash-attn'

accelerate                            0.31.0
sentence-transformers                 3.4.1
torch                                 2.3.1+cu121
torchaudio                            2.3.1+cu121
torchsummary                          1.5.1
torchvision                           0.18.1+cu121
transformers                          4.41.2


In [12]:
# --- Configuration ---
model_id = "microsoft/Phi-3-mini-4k-instruct"

# --- Load Model and Tokenizer ---
# Load the model with 4-bit quantization to save memory (requires bitsandbytes)
# Use device_map="auto" to automatically use GPU if available
try:
    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        device_map="auto",         # Use GPU if available, otherwise CPU
        torch_dtype="auto",        # Automatically select appropriate dtype
        trust_remote_code=True,    # Phi-3 requires this
        # Optional: uncomment below for 4-bit loading (needs bitsandbytes)
        # load_in_4bit=True,
        # bnb_4bit_compute_dtype=torch.bfloat16
    )
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    print(f"Model '{model_id}' loaded successfully.")

    # --- Create a Hugging Face Pipeline for easier text generation ---
    # Note: max_new_tokens controls how long the generated output can be. Adjust as needed.
    pipe = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        # Adjust max_new_tokens if your extracted text might be longer
        # Needs to be long enough for the longest expected extraction + "NONE"
        max_new_tokens=256,
        # Temperature=0 means more deterministic output, higher means more creative/random
        # temperature=0.0,
        # top_p=0.95, # Optional: nucleus sampling
        do_sample=False # Set to False for more deterministic extraction
    )

except Exception as e:
    print(f"Error loading model or creating pipeline: {e}")
    print("Ensure you have sufficient RAM/VRAM and necessary libraries installed.")
    print("Consider using Google Colab with a T4 GPU runtime.")
    # Exit if model loading fails
    exit()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

configuration_phi3.py:   0%|          | 0.00/11.2k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- configuration_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_phi3.py:   0%|          | 0.00/73.2k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- modeling_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors.index.json:   0%|          | 0.00/16.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.44k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.94M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/306 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/599 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Model 'microsoft/Phi-3-mini-4k-instruct' loaded successfully.


**Example Usage**

In [13]:
input_text_1 = "Almost half these men are the age 30-39, with the next largest group being men under age 30. The mean age is 33 and the median 31. The youngest survey participant was 18, and the oldest was 67."

input_text_2 = "The data clearly debunk the myth that CSEC is a problem relegated to the urban core. Men who respond to advertisements for sex with young females come from all over metro Atlanta, the geographic market where the advertisements in this study were targeted."

input_text_3 = "This paragraph discusses unrelated economic factors in the region and contains no information about buyers or traffickers."

input_text_4 = "Research indicates traffickers often groom potential buyers by displaying luxury goods online and frequenting specific forums known for risky behavior discussions. Victims may appear withdrawn or display signs of coaching. Reporting suspicions anonymously through the national hotline is encouraged."

In [14]:
prompt_buyers_profiles = "prompt_template.txt"

In [15]:
!cat /content/buyers-ID-manual-project/extraction_utils.py

import torch
import os

# --- Function to Load a Specific Prompt File ---
def load_system_prompt(prompt_filename):
    """Loads system prompt text from a specified file."""
    system_prompt = ""
    try:
        # Assumes prompt file is in the same directory as this script
        script_dir = os.path.dirname(__file__)
        prompt_path = os.path.join(script_dir, prompt_filename)
        with open(prompt_path, 'r', encoding='utf-8') as f:
            system_prompt = f.read()
        if not system_prompt:
            return f"ERROR: Prompt file empty {prompt_filename}" # Return specific error
    except FileNotFoundError:
        print(f"ERROR: Prompt file not found at {prompt_path}")
        return f"ERROR: Prompt file not found {prompt_filename}" # Return specific error
    except Exception as e:
        print(f"ERROR: Could not read prompt file {prompt_path}: {e}")
        return f"ERROR: Could not read prompt file {prompt_filename}" # Return specific error
    return system_promp

In [16]:
print("\n--- Extracting Information ---")

# Check if the pipeline object exists before trying to use it
if 'pipe' in locals() and pipe is not None:
    print(f"\nRunning prompt: {prompt_buyers_profiles}")

    print(f"\nInput 1:\n{input_text_1}")
    extraction_1 = extract_information(input_text_1, pipe, prompt_buyers_profiles)
    print(f"\nExtraction 1:\n{extraction_1}")

    print("-" * 20)

    print(f"\nInput 2:\n{input_text_2}")
    extraction_2 = extract_information(input_text_2, pipe, prompt_buyers_profiles)
    print(f"\nExtraction 2:\n{extraction_2}")

    print("-" * 20)

    print(f"\nInput 3:\n{input_text_3}")
    extraction_3 = extract_information(input_text_3, pipe, prompt_buyers_profiles)
    print(f"\nExtraction 3:\n{extraction_3}")

    print("-" * 20)

    print(f"\nInput 4:\n{input_text_4}")
    extraction_4 = extract_information(input_text_4, pipe, prompt_buyers_profiles)
    print(f"\nExtraction 4:\n{extraction_4}")

else:
    print("\nERROR: The 'pipe' object (text generation pipeline) was not found or not created successfully.")
    print("Please ensure the model loading cell was run successfully after importing functions.")


--- Extracting Information ---

Running prompt: prompt_template.txt

Input 1:
Almost half these men are the age 30-39, with the next largest group being men under age 30. The mean age is 33 and the median 31. The youngest survey participant was 18, and the oldest was 67.





Extraction 1:
Almost half these men are the age 30-39, with the next largest group being men under age 30. The mean age is 33 and the median 31. The youngest survey participant was 18, and the oldest was 67.
--------------------

Input 2:
The data clearly debunk the myth that CSEC is a problem relegated to the urban core. Men who respond to advertisements for sex with young females come from all over metro Atlanta, the geographic market where the advertisements in this study were targeted.

Extraction 2:
Men who respond to advertisements for sex with young females come from all over metro Atlanta.
--------------------

Input 3:
This paragraph discusses unrelated economic factors in the region and contains no information about buyers or traffickers.

Extraction 3:
NONE
--------------------

Input 4:
Research indicates traffickers often groom potential buyers by displaying luxury goods online and frequenting specific forums known for risky behavior discussions. Victims may appear withdr

**Research Articles - extract main ideas**

In [27]:
# 1. DEFINE INPUT and PROMPT FILENAMES
input_txt_filename = "buyers-manual-article-01.txt"  # <--- CHANGE to your uploaded .txt filename
prompt_to_use = "prompt_main_idea.txt"    # <--- CHANGE to the prompt file you want to apply

input_txt_filepath = os.path.join('/content/buyers-ID-manual-project', input_txt_filename)

In [26]:
!ls /content/

buyers-ID-manual-project  sample_data


In [32]:
# 2. Check if pipeline 'pipe' exists from previous cells
if 'pipe' in locals() and callable(pipe):
    # 3. Read the text file
    print(f"Reading text file: {input_txt_filepath}")
    try:
        with open(input_txt_filepath, 'r', encoding='utf-8') as f:
            full_text_content = f.read()
        print(f"Successfully read {len(full_text_content)} characters.")

        # 4. Chunk the text
        if full_text_content:
            text_chunks = chunk_text_by_paragraphs(full_text_content)

            # 5. Process the chunks with the selected prompt
            if text_chunks:
                final_results = process_text_chunks_with_prompt(text_chunks, pipe, prompt_to_use)

                # 6. Display aggregated results for this prompt
                print(f"\n--- Aggregated Extractions for Prompt: {prompt_to_use} ---")
                if final_results:
                    for i, result in enumerate(final_results):
                        print(f"Extraction {i+1}:\n{result}\n{'-'*20}")
                else:
                    print(f"No relevant information extracted using '{prompt_to_use}'.")
            else:
                print("Text file was read, but no chunks were created (check chunking logic or file content).")
        else:
            print("Text file is empty.")

    except FileNotFoundError:
        print(f"ERROR: Input text file not found at '{input_txt_filepath}'. Please upload it.")
    except Exception as e:
        print(f"An error occurred during processing: {e}")

else:
    print("\nERROR: The 'pipe' object (text generation pipeline) was not found or not created successfully.")
    print("Please ensure the model loading cell was run successfully BEFORE this cell.")

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


Reading text file: /content/buyers-ID-manual-project/buyers-manual-article-01.txt
Successfully read 65405 characters.
Split text into 3 chunks.
Processing 3 chunks using prompt 'prompt_main_idea.txt'...

  Processing chunk 1/3. Length: 1708 characters.
    Found relevant info in chunk 1: The paper discusses the use of augmented intelligence in deterring sex buyers in the online sexual e...

  Processing chunk 2/3. Length: 15827 characters.
Error during generation pipeline call for prompt_main_idea.txt: CUDA out of memory. Tried to allocate 1.93 GiB. GPU 
    Found relevant info in chunk 2: Error: Exception during generation. (OutOfMemoryError)...

  Processing chunk 3/3. Length: 47863 characters.
Error during generation pipeline call for prompt_main_idea.txt: CUDA out of memory. Tried to allocate 7.39 GiB. GPU 
    Found relevant info in chunk 3: Error: Exception during generation. (OutOfMemoryError)...
Finished processing chunks. Found 3 relevant extractions for this prompt.

--- Aggr