<a href="https://colab.research.google.com/github/alex-jk/buyers-ID-manual-project/blob/main/process_studies.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**This notebook extracts and cleans text from PDF reports for use in training the buyer profile extractor model.**

**Add input and output examples to jsonl file**

In [1]:
!git clone https://github.com/alex-jk/buyers-ID-manual-project.git

Cloning into 'buyers-ID-manual-project'...
remote: Enumerating objects: 79, done.[K
remote: Counting objects: 100% (79/79), done.[K
remote: Compressing objects: 100% (74/74), done.[K
remote: Total 79 (delta 34), reused 0 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (79/79), 2.86 MiB | 14.13 MiB/s, done.
Resolving deltas: 100% (34/34), done.


In [1]:
%cd buyers-ID-manual-project
!ls

/content/buyers-ID-manual-project
buyers-manual-article-01.txt  LICENSE		     __pycache__
data			      process_studies.ipynb  README.md
extraction_utils.py	      prompt_main_idea.txt
extract_text_from_pdfs.ipynb  prompt_template.txt


In [2]:
!pip uninstall -y nltk
!rm -rf /root/nltk_data /content/nltk_data /usr/local/nltk_data

# ✅ Step 2: Reinstall clean version
!pip install nltk==3.8.1

Found existing installation: nltk 3.9.1
Uninstalling nltk-3.9.1:
  Successfully uninstalled nltk-3.9.1
Collecting nltk==3.8.1
  Downloading nltk-3.8.1-py3-none-any.whl.metadata (2.8 kB)
Downloading nltk-3.8.1-py3-none-any.whl (1.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m64.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: nltk
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
textblob 0.19.0 requires nltk>=3.9, but you have nltk 3.8.1 which is incompatible.[0m[31m
[0mSuccessfully installed nltk-3.8.1


**Import libraries**

In [2]:
# Install specific compatible versions for torch 2.3.1 / cu121
!pip install torch==2.3.1+cu121 torchaudio==2.3.1+cu121 torchvision==0.18.1+cu121 -f https://download.pytorch.org/whl/torch_stable.html

# Install the other pinned versions
!pip install transformers==4.41.2 accelerate==0.31.0 --no-deps

# Install compatible tokenizers version LAST to ensure it sticks
!pip install tokenizers==0.19.1

# Ensure base dependencies are present (redundant ok)
!pip install bitsandbytes sentencepiece

print("\nInstalled specific package versions (including compatible tokenizers). PLEASE RESTART RUNTIME NOW.")

Looking in links: https://download.pytorch.org/whl/torch_stable.html

Installed specific package versions (including compatible tokenizers). PLEASE RESTART RUNTIME NOW.


In [3]:
import pandas as pd
import json
import sys
import os
import nltk

In [4]:
import nltk
print("Attempting NLTK 'punkt' download directly...")
nltk.download('punkt') # Remove quiet=True and try/except
print("NLTK download command finished.")

Attempting NLTK 'punkt' download directly...
NLTK download command finished.


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


**Read input output file and write to jsonl**

In [5]:
input_output_df = pd.read_csv("data/buyers_id_manual_input_output.csv")

print(input_output_df.shape)
print(input_output_df.columns)
print("\n-------------------------------\n")
print(input_output_df.head())

(9, 3)
Index(['Study', 'Input', 'Output'], dtype='object')

-------------------------------

                                    Study  \
0  The Shapiro Group Georgia Demand Study   
1  The Shapiro Group Georgia Demand Study   
2  The Shapiro Group Georgia Demand Study   
3  The Shapiro Group Georgia Demand Study   
4  The Shapiro Group Georgia Demand Study   

                                               Input  \
0  Almost half these men are the age 30-39, with ...   
1  The data clearly debunk the myth that CSEC is ...   
2  Not only are 65% of men who buy sex with young...   
3  Craigslist is by far the most efficient medium...   
4  While many of the men who exploit these childr...   

                                              Output  
0  Almost half these men are the age 30-39, with ...  
1  Men who\nrespond to advertisements for sex wit...  
2  65% of men who buy sex with young females do s...  
3  Buyers respond to Craigslist ads 3 times more ...  
4  Nearly half of buyers

In [6]:
input_output_df = input_output_df.drop_duplicates(subset=["Study", "Input", "Output"])

# Convert to JSONL
with open("data/labeled_chunks.jsonl", "w") as f:
    for _, row in input_output_df.iterrows():
        json_obj = {
            "study": row["Study"],
            "input": row["Input"],
            "output": row["Output"]
        }
        f.write(json.dumps(json_obj) + "\n")

**Models for text extraction**

In [7]:
%cd /content/buyers-ID-manual-project

!git pull

/content/buyers-ID-manual-project
remote: Enumerating objects: 5, done.[K
remote: Counting objects: 100% (5/5), done.[K
remote: Compressing objects: 100% (3/3), done.[K
remote: Total 3 (delta 1), reused 0 (delta 0), pack-reused 0 (from 0)[K
Unpacking objects: 100% (3/3), 24.50 KiB | 3.50 MiB/s, done.
From https://github.com/alex-jk/buyers-ID-manual-project
   b54874a..7633ab8  main       -> origin/main
Updating b54874a..7633ab8
Fast-forward
 process_studies.ipynb | 5210 [32m+++++++++++++++++++++++++++++++++++++++++++++++[m[31m--[m
 1 file changed, 5101 insertions(+), 109 deletions(-)


In [8]:
import importlib
import extraction_utils

# Reload the entire module
importlib.reload(extraction_utils)
print("Module 'extraction_utils' reloaded.")

from extraction_utils import extract_information, chunk_text_by_tokens, process_text_chunks_with_prompt

print("Functions imported from extraction_utils successfully.")

Module 'extraction_utils' reloaded.
Functions imported from extraction_utils successfully.


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [9]:
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Jun__6_02:18:23_PDT_2024
Cuda compilation tools, release 12.5, V12.5.82
Build cuda_12.5.r12.5/compiler.34385749_0


In [10]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

In [11]:
# Check installed versions of required packages
!pip list | grep -E 'transformers|torch|accelerate|flash-attn'

accelerate                            0.31.0
sentence-transformers                 3.4.1
torch                                 2.3.1+cu121
torchaudio                            2.3.1+cu121
torchsummary                          1.5.1
torchvision                           0.18.1+cu121
transformers                          4.41.2


In [12]:
# --- Configuration ---
model_id = "microsoft/Phi-3-mini-4k-instruct"

# --- Load Model and Tokenizer ---
# Load the model with 4-bit quantization to save memory (requires bitsandbytes)
# Use device_map="auto" to automatically use GPU if available
try:
    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        device_map="auto",         # Use GPU if available, otherwise CPU
        torch_dtype="auto",        # Automatically select appropriate dtype
        trust_remote_code=True,    # Phi-3 requires this
        # Optional: uncomment below for 4-bit loading (needs bitsandbytes)
        # load_in_4bit=True,
        # bnb_4bit_compute_dtype=torch.bfloat16
    )
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    print(f"Model '{model_id}' loaded successfully.")

    # --- Create a Hugging Face Pipeline for easier text generation ---
    # Note: max_new_tokens controls how long the generated output can be. Adjust as needed.
    pipe = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        # Adjust max_new_tokens if your extracted text might be longer
        # Needs to be long enough for the longest expected extraction + "NONE"
        max_new_tokens=256,
        # Temperature=0 means more deterministic output, higher means more creative/random
        # temperature=0.0,
        # top_p=0.95, # Optional: nucleus sampling
        do_sample=False # Set to False for more deterministic extraction
    )

except Exception as e:
    print(f"Error loading model or creating pipeline: {e}")
    print("Ensure you have sufficient RAM/VRAM and necessary libraries installed.")
    print("Consider using Google Colab with a T4 GPU runtime.")
    # Exit if model loading fails
    exit()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Model 'microsoft/Phi-3-mini-4k-instruct' loaded successfully.


**Example Usage**

In [13]:
input_text_1 = "Almost half these men are the age 30-39, with the next largest group being men under age 30. The mean age is 33 and the median 31. The youngest survey participant was 18, and the oldest was 67."

input_text_2 = "The data clearly debunk the myth that CSEC is a problem relegated to the urban core. Men who respond to advertisements for sex with young females come from all over metro Atlanta, the geographic market where the advertisements in this study were targeted."

input_text_3 = "This paragraph discusses unrelated economic factors in the region and contains no information about buyers or traffickers."

input_text_4 = "Research indicates traffickers often groom potential buyers by displaying luxury goods online and frequenting specific forums known for risky behavior discussions. Victims may appear withdrawn or display signs of coaching. Reporting suspicions anonymously through the national hotline is encouraged."

In [14]:
prompt_buyers_profiles = "prompt_template.txt"

In [15]:
!cat /content/buyers-ID-manual-project/extraction_utils.py

import torch
import os
import re
import nltk
nltk.download('punkt')

# --- Function to Load a Specific Prompt File ---
def load_system_prompt(prompt_filename):
    """Loads system prompt text from a specified file."""
    system_prompt = ""
    try:
        # Assumes prompt file is in the same directory as this script
        script_dir = os.path.dirname(__file__)
        prompt_path = os.path.join(script_dir, prompt_filename)
        with open(prompt_path, 'r', encoding='utf-8') as f:
            system_prompt = f.read()
        if not system_prompt:
            return f"ERROR: Prompt file empty {prompt_filename}" # Return specific error
    except FileNotFoundError:
        print(f"ERROR: Prompt file not found at {prompt_path}")
        return f"ERROR: Prompt file not found {prompt_filename}" # Return specific error
    except Exception as e:
        print(f"ERROR: Could not read prompt file {prompt_path}: {e}")
        return f"ERROR: Could not read prompt file {prompt_filename}" # 

In [16]:
print("\n--- Extracting Information ---")

# Check if the pipeline object exists before trying to use it
if 'pipe' in locals() and pipe is not None:
    print(f"\nRunning prompt: {prompt_buyers_profiles}")

    print(f"\nInput 1:\n{input_text_1}")
    extraction_1 = extract_information(input_text_1, pipe, prompt_buyers_profiles)
    print(f"\nExtraction 1:\n{extraction_1}")

    print("-" * 20)

    print(f"\nInput 2:\n{input_text_2}")
    extraction_2 = extract_information(input_text_2, pipe, prompt_buyers_profiles)
    print(f"\nExtraction 2:\n{extraction_2}")

    print("-" * 20)

    print(f"\nInput 3:\n{input_text_3}")
    extraction_3 = extract_information(input_text_3, pipe, prompt_buyers_profiles)
    print(f"\nExtraction 3:\n{extraction_3}")

    print("-" * 20)

    print(f"\nInput 4:\n{input_text_4}")
    extraction_4 = extract_information(input_text_4, pipe, prompt_buyers_profiles)
    print(f"\nExtraction 4:\n{extraction_4}")

else:
    print("\nERROR: The 'pipe' object (text generation pipeline) was not found or not created successfully.")
    print("Please ensure the model loading cell was run successfully after importing functions.")


--- Extracting Information ---

Running prompt: prompt_template.txt

Input 1:
Almost half these men are the age 30-39, with the next largest group being men under age 30. The mean age is 33 and the median 31. The youngest survey participant was 18, and the oldest was 67.





Extraction 1:
Almost half these men are the age 30-39, with the next largest group being men under age 30. The mean age is 33 and the median 31. The youngest survey participant was 18, and the oldest was 67.
--------------------

Input 2:
The data clearly debunk the myth that CSEC is a problem relegated to the urban core. Men who respond to advertisements for sex with young females come from all over metro Atlanta, the geographic market where the advertisements in this study were targeted.

Extraction 2:
Men who respond to advertisements for sex with young females come from all over metro Atlanta.
--------------------

Input 3:
This paragraph discusses unrelated economic factors in the region and contains no information about buyers or traffickers.

Extraction 3:
NONE
--------------------

Input 4:
Research indicates traffickers often groom potential buyers by displaying luxury goods online and frequenting specific forums known for risky behavior discussions. Victims may appear withdr

**Research Articles - extract main ideas**

In [17]:
# 1. DEFINE INPUT and PROMPT FILENAMES
input_txt_filename = "buyers-manual-article-01.txt"  # <--- CHANGE to your uploaded .txt filename
prompt_to_use = "prompt_main_idea.txt"    # <--- CHANGE to the prompt file you want to apply

input_txt_filepath = os.path.join('/content/buyers-ID-manual-project', input_txt_filename)

In [18]:
!ls /content/

buyers-ID-manual-project  nltk_data  sample_data


In [19]:
import nltk
import os

# Force a fixed nltk data path
nltk_data_dir = "/content/nltk_data"
os.makedirs(nltk_data_dir, exist_ok=True)
nltk.data.path.clear()
nltk.data.path.append(nltk_data_dir)

# Download 'punkt' there
nltk.download('punkt', download_dir=nltk_data_dir, quiet=False)

# 2. Check if pipeline 'pipe' exists from previous cells
prompts_to_run = [
    "prompt_main_idea.txt"
]

CHUNK_MAX_TOKENS = 2500

if 'pipe' in locals() and callable(pipe) and 'tokenizer' in locals() and hasattr(tokenizer, 'encode'):
    print(f"Pipeline and tokenizer are ready.")
    # --- Read Input File ---
    print(f"Reading text file: {input_txt_filepath}")
    try:
        with open(input_txt_filepath, 'r', encoding='utf-8') as f:
            full_text_content = f.read()
        print(f"Successfully read {len(full_text_content)} characters.")

        if full_text_content:
            # --- Chunk Text ---
            # Calls the function imported from extraction_utils.py
            try:
              print("DEBUG: Checking/Downloading NLTK punkt right before chunking...")

              # ✅ ADDED: Force NLTK to use a specific directory for downloads
              nltk_data_dir = "/content/nltk_data"
              import os
              os.makedirs(nltk_data_dir, exist_ok=True)
              nltk.data.path.append(nltk_data_dir)

              # ✅ ADDED: Explicitly download punkt to that path
              nltk.download('punkt', download_dir=nltk_data_dir, quiet=True)

              print("DEBUG: NLTK download check complete.")
            except Exception as nltk_e:
              print(f"DEBUG: NLTK download right before use failed: {nltk_e}")
              raise RuntimeError("NLTK download failed immediately before use.") from nltk_e

            print(f"Chunking text with max_chunk_tokens = {CHUNK_MAX_TOKENS}...")
            text_chunks = chunk_text_by_tokens(full_text_content, tokenizer, max_chunk_tokens=CHUNK_MAX_TOKENS)

            if text_chunks:
                #  --- Optional Debug: Print Chunk Token Counts ---
                 print("\n--- Checking Chunk Token Counts ---")
                 for i, chunk in enumerate(text_chunks):
                     try:
                         chunk_tokens = len(tokenizer.encode(chunk, add_special_tokens=False))
                         print(f"Chunk {i+1} token count: {chunk_tokens}")
                     except Exception as e: print(f"Could not tokenize chunk {i+1}: {e}")
                 print("------------------------------------\n")
                #  -------------------------------------------

                 # --- Process Chunks for Each Prompt ---
                 all_results_by_prompt = {} # Dictionary to store results per prompt
                 for prompt_file in prompts_to_run:
                     # Calls the function imported from extraction_utils.py
                     final_results = process_text_chunks_with_prompt(text_chunks, pipe, prompt_file)
                     all_results_by_prompt[prompt_file] = final_results

                 # --- Display All Aggregated Results ---
                 print("\n\n" + "="*40)
                 print("          FINAL AGGREGATED RESULTS")
                 print("="*40)
                 for prompt_file, results in all_results_by_prompt.items():
                      print(f"\n--- Results for Prompt: {prompt_file} ---")
                      if results:
                          for i, result in enumerate(results):
                              print(f"Extraction {i+1}:\n{result}\n{'-'*20}")
                      else:
                          print(f"No relevant information extracted using '{prompt_file}'.")
                      print("="*30)

            else:
                print("Text file was read, but no chunks were created (check chunking logic/content).")
        else:
            print("Text file is empty.")

    except FileNotFoundError:
        print(f"ERROR: Input text file not found at '{input_txt_filepath}'. Check filename and current directory.")
    except Exception as e:
        print(f"An unexpected error occurred during processing: {e}")
        import traceback
        traceback.print_exc() # Print detailed traceback for debugging

else:
    print("\nERROR: The 'pipe' object and/or 'tokenizer' object not found or not ready.")
    print("Please ensure the Model Loading cell (Cell 5) was run successfully BEFORE this cell.")

Pipeline and tokenizer are ready.
Reading text file: /content/buyers-ID-manual-project/buyers-manual-article-01.txt
Successfully read 65405 characters.
DEBUG: Checking/Downloading NLTK punkt right before chunking...
DEBUG: NLTK download check complete.
Chunking text with max_chunk_tokens = 2500...
DEBUG: nltk.data.find('tokenizers/punkt') succeeded inside chunk_text_by_tokens.
ERROR: NLTK 'punkt' tokenizer data not found. Ensure nltk.download('punkt') ran successfully.
Text file was read, but no chunks were created (check chunking logic/content).


[nltk_data] Downloading package punkt to /content/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [20]:
import nltk
import os
from nltk.tokenize import sent_tokenize

# Force specific download path
nltk_data_dir = "/content/nltk_data"
os.makedirs(nltk_data_dir, exist_ok=True)
nltk.data.path.clear()
nltk.data.path.append(nltk_data_dir)
nltk.download('punkt', download_dir=nltk_data_dir, quiet=False)

# Test tokenization
sample_text = "This is a test. Here is another sentence. Is it working?"
print("nltk.data.path = ", nltk.data.path)

try:
    sentences = sent_tokenize(sample_text)
    print("✅ Tokenization succeeded. Sentences:")
    print(sentences)
except Exception as e:
    print("❌ Tokenization failed:", e)


nltk.data.path =  ['/content/nltk_data']
❌ Tokenization failed: 
**********************************************************************
  Resource [93mpunkt_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt_tab')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt_tab/english/[0m

  Searched in:
    - '/content/nltk_data'
**********************************************************************



[nltk_data] Downloading package punkt to /content/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [23]:
from nltk.tokenize import sent_tokenize

text = "This is a test. Here's another sentence. And one more!"
sentences = sent_tokenize(text)
print("✅ It works. Sentences:")
print(sentences)


LookupError: 
**********************************************************************
  Resource [93mpunkt_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt_tab')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt_tab/english/[0m

  Searched in:
    - '/content/nltk_data'
**********************************************************************
