# ITAR Topic Modeling

Notes: 
- This code prints the LLM results in plain text but does not store them in a dataframe, json file, spreadsheet, etc.
- This code has not been optimized for performance.
- If you run into an Out of Memory (OOO) error, try reducing `batch_size`.
- Make sure that `final_batch_size << batch_size`. 

In [1]:
%pip install transformers -q
%pip install accelerate -q
%pip install openpyxl -q

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


## Model
**IMPORTANT**: This model uses `device_map=auto`. However, in the latest version of `meta-llama/Llama-3.2-3B-Instruct`, setting `device_map=auto` may result in an `RuntimeError: probability tensor contains either inf, nan or element < 0` error. If you encounter this error, set `device_map=cuda`. 

In [2]:
import os
import torch
import transformers
from transformers import pipeline

# Hugging Face user token
os.environ["HF_TOKEN"]="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"

model_id = "meta-llama/Llama-3.2-3B-Instruct"

pipe = pipeline(
    "text-generation",
    model=model_id,
    torch_dtype=torch.bfloat16,
    device_map="cuda",
    temperature = 0.01,
    top_k = 1
)
transformers.logging.set_verbosity_error()
print('[Done]')

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

[Done]


## Data

In [3]:
import pandas as pd
import openpyxl

FILE_XLS = 'x_dnllp_dhs_itar_itar-SME Review-241016.xlsx'
df_conditions = pd.read_excel(FILE_XLS, sheet_name='Conditions')
print(f'Num rows: {len(df_conditions)}')

# Replace all single quotes with double quotes to prevent issues with list of strings below.
df_conditions.replace({'\'': '"'}, regex=True)

df_hq_coord_group = df_conditions.loc[df_conditions['Opened By Group'] == 'ITAR HQ Coordinator Group']
print(f'Num rows for HQ Coord Group: {len(df_hq_coord_group)}')

Num rows: 17620
Num rows for HQ Coord Group: 10524


## Utils

In [4]:
import time
from datetime import datetime as dt
import pytz


def get_elapsed(start):
    elapsed = time.time() - start
    return dt.strftime(dt.utcfromtimestamp(elapsed), '%Hh:%Mm:%Ss')


def get_datetime():
    EST = pytz.timezone("America/New_York")
    datetime_est = dt.now(EST)
    return datetime_est.strftime("%m/%d/%Y, %H:%M:%S")

## Topics Inference

The following is used to identify the significant topics per batch of Descriptions *AND* the significant topics from the list of topics for each batch of Descriptions.

In [5]:
def get_prompt(batched_descriptions, num_topics):
    messages = [
        {"role": "system", "content": """The following is a list <B> of texts <T>. List the """ + str(num_topics) + 
             """ most significant topics across all texts in <B>. Provide a 2-3 sentence description of each topic including how 
             many texts out of <B> were relevant to the topic. Do NOT include any additional text. Do not include any XML tags."""},
        {"role": "user", "content": batched_descriptions},
    ]
    return messages


def get_topics(batched_descriptions, num_topics):
    prompt = get_prompt(batched_descriptions, num_topics)
    
    outputs = pipe(
        prompt,
        max_new_tokens=512,
    )
    x = outputs[0]["generated_text"][-1]
    #print(f'x.get("content"): {x.get("content")}')
    return x.get("content")

## Analyze

### Get Batched Topics
- Break down list of Descriptions into a set of `batched_descriptions`, where a `batched_descriptions` is a string of concatenated Descriptions.
- For each `batched_descriptions`, generate a description of significant `topics`.
- Add each `topics` description to a `batched_description_topics` list.

In [6]:
import math

count = 0
num_batches = 0
batch_size = 100
num_topics = 5
batched_descriptions = "<B>\n"
batched_description_topics = []

print(f'Starting... {(get_datetime())}')
print(f'- Num records in group: {len(df_hq_coord_group)}')
print(f'- Batch size: {batch_size}')
print(f'- Num topics: {num_topics}')

total_num_batches = math.ceil(len(df_hq_coord_group) / (batch_size))

start = time.time()

for row, cols in df_hq_coord_group.iterrows():
    count += 1
    batched_descriptions += "<T>" + str(cols['Description']) + "</T>\n"
    
    if count % batch_size == 0:
        # We have a batch
        num_batches += 1

        print(f'Processing {num_batches} of {total_num_batches} batches...', end='\r')
        batched_descriptions += "</B>\n"
        #print(batched_descriptions)
        topics = get_topics(batched_descriptions, num_topics)
        #print(topics)
        batched_description_topics.append(topics)
        
        # Reset vars
        batched_descriptions = "<B>\n"
        
    elif count == len(df_hq_coord_group):
        # We have the last batch
        num_batches += 1
        print(f'Processing final batch {num_batches} of {total_num_batches}...')
        batched_descriptions += "</B>\n"
        #print(batched_descriptions)
        topics = get_topics(batched_descriptions, num_topics)
        #print(topics)
        batched_description_topics.append(topics)

print(f'len batched_description_topics: {len(batched_description_topics)}')
#print(f'batched_description_topics: {batched_description_topics}')
print(f'\n[End (elapsed: {get_elapsed(start)})]')

Starting... 11/01/2024, 16:13:14
- Num records in group: 10524
- Batch size: 100
- Num topics: 5
Processing final batch 106 of 106...
len batched_description_topics: 106

[End (elapsed: 00h:25m:22s)]


### Batch Significant Topics from Batched Description Topics

- Break down `batched_description_topics` list into a set of `batched_topics`, where `batched_topics` is a string of concatenated topics for a batch of Descriptions. 
- For each `batched_topics`, generate a description of significant `pre_final_topics`.
- For `pre_final_topics`, generate the `final_topics`.

In [7]:
final_topics = ""
batched_topics = "<B>\n"
# Note: final_batch_size must be smaller than batch_size.
final_batch_size = 25
final_num_topics = 5
count = 0
pre_final_topics = ""
num_batches = 0

num_batched_description_topics = len(batched_description_topics)
total_num_batches = math.ceil((num_batched_description_topics) / (final_batch_size))

start = time.time()

print(f'Starting... {(get_datetime())}')
print(f'- Num records in num_batched_description_topics: {num_batched_description_topics}')
print(f'- Final batch size: {final_batch_size}')
print(f'- Final num topics: {final_num_topics}')

for i in range(num_batched_description_topics):
    count += 1
    batched_topics += "<T>" + batched_description_topics[i] + "</T>\n"
    
    if count % final_batch_size == 0:
        # We have a batch
        num_batches += 1

        print(f'Processing {num_batches} of {total_num_batches} batched topics...', end='\r')
        batched_topics += "</B>\n"
        #print(batched_topics)
        pre_final_topics += get_topics(batched_topics, final_num_topics)
        
        # Reset vars
        batched_topics = "<B>\n"
        
    elif count == num_batched_description_topics:
        # We have the last batch
        num_batches += 1
        print(f'Processing final batch {num_batches} of {total_num_batches} batched topics...')
        batched_topics += "</B>\n"
        #print(batched_topics)
        pre_final_topics += get_topics(batched_topics, final_num_topics)

        
#print(f'pre_final_topics:\n {pre_final_topics}')

final_topics = get_topics(pre_final_topics, final_num_topics)
print('\nGetting final topics...')
print(f'\n---------------\nFinal Topics:\n{final_topics}')
print(f'\n[End (elapsed: {get_elapsed(start)})]')

Starting... 11/01/2024, 16:38:37
- Num records in num_batched_description_topics: 106
- Final batch size: 25
- Final num topics: 5
Processing final batch 5 of 5 batched topics...

Getting final topics...

---------------
Final Topics:
1. **Procurement Tab Information**: This topic involves entering information into the Procurement Tab, including vendor name, award amount, Contract #/Procurement Instrument Identifier (PIID), and other relevant details. This topic is relevant to 14 texts.

2. **Unique Investment Identifier (UII) Information**: This topic involves the entry of UII information into the "Notes" tab in the "Additional Comments" section of the request's Dashboard, identifying UIIs that are new, changed, or remained the same. This topic is relevant to 14 texts.

3. **Contract Award and Procurement Tab**: This topic involves the entry of award information, such as vendor name, award amount, and Contract #/Procurement Instrument Identifier (PIID), into the Procurement tab of the