# TopicGPT_Python package

`topicgpt_python` consists of five modules in total: 
- `generate_topic_lvl1` generates high-level and generalizable topics. 
- `generate_topic_lvl2` generates low-level and specific topics to each high-level topic.
- `refine_topics` refines the generated topics by merging similar topics and removing irrelevant topics.
- `assign_topics` assigns the generated topics to the input text, along with a quote that supports the assignment.
- `correct_topics` corrects the generated topics by reprompting the model so that the topic assignment is grounded in the topic list. 

![topicgpt_python](assets/img/pipeline.png)

## Setup
1. Make a new Python 3.9+ environment using virtualenv or conda. 
2. Install the required packages: `pip install --upgrade topicgpt_python`.
- Our package supports OpenAI API, Google Cloud Vertex AI API, Gemini API, Azure API, and vLLM inference. vLLM requires GPUs to run. 
- Please refer to https://openai.com/pricing/ for OpenAI API pricing or to https://cloud.google.com/vertex-ai/pricing for Vertex API pricing. 

In [1]:
# Run in shell
#!pip install --upgrade topicgpt_python

# Needed only for the OpenAI API deployment
#export OPENAI_API_KEY={your_openai_api_key}

# Needed only for the Vertex AI deployment
#export VERTEX_PROJECT={your_vertex_project}   # e.g. my-project
#export VERTEX_LOCATION={your_vertex_location} # e.g. us-central1

# Needed only for Gemini deployment
#export GEMINI_API_KEY={your_gemini_api_key}

# Needed only for the Azure API deployment
#export AZURE_OPENAI_API_KEY={your_azure_api_key}
#export AZURE_OPENAI_ENDPOINT={your_azure_endpoint}

## Usage
1. First, define the necessary file paths for I/O operations in `config.yml`. 
2. Then, import the necessary modules and functions from `topicgpt_python`.
3. Store your data in `data/input` and modify the `data_sample` path in `config.yml`. 

- Prepare your `.jsonl` data file in the following format:
    ```
    {
        "id": "IDs (optional)",
        "text": "Documents",
        "label": "Ground-truth labels (optional)"
    }
    ```

In [2]:
import os

os.environ['USE_LIBUV'] = '0'

In [3]:
from topicgpt_python import *
import yaml

with open("config.yml", "r") as f:
    config = yaml.safe_load(f)

INFO 02-12 23:11:41 __init__.py:192] Automatically detected platform cuda.


### Topic Generation 
Generate high-level topics using `generate_topic_lvl1`. 
- Define the api type and model. 
- Define your seed topics in `prompt/seed_1.md`.
- (Optional) Modify few-shot examples in `prompt/generation_1.txt`.
- Expect the generated topics in `data/output/{data_name}/generation_1.md` and `data/output/{data_name}/generation_1.jsonl`.
- Right now, early stopping is set to 100, meaning that if no new topic has been generated in the last 100 iterations, the generation process will stop.

In [None]:
#MODEL_API = 'vllm'
MODEL_API = 'custom_llm'

# MODEL_NAME = "C:/git/Mistral-7B-Instruct-v0.3-quantized.w4a16"
#MODEL_NAME = "C:/git/Mistral-7B-Instruct-v0.3-quantized.w8a16"
MODEL_NAME = 'C:/git/Mistral-7B-Instruct-v0.1'

generate_topic_lvl1(
    MODEL_API,
    MODEL_NAME,
    config["data_sample"],
    config["generation"]["prompt"],
    config["generation"]["seed"],
    config["generation"]["output"],
    config["generation"]["topic_output"],
    verbose=config["verbose"],
)

Preparing to import and load with custom_llm


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Model loaded with custom_llm
Tokenizer loaded with custom_llm
-------------------
Initializing topic generation...
Model: C:/git/Mistral-7B-Instruct-v0.1
Data file: data/input/sample.jsonl
Prompt file: prompt/generation_1.txt
Seed file: prompt/seed_1.md
Output file: data/output/sample/generation_1.jsonl
Topic file: data/output/sample/generation_1.md
-------------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


iterative_prompt system_message: You are a helpful assistant.
iterative_prompt prompt: You will receive a document and a set of top-level topics from a topic hierarchy. Your task is to identify generalizable topics within the document that can act as top-level topics in the hierarchy. If any relevant topics are missing from the provided set, please add them. Otherwise, output the existing top-level topics as identified in the document.

[Top-level topics]
[1] Trade

[Examples]
Example 1: Adding "[1] Agriculture"
Document: 
Saving Essential American Sailors Act or SEAS Act - Amends the Moving Ahead for Progress in the 21st Century Act (MAP-21) to repeal the Act's repeal of the agricultural export requirements that: (1) 25% of the gross tonnage of certain agricultural commodities or their products exported each fiscal year be transported on U.S. commercial vessels, and (2) the Secretary of Transportation (DOT) finance any increased ocean freight charges incurred in the transportation of 

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


type(custom_llm_output): <class 'torch.Tensor'>
custom_llm_output: tensor([[    1,   733, 16289, 28793,   995,   622,  5556,   264,  3248,   304,
           264,   808,   302,  1830, 28733,  4404, 13817,   477,   264,  9067,
         25846, 28723,  3604,  3638,   349,   298,  9051,  2952, 11552, 13817,
          2373,   272,  3248,   369,   541,   960,   390,  1830, 28733,  4404,
         13817,   297,   272, 25846, 28723,  1047,   707,  8598, 13817,   460,
          6925,   477,   272,  3857,   808, 28725,  4665,   967,   706, 28723,
         15510, 28725,  3825,   272,  6594,  1830, 28733,  4404, 13817,   390,
         10248,   297,   272,  3248, 28723,    13,    13, 28792,  6228, 28733,
          4404, 13817, 28793,    13, 28792, 28740, 28793, 17684,    13,    13,
         28792,   966,  9874, 28793,    13, 20275, 28705, 28740, 28747,  3301,
           288, 14264, 28740, 28793, 23837,   482, 28739,    13,  7364, 28747,
         28705,    13, 28735,  1652, 11299,  2256,  2556,   318,

### Topic Refinement
If topics are generated by a weaker model, there sometimes exist irrelevant or redundant topics. This module: 
- Merges similar topics.
- Removes overly specific or redundant topics that occur < 1% of the time (you can skip this by setting `remove` to False in `config.yml`).
- Expect the refined topics in `data/output/{data_name}/refinement_1.md` and `data/output/{data_name}/refinement_1.jsonl`. If nothing happens, it means that the topic list is coherent.
- If you're unsatisfied with the refined topics, call the function again with the refined topic file and refined topic file from the previous iteration

In [None]:
# Optional: Refine topics if needed
if config["refining_topics"]:
    refine_topics(
        MODEL_API,
        MODEL_NAME,
        config["refinement"]["prompt"],
        config["generation"]["output"],
        config["generation"]["topic_output"],
        config["refinement"]["topic_output"],
        config["refinement"]["output"],
        verbose=config["verbose"],
        remove=config["refinement"]["remove"],
        mapping_file=config["refinement"]["mapping_file"]
    )

### Subtopic Generation 
Generate subtopics using `generate_topic_lvl2`.
- This function iterates over each high-level topic and generates subtopics based on a few example documents associated with the high-level topic.
- Expect the generated topics in `data/output/{data_name}/generation_2.md` and `data/output/{data_name}/generation_2.jsonl`.

In [None]:
# Optional: Generate subtopics
if config["generate_subtopics"]:
    generate_topic_lvl2(
        MODEL_API,
        MODEL_NAME,
        config["generation"]["topic_output"],
        config["generation"]["output"],
        config["generation_2"]["prompt"],
        config["generation_2"]["output"],
        config["generation_2"]["topic_output"],
        verbose=config["verbose"],
    )

### Topic Assignment
Assign the generated topics to the input text using `assign_topics`. Each assignment is supported by a quote from the input text.
- Expect the assigned topics in `data/output/{data_name}/assignment.jsonl`. 
- The model used here is often a weaker model to save cost, so the topics may not be grounded in the topic list. To correct this, use the `correct_topics` module. If there are still errors/hallucinations, run the `correct_topics` module again.

In [None]:
# Assignment
assign_topics(
    MODEL_API,
    MODEL_NAME,
    config["data_sample"],
    config["assignment"]["prompt"],
    config["assignment"]["output"],
    config["generation"][
        "topic_output"
    ],  # TODO: change to generation_2 if you have subtopics, or config['refinement']['topic_output'] if you refined topics
    verbose=config["verbose"],
)

In [None]:
# Correction
correct_topics(
    MODEL_API,
    MODEL_NAME,
    config["assignment"]["output"],
    config["correction"]["prompt"],
    config["generation"][
        "topic_output"
    ],  # TODO: change to generation_2 if you have subtopics, or config['refinement']['topic_output'] if you refined topics
    config["correction"]["output"],
    verbose=config["verbose"],
)