# TopicGPT_Python package

`topicgpt_python` consists of five modules in total: 
- `generate_topic_lvl1` generates high-level and generalizable topics. 
- `generate_topic_lvl2` generates low-level and specific topics to each high-level topic.
- `refine_topics` refines the generated topics by merging similar topics and removing irrelevant topics.
- `assign_topics` assigns the generated topics to the input text, along with a quote that supports the assignment.
- `correct_topics` corrects the generated topics by reprompting the model so that the topic assignment is grounded in the topic list. 

![topicgpt_python](assets/img/pipeline.png)

## Setup
1. Make a new Python 3.9+ environment using virtualenv or conda. 
2. Install the required packages: `pip install --upgrade topicgpt_python`.
- Our package supports OpenAI API, Google Cloud Vertex AI API, Gemini API, Azure API, and vLLM inference. vLLM requires GPUs to run. 
- Please refer to https://openai.com/pricing/ for OpenAI API pricing or to https://cloud.google.com/vertex-ai/pricing for Vertex API pricing. 

In [None]:
# Run in shell
#!pip install --upgrade topicgpt_python

# Needed only for the OpenAI API deployment
#export OPENAI_API_KEY={your_openai_api_key}

# Needed only for the Vertex AI deployment
#export VERTEX_PROJECT={your_vertex_project}   # e.g. my-project
#export VERTEX_LOCATION={your_vertex_location} # e.g. us-central1

# Needed only for Gemini deployment
#export GEMINI_API_KEY={your_gemini_api_key}

# Needed only for the Azure API deployment
#export AZURE_OPENAI_API_KEY={your_azure_api_key}
#export AZURE_OPENAI_ENDPOINT={your_azure_endpoint}

## Usage
1. First, define the necessary file paths for I/O operations in `config.yml`. 
2. Then, import the necessary modules and functions from `topicgpt_python`.
3. Store your data in `data/input` and modify the `data_sample` path in `config.yml`. 

- Prepare your `.jsonl` data file in the following format:
    ```
    {
        "id": "IDs (optional)",
        "text": "Documents",
        "label": "Ground-truth labels (optional)"
    }
    ```

In [1]:
from topicgpt_python import *
import yaml

with open("config.yml", "r") as f:
    config = yaml.safe_load(f)

INFO 02-13 02:22:49 __init__.py:192] Automatically detected platform cuda.


### Topic Generation 
Generate high-level topics using `generate_topic_lvl1`. 
- Define the api type and model. 
- Define your seed topics in `prompt/seed_1.md`.
- (Optional) Modify few-shot examples in `prompt/generation_1.txt`.
- Expect the generated topics in `data/output/{data_name}/generation_1.md` and `data/output/{data_name}/generation_1.jsonl`.
- Right now, early stopping is set to 100, meaning that if no new topic has been generated in the last 100 iterations, the generation process will stop.

In [2]:
#MODEL_API = 'vllm'
MODEL_API = 'custom_llm'

MODEL_NAME = "C:/git/Mistral-7B-Instruct-v0.3-quantized.w4a16"
#MODEL_NAME = "C:/git/Mistral-7B-Instruct-v0.3-quantized.w8a16"
#MODEL_NAME = 'C:/git/Mistral-7B-Instruct-v0.1'

In [None]:
generate_topic_lvl1(
    MODEL_API,
    MODEL_NAME,
    config["data_sample"],
    config["generation"]["prompt"],
    config["generation"]["seed"],
    config["generation"]["output"],
    config["generation"]["topic_output"],
    verbose=config["verbose"],
)

### Topic Refinement
If topics are generated by a weaker model, there sometimes exist irrelevant or redundant topics. This module: 
- Merges similar topics.
- Removes overly specific or redundant topics that occur < 1% of the time (you can skip this by setting `remove` to False in `config.yml`).
- Expect the refined topics in `data/output/{data_name}/refinement_1.md` and `data/output/{data_name}/refinement_1.jsonl`. If nothing happens, it means that the topic list is coherent.
- If you're unsatisfied with the refined topics, call the function again with the refined topic file and refined topic file from the previous iteration

In [None]:
# Optional: Refine topics if needed
if config["refining_topics"]:
    refine_topics(
        MODEL_API,
        MODEL_NAME,
        config["refinement"]["prompt"],
        config["generation"]["output"],
        config["generation"]["topic_output"],
        config["refinement"]["topic_output"],
        config["refinement"]["output"],
        verbose=config["verbose"],
        remove=config["refinement"]["remove"],
        mapping_file=config["refinement"]["mapping_file"]
    )

### Subtopic Generation 
Generate subtopics using `generate_topic_lvl2`.
- This function iterates over each high-level topic and generates subtopics based on a few example documents associated with the high-level topic.
- Expect the generated topics in `data/output/{data_name}/generation_2.md` and `data/output/{data_name}/generation_2.jsonl`.

In [None]:
# Optional: Generate subtopics
if config["generate_subtopics"]:
    generate_topic_lvl2(
        MODEL_API,
        MODEL_NAME,
        config["generation"]["topic_output"],
        config["generation"]["output"],
        config["generation_2"]["prompt"],
        config["generation_2"]["output"],
        config["generation_2"]["topic_output"],
        verbose=config["verbose"],
    )

### Topic Assignment
Assign the generated topics to the input text using `assign_topics`. Each assignment is supported by a quote from the input text.
- Expect the assigned topics in `data/output/{data_name}/assignment.jsonl`. 
- The model used here is often a weaker model to save cost, so the topics may not be grounded in the topic list. To correct this, use the `correct_topics` module. If there are still errors/hallucinations, run the `correct_topics` module again.

In [3]:
# Assignment
assign_topics(
    MODEL_API,
    MODEL_NAME,
    config["data_sample"],
    config["assignment"]["prompt"],
    config["assignment"]["output"],
    config["generation"][
        "topic_output"
    ],  # TODO: change to generation_2 if you have subtopics, or config['refinement']['topic_output'] if you refined topics
    verbose=config["verbose"],
)

  @custom_fwd
  @custom_bwd
  @custom_fwd(cast_inputs=torch.float16)
CUDA extension not installed.
CUDA extension not installed.
`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Preparing to import and load with custom_llm
Model loaded with custom_llm
Tokenizer loaded with custom_llm
-------------------
Initializing topic assignment...
Model: C:/git/Mistral-7B-Instruct-v0.3-quantized.w4a16
Data file: data/input/sample.jsonl
Prompt file: prompt/assignment.txt
Output file: data/output/sample/assignment.jsonl
Topic file: data/output/sample/generation_1.md
-------------------


100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 11.78it/s]
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


batch_prompt prompts: ['You will receive a document and a topic hierarchy. Assign the document to the most relevant topics the hierarchy. Then, output the topic labels, assignment reasoning and supporting quotes from the document. DO NOT make up new topics or quotes.  \n\n[Topic Hierarchy]\n[1] Forest Conservation: Regulates the management and protection of forest areas to maintain their natural state.\n[1] Oil and Gas Exploration: Discusses the exploration, extraction, and management of oil and gas resources.\n[1] Infrastructure Development: Mentions the planning, design, environmental review, or land acquisition activities for transportation systems. This topic is broad enough to accommodate future subtopics related to infrastructure development, such as roads, bridges, ports, and airports.</s>\n\n[Examples]\nExample 1: Assign "[1] Agriculture" to the document\nDocument: \nSaving Essential American Sailors Act or SEAS Act - Amends the Moving Ahead for Progress in the 21st Century Act

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


In [4]:
# Correction
correct_topics(
    MODEL_API,
    MODEL_NAME,
    config["assignment"]["output"],
    config["correction"]["prompt"],
    config["generation"][
        "topic_output"
    ],  # TODO: change to generation_2 if you have subtopics, or config['refinement']['topic_output'] if you refined topics
    config["correction"]["output"],
    verbose=config["verbose"],
)

Preparing to import and load with custom_llm
Model loaded with custom_llm
Tokenizer loaded with custom_llm
-------------------
Initializing topic correction...
Model: C:/git/Mistral-7B-Instruct-v0.3-quantized.w4a16
Data file: data/output/sample/assignment.jsonl
Prompt file: prompt/correction.txt
Output file: data/output/sample/assignment_corrected.jsonl
Topic file: data/output/sample/generation_1.md
-------------------
Error: Row 0 has no topics.
Error: Row 2 has no topics.
Error: Row 3 has no topics.
Error: Row 4 has no topics.
Number of errors: 4
Number of hallucinated topics: 0


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


iterative_prompt system_message: You are a helpful assistant.
iterative_prompt prompt: You will receive a document and a topic hierarchy. Assign the document to the most relevant topics the hierarchy. Then, output the topic labels, assignment reasoning and supporting quotes from the document. DO NOT make up new topics or quotes.  

[Topic Hierarchy]
[1] Forest Conservation: Regulates the management and protection of forest areas to maintain their natural state.
[1] Oil and Gas Exploration: Discusses the exploration, extraction, and management of oil and gas resources.
[1] Infrastructure Development: Mentions the planning, design, environmental review, or land acquisition activities for transportation systems. This topic is broad enough to accommodate future subtopics related to infrastructure development, such as roads, bridges, ports, and airports.</s>

[Examples]
Example 1: Assign "[1] Agriculture" to the document
Document: 
Saving Essential American Sailors Act or SEAS Act - Amends 

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


post_processed_output_text: 

[1] Topic Label: Infrastructure Development: Mentions the planning, design, environmental review, or land acquisition activities for transportation systems. This topic is broad enough to accommodate future subtopics related to infrastructure development, such as roads, bridges, ports, and airports.
Assignment reasoning: The document discusses the identification and management of roadless areas within the National Forest System, which can be considered a part of the transportation system as it involves the planning and design of roads.
Supporting quote: "Identifies roadless areas within the National Forest System set forth in specified maps as National Forest Inventoried Roadless Areas."</s>
Document 1: 

[1] Topic Label: Infrastructure Development: Mentions the planning, design, environmental review, or land acquisition activities for transportation systems. This topic is broad enough to accommodate future subtopics related to infrastructure development, s

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


post_processed_output_text: 

[T] pic Label: Infrastructure Development: Mentions the planning, design, environmental review, or land acquisition activities for transportation systems. This topic is broad enough to accommodate future subtopics related to infrastructure development, such as roads, bridges, ports, and airports.

Assignment reasoning: The document discusses the decommissioning, reuse, and management of oil and gas platforms, which can be considered a part of the infrastructure development in the Gulf of Mexico.

Supporting quote: "Rigs to Reefs Habitat Protection Act - Directs the Secretary of the Interior to assess each offshore oil and gas platform in the Gulf of Mexico that is no longer useful for operations..."</s>
Document 3: 

[T] pic Label: Infrastructure Development: Mentions the planning, design, environmental review, or land acquisition activities for transportation systems. This topic is broad enough to accommodate future subtopics related to infrastructure dev

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


post_processed_output_text: 
[1] Topic Label: Infrastructure Development: Mentions the planning, design, environmental review, or land acquisition activities for transportation systems. (Supporting quote: "Directs the Secretary of Transportation to establish an aerotropolis grant program to assist in the development of aerotropolis transportation systems...")</s>
Document 4: 
[1] Topic Label: Infrastructure Development: Mentions the planning, design, environmental review, or land acquisition activities for transportation systems. (Supporting quote: "Directs the Secretary of Transportation to establish an aerotropolis grant program to assist in the development of aerotropolis transportation systems...")</s>
--------------------
iterative_prompt system_message: You are a helpful assistant.
iterative_prompt prompt: You will receive a document and a topic hierarchy. Assign the document to the most relevant topics the hierarchy. Then, output the topic labels, assignment reasoning and suppor

Correcting topics: 100%|█████████████████████████████████████████████████████████████████| 4/4 [03:17<00:00, 49.50s/it]

post_processed_output_text: 

[1] Topic Label: Forest Conservation: Regulates the management and protection of forest areas to maintain their natural state.

Assignment reasoning: The document discusses changes to the REAL ID Act of 2005, which is related to driver's licenses and identification documents, a type of identification system. The document also mentions the requirement for states to comply with certain citizenship or lawful immigration status verification requirements, which can be seen as a form of regulation to maintain the natural state of the system by ensuring that only eligible individuals are granted driver's licenses.

Supporting quote: "Prevention of Unsafe Licensing Act - Amends the REAL ID Act of 2005 to prohibit a state from issuing a driver's license or identification document to a person unless the state has complied with certain citizenship or lawful immigration status verification requirements."</s>
Document 5: 

[1] Topic Label: Forest Conservation: Regulate


