# TopicGPT_Python package

`topicgpt_python` consists of five modules in total: 
- `generate_topic_lvl1` generates high-level and generalizable topics. 
- `generate_topic_lvl2` generates low-level and specific topics to each high-level topic.
- `refine_topics` refines the generated topics by merging similar topics and removing irrelevant topics.
- `assign_topics` assigns the generated topics to the input text, along with a quote that supports the assignment.
- `correct_topics` corrects the generated topics by reprompting the model so that the topic assignment is grounded in the topic list. 

![topicgpt_python](assets/img/pipeline.png)

## Setup
1. Make a new Python 3.9+ environment using virtualenv or conda. 
2. Install the required packages: `pip install --upgrade topicgpt_python`.
- Our package supports OpenAI API, Google Cloud Vertex AI API, Gemini API, Azure API, and vLLM inference. vLLM requires GPUs to run. 
- Please refer to https://openai.com/pricing/ for OpenAI API pricing or to https://cloud.google.com/vertex-ai/pricing for Vertex API pricing. 

In [1]:
# Run in shell
#!pip install --upgrade topicgpt_python

# Needed only for the OpenAI API deployment
#export OPENAI_API_KEY={your_openai_api_key}

# Needed only for the Vertex AI deployment
#export VERTEX_PROJECT={your_vertex_project}   # e.g. my-project
#export VERTEX_LOCATION={your_vertex_location} # e.g. us-central1

# Needed only for Gemini deployment
#export GEMINI_API_KEY={your_gemini_api_key}

# Needed only for the Azure API deployment
#export AZURE_OPENAI_API_KEY={your_azure_api_key}
#export AZURE_OPENAI_ENDPOINT={your_azure_endpoint}

## Usage
1. First, define the necessary file paths for I/O operations in `config.yml`. 
2. Then, import the necessary modules and functions from `topicgpt_python`.
3. Store your data in `data/input` and modify the `data_sample` path in `config.yml`. 

- Prepare your `.jsonl` data file in the following format:
    ```
    {
        "id": "IDs (optional)",
        "text": "Documents",
        "label": "Ground-truth labels (optional)"
    }
    ```

In [2]:
from topicgpt_python import *
import yaml

with open("config.yml", "r") as f:
    config = yaml.safe_load(f)

INFO 02-13 01:23:20 __init__.py:192] Automatically detected platform cuda.


### Topic Generation 
Generate high-level topics using `generate_topic_lvl1`. 
- Define the api type and model. 
- Define your seed topics in `prompt/seed_1.md`.
- (Optional) Modify few-shot examples in `prompt/generation_1.txt`.
- Expect the generated topics in `data/output/{data_name}/generation_1.md` and `data/output/{data_name}/generation_1.jsonl`.
- Right now, early stopping is set to 100, meaning that if no new topic has been generated in the last 100 iterations, the generation process will stop.

In [3]:
#MODEL_API = 'vllm'
MODEL_API = 'custom_llm'

MODEL_NAME = "C:/git/Mistral-7B-Instruct-v0.3-quantized.w4a16"
#MODEL_NAME = "C:/git/Mistral-7B-Instruct-v0.3-quantized.w8a16"
#MODEL_NAME = 'C:/git/Mistral-7B-Instruct-v0.1'

generate_topic_lvl1(
    MODEL_API,
    MODEL_NAME,
    config["data_sample"],
    config["generation"]["prompt"],
    config["generation"]["seed"],
    config["generation"]["output"],
    config["generation"]["topic_output"],
    verbose=config["verbose"],
)

  @custom_fwd
  @custom_bwd
  @custom_fwd(cast_inputs=torch.float16)
CUDA extension not installed.
CUDA extension not installed.
`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Preparing to import and load with custom_llm
Model loaded with custom_llm
Tokenizer loaded with custom_llm
-------------------
Initializing topic generation...
Model: C:/git/Mistral-7B-Instruct-v0.3-quantized.w4a16
Data file: data/input/sample.jsonl
Prompt file: prompt/generation_1.txt
Seed file: prompt/seed_1.md
Output file: data/output/sample/generation_1.jsonl
Topic file: data/output/sample/generation_1.md
-------------------


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


iterative_prompt system_message: You are a helpful assistant.
iterative_prompt prompt: You will receive a document and a set of top-level topics from a topic hierarchy. Your task is to identify generalizable topics within the document that can act as top-level topics in the hierarchy. If any relevant topics are missing from the provided set, please add them. Otherwise, output the existing top-level topics as identified in the document.

[Top-level topics]
[1] Trade

[Examples]
Example 1: Adding "[1] Agriculture"
Document: 
Saving Essential American Sailors Act or SEAS Act - Amends the Moving Ahead for Progress in the 21st Century Act (MAP-21) to repeal the Act's repeal of the agricultural export requirements that: (1) 25% of the gross tonnage of certain agricultural commodities or their products exported each fiscal year be transported on U.S. commercial vessels, and (2) the Secretary of Transportation (DOT) finance any increased ocean freight charges incurred in the transportation of 

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


post_processed_output_text: [1]  Forest Conservation: Regulates the management and protection of forest areas to maintain their natural state.

[2]  National Forest Roadless Area Conservation Act: A legislative act that identifies and protects roadless areas within the National Forest System. It requires the Secretary of Agriculture to manage these areas to maintain their roadless character and allows for modifications to improve accuracy or inclusiveness. Any substantial modification must go through the national forest management planning process and be documented in an environmental impact statement.</s>
Invalid topic format: . Skipping...
Lower level topics are not allowed: [2]  National Forest Roadless Area Conservation Act: A legislative act that identifies and protects roadless areas within the National Forest System. It requires the Secretary of Agriculture to manage these areas to maintain their roadless character and allows for modifications to improve accuracy or inclusivenes

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


post_processed_output_text: 1. Topic Level: 1
Topic Label: Government Policy
Topic Description: Policies and acts established by government entities to regulate various aspects of society, including but not limited to, land management, resource distribution, and financial compensation.

2. Topic Level: 1
Topic Label: Hydropower
Topic Description: The production of electricity through the movement of water, typically through dams or water turbines.

3. Topic Level: 1
Topic Label: Native American Rights
Topic Description: The rights and claims of Native American tribes, including land rights, resource rights, and financial compensation for historical injustices.

4. Topic Level: 1
Topic Label: Land Management
Topic Description: The management, conservation, and development of land resources, including but not limited to, public lands, tribal lands, and national parks.

5. Topic Level: 1
Topic Label: Energy Production
Topic Description: The generation and distribution of energy, including

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


post_processed_output_text: [1]  Oil and Gas Exploration: Discusses the exploration, extraction, and management of oil and gas resources.

[2]  Marine Habitats: Focuses on the ecosystems and biodiversity found in marine environments.

[2]  Fisheries Management: Deals with the management and conservation of fish populations and their habitats.

[2]  Coral Reefs: Concentrates on the biology, ecology, and conservation of coral reefs.

[2]  Protected Species: Addresses the protection and conservation of endangered or threatened species.

[2]  Recreational Fishing: Discusses the practice of fishing for pleasure and recreation.

[2]  Commercial Fishing: Focuses on the fishing industry and the economic aspects of fishing.

[2]  Artificial Reefs: Deals with the creation and management of artificial reefs to enhance marine habitats.

[2]  National Fishing Enhancement Act of 1984: Refers to the legislation that provides for the enhancement of fisheries resources.

[2]  Gulf of Mexico: Focuses on

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


post_processed_output_text: [1]  Infrastructure Development: Mentions the planning, design, environmental review, or land acquisition activities for transportation systems. This topic is broad enough to accommodate future subtopics related to infrastructure development, such as roads, bridges, ports, and airports.</s>
Topics: [1]  Infrastructure Development: Mentions the planning, design, environmental review, or land acquisition activities for transportation systems. This topic is broad enough to accommodate future subtopics related to infrastructure development, such as roads, bridges, ports, and airports.</s>
--------------------
iterative_prompt system_message: You are a helpful assistant.
iterative_prompt prompt: You will receive a document and a set of top-level topics from a topic hierarchy. Your task is to identify generalizable topics within the document that can act as top-level topics in the hierarchy. If any relevant topics are missing from the provided set, please add them

100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [08:08<00:00, 97.61s/it]

post_processed_output_text: [T] pic Label: Immigration Policy: Regulations governing the entry, stay, and departure of individuals in a country, including citizenship and lawful status verification.</s>
Invalid topic format: [T] pic Label: Immigration Policy: Regulations governing the entry, stay, and departure of individuals in a country, including citizenship and lawful status verification.</s>. Skipping...
Topics: [T] pic Label: Immigration Policy: Regulations governing the entry, stay, and departure of individuals in a country, including citizenship and lawful status verification.</s>
--------------------





<topicgpt_python.utils.TopicTree at 0x1fcb303fb10>

### Topic Refinement
If topics are generated by a weaker model, there sometimes exist irrelevant or redundant topics. This module: 
- Merges similar topics.
- Removes overly specific or redundant topics that occur < 1% of the time (you can skip this by setting `remove` to False in `config.yml`).
- Expect the refined topics in `data/output/{data_name}/refinement_1.md` and `data/output/{data_name}/refinement_1.jsonl`. If nothing happens, it means that the topic list is coherent.
- If you're unsatisfied with the refined topics, call the function again with the refined topic file and refined topic file from the previous iteration

In [4]:
# Optional: Refine topics if needed
if config["refining_topics"]:
    refine_topics(
        MODEL_API,
        MODEL_NAME,
        config["refinement"]["prompt"],
        config["generation"]["output"],
        config["generation"]["topic_output"],
        config["refinement"]["topic_output"],
        config["refinement"]["output"],
        verbose=config["verbose"],
        remove=config["refinement"]["remove"],
        mapping_file=config["refinement"]["mapping_file"]
    )

Preparing to import and load with custom_llm
Model loaded with custom_llm
Tokenizer loaded with custom_llm
-------------------
Initializing topic refinement...
Model: C:/git/Mistral-7B-Instruct-v0.3-quantized.w4a16
Input data file: data/output/sample/generation_1.jsonl
Prompt file: prompt/refinement.txt
Output file: data/output/sample/refinement.md
Topic file: data/output/sample/generation_1.md
-------------------
No topic pairs to be merged.
No topics removed.
Node('/Topics', count=1, desc='Root topic', lvl=0)
├── Node('/Topics/Forest Conservation', count=1, desc='Regulates the management and protection of forest areas to maintain their natural state.', lvl=1)
├── Node('/Topics/Oil and Gas Exploration', count=1, desc='Discusses the exploration, extraction, and management of oil and gas resources.', lvl=1)
└── Node('/Topics/Infrastructure Development', count=1, desc='Mentions the planning, design, environmental review, or land acquisition activities for transportation systems. This top

### Subtopic Generation 
Generate subtopics using `generate_topic_lvl2`.
- This function iterates over each high-level topic and generates subtopics based on a few example documents associated with the high-level topic.
- Expect the generated topics in `data/output/{data_name}/generation_2.md` and `data/output/{data_name}/generation_2.jsonl`.

In [5]:
# Optional: Generate subtopics
if config["generate_subtopics"]:
    generate_topic_lvl2(
        MODEL_API,
        MODEL_NAME,
        config["generation"]["topic_output"],
        config["generation"]["output"],
        config["generation_2"]["prompt"],
        config["generation_2"]["output"],
        config["generation_2"]["topic_output"],
        verbose=config["verbose"],
    )

Preparing to import and load with custom_llm
Model loaded with custom_llm
Tokenizer loaded with custom_llm
-------------------
Initializing topic generation (lvl 2)...
Model: C:/git/Mistral-7B-Instruct-v0.3-quantized.w4a16
Data file: data/output/sample/generation_1.jsonl
Prompt file: prompt/generation_2.txt
Seed file: data/output/sample/generation_1.md
Output file: data/output/sample/generation_2.jsonl
Topic file: data/output/sample/generation_2.md
-------------------
Number of remaining documents for prompting: 0


  0%|                                                                                            | 0/3 [00:00<?, ?it/s]

Current topic: [1] Forest Conservation





KeyError: 'text'

### Topic Assignment
Assign the generated topics to the input text using `assign_topics`. Each assignment is supported by a quote from the input text.
- Expect the assigned topics in `data/output/{data_name}/assignment.jsonl`. 
- The model used here is often a weaker model to save cost, so the topics may not be grounded in the topic list. To correct this, use the `correct_topics` module. If there are still errors/hallucinations, run the `correct_topics` module again.

In [7]:
# Assignment
assign_topics(
    MODEL_API,
    MODEL_NAME,
    config["data_sample"],
    config["assignment"]["prompt"],
    config["assignment"]["output"],
    config["generation"][
        "topic_output"
    ],  # TODO: change to generation_2 if you have subtopics, or config['refinement']['topic_output'] if you refined topics
    verbose=config["verbose"],
)

Preparing to import and load with custom_llm
Model loaded with custom_llm
Tokenizer loaded with custom_llm
-------------------
Initializing topic assignment...
Model: C:/git/Mistral-7B-Instruct-v0.3-quantized.w4a16
Data file: data/input/sample.jsonl
Prompt file: prompt/assignment.txt
Output file: data/output/sample/assignment.jsonl
Topic file: data/output/sample/generation_1.md
-------------------


100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:36<00:00,  7.29s/it]

batch_prompt prompts: ['You will receive a document and a topic hierarchy. Assign the document to the most relevant topics the hierarchy. Then, output the topic labels, assignment reasoning and supporting quotes from the document. DO NOT make up new topics or quotes.  \n\n[Topic Hierarchy]\n[1] Forest Conservation: Regulates the management and protection of forest areas to maintain their natural state.\n[1] Oil and Gas Exploration: Discusses the exploration, extraction, and management of oil and gas resources.\n[1] Infrastructure Development: Mentions the planning, design, environmental review, or land acquisition activities for transportation systems. This topic is broad enough to accommodate future subtopics related to infrastructure development, such as roads, bridges, ports, and airports.</s>\n\n[Examples]\nExample 1: Assign "[1] Agriculture" to the document\nDocument: \nSaving Essential American Sailors Act or SEAS Act - Amends the Moving Ahead for Progress in the 21st Century Act




TypeError: can only concatenate str (not "bool") to str

In [None]:
# Correction
correct_topics(
    MODEL_API,
    MODEL_NAME,
    config["assignment"]["output"],
    config["correction"]["prompt"],
    config["generation"][
        "topic_output"
    ],  # TODO: change to generation_2 if you have subtopics, or config['refinement']['topic_output'] if you refined topics
    config["correction"]["output"],
    verbose=config["verbose"],
)