# TopicGPT_Python package

`topicgpt_python` consists of five modules in total: 
- `generate_topic_lvl1` generates high-level and generalizable topics. 
- `generate_topic_lvl2` generates low-level and specific topics to each high-level topic.
- `refine_topics` refines the generated topics by merging similar topics and removing irrelevant topics.
- `assign_topics` assigns the generated topics to the input text, along with a quote that supports the assignment.
- `correct_topics` corrects the generated topics by reprompting the model so that the topic assignment is grounded in the topic list. 

![topicgpt_python](assets/img/pipeline.png)

## Setup
1. Make a new Python 3.9+ environment using virtualenv or conda. 
2. Install the required packages: `pip install --upgrade topicgpt_python`.
- Our package supports OpenAI API, Google Cloud Vertex AI API, and vLLM inference. vLLM requires GPUs to run. 
- Please refer to https://openai.com/pricing/ for OpenAI API pricing or to https://cloud.google.com/vertex-ai/pricing for Vertex API pricing. 

In [1]:
# Run in shell
!pip install --upgrade topicgpt_python
export OPENAI_API_KEY={your_openai_api_key}

# Needed only for the Vertex AI deployment
export VERTEX_PROJECT={your_vertex_project}   # e.g. my-project
export VERTEX_LOCATION={your_vertex_location} # e.g. us-central1

SyntaxError: invalid syntax (2333545508.py, line 3)

## Usage
1. First, define the necessary file paths for I/O operations in `config.yml`. 
2. Then, import the necessary modules and functions from `topicgpt_python`.
3. Store your data in `data/input` and modify the `data_sample` path in `config.yml`. 

- Prepare your `.jsonl` data file in the following format:
    ```
    {
        "id": "IDs (optional)",
        "text": "Documents",
        "label": "Ground-truth labels (optional)"
    }
    ```

In [3]:
from topicgpt_python import *
import yaml

with open("config.yml", "r") as f:
    config = yaml.safe_load(f)

### Topic Generation 
Generate high-level topics using `generate_topic_lvl1`. 
- Define the api type and model. 
- Define your seed topics in `prompt/seed_1.md`.
- (Optional) Modify few-shot examples in `prompt/generation_1.txt`.
- Expect the generated topics in `data/output/{data_name}/generation_1.md` and `data/output/{data_name}/generation_1.jsonl`.
- Right now, early stopping is set to 100, meaning that if no new topic has been generated in the last 100 iterations, the generation process will stop.

In [4]:
generate_topic_lvl1(
    "openai",
    "gpt-4o",
    config["data_sample"],
    config["generation"]["prompt"],
    config["generation"]["seed"],
    config["generation"]["output"],
    config["generation"]["topic_output"],
    verbose=config["verbose"],
)

-------------------
Initializing topic generation...
Model: gpt-4o
Data file: data/input/sample.jsonl
Prompt file: prompt/generation_1.txt
Seed file: prompt/seed_1.md
Output file: data/output/sample/generation_1.jsonl
Topic file: data/output/sample/generation_1.md
-------------------


 20%|██        | 1/5 [00:03<00:13,  3.35s/it]

Prompt token usage: 610 ~$0.0030499999999999998
Response token usage: 18 ~$0.00027
Topics: [1] Environment: Involves the conservation and management of natural resources and ecosystems.
--------------------


 40%|████      | 2/5 [00:04<00:06,  2.25s/it]

Prompt token usage: 1271 ~$0.0063549999999999995
Response token usage: 24 ~$0.00036
Topics: [1] Environment: Mentions the use of land and natural resources, including hydropower generation and land management.
--------------------


 60%|██████    | 3/5 [00:06<00:03,  1.92s/it]

Prompt token usage: 823 ~$0.004115
Response token usage: 17 ~$0.000255
Topics: [1] Environment: Mentions the protection and management of natural habitats and ecosystems.
--------------------


 80%|████████  | 4/5 [00:07<00:01,  1.76s/it]

Prompt token usage: 632 ~$0.0031599999999999996
Response token usage: 24 ~$0.00036
Topics: [1] Transportation: Involves the development and management of systems and infrastructure for the movement of people and goods.
--------------------


100%|██████████| 5/5 [00:10<00:00,  2.01s/it]

Prompt token usage: 571 ~$0.002855
Response token usage: 28 ~$0.00042
Topics: [1] Transportation: Relates to the systems and methods for moving people or goods from one place to another, including regulations and licensing.
--------------------





<topicgpt_python.utils.TopicTree at 0x34b3364c0>

### Topic Refinement
If topics are generated by a weaker model, there sometimes exist irrelevant or redundant topics. This module: 
- Merges similar topics.
- Removes overly specific or redundant topics that occur < 1% of the time (you can skip this by setting `remove` to False in `config.yml`).
- Expect the refined topics in `data/output/{data_name}/refinement_1.md` and `data/output/{data_name}/refinement_1.jsonl`. If nothing happens, it means that the topic list is coherent.
- If you're unsatisfied with the refined topics, call the function again with the refined topic file and refined topic file from the previous iteration

In [5]:
# Optional: Refine topics if needed
if config["refining_topics"]:
    refine_topics(
        "openai",
        "gpt-4o",
        config["refinement"]["prompt"],
        config["generation"]["output"],
        config["generation"]["topic_output"],
        config["refinement"]["topic_output"],
        config["refinement"]["output"],
        verbose=config["verbose"],
        remove=config["refinement"]["remove"],
        mapping_file=config["refinement"]["mapping_file"]
    )

-------------------
Initializing topic refinement...
Model: gpt-4o
Input data file: data/output/sample/generation_1.jsonl
Prompt file: prompt/refinement.txt
Output file: data/output/sample/refinement.md
Topic file: data/output/sample/generation_1.md
-------------------
No topic pairs to be merged.
No topics removed.
Node('/Topics', count=1, desc='Root topic', lvl=0)
├── Node('/Topics/Environment', count=3, desc='Involves the conservation and management of natural resources and ecosystems.', lvl=1)
└── Node('/Topics/Transportation', count=2, desc='Involves the development and management of systems and infrastructure for the movement of people and goods.', lvl=1)


### Subtopic Generation 
Generate subtopics using `generate_topic_lvl2`.
- This function iterates over each high-level topic and generates subtopics based on a few example documents associated with the high-level topic.
- Expect the generated topics in `data/output/{data_name}/generation_2.md` and `data/output/{data_name}/generation_2.jsonl`.

In [6]:
# Optional: Generate subtopics
if config["generate_subtopics"]:
    generate_topic_lvl2(
        "openai",
        "gpt-4o",
        config["generation"]["topic_output"],
        config["generation"]["output"],
        config["generation_2"]["prompt"],
        config["generation_2"]["output"],
        config["generation_2"]["topic_output"],
        verbose=config["verbose"],
    )

-------------------
Initializing topic generation (lvl 2)...
Model: gpt-4o
Data file: data/output/sample/generation_1.jsonl
Prompt file: prompt/generation_2.txt
Seed file: data/output/sample/generation_1.md
Output file: data/output/sample/generation_2.jsonl
Topic file: data/output/sample/generation_2.md
-------------------
Number of remaining documents for prompting: 5


  0%|          | 0/2 [00:00<?, ?it/s]

Current topic: [1] Environment


 50%|█████     | 1/2 [00:02<00:02,  2.19s/it]

Subtopics: [1] Environment
   [2] Conservation (Document: 1): Focuses on the management and protection of natural areas to maintain their ecological integrity, such as roadless areas in national forests.
   [2] Indigenous Rights and Compensation (Document: 2): Pertains to the rights and compensation of indigenous tribes for the use of their lands, particularly in relation to resource development and energy projects.
   [2] Marine Habitat Protection (Document: 3): Involves the protection and management of marine ecosystems, particularly concerning the conversion of decommissioned oil and gas platforms into artificial reefs.
Conservation (Count: 0): Focuses on the management and protection of natural areas to maintain their ecological integrity, such as roadless areas in national forests.
Indigenous Rights and Compensation (Count: 0): Pertains to the rights and compensation of indigenous tribes for the use of their lands, particularly in relation to resource development and energy projec

100%|██████████| 2/2 [00:03<00:00,  1.64s/it]

Subtopics: [1] Transportation
    [2] Aerotropolis Development (Document: 1): Involves planning and developing transportation systems centered around major airports.
    [2] Licensing and Identification (Document: 2): Concerns regulations and requirements for issuing driver's licenses and identification documents.
Aerotropolis Development (Count: 0): Involves planning and developing transportation systems centered around major airports.
Licensing and Identification (Count: 0): Concerns regulations and requirements for issuing driver's licenses and identification documents.
--------------------------------------------------





### Topic Assignment
Assign the generated topics to the input text using `assign_topics`. Each assignment is supported by a quote from the input text.
- Expect the assigned topics in `data/output/{data_name}/assignment.jsonl`. 
- The model used here is often a weaker model to save cost, so the topics may not be grounded in the topic list. To correct this, use the `correct_topics` module. If there are still errors/hallucinations, run the `correct_topics` module again.

In [7]:
# Assignment
assign_topics(
    "openai",
    "gpt-4o-mini",
    config["data_sample"],
    config["assignment"]["prompt"],
    config["assignment"]["output"],
    config["generation"][
        "topic_output"
    ],  # TODO: change to generation_2 if you have subtopics, or config['refinement']['topic_output'] if you refined topics
    verbose=config["verbose"],
)

-------------------
Initializing topic assignment...
Model: gpt-4o-mini
Data file: data/input/sample.jsonl
Prompt file: prompt/assignment.txt
Output file: data/output/sample/assignment.jsonl
Topic file: data/output/sample/generation_1.md
-------------------


 20%|██        | 1/5 [00:02<00:08,  2.10s/it]

Prompt token usage: 509 ~$0.002545
Response token usage: 49 ~$0.000735
Response: [1] Environment: The document discusses the management of roadless areas within the National Forest System, which involves conservation efforts to maintain their natural state. ("...directs the Secretary of Agriculture to manage such Areas to maintain their roadless character.")
--------------------


 40%|████      | 2/5 [00:06<00:09,  3.21s/it]

Prompt token usage: 1165 ~$0.005825
Response token usage: 208 ~$0.00312
Response: [1] Environment: The document discusses the compensation for the Spokane Tribe of Indians for the use of their land for hydropower generation, which relates to the management of natural resources and ecosystems. The mention of land being held in trust and the federal trust responsibility indicates a focus on environmental and resource management. 

Supporting quote: "the purpose of this Act is to compensate the Spokane Tribe of Indians of the Spokane Reservation, Washington State for the use of its land for hydropower generation by the Grand Coulee Dam." 

[1] Transportation: The document also touches on the management of land and resources related to the Grand Coulee Dam, which is a significant infrastructure project for the movement of hydropower. The mention of the Bonneville Power Administration, which markets power produced at the dam, indicates a connection to transportation of energy resources.

Su

 60%|██████    | 3/5 [00:08<00:05,  2.62s/it]

Prompt token usage: 717 ~$0.0035849999999999996
Response token usage: 80 ~$0.0012000000000000001
Response: [1] Environment: The document discusses the assessment and protection of marine habitats related to offshore oil and gas platforms, which directly relates to the conservation and management of natural resources and ecosystems. 

Supporting quote: "Directs the Secretary of the Interior to assess each offshore oil and gas platform in the Gulf of Mexico that is no longer useful for operations, and has become critical for a marine fisheries habitat..."
--------------------


 80%|████████  | 4/5 [00:09<00:02,  2.11s/it]

Prompt token usage: 526 ~$0.00263
Response token usage: 51 ~$0.000765
Response: [1] Transportation: The document discusses the establishment of an aerotropolis grant program aimed at developing transportation systems that enhance connectivity and efficiency. ("...establish an aerotropolis grant program to assist in the development of aerotropolis transportation systems...")
--------------------


100%|██████████| 5/5 [00:10<00:00,  2.18s/it]

Prompt token usage: 460 ~$0.0023
Response token usage: 62 ~$0.00093
Response: [1] Transportation: The document discusses the issuance of driver's licenses and identification documents, which relates to the management of systems for the movement of people. ("...prohibit a state from issuing a driver's license or identification document to a person unless the state has complied with certain citizenship or lawful immigration status verification requirements.")
--------------------





In [8]:
# Correction
correct_topics(
    "openai",
    "gpt-4o-mini",
    config["assignment"]["output"],
    config["correction"]["prompt"],
    config["generation"][
        "topic_output"
    ],  # TODO: change to generation_2 if you have subtopics, or config['refinement']['topic_output'] if you refined topics
    config["correction"]["output"],
    verbose=config["verbose"],
)

-------------------
Initializing topic correction...
Model: gpt-4o-mini
Data file: data/output/sample/assignment.jsonl
Prompt file: prompt/correction.txt
Output file: data/output/sample/assignment_corrected.jsonl
Topic file: data/output/sample/generation_1.md
-------------------
Number of errors: 0
Number of hallucinated topics: 0
All topics are correct.
