Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The code for generating the initial concepts #22

Closed
ayhem18 opened this issue Jan 10, 2024 · 3 comments
Closed

The code for generating the initial concepts #22

ayhem18 opened this issue Jan 10, 2024 · 3 comments

Comments

@ayhem18
Copy link

ayhem18 commented Jan 10, 2024

First of all, great job on the paper and the overall project !!

I went through the repository but I don't seem to find the code for generating the initial concepts; the ones before the selection mechanism with submodular optimization.
Will this part of the code be publicly shared ?

Thanks in advance.

@YueYANG1996
Copy link
Owner

Please refer to this for our prompts.
For extracting concepts from LLM-generated sentences, you can use our T5 model:

  • Download model weight here.
  • Inference code:
from transformers import T5Tokenizer, T5ForConditionalGeneration
t5_tokenizer = T5Tokenizer.from_pretrained("t5-large")
t5_model = T5ForConditionalGeneration.from_pretrained("t5-large-concept-extractor").to("cuda:0")

def sentences2concepts(sentences, batch_size):
    task_prefix = "extract concepts from sentence: "
    iters = len(sentences) // batch_size
    concepts = []
    outputs = []
    for i in range(iters + 1):
        current_sentences = sentences[i * batch_size:min((i + 1) * batch_size, len(sentences))]
        if current_sentences == []:
            continue
        inputs = t5_tokenizer([task_prefix + sentence for sentence in current_sentences], return_tensors="pt", padding=True)
        output_sequences = t5_model.generate(input_ids=inputs["input_ids"].to("cuda:0"), attention_mask=inputs["attention_mask"].to("cuda:0"), do_sample=False)
        outputs += t5_tokenizer.batch_decode(output_sequences, skip_special_tokens=True)
    for output in outputs:
        concepts += output.split("; ")
    return concepts

Since LaBo was using GPT-3, which is less powerful than current LLMs, we need this extra step to extract concepts. Using ChatGPT or GPT-4, you can easily control the output formats through prompts. So, I recommend you design your own pipeline to generate candidate concepts.

@ayhem18
Copy link
Author

ayhem18 commented Jan 13, 2024

Thank you for your prompt response. I'd like to clarify.
According to the JSON files here, each class is associated with around 500 sentences. Those should be generated by GPT-3, right?
If this is the case, then the [prompts] (#12 (comment)) should be called multiple times. Can you share the details for these calls?

I am conducting a DL experience on top of the suggested concept generation and selection mechanisms suggested in LaBo. Thus, I would like my preparation work to deviate as little as possible from the literature (which is LaBo in my case)

@YueYANG1996
Copy link
Owner

Calling each prompt will generate 10 sentences, and you have 5 prompts, so just call each prompt 10 times with a for loop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants