# AHAM: Adapt, Help, Ask, Model - Harvesting LLMs for literature mining

AHAM adapts a topic modeling framework to a specific domain by minimizing the AHAM metric, as introduced in [AHAM: Adapt, Help, Ask, Model - Harvesting LLMs for literature mining](https://arxiv.org/pdf/2312.15784). By doing so, it reduces the proportion of outlier topics and lowers the lexical or semantic similarity between the generated topic labels, resulting in more distinct and domain-relevant topics.
 


### AHAM METRIC

The AHAM metric is defined as:

$$
\text{AHAM} = 2 \times \left(\frac{|\text{outliers}|}{|\text{topics}|}\right) \times \text{(average pairwise topic similarity)}
$$

This metric combines the ratio of outlier topics to total topics with the average pairwise similarity of topic labels. Minimizing this metric drives the adaptation process, ensuring that topics are both distinct and well-aligned with the target domain.

In [None]:
#!pip install git+https://github.com/bkolosk1/aham --quiet

In [None]:
from aham.data import load_ida_dataset

In [None]:
abstracts, titles = load_ida_dataset()

In [None]:
# Next, we need to configure the grid

In [None]:
# first we select the LLM 

In [None]:
DEFAULT_MODEL_ID = "mistralai/Mistral-7B-Instruct-v0.3"

In [None]:
DEFAULT_EMBEDDING_MODEL = "sentence-transformers/all-MiniLM-L6-v2"

In [None]:
# We set the Generation parameters
DEFAULT_LLAMA_GEN_PARAMS = {"temperature": 0.1, "max_new_tokens": 300, "repetition_penalty": 1.1}

In [None]:
DEFAULT_UMAP_PARAMS = {"n_neighbors": 15, "n_components": 32, "min_dist": 0.0, "metric": "cosine", "random_state": 42}

In [None]:
DEFAULT_HDBSCAN_PARAMS = {"min_cluster_size": 10, "metric": "euclidean", "cluster_selection_method": "eom"}

In [None]:
DEFAULT_CHAT_TEMPLATE = [
    {"role": "system", "content": "You are a helpful, respectful, and honest research assistant for labeling topics."},
    {"role": "user", "content": (
        "I have a topic that contains the following documents:\n"
        "- Bisociative Knowledge Discovery by Literature Outlier Detection.\n"
        "- Evaluating Outliers for Cross-Context Link Discovery.\n"
        "- Exploring the Power of Outliers for Cross-Domain Literature Mining.\n"
        "The topic is described by the following keywords: bisociative, knowledge discovery, "
        "outlier detection, data mining, cross-context, link discovery, cross-domain, machine learning.\n"
        "Based on the information above, please create a simple, short and concise computer science label for this topic. "
        "Make sure you only return the label."
    )},
    {"role": "assistant", "content": "Outlier-based knowledge discovery"},
    {"role": "user", "content": (
        "\n"
        "I have a topic that contains the following documents: [DOCUMENTS]\n"
        "The topic is described by the following keywords: '[KEYWORDS]'.\n"
        "Based on the information above, please create a simple, short and concise computer science label for this topic. "
        "Make sure you only return the label."
    )}
]


In [3]:
DEFAULT_TOPIC_SIMILARITY_METHOD = "semantic"

In [4]:
from aham.aham_topic_modeling import AHAMTopicModeling

RuntimeError: Failed to import transformers.integrations.integration_utils because of the following error (look up to see its traceback):
Failed to import transformers.modeling_utils because of the following error (look up to see its traceback):
partially initialized module 'torchvision' has no attribute 'extension' (most likely due to a circular import)

In [None]:
#now we can setup the config 

In [None]:
config = {
            "chat_template": DEFAULT_CHAT_TEMPLATE,
            "model_id": DEFAULT_MODEL_ID,
            "token": "",
            "llm_gen_params": DEFAULT_LLAMA_GEN_PARAMS,
            "embedding_model_name": DEFAULT_EMBEDDING_MODEL,
            "umap_params": DEFAULT_UMAP_PARAMS,
            "hdbscan_params": DEFAULT_HDBSCAN_PARAMS,
            "sem_model_name": DEFAULT_EMBEDDING_MODEL,
            "topic_similarity_method": DEFAULT_TOPIC_SIMILARITY_METHOD
        }

In [None]:
estimator = AHAMTopicModeling(config=best_result["config"], topic_similarity_method="fuzzy")
estimator.fit(abstracts)
score = estimator.score()

In [None]:
print(score)