# Exploring meta-analyzer features

In [1]:
import pandas as pd

In [2]:
bowen_df = pd.read_json("../../data/helpsteer2_train_tags.jsonl", lines=True)
input_df = pd.read_json(
    "../../data/helpsteer2_human_vs_gpt4_weighted_for_llama.jsonl", lines=True
)

Maybe, instead of doing the data processing in the feature extractor, we just update the input dataset to include these tags (in terms of fields).
However, there are things you need to check: are the tags act on the prompt only? Or does it include the response?
Also, you need to find the code o generating the prompt_hash from the initial repository.

In [3]:
import hashlib

bowen_df_normalized = pd.concat(
    [bowen_df.drop(columns=["tags"]), pd.json_normalize(bowen_df["tags"])],
    axis=1,
)
bowen_df_normalized["prompt_hash"] = bowen_df_normalized["prompt"].apply(
    lambda x: hashlib.md5(x.encode("utf-8")).hexdigest()
)
bowen_df_normalized = bowen_df_normalized.drop_duplicates(subset=["prompt_hash"])
# Remove unnecessary columns
bowen_df_normalized = bowen_df_normalized.drop(
    columns=[
        "response",
        "helpfulness",
        "correctness",
        "coherence",
        "complexity",
        "verbosity",
    ]
).reset_index(drop=True)

In [4]:
bowen_df_normalized.head(3)

Unnamed: 0,prompt,subject_of_expertise,expertise_level,languages,open_endedness,safety_concern,complexity_of_intents,type_of_in_context_material,format_constraints,prompt_hash
0,c#,[Computer sciences],basic domain knowledge,[English],high,safe,simple,[],[],240aa2cec4b29c56f3bee520a8dcee7e
1,bacillus subtilus,[Biology],basic domain knowledge,[English],low,safe,simple,[],[],823aad83e7d34af60a868febc39e2acb
2,Write long detailed essay about ancient type o...,"[Religion, Anthropology, History]",basic domain knowledge,[English],high,safe,simple,[],"[long, detailed essay]",598ee4176ccd9dc6e162568930ea3d89


In [5]:
updated_df = input_df.merge(
    bowen_df_normalized.drop(columns=["prompt"]), how="left", on="prompt_hash"
)

In [6]:
updated_df.to_json(
    "helpsteer2_human_vs_gpt4_weighted_for_llama.jsonl", lines=True, orient="records"
)

In [8]:
updated_df.columns

Index(['prompt_hash', 'text', 'response_a', 'response_b', 'pref_human',
       'pref_gpt4', 'rating_human', 'rating_gpt4', 'completions',
       'subject_of_expertise', 'expertise_level', 'languages',
       'open_endedness', 'safety_concern', 'complexity_of_intents',
       'type_of_in_context_material', 'format_constraints'],
      dtype='object')

## Data types

- `subject_of_expertise`: list[str]
- `expertise_level`: str (3 labels) (basic domain knowledge, general public, expert domain knowledge)
- `languages`: list[str]
- `open_endedness`: str (4 labels?) (moderate, high, low, no)
- `safety_concern`: str (safe, low, moderate, high)
- `complexity_of_intents`: str (simple, moderate, complex, high)
- `type_of_in_context_material`: list[str]
- `format_constraints`: list[str]

If the data type is a `list[str]`. The domains of this prompt is "ICT" and "Sociology". Our constraints should be something like: at least one of this domain falls under what we specified. Then perhaps there's a `strict` option, if we want that the domains are exactly what we specified.