# Reproducibility snippets
*"Multimodal feature extraction for assistive technology: evaluation and dataset"*

## Dataset creation
The dataset was generated using the Pixtral-12b model from Mistral AI. In addition to the below prompt an image of the product was also provided to the model. The model's output was constrained to JSON and validated according to the schema. 
### Prompt
>You are an expert assistive technology advisor. Please output a JSON object that adheres to the following schema. The object should represent an assistive technology item, which is given to you with the product's name in <ITEM_NAME></ITEM_NAME>, description in < DESCRIPTION >< /DESCRIPTION > and the image, with the following properties: {{detailed_description}}: A paragraph that describes the item in detail, which builds on the original description and image provided by the user. {{goals}}: An array of strings, where each string contains a life-related goal for a user of the item which motivates the use or purchase. The goals should be abstract and relate to high-level life objectives, rather than specific product features. {{healthcare_challenges}}: An array of strings, where each string contains a challenge defined as a disability and/or healthcare challenge associated with the item. These challenges can and should be both broadly descriptive and also include specific diagnoses relevant to the item. {{goal_classification}}: an object containing each of the following goal domains, [choice_and_control, daily_life, work, where_live, health_wellbeing, learning, relationships, social_community], and a boolean indicating if each domain is relevant to the item. Always output all domains in the JSON object. The goal domains are defined here: [choice_and_control: Support to make choices about one's life and minimise the need for external assistance.; daily_life: Support to participate in and manage the basic activities of daily life, including personal care, meal preparation, and managing one's living environment.; work: Support to participate in and maintain employment or self-employment.; where_live: Support to access and maintain suitable housing, including modifications; health_wellbeing: Support to directly maintain physical and mental health.; learning: Support to access enable or improve education and training opportunities, including formal and informal learning, and to develop skills for independent living.; relationships: Support to develop and maintain relationships with family, friends, and the community.; social_community: Support to participate in social and community activities, including recreation, leisure, and cultural pursuits.].{{user_age_minimum}}: An integer between 0 and 100 that estimates an absolute lower bound for a user's age such that the item is suitable (safe and able to be used). In terms of item size compatibility, enter 15 for an adult, 5 for a child and 0 for an infant as a lower bound. {{user_age_maximum}}: An integer between 0 and 100 that estimates an absolute upper bound for a user's age such that the item is suitable. {{estimated_cost}}: An integer number that indicates the estimated cost of the item, which should be a non-negative value. Make sure to include realistic values for each property. Do not include any keys in the JSON that are not mentioned above within {{}} or in the provided schema. "<ITEM_NAME>{item['serienavneng']}</ITEM_NAME> < DESCRIPTION >{item['seriebeskrivelseeng']}< /DESCRIPTION >.

### Schema
> {
    "type": "object",
    "properties": {
      "detailed_description": {
        "type": "string",
        "description": "An extended description of the item."
      },
      "goals": {
        "type": "array",
        "description": "A list of goals a user would have that are associated or fulfilled with the item.",
        "items": {
            "type": "string"
        },
      },
      "healthcare_challenges": {
        "type": "array",
        "description": "A list of disability and/or healthcare challenges associated with the item.",
        "items": {
            "type": "string"
        },
      },
      "goal_classification": {
        "type": "object",
        "properties": {
          "choice_and_control": {"type": "boolean"},
          "daily_life": {"type": "boolean"},
          "work": {"type": "boolean"},
          "where_live": {"type": "boolean"},
          "health_wellbeing": {"type": "boolean"},
          "learning": {"type": "boolean"},
          "relationships": {"type": "boolean"},
          "social_community": {"type": "boolean"}
        },
        "required": [
          "choice_and_control",
          "daily_life",
          "work",
          "where_live",
          "health_wellbeing",
          "learning",
          "relationships",
          "social_community"
        ]
      },
      "user_age_minimum": {
        "type": "integer",
        "description": "The absolute minimum suitable age for a user of the product. To approximate an infant, enter 0; for a child, enter 5. To approximate an adult, enter 15.",
        "minimum": 0,
        "maximum": 100
      },
      "user_age_maximum": {
        "type": "integer",
        "description": "The absolute maximum suitable age for a user of the product. To approximate an infant, enter 0; for a child, enter 5. To approximate an adult, enter 15.",
        "minimum": 0,
        "maximum": 100
      },
      "estimated_cost": {
        "type": "number",
        "description": "The estimated cost of the item.",
        "minimum": 0
      }
    },
    "required": ["detailed_description", "goals", "healthcare_challenges", "goal_classification", "user_age_minimum", "user_age_maximum", "estimated_cost"]
    }

### Validation:
Uses the [jsonschema library](https://github.com/python-jsonschema/jsonschema). The function parameter 'data' is the model's JSON output.


In [6]:
import jsonschema

def validate_json(data, schema, id):
    try:
        jsonschema.validate(instance=data, schema=schema)
        print("JSON data is valid.")
        return True
    except jsonschema.exceptions.ValidationError as e:
        print(f"JSON data for item {id} is invalid:", e.message)
        return False

---
## Multimodal experimental dataset production
The production prompt above was modified to remove refrences to an attached image, item name or item description depending on the experiment type. Model output is available in the following files\:
- Text only: ablation_text_only.json
- Images only: ablation_images_only.json
- Image and title: ablation_image_and_title.json

---
## Annotator agreement
This section contains code snippets for calculating annotator agreement on boolean classification activities.
### Cohen's kappa (example between annotators and LLM)
Uses the [scikit-learn library](https://scikit-learn.org). 

In [None]:
import json
from sklearn.metrics import cohen_kappa_score

def alignment_all(merged_file, categories):

    scores_dict = {category: {'user': [], 'llm': []} for category in categories}
    scores_user = []
    scores_llm = []
    with open(merged_file, 'r') as file:
        data = json.load(file)
    
    with open('annotation_subset_llm_outputs.json', 'r') as file:
        full_dataset = json.load(file)
    
    for each in data:
        for item in full_dataset:
            if each['id'] == item['id']:
                llm_scores = item['LLM_output']['goal_classification']
        
        for category in categories:
            scores_dict[category]['user'].append(each['LLM_output']['goal_classification'][category])
            scores_user.append(each['LLM_output']['goal_classification'][category])
            scores_dict[category]['llm'].append(llm_scores[category])
            scores_llm.append(llm_scores[category])

    # Calculate and print Cohen's kappa for each category
    for category in categories:
        kappa = cohen_kappa_score(scores_dict[category]['user'], scores_dict[category]['llm'])
        print(f"Kappa for {category}: {kappa}")
    
    print(f"All category Cohen's Kappa is {cohen_kappa_score(scores_user, scores_llm)}")

categories = [
    'choice_and_control',
    'daily_life',
    'work',
    'where_live',
    'health_wellbeing',
    'learning',
    'relationships',
    'social_community'
]

# Annotator A - LLM
alignment_all('annotator_a_annotations.json', categories)
# Annotator A - LLM Unprimed
alignment_all('annotator_a_no_priming_annotations.json', categories)
# Annotator B - LLM
alignment_all('annotator_b_annotations.json', categories)
# Annotator B - LLM Unprimed
alignment_all('annotator_b_no_priming_annotations.json', categories)



## Cohen's kappa inter-annotator agreement
 Additional files for calculating the kappas between annotators are supplied as:
- annotator_a_crosscheck.json (annotator A's overlap of tasks assigned to B)
- annotator_b_crosscheck.json (annotator B's overlap of tasks assigned to A)

In [None]:
import json
from sklearn.metrics import cohen_kappa_score

def inter_annotator_agreement():
    # Create a dictionary to store booleans
    scores_dict = {category: {'annotator_a': [], 'annotator_b': []} for category in categories}

    # Add annotator A's data
    with open('annotator_a_crosscheck.json', 'r') as file:
        a_crosscheck_data = json.load(file)
    with open('annotator_b_annotations.json', 'r') as file:
        b_annotations = json.load(file)
    
    for each in a_crosscheck_data:
        for comparison in b_annotations:
            if each['id'] == comparison['id']:
                b_annotations_dict = comparison['LLM_output']['goal_classification']
        
        for category in categories:
            scores_dict[category]['annotator_a'].append(each['LLM_output']['goal_classification'][category])
            scores_dict[category]['annotator_b'].append(b_annotations_dict[category])

    # Add annotator B's data
    with open('annotator_b_crosscheck.json', 'r') as file:
        b_crosscheck_data = json.load(file)
    with open('annotator_a_annotations.json', 'r') as file:
        a_annotations = json.load(file)
    
    for each in b_crosscheck_data:
        for comparison in a_annotations:
            if each['id'] == comparison['id']:
                a_annotations_dict = comparison['LLM_output']['goal_classification']
        
        for category in categories:
            scores_dict[category]['annotator_b'].append(each['LLM_output']['goal_classification'][category])
            scores_dict[category]['annotator_a'].append(a_annotations_dict[category])

    combined_a = []
    combined_b = []
    for category in categories:
        kappa = cohen_kappa_score(scores_dict[category]['annotator_a'], scores_dict[category]['annotator_b'])
        combined_a.extend(scores_dict[category]['annotator_a'])
        combined_b.extend(scores_dict[category]['annotator_b'])
        print(f"Kappa for {category}: {kappa}")
    
    print(f"All category Cohen's Kappa is {cohen_kappa_score(combined_a, combined_b)}")
    
    

inter_annotator_agreement()

        

---
## BERTScore 
The [BERTScore library](https://github.com/Tiiiger/bert_score) was used to calculate semantic similarity between text. This encompasses inter-annotator similarity, annotator-LLM similarity and similarity between generations in our ablation study. The examples below calculate the BERTScore for the ablation study. The code can easily be modified to reproduce annotator similarity scores using annotation objects included in this directory. 

Hash: roberta-large_L17_no-idf_version=0.3.12(hug_trans=4.46.3)

In [None]:
# TODO: modify to use JSON reproducibility files. Ablation needs to use aggregated annotations, not LLM generations, as constants and ablations are the candidates. 

import json
from bert_score import score, plot_example

def bert_score_text_only(reference_file, candidate_file):
    references_descriptions = []
    candidates_descriptions = [] 

    references_goals = []
    candidates_goals = [] 

    references_challenges = []
    candidates_challenges = [] 

    with open(reference_file, 'r') as ref:
        reference_data = json.load(ref)
    
    with open(candidate_file, 'r') as can:
        candidate_data = json.load(can)
    
    for each in reference_data:
        references_descriptions.append(each['LLM_output']['detailed_description'])
        references_goals.append(each['LLM_output']['goals'])
        references_challenges.append(each['LLM_output']['healthcare_challenges'])
        for item in candidate_data:
            if each['id'] == item['id']:
                candidates_descriptions.append(item['LLM_output']['detailed_description'])
                candidates_goals.append(item['LLM_output']['goals'])
                candidates_challenges.append(item['LLM_output']['healthcare_challenges'])
        

    P, R, F1 = score(references_descriptions, candidates_descriptions, lang='en', verbose=True)

    P1, R1, F1_1 = score([', '.join(sublist) for sublist in references_goals], [', '.join(sublist) for sublist in candidates_goals], lang='en', verbose=True)

    P2, R2, F1_2 = score([', '.join(sublist) for sublist in references_challenges], [', '.join(sublist) for sublist in candidates_challenges], lang='en', verbose=True)
    
    mean_results = {"Descriptions precision": P.mean(), "Descriptions recall": R.mean(), "Descriptions F1": F1.mean(), "Goals precision": P1.mean(), "Goals recall": R1.mean(), "Goals F1": F1_1.mean(),"Challenges precision": P2.mean(), "Challenges recall": R2.mean(), "Challenges F1": F1_2.mean()}
    median_results = {"Descriptions precision": P.median(), "Descriptions recall": R.median(), "Descriptions F1": F1.median(), "Goals precision": P1.median(), "Goals recall": R1.median(), "Goals F1": F1_1.median(),"Challenges precision": P2.median(), "Challenges recall": R2.median(), "Challenges F1": F1_2.median()}
    minimum_results = {"Descriptions precision": P.min(), "Descriptions recall": R.min(), "Descriptions F1": F1.min(), "Goals precision": P1.min(), "Goals recall": R1.min(), "Goals F1": F1_1.min(),"Challenges precision": P2.min(), "Challenges recall": R2.min(), "Challenges F1": F1_2.min()}

    all_results = {"mean_results": mean_results, "median_results": median_results, "minimum_results": minimum_results}
    some_results = {"descriptions_f1": mean_results['Descriptions F1'], "goals_f1": mean_results['Goals F1'], "challenges_f1": mean_results['Challenges F1']}

    return some_results

annotator_a_llm = bert_score_text_only('annotator_a_annotations.json', 'annotation_subset_llm_outputs.json')
annotator_b_llm = bert_score_text_only('annotator_b_annotations.json', 'annotation_subset_llm_outputs.json')
inter_annotator = bert_score_text_only('crosscheck_merged.json','merged_annotations.json')
ablation_image = bert_score_text_only('merged_annotations.json','ablation_images_only.json')
ablation_text = bert_score_text_only('merged_annotations.json','ablation_text_only.json')
ablation_image_title = bert_score_text_only('merged_annotations.json','ablation_image_and_title.json')


print(f"Annotator A - LLM similarity: {annotator_a_llm}")
print(f"Annotator B - LLM similarity: {annotator_b_llm}")
print(f"Inter-annotator similarity: {inter_annotator}")
print(f"Ablation study - image only to gold standard: {ablation_image}")
print(f"Ablation study - text only to gold standard: {ablation_text}")
print(f"Ablation study - image + item title only to gold standard: {ablation_image_title}")



## Qualitative score
Annotators were given the subsequent instructions to rate generative candidates on a qualitative basis.
> Rate the accuracy and comprehensiveness of the generative output after annotation on a scale of 1-3 where 1 requires signifiant revision, 2 requires moderate revision and 3 requires no or a limited amount of revision.