In [1]:
import numpy as np
import pandas as pd
import matplotlib as plt

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [2]:
# Data from Eric, all four of these files confirmed correspond to each other
datahunt = pd.read_csv('evidence_eric/Covid_Evidencev1-Task-2224-DataHunt.csv')
schema = pd.read_csv('evidence_eric/45dce5251bd3ea6e908fa33ac9e6a8e17e6830215912ce1626cf4206e159819c.csv')
iaa = pd.read_csv('evidence_eric/Covid_Evidencev1.IAA-edb1510f-1923-4d6f-a678-95f53d752bea-Tags.csv')
adj = pd.read_csv('evidence_eric/Covid_Evidence2020_03_21.adjudicated-edb1510f-1923-4d6f-a678-95f53d752bea-Tags.csv')

The iaa (IAA, inter-annotator agreement) and adj (adjudicated, Gold-Standard) files are tag files, and each contain a row for each converged answer to a given task. The iaa file has converged answers based on regular user responses on tasks, while the adj file has answers based on an experience Public Editor user (ex. Nick, Emlen, Eric).

These tag files contain the answer_uuid for each converged answer (string of letters and numbers, not human-interpretable), but not the answer_label (ex. 'T1.Q1.A2') or information such as the question type. Instead, this additional data is stored in the schema files, which we can combine with the tag files by merging on the answer_uuid column.

In [3]:
answers_iaa = iaa.merge(schema, how="inner", on="answer_uuid")
answers_adj = adj.merge(schema, how="inner", on="answer_uuid")

# filter down IAA tags file, replace nan values to prevent errors
answers_iaa = answers_iaa[["answer_uuid", "source_task_uuid", "tua_uuid",
                           "target_text", "question_label", "answer_label",
                           "question_type_x", "question_type_y", "answer_count",
                           "alpha_distance"]]

answers_iaa = answers_iaa.replace(np.nan, '', regex=True)

# filter down adjudicated/gold standard tags file, replace nan values to prevent errors
answers_adj = answers_adj[["answer_uuid", "source_task_uuid", "tua_uuid",
                           "target_text", "question_label", "answer_label",
                           "question_type", "answer_count", "alpha_distance"]]

answers_adj = answers_adj.replace(np.nan, '', regex=True)

## get_consensus function
Function that takes in  **question_label**, **source_task_uuid/quiz_task_uuid**, and **answer_uuid**, and returns a list of all the corresponding consensus answers. This could be an empty list if there is no consensus for that question, a length one list if the question is a "select one" question (including ordinal questions), or a list of multiple answers if it is a "select one"/checkbox question.

If consensus answers exist, they will be in the form T1.Q_.A_

The above three columns are a unique identifier for a (set of) consensus answer(s). 
- **question_label** is in the form T1.Q_
- **source_task_uuid/quiz_task_uuid identifies** a unique combination of a schema, an article that schema is applied to, and the bolded text unit of analysis.  
    - quiz_task_uuid and source_task_uuid are essentially the same, but quiz is the column name in datahunt csvs, and source is the column name in the “answer” csvs

- Motivation: use something like 
```
df['iaa_consensus'] = df.apply(lambda x: get_consensus(x['question_label'], x['quiz_task_uuid']), axis=1)
```
to add a column to the datahunt dataframe with the iaa consensus answers or gold standard answers (works the exact same way for both) 

In [4]:
def get_consensus(answers, question_label, quiz_task_uuid):
    answer_df = answers.loc[(answers["question_label"] == question_label)
                         & (answers["source_task_uuid"] == quiz_task_uuid)]
    
    return list(set(answer_df["answer_label"].tolist()))

In [5]:
datahunt['iaa_consensus'] = datahunt.apply(lambda x: get_consensus(answers_iaa, x['question_label'], x['quiz_task_uuid']), axis=1)

In [6]:
datahunt['adj_consensus'] = datahunt.apply(lambda x: get_consensus(answers_adj, x['question_label'], x['quiz_task_uuid']), axis=1)

In [7]:
datahunt[['question_label', 'quiz_task_uuid', 'answer_label', 'iaa_consensus', 'adj_consensus']].head()

Unnamed: 0,question_label,quiz_task_uuid,answer_label,iaa_consensus,adj_consensus
0,T1.Q1,edb1510f-1923-4d6f-a678-95f53d752bea,T1.Q1.A1,[T1.Q1.A1],[T1.Q1.A1]
1,T1.Q2,edb1510f-1923-4d6f-a678-95f53d752bea,T1.Q2.A1,"[T1.Q2.A3, T1.Q2.A1, T1.Q2.A5, T1.Q2.A2]","[T1.Q2.A3, T1.Q2.A1, T1.Q2.A5, T1.Q2.A2]"
2,T1.Q2,edb1510f-1923-4d6f-a678-95f53d752bea,T1.Q2.A2,"[T1.Q2.A3, T1.Q2.A1, T1.Q2.A5, T1.Q2.A2]","[T1.Q2.A3, T1.Q2.A1, T1.Q2.A5, T1.Q2.A2]"
3,T1.Q2,edb1510f-1923-4d6f-a678-95f53d752bea,T1.Q2.A3,"[T1.Q2.A3, T1.Q2.A1, T1.Q2.A5, T1.Q2.A2]","[T1.Q2.A3, T1.Q2.A1, T1.Q2.A5, T1.Q2.A2]"
4,T1.Q4,edb1510f-1923-4d6f-a678-95f53d752bea,T1.Q4.A6,[T1.Q4.A4],[T1.Q4.A2]


Notice below that the iaa_consensus and adj_consensus columns are very similar, aside from some small differences. This is a good sign, because it means that the converged answers from user task responses are very similar to the task responses from very experiences users.

## get_question_meta function
Function that takes in  **question_label** and **source_task_uuid/quiz_task_uuid**, and returns a **tuple containing the question type and number of answer choices, will help with scoring questions**.

If the question_type and num_answer_choices works as expected, we shouldn't need to hardcode the question schema data anymore.

Returned question type will be NOMINAL (select one), ORDINAL (select one), CHECKBOX (select all), or TEXT (short answer).

- Motivation: use something like 
```
df['question_meta'] = df.apply(lambda x: get_question_meta(answers_iaa, x['question_label'], x['quiz_task_uuid']), axis=1)
```
to add a column to the datahunt df with the question metadata

In [8]:
def get_question_meta(answers_iaa, answers_adj, question_label, quiz_task_uuid):
    answer_iaa = answers_iaa.loc[(answers_iaa["question_label"] == question_label)
                         & (answers_iaa["source_task_uuid"] == quiz_task_uuid)]
    
    answer_adj = answers_adj.loc[(answers_adj["question_label"] == question_label)
                         & (answers_adj["source_task_uuid"] == quiz_task_uuid)]
    
    if len(answer_iaa["question_type_y"]) > 0 and len(answer_iaa["answer_count"]) > 0:
        question_type_y = answer_iaa["question_type_y"].iloc[0]
        num_answer_choices = answer_iaa["answer_count"].iloc[0]

        if question_type_y == "RADIO":
            question_type_y = answer_iaa["alpha_distance"].iloc[0].upper()

        return (question_type_y, num_answer_choices)
    
    elif len(answer_adj["question_type"]) > 0 and len(answer_adj["answer_count"]) > 0:
        question_type = answer_adj["question_type"].iloc[0]
        num_answer_choices = answer_adj["answer_count"].iloc[0]

        if question_type == "RADIO":
            question_type = answer_adj["alpha_distance"].iloc[0].upper()

        return (question_type, num_answer_choices)
    
    else:
        return ()

In [9]:
datahunt['question_meta'] = datahunt.apply(lambda x: get_question_meta(answers_iaa, answers_adj,
                                                                       x['question_label'],
                                                                       x['quiz_task_uuid']), axis=1)

In [10]:
datahunt[['question_label', 'quiz_task_uuid', 'answer_label', 'iaa_consensus', 'adj_consensus', 'question_meta']].head()

Unnamed: 0,question_label,quiz_task_uuid,answer_label,iaa_consensus,adj_consensus,question_meta
0,T1.Q1,edb1510f-1923-4d6f-a678-95f53d752bea,T1.Q1.A1,[T1.Q1.A1],[T1.Q1.A1],"(CHECKBOX, 3)"
1,T1.Q2,edb1510f-1923-4d6f-a678-95f53d752bea,T1.Q2.A1,"[T1.Q2.A3, T1.Q2.A1, T1.Q2.A5, T1.Q2.A2]","[T1.Q2.A3, T1.Q2.A1, T1.Q2.A5, T1.Q2.A2]","(CHECKBOX, 9)"
2,T1.Q2,edb1510f-1923-4d6f-a678-95f53d752bea,T1.Q2.A2,"[T1.Q2.A3, T1.Q2.A1, T1.Q2.A5, T1.Q2.A2]","[T1.Q2.A3, T1.Q2.A1, T1.Q2.A5, T1.Q2.A2]","(CHECKBOX, 9)"
3,T1.Q2,edb1510f-1923-4d6f-a678-95f53d752bea,T1.Q2.A3,"[T1.Q2.A3, T1.Q2.A1, T1.Q2.A5, T1.Q2.A2]","[T1.Q2.A3, T1.Q2.A1, T1.Q2.A5, T1.Q2.A2]","(CHECKBOX, 9)"
4,T1.Q4,edb1510f-1923-4d6f-a678-95f53d752bea,T1.Q4.A6,[T1.Q4.A4],[T1.Q4.A2],"(NOMINAL, 6)"


In [11]:
datahunt.to_csv("test/evidence_datahunt_with_consensus.csv", encoding='utf-8', index=False)