# Answer Consensus

In [1]:
import numpy as np
import pandas as pd
import re
import os
from collections import Counter

## Goal: ## 
Supposed to work hand in hand with our **Convergence Tool.** This is just to print out the general consensus for each question in a pretty CSV file.  

## Reading In The Data and Organizing It ##
**This is the same as the our convergence tool. **
We'll need to read in our data file. This should be able to read in all the data easily using Regex.

In [2]:
#These are all the tasks we have for now. Requires changes if we have more tasks or the representation of tasks in the 
#datafile names changes.
tasks = ['Evidence', 'Language', 'Probability', 'Reasoning'] 

data_dir = '../testing-format/' #tailing / in necessary
data_files = os.listdir(data_dir)
data_files
dfs = {}

for t in tasks:
    print(t)
    data_files_t = [path for path in data_files if re.match('.*{}.*'.format(t), path)]
    dfs[t] = [pd.read_csv(data_dir + data_file_t) for data_file_t in data_files_t]


# df_master = pd.read_csv('../newDataFormat/BETA_Language-2020-01-18T0225-DataHuntSubmitted.csv')
# Functions about schema is depreciated
# schema = pd.read_csv("../urap/Demo1Evi-2019-06-22T0023-Schema.csv")
# backup_schema = schema
# df_master.head()

Evidence
Language
Probability
Reasoning


Using hard-coded schema for now to distinguish between single-choice questions and multiple-choice questions.

In [3]:
# Currently this dict will be mutated by later code due to simplicity
multi_choice = {"Evidence": ["T1.Q2"], 
                "Language": ["T1.Q1", "T1.Q6"], 
                "Probability": ["T1.Q12"], 
                "Reasoning": ["T1.Q1", "T1.Q2", "T1.Q3", "T1.Q6"]}

The following code takes in the data and organizes it based on contributor ID. It shows which questions each user had reached as well as their answers to those respective questions.

In [4]:
'''
s: a series

Return:
    A dataframe, hot-encoded in a managable manner
'''
def hotcode_multiple_choices(s):
    keys = set()
    for elem in s:
        if type(elem) == list:
            for key in elem:
                keys.add(key)
    
    encode_data = np.zeros((len(s), len(keys)))
    i = 0
    for elem in s:
        if type(elem) == list:
            j = 0
            for key in keys:
                if key in elem:
                    encode_data[i][j] = 1
                j += 1
        i += 1
    encode_data = pd.DataFrame(encode_data, dtype=np.uint8)
    encode_data.columns = keys
    encode_data.index = s.index
    return encode_data

'''
This function get an article from 2020 feeds and convert it into the format of 2019 feeds so the code below can be reused

df: the dataframe of 2020 format, e.g. BETA_Language-2020-01-18T0225-DataHuntSubmitted.csv
article_id: article_number in df

Return:
    A pandas dataframe similiar to the format of 2019 feeds
'''
def getArticle(df, article_id, task):
    article = df[df['article_number'] == article_id]
    #We use quiz_taskrun_uuid as dummy index, since it is a primary index for 2019 feed
    tbl = pd.pivot_table(article, values='answer_label', index='quiz_taskrun_uuid', columns='question_label', aggfunc=list)
    for col in tbl.columns:
        if col in multi_choice[task]:
            encoded = hotcode_multiple_choices(tbl[col])
            for col_ in encoded.columns:
                tbl[col_] = encoded[col_]
            tbl = tbl.drop(labels=col, axis=1)
    return tbl

## Consensus Function ##
The following function is to find the consensus of each question by finding the max of the answers. 

In [5]:
def getConsensus(answers):
    qcounts = Counter(answers)
    maxKey = max(qcounts.keys(), key=lambda key: qcounts[key])
    return maxKey

To help improve the effiency of our code, we converted all of the answers from strings to integers.

In [6]:
def strToInt(lst):
    vals = lst.unique()
    mapper = {vals[i]:i for i in range(len(vals))}
    return lst.replace(mapper)

A helper function aggregating a pandas series whose elements are lists of items.

In [7]:
def agg(series):
    ret = []
    for elem in series:
        ret += elem
    return ret

Below we are putting everything together to get the Answer Consensus and we put it into a pretty csv file. Again, this is 

In [27]:
cols = ['Task File', 'Article Number', 'Question Label', 'Answer Label']
lst=[]

for t in tasks:
    for df_master in dfs[t]:
        for article_id in df_master['article_number'].unique():
            df_q = getArticle(df_master, article_id, t)
            for q in df_q.columns:
                # TODO: SOME PROCESS HERE
                helper = df_q[q].dropna()
                if q not in multi_choice[t]:
                    helper = agg(helper)
                print('Task File:', t, 'Article Number:', article_id, 'Question:', q, 'Answer:', getConsensus(helper))
                lst.append([t, article_id, q, getConsensus(helper)])

consensus_df = pd.DataFrame(lst, columns=cols)

Task File: Evidence Article Number: 1712 Question: T1.Q1 Answer: T1.Q1.A1
Task File: Evidence Article Number: 1712 Question: T1.Q10 Answer: T1.Q10.A5
Task File: Evidence Article Number: 1712 Question: T1.Q11 Answer: T1.Q11.A2
Task File: Evidence Article Number: 1712 Question: T1.Q12 Answer: T1.Q12.A3
Task File: Evidence Article Number: 1712 Question: T1.Q13 Answer: T1.Q13.A5
Task File: Evidence Article Number: 1712 Question: T1.Q14 Answer: T1.Q14.A7
Task File: Evidence Article Number: 1712 Question: T1.Q3 Answer: T1.Q3.A1
Task File: Evidence Article Number: 1712 Question: T1.Q4 Answer: T1.Q4.A4
Task File: Evidence Article Number: 1712 Question: T1.Q5 Answer: T1.Q5.A5
Task File: Evidence Article Number: 1712 Question: T1.Q6 Answer: T1.Q6.A3
Task File: Evidence Article Number: 1712 Question: T1.Q7 Answer: T1.Q7.A1
Task File: Evidence Article Number: 1712 Question: T1.Q8 Answer: T1.Q8.A4
Task File: Evidence Article Number: 1712 Question: T1.Q9 Answer: T1.Q9.A1
Task File: Evidence Article 

Task File: Language Article Number: 100002 Question: T1.Q1.A2 Answer: 1
Task File: Language Article Number: 100002 Question: T1.Q1.A13 Answer: 0
Task File: Language Article Number: 100002 Question: T1.Q1.A10 Answer: 0
Task File: Language Article Number: 100002 Question: T1.Q1.A4 Answer: 0
Task File: Language Article Number: 100002 Question: T1.Q1.A3 Answer: 0
Task File: Language Article Number: 100002 Question: T1.Q1.A12 Answer: 0
Task File: Language Article Number: 100002 Question: T1.Q1.A8 Answer: 0
Task File: Language Article Number: 100002 Question: T1.Q6.A8 Answer: 0
Task File: Language Article Number: 100002 Question: T1.Q6.A7 Answer: 0
Task File: Language Article Number: 100002 Question: T1.Q6.A3 Answer: 0
Task File: Language Article Number: 100002 Question: T1.Q6.A5 Answer: 0
Task File: Language Article Number: 100002 Question: T1.Q6.A2 Answer: 0
Task File: Language Article Number: 100002 Question: T1.Q6.A1 Answer: 0
Task File: Language Article Number: 100002 Question: T1.Q6.A4

Task File: Probability Article Number: 100003 Question: T1.Q1 Answer: T1.Q1.A1
Task File: Probability Article Number: 100003 Question: T1.Q11 Answer: T1.Q11.A3
Task File: Probability Article Number: 100003 Question: T1.Q13 Answer: T1.Q13.A3
Task File: Probability Article Number: 100003 Question: T1.Q14 Answer: T1.Q14.A7
Task File: Probability Article Number: 100003 Question: T1.Q2 Answer: T1.Q2.A4
Task File: Probability Article Number: 100003 Question: T1.Q5 Answer: T1.Q5.A3
Task File: Probability Article Number: 100003 Question: T1.Q6 Answer: T1.Q6.A1
Task File: Probability Article Number: 100003 Question: T1.Q12.A4 Answer: 0
Task File: Probability Article Number: 100004 Question: T1.Q1 Answer: T1.Q1.A3
Task File: Probability Article Number: 100004 Question: T1.Q11 Answer: T1.Q11.A4
Task File: Probability Article Number: 100004 Question: T1.Q13 Answer: T1.Q13.A5
Task File: Probability Article Number: 100004 Question: T1.Q14 Answer: T1.Q14.A7
Task File: Probability Article Number: 1000

In [28]:
consensus_df[consensus_df['Task File'] == 'Probability']

Unnamed: 0,Task File,Article Number,Question Label,Answer Label
218,Probability,1712,T1.Q1,T1.Q1.A1
219,Probability,1712,T1.Q10,T1.Q10.A3
220,Probability,1712,T1.Q11,T1.Q11.A2
221,Probability,1712,T1.Q13,T1.Q13.A5
222,Probability,1712,T1.Q14,T1.Q14.A7
...,...,...,...,...
279,Probability,100026,T1.Q14,T1.Q14.A6
280,Probability,100026,T1.Q2,T1.Q2.A3
281,Probability,100026,T1.Q5,T1.Q5.A1
282,Probability,100026,T1.Q6,T1.Q6.A1


In [29]:
consensus_df.to_csv(r'Answer_Consensus.csv')

In [30]:
def get_question_base_label(row):
    if "A" not in row['Question Label']:
        return row['Question Label']
    else:
        base_end_index = row['Question Label'].index("A") - 1
        return row['Question Label'][:base_end_index]

In [31]:
consensus_df["Question Base Label"] = consensus_df.apply(get_question_base_label, axis=1)
consensus_df.loc[consensus_df["Question Base Label"] == "T1.Q1"]["Task File"].unique()

array(['Evidence', 'Language', 'Probability', 'Reasoning'], dtype=object)