# Credibility Indicator Variance

Author: Ewen Dai <br>
Completed: 28 July 2019 <br>
Contact Info: ewendai@berkeley.edu <br>

In [166]:
import pandas as pd
import numpy as np

### Definitions

- Assignment: The specialist assignments that each article goes through. (Ex: Language, Reasoning, Probability, Evidence, etc)
- Article: The piece of written work that we are analyzing. Identified by an article number.
- Task: Each run of an assignment on an article

### Variance Calculation Method

#### Variance of a single question of a task
Find the count of each provided answer and take the variance of the given numbers. For instance, Q7 has three options. 4 responses indicate the first option, 5 responses indicate the second option, and 1 response indicate the third option. The variance of Q7 would be `np.var([4, 5, 1])`, which is `2.889`.

##### Suggested Calculation Method
1. One-hot-encode answers.
2. Sum each column of the one-hot-encoded data.
3. Take the variance of the column sums.

#### Variance of a single question of an article
The mean of the variances of each task from the article.

#### Variance of an article
The weighted average of the variances of each question where weights are based on the "importance" of each question, which is decided separately. If there are no specified weights, assume questions are weighted equally.

## General Functions

`numVariance` is designed to find the variance of numerical questions/data. Expected format of numerical data is that they are already one-hot-encoded.

Inputs: 
* `df` = df with columns ['quiz_task_uuid', 'article_number'] and columns of the questions (e.g. ['T1.Q2.A1', 'T1.Q2.A2', 'T1.Q2.A3', 'T1.Q3.A1', 'T1.Q3.A2', 'T1.Q3.A3', 'T1.Q3.A4', 'T1.Q3.A5', ...])
* `columns` = list of column names containing question data as strings
* `questions` = list of question names (e.g. ['T1.Q2', 'T1.Q3', ...])

Output:
* dictionary of dictionaries: keys of "outer" dictionary are task numbers, keys of "inner" dictionary are question numbers

In [167]:
def numVariance(df, columns, questions):
    if len(columns) == 0 or len(questions) == 0:
        return {}
    
    articleNums = df["article_number"].unique()
    quizTaskUUID = df["quiz_task_uuid"].unique()

    dictionary = {}
    for i in articleNums:
        dictionary[i] = {}
        for j in quizTaskUUID:
            dfTemp = df[(df["article_number"] == i) & (df["quiz_task_uuid"] == j)]
            for q in questions:
                if not q in dictionary[i]:
                    dictionary[i][q] = []
                if not dfTemp.empty:
                    tempList = []
                    for c in columns:
                        if c.startswith(q):
                            tempList.append(c)

                    counts = []
                    for t in tempList:
                        counts.append(sum(dfTemp[t]))
                dictionary[i][q].append(np.var(counts))
        for q in questions:
            dictionary[i][q] = np.mean(dictionary[i][q])
    return dictionary

`textVariance` is designed to find the variance of text-based questions/data. Expected format of data is that all responses to one question is in one column.

Inputs: 
* `df` = df with columns ['quiz_task_uuid', 'article_number'] and columns of the questions (e.g. ['T1.Q4', 'T1.Q5', 'T1.Q6', 'T1.Q7', 'T1.Q8' ...])
* `columns` = list of column names containing question data as strings
* `questions` = list of question names. should be the same as `columns` (e.g. ['T1.Q4', 'T1.Q5', ...])

Output:
* dictionary of dictionaries: keys of "outer" dictionary are task numbers, keys of "inner" dictionary are question numbers

In [168]:
def textVariance(df, questions):
    if len(questions) == 0:
        return {}
    
    articleNums = df["article_number"].unique()
    quizTaskUUID = df["quiz_task_uuid"].unique()
    
    dictionary = {}
    for i in articleNums:
        dictionary[i] = {}
        for j in quizTaskUUID:
            dfTemp = df[(df["article_number"] == i) & (df["quiz_task_uuid"] == j)]
            for q in questions:
                if not q in dictionary[i]:
                    dictionary[i][q] = []
                if not dfTemp.empty:
                    dictionary[i][q].append(pd.get_dummies(dfTemp[q]).sum().values)
        for q in questions:
            dictionary[i][q] = np.mean([np.var(arr) for arr in dictionary[i][q]])
    return dictionary

`finalVariance` is designed to find the total variance of a task

Assumptions:
* Questions are either numerical or text, and cannot be both
* Questions are numbered in order, with no questions having no data

Inputs: 
* `articles` = list of article numbers
* `numDict` = dictionary of dictionaries containing calculated variances of numerical data
* `textDict` = dictionary of dictionaries containing calculated variances of text-based data
* `weights` = list of weights for each question. len(weights) must equal the number of questions as provided in `numDict` and `textDict`. `weights` is assumed to correspond to questions, in order. For unspecified weights, use `np.ones(numQuestions).tolist()`
* `questions` = list of all questions in this task

Output: 
* dictionary: keys are task numbers and corresponding values are the calculated variances

In [256]:
def finalVariance(articles, numDict, textDict, weights, questions):
    dictionary = {}
    for i in articles:
        if len(numDict) == 0 and len(textDict) == 0:
            return {}
        elif len(numDict) == 0:
            lengths = [len(v) for v in textDict.values()]
            if all(x == lengths[0] for x in lengths): 
                if lengths[0] == len(weights) and len(weights) == len(questions):
                    dictionary[i] = np.mean([w*variance for w, variance in zip(weights, textDict[i].values())])
                else:
                    raise ValueError("Number of weight(s) do not equal the number of question(s) or given weights/questions is incorrect.")
            else:
                raise ValueError("Given textDict does not have variance for each question for each task")
        elif len(textDict) == 0:
            lengths = [len(v) for v in numDict.values()]
            if all(x == lengths[0] for x in lengths): 
                if lengths[0] == len(weights) and len(weights) == len(questions):
                    dictionary[i] = np.mean([w*variance for w, variance in zip(weights, numDict[i].values())])
                else:
                    raise ValueError("Number of weight(s) do not equal the number of question(s) or given weights/questions is incorrect.")
            else:
                raise ValueError("Given numDict does not have variance for each question for each task")
        else: # both len(numDict) and len(textDict) > 0
            lengthText = [len(v) for v in textDict.values()]
            lengthNum = [len(v) for v in numDict.values()]
            if all(x == lengthText[0] for x in lengthText) and all(x == lengthNum[0] for x in lengthNum): 
                if lengthText[0] + lengthNum[0] == len(weights) and len(weights) == len(questions):
                    combinedDict = {}
                    # data is pulled from textDict[i] and numDict[i]
                    for q in questions:
                        if q in textDict[i]:
                            # if the variance does not exist, we currently set to 0 for ease of calculation.
                            # will change to not considering the nan value in the future.
                            if np.isnan(textDict[i][q]):
                                textDict[i][q] = 0
                            combinedDict[q] = textDict[i][q]
                        else:
                            if np.isnan(numDict[i][q]):
                                numDict[i][q] = 0
                            combinedDict[q] = numDict[i][q]
                    dictionary[i] = np.mean([w*variance for w, variance in zip(weights, combinedDict.values())])
                else:
                    raise ValueError("Number of weight(s) do not equal the number of question(s). numDict has info on ", 
                                     lengthNum[0], " question(s) and textDict has info on ", lengthText[0], 
                                    " questions. There are ", len(questions), 
                                     " questions total. Alternatively, given weights/questions may be incorrect.")
            else:
                raise ValueError("Given dictionary/dictionaries do(es) not have variance for each question for each task")
    return dictionary

## Argument Relevance

In [257]:
argRel = pd.read_csv('demo1/Demo1ArgRel3-2018-09-01T2239-DataHuntAnswers.csv')
argRel = argRel.loc[:, ['quiz_task_uuid', 'article_number', 'quiz_taskrun_uuid', 
                        'contributor_uuid', 'T1.Q1']]

In [258]:
articles = argRel["article_number"].unique()
numDict = {}
textDict = textVariance(argRelSimp, ['T1.Q1'])
weights = np.ones(1).tolist()
questions = ['T1.Q1']

In [259]:
argRelVariances = finalVariance(articles, numDict, textDict, weights, questions)
argRelVariances

{1721: 0.125, 1712: 0.041666666666666664, 1737: 0.0}

## Evidence

In [260]:
evidence = pd.read_csv('demo1/Demo1Evi-2018-09-01T2239-DataHuntAnswers.csv')
evidence = evidence.drop(columns=['schema_namespace', 'schema_sha256', 'task_url', 
                                                    'article_batch_name', 'article_filename', 'article_sha256',
                                                    'quiz_taskrun_uuid', 'contributor_uuid', 'created',
                                                    'finish_time', 'elapsed_seconds', 'final_queue'])

In [261]:
numericaldf = evidence.loc[:, ['quiz_task_uuid', 'article_number', 'T1.Q2.A1', 'T1.Q2.A2', 'T1.Q2.A3', 'T1.Q3.A1', 'T1.Q3.A2', 'T1.Q3.A3', 'T1.Q3.A4', 
           'T1.Q3.A5', 'T1.Q3.A6', 'T1.Q3.A7', 'T1.Q3.A8', 'T1.Q3.A9']]
columns = ['T1.Q2.A1', 'T1.Q2.A2', 'T1.Q2.A3', 'T1.Q3.A1', 'T1.Q3.A2', 'T1.Q3.A3', 'T1.Q3.A4', 
           'T1.Q3.A5', 'T1.Q3.A6', 'T1.Q3.A7', 'T1.Q3.A8', 'T1.Q3.A9']
questions = ['T1.Q2', 'T1.Q3']

In [262]:
numDict = numVariance(numericaldf, columns, questions)

In [263]:
textdf = evidence.loc[:, ['quiz_task_uuid', 'article_number', 'T1.Q4', 'T1.Q5', 'T1.Q6', 'T1.Q7', 'T1.Q8', 'T1.Q9', 'T1.Q11', 
                                   'T1.Q12', 'T1.Q13', 'T1.Q14', 'T1.Q15', 'T1.Q16']]
questions = ['T1.Q4', 'T1.Q5', 'T1.Q6', 'T1.Q7', 'T1.Q8', 'T1.Q9', 'T1.Q11', 
                                   'T1.Q12', 'T1.Q13', 'T1.Q14', 'T1.Q15', 'T1.Q16']

In [264]:
textDict = textVariance(textdf, questions)

In [265]:
articles = evidence["article_number"].unique()
weights = np.ones(14).tolist()
questions = ['T1.Q2', 'T1.Q3', 'T1.Q4', 'T1.Q5', 'T1.Q6', 'T1.Q7', 'T1.Q8', 'T1.Q9',
             'T1.Q11', 'T1.Q12', 'T1.Q13', 'T1.Q14', 'T1.Q15', 'T1.Q16']

In [266]:
evidenceVariances = finalVariance(articles, numDict, textDict, weights, questions)
evidenceVariances

{1721: 0.5960055359592397, 1712: 0.7523105893592004, 1737: 0.614108735057809}

## Probability

In [267]:
prob = pd.read_csv('demo1/Demo1Prob-2018-09-01T2240-DataHuntAnswers.csv')
prob = prob.drop(columns=['schema_namespace', 'schema_sha256', 'task_url', 
                                                    'article_batch_name', 'article_filename', 'article_sha256',
                                                    'quiz_taskrun_uuid', 'contributor_uuid', 'created',
                                                    'finish_time', 'elapsed_seconds', 'final_queue'])

In [268]:
numericaldf = prob.loc[:, ['quiz_task_uuid', 'article_number', 'T1.Q12.A1', 'T1.Q12.A2', 'T1.Q12.A3', 'T1.Q12.A4']]
columns = ['T1.Q12.A1', 'T1.Q12.A2', 'T1.Q12.A3', 'T1.Q12.A4']
questions = ['T1.Q12']

In [269]:
numDict = numVariance(numericaldf, columns, questions)

In [270]:
textdf = prob.loc[:, ['quiz_task_uuid', 'article_number', 'T1.Q1', 'T1.Q2', 'T1.Q4', 'T1.Q5', 
                      'T1.Q6', 'T1.Q7', 'T1.Q8', 'T1.Q9', 'T1.Q10', 'T1.Q11', 'T1.Q13', 'T1.Q14']]
questions = ['T1.Q1', 'T1.Q2', 'T1.Q4', 'T1.Q5', 'T1.Q6', 'T1.Q7', 'T1.Q8', 'T1.Q9', 
             'T1.Q10', 'T1.Q11', 'T1.Q13', 'T1.Q14']

In [271]:
textDict = textVariance(textdf, questions)

In [272]:
articles = prob["article_number"].unique()
weights = np.ones(13).tolist()
questions = ['T1.Q1', 'T1.Q2', 'T1.Q4', 'T1.Q5', 'T1.Q6', 'T1.Q7', 'T1.Q8', 'T1.Q9', 
             'T1.Q10', 'T1.Q11', 'T1.Q12', 'T1.Q13', 'T1.Q14']

In [273]:
probVariances = finalVariance(articles, numDict, textDict, weights, questions)
probVariances

{1737: 0.9082051282051282, 1712: 0.6904700854700855, 1721: 0.9754273504273503}

## Quote Source

In [274]:
quosour = pd.read_csv('demo1/Demo1QuoSour-2018-09-01T2240-DataHuntAnswers.csv')
quosour = quosour.drop(columns=['schema_namespace', 'schema_sha256', 'task_url', 
                                                    'article_batch_name', 'article_filename', 'article_sha256',
                                                    'quiz_taskrun_uuid', 'contributor_uuid', 'created',
                                                    'finish_time', 'elapsed_seconds', 'final_queue'])

In [275]:
numericaldf = quosour.loc[:, ['quiz_task_uuid', 'article_number', 'T1.Q1.A1', 'T1.Q1.A2', 'T1.Q1.A3', 'T1.Q1.A4', 
                              'T1.Q1.A5', 'T1.Q1.A6', 'T1.Q1.A7', 'T1.Q1.A8', 'T1.Q1.A9', 'T1.Q2.A1', 'T1.Q2.A2', 
                              'T1.Q3.A1', 'T1.Q3.A2', 'T1.Q3.A3', 'T1.Q3.A4', 'T1.Q3.A5']]
columns = ['T1.Q1.A1', 'T1.Q1.A2', 'T1.Q1.A3', 'T1.Q1.A4', 'T1.Q1.A5', 'T1.Q1.A6', 'T1.Q1.A7', 'T1.Q1.A8', 
           'T1.Q1.A9', 'T1.Q2.A1', 'T1.Q2.A2', 'T1.Q3.A1', 'T1.Q3.A2', 'T1.Q3.A3', 'T1.Q3.A4', 'T1.Q3.A5']
questions = ['T1.Q1', 'T1.Q2', 'T1.Q3']

In [276]:
numDict = numVariance(numericaldf, columns, questions)

In [277]:
textdf = quosour.loc[:, ['quiz_task_uuid', 'article_number', 'T1.Q4']]
questions = ['T1.Q4']

In [278]:
textDict = textVariance(textdf, questions)

In [279]:
articles = quosour["article_number"].unique()
weights = np.ones(4).tolist()
questions = ['T1.Q1', 'T1.Q2', 'T1.Q3', 'T1.Q4']

In [280]:
quosourVariances = finalVariance(articles, numDict, textDict, weights, questions)
quosourVariances

{1712: 0.7030632716049383, 1721: 0.411820987654321, 1737: 0.22109567901234572}

## Reasoning

In [281]:
reas = pd.read_csv('demo1/Demo1Reas-2018-09-01T2241-DataHuntAnswers.csv')
reas = reas.drop(columns=['schema_namespace', 'schema_sha256', 'task_url', 
                                                    'article_batch_name', 'article_filename', 'article_sha256',
                                                    'quiz_taskrun_uuid', 'contributor_uuid', 'created',
                                                    'finish_time', 'elapsed_seconds', 'final_queue'])

In [282]:
numericaldf = reas.loc[:, ['quiz_task_uuid', 'article_number', 'T1.Q1.A1', 'T1.Q1.A2', 'T1.Q1.A3', 'T1.Q1.A4', 
                              'T1.Q1.A5', 'T1.Q2.A1', 'T1.Q2.A2', 'T1.Q2.A3', 'T1.Q2.A4', 'T1.Q2.A5', 'T1.Q2.A6',
                              'T1.Q3.A1', 'T1.Q3.A2', 'T1.Q3.A3', 'T1.Q3.A4', 'T1.Q3.A5', 'T1.Q3.A6', 'T1.Q3.A7',
                              'T1.Q6.A1', 'T1.Q6.A2', 'T1.Q6.A3', 'T1.Q6.A4', 'T1.Q6.A5', 'T1.Q6.A6', 'T1.Q6.A7', 
                              'T1.Q6.A8', 'T1.Q6.A9']]
columns = ['T1.Q1.A1', 'T1.Q1.A2', 'T1.Q1.A3', 'T1.Q1.A4', 'T1.Q1.A5', 'T1.Q2.A1', 'T1.Q2.A2', 'T1.Q2.A3', 
           'T1.Q2.A4', 'T1.Q2.A5', 'T1.Q2.A6', 'T1.Q3.A1', 'T1.Q3.A2', 'T1.Q3.A3', 'T1.Q3.A4', 'T1.Q3.A5', 
           'T1.Q3.A6', 'T1.Q3.A7', 'T1.Q6.A1', 'T1.Q6.A2', 'T1.Q6.A3', 'T1.Q6.A4', 'T1.Q6.A5', 'T1.Q6.A6', 
           'T1.Q6.A7', 'T1.Q6.A8', 'T1.Q6.A9']
questions = ['T1.Q1', 'T1.Q2', 'T1.Q3', 'T1.Q6']

In [283]:
numDict = numVariance(numericaldf, columns, questions)

In [284]:
textdf = reas.loc[:, ['quiz_task_uuid', 'article_number', 'T1.Q4', 'T1.Q5', 'T1.Q7', 'T1.Q8', 'T1.Q9', 'T1.Q10']]
questions = ['T1.Q4', 'T1.Q5', 'T1.Q7', 'T1.Q8', 'T1.Q9', 'T1.Q10']

In [285]:
textDict = textVariance(textdf, questions)

In [286]:
articles = reas["article_number"].unique()
weights = np.ones(10).tolist()
questions = ['T1.Q1', 'T1.Q2', 'T1.Q3', 'T1.Q4', 'T1.Q5',  'T1.Q6', 'T1.Q7', 'T1.Q8', 'T1.Q9', 'T1.Q10']

In [287]:
reasVariances = finalVariance(articles, numDict, textDict, weights, questions)
reasVariances

{1712: 0.6269523389602755}

## Language

In [288]:
lang = pd.read_csv('demo1/DemoLang-2018-09-01T2240-DataHuntAnswers.csv')
lang = lang.drop(columns=['schema_namespace', 'schema_sha256', 'task_url', 
                                                    'article_batch_name', 'article_filename', 'article_sha256',
                                                    'quiz_taskrun_uuid', 'contributor_uuid', 'created',
                                                    'finish_time', 'elapsed_seconds', 'final_queue'])

In [289]:
numericaldf = lang.loc[:, ['quiz_task_uuid', 'article_number', 'T1.Q1.A1', 'T1.Q1.A2', 'T1.Q1.A3', 'T1.Q1.A4', 
                           'T1.Q1.A5', 'T1.Q1.A6', 'T1.Q1.A7', 'T1.Q1.A8', 'T1.Q1.A9', 'T1.Q1.A10', 'T1.Q1.A11', 
                           'T1.Q1.A12', 'T1.Q1.A13', 'T1.Q6.A1', 'T1.Q6.A2', 'T1.Q6.A3', 'T1.Q6.A4', 'T1.Q6.A5', 
                           'T1.Q6.A6', 'T1.Q6.A7', 'T1.Q6.A8', 'T1.Q12.A1', 'T1.Q12.A2', 'T1.Q12.A3', 'T1.Q12.A4']]
columns = ['T1.Q1.A1', 'T1.Q1.A2', 'T1.Q1.A3', 'T1.Q1.A4', 'T1.Q1.A5', 'T1.Q1.A6', 'T1.Q1.A7', 'T1.Q1.A8', 
           'T1.Q1.A9', 'T1.Q1.A10', 'T1.Q1.A11', 'T1.Q1.A12', 'T1.Q1.A13', 'T1.Q6.A1', 'T1.Q6.A2', 'T1.Q6.A3', 
           'T1.Q6.A4', 'T1.Q6.A5', 'T1.Q6.A6', 'T1.Q6.A7', 'T1.Q6.A8', 'T1.Q12.A1', 'T1.Q12.A2', 'T1.Q12.A3',
           'T1.Q12.A4']
questions = ['T1.Q1', 'T1.Q6', 'T1.Q12']

In [290]:
numDict = numVariance(numericaldf, columns, questions)

In [291]:
textdf = lang.loc[:, ['quiz_task_uuid', 'article_number', 'T1.Q2', 'T1.Q3', 'T1.Q5', 'T1.Q7', 
                      'T1.Q9', 'T1.Q10', 'T1.Q11', 'T1.Q13', 'T1.Q15']]
questions = ['T1.Q2', 'T1.Q3', 'T1.Q5', 'T1.Q7', 'T1.Q9', 'T1.Q10', 'T1.Q11', 'T1.Q13', 'T1.Q15']

In [292]:
textDict = textVariance(textdf, questions)

In [293]:
articles = lang["article_number"].unique()
weights = np.ones(12).tolist()
questions = ['T1.Q1', 'T1.Q2', 'T1.Q3', 'T1.Q5', 'T1.Q6', 'T1.Q7', 'T1.Q9', 'T1.Q10', 'T1.Q11', 'T1.Q12', 'T1.Q13', 'T1.Q15']

In [294]:
langVariances = finalVariance(articles, numDict, textDict, weights, questions)
langVariances

{1712: 1.4098379829873124, 1737: 1.1954882136678202}