# Public Editor Convergence Tool 

In [1]:
import numpy as np
import pandas as pd
import re
from collections import Counter

## Goal: ## 
Our goal of this tool is to find the minimum number of users it takes to reach a consensus for each question. We are treating the consensus (the majority of answers) as the "correct answer". Then we are using a bootstrapping method in order to take simple random samples until we reach the correct answer of that specific question. 

## Reading In The Data and Organizing It ##
We'll need to read in our data file and the schema for it. As of right now, we are handling it file by file since each DataHuntAnswers.csv and Scheme.csv are different questions and answers. In the future, we hope we can build a tool to be able to read in all the data easily.

In [2]:
df = pd.read_csv("../urap/Demo1Evi-2019-06-22T0023-DataHuntAnswers.csv")
backup_df = df
schema = pd.read_csv("../urap/Demo1Evi-2019-06-22T0023-Schema.csv")
backup_schema = schema

We will be using the Schema file in order to help us disregard open-ended questions since those types of questions will not have a convergence or a majority answer. 

In [3]:
textQ = list(schema[schema['question_type'] == 'TEXT']["question_label"])

The following code takes in the data and organizes it based on contributor ID. It shows which questions each user had reached as well as their answers to those respective questions.

In [4]:
# Take in the feed, get the contributor_uuid and their answers for each question
def getContributorAndAns(df):
    cols = df.columns
    Q_s = [col_name for col_name in cols if re.search(r"\.Q[0-9]", col_name) and col_name not in textQ]
    selected_cols = ['contributor_uuid'] + Q_s
    return df[selected_cols]
    
df = getContributorAndAns(df)

df.head()

Unnamed: 0,contributor_uuid,T1.Q2.A1,T1.Q2.A2,T1.Q2.A3,T1.Q3.A1,T1.Q3.A2,T1.Q3.A3,T1.Q3.A4,T1.Q3.A5,T1.Q3.A6,...,T1.Q5,T1.Q6,T1.Q7,T1.Q9,T1.Q11,T1.Q12,T1.Q13,T1.Q14,T1.Q15,T1.Q16
0,e1ae8875-a398-4dde-8f4e-4b21109784e3,1,0,0,1,1,0,0,0,0,...,Slightly representative,,,Somewhat less likely,Yes,Very Unlikely,Somewhat Unlikely,"Yes, expicitly",5: Middling difficulty,8
1,2abbe1a3-c9a7-41ab-9738-64ca86756d37,1,0,0,1,1,0,1,1,0,...,Fairly representative,Can't tell; not enough info,,Somewhat less likely,Yes,Somewhat Unlikely,Somewhat Unlikely,"Yes, expicitly",3,7
2,e4a2f20b-0ee3-4d4c-a622-e392d1150ec8,0,0,1,0,0,0,0,0,0,...,,,,,No,,,"Yes, implicitly",2,10: 100% certain about all my answers
3,f9143626-bfe0-4e69-b652-6d1525ab4eb0,0,0,0,0,0,0,0,0,0,...,,,,,No,,,"No, not at all",4,2
4,6ae640df-8dbc-4401-ae14-0636d2c0d086,0,0,1,0,0,0,0,0,0,...,,,,,Yes,Somewhat Unlikely,Not Sure,"Yes, expicitly",4,8


Again, we dropped all questions that were open-ended. Some of the data in the files are OHE for questions that can have more than one answer, i.e the checkbox answers. We originally wanted to rename the columns, but decided that it was not neccessary. 
#### Note: #### 
T1.Q#.A# are the questions that can have more than one answer and OHE. 

## Bootstrapping ##
We used a Bootstrapping strategy in order to find the number it takes to match the consensus.

We chose a p-value of 0.01 where we want 99% of the time the majority reaches the "correct" consensus. 

In the future, we hope to integrate the user reputation tool and be able to put weights on the users. As of right now, we do not have the tool so the user weights/reputation scores are all just 1. 

The following function is to find the consensus of each question by finding the max of the answers. 

In [5]:
def getConsensus(answers):
    qcounts = Counter(answers)
    maxKey = max(qcounts.keys(), key=lambda key: qcounts[key])
    return maxKey

To help improve the effiency of our code, we converted all of the answers from strings to integers.

In [6]:
def strToInt(lst):
    vals = lst.unique()
    mapper = {vals[i]:i for i in range(len(vals))}
    return lst.replace(mapper)

The following function is to bootstrap and find the minimum number of users needed to find a convergence for each question. If we are unable to reach a confidence of 99% after sampling a max number of users, then we cannot find a convergence. The number of users that we sample can be changed in the future. 

In [7]:
# n = number of answer choices for questions
# c = consensus of entire dataset for question
# answers = answer column

def getN(questionName):
    answers = df[questionName].dropna()
    answers = strToInt(answers)
    n = len(pd.unique(answers))
    c = getConsensus(answers)
    
    for i in range(n + 1, max_group_size):
        count = 0
        for s in range(0, 1000):
            sample = np.random.choice(answers, i, replace=True)
            consensus = getConsensus(sample)
            if consensus == c:
                count += 1
        
        if count/1000 > .99:
            return i
            
    return "Not converged"

We ran the function using a sample size of 100. We used a print statement in order to figure out the efficiency of it and organized the final results in a panda dataframe.

In [8]:
cols = ['Question', 'Min']
lst=[]
max_group_size = 100 # FEEL FREE TO CHANGE THIS

for q in df.columns[1:]:
    if len(df[q].dropna().unique()) < 20:
        n = getN(q)
        lst.append([q,n])
        print("Question:", q, "Converge:", n)
    
df1 = pd.DataFrame(lst, columns=cols)
df1

Question: T1.Q2.A1 Converge: Not converged
Question: T1.Q2.A2 Converge: 19
Question: T1.Q2.A3 Converge: 45
Question: T1.Q3.A1 Converge: 71
Question: T1.Q3.A2 Converge: 7
Question: T1.Q3.A3 Converge: 3
Question: T1.Q3.A4 Converge: 9
Question: T1.Q3.A5 Converge: 31
Question: T1.Q3.A6 Converge: 5
Question: T1.Q3.A7 Converge: 7
Question: T1.Q3.A8 Converge: 3
Question: T1.Q3.A9 Converge: 3
Question: T1.Q5 Converge: Not converged
Question: T1.Q6 Converge: 73
Question: T1.Q7 Converge: Not converged
Question: T1.Q9 Converge: Not converged
Question: T1.Q11 Converge: 30
Question: T1.Q12 Converge: Not converged
Question: T1.Q13 Converge: Not converged
Question: T1.Q14 Converge: 45
Question: T1.Q15 Converge: Not converged
Question: T1.Q16 Converge: Not converged


Unnamed: 0,Question,Min
0,T1.Q2.A1,Not converged
1,T1.Q2.A2,19
2,T1.Q2.A3,45
3,T1.Q3.A1,71
4,T1.Q3.A2,7
5,T1.Q3.A3,3
6,T1.Q3.A4,9
7,T1.Q3.A5,31
8,T1.Q3.A6,5
9,T1.Q3.A7,7
