# Annotating Training Data With MTurk

## Pre-requisites
If you haven't already, you'll need to setup MTurk and AWS accounts that are linked together to use MTurk with Python. The MTurk account will be used to post tasks to the MTurk crowd and the AWS accounts will be used to connect to MTurk via API and provide access to any additional AWS resources that are needed to execute your task.

1. If you don't have an AWS account already, visit https://aws.amazon.com and create an account you can use for your project.
2. If you don't have an MTurk Requester account already, visit https://requester.mturk.com and create a new account.

After you've setup your accounts, you will need to link them together. When logged into both the root of your AWS account and your MTurk account, visit https://requester.mturk.com/developer to link them together.

From your AWS console create a new AWS IAM User or select an existing one you plan to use. Add the AmazonMechanicalTurkFullAccess policy to your user. Then select the Security Credentials tab and create a new Access Key, copy the Access Key and Secret Access Key for future use.

If you haven't installed the awscli yet, install it with pip (pip install awscli) and configure a profile using the access key and secret key above (aws configure --profile mturk). 

To post tasks to MTurk for Workers to complete you will first need to add funds to your account that will be used to reward Workers. Visit https://requester.mturk.com/account to get started with as little as $1.00.

We also recommend installing xmltodict as shown below.

In [1]:
!pip install xmltodict

[33mYou are using pip version 18.1, however version 19.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


## Overview
Amazon Mechanical Turk allows you to post tasks for Workers to complete at https://worker.mturk.com. To post a task to
MTurk you create an HTML form that includes the information you want them to provide. In this example we'll be asking Workers to rate the sentiment of Tweets on a scale of 1 (negative) to 10 (positive).

MTurk has a Sandbox environment that can be used for testing. Workers won't work see your tasks in the Sandbox but you can log in to do them yourself to test the task interface at https://workersandbox.mturk.com. It's recommended you test first in the Sandbox to make sure your task returns the data you need before moving to the Production environment. There is no cost to use the Sandbox environment.

In [1]:
import boto3
import xmltodict
import json

In [2]:
create_hits_in_production = True
environments = {
        "production": {
            "endpoint": "https://mturk-requester.us-east-1.amazonaws.com",
            "preview": "https://www.mturk.com/mturk/preview"
        },
        "sandbox": {
            "endpoint": "https://mturk-requester-sandbox.us-east-1.amazonaws.com",
            "preview": "https://workersandbox.mturk.com/mturk/preview"
        },
}
mturk_environment = environments["production"] if create_hits_in_production else environments["sandbox"]

client = boto3.client(
    service_name='mturk',
    region_name='us-east-1',
    endpoint_url=mturk_environment['endpoint'],
)

In [None]:
aws


In [5]:
# This will return your current MTurk balance if you are connected to Production.
# If you are connected to the Sandbox it will return $10,000.
print(client.get_account_balance()['AvailableBalance'])

NoCredentialsError: Unable to locate credentials

## Define your task
For this project we are going to get the sentiment of a set of tweets that we plan to train a model to evaluate. We will create an MTurk Human Intelligence Task (HIT) for each tweet.

In [None]:
tweets = ['in science class right now... urgh... stupid project..',
          'hmmm what to have for breaky?... Honey on toast ',
          'Doing home work  x',
          'Headed out of town for a few days. Will miss my girls']

MTurk accepts an XML document containing the HTML that will be displayed to Workers. Workers will see these HTML for each item tweet that is submitted. To use the HTML for this example task, download it from [here](https://s3.amazonaws.com/mturk/samples/jupyter-examples/SentimentQuestion.html) and store it in the same directory as this notebook. Within the HTML is a variable ${content} that will be replaced with a different tweet when the HIT is created.

Here the HTML is loaded and inserted into the XML Document.

In [None]:
html_layout = open('./SentimentQuestion.html', 'r').read()
QUESTION_XML = """<HTMLQuestion xmlns="http://mechanicalturk.amazonaws.com/AWSMechanicalTurkDataSchemas/2011-11-11/HTMLQuestion.xsd">
        <HTMLContent><![CDATA[{}]]></HTMLContent>
        <FrameHeight>650</FrameHeight>
        </HTMLQuestion>"""
question_xml = QUESTION_XML.format(html_layout)

In Mechanical Turk each task is representated by a Human Intelligence Task (HIT) which is an individual item you want annotated by one or more Workers and the interface that should be displayed. The definition below requests that five Workers review each item, that the HIT remain live on the worker.mturk.com website for no more than an hour, and that Workers provide a response for each item in less than ten minutes. Each response has a reward of \$0.05 so the total Worker reward for this task would be \$0.25 plus \$0.05 in MTurk fees. An appropriate title, description, keywords are also provided to let Workers know what is involved in this task.

In [None]:
TaskAttributes = {
    'MaxAssignments': 5,                 
    'LifetimeInSeconds': 60*60,           # How long the task will be available on the MTurk website (1 hour)
    'AssignmentDurationInSeconds': 60*10, # How long Workers have to complete each item (10 minutes)
    'Reward': '0.05',                     # The reward you will offer Workers for each response
    'Title': 'Provide sentiment for a Tweet',
    'Keywords': 'sentiment, tweet',
    'Description': 'Rate the sentiment of a tweet on a scale of 1 to 10.'
}


## Create the tasks
Here a HIT is created for each tweet so that it can be completed by Workers. Prior to creating the HIT, the tweet is inserted into the Question XML content. The HIT Id returned for each task is stored in a results array so that we can retrieve the results later.

In [None]:
results = []
hit_type_id = ''

for tweet in tweets:
    response = client.create_hit(
        **TaskAttributes,
        Question=question_xml.replace('${content}',tweet)
    )
    hit_type_id = response['HIT']['HITTypeId']
    results.append({
        'tweet': tweet,
        'hit_id': response['HIT']['HITId']
    })
    
print("You can view the HITs here:")
print(mturk_environment['preview'] + "?groupId={}".format(hit_type_id))


## Get Results
Depending on the task, results will be available anywhere from a few minutes to a few hours. Here we retrieve the status of each HIT and the responses that have been provided by Workers.

In [None]:
for item in results:
    
    # Get the status of the HIT
    hit = client.get_hit(HITId=item['hit_id'])
    item['status'] = hit['HIT']['HITStatus']

    # Get a list of the Assignments that have been submitted by Workers
    assignmentsList = client.list_assignments_for_hit(
        HITId=item['hit_id'],
        AssignmentStatuses=['Submitted', 'Approved'],
        MaxResults=10
    )

    assignments = assignmentsList['Assignments']
    item['assignments_submitted_count'] = len(assignments)

    answers = []
    for assignment in assignments:
    
        # Retreive the attributes for each Assignment
        worker_id = assignment['WorkerId']
        assignment_id = assignment['AssignmentId']
        
        # Retrieve the value submitted by the Worker from the XML
        answer_dict = xmltodict.parse(assignment['Answer'])
        answer = answer_dict['QuestionFormAnswers']['Answer']['FreeText']
        answers.append(int(answer))
        
        # Approve the Assignment (if it hasn't already been approved)
        if assignment['AssignmentStatus'] == 'Submitted':
            client.approve_assignment(
                AssignmentId=assignment_id,
                OverrideRejection=False
            )
    
    # Add the answers that have been retrieved for this item
    item['answers'] = answers
    if len(answers) > 0:
        item['avg_answer'] = sum(answers)/len(answers)

print(json.dumps(results,indent=2))