# Table of Contents
* [Take-home MTurk task](#Take-home-MTurk-task)
	* [Instructions from Eric](#Instructions-from-Eric)
	* [Creating batch process from image links in the json data file](#Creating-batch-process-from-image-links-in-the-json-data-file)
		* [Setting data paths](#Setting-data-paths)
		* [Exploring the linked images](#Exploring-the-linked-images)
		* [Writing formatted csv for MTurk batch submission](#Writing-formatted-csv-for-MTurk-batch-submission)
		* [Experimenting with Boto](#Experimenting-with-Boto)
	* [Processing batch results](#Processing-batch-results)
		* [First look at responses and my observations](#First-look-at-responses-and-my-observations)
		* [Results with strict string matching](#Results-with-strict-string-matching)
		* [Laxer string matching](#Laxer-string-matching)


# Take-home MTurk task

## Instructions from Eric

In [11]:
import pandas as pd
from scipy.stats.mstats import mode

from IPython.display import Image
from IPython.core.display import HTML 
from IPython.core.display import display

## Creating batch process from image links in the json data file

### Setting data paths

In [2]:
data_dir = './data/'
link_data = 'data.json'
results_csv = 'Batch_123190_batch_results.csv'

In [3]:
image_link_df = pd.read_json(data_dir+link_data)
image_link_df.head(2)

Unnamed: 0,image_name,text_images
0,0.png,[https://s3-us-west-2.amazonaws.com/ai2-vision...
1,1.png,[https://s3-us-west-2.amazonaws.com/ai2-vision...


### Exploring the linked images

The images linked in the json file seem to be formated in the following way:
original image -> list of links to the crops of the individual words in the original image

### Writing formatted csv for MTurk batch submission

The template I'm using for my HITs expects the batch file to be a csv formatted as:

header 

url_1

url_2

...

<br>

The links in the text_images column appear as a list of links derived from a single image:

header

[url_1, url_2, ...]

[url_6, url_7, ...]

<br>

I need to flatten the values in this column before writing them to a csv.

In [5]:
def write_url_csv(list_of_url_lists, filename='word_image_links.csv'):
    """
        Writes a csv formatted for a text transcription MTurk batch submission\

    :param list_of_url_lists: List url lists. In this instance, the text images column from the link dataframe
    :param filename: output file name
    :return: None
    """
    url_list = [url for sublist in list_of_lists for url in sublist]  # flattens the list of lists
    # I now have a regular python list. I convert it to a pandas series for convenience when writing the csv
    image_url_series = pd.Series(url_list, name='image_url')  # list name here is set to match the MTurk template
    image_url_series.to_csv(filename, header=True, index = False)
    return None

In [6]:
list_of_lists = image_link_df['text_images']
write_url_csv(list_of_lists)

### Experimenting with Boto

Jumping to the gui interface to submit the HITS seemed less than optimal. I experimented with a package called boto that provides a python interface to aws/mturk.

I've tested the function below, and it does create a hit with the properties I specified. 
The difficulties with this approach are-

1) creating these as a batch

2) generating the template for the HIT layout

3) boto is not as well documented as I'd like



<br>
I've abandoned this for now and will proceed with the web interface

In [1]:
import boto

In [24]:
import boto.mturk.connection as tc
import boto.mturk.question as tq
from keysTkingdom import ai2_aws
from keysTkingdom import aws_tokes
from keysTkingdom import mturk_ai2

In [26]:
print(mturk_ai2.access_key,mturk_ai2.access_secret_key)

AKIAJRIXD4DDBLFQSC2A sFuYcTJU3eJ7FxsUiKzDmUdrs40S3AnE7rbIWcEI


In [27]:
sandbox_host = 'mechanicalturk.sandbox.amazonaws.com' 
mturk = tc.MTurkConnection(
    aws_access_key_id = mturk_ai2.access_key,
    aws_secret_access_key = mturk_ai2.access_secret_key,
    host = sandbox_host,
    debug = 1 # debug = 2 prints out all requests.
)

In [28]:
mturk.get_account_balance()

[$10,000.00]

In [20]:
ex_url = 'https://s3-us-west-2.amazonaws.com/ai2-vision-turk-data/textbook-annotation-test/build/index.html?url=Daily_Science_Grade_5_Evan_Moor_149.jpeg&id=149'

In [21]:
baseurl = 'https://s3-us-west-2.amazonaws.com/ai2-vision-turk-data/textbook-annotation-test/build/index.html'
page_n =149 
current_url = baseurl + '?url=Daily_Science_Grade_5_Evan_Moor_{}.jpeg&id={}'.format(page_n, page_n)

In [22]:
current_url

'https://s3-us-west-2.amazonaws.com/ai2-vision-turk-data/textbook-annotation-test/build/index.html?url=Daily_Science_Grade_5_Evan_Moor_149.jpeg&id=149'

In [30]:
def creat_hit(url):
    """
    creates a single HIT from a provided url
    """
    title = "Annotate Science Textbook"
    description = "Choose which category a text entry best belongs to"
    keywords = ['image', 'science']
    frame_height = 1000 # the height of the iframe holding the external hit
    amount = .05

    questionform = tq.ExternalQuestion(url, frame_height)

    create_hit_result = mturk.create_hit(
        title = title,
        description = description,
        keywords = keywords,
        question = questionform,
        reward = boto.mturk.price.Price(amount=amount),
        max_assignments=3
    )

In [31]:
creat_hit(current_url)

## Processing batch results

In [7]:
batch_results_df = pd.read_csv(data_dir+results_csv)
print(batch_results_df.shape)
batch_results_df.head(2)

(75, 31)


Unnamed: 0,HITId,HITTypeId,Title,Description,Keywords,Reward,CreationTime,MaxAssignments,RequesterAnnotation,AssignmentDurationInSeconds,...,RejectionTime,RequesterFeedback,WorkTimeInSeconds,LifetimeApprovalRate,Last30DaysApprovalRate,Last7DaysApprovalRate,Input.image_url,Answer.NumberOfItems,Approve,Reject
0,3P4ZBJFX2V4QBEY70GV6EHICGAAWFV,31R230RZ7QPLBWK4IGNFGUD2RJ6L8V,Write the words shown in an image,Write the words shown in an image,"image, write, transcription",$0.05,Wed Feb 24 09:02:13 PST 2016,3,BatchId:123190;,86400,...,,,5,0% (0/0),0% (0/0),0% (0/0),https://s3-us-west-2.amazonaws.com/ai2-vision/...,hair,,
1,3P4ZBJFX2V4QBEY70GV6EHICGAAWFV,31R230RZ7QPLBWK4IGNFGUD2RJ6L8V,Write the words shown in an image,Write the words shown in an image,"image, write, transcription",$0.05,Wed Feb 24 09:02:13 PST 2016,3,BatchId:123190;,86400,...,,,7,0% (0/0),0% (0/0),0% (0/0),https://s3-us-west-2.amazonaws.com/ai2-vision/...,hAir,,


In [8]:
batch_results_df.columns

Index(['HITId', 'HITTypeId', 'Title', 'Description', 'Keywords', 'Reward',
       'CreationTime', 'MaxAssignments', 'RequesterAnnotation',
       'AssignmentDurationInSeconds', 'AutoApprovalDelayInSeconds',
       'Expiration', 'NumberOfSimilarHITs', 'LifetimeInSeconds',
       'AssignmentId', 'WorkerId', 'AssignmentStatus', 'AcceptTime',
       'SubmitTime', 'AutoApprovalTime', 'ApprovalTime', 'RejectionTime',
       'RequesterFeedback', 'WorkTimeInSeconds', 'LifetimeApprovalRate',
       'Last30DaysApprovalRate', 'Last7DaysApprovalRate', 'Input.image_url',
       'Answer.NumberOfItems', 'Approve', 'Reject'],
      dtype='object')

The columns of interest here are Input.image_url and Answer.NumberOfItemts.

There are 75 rows, one for each MTurker's response. I should have three responses for each unique input image_url.

Here are the responses grouped by image.

In [9]:
grouped_results_df = batch_results_df.groupby('Input.image_url')
for image_response in grouped_results_df:
    print(image_response[1]['Answer.NumberOfItems'])

0    hair
1    hAir
2    HAIR
Name: Answer.NumberOfItems, dtype: object
3    NOSE
4    NOSE
5    Nose
Name: Answer.NumberOfItems, dtype: object
6    eYeS
7    EYES
8    EYES
Name: Answer.NumberOfItems, dtype: object
9          EARS
10    EARS EARS
11         ears
Name: Answer.NumberOfItems, dtype: object
12    Mouth
13    MOUTH
14    MOUTH
Name: Answer.NumberOfItems, dtype: object
15    fcea
16    FACE
17    FACE
Name: Answer.NumberOfItems, dtype: object
18    DDD
19      d
20      D
Name: Answer.NumberOfItems, dtype: object
21      F
22      B
23    BEE
Name: Answer.NumberOfItems, dtype: object
24    {}
25     A
26     a
Name: Answer.NumberOfItems, dtype: object
27    C
28    C
29    C
Name: Answer.NumberOfItems, dtype: object
30      Citrulluscolocynths
31    Citrullus colocynthis
32    Citrullus colocynthis
Name: Answer.NumberOfItems, dtype: object
33    Zygophyllum sp.
34      Zygophyllumsp
35    Zyogphyllum sp.
Name: Answer.NumberOfItems, dtype: object
36      Fagonia indica
37   

### First look at responses and my observations

Things to be aware of:

1) inconsistent spelling and capitalization

2) white space

3) additional symbols, e.g. hyphens and parenthesis

<br>

The instructions are not explicit on how strict the matching should be. There are a few option, in increasing difficulty/flexibility:

a) exact string matching

b) stripping white space and converting to lower case 

c) fuzzy string matching to deal with spelling errors and extraneous symbols. I could look at character match ratios to allow some flexibility.

<br>

I'll start with approach a), and then compare to approach b). Option c) is probably too open ended for this exercise.

In the future, I should be more explicit about capitalization in the HIT instructions.

In [20]:
def most_common_strict(image_response):
    """
    returns the consensus response of the three raw response strings for a given image
    """
    most_common = image_response[1]['Answer.NumberOfItems'].mode()
    if most_common.empty:
        most_common = pd.Series(['NO AGREEMENT'])
    return most_common

### Results with strict string matching

In [15]:
def find_transcriptions_matches(batch_results_df, response_matcher):
    """
    returns a pandas series with the consunsus response for each image
    """
    agreed_responses = pd.Series()
    for image_response in batch_results_df.groupby('Input.image_url'):
        most_common = response_matcher(image_response)
        agreed_responses = pd.concat([agreed_responses, most_common])
        # The reindex below is needed to match the original df index after the groupby operation
    return agreed_responses.reset_index(name = 'result_string')['result_string'] 

In [16]:
transcription_results = find_transcriptions_matches(batch_results_df, most_common_strict)

I have the responses in a pandas series here. I now need to pair them to the original image urls.

In [17]:
image_urls = pd.Series(pd.unique(batch_results_df['Input.image_url']))

transcription_results.index = image_urls
transcription_results.head(5)

https://s3-us-west-2.amazonaws.com/ai2-vision/example-turk/0.png-0.png    NO AGREEMENT
https://s3-us-west-2.amazonaws.com/ai2-vision/example-turk/0.png-1.png            NOSE
https://s3-us-west-2.amazonaws.com/ai2-vision/example-turk/0.png-2.png            EYES
https://s3-us-west-2.amazonaws.com/ai2-vision/example-turk/0.png-3.png    NO AGREEMENT
https://s3-us-west-2.amazonaws.com/ai2-vision/example-turk/0.png-4.png           MOUTH
Name: result_string, dtype: object

In [18]:
transcription_results.to_json('./transcription_resonses_strict.json')

I've written the transcription results to a JSON here. A quick note- pandas writes the file with the '\' characters in the links escaped. This is allowed in JSON, but not required. I've decided to leave them in.

### Laxer string matching

I'll now at the very least deal with capitalization and white space.

In [21]:
def most_common_lax(image_response):
    """
    returns the consensus response after stripping white space and converting the reponses to lower case
    """
    simple_sanitizer = lambda x : x.strip().lower()
    most_common = image_response[1]['Answer.NumberOfItems'].apply(simple_sanitizer).mode()
    if most_common.empty:
        most_common = pd.Series(['NO AGREEMENT'])
    return most_common

In [22]:
transcription_results_lax = find_transcriptions_matches(batch_results_df, most_common_lax)
transcription_results_lax.index = image_urls
transcription_results_lax.head(5)

https://s3-us-west-2.amazonaws.com/ai2-vision/example-turk/0.png-0.png     hair
https://s3-us-west-2.amazonaws.com/ai2-vision/example-turk/0.png-1.png     nose
https://s3-us-west-2.amazonaws.com/ai2-vision/example-turk/0.png-2.png     eyes
https://s3-us-west-2.amazonaws.com/ai2-vision/example-turk/0.png-3.png     ears
https://s3-us-west-2.amazonaws.com/ai2-vision/example-turk/0.png-4.png    mouth
Name: result_string, dtype: object

In [23]:
print(sum(transcription_results_lax != 'NO AGREEMENT'))
print(sum(transcription_results != 'NO AGREEMENT'))

18
11


This approach captures 7 more consensus responses. Unless capital letter detection is important, I'm inclined to use these results.

In [24]:
transcription_results_lax.to_json('./transcription_resonses_lax.json')