# Identify worker contribution for Cognito backed private work teams in Amazon SageMaker Ground Truth

## Overview

SageMaker Ground Truth allows you to integrate domain experts into your labeling process via a private workforce. Often there is a need to reward or compensate the effort invested by the labelers.

This notebook walks you through sample code which you can use to list the individual contribution of each member in your private workforce to a labeling job as well as their contact details.
This allows you to easily distribute rewards or compensate workers based on their contribution to the labeling effort!

Here is an example of a CSV file generated by this notebook

![worker_contribution.png](worker_contribution.png)

This example notebook requires a private workforce which is backed by a Cognito user pool as Identity Provider for managing worker credentials and authentication. In case you are using a custom identity provider you may have to adapt the last part of the notebook to retrieve user information. Have a look [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-workforce-private.html) to learn more about private workforce authentication options.

This notebook was tested with the Python 3 Data Science Kernel.

## Define the name of your labeling job

Define the names of your labeling jobs here which you want to analyze. Labeling jobs should use the same private workforce. If this is your first time creating a labeling job, you can find a good example how to create a labeling job using a private workforce [here](https://github.com/aws/amazon-sagemaker-examples/blob/master/ground_truth_labeling_jobs/from_unlabeled_data_to_deployed_machine_learning_model_ground_truth_demo_image_classification/from_unlabeled_data_to_deployed_machine_learning_model_ground_truth_demo_image_classification.ipynb)

In [None]:
JOB_LIST = [ '<your_labeling_job_name_1>', '<your_labeling_job_name_2>']

Importing required libraries and general setup

In [None]:
import boto3
import pprint
import json
from urllib.parse import urlparse
import os
import glob
import pandas as pd
from pathlib import Path

sm_client = boto3.client('sagemaker')
s3_client  = boto3.client('s3')
cognito_client = boto3.client('cognito-idp')


## Download worker responses

Ground truth saves all labeling activities in the output location specified as part of the labeling job. One of the outputs are individual worker responses. Worker responses can be found at `<s3_output_path>/annotations/worker-response`

The first step is to download all worker responses from the output location, so youcan start analyzing them. If you want to learn more about the output format of labeling jobs, have a look [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-data-output.html)

In [None]:
prefix_worker_response = "annotations/worker-response/"
work_dir = "gt-assets"

In [None]:
# utility function to download an entire folder in S3
def download_dir(prefix, local, bucket, client=s3_client):
    """
    params:
    - prefix: pattern to match in s3
    - local: local path to folder in which to place files
    - bucket: s3 bucket with target contents
    - client: initialized s3 client object
    """
    keys = []
    dirs = []
    next_token = ''
    base_kwargs = {
        'Bucket':bucket,
        'Prefix':prefix,
    }
    
    while next_token is not None:
        kwargs = base_kwargs.copy()
        if next_token != '':
            kwargs.update({'ContinuationToken': next_token})
        results = client.list_objects_v2(**kwargs)
        contents = results.get('Contents')
        for i in contents:
            k = i.get('Key')
            if k[-1] != '/':
                keys.append(k)
            else:
                dirs.append(k)
        next_token = results.get('NextContinuationToken')
    for d in dirs:
        dest_pathname = os.path.join(local, d)
        if not os.path.exists(os.path.dirname(dest_pathname)):
            os.makedirs(os.path.dirname(dest_pathname))
    for k in keys:
        dest_pathname = os.path.join(local, k)
        if not os.path.exists(os.path.dirname(dest_pathname)):
            os.makedirs(os.path.dirname(dest_pathname))
        client.download_file(bucket, k, dest_pathname)

In [None]:
def download_worker_responses_for_job(labelingJobName=None,directory ="."):
    response = sm_client.describe_labeling_job(LabelingJobName=labelingJobName)
    output_uri = response['OutputConfig']['S3OutputPath'].strip("/")
    worker_response_path = f'{output_uri}/{labelingJobName}/{prefix_worker_response}'
    url = urlparse(worker_response_path)
    print(f'Downloading worker responses for labeling job {labelingJobName} from {worker_response_path}')
    download_dir(url.path.lstrip('/'),directory,url.netloc)
    return

# for each job download worker responses to a local folder
for job in JOB_LIST:
    download_worker_responses_for_job(job,work_dir)



## Load worker responses into a Pandas Dataframe

To make it easy to analyze the data youwill parse the individual worker response files and store them in a Pandas Dataframe for further processing. A worker response is stored in JSON format. Here an example for a text classification job using a private workforce backed by a Amazon Cognito user pool.

```json
{
      "acceptanceTime": "2021-03-12T10:11:27.475Z",
      "answerContent": {
        "crowd-image-classifier": {
          "label": "Musical Instrument"
        }
      },
      "submissionTime": "2021-03-12T10:11:29.490Z",
      "timeSpentInSeconds": 2.015,
      "workerId": "private.eu-west-1.88f516d0d40e6db7",
      "workerMetadata": {
        "identityData": {
          "identityProviderType": "Cognito",
          "issuer": "https://cognito-idp.eu-west-1.amazonaws.com/eu-west-1_joL9srkLl",
          "sub": "5b7d7d09-7257-4087-8c15-xxxxxx2ce934d"
        }
      }
    }
```

In [None]:
def load_worker_responses_into_df():
    labels = []

    for file in glob.glob(f'{work_dir}/**/*.json', recursive=True):
        file_id = os.path.basename(Path(file).parent)

        with open(file) as json_file:
            data = json.load(json_file)
            for answer in data['answers']:
                labels.append([answer['submissionTime'],answer['workerId'],answer['workerMetadata']['identityData']['identityProviderType'],answer['workerMetadata']['identityData']['sub'],answer['workerMetadata']['identityData']['issuer'].split("/")[-1]])


    return pd.DataFrame(labels, columns =['submissionTime','workerId','identityProviderType','sub', 'userPoolId'])
    
df_worker_results = load_worker_responses_into_df()
df_worker_results

## Count label events by worker

You can use the [value_counts()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.value_counts.html) function in Pandas to create a count of labels per user.

In [None]:
# Getlabeled items by worker sub 
aggregated = df_worker_results['sub'].value_counts().rename_axis('user_sub').reset_index(name='label_count')
aggregated

## Match sub to username for each worker

The final step is to match a user sub to a human readable username. A users sub is a unique identifier for a user stored as part of the users attributes. You can use the [list_users](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/cognito-idp.html#CognitoIdentityProvider.Client.list_users) function to get a username given a particular sub. Note by default the username matches the email of the user.

In [None]:
user_pool = df_worker_results['userPoolId'][0]

def get_user_for_sub(row):
    response = cognito_client.list_users(UserPoolId=user_pool, Filter=f'sub="{row["user_sub"]}"')
    if len(response['Users']) > 0:
        return response['Users'][0]['Username']
    else:
        return None
aggregated['username'] = aggregated.apply(get_user_for_sub, axis=1)
aggregated = aggregated[['username', 'user_sub','label_count']]
aggregated

## Save to CSV

Optionally you can now save the output to a csv file for further processing.

In [None]:
aggregated.to_csv('label_results.csv')