# Product Image Classifier

This document details the entire process of creating the labeled dataset, defining and training the model, and deploying the results to a SageMaker endpoint.

### Quick Warning

Most of these collation/creation notebooks are _desctructive_, meaning they will overwrite existing content. You can generate a completely different working dataset in a nondestrutive way by simply changing the `DATASET_NAME` property in `config/test_config.py` or by making a copy and changing that same property and updating the notebooks you're running to point to this new config explicitly. This is obviously not ideal, but it's the case for now.

# Table of contents
1. [Collecting the product data](#1)
    1. [Crawl solution: Query Redshift](#1.1)
    1. [Crawl solution: Collating product data by hitting the product service directly](#1.2) — [[notebook](notebooks/collate-products.ipynb)]
2. [Preparing to run a Ground Truth Labeling Job](#2)
    1. [Creating the manifest file](#2.1) — [[notebook](notebooks/create-ground-truth-manifest.ipynb)]
    1. [Uploading the poduct images to S3](#2.2) — [[notebook](notebooks/write-product-images-to-s3.ipynb)]
3. [Creating the Ground Truth labeling job](#3)
    1. [Category Taxonomy](#3.1) — [[notebook](notebooks/generate-taxonomy.ipynb)]
    1. [Labeling Tool Template](#3.2)
        1. [Rendering values into HTML](#3.2.1)
        1. [Developing the template](#3.2.2) — [[notebook](notebooks/create-labeling-tool.ipynb)]
    1. [Pre-annotation Lambda](#3.3)
    1. [Post-annotation Lambda](#3.4)
    1. [Creating the Labeling Job](#3.5)
4. [Evaluating Ground Truth Performance](#4)
    1. [Performance metrics](4.1)
    1. [Running the same test at 3 different price per tasks](4.2) — [[notebook](notebooks/evaluating-labeling-performance.ipynb)]
    1. [Reduced taxonomy and uncropped product photos](4.3)
    1. [Sanity check on new changes](4.4)
    1. [What to expect](4.5)

<a name="1"></a>
## 1. Collecting the product data

Our dataset will include the product image from the brand/retailer's site, the title, description, and price. We want to create a dataset that's reflective of actual content our influencers post, so we start by pulling products that have been posted in an LTK. Because there exists a spamming problem (influencers will post many often irrelevant products), we choose to only work with first products as a proxy for "actually relevant".

<a name="1.1"></a>
### 1.1 Crawl solution: Query Redshift

We can pull this data by hitting Redshift directly:
``` mysql
select l.id, p.product_id, p.image_url, l.profile_id, l.date_published
from ltk_ltks l
inner join ltk_ltk_products p
	on l.id = p.ltk_id
where l.status = 2
and extract(year from l.date_published)=2020
and p.position=0
order by l.id
```

I currently just run this on Periscope in [this dashboard](https://app.periscopedata.com/app/rewardstyle/737156/ltk-product-pulls). Eventually the plan is to pull this—and all—data from the data lake, but for now this gets manually stored on S3 [here](s3://data-science-product-image/collation/products.csv).

<a name="1.2"></a>
### 1.2 Crawl solution: Collating product data by hitting the product service directly

Next, we take our list of IDs from S3 and pull the associated product data from the product service. I have to run this locally because the VPC product service is on is not accessible from this role. The collation and writing back to S3 happens in [this notebook](./notebooks/collate-products.ipynb).

<a name="2"></a>
## 2. Preparing to run a Ground Truth labeling job

Next, we have to take the product data we've collated and create everything necessary for Ground Truth to run a labeling job, so basically just an unlabeled dataset.

<a name="2.1"></a>
### 2.1 Creating the manifest file
Ground Truth uses a JSON lines "manifest" to serve as an index of every product we want labeled. So all we do is pull down our collated data and generate the file in the expected format with only the fields we're interested in using for labeling. This process occurs in [this notebook](./notebooks/create-ground-truth-manifest.ipynb).

<a name="2.2"></a>
### 2.2 Uploading the product images to S3
The last step in our labeling prep work is to store all our product images on S3. This process happens in [this notebook](./notebooks/write-product-images-to-s3.ipynb)

<a name="3"></a>
## 3. Creating the Ground Truth labeling job
Now that we have our manifest file and all the images on S3, we can create the Ground Truth components used in our labeling job.

<a name="3.1"></a>
### 3.1 Category Taxonomy
I pulled the list of categories from the [product team's category work](https://docs.google.com/spreadsheets/d/1WtKqCdNpncA9744qQSiDZs7j6GEaPmA2xH4qPONMRPQ/edit#gid=896054480), and generated a JSON representation for easy use in the labeling tool, because we need to allow labelers to drill down into categories and back. The taxonomy can be found [here](./files/taxonomy.json). The format is as follows:

``` json
[
    { "id": <id1>, "name": <name1>, "subcategories": [<subcat1>, <subcat2>...], "parent": <parent_id> },
    { "id": <id2>, "name": <name2>, "subcategories": [<subcat1>, <subcat2>...], "parent": <parent_id> },
    ...
]
```

If there is no parent (only the root category "Main"), the field is not included on the JSON object, and if there are no subcategories, it is defined as an empty list. The taxonomy generation occurs in [this notebook](./notebooks/generate-taxonomy.ipynb).

<a name="3.2"></a>
### 3.2 Labeling Tool Template
Because our labeling is more complex than just image -> label, we need to make a custom template for the Ground Truth labeling job. On the backend, Ground Truth uses a templating system to render values into placeholders in your template, so first we have to decide what data we want to display.

<a name="3.2.1"></a>
#### 3.2.1 Rendering values into HTML
We need to display the product image, title and description, and we also need to display the category selection tree. The variables come attached to the `task.input` object. For example, here is the image, title and description placeholders for a product in the labeling tool template:

``` html
  <div class="left">
    <h3>Product Details:</h3>
    <crowd-card>
      <div class="card">
        <h4 style="padding-bottom: 5px;">Image</h4>
        <img src="{{ task.input.image | grant_read_access }}" />

        <h4>Title</h4>
        <p>{{ task.input.title }}</p>

        <h4>Description</h4>
        <p>{{ task.input.description }}</p>
      </div>
    </crowd-card>
  </div>
```

The "piping" to `grant_read_access` of the image is a method of converting the s3 path to a one-time image URL. More information on customizing Ground Truth templates can be found [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-custom-templates-step2.html).

<a name="3.2.2"></a>
#### 3.2.2 Developing the template
I used [this notebook](./notebooks/create-labeling-tool.ipynb) to work on creating the template. It just mocks the template stuff so you don't have to try to set that up locally and allows you to iterate on the design without leaving SageMaker. The template itself is [here](./files/product-classifier.liquid.html).

<a name="3.3"></a>
### 3.3 Pre-annotation Lambda
Another requirement of the Ground Truth "custom workflow" is that you manually process each manifest row before being passed to the templating engine. In our case, I just added all the values we want to render in our labeling tool. The pre-annotation lambda is named `sagemaker-ground-truth-product-labeling-pre-annotation`. The code is very simple:

``` python
import json

def lambda_handler(event, context):
    manifest_row = {}
    entry = event['dataObject']
    manifest_row['title'] = entry['title']
    manifest_row['description'] = entry['description']
    manifest_row['image'] = entry['source-ref']
    manifest_row['category_tree'] = <taxonomy JSON>

    return {
        "taskInput": manifest_row
    }
```

<a name="3.4"></a>
### 3.4 Post-annotation Lambda
Similarly, we have to handle "consolidating" the labeling. When we create a job, we select the number of unique workers that label a given product. When those results are passed to our post-annotation lambda, we have to decide what to do with the potentially nonunanimous labels. If you don't use a custom workflow, AWS offers methods for how the winning label is selected and confidence is calculated. I decided to skirt this problem for now and provide full information in the output manifest by calculating confidence as `max_label_count/total_annotations_per_product` and including all the individual workers' respoonses. We can therefore decide to change how we choose a winning label (or discard it altogether) in the future. 

The lambda is called `sagemaker-ground-truth-product-labeling-post-annotation`, and here is the code:

``` python
import json
import boto3
from urllib.parse import urlparse

def consolidate(labelAttributeName, dataset, threshold=0.5):
    label_counts = {}
    workers = []
    max_count = 0
    max_label = None
    for annotation_json in dataset['annotations']:
        annotation = json.loads(annotation_json['annotationData']['content'])
        label = annotation["product-category"]
        worker = {
                'worker_id': annotation_json['workerId'],
                'label': label
            }
        workers.append(worker)
        if label not in label_counts:
            label_counts[label] = 1
        else:
            label_counts[label] += 1
        if label_counts[label] > max_count:
            max_count = label_counts[label]
            max_label = label
    confidence = max_count / float(len(dataset['annotations']))
    return {
            'datasetObjectId': dataset['datasetObjectId'],
            'consolidatedAnnotation' : {
                'content': {
                    labelAttributeName: {
                        'workers': workers,
                        'confidence': confidence,
                        'result': {'product-category': max_label},
                        'labeledContent': dataset['dataObject']
                        }
                    }
                }
            } 
    return None

def lambda_handler(event, context):
    consolidated_labels = []

    parsed_url = urlparse(event['payload']['s3Uri']);
    s3 = boto3.client('s3')
    textFile = s3.get_object(Bucket = parsed_url.netloc, Key = parsed_url.path[1:])
    filecont = textFile['Body'].read()
    annotations = json.loads(filecont)
    
    for dataset in annotations:
        annotation = consolidate(event['labelAttributeName'], dataset)
        if annotation:
            consolidated_labels.append(annotation)

    return consolidated_labels
```

<a name="3.5"></a>
### 3.5 Creating the Labeling Job
Finally, all of the components are created and we can actually kick off the labeling job.

First, create a new labeling job in the SageMaker console, choose a name for the labeling job, manual set up and point to the manifest file for input, and choose the folder the input manifest sits in as the output, and choose the SageMaker execution role. 

<img src="files/specify-job-details-01.png" width="800">

Secondly, choose Custom task type:

<img src="files/specify-job-details-02.png" width="800">

Next, select Mechanical Turk and the rest of the parameters you want for the job:

<img src="files/specify-job-details-03.png" width="800">

Lastly, select Custom template, paste in your custom template, and select the pre and post annotation lambdas:

<img src="files/specify-job-details-04.png" width="800">

<a name="4"></a>
## 4. Evaluating Ground Truth performance
After we've created a job, we can look at the output. Unfortunately, I cannot find any information on how exactly to format the output data from the annotation consolidation, so getting things to look nice in Ground Truth's already terrible evaluation tools is difficult/impossible. But the label data is in the output manifest, so we can look at it there.

One thing worth noting is that the higher the price, the faster the labeling gets done.

<a name="4.2"></a>
### 4.1 Performance metrics
There are 2 things we'll look for in evaluating the performance of the labeling: 

    1. How accurately do they match a personally labeled version of the dataset?
    2. What percentage of labeled rows are usable?

We'll use some function of these 2 values, alongside cost to determine the best parameters for labeling.

<a name="4.2"></a>
### 4.2 Running the same test at 3 different price per tasks
To take a first stab at this, I ran a test at 3 different prices, \\$0.036, \\$0.048 and \\$0.060 per task. The dataset size was 100, which I manually labeled as the gold standard. I then compared each of the jobs to the gold standard.

Unfortunately this just isn't a lot of data, and the experiments aren't cheap. While results confirm what we expect (that we get better performance at the higher rates), there are just so many factors at play that it wouldn't be right to draw any strong conclusions.

The jobs ended with 26, 47, and 35 items with consensus, respectively, 42, 58, and 60 that matched the gold standard, and 22 / 26, 41 / 47, 31 / 35 of items with consensus that match the gold standard. The results were computed in [this notebook](notebooks/evaluating-labeling-performance.ipynb).

~I think the smart course of action would be to drastically pare down the category taxonomy, find/nix similar categories etc. We're also using the cropped product images, which might make the task unnecessarily difficult. These will be what I address first when I get back from vacation.~ [Updated](#4.3)

That said, both the higher prices get around 90\% accuracy, which is definitely servicable.

<a name="4.3"></a>
### 4.3 Reduced taxonomy and uncropped product photos
I cut down on the total categories in a way that removed confusing branching, such as `Womens` -> `Clothing` -> `Activewear` -> `Skirts` and `Womens` -> `Clothing` -> `Skirts` by adding `Activewear` as a child to `Skirts` (and all other previous activewear subcategories). This was a purely manual process.

I also omitted cropping from the image service request when asking for the product image. This will hopefully avoid some poorly cropped cases.

<a name="4.4"></a>
### 4.4 Sanity check on new changes
After pruning the taxonomy and switching to uncropped product images, I ran 2 sanity checks, both at \\$0.048 per task, and the results were better. One had consensus on 59 of the 100 products, the other had consensus on 75. Of the 59 with consensus in the first test, 56 had the same label as the second test, compared to 68 out of 100 for the entire dataset. 

Of the 75 in the second dataset, 72 match the gold standard, and all but 1 were within the realm of reason. This is a strong signal that consensus is a fair proxy for accuracy.

<a name="4.5"></a>
### 4.5 What to expect
If we conservatively choose the lower performing job where 60\% of products achieve labeling consensus and we run with 3 independent labelers per product at \\$0.048 per task, it will cost us $24,000 to get to the 100K product dataset we'd like.