Our objective was to augment authors to add the appropriate tags for their project so the community can discover them. So we want to use the metadata provided in each project to determine what the relevant tags are. We'll want to start with the highly influential features and iteratively experiment with additional features.

### Load data

In [2]:
from collections import Counter, OrderedDict
import ipywidgets as widgets
import itertools
import json
import pandas as pd
from urllib.request import urlopen

In [3]:
# Load projects
url = "https://raw.githubusercontent.com/GokuMohandas/MadeWithML/main/datasets/projects.json"
projects = json.loads(urlopen(url).read())
print (json.dumps(projects[-305], indent=2))

{
  "id": 2106,
  "created_on": "2020-08-08 15:06:18",
  "title": "Fast NST for Videos (+ person segmentation) \ud83c\udfa5 + \u26a1\ud83d\udcbb + \ud83c\udfa8 = \u2764\ufe0f",
  "description": "Create NST videos and pick separate styles for the person in the video and for the background.",
  "tags": [
    "code",
    "tutorial",
    "video",
    "computer-vision",
    "style-transfer",
    "neural-style-transfer"
  ]
}


In [4]:
# Create dataframe
df = pd.DataFrame(projects)
print (f"{len(df)} projects")
df.head(5)

2032 projects


Unnamed: 0,id,created_on,title,description,tags
0,1,2020-02-17 06:30:41,Machine Learning Basics,A practical set of notebooks on machine learni...,"[code, tutorial, keras, pytorch, tensorflow, d..."
1,2,2020-02-17 06:41:45,Deep Learning with Electronic Health Record (E...,A comprehensive look at recent machine learnin...,"[article, tutorial, deep-learning, health, ehr]"
2,3,2020-02-20 06:07:59,Automatic Parking Management using computer vi...,Detecting empty and parked spaces in car parki...,"[code, tutorial, video, python, machine-learni..."
3,4,2020-02-20 06:21:57,Easy street parking using region proposal netw...,Get a text on your phone whenever a nearby par...,"[code, tutorial, python, pytorch, machine-lear..."
4,5,2020-02-20 06:29:18,Deep Learning based parking management system ...,Fastai provides easy to use wrappers to quickl...,"[code, tutorial, fastai, deep-learning, parkin..."


The reason we want to iteratively add more features is because it introduces more complexity and effort. We may have additional data about each feature such as author info, html from links in the description, etc. While these may have meaningful signal, we want to slowly introduce these after we close the loop.

### Auxiliary Data

We're also going to be using an auxiliary dataset which contains a collection of all the tags with their aliases and parent/child relationships. This auxiliary dataset was used by our application to automatically add the relevant parent tags when the child tags were present.

In [5]:
# Load tags
url = "https://raw.githubusercontent.com/GokuMohandas/MadeWithML/main/datasets/tags.json"
tags = json.loads(urlopen(url).read())
tags_dict = {}
for item in tags:
    key = item.pop("tag")
    tags_dict[key] = item
print (f"{len(tags_dict)} tags")

400 tags


In [6]:
@widgets.interact(tag=list(tags_dict.keys()))
def display_tag_details(tag='question-answering'):
    print (json.dumps(tags_dict[tag], indent=2))

interactive(children=(Dropdown(description='tag', index=283, options=('3d', 'action-localization', 'action-rec…

## Data Imbalance

There are several techniques to mitigate data imbalance, including resampling (oversampling from minority classes / undersampling from majority classes), account for the data distributions via the loss function (since that drives the learning process), etc.

## Libraries

We could have used the user provided tags as our labels but what if the user added a wrong tag or forgot to add a relevant one. To remove this dependency on the user to provide the gold standard labels, we can leverage labeling tools and platforms. These tools allow for quick and organized labeling of the dataset to ensure its quality. And instead of starting from scratch and asking our labeler to provide all the relevant tags for a given project, we can provide the author's original tags and ask the labeler to add / remove as necessary. The specific labeling tool may be something that needs to be custom built or leverages something from the ecosystem.

### General

- <a href="https://scale.com/">Scale AI</a>: the data platform for high quality training and validation data for AI applications.
- <a href="https://github.com/heartexlabs/label-studio">Label Studio</a>: a multi-type data labeling and annotation tool with standardized output format.
- <a href="https://github.com/UniversalDataTool/universal-data-tool">Universal Data Tool</a>: collaborate and label any type of data, images, text, or documents in an easy web interface or desktop app.
- <a href="https://github.com/explosion/prodigy-recipes">Prodigy</a>: recipes for the Prodigy, our fully scriptable annotation tool.
- <a href="https://github.com/janfreyberg/superintendent">Superintendent</a>: an ipywidget-based interactive labeling tool for your data to enable active learning.

### Natural language processing

- <a href="https://github.com/doccano/doccano">Doccano</a>: an open source text annotation tool for text classification, sequence labeling and sequence to sequence tasks.
- <a href="https://github.com/nlplab/brat">BRAT</a>: a rapid annotation tool for all your textual annotation needs.

### Computer Vision
- <a href="https://github.com/tzutalin/labelImg">LabelImg</a>: a graphical image annotation tool and label object bounding boxes in images.
- <a href="https://github.com/openvinotoolkit/cvat">CVAT</a>: a free, online, interactive video and image annotation tool for computer vision.
- <a href="https://github.com/Microsoft/VoTT">VoTT</a>: an electron app for building end-to-end object detection models from images and videos.
- <a href="https://github.com/SkalskiP/make-sense">makesense.ai</a>: a free to use online tool for labelling photos.
- <a href="https://github.com/rediscovery-io/remo-python">remo</a>: an app for annotations and images management in computer vision.
- <a href="https://github.com/aralroca/labelai">Labelai</a>: an online tool designed to label images, useful for training AI models.

### Audio
- <a href="https://github.com/midas-research/audino">Audino</a>: an open source audio annotation tool for voice activity detection (VAD), diarization, speaker identification, automated speech recognition, emotion recognition tasks, etc.
- <a href="https://github.com/CrowdCurio/audio-annotator">audio-annotator</a>: a JavaScript interface for annotating and labeling audio files.
- <a href="https://github.com/ritazh/EchoML">EchoML</a>: a web app to play, visualize, and annotate your audio files for machine learning.

### Miscellaneous
- <a href="https://github.com/CogStack/MedCAT">MedCAT</a>: a medical concept annotation tool that can extract information from Electronic Health Records(EHRs) and link it to biomedical ontologies like SNOMED-CT and UMLs.

## Active Learning
Even with a powerful labeling tool and established workflows, it's easy to see how involved and expensive labeling can be. Therefore, many teams employ active learning to iteratively label the dataset and evaluate the model.
1. Label a small, initial dataset to train a model.
2. Ask the trained model to predict on some unlabeled data.
3. Determine which new data points to label from the unlabeled data based on:
   - entropy over the predicted class probabilities.
   - samples with lowest predicted, calibrated, confidence (uncertainty sampling)
   - discrepancy in predictions from an ensemble of trained models
4. Repeat until the desired perdormance is achieved.

> This can be significantly more cost-effective and faster than labeling the entire dataset.

### Libraries
- <a href="https://github.com/modAL-python/modAL">ModAL</a>: a modular active learning framework for Python.
- <a href="https://github.com/ntucllab/libact">libact</a>: pool-based active learning in Python.
- <a href="https://github.com/NUAA-AL/ALiPy">ALiPy</a>: active learning python toolbox, which allows users to conveniently evaluate, compare and analyze the performance of active learning methods.

## Weak supervision
If we had samples that needed labeling or if we simply wanted to validate existing labels, we can use weak supervision to generate labels as opposed to hand labeling all of them. We could utilize weak supervision via labeling functions to label our existing and new data. We can create constructs based on keywords, pattern expressions, knowledge bases and generalized models to create these labeling functions to label our data. And we can add to the labeling functions over time and even mitigate conflicts amongst the different labeling functions.

In [8]:
!pip install snorkel

Collecting snorkel
  Downloading snorkel-0.9.7-py3-none-any.whl (145 kB)
[K     |████████████████████████████████| 145 kB 1.3 MB/s eta 0:00:01
Collecting munkres>=1.0.6
  Downloading munkres-1.1.4-py2.py3-none-any.whl (7.0 kB)
Collecting tensorboard<2.0.0,>=1.14.0
  Downloading tensorboard-1.15.0-py3-none-any.whl (3.8 MB)
[K     |████████████████████████████████| 3.8 MB 3.9 MB/s eta 0:00:01
Collecting networkx<2.4,>=2.2
  Downloading networkx-2.3.zip (1.7 MB)
[K     |████████████████████████████████| 1.7 MB 9.1 MB/s eta 0:00:01     |████████████▊                   | 696 kB 9.1 MB/s eta 0:00:01
[?25hCollecting numpy<1.20.0,>=1.16.5
  Downloading numpy-1.19.5-cp38-cp38-manylinux2010_x86_64.whl (14.9 MB)
[K     |████████████████████████████████| 14.9 MB 67 kB/s  eta 0:00:01     |████████████████████████▉       | 11.5 MB 8.2 MB/s eta 0:00:01
Collecting markdown>=2.6.8
  Using cached Markdown-3.3.4-py3-none-any.whl (97 kB)
Collecting grpcio>=1.6.3
  Downloading grpcio-1.40.0-cp38-cp38-

In [9]:
from snorkel.labeling import labeling_function

@labeling_function()
def contains_tensorflow(text):
    condition = any(tag in text.lower() for tag in ("tensorflow", "tf"))
    return "tensorflow" if condition else None

## Iteration
Labeling isn't just a one time event or something we repeat identically. As new data is available, we'll want to strategically label the appropriate samples and improve slices of our data that are lacking in quality. In fact, there's an entire workflow related to labeling that is initiated when we want to iterate. We'll learn more about this iterative labeling process in our continual learning lesson.