# A simple intro to Eto Labs SDK

The Eto Labs SDK helps you seamlessly manage your deep learning datasets. <br>
It seeks to make your existing workflow 10x easier. <br>
Let's take a look at the basic features available on the platform.

In [1]:
import os
import eto
#eto.configure(account='demo', token=os.environ.get('ETO_API_TOKEN')) 
eto.config.Config.create_config(url='http://api:5000', token='testtest')

## Ingest a Coco dataset

In [2]:
eto.ingest_coco('little_coco', source={'image_dir': 's3://eto-public/little_coco_raw/val2017/',
                                       'annotation': 's3://eto-public/little_coco_raw/annotations/instances_val2017.json'},
                mode='overwrite')

{'created_at': '2021-12-20T04:51:58.869511+00:00',
 'id': '9ad3cfc6-5ff8-446a-982f-e54c854073e6',
 'status': 'created'}

## Access the data registry

In [3]:
eto.list_datasets()

[{'created_at': '2021-12-20T02:19:38',
  'dataset_id': 'little_coco',
  'project_id': 'default',
  'uri': 'file:/var/data/warehouse/little_coco'}]

In [4]:
eto.get_dataset('little_coco')

{'created_at': '2021-12-20T02:19:38',
 'dataset_id': 'little_coco',
 'project_id': 'default',
 'schema': {'fields': [{'metadata': {},
                        'name': 'date_captured',
                        'nullable': True,
                        'type': 'string'},
                       {'metadata': {},
                        'name': 'width',
                        'nullable': True,
                        'type': 'integer'},
                       {'metadata': {},
                        'name': 'height',
                        'nullable': True,
                        'type': 'integer'},
                       {'metadata': {},
                        'name': 'file_name',
                        'nullable': True,
                        'type': 'string'},
                       {'metadata': {},
                        'name': 'image_id',
                        'nullable': True,
                        'type': 'integer'},
                       {'metadata': {},
                 

## DataFrame integration

### Eto adds a pandas reader for datasets in the Eto data registry

In [5]:
import pandas as pd
df = pd.read_eto('little_coco')
df

2021-12-20 04:52:08,697 INFO Rikai (dataset.py:111): Loading parquet files: ['file:///var/data/warehouse/little_coco/part-00000-b82537bd-3885-4631-9687-2ee6b1c9a2df-c000.snappy.parquet', 'file:///var/data/warehouse/little_coco/part-00001-b82537bd-3885-4631-9687-2ee6b1c9a2df-c000.snappy.parquet', 'file:///var/data/warehouse/little_coco/part-00002-b82537bd-3885-4631-9687-2ee6b1c9a2df-c000.snappy.parquet', 'file:///var/data/warehouse/little_coco/part-00003-b82537bd-3885-4631-9687-2ee6b1c9a2df-c000.snappy.parquet', 'file:///var/data/warehouse/little_coco/part-00004-b82537bd-3885-4631-9687-2ee6b1c9a2df-c000.snappy.parquet', 'file:///var/data/warehouse/little_coco/part-00005-b82537bd-3885-4631-9687-2ee6b1c9a2df-c000.snappy.parquet', 'file:///var/data/warehouse/little_coco/part-00006-b82537bd-3885-4631-9687-2ee6b1c9a2df-c000.snappy.parquet', 'file:///var/data/warehouse/little_coco/part-00007-b82537bd-3885-4631-9687-2ee6b1c9a2df-c000.snappy.parquet', 'file:///var/data/warehouse/little_coco/par

Unnamed: 0,date_captured,width,height,file_name,image_id,annotations,image
0,2013-11-16 23:31:18,640,427,000000124798.jpg,124798,"[{'image_id': 124798, 'area': 550.204900000000...",Image(<embedded>)
1,2013-11-16 23:51:13,640,480,000000566758.jpg,566758,"[{'image_id': 566758, 'area': 91964.6944500000...",Image(<embedded>)
2,2013-11-21 20:54:54,640,480,000000195842.jpg,195842,"[{'image_id': 195842, 'area': 1177.45305, 'lab...",Image(<embedded>)
3,2013-11-17 16:54:43,640,416,000000356612.jpg,356612,"[{'image_id': 356612, 'area': 3110.81619999999...",Image(<embedded>)
4,2013-11-23 18:02:01,527,640,000000312263.jpg,312263,"[{'image_id': 312263, 'area': 3687.52364999999...",Image(<embedded>)
...,...,...,...,...,...,...,...
4947,2013-11-19 23:46:47,469,640,000000576566.jpg,576566,"[{'image_id': 576566, 'area': 20906.4590499999...",Image(<embedded>)
4948,2013-11-14 23:37:45,427,640,000000570736.jpg,570736,"[{'image_id': 570736, 'area': 7615.19894999999...",Image(<embedded>)
4949,2013-11-16 03:15:19,640,427,000000131386.jpg,131386,"[{'image_id': 131386, 'area': 19764.42945, 'la...",Image(<embedded>)
4950,2013-11-18 16:20:38,640,480,000000562561.jpg,562561,"[{'image_id': 562561, 'area': 4971.72075, 'lab...",Image(<embedded>)


### The Rikai image type is very convenient to work with
See [Rikai Image](https://github.com/eto-ai/rikai/blob/master/python/rikai/types/vision.py#L43) documentation for details

In [6]:
df.image[0]

Image(<embedded>)

Easy conversions to numpy arrays

In [7]:
df.image[0].to_numpy().shape

(427, 640, 3)

Image supports both embedded and externalized image

In [8]:
from rikai.types.vision import Image
path = '/tmp/test.png'
df.image[0].save(path)
Image(path)

Image(uri=/tmp/test.png)

### Annotations are converted to Rikai geometry types
For example [Box2d](https://github.com/eto-ai/rikai/blob/master/python/rikai/types/geometry.py#L70)

In [9]:
bbox = [ann['bbox'] for ann in df.annotations[0]]
bbox

[Box2d(xmin=2.88, ymin=269.87, xmax=66.21, ymax=290.02),
 Box2d(xmin=99.65, ymin=276.6, xmax=143.14000000000001, ymax=292.87),
 Box2d(xmin=125.83, ymin=278.87, xmax=157.8, ymax=293.25),
 Box2d(xmin=148.9, ymin=279.88, xmax=176.09, ymax=293.93),
 Box2d(xmin=47.81, ymin=271.98, xmax=88.03, ymax=301.59000000000003),
 Box2d(xmin=244.82, ymin=267.85, xmax=271.05, ymax=294.99),
 Box2d(xmin=170.82, ymin=238.56, xmax=408.39, ymax=364.03),
 Box2d(xmin=41.36, ymin=299.01, xmax=137.45999999999998, ymax=369.93),
 Box2d(xmin=574.77, ymin=311.69, xmax=640.0, ymax=425.88),
 Box2d(xmin=0.0, ymin=293.68, xmax=58.12, ymax=375.64),
 Box2d(xmin=546.99, ymin=313.23, xmax=593.32, ymax=346.48),
 Box2d(xmin=496.03, ymin=314.05, xmax=536.2099999999999, ymax=340.96000000000004),
 Box2d(xmin=472.8, ymin=313.48, xmax=499.79, ymax=338.63),
 Box2d(xmin=530.88, ymin=314.0, xmax=551.61, ymax=333.45),
 Box2d(xmin=591.86, ymin=315.52, xmax=611.97, ymax=336.84),
 Box2d(xmin=408.67, ymin=299.31, xmax=476.41, ymax=353.63)

In [10]:
# TODO
df.image[0].draw(bbox)

AttributeError: 'Image' object has no attribute 'draw'

You can compute IOUs by simply calling the IOU method

In [11]:
bbox[0].iou(bbox[1:])

array([0.        , 0.        , 0.        , 0.15546788, 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ])

It's now easy to compute analytics on your datasets

In [12]:
ann_df = pd.json_normalize(df.annotations.explode())
ann_df

Unnamed: 0,image_id,area,label,label_id,ann_id,bbox,supercategory
0,124798,550.20490,umbrella,28,1424448,"(2.88, 269.87, 66.21, 290.02)",accessory
1,124798,351.36185,umbrella,28,1426270,"(99.65, 276.6, 143.14000000000001, 292.87)",accessory
2,124798,231.88560,umbrella,28,1828906,"(125.83, 278.87, 157.8, 293.25)",accessory
3,124798,196.79755,umbrella,28,1830048,"(148.9, 279.88, 176.09, 293.93)",accessory
4,124798,461.03390,umbrella,28,1831062,"(47.81, 271.98, 88.03, 301.59000000000003)",accessory
...,...,...,...,...,...,...,...
36776,233727,370.34715,bicycle,2,1765908,"(570.35, 220.91, 591.6, 262.25)",vehicle
36777,233727,207.53745,bicycle,2,1766498,"(494.93, 223.53, 506.31, 260.77)",vehicle
36778,233727,141.86895,bicycle,2,1766800,"(492.45, 212.97, 508.45, 228.05)",vehicle
36779,233727,58.73225,bicycle,2,1767320,"(480.72, 213.22, 494.3, 223.13)",vehicle


In [13]:
ann_df.groupby('image_id')['label'].nunique()

image_id
139       10
285        1
632        4
724        3
776        2
          ..
581317     2
581357     3
581482     1
581615     1
581781     1
Name: label, Length: 4952, dtype: int64

## Pytorch data loader

In [14]:
import torch
import torchvision
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor

from rikai.torch.vision import Dataset

In [15]:
def ann_transform(ann_lst):
    keys = ['boxes', 'labels', 'image_id', 'area']
    rs = {}
    rs['boxes'] = torch.as_tensor([x['bbox'] for x in ann_lst], dtype=torch.float32)
    rs['labels'] = torch.as_tensor([x['label_id'] for x in ann_lst], dtype=torch.int64)
    rs['image_id'] = torch.tensor([ann_lst[0]['image_id']])
    rs['area'] = torch.as_tensor([x['area'] for x in ann_lst], dtype=torch.float32)
    rs['iscrowd'] = torch.zeros((len(ann_lst),), dtype=torch.int64)
    return rs

In [16]:
def get_model(num_classes):
    # load an instance segmentation model pre-trained pre-trained on COCO
    model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=False)
    # get number of input features for the classifier
    in_features = model.roi_heads.box_predictor.cls_score.in_features
    # replace the pre-trained head with a new one
    model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)
    return model

In [17]:
dataset = Dataset('little_coco', image_column='image', target_column='annotations', 
                  transform=torchvision.transforms.ToTensor(),
                  target_transform=ann_transform)

In [18]:
# select device (whether GPU or CPU)
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

num_classes = ann_df.label_id.max()
model = get_model(num_classes)

# move model to the right device
model.to(device)

# parameters
params = [p for p in model.parameters() if p.requires_grad]
optimizer = torch.optim.SGD(
    params, lr=0.005, momentum=0.9, weight_decay=0.005
)


# own DataLoader
data_loader = torch.utils.data.DataLoader(
    dataset,
    batch_size=10,
    num_workers=4,
    collate_fn=lambda batch: tuple(zip(*batch)),
)

# Training
num_epochs = 1
for epoch in range(num_epochs):
    model.train()
    i = 0
    for imgs, annotations in data_loader:
        i += 1
        imgs = list(img.to(device) for img in imgs)
        annotations = [{k: v.to(device) for k, v in t.items()} for t in annotations]
        loss_dict = model(imgs, annotations)
        losses = sum(loss for loss in loss_dict.values())

        optimizer.zero_grad()
        losses.backward()
        optimizer.step()

        print(f"Batch: {i}, Loss: {losses}")
        break

2021-12-20 04:52:57,316 INFO Rikai (dataset.py:102): Running in distributed mode, world size=4, rank=0
2021-12-20 04:52:57,317 INFO Rikai (dataset.py:102): Running in distributed mode, world size=4, rank=2
2021-12-20 04:52:57,317 INFO Rikai (dataset.py:102): Running in distributed mode, world size=4, rank=1
2021-12-20 04:52:57,333 INFO Rikai (dataset.py:102): Running in distributed mode, world size=4, rank=3
2021-12-20 04:52:57,334 INFO Rikai (dataset.py:111): Loading parquet files: ['file:///var/data/warehouse/little_coco/part-00000-b82537bd-3885-4631-9687-2ee6b1c9a2df-c000.snappy.parquet', 'file:///var/data/warehouse/little_coco/part-00001-b82537bd-3885-4631-9687-2ee6b1c9a2df-c000.snappy.parquet', 'file:///var/data/warehouse/little_coco/part-00002-b82537bd-3885-4631-9687-2ee6b1c9a2df-c000.snappy.parquet', 'file:///var/data/warehouse/little_coco/part-00003-b82537bd-3885-4631-9687-2ee6b1c9a2df-c000.snappy.parquet', 'file:///var/data/warehouse/little_coco/part-00004-b82537bd-3885-4631-9

Batch: 1, Loss: 5.549132347106934
