# Merging Datasets

This recipe demonstrates a simple pattern for merging FiftyOne Datasets via [Dataset.merge_samples()](https://voxel51.com/docs/fiftyone/api/fiftyone.core.dataset.html?highlight=merge_samples#fiftyone.core.dataset.Dataset.merge_samples).

Merging datasets is an easy way to:

-   Combine multiple datasets with information about the same underlying raw media (images and videos)
-   Add model predictions to a FiftyOne dataset, to compare with ground truth annotations and/or other models

## Setup

If you haven't already, install FiftyOne:

In [None]:
!pip install fiftyone

In this recipe, we'll work with a dataset downloaded from the [FiftyOne Dataset Zoo](https://voxel51.com/docs/fiftyone/user_guide/dataset_creation/zoo.html).

To access the dataset, install `torch` and `torchvision`, if necessary:

In [1]:
!pip install torch torchvision



In [2]:
import fiftyone as fo
import fiftyone.zoo as foz
from fiftyone import ViewField as F

In [3]:
# fo.delete_datasets("d*")

Create a few random datasets to merge:

In [4]:
dsets = [
    foz.load_zoo_dataset(
        "cifar10",
        split="test",
        dataset_name=f"d{i+1}",
        max_samples=1000,
        shuffle=True,
    )
    for i in range(3)
]

Split 'test' already downloaded
Loading 'cifar10' split 'test'
 100% |███████████████| 1000/1000 [245.0ms elapsed, 0s remaining, 4.1K samples/s]      
Dataset 'd1' created
Split 'test' already downloaded
Loading 'cifar10' split 'test'
 100% |███████████████| 1000/1000 [293.7ms elapsed, 0s remaining, 3.4K samples/s]      
Dataset 'd2' created
Split 'test' already downloaded
Loading 'cifar10' split 'test'
 100% |███████████████| 1000/1000 [241.3ms elapsed, 0s remaining, 4.1K samples/s]      
Dataset 'd3' created


In [5]:
d1, d2, d3 = dsets

In [6]:
for d in dsets:
    print(f"{d.name} has {d.count()} samples")

d1 has 1000 samples
d2 has 1000 samples
d3 has 1000 samples


## Basic Merge

Calling `merge_samples()` on a FiftyOne dataset will merge the samples from the input dataset into the target dataset. Let's see this by cloning `d1` and merging `d2` into it:

In [7]:
d1_clone = d1.clone(name="d1_clone")
d1_clone.merge_samples(d2)
print(d1_clone)

Name:        d1_clone
Media type:  image
Num samples: 1912
Persistent:  False
Tags:        []
Sample fields:
    id:           fiftyone.core.fields.ObjectIdField
    filepath:     fiftyone.core.fields.StringField
    tags:         fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:     fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
    ground_truth: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)


By default, samples with the same `filepath` are merged. Hence, every filepath in `d2` is unique:

In [18]:
len(d1_clone.distinct(F("filepath")))

1912

To be explicit, let's do the math:

In [26]:
d1_fps = set(d1.distinct("filepath"))
d2_fps = set(d2.distinct("filepath"))

## Union
print(f"Num unique filepaths: {len(d1_fps.union(d2_fps))}")

## Intersection
print(f"Num duplicate filepaths: {len(d1_fps.intersection(d2_fps))}")

Num unique filepaths: 1912
Num duplicate filepaths: 88


If you try to merge the same samples again, nothing will happen:

In [27]:
d1_clone.merge_samples(d2)
print(d1_clone.count())

1912


If you do want to add all samples from `d2` to `d1`, creating new samples in `d1` for samples in `d2` with the same `filepath`, use `add_collection()`:

In [28]:
d1_clone.add_collection(d2)
print(d1_clone.count())

2912


You can also customize the merge key with a `key_field` or a `key_fcn` function.

## Merge with Key Field

Let's merge two datasets based on a key field "position". Note what fields are on the samples in the merged datasets:

In [46]:
import numpy as np
d1_keyfield = d1.clone(name="d1_keyfield")

## number by position
d1_keyfield.set_values("position", np.arange(d1_keyfield.count()))
d2.set_values("position", np.arange(d2.count()))

## create new fields on d1_keyfield and d2 for illustration
d1_keyfield.set_values("field1", np.random.rand(d1.count()))
d2.set_values("field2", np.random.rand(d2.count()))

print(d1_keyfield)

print(d2)

Name:        d1_keyfield
Media type:  image
Num samples: 1000
Persistent:  False
Tags:        []
Sample fields:
    id:           fiftyone.core.fields.ObjectIdField
    filepath:     fiftyone.core.fields.StringField
    tags:         fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:     fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
    ground_truth: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)
    position:     fiftyone.core.fields.IntField
    field1:       fiftyone.core.fields.FloatField
Name:        d2
Media type:  image
Num samples: 1000
Persistent:  False
Tags:        []
Sample fields:
    id:           fiftyone.core.fields.ObjectIdField
    filepath:     fiftyone.core.fields.StringField
    tags:         fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:     fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
    ground_truth:

In [47]:
d1_keyfield.merge_samples(d2, key_field="position")
print(d1_keyfield)

Name:        d1_keyfield
Media type:  image
Num samples: 1000
Persistent:  False
Tags:        []
Sample fields:
    id:           fiftyone.core.fields.ObjectIdField
    filepath:     fiftyone.core.fields.StringField
    tags:         fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:     fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
    ground_truth: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)
    position:     fiftyone.core.fields.IntField
    field1:       fiftyone.core.fields.FloatField
    keyfield:     fiftyone.core.fields.IntField
    field2:       fiftyone.core.fields.FloatField


Note that there are only $1,000$ samples in the merged dataset, because the samples with the same `position` are merged. Also note that while `field1` was present in `d1`, and `field2` was present in `d2`, both are present in the merged dataset. This is what happens when you merge datasets with different fields!

If we had some samples with non-matching `position` values:

In [57]:
d1_partial_overlap = d1.clone(name="d1_po")

## number by position with half overlap
d1_partial_overlap.set_values("position", np.arange(d1_partial_overlap.count()) + int(d1.count() / 2))

In [58]:
d1_partial_overlap.merge_samples(d2, key_field="position")
print(d1_partial_overlap)

Name:        d1_po
Media type:  image
Num samples: 1500
Persistent:  False
Tags:        []
Sample fields:
    id:           fiftyone.core.fields.ObjectIdField
    filepath:     fiftyone.core.fields.StringField
    tags:         fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:     fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
    ground_truth: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)
    position:     fiftyone.core.fields.IntField
    keyfield:     fiftyone.core.fields.IntField
    field2:       fiftyone.core.fields.FloatField


This merges the overlapping samples and adds the non-overlapping samples as new samples in the merged dataset. You can add just the overlapping samples with the argument `insert_new=False`.

## Choosing Fields to Merge

When you merge two datasets, you can choose which fields to merge. By default, all fields are merged. You can specify the fields to merge with the `fields` argument, or you can specify the fields to exclude with the `omit_fields` argument. Let's try both:

In [73]:
d3.delete_sample_field("field1")

In [81]:
fo.delete_datasets("d3_*")

In [82]:
d3_a = d3.clone(name="d3_a")
d3_b = d3.clone(name="d3_b")
d3_c = d3.clone(name="d3_c")

In [83]:
## create new dummy fields on d3_a, d3_b, d3_c

d3_a.add_sample_field("field_a", fo.IntField)
d3_a.set_field("field_a", 1)
d3_a.save()

d3_b.add_sample_field("field_b1", fo.IntField)
d3_b.set_field("field_b1", 1)
d3_b.save()

d3_b.add_sample_field("field_b2", fo.IntField)
d3_b.set_field("field_b2", 1)
d3_b.save()

d3_c.add_sample_field("field_c", fo.IntField)
d3_c.set_field("field_c", 1)
d3_c.save()

In [84]:
## merge all fields from d3_b except "field_b1" into d3_a
d3_a.merge_samples(d3_b, omit_fields="field_b1")
print(d3_a)

Name:        d3_a
Media type:  image
Num samples: 1000
Persistent:  False
Tags:        []
Sample fields:
    id:           fiftyone.core.fields.ObjectIdField
    filepath:     fiftyone.core.fields.StringField
    tags:         fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:     fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
    ground_truth: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)
    field_a:      fiftyone.core.fields.IntField
    field_b2:     fiftyone.core.fields.IntField


In [85]:
## merge only field "field_a" from d3_a into d3_c (not "field_b2")
d3_c.merge_samples(d3_a, fields=["field_a"])
print(d3_c)

Name:        d3_c
Media type:  image
Num samples: 1000
Persistent:  False
Tags:        []
Sample fields:
    id:           fiftyone.core.fields.ObjectIdField
    filepath:     fiftyone.core.fields.StringField
    tags:         fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:     fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
    ground_truth: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)
    field_c:      fiftyone.core.fields.IntField
    field_a:      fiftyone.core.fields.IntField
