# Datapath Example 3

This notebook gives an example of how to build relatively simple data paths.
It assumes that you understand the concepts presented in the example 2
notebook.

## Exampe Data Model
The examples require that you understand a little bit about the example
catalog data model, which is based on the FaceBase project.

### Key tables
- `'dataset'` : represents a unit of data usually a `'study'` or `'experiment'`
- `'sample'` : a biosample
- `'assay'` : a bioassay (typically RNA-seq or ChIP-seq assays)

### Relationships
- `dataset <- sample`: A dataset may have one to many samples. I.e., there 
  is a foreign key reference from sample to dataset.
- `sample <- assay`: A sample may have one to many assays. I.e., there is a
  foreign key reference from assay to sample.

In [1]:
# Import deriva modules
from deriva_common import ErmrestCatalog, get_credential

In [2]:
# Connect with the deriva catalog
protocol = 'https'
hostname = 'www.facebase.org'
catalog_number = 1
credential = None
# If you need to authenticate, use Deriva Auth agent and get the credential
# credential = get_credential(hostname)
catalog = ErmrestCatalog(protocol, hostname, catalog_number, credential)

In [3]:
# Get the path builder interface for this catalog
pb = catalog.getPathBuilder()

## Building a Datapath
We will build a data path by linking tables from the catalog. To make things a little easier we will use python variables to reference the tables. This is not necessary, but simplifies the examples.

In [4]:
dataset = pb.isa.dataset
sample = pb.isa.sample
assay = pb.isa.assay

Build a data path by linking together different tables that are related.
By default, data path returns entities for the _last_ linked entity set
in the path. The following data path will therefore return assays not
datasets.

In [5]:
path = dataset.path            # a new path rooted at the "dataset" table
path.link(sample).link(assay)  # extended path dataset<-sample<-assay
print(path.uri)                # URI for this path

https://www.facebase.org/ermrest/catalog/1/entity/dataset:=isa:dataset/sample:=isa:sample/assay:=isa:assay


Get the entity set for this linked data path.

In [6]:
entities = path.entities()
len(entities)

171

## Filtering a Datapath

Building off of the path, a filter can be added. In this filter, the assay's
attriburtes may be reference in the expressions. We did not have to split this
step from the prior step.

**Note**:
In these binary comparisons 
the left operand must be an attribute while the right operand must a literal
value.

In [7]:
path.filter(assay.molecule_type == 'mRNA')
print(path.uri)

https://www.facebase.org/ermrest/catalog/1/entity/dataset:=isa:dataset/sample:=isa:sample/assay:=isa:assay/molecule_type=mRNA


In [8]:
entities = path.entities()
len(entities)

6

## Slicing EntitySets
Any entity set can be sliced too.

In [9]:
print(entities[2:4])

[{'id': 15, 'dataset': 14068, 'sample': 2, 'replicate': '5', 'sample_composition': 'maxillary process', 'sample_type': 'RNA-seq', 'molecule_type': 'mRNA', 'sample_purification': 'excision', 'markers': 'histology', 'isolation_protocol': '', 'cell_count': 'NA', 'protocol': '', 'pretreatment': 'Trizol', 'fragmentation_method': 'Fragmentation Buffer from Illumina', 'reagent': 'TruSeq stranded total RNA kit', 'reagent_source': 'Illumina', 'reagent_catalog_number': '15032619.0', 'reagent_batch_number': '', 'selection': 'totalRNA', 'library_id': 75, 'alignment_id': 55, 'tracks_id': 35}, {'id': 20, 'dataset': 14068, 'sample': 4, 'replicate': '5', 'sample_composition': 'mandibular process', 'sample_type': 'RNA-seq', 'molecule_type': 'mRNA', 'sample_purification': 'excision', 'markers': 'histology', 'isolation_protocol': '', 'cell_count': 'NA', 'protocol': '', 'pretreatment': 'Trizol', 'fragmentation_method': 'Fragmentation Buffer from Illumina', 'reagent': 'TruSeq stranded total RNA kit', 'reag

Let's see it rendered as a Pandas DataFrame.

In [10]:
entities.dataframe

Unnamed: 0,alignment_id,cell_count,dataset,fragmentation_method,id,isolation_protocol,library_id,markers,molecule_type,pretreatment,...,reagent_batch_number,reagent_catalog_number,reagent_source,replicate,sample,sample_composition,sample_purification,sample_type,selection,tracks_id
0,41,,14068,Fragmentation Buffer from Illumina,1,,61,histology,mRNA,Trizol,...,,15032619.0,Illumina,5,1,medial nasal process,excision,RNA-seq,totalRNA,21
1,46,,14068,Fragmentation Buffer from Illumina,6,,66,histology,mRNA,Trizol,...,,15032619.0,Illumina,5,3,latero nasal process,excision,RNA-seq,totalRNA,26
2,55,,14068,Fragmentation Buffer from Illumina,15,,75,histology,mRNA,Trizol,...,,15032619.0,Illumina,5,2,maxillary process,excision,RNA-seq,totalRNA,35
3,60,,14068,Fragmentation Buffer from Illumina,20,,80,histology,mRNA,Trizol,...,,15032619.0,Illumina,5,4,mandibular process,excision,RNA-seq,totalRNA,40
4,62,,14130,Fragmentation Buffer from Illumina,25,,85,Histology,mRNA,Trizol,...,,15032619.0,Illumina,5,1088,face,Excision,RNA-seq,totalRNA,43
5,64,,14130,Fragmentation Buffer from Illumina,30,,90,Histology,mRNA,Trizol,...,,15032619.0,Illumina,5,1089,face,Excision,RNA-seq,totalRNA,43


# Table Instances
So far we have discussed _base_ tables. A _base_ table is a representation of the table as it is stored in the ERMrest catalog. A table _instance_ is a usage or reference of a table _within the context_ of a data path. We may link together multiple tables and thus create multiple table instances within a data path.

For example, in `path.link(dataset).link(sample).link(assay)` the table instance `sample` is no longer the same as the original base table `sample` because _within the context_ of this data path the `sample` entities must satisfy the constraints of the data path. The `sample` entities must reference a `dataset` entity, and they must be referenced by a `assay` entity. Thus within this path, the entity set for `sample` may be quite different than the entity set for the base table on its own.

## Table instances are bound to the path
Whenever you initiate a data path (e.g., `table.path`) or link a table to a path (e.g., `path.link(table)`) a table instance is created and bound to the DataPath object (e.g., `path`). These table instances can be referenced via the `DataPath`'s `table_instances` container or directly as a property of the `DataPath` object itself.

In [11]:
dataset_instance = path.table_instances['dataset']
# or
dataset_instance = path.dataset

## Aliases for table instances
Whenever a table instance is created and bound to the path, it is given a name. If no name is specified for it, it will be named after the name of its base table. For example, a table named "My Table" will result in a table instance named "My Table". Tables may appear _more than once_ in a path, and if the name is taken, the instance will be given the "'base name' + `number`" (e.g., "My Table2").

You may with to specify the name of your table instance. In database terms, these alternate names are called an "alias" name.

In [12]:
path.link(dataset.alias('D'))

<deriva_common.datapath.DataPath at 0x10b3675f8>

In [13]:
path.D.uri

'https://www.facebase.org/ermrest/catalog/1/entity/dataset:=isa:dataset/sample:=isa:sample/assay:=isa:assay/molecule_type=mRNA/D:=isa:dataset'

You'll notice that in this path we added an additional _instance_ of the `dataset` table from our catalog model. In addition, we linked it to the `isa.assay` table. This was possible because in this model, there is a foriegn key reference from the base table `assay` to the base table `dataset`. The entities for the table instance named `dataset` and the instance name `D` will likely consist of different entities because the constraints for each are different.

## Selecting Attributes From Linked Entities

Returning to the initial example, if we want to project additional attributes
from other entities in the DataPath, we need to be able to reference the
"table instances" at any point in the path. First, we will build our original path.

In [14]:
path = dataset.path.link(sample).link(assay).filter(assay.molecule_type == 'mRNA')
print(path.uri) 

https://www.facebase.org/ermrest/catalog/1/entity/dataset:=isa:dataset/sample:=isa:sample/assay:=isa:assay/molecule_type=mRNA


Now let's fetch an entity set with attributes pulled from each of the table instances in the path.

In [15]:
entities = path.entities(path.dataset.accession, local_sample_id=path.sample.local_identifier, assay_molecule=path.assay.molecule_type)
print(entities.uri)

https://www.facebase.org/ermrest/catalog/1/attribute/dataset:=isa:dataset/sample:=isa:sample/assay:=isa:assay/molecule_type=mRNA/dataset:accession,local_sample_id:=sample:local_identifier,assay_molecule:=assay:molecule_type


**Notice** that the `EntitySet` also has a `uri` property. This URI may differ from the origin path URI because the attribute projection does not get appended to the path URI.

In [16]:
path.uri == entities.uri

False

As usual, `fetch(...)` the entities from the catalog.

In [17]:
entities.fetch(limit=5)
for e in entities:
    print(e)

{'accession': 'FB00000806.2', 'local_sample_id': 'E11.5_MNP', 'assay_molecule': 'mRNA'}
{'accession': 'FB00000806.2', 'local_sample_id': 'E11.5_LNP', 'assay_molecule': 'mRNA'}
{'accession': 'FB00000806.2', 'local_sample_id': 'E11.5_MX', 'assay_molecule': 'mRNA'}
{'accession': 'FB00000806.2', 'local_sample_id': 'E11.5_MD', 'assay_molecule': 'mRNA'}
{'accession': 'FB00000807.2', 'local_sample_id': 'CS22_11865', 'assay_molecule': 'mRNA'}
