Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

episb-provider interface #6

Closed
nsheff opened this issue Nov 27, 2018 · 26 comments

Comments

@nsheff
Copy link
Member

commented Nov 27, 2018

Provider design interface

In order to link a provider to a hub, we must specify an interface for that connection, which we call the provider interface (other suggestions for a name?). Each data provider must provide this provider interface, which is essentially a summary of what kind of data the provider provides. Here is an example of such an interface, which describes 3 experiments that all subscribe to one segmentation:

{ 
	"provider_name": "regdb"
	"provider_description" : "Shefflab Regulatory Elements Data Provider"
	"segmentations": [
		{"@id": "http://episb.org/segmentations/DHS",
			"experiments": [
			{"@id": "experiment1", 
			"celltype: "HUVEC",
			"description": "ChIP-seq in HUVEC cells for factor X", 
			"annotation_key": "value",
			"annotation_range_start": 0,
			"annotation_range_end": 1000}, 
			{"@id": "experiment2", 
			"celltype: "MSC",
			"description": "ChIP-seq in MSC cells for factor Y", 
			"annotation_key": "value",
			"annotation_range_start": 0,
			"annotation_range_end": 1000}, 		
			{"@id": "experiment3", 
			"description": "CpG island annotation", 
			"annotation_key": "value",
			"annotation_range_start": 0,
			"annotation_range_end": 1}, 		
			]
		}]

}

We need to:

  • Create such a design interface for this provider
  • develop and document the format for how such files can be created generically.

This issue could really go either in the provider or the hub since it is the connection between the two.

@nsheff

This comment has been minimized.

Copy link
Member Author

commented Nov 27, 2018

I think probably this depends on solving #7 first.

@nsheff

This comment has been minimized.

Copy link
Member Author

commented Nov 28, 2018

Discussion in databio/episb-hub#5

Once this data is in the data provider, it could probably automatically create the provider interface based on what it knows about the data it has. So, you're right, being an API point does make more sense, because it means this interface file doesn't have to be created manually.

So this interface file should probably be provided by an API point (though it may be nice if it could also just be a file, for the use case of a data provider that doesn't provide an API but just some file-server data)

@nsheff nsheff added this to the version 0.2 milestone Nov 28, 2018

@oddodaoddo

This comment has been minimized.

Copy link
Contributor

commented Jan 3, 2019

I am working on this issue now. If I understand it correctly, we are supposed to extrapolate the annotation value range from a file such as DHS and include it in the segmentation interface? Right now, all I am doing is getting the segments out of the files, to create segmentations. It is a "dumb" way of doing so because it does these things verbatim. This means that we do not check whether there are overlapping segmentations already present in the database, for example (we may or may not want to do this check?). It is not difficult to "learn" the annotation value range from a bed file used for a segmentation. I just want to make sure I understand what I am doing here 😁. Thanks!

@nsheff

This comment has been minimized.

Copy link
Member Author

commented Jan 3, 2019

yeah, I think so... we shouldn't have to extrapolate anything, though. we should be able to just grab it directly, using min and max functions or something.

I don't see how overlapping segments (segmentations?) has anything to do with this.

for each experiment, I just want to know the range the values it takes, that's it.

@oddodaoddo

This comment has been minimized.

Copy link
Contributor

commented Jan 3, 2019

Dr. Sheffield, it was just a comment so that you understand there is no extra processing involved when grabbing the segmentations out of bed files. That's all. Thank you for answering - yes, by learning I meant min/max.

@oddodaoddo

This comment has been minimized.

Copy link
Contributor

commented Jan 4, 2019

@nsheff : quick question. After re-reading the above ticket, I am curious about something: we are separating segmentations from annotations and making experiments to be lists of (annotation_value, segmentation:segment_id) tuples. In your description above, it looks like we have 3 experiments that subscribe to the same segmentation.

q: Were the annotation ranges derived from the experiments or from the time a segmentation was created? For example, did we run through the DHS bed file to create a DHS segmentation and at the same time we grabbed the annotation values and got a min/max to use in the design interface? Or did we "back-fill" this annotation value from an independent experiment that happens to subscribe to the segmentation?

@oddodaoddo

This comment has been minimized.

Copy link
Contributor

commented Jan 4, 2019

Nevermind, it seems obvious we are back-filling these from the experiments that are subscribing to the segmentation....

@nsheff

This comment has been minimized.

Copy link
Member Author

commented Jan 4, 2019

definitely not from when a segmentation is created, because this doesn't involve any annotations, which can be added afterwards.

Yeah, I think you're right... it could be done at the time the experiment is created, though.

@oddodaoddo

This comment has been minimized.

Copy link
Contributor

commented Jan 4, 2019

There are two options - when an experiment is added or periodic sweep that reconciles things. I think the former is better.

@nsheff

This comment has been minimized.

Copy link
Member Author

commented Jan 4, 2019

yeah, I agree.

@oddodaoddo

This comment has been minimized.

Copy link
Contributor

commented Jan 5, 2019

Now that I finished #15 I have an easy path to finish this ticket. Will be working on it tonight.

@oddodaoddo

This comment has been minimized.

Copy link
Contributor

commented Jan 7, 2019

OK, rough code for this is being committed. Here is an example screenshot of an interface that was created as part of loading a test segmentation / test experiment pairing.
interfaces

@oddodaoddo

This comment has been minimized.

Copy link
Contributor

commented Jan 7, 2019

I have a lot of FIXMEs in the code but the general trajectory is that we are using our own REST services to add experiments, design interfaces etc. There are a lot of issues to take care of down the road such as network timeouts, graceful handling of errors, large file uploads etc. etc. It is a start at least - we have the functionality to add experiments with exact segment matches right now, as well as functionality to add design interfaces... 😃

@nsheff

This comment has been minimized.

Copy link
Member Author

commented Jan 8, 2019

what do you mean by "add design interfaces" ? These are not added, but returned automatically by the server based on what segmentations/experiment it knows about.

@oddodaoddo

This comment has been minimized.

Copy link
Contributor

commented Jan 8, 2019

The server "knows" what segmentations/experiments it is keeping by way of storing these documents verbatim in elastic search in the index "interfaces". Every time an experiment is added, a design interface describing the segmentation/experiment pairing has to be added to the interfaces index. Finding out what design interfaces are available is a matter of recalling documents from this index by segmentation or experiment name (I have to add API points to expose this functionality).

@oddodaoddo

This comment has been minimized.

Copy link
Contributor

commented Jan 8, 2019

if the above is satisfactory, we can close this ticket? Thanks!

@nsheff

This comment has been minimized.

Copy link
Member Author

commented Jan 8, 2019

the way I imagine it, there is 1 interface, which describes all the data in that data provider.

let's close it once we have a functional, documented API point that can be pasted into the subscription interface (once it works).

The place to document this would be here: http://code.databio.org/episb/provider-api/

@nsheff

This comment has been minimized.

Copy link
Member Author

commented Jan 8, 2019

Can you show/document how to retrieve the design interface for our current data provider?

@oddodaoddo

This comment has been minimized.

Copy link
Contributor

commented Jan 10, 2019

episb-provider now includes an API point such as:

/segmentations/get/all -> returns a list of all design interfaces owned by the provider.

I am preparing documentation on how to use all the API points next.

@oddodaoddo

This comment has been minimized.

Copy link
Contributor

commented Jan 18, 2019

For now I think the provider interface can be something like:

{
    providerName:String,
    providerDescription:String,
    providerInstitution:String,
    providerAdmin:String,
    providerAdminContact:String,
    segmentationProvider:Boolean,
    segmentationsStored:Int
    experimentsStored:Int,
    regionsStored:Int,
    segmentationsAPIpt:String,
    experimentsAPIpt:String
}

although I am not entirely certain that we need to list the actual segmentations and experiments API points - they can be assumed to the uniformly the same across all providers that declare the field segmentationProvider:Boolean to be "true".

This file can be a json file that's manually edited by the provider admin and assumed to exist in a certain (well documented) location.

Thoughts?

@oddodaoddo

This comment has been minimized.

Copy link
Contributor

commented Jan 18, 2019

Now that I slept on it, I don't like it. We can use the file we already use to configure the episb-provider (a HOCON file in a specific location). That way some of the above can be populated from the config file, the rest can be reported live if the/provider-interface/ link is visited.

@oddodaoddo

This comment has been minimized.

Copy link
Contributor

commented Jan 18, 2019

If we use the episb-provider/src/main/resources/application.conf file that we have been using for other configuration related things, like host:port of elasticsearch etc., we can have something like the following in it:

    provider-name = "SheffieldLab"
    provider-description = "API and DATA provider for EPISB"
    provider-institution = "University of Virginia, Sheffield Lab of Computational Biology"
    provider-admin = "Ognen Duzlevski"
    provider-contact = "od5t@virginia.edu"
    segmentation-provider = true

This will be used to complete the above mentioned JSON document, some of which can be enriched on the fly.

The API point to submit to the hub would be: http://episb-provider-url/provider-interface

@nsheff

This comment has been minimized.

Copy link
Member Author

commented Jan 18, 2019

I think I understand -- but your example doesn't have the configuration related things you mentioned, like host:port of elasticsearch.

@oddodaoddo

This comment has been minimized.

Copy link
Contributor

commented Jan 18, 2019

I think I understand -- but your example doesn't have the configuration related things you mentioned, like host:port of elasticsearch.

Correct, I just added the section that was of relevance here. So, for example, the full one would be:

episb-utils {
    elastic-host = provider.episb.org
    elastic-port = 9300
    elastic-cluster-name = episb-elastic-cluster
    episb-provider-url = provider.episb.org
    episb-provider-url-base = "/episb-provider"
    episb-provider-port = 8080

    provider-name = "SheffieldLab"
    provider-description = "API and DATA provider for EPISB"
    provider-institution = "University of Virginia, Sheffield Lab of Computational Biology"
    provider-admin = "Ognen Duzlevski"
    provider-contact = "od5t@virginia.edu"
    segmentation-provider = true
}
@oddodaoddo

This comment has been minimized.

Copy link
Contributor

commented Jan 19, 2019

A call to http://provider.episb.org/episb-provider/provider-interface will return something like:

{"result":[{"providerName":"SheffieldLab","providerDescription":"API and DATA provider for EPISB","providerInstitution":"University of Virginia, Sheffield Lab of Computational Biology","providerAdmin":"Ognen Duzlevski","providerAdminContact":"od5t@virginia.edu","segmentationsProvided":true,"segmentationsNo":6,"regionsNo":3534878,"annotationsNo":571339,"experimentsNo":1}],"error":"None"}
@oddodaoddo

This comment has been minimized.

Copy link
Contributor

commented Jan 20, 2019

I am loading segmentations and experiments one by one. provider-interface is implemented on the episb-provider side. I am going to create separate tickets to change the documentation and add explanation of provider-interface API point, as well as tickets to work on the episb-hub side of things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.