Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Biom datatype #941

Closed
fescudie opened this issue Oct 19, 2015 · 8 comments
Closed

Biom datatype #941

fescudie opened this issue Oct 19, 2015 · 8 comments

Comments

@fescudie
Copy link
Contributor

Hi,

In metagenomics the standard format for representing biological samples by observation contingency tables is the format BIOM (http://biom-format.org/). Unfortunately this datatype does not exist in galaxy. Can you add this datatype in galaxy ?
Below an implementation used in our instance.

Add datatype in datatypes_conf.xml and sniffer just before Json:

<registration converters_path="lib/galaxy/datatypes/converters">
    ...
    <datatype extension="biom1" type="galaxy.datatypes.text:Biom1" display_in_upload="True" subclass="True" mimetype="application/json" />
    ...
<registration />
<sniffers>    
    ...
    <sniffer type="galaxy.datatypes.text:Biom1"/>
    <sniffer type="galaxy.datatypes.text:Json"/>
    ...
</sniffers>

Add Biom and Biom1 classes in galaxy.datatypes.text.py:

class Biom( Text ):
    file_ext = "biom"

    def set_peek( self, dataset, is_multi_byte=False ):
        if not dataset.dataset.purged:
            dataset.peek = get_file_peek( dataset.file_name, is_multi_byte=is_multi_byte )
            dataset.blurb = "Biological Observation Matrix"
        else:
            dataset.peek = 'file does not exist'
            dataset.blurb = 'file purged from disc'

    def display_peek( self, dataset ):
        try:
            return dataset.peek
        except:
            return "BIOM file (%s)" % ( nice_size( dataset.get_size() ) )

class Biom1( Biom ):
    edam_format = "format_3464"
    file_ext = "biom1"

    def set_peek( self, dataset, is_multi_byte=False ):
        super(Biom1, self).set_peek( dataset, is_multi_byte )
        if not dataset.dataset.purged:
            dataset.blurb = "Biological Observation Matrix v1"

    def sniff( self, filename ):
        return self._looks_like_biom( filename )

    def _looks_like_biom( self, filepath, check_limit_size=104857600 ):
        """
        @param filepath: [str] The path to the evaluated file.
        @param check_limit_size: [int] The maximum size of the checked file (in
                                 bytes). If the size is superior than this 
                                 number the format cannot be validated.
        """
        is_biom = False 
        try:
            if os.path.getsize(filepath) < check_limit_size:
                biom = json.load( open(filepath, "r") )
                is_biom = True
                biom_expected_fields = ["id", "format", "format_url", "type", "generated_by", "date", "rows", "columns", "matrix_type", "matrix_element_type", "shape", "data"]
                for expected_field in biom_expected_fields:
                    if not expected_field in biom:
                        is_biom = False
        except:
            is_biom = False
        return is_biom

Note: The BIOM exist in 2 main versions v1 (JSON) and v2 (HDF5). Some softwares accept only the v1, others accept only the v2, others accept every version of BIOM. This is why we have implemented Biom1, Biom2 inherited from Biom.

Thanks in advance.

@martenson
Copy link
Member

Hi @fescudie and thanks for the idea. Is there a reason why you did not open PR given that you already have implementation that you use?

@nturaga
Copy link
Contributor

nturaga commented Oct 19, 2015

@fescudie Just out of curiosity, are they any tools now in the galaxy toolshed which use this datatype already?

@martenson
Copy link
Member

@nitesh1989 I would assume these do: https://github.com/geraldinepascal/FROGS/tree/master/tools
they don't seem to be in the TS

@nturaga
Copy link
Contributor

nturaga commented Oct 19, 2015

@martenson Sweet, thanks!

@fescudie
Copy link
Contributor Author

@martenson I have not opened a PR because I'm not yet very used to Git.

@hexylena
Copy link
Member

@fescudie I'm happy to open a PR for this on your behalf (with you listed as author) if you'd like, otherwise, happy to direct you to resources to learn git. I'm very excited to see BIOM datatypes in Galaxy!

@fescudie
Copy link
Contributor Author

The pull request has been created: #950

@fescudie
Copy link
Contributor Author

The pull request #950 has been merged.
Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants