Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Classical linkage and haplotype datatypes #4894

Merged
merged 13 commits into from Nov 24, 2017

Conversation

Projects
None yet
4 participants
@mtekman
Copy link
Contributor

commented Oct 30, 2017

Datatypes for pedigree analysis commonly used in programs such as:

Here common input linkage formats are defined (compatible with: Allegro, Genehunter, Merlin, Simwalk, and Swiftlink):

  • pedin
  • datain
  • mapin

Input formats for Alohomora are also defined which can generate the above input formats for more upstream analysis:

  • genotype matrix
  • pre-makeped pedigree
  • minor allele frequency table
  • pre-makeped marker map

Output linkage formats will default to Allegro conversion (via a multi-purpose conversion script wrapped within the tool XML) that will yield the Allegro output types:

  • ihaplo
  • descent
  • LOD
(squashed)
Added common linkage file formats and sniffers

@mtekman mtekman force-pushed the mtekman:linkage branch from f3d17af to 07ab214 Nov 6, 2017

@mtekman

This comment has been minimized.

Copy link
Contributor Author

commented Nov 6, 2017

review request please - @bgruening

@bgruening
Copy link
Member

left a comment

@mtekman thanks a few initial comments-

<!-- Common input formats: Generated by alohomora, but user may also upload these manually -->
<datatype extension="linkage_datain" type="galaxy.datatypes.genetics:DataIn" />
<datatype extension="linkage_pedin" type="galaxy.datatypes.genetics:PreMakePed" />
<datatype extension="linkage_map" type="galaxy.datatypes.genetics:MarkerMap" />

This comment has been minimized.

Copy link
@bgruening

bgruening Nov 11, 2017

Member

please don't include white space to align attributes

# iterate whole file without errors
self.eof_res = True

@staticmethod

This comment has been minimized.

Copy link
@bgruening

bgruening Nov 11, 2017

Member

Why does this need to be a static method?
If you subclass it you will have access to this function, imho.

This comment has been minimized.

Copy link
@mtekman

mtekman Nov 11, 2017

Author Contributor

I assumed it was bad practice, but okay will remove

self.eof_res = True

@staticmethod
def __is_binary_file(fstream):

This comment has been minimized.

Copy link
@bgruening

bgruening Nov 11, 2017

Member

Is it really needed to check for binary? All your files are text right?

This comment has been minimized.

Copy link
@mtekman

mtekman Nov 11, 2017

Author Contributor

It was required locally when testing in isolation, but will remove

return result

@staticmethod
def tokenizer(line, sep=None):

This comment has been minimized.

Copy link
@bgruening

bgruening Nov 11, 2017

Member

Mh, if line is a string, the first line of a file, why do you use splitlines?
If the only purpose is to split a string into tokens split() should be enough.

This comment has been minimized.

Copy link
@mtekman

mtekman Nov 11, 2017

Author Contributor

force of habit from my mac2unix/dos2unix days -- will remove

"""
self.lcount += 1
if self.lcount > self.max_lines:
return -1

This comment has been minimized.

Copy link
@bgruening

bgruening Nov 11, 2017

Member

Do you mean False here?

This comment has been minimized.

Copy link
@mtekman

mtekman Nov 11, 2017

Author Contributor

no:

  • False would be passed to calling function which would then stop all further processing and reject the file
  • True would ..... and accept the file
  • None would tell the calling function that it needs to keep processing
  • -1 tells the calling function to stop further processing (max lines exceeded) and to evaluate the eof function

This comment has been minimized.

Copy link
@mtekman

mtekman Nov 11, 2017

Author Contributor

using integers as return values by default seemed less readable, and I didn't want to implement needless enums

def sniffer(self, filename):
with open(filename, "r") as fio:

if LinkageStudies.__is_binary_file(fio):

This comment has been minimized.

Copy link
@bgruening

bgruening Nov 11, 2017

Member

This is not needed here, right?

This comment has been minimized.

Copy link
@mtekman

mtekman Nov 11, 2017

Author Contributor

faster in local testing, but will remove



class PreMakePed(LinkageStudies):
"""

This comment has been minimized.

Copy link
@bgruening
"""
>>> cname = "PreMakePed"
>>> from galaxy.datatypes.sniff import get_test_fname
>>> extn_true = eval(cname)().file_ext

This comment has been minimized.

Copy link
@bgruening

bgruening Nov 11, 2017

Member

Don't over engineer things, you know all the extensions and they should stay stable.
xref: https://stackoverflow.com/questions/1832940/why-is-using-eval-a-bad-practice

This comment has been minimized.

Copy link
@mtekman

mtekman Nov 11, 2017

Author Contributor

Sorry, this was leftover back when I was trying to implement inheritable doc-tests. Will cleanup asap

@galaxyproject galaxyproject deleted a comment from mtekman Nov 11, 2017

@galaxyproject galaxyproject deleted a comment from mtekman Nov 11, 2017

def header_check(self, fio):
header = fio.readline().splitlines()[0].split()
try:
if header[0] == "family" and header[1] == "location":

This comment has been minimized.

Copy link
@bgruening

bgruening Nov 11, 2017

Member

Can you not simply compare the lists?

# iterate whole file without errors
self.eof_res = True

def tokenizer(self, line, sep=None):

This comment has been minimized.

Copy link
@bgruening

bgruening Nov 11, 2017

Member

This function becomes now obsolete, right? It is exactly what split() is doing.

This comment has been minimized.

Copy link
@mtekman

mtekman Nov 11, 2017

Author Contributor

I initially sought to standardize, but yes it's likely extra overhead now

return line_res

return self.eof_function()
return False

This comment has been minimized.

Copy link
@bgruening

bgruening Nov 11, 2017

Member

This return will never be reached.


if self.num_cols == -1:
self.num_cols = len(tokens)

This comment has been minimized.

Copy link
@bgruening

bgruening Nov 11, 2017

Member

these line-breaks can be removed, if-else belong together.

mtekman added some commits Nov 11, 2017

def header_check(self, fio):
header = fio.readline().splitlines()[0].split()
try:
if header[:4] == ["family", "location", "LOD", "marker"]:

This comment has been minimized.

Copy link
@bgruening

bgruening Nov 11, 2017

Member

remove the try, except and ask in the same if condition if it has 4 elements ...

@mtekman mtekman force-pushed the mtekman:linkage branch from 7d39078 to c9f4439 Nov 20, 2017

@erasche
Copy link
Member

left a comment

Looks like it is getting into pretty good shape. Is this out of WIP?

if not self.header_check(fio):
return False

for line in fio:

This comment has been minimized.

Copy link
@erasche

erasche Nov 21, 2017

Member

use something like

for lcount, line in enumerate(fio)
    if lcount > ...

instead and then you can remove self.lcount everywhere. It seems strange since it isn't the real lcount, it's "the number of lines I've seen so far as long as it isn't greater than 10". I would find myself mis-using that variable if I had not read the code closely enough.

Text.__init__(self, **kwd)
self.lcount = 0
self.max_lines = 10
# iterated whole file without errors

This comment has been minimized.

Copy link
@erasche

erasche Nov 21, 2017

Member

iterated the whole .. first ten lines of the file without errors.

This comment has been minimized.

Copy link
@mtekman

mtekman Nov 24, 2017

Author Contributor

Thanks, a lot of these are legacy from when the LinkageStudies superclass actually did something. Should be fixed now!

self.lcount = 0
self.max_lines = 10
# iterated whole file without errors
self.eof_res = True

This comment has been minimized.

Copy link
@erasche

erasche Nov 21, 2017

Member

I'm not a big fan of this. Compared to most other datatypes I've read / written which just return true if they want to return true rather than returning a class variable.

@mtekman mtekman changed the title [WIP] Classical linkage and haplotype datatypes Classical linkage and haplotype datatypes Nov 24, 2017

@mtekman mtekman changed the title Classical linkage and haplotype datatypes Add Classical linkage and haplotype datatypes Nov 24, 2017

@erasche

This comment has been minimized.

Copy link
Member

commented Nov 24, 2017

Great work, thanks @mtekman

@erasche erasche merged commit 5c6e4a5 into galaxyproject:dev Nov 24, 2017

2 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
lgtm analysis: JavaScript No alert changes
Details

@mtekman mtekman deleted the mtekman:linkage branch Nov 24, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.