-
Notifications
You must be signed in to change notification settings - Fork 18
Description
Currently, snps normalizes data into a dataframe with four columns named rsid, chrom, pos, and genotype.
genotype can either be np.nan or a string of length 1 or 2. For autosomal SNPs and the X chromosome, the genotype is always a length 2 string. See #43.
The Y chromosome and mtDNA alleles are often strings of length 1; however, SNPs in the pseudoautosomal region on the X and Y chromosomes often have two alleles reported.
So, to better handle various numbers of alleles reported for the X, Y, and mtDNA chromosomes, consider refactoring the genotype column into allele1 and allele2.
Note that this would also more naturally support phased genotypes (vs. indexing a length 2 string genotype), wherein allele1 could be alleles on one chromosome, and allele2 could be alleles on the other. See #44.