# Allele coding format
We suppose that alleles in genotypes file are in **illumina top** format and we don't need to convert genotypes in *MAP/PED*. However, it is always like this? Can I verify that coordinates are always in **top**? How I can convert the coordinates in **forward** format (which is needed by VCF)? How I can determine if a genotype is a reference or an alternate? Let's start by importing and initialization some stuff

In [1]:
import csv
import itertools

from tqdm.notebook import tqdm

from src.features.smarterdb import Dataset, VariantSheep, Location, global_connection

conn = global_connection()
sniffer = csv.Sniffer()

Let's test the *texel* dataset. Get information from database:

In [2]:
dataset = Dataset.objects(file="TEXEL_INIA_UY.zip").get()
mapfile = "TEXEL_UY.map"
pedfile = "TEXEL_UY.ped"
mappath = dataset.working_dir / mapfile
pedpath = dataset.working_dir / pedfile

Time to read SNPs positions and *ids*. Open *MAP* file with `csv` module:

In [3]:
with open(mappath) as handle:
    dialect = sniffer.sniff(handle.read(2048))
    handle.seek(0)
    reader = csv.reader(handle, dialect=dialect)
    mapdata = list(reader)
print(f"read {len(mapdata)} snps from {mappath}")
for line in itertools.islice(mapdata, 5):
    print(line)

read 51135 snps from /home/paolo/Projects/SMARTER-database/data/interim/604f75a61a08c53cebd09b67/TEXEL_UY.map
['15', '250506CS3900065000002_1238.1', '0', '5327353']
['23', '250506CS3900140500001_312.1', '0', '27428869']
['7', '250506CS3900176800001_906.1', '0', '89002990']
['16', '250506CS3900211600001_1041.1', '0', '44955568']
['2', '250506CS3900218700001_1294.1', '0', '157820235']


Ok, now get information from smarter database

In [4]:
locations = []
print("Loading locations...")
for line in tqdm(mapdata):
    variant = VariantSheep.objects(name=line[1]).get()
    location = variant.get_location(version='Oar_v3.1')
    locations.append(location)

print("Done!")

for line in itertools.islice(locations, 5):
    print(line)

Loading locations...


  0%|          | 0/51135 [00:00<?, ?it/s]

Done!
(SNPchiMp v.3:Oar_v3.1) 15:5870057
(SNPchiMp v.3:Oar_v3.1) 23:26298017
(SNPchiMp v.3:Oar_v3.1) 7:81648528
(SNPchiMp v.3:Oar_v3.1) 16:41355381
(SNPchiMp v.3:Oar_v3.1) 2:148802744


Now `mapdata` and `locations` have the same indexes (apply on the same SNP). Ok time to read from *PED* datafile:

In [5]:
def is_top(genotype: list, location: Location, missing: str = "0") -> bool:
    """Return True if genotype is compatible with illumina TOP coding

    Returns:
        bool: True if in top coordinates
    """

    # get illumina data as an array
    top = location.illumina_top.split("/")

    for allele in genotype:
        # mind to missing values. If missing can't be equal to illumina_top
        if allele == missing:
            continue

        if allele not in top:
            return False

    return True

print(f"Check if PED is in top coordinates")

with open(pedpath) as handle:
    dialect = sniffer.sniff(handle.read(2048))
    handle.seek(0)
    reader = csv.reader(handle, dialect=dialect)
    for i, line in enumerate(tqdm(reader, total=dataset.n_of_individuals)):
        # debug on first 100 snps
        for j, mapline in enumerate(mapdata):
            # get location from locations list (read previously from db)
            location = locations[j]
            
            # skip first 6 column (ped extra fields)
            a1 = line[6+j*2]
            a2 = line[6+j*2+1]
            
            # define genotype as an array
            genotype = [a1, a2]
            
            if not is_top(genotype, location):
                # print snp name and info on locations
                print(f"{line[:2]}:{mapline[1]}: from ped: [{a1}/{a2}] -> from snpchimp [{location.illumina_top}]")
                
                raise Exception("Not illumina top")
            
        # debug on first sample
        # break

Check if PED is in top coordinates


  0%|          | 0/169 [00:00<?, ?it/s]