# Spanish sheeps
Describe the latest data coming from Spain. They sent multiple file. File with prefix `CHUOJA` have the same samples for breeds `{'Churra', 'Ojalada'}` already imported from sheephapmap. Other files seems to have new samples, however they comes from an *affymetrix* array I don't have.
* [SMARTER-500-ASSAF](#dataset0)
* [Castellana](#dataset1)
* [Churra](#dataset2)

In [1]:
import re
import os
import logging
import zipfile
from collections import defaultdict
from pathlib import Path

import pandas as pd
from tqdm.notebook import tqdm

from src.features.smarterdb import global_connection, Dataset
from src.features.plinkio import AffyPlinkIO, TextPlinkIO, CodingException
from src.features.utils import get_interim_dir
from src.data.common import WORKING_ASSEMBLIES

_ = global_connection()
OAR3 = WORKING_ASSEMBLIES["OAR3"]
logger = logging.getLogger('src.features.plinkio')
logger.setLevel(logging.CRITICAL)

In [2]:
class CustomMixin():
    n_of_individuals = None
    
    def process_pedfile(self, coding="top"):
        for line in tqdm(self.read_pedfile(), total=self.n_of_individuals):
            _ = self._process_genotypes(line, coding)
            
        return True
    
    def is_top(self):
        try:
            return self.process_pedfile(coding='top')
        
        except CodingException:
            return False
    
    def is_forward(self):
        try:
            return self.process_pedfile(coding='forward')
        
        except CodingException:
            return False
        
    def is_affymetrix(self):
        try:
            return self.process_pedfile(coding='affymetrix')
        
        except CodingException:
            return False
        
class CustomTextPlinkIO(CustomMixin, TextPlinkIO):
    pass


class CustomAffyPlinkIO(CustomMixin, AffyPlinkIO):
    """This is not a cellfile, but a plink made by affymetrix"""
    
    def read_pedfile(self, *args, **kwargs):
        """Open pedfile for reading return iterator"""

        with open(self.pedfile) as handle:
            # affy files has both " " and "\t" in their files
            for record in handle:
                # affy data may have comments in files
                if record.startswith("#"):
                    logger.info(f"Skipping {record}")
                    continue

                line = re.split('[ \t]+', record.strip())

                yield line

<a id='dataset0'></a>
## SMARTER-500-ASSAF
Let'explore the ASSAF dataset. It seems to be an affymetrix dataset, however file is plink text format:

In [3]:
assaf_dataset = Dataset.objects.get(file="SMARTER-500-ASSAF.zip")
plinkio = CustomTextPlinkIO(
    prefix=str(assaf_dataset.working_dir / "SMARTER-500-ASSAF"), 
    species=assaf_dataset.species, 
    chip_name=assaf_dataset.chip_name)
plinkio.n_of_individuals = assaf_dataset.n_of_individuals

Start by reading coordinates. Try to determine how many SNPs I have in SMARTER database

In [4]:
plinkio.read_mapfile()
plinkio.fetch_coordinates(
    version="Oar_v4.0",
    imported_from="affymetrix",
    search_field="probeset_id"
)

In [5]:
snps_found = len(plinkio.mapdata)-len(plinkio.filtered)
perc_missing = round(100 - (snps_found / len(plinkio.mapdata) * 100), 2)

print(f"I can retrieve {snps_found} of {len(plinkio.mapdata)} SNPs ({perc_missing}% missing)")

I can retrieve 38843 of 49702 SNPs (21.85% missing)


Is this dataset in *top* coordinates?

In [6]:
plinkio.is_top()

  0%|          | 0/504 [00:00<?, ?it/s]

Error for SNP 6:AX-123194370: C/C <> A/G


False

Is this file in *affymetrix forward* coordinates?

In [7]:
plinkio.is_affymetrix()

  0%|          | 0/504 [00:00<?, ?it/s]

Error for SNP 3:AX-123245398: G/G <> T/C


False

This isn't expected, maybe the reference genome is *OAR3* and not the latest *OAR4*. I think I need to add this old manifest data into database

<a id='dataset1'></a>
## Castellana_Ovine
Let's explore another spanish dataset. This dataset contains a plink file for the whole affymetrix chip and a subset of samples made to test for the creation of a smaller and cheaper chip. Samples and SNPs are the same, so the 10K dataset could be totally ignored. The 50K file is affymetrix plink file, however it don't come from *cell file* but its a plink *tab separated* file with comments

In [8]:
castellana_ovine = Dataset.objects.get(file="Castellana.zip")
plinkio = CustomAffyPlinkIO(
    prefix=str(castellana_ovine.working_dir / "Castellana/20220131 Ovine"), 
    species=castellana_ovine.species, 
    chip_name=castellana_ovine.chip_name)
plinkio.n_of_individuals = castellana_ovine.n_of_individuals

Start by reading coordinates. Try to determine how many SNPs I have in SMARTER database

In [9]:
plinkio.read_mapfile()
plinkio.fetch_coordinates(
    version="Oar_v4.0",
    imported_from="affymetrix",
    search_field="probeset_id"
)

In [10]:
snps_found = len(plinkio.mapdata)-len(plinkio.filtered)
perc_missing = round(100 - (snps_found / len(plinkio.mapdata) * 100), 2)

print(f"I can retrieve {snps_found} of {len(plinkio.mapdata)} SNPs ({perc_missing}% missing)")

I can retrieve 38843 of 49702 SNPs (21.85% missing)


Is this dataset in *top* coordinates?

In [11]:
plinkio.is_top()

  0%|          | 0/186 [00:00<?, ?it/s]

Error for SNP 6:AX-123194370: C/C <> A/G


False

Is this file in *affymetrix forward* coordinates?

In [12]:
plinkio.is_affymetrix()

  0%|          | 0/186 [00:00<?, ?it/s]

Error for SNP 3:AX-123245398: G/G <> T/C


False

This is the same behaviour seen for *Assaf* file. Which breeds I have in this dataset?

In [13]:
breeds_castellana = set()
samples_castellana = set()
for line in plinkio.read_pedfile():
    breed, sample = line[0], line[1]
    if breed not in breeds_castellana:
        breeds_castellana.add(breed)
    samples_castellana.add(sample)
    
print(f"Got {breeds_castellana} breeds")

Got {'Assaf', 'SMARTER'} breeds


<a id='dataset2'></a>
## Churra
Let's explore the last spanish dataset. This dataset is affymetrix plink file, with mixed *rs_id* and *affy ids* as SNP names

In [14]:
churra_dataset = Dataset.objects.get(file="Churra.zip")
plinkio = CustomAffyPlinkIO(
    prefix=str(churra_dataset.working_dir / "Churra/Churra_SMARTER_JJsent"), 
    species=churra_dataset.species, 
    chip_name=churra_dataset.chip_name)
plinkio.n_of_individuals = churra_dataset.n_of_individuals

Here we have the problem that we have a mix of `rs_id` and `probeset_id` as snp names:

In [15]:
plinkio.read_mapfile()
plinkio.fetch_coordinates(
    version="Oar_v4.0",
    imported_from="affymetrix",
    search_field="probeset_id"
)
probeset_found = len(plinkio.mapdata)-len(plinkio.filtered)
print(f"Found {probeset_found} SNPs using 'probeset_id'")

Found 69 SNPs using 'probeset_id'


In [16]:
plinkio.read_mapfile()
plinkio.fetch_coordinates(
    version=OAR3.version,
    imported_from=OAR3.imported_from,
    search_field="rs_id"
)
rs_found = len(plinkio.mapdata)-len(plinkio.filtered)
print(f"Found {rs_found} SNPs using 'rs_id'")

Found 45999 SNPs using 'rs_id'


In [17]:
snps_found = probeset_found + rs_found
perc_missing = round(100 - (snps_found / len(plinkio.mapdata) * 100), 2)

print(f"I can retrieve {snps_found} of {len(plinkio.mapdata)} SNPs ({perc_missing}% missing)")

I can retrieve 46068 of 60379 SNPs (23.7% missing)


Ok, let's try to create a metadata table in which defining the few GPS coordinates I have:

In [18]:
coordinates = {'AV': (42.097806, -5.283205), 'VG': (41.86830, -5.39687)}
data = defaultdict(list)
for line in plinkio.read_pedfile():
    # define the minimal set of smarter metadata
    data["original_id"].append(line[1])
    
    # this breed is already in smarter
    data["breed_name"].append("Churra")
    data["breed_code"].append("CHU")
    
    # other data I know
    data["country"].append("Spain")
    data["purpose"].append("Milk")
    
    # determining GPS coordinates
    key = line[1][:2]
    latlong = coordinates[key]
    data["latitude"].append(latlong[0])
    data["longitude"].append(latlong[1])
    
# ok transform into dataframe
df = pd.DataFrame(data=data)

In [19]:
outfile = Path(churra_dataset.file).stem + ".xlsx"
outpath = get_interim_dir() / outfile
df.to_excel(str(outpath), index=False)
os.chdir(get_interim_dir())
metadata_file = zipfile.ZipFile("Churra_metadata.zip", "w")
metadata_file.write(outfile, arcname=f"metadata/{outfile}")
outpath.unlink()
metadata_file.close()