# ISHEEP dataset

* [Introduction](#introduction)
* [Data preparation](#data-preparation)
  * [Convert VCF files into a valid format](#vcf-convert)
  * [Convert sorted VCF file to plink binaries](#vcf-plink)
* [Importing stuff](#importing)
* [ISHEEP 50K dataset](#snp50k)
* [ISHEEP 600K dataset](#snpHD)

<a id='introduction'></a>
## Introduction

This notebook is an exploratory analisis on the [isheep dataset](https://ngdc.cncb.ac.cn/isheep/). Data was downloaded from their [ftp site](https://ngdc.cncb.ac.cn/isheep/download). There are two datasets in *VCF* format, rispectively 1512 samples in 50K and 911 samples in 600k (HD). Headers are missing or not valid, probabily they where generated using plink itself. Genomic coordinates are expected to be *OAR4* and genotypes in forward coding (as VCF should be). SNP were renamed using the **RSID** (not illumina name). All SNPs without rs_id have no name (and placed on 99 chromosome). There are no *FID* in VCF, since they are an attribute of plink files. Those additional information could be retrieved from [isheep supplementary material](https://www.frontiersin.org/articles/10.3389/fgene.2021.714852/full#supplementary-material)

<a id='data-preparation'></a>
## Data preparation

<a id='vcf-convert'></a>
### Convert VCF files into a valid format

VCF files were fixed and modified in order to be sorted and indexed. Chromosome `99` was renamed to `0` in order to be processed with plink. VCF will be sorted on position using `awk` since headers was malformed and couldn't work with `bcftools`, `picard` or other utils. Final file was compressed and indexed. For 50K:

```bash
python createVCFheader.py > header.txt
cat header.txt 50K-all.vcf > 50K-all.fix.vcf
awk '{if ($1=="99") sub($1, 0); print }' 50K-all.fix.vcf > 50K-all.fix-no99.vcf
sed -i 's/ID=99/ID=0/' 50K-all.fix-no99.vcf
grep "^#" 50K-all.fix-no99.vcf > 50K-all.fix-no99.sort.vcf
grep -v "^#" 50K-all.fix-no99.vcf | sort -k1,1V -k2,2g >> 50K-all.fix-no99.sort.vcf
bgzip -@24 50K-all.fix-no99.sort.vcf
tabix 50K-all.fix-no99.sort.vcf.gz
```

Similarly for 600K:

```bash
cat 600K-all.vcf | awk '{if ($1=="99") sub($1, 0); print }' > 600K-all.fix-no99.vcf
sed -i 's/ID=99/ID=0/' 600K-all.fix-no99.vcf
grep "^#" 600K-all.fix-no99.vcf > 600K-all.fix-no99.sort.vcf
grep -v "^#" 600K-all.fix-no99.vcf | sort -k1,1V -k2,2g >> 600K-all.fix-no99.sort.vcf
bgzip -@24 600K-all.fix-no99.sort.vcf 
tabix 600K-all.fix-no99.sort.vcf.gz
```

<a id='vcf-plink'></a>
### Convert sorted VCF file to plink binaries

VCF are the converted into plink files in order to be processed using the `SMARTER-database` library:

```bash
mkdir 50K-all
mkdir 600K-all
plink --chr-set 26 --allow-extra-chr --vcf 50K-all.fix-no99.sort.vcf.gz --make-bed --out 50K-all/50K-all
plink --chr-set 26 --allow-extra-chr --vcf 600K-all.fix-no99.sort.vcf.gz --make-bed --out 600K-all/600K-all
```

<a id='importing'></a>
## Importing stuff

Now it's time to upload datasets

In [17]:
import logging
from pathlib import Path
from collections import Counter

from plinkio import plinkfile
from tqdm.notebook import tqdm

from src.features.smarterdb import global_connection, VariantSheep
from src.features.utils import get_project_dir
from src.features.plinkio import BinaryPlinkIO, CodingException
from src.data.common import WORKING_ASSEMBLIES

In [18]:
_ = global_connection()
OAR3 = WORKING_ASSEMBLIES["OAR3"]
logging.getLogger('src.features.plinkio').setLevel(logging.CRITICAL)

In [19]:
isheep_50K = plinkfile.open(str(get_project_dir() / "data/external/SHE/ISHEEP/50K-all/50K-all"))
isheep_600K = plinkfile.open(str(get_project_dir() / "data/external/SHE/ISHEEP/600K-all/600K-all"))

<a id='snp50k'></a>
## ISHEEP 50K dataset
How many samples in 50K dataset are in 600K dataset?

In [20]:
isheep_50K_samples = set([sample.iid for sample in isheep_50K.get_samples()])
isheep_600K_samples = set([sample.iid for sample in isheep_600K.get_samples()])

common_samples = isheep_600K_samples.intersection(isheep_50K_samples)
print(f"There are {len(common_samples)} samples in common between 50K and 600K datasets")

There are 0 samples in common between 50K and 600K datasets


Ok, it's seems to me that those dataset could be managed separately, and maybe that those two dataset were created independentely. Let's consider SNPs: they have the *rs_id* as name, and SNP without a place have no name at all. How many SNP with rs_id i have?

In [21]:
isheep_50K_all_variants = isheep_50K.get_loci()
isheep_50K_valid_variants = list(filter(lambda snp: snp.name != 'NULL', isheep_50K_all_variants))
print(f"There are {len(isheep_50K_valid_variants)} SNPs with a valid 'rs_id' from {len(isheep_50K_all_variants)} total SNPs")

There are 48221 SNPs with a valid 'rs_id' from 51132 total SNPs


How many variants I can find in my database starting from rs_id? Consider that affymetrix SNPs have the same 'rs_id' for multiple probes. Force SNPs to belong the appropriate chip

In [22]:
variants = VariantSheep.objects.filter(rs_id__in=[variant.name for variant in isheep_50K_valid_variants], chip_name="IlluminaOvineSNP50")
print(f"I can found {variants.count()}/{len(isheep_50K_valid_variants)} in my SMARTER-database")

I can found 48221/48221 in my SMARTER-database


Loading data from other dataset provide me all the SNPs I need. So, I could add almost all snps with an *rs_id*. Next question is are those SNPs in forward coordinates? I need to ovverride my class methods:

In [23]:
class CustomBinaryPlinkIO(BinaryPlinkIO):
    def process_pedfile(self, src_coding="top"):
        for line in tqdm(self.read_pedfile(), total=len(self.plink_file.get_samples())):
            _ = self._process_genotypes(line, src_coding=src_coding)

        return True

    def is_top(self):
        try:
            return self.process_pedfile(src_coding='top')

        except CodingException:
            return False

    def is_forward(self):
        try:
            return self.process_pedfile(src_coding='forward')

        except CodingException:
            return False

In [24]:
plinkio = CustomBinaryPlinkIO(species="Sheep", chip_name="IlluminaOvineSNP50")
plinkio.plink_file = isheep_50K
plinkio.read_mapfile()
plinkio.fetch_coordinates(search_field="rs_id", chip_name="IlluminaOvineSNP50", src_assembly=OAR3)

Is this file in *forward*?

In [25]:
plinkio.is_forward()

  0%|          | 0/1512 [00:00<?, ?it/s]

False

Ok, so is this file in illumina *top*?

In [26]:
plinkio.is_top()

  0%|          | 0/1512 [00:00<?, ?it/s]

True

<a id='snpHD'></a>
## ISHEEP 600K dataset
How many SNP with rs_id i have?

In [27]:
isheep_600K_all_variants = isheep_600K.get_loci()
isheep_600K_valid_variants = list(filter(lambda snp: snp.name.upper() != 'NULL', isheep_600K_all_variants))
print(f"There are {len(isheep_600K_valid_variants)} SNPs with a valid 'rs_id' from {len(isheep_600K_all_variants)} total SNPs")

There are 604151 SNPs with a valid 'rs_id' from 606006 total SNPs


Well, I know that there are multiple probes in *HD* chip with the same `rs_id`:

In [28]:
counter = Counter([variant.name for variant in isheep_600K_valid_variants])
{x: count for x, count in counter.items() if count > 1}

{'rs160403113': 2,
 'rs402137533': 2,
 'rs411572125': 2,
 'rs414994086': 2,
 'rs417009700': 2,
 'rs419271878': 2,
 'rs421030064': 2,
 'rs424177120': 2,
 'rs424922202': 2,
 'rs429936770': 2,
 'rs409530414': 2,
 'rs401964070': 2,
 'rs403536877': 2,
 'rs160076408': 2,
 'rs404810128': 2,
 'rs427172981': 2,
 'rs399767812': 2,
 'rs408149659': 2,
 'rs407812192': 2,
 'rs421290240': 2,
 'rs402828512': 2,
 'rs424493804': 2,
 'rs418396733': 2}

How many variants I can find in my database starting from rs_id? Consider that affymetrix SNPs have the same 'rs_id' for multiple probes. Force SNPs to belong the appropriate chip

In [29]:
variants = VariantSheep.objects.filter(rs_id__in=[variant.name for variant in isheep_600K_valid_variants], chip_name="IlluminaOvineHDSNP")
print(f"I can found {variants.count()}/{len(isheep_600K_valid_variants)} in my SMARTER-database")

I can found 604149/604151 in my SMARTER-database


Not bad. Now check that coding convention:

In [30]:
plinkio = CustomBinaryPlinkIO(species="Sheep", chip_name="IlluminaOvineHDSNP")
plinkio.plink_file = isheep_600K
plinkio.read_mapfile()
plinkio.fetch_coordinates(search_field="rs_id", chip_name="IlluminaOvineHDSNP", src_assembly=OAR3)

Is this file in *forward*?

In [31]:
plinkio.is_forward()

  0%|          | 0/911 [00:00<?, ?it/s]

False

Ok, so is this file in illumina *top*?

In [32]:
plinkio.is_top()

  0%|          | 0/911 [00:00<?, ?it/s]

True