# ISHEEP WGS DATASET

This is an attempt to collect SNPs from [isheep WGS dataset](https://ngdc.cncb.ac.cn/isheep/download) and track them in to smarter database. There absolutely no info regarding *probeset IDs*, so the only way to collect SNPs from this dataset relies on position on *OAR4*. Data is divided by chromosomes and where downloaded from [isheep WGS ftp folder](ftp://download.big.ac.cn/isheep/SNP).

## Data preparation

Files were compressed using the standard `gzip` utility. Unpack all files and compress them using `bgzip`, then index with `tabix`:

```bash
for compressed in $(ls *.vcf.gz); do echo "Processing " $compressed; vcf="${compressed%.*}"; bgzip -d --stdout $compressed | bgzip -@24 --compress-level 9 --stdout > $vcf.bgzip ; done
for compressed in $(ls *.bgzip); do echo "Processing " $compressed; bgzip --test $compressed ; done
for compressed in $(ls *.bgzip); do echo "Processing " $compressed; vcf="${compressed%.*}.gz" ; mv $compressed $vcf ; tabix $vcf ; done
```

Now I can try to extract my probes relying on positions and `tabix`. The point is that I could have more probe on the same position, so I can't assign a unique `VariantSheep.name` to a certain SNP; Moreover positions need to be checked against `OAR4` before querying VCF. For the moment, I will try to extract the information I need relying on the data I have

In [1]:
import csv

from tqdm.notebook import tqdm

from src.features.smarterdb import global_connection, VariantSheep
from src.features.utils import get_interim_dir
from src.data.common import WORKING_ASSEMBLIES

In [2]:
_ = global_connection()
OAR4 = WORKING_ASSEMBLIES["OAR4"]

## Creating the regions file

As described by the `tabix` documentation, I can collect samples by region. Chromosome and position in a *tab separated* file is enough. I have a VCF for each chromosome, so I need to collect data by chromosomes:

In [3]:
chromosomes = [str(chrom) for chrom in range(1, 27)] + ["X"]

for chrom in tqdm(chromosomes):
    condition = OAR4._asdict()
    condition['chrom'] = chrom
    variants = VariantSheep.objects.filter(
        locations__match=condition
    ).fields(
        elemMatch__locations=OAR4._asdict(),
        name=1,
        rs_id=1
    )
    
    with open(get_interim_dir() / f"chr{chrom}_regions.tsv", "w") as handle:
        writer = csv.writer(handle, delimiter="\t", lineterminator="\n")
        for variant in variants:
            location = variant.locations[0]
            writer.writerow([location.chrom, location.position])
            

  0%|          | 0/27 [00:00<?, ?it/s]

## Extract variants from VCF

It's time to extract the sheep smarter variants from the VCF files. Then merge all files into one VCF:

```bash
for i in $(seq 1 26); do tabix -h -R chr$i\_regions.tsv output_chr$i.snp.filtered.vcf.gz | bgzip --compress-level 9 --stdout > smarter_chr$i\_regions.vcf.gz; tabix smarter_chr$i\_regions.vcf.gz ; done
tabix -h -R chrX_regions.tsv output_chrX.snp.filtered.vcf.gz | bgzip --compress-level 9 --stdout > smarter_chrX_regions.vcf.gz
tabix smarter_chrX_regions.vcf.gz
vcf-concat smarter_chr*.vcf.gz | bgzip -@24 --stdout > WGS-all/WGS-all.smarter.vcf.gz
tabix WGS-all/WGS-all.smarter.vcf.gz
plink --allow-extra-chr --vcf WGS-all/WGS-all.smarter.vcf.gz --make-bed --double-id --out WGS-all/WGS-all.smarter
```

Please note that in the final plink file `WGS-all.smarter.bim` SNPs have no name: this need to be fixed with the `VariantSheep.name` field in order to be merged with the SMARTER dataset