# Sheep Coordinates and Ensembl
Ensembl stores information regarding SNPs on sheep assembly 3.1. I want to verify if such information are updated with snpchimp or not

In [1]:
import os
import itertools
import random

from pathlib import Path
from importlib import reload

from ensemblrest import EnsemblRest

import src.features.illumina
import src.features.snpchimp

random.seed = 42

Start by reading snpchimp and snpchip illumina data from files

In [2]:
project_dir = Path.cwd().parents[1]
illumina_chip = project_dir / "data/external/SHE/ILLUMINA/ovinesnp50-genome-assembly-oar-v3-1.csv"
snpchimp_file = project_dir / "data/external/SHE/SNPCHIMP/SNPchimp_SHE_SNP50v1_oar3.1.csv"

In [3]:
new_chip3 = dict()
for record in src.features.illumina.read_snpChip(illumina_chip):
    new_chip3[record.name] = (record.chr, record.mapinfo)

In [4]:
snpchimp3 = dict()
for record in src.features.snpchimp.read_snpChimp(snpchimp_file):
    snpchimp3[record.snp_name] = [(record.chromosome, record.position), record.rs]

Ok, first track snp ids with different positions in both files. Skip SNP without rsID

In [5]:
count = 0
rs_different = {}

for i, (key, value) in enumerate(new_chip3.items()):
    if value != snpchimp3[key][0]:
        count += 1
        if snpchimp3[key][1] != 'NULL':
            rs_different[snpchimp3[key][1]] = key
    if i < 10:
        print(key, value, snpchimp3[key])
    elif i == 10:
        print("...")
        
print(f"\nN of SNPs in different positions from illumina to SNPchimp: {count}")
print(f"\nN of SNPs with rsID with different positions from illumina to SNPchimp: {len(rs_different)}")

250506CS3900065000002_1238.1 ('15', 5870057) [('15', 5870057), 'rs55630613']
250506CS3900140500001_312.1 ('23', 26298017) [('23', 26298017), 'rs55630642']
250506CS3900176800001_906.1 ('7', 81648528) [('7', 81648528), 'rs55630654']
250506CS3900211600001_1041.1 ('16', 41355381) [('16', 41355381), 'rs55630658']
250506CS3900218700001_1294.1 ('2', 148802744) [('2', 148802744), 'rs55630663']
250506CS3900283200001_442.1 ('1', 188498238) [('99', 0), None]
250506CS3900371000001_1255.1 ('11', 35339123) [('11', 35339123), 'rs417377113']
250506CS3900386000001_696.1 ('16', 62646307) [('16', 62646307), 'rs55631041']
250506CS3900414400001_1178.1 ('1', 103396552) [('1', 103396552), 'rs119102699']
250506CS3900435700001_1658.1 ('12', 45221821) [('99', 0), None]
...

N of SNPs in different positions from illumina to SNPchimp: 6463

N of SNPs with rsID with different positions from illumina to SNPchimp: 447


Select some snps to test coordinates with ensembl

In [6]:
# filter out None rs
rs_keys = [rs for rs in rs_different.keys() if rs is not None]
selected_rsID = random.sample(sorted(rs_keys), 10)

# add a custom rs
selected_rsID.append("rs408606108")

Ok, search in Ensembl for snp coordinates:

In [7]:
ensRest = EnsemblRest()
result = ensRest.getVariationByMultipleIds(ids=selected_rsID, species="ovis_aries", genotyping_chips=True)
for key, value in result.items():
    for location in value['mappings']:
        print(key, location['seq_region_name'], location['start'], snpchimp3[rs_different[key]], new_chip3[rs_different[key]])

rs428267146 3 5146200 [('3', 5146199), 'rs428267146'] ('3', 5146197)
rs406134692 7 29834504 [('7', 29834504), 'rs406134692'] ('7', 29834566)
rs415771800 2 27912185 [('2', 27912184), 'rs415771800'] ('2', 27912185)
rs409291640 11 61430571 [('11', 61430571), 'rs409291640'] ('11', 61430572)
rs425665688 3 177824621 [('99', 0), 'rs425665688'] ('0', 0)
rs425665688 1 91348961 [('99', 0), 'rs425665688'] ('0', 0)
rs408606108 8 54058725 [('8', 54058724), 'rs408606108'] ('8', 54058725)
rs408458247 8 17823087 [('8', 17823087), 'rs408458247'] ('8', 17823088)
rs422423877 24 5726120 [('24', 5726120), 'rs422423877'] ('24', 5726121)
rs411384047 7 96461265 [('7', 96461265), 'rs411384047'] ('0', 0)


The coordinates are different: sometimes they are equal to SNPchimp, sometimes with illumina chip. Sometimes there are totally different. I think that it depends on the dbSNP version used for build or the align method in general.
When both SNPchimp and Illumina chip doesn't have coordinates, the SNP has a multiple aligment

<div class="alert alert-block alert-danger">
    <b>Danger:</b> <code>rs408606108</code>, for example, failed the alignment with ensembl since the <code>C/T</code> allele doesn't match the reference genome (which is <code>T</code>). In EVA this snps is deprecated in Oarv4.0. Does this applies to any others SNPs? are they BOT/reverse? need to validate all SNPs data
</div>