# About Goat Coordinates
Ensembl stores goat coordinates in **ARS1** assembly, which is the same of [GoatGenome project](http://www.goatgenome.org/projects.html). SNPchimp doesn't have any information on such assembly, but it seems keeping the coorespondences between *snp names* and *rsID*

In [1]:
import os
import itertools
import random

from pathlib import Path
from importlib import reload

from ensemblrest import EnsemblRest

import src.features.illumina
import src.features.snpchimp

random.seed = 42

Like sheep, try to read data files for Goat:

In [2]:
project_dir = Path.cwd().parents[1]
illumina_chip = project_dir / "data/external/GOA/ILLUMINA/Goat_IGGC_65K_v2_15069617X365016_A2.csv"
snpchimp_file = project_dir / "data/external/GOA/SNPCHIMP/SNPchimp_GOAT_SNP50_chi1.0.csv"

In [3]:
ars_chip = dict()
for record in src.features.illumina.read_snpChip(illumina_chip, delimiter=','):
    ars_chip[record.name] = (record.chr, record.mapinfo)

In [4]:
snpchimp = dict()
for record in src.features.snpchimp.read_snpChimp(snpchimp_file):
    snpchimp[record.snp_name] = [(record.chromosome, record.position), record.rs]

I'm pretty sure that coordinates don't match. Assemblies are different. Does chips have same snps names?

In [5]:
sorted(snpchimp.keys()) == sorted(ars_chip.keys())

False

So, keys are different. Does the new chip contains all the SNPchimp entries?

In [6]:
snpchimp_keys = set(snpchimp.keys())
ars_chip_keys = set(ars_chip.keys())
print("Snps in SNPchimp: %s" %(len(snpchimp_keys)))
print("Snp in chip which are also in SNPchimp: %s" % (len(snpchimp_keys.intersection(ars_chip_keys))))
print("New SNPs in chip: %s" % len(ars_chip_keys.difference(snpchimp_keys)))

Snps in SNPchimp: 53347
Snp in chip which are also in SNPchimp: 53347
New SNPs in chip: 6380


I've discovered that there are ~6300 new SNPs in the new chip version. Does the coordinate match for SNPs in two dataset? I soppose **NO** since assemblies are different

In [7]:
count = 0
rs_different = {}

for i, (key, value) in enumerate(ars_chip.items()):
    if not key in snpchimp:
        continue
        
    if value != snpchimp[key][0]:
        count += 1
        if snpchimp[key][1] != 'NULL':
            rs_different[snpchimp[key][1]] = key
        
    if i < 10:
        print(key, value, snpchimp[key])
        
    elif i == 10:
        print("...")
        
print(f"\nN of SNPs in different positions from illumina to SNPchimp: {count}")
print(f"\nN of SNPs with rsID with different positions from illumina to SNPchimp: {len(rs_different)}")


N of SNPs in different positions from illumina to SNPchimp: 53347

N of SNPs with rsID with different positions from illumina to SNPchimp: 53345


So no coordinates match. Almost all SNPs have a rsID in snpchimp

In [8]:
selected_rsID = random.sample(sorted(list(rs_different.keys())), 20)

In [9]:
ensRest = EnsemblRest()
result = ensRest.getVariationByMultipleIds(ids=selected_rsID, species="capra_hircus")
for key, value in result.items():
    for location in value['mappings']:
        print(key, location['seq_region_name'], location['start'], snpchimp[rs_different[key]], ars_chip[rs_different[key]])

rs268267859 2 96832550 [('2', 38735494), 'rs268267859'] ('2', 96832550)
rs268278731 3 83126133 [('3', 35289018), 'rs268278731'] ('0.NW_scaffold', 11865)
rs268267544 9 41101060 [('9', 40838835), 'rs268267544'] ('9', 41101060)
rs268283082 2 78014805 [('2', 57558876), 'rs268283082'] ('2', 78014805)
rs268251314 5 4738366 [('5', 4709049), 'rs268251314'] ('5', 4738366)
rs268276522 20 37193614 [('20', 36929497), 'rs268276522'] ('20', 37193614)
rs268234717 19 14879682 [('19', 14577961), 'rs268234717'] ('19', 14879682)
rs268253096 1 5392991 [('1', 6233916), 'rs268253096'] ('1', 5392991)
rs268260220 1 138469529 [('1', 136455683), 'rs268260220'] ('1', 138469529)
rs268293116 6 86209997 [('6', 82907013), 'rs268293116'] ('6', 86209997)
rs268236795 16 64044670 [('16', 63062129), 'rs268236795'] ('16', 64044670)
rs268283889 26 6903525 [('26', 43500435), 'rs268283889'] ('26', 6903525)
rs268244330 26 26883778 [('26', 23159349), 'rs268244330'] ('26', 26883778)
rs268237037 17 22501157 [('17', 48254896), 'r

It seems to me that coordinates stored in chip are identical (with one exception) with ensembl (or at least for this subset of SNPs)