# About Goat Coordinates
Ensembl stores goat coordinates in **ARS1** assembly, which is the same of [GoatGenome project](http://www.goatgenome.org/projects.html). SNPchimp doesn't have any information on such assembly, but it seems keeping the coorespondences between *snp names* and *rsID*

In [1]:
import os
import itertools
import random
import pandas

from collections import defaultdict
from pathlib import Path
from importlib import reload
from ensemblrest import EnsemblRest

import src.features.illumina
import src.features.snpchimp

random.seed = 42

Like sheep, try to read data files for Goat:

In [2]:
project_dir = Path.cwd().parents[1]
illumina_chip = project_dir / "data/external/GOA/ILLUMINA/Goat_IGGC_65K_v2_15069617X365016_A2.csv"
snpchimp_file = project_dir / "data/external/GOA/SNPCHIMP/SNPchimp_GOAT_SNP50_chi1.0.csv"

In [3]:
ars_chip = defaultdict(list)
for record in src.features.illumina.read_Manifest(illumina_chip, delimiter=','):
    ars_chip['name'].append(record.name)
    ars_chip['chr'].append(record.chr)
    ars_chip['position'].append(record.mapinfo)
    ars_chip['ilmnstrand'].append(record.ilmnstrand)
    ars_chip['snp'].append(record.snp)
ars_chip = pandas.DataFrame.from_dict(ars_chip)
ars_chip.set_index('name', inplace=True)
ars_chip.head()

Unnamed: 0_level_0,chr,position,ilmnstrand,snp
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1_101941444_AF-PAKI,1,101941444,TOP,A/G
1_10408764_AF-PAKI,1,10408764,TOP,A/G
1_104453302_AF-PAKI,1,104453302,TOP,A/G
1_107080965_AF-PAKI,1,107080965,BOT,T/C
1_109839943_AF-PAKI,1,109839943,BOT,T/C


In [4]:
snpchimp = defaultdict(list)
for record in src.features.snpchimp.read_snpChimp(snpchimp_file):
    snpchimp['name'].append(record.snp_name)
    snpchimp['chr'].append(record.chromosome)
    snpchimp['position'].append(record.position)
    snpchimp['ilmnstrand'].append(record.strand)
    snpchimp['snp'].append(record.alleles_a_b_top)
    snpchimp['rsID'].append(record.rs)
snpchimp = pandas.DataFrame.from_dict(snpchimp)
snpchimp.set_index('name', inplace=True)
snpchimp.head()

Unnamed: 0_level_0,chr,position,ilmnstrand,snp,rsID
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
snp1-scaffold1-2170,22,27222753,top,A/C,rs268233143
snp1-scaffold708-1421224,14,90885671,bottom,A/G,rs268293133
snp10-scaffold1-352655,22,26872268,top,A/G,rs268233152
snp1000-scaffold1026-533890,8,68958341,top,A/G,rs268291433
snp10000-scaffold1356-652219,7,50027003,top,A/G,rs268242876


I'm pretty sure that coordinates don't match. Assemblies are different. Does chips have same snps names?

In [5]:
sorted(list(snpchimp.index)) == sorted(list(ars_chip.index))

False

So, keys are different. Does the new chip contains all the SNPchimp entries?

In [6]:
snpchimp_keys = set(list(snpchimp.index))
ars_chip_keys = set(list(ars_chip.index))
print("Snps in SNPchimp: %s" %(len(snpchimp_keys)))
print("Snp in chip which are also in SNPchimp: %s" % (len(snpchimp_keys.intersection(ars_chip_keys))))
print("New SNPs in chip: %s" % len(ars_chip_keys.difference(snpchimp_keys)))

Snps in SNPchimp: 53347
Snp in chip which are also in SNPchimp: 53347
New SNPs in chip: 6380


I've discovered that there are ~6300 new SNPs in the new chip version. Does the coordinate match for SNPs in two dataset? I soppose **NO** since assemblies are different

In [7]:
count = 0
rs_different = {}

for i, ars_chip_row in ars_chip.iterrows():
    if not ars_chip_row.name in snpchimp_keys:
        continue
    
    # get row relying snp name
    snpchimp_row = snpchimp.loc[ars_chip_row.name, :]
    
    if (snpchimp_row['chr'], snpchimp_row['position']) != (ars_chip_row['chr'], ars_chip_row['position']):
        count += 1
        if snpchimp_row['rsID'] != 'NULL':
            rs_different[snpchimp_row['rsID']] = snpchimp_row.name
    
print(f"\nN of SNPs in different positions from illumina to SNPchimp: {count}")
print(f"\nN of SNPs with rsID with different positions from illumina to SNPchimp: {len(rs_different)}")


N of SNPs in different positions from illumina to SNPchimp: 53347

N of SNPs with rsID with different positions from illumina to SNPchimp: 53345


So no coordinates match. Almost all SNPs have a rsID in snpchimp

In [8]:
selected_rsID = random.sample(sorted(list(rs_different.keys())), 20)

In [9]:
ensRest = EnsemblRest()
result = ensRest.getVariationByMultipleIds(ids=selected_rsID, species="capra_hircus")
for key, value in result.items():
    for location in value['mappings']:
        print(
            key, rs_different[key],
            location['seq_region_name'], location['start'], location['allele_string'], 
            list(snpchimp.loc[rs_different[key], ['chr', 'position', 'ilmnstrand', 'snp']]), 
            list(ars_chip.loc[rs_different[key], ['chr', 'position', 'ilmnstrand', 'snp']])
        )

rs268239058 snp6081-scaffold1216-103970 13 9575467 C/A ['13', 9179149, 'top', 'A/C'] ['13', 9575467, 'TOP', 'A/C']
rs268292422 snp35435-scaffold427-949089 9 41528091 G/A ['9', 41264923, 'top', 'A/G'] ['9', 41528091, 'TOP', 'A/G']
rs268278628 snp46818-scaffold653-2097388 11 99026982 G/T ['11', 97952477, 'top', 'A/C'] ['11', 99026982, 'BOT', 'T/G']
rs268248074 snp15346-scaffold163-2748091 22 38417436 C/T ['22', 38293677, 'top', 'A/G'] ['22', 38417436, 'TOP', 'A/G']
rs268282075 snp50367-scaffold72-2629775 21 11607025 T/C ['21', 10396630, 'bottom', 'A/G'] ['21', 11607025, 'BOT', 'T/C']
rs268241883 snp8977-scaffold1327-155697 16 22609992 A/G ['16', 21823346, 'top', 'A/G'] ['16', 22609992, 'TOP', 'A/G']
rs268243138 snp10268-scaffold1368-2529699 23 6733488 G/A ['23', 40785604, 'top', 'A/G'] ['23', 6733488, 'TOP', 'A/G']
rs268269625 snp37544-scaffold46-1948609 1 153745989 C/T ['1', 151371167, 'top', 'A/G'] ['1', 153745989, 'BOT', 'T/C']
rs268268850 snp36756-scaffold445-2969981 15 48728463 T/C 

It seems to me that coordinates stored in chip are identical with ensembl (or at least for this subset of SNPs)