# Sheep Affymetrix chip aligned
I've aligned with megablast the affymetrix sheep chip probeset to the oldest `OAR4` assembly. I want to compare my results with result I get from dbSNP

In [1]:
from collections import defaultdict

from src.features.smarterdb import global_connection, VariantSheep
from src.data.common import AssemblyConf

import pandas as pd

In [2]:
conn = global_connection()
dbSNP152 = AssemblyConf(version="Oar_v4.0", imported_from="dbSNP152")

First, get my data aligned and set `snp_name` as a index:

In [3]:
results = pd.read_csv("Axiom_Ovi_Can.na35.r3.a3.annot.csv-GCA_000298735.2_Oar_v4.0_genomic.fna.blastn.csv")
results.set_index("snp_name", inplace=True)
results.head()

Unnamed: 0_level_0,chrom,position,alleles,illumina,illumina_forward,illumina_strand,strand,ref,alt
snp_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Affx-293815543,10,86367023,C/T,T/C,T/C,BOT,forward,T,C
Affx-139979198,0,0,,T/G,,BOT,,,
Affx-139969918,0,0,,C/G,,TOP,,,
Affx-139932950,0,0,,T/C,,BOT,,,
Affx-139939859,0,0,,A/G,,TOP,,,


Next load errors: If I can't place a SNP on a chromosome, I will have no position in results table and I will have a reason in this table:

In [4]:
errors = pd.read_csv("Axiom_Ovi_Can.na35.r3.a3.annot.csv-GCA_000298735.2_Oar_v4.0_genomic.fna.blastn.err")
errors.set_index("snp_name", inplace=True)
errors.head()

Unnamed: 0_level_0,illumina,illumina_strand,reason
snp_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Affx-139979198,T/G,BOT,No valid alignments after filtering
Affx-139969918,C/G,TOP,No valid alignments after filtering
Affx-139932950,T/C,BOT,No valid alignments after filtering
Affx-139939859,A/G,TOP,No valid alignments after filtering
Affx-139991202,A/G,TOP,No valid alignments after filtering


Ok, now get my Sheep variants and focus on *NCBI* data: I could have more variants than *NCBI* if there are probes more recent than dbSNP152:

In [5]:
ncbi_variants = VariantSheep.objects.filter(chip_name="AffymetrixAxiomOviCan", locations__match=dbSNP152._asdict(), rs_id__exists=True)
ncbi_variants.count()

39105

Ok now extract dbSNP locations from my `ncbi_variants`:

In [6]:
tmp = defaultdict(list)

for variant in ncbi_variants:
    location = variant.get_location(**dbSNP152._asdict())
    tmp["snp_name"].append(variant.affy_snp_id)
    tmp["rs_id"].append(",".join(variant.rs_id))
    tmp["ncbi_chrom"].append(location.chrom)
    tmp["ncbi_position"].append(location.position)
    
    
ncbi_locations = pd.DataFrame.from_dict(tmp)
ncbi_locations.set_index('snp_name', inplace=True)
ncbi_locations.head()

Unnamed: 0_level_0,rs_id,ncbi_chrom,ncbi_position
snp_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Affx-256854517,rs10721113,18,64294536
Affx-122852950,"rs406297509,rs1087899539",16,68777502
Affx-122806470,"rs402039066,rs1093088087",5,34727924
Affx-122839502,rs119102699,1,103285485
Affx-122821645,rs159412897,1,121010442


Ok, merge these data in a new dataframe. Get rid of SNPs not in *NCBI*, mind that since `ncbi_position` could have *NA* values, need to be converted as *integer*:

In [7]:
tmp = results.merge(ncbi_locations, how="left", on="snp_name")
ncbi_results = tmp[tmp['ncbi_chrom'].isna() == False].astype({'ncbi_position':'int'})
ncbi_results.head()

Unnamed: 0_level_0,chrom,position,alleles,illumina,illumina_forward,illumina_strand,strand,ref,alt,rs_id,ncbi_chrom,ncbi_position
snp_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Affx-122847494,3,30880580,A/G,A/G,A/G,TOP,forward,A,G,rs424489686,3,30880580
Affx-122829181,2,219365951,A/G,A/G,A/G,TOP,forward,G,A,rs401909860,2,219365951
Affx-122816720,1,120533735,A/G,A/G,A/G,TOP,forward,A,G,rs398687222,1,120533735
Affx-122808678,16,22212130,C/T,T/C,T/C,BOT,forward,C,T,rs415806402,16,22212130
Affx-122814061,1,4556384,C/T,T/C,T/C,BOT,forward,C,T,rs55630584,1,4556384


Ok focus on the differences between my alignment and NCBI:

In [8]:
differences = ncbi_results.query("chrom != ncbi_chrom | position != ncbi_position")
differences.head()

Unnamed: 0_level_0,chrom,position,alleles,illumina,illumina_forward,illumina_strand,strand,ref,alt,rs_id,ncbi_chrom,ncbi_position
snp_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Affx-122859153,0,0,,A/G,,TOP,,,,rs428412553,2,6796820
Affx-122859040,0,0,,T/C,,BOT,,,,rs409528638,6,111790031
Affx-122858615,0,0,,T/G,,BOT,,,,rs415678531,3,41459193
Affx-122858602,0,0,,T/C,,BOT,,,,rs429068120,23,2448098
Affx-122858585,0,0,,A/G,,TOP,,,,rs408135675,26,31305174


In [9]:
differences.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2162 entries, Affx-122859153 to Affx-122806049
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   chrom             2162 non-null   object
 1   position          2162 non-null   int64 
 2   alleles           53 non-null     object
 3   illumina          2162 non-null   object
 4   illumina_forward  53 non-null     object
 5   illumina_strand   2162 non-null   object
 6   strand            53 non-null     object
 7   ref               53 non-null     object
 8   alt               53 non-null     object
 9   rs_id             2162 non-null   object
 10  ncbi_chrom        2162 non-null   object
 11  ncbi_position     2162 non-null   int64 
dtypes: int64(2), object(10)
memory usage: 219.6+ KB


I have ~2100 differences between NCBI, let's focus on different chromosome types:

In [10]:
differences["chrom"].value_counts()

0                 2109
2                    6
1                    6
6                    5
18                   4
5                    4
26                   4
12                   3
4                    2
8                    2
24                   2
16                   2
21                   2
19                   2
13                   1
3                    1
11                   1
10                   1
14                   1
X                    1
7                    1
AMGL02043384.1       1
25                   1
Name: chrom, dtype: int64

Ok, tell me how many SNPs I can't place, while *NCBI* can:

In [11]:
not_placed = differences[differences["chrom"] == '0']
print(f"There are {not_placed.shape[0]} SNPs that I can't map to genome")
not_placed.merge(errors["reason"], how="left", on="snp_name").sort_values("reason")

There are 2109 SNPs that I can't map to genome


Unnamed: 0_level_0,chrom,position,alleles,illumina,illumina_forward,illumina_strand,strand,ref,alt,rs_id,ncbi_chrom,ncbi_position,reason
snp_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Affx-122833434,0,0,,T/C,,BOT,,,,rs406770223,X,61594894,Allele doesn't match to reference
Affx-122850925,0,0,,T/G,,BOT,,,,rs418865333,25,38924792,Allele doesn't match to reference
Affx-122850925,0,0,,T/G,,BOT,,,,rs418865333,25,38924792,Allele doesn't match to reference
Affx-122830050,0,0,,T/C,,BOT,,,,"rs407048207,rs401626637",15,52107496,Allele doesn't match to reference
Affx-122830050,0,0,,T/C,,BOT,,,,"rs407048207,rs401626637",15,52107496,Allele doesn't match to reference
...,...,...,...,...,...,...,...,...,...,...,...,...,...
Affx-122828030,0,0,,T/C,,BOT,,,,rs426199307,26,11423020,Too many alignments after filtering
Affx-122843237,0,0,,T/C,,BOT,,,,rs424022354,4,45287904,Too many alignments after filtering
Affx-122843237,0,0,,T/C,,BOT,,,,rs424022354,4,45287904,Too many alignments after filtering
Affx-122843237,0,0,,T/C,,BOT,,,,rs424022354,4,45287904,Too many alignments after filtering


Well, there are a lot of SNPs I cannot match. Group them by reason:

In [12]:
not_placed.merge(errors["reason"], how="left", on="snp_name")["reason"].value_counts()

No valid alignments after filtering                    2268
Allele doesn't match to reference                        69
Too many alignments after filtering                      64
Cannot determine a unique position for SNP A/G (36)       9
Cannot determine a unique position for SNP A/G (35)       7
Cannot determine a unique position for SNP T/C (36)       6
Cannot determine a unique position for SNP A/C (35)       4
Cannot determine a unique position for SNP T/C (35)       3
Cannot determine a unique position for SNP T/G (36)       2
Cannot determine a unique position for SNP T/G (37)       1
Cannot determine a unique position for SNP T/G (35)       1
Cannot determine a unique position for SNP T/G (38)       1
Cannot determine a unique position for SNP T/C (33)       1
Cannot determine a unique position for SNP A/C (36)       1
Name: reason, dtype: int64

In [13]:
tmp = not_placed.merge(errors["reason"], how="left", on="snp_name")
tmp[tmp["reason"] == "No valid alignments after filtering"]

Unnamed: 0_level_0,chrom,position,alleles,illumina,illumina_forward,illumina_strand,strand,ref,alt,rs_id,ncbi_chrom,ncbi_position,reason
snp_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Affx-122859153,0,0,,A/G,,TOP,,,,rs428412553,2,6796820,No valid alignments after filtering
Affx-122859040,0,0,,T/C,,BOT,,,,rs409528638,6,111790031,No valid alignments after filtering
Affx-122858615,0,0,,T/G,,BOT,,,,rs415678531,3,41459193,No valid alignments after filtering
Affx-122858602,0,0,,T/C,,BOT,,,,rs429068120,23,2448098,No valid alignments after filtering
Affx-122858585,0,0,,A/G,,TOP,,,,rs408135675,26,31305174,No valid alignments after filtering
...,...,...,...,...,...,...,...,...,...,...,...,...,...
Affx-122807329,0,0,,A/G,,TOP,,,,rs417861422,1,106450244,No valid alignments after filtering
Affx-122807016,0,0,,T/G,,BOT,,,,rs416491353,2,67161544,No valid alignments after filtering
Affx-122807055,0,0,,T/C,,BOT,,,,rs419002996,26,33849475,No valid alignments after filtering
Affx-122807055,0,0,,T/C,,BOT,,,,rs419002996,26,33849475,No valid alignments after filtering


Are then any *SNP* which I map to a different position than ncbi??

In [14]:
different = differences.query("chrom != '0' and ncbi_chrom != '0'")
print(f"There are {different.shape[0]} SNPs that I can map to a different position")
different.merge(errors["reason"], how="left", on="snp_name").sort_values("reason")

There are 50 SNPs that I can map to a different position


Unnamed: 0_level_0,chrom,position,alleles,illumina,illumina_forward,illumina_strand,strand,ref,alt,rs_id,ncbi_chrom,ncbi_position,reason
snp_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Affx-122854353,25,26911111,A/G,A/G,A/G,TOP,forward,G,A,"rs403904639,rs419622107",25,26911110,
Affx-122853463,1,79028293,A/G,A/G,A/G,TOP,forward,G,A,"rs398751024,rs404404860",1,79028292,
Affx-122853409,8,29628216,C/T,T/C,T/C,BOT,forward,C,T,rs421478715,8,29628215,
Affx-122852870,4,103901630,C/T,T/C,T/C,BOT,forward,T,C,rs412596792,4,103901753,
Affx-122850247,1,249991139,A/T,T/A,A/T,BOT,reverse,A,T,rs424770920,1,249991140,
Affx-122850826,1,109938389,A/G,A/G,A/G,TOP,forward,G,A,"rs427231222,rs421380864",1,109938388,
Affx-122847381,19,4624278,A/G,A/G,A/G,TOP,forward,G,A,rs424927702,19,4624173,
Affx-122846971,2,53100192,C/G,C/G,G/C,TOP,reverse,G,C,rs402596963,2,53100191,
Affx-122844107,4,5551297,A/G,A/G,A/G,TOP,forward,A,G,rs417507408,4,5551154,
Affx-122844311,24,38152051,C/G,G/C,C/G,BOT,reverse,C,G,rs402978156,24,38152050,
