# Sheep Illumina HD chip aligned
I've aligned with megablast the sheep chip probeset to the oldest `OAR4` assembly. I want to compare my results with result I get from dbSNP

In [1]:
import os
import pymongo
import pandas as pd

from dotenv import find_dotenv, load_dotenv
from pymongo import MongoClient
from pymongoarrow.monkey import patch_all
from pymongoarrow.api import Schema

In [2]:
load_dotenv(find_dotenv())
patch_all()

First, get my data aligned and set `snp_name` as a index:

In [3]:
results = pd.read_csv("ovinesnpHD-genome-assembly-oar-v3-1.csv-GCA_000298735.2_Oar_v4.0_genomic.fna.blastn.csv", low_memory=False)
results.set_index("snp_name", inplace=True)
results.head()

Unnamed: 0_level_0,chrom,position,alleles,illumina,illumina_forward,illumina_strand,strand,ref,alt
snp_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
250506CS3900140500001_312.1,23,26243215,C/T,T/C,T/C,BOT,forward,C,T
250506CS3900176800001_906.1,7,81590897,C/T,T/C,T/C,BOT,forward,C,T
250506CS3900211600001_1041.1,16,41363310,G/T,T/G,T/G,BOT,forward,G,T
250506CS3900218700001_1294.1,2,148834939,C/T,T/C,T/C,BOT,forward,C,T
250506CS3900283200001_442.1,1,188328803,A/C,T/G,A/C,BOT,reverse,A,C


Next load errors: If I can't place a SNP on a chromosome, I will have no position in results table and I will have a reason in this table:

In [4]:
errors = pd.read_csv("ovinesnpHD-genome-assembly-oar-v3-1.csv-GCA_000298735.2_Oar_v4.0_genomic.fna.blastn.err")
errors.set_index("snp_name", inplace=True)
errors.head()

Unnamed: 0_level_0,illumina,illumina_strand,reason
snp_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
DU175804_598.1,T/C,BOT,Allele doesn't match to reference
DU178311_404.1,T/C,BOT,No valid alignments after filtering
DU179070_177.1,A/G,TOP,No valid alignments after filtering
DU186191_327.1,T/C,BOT,Allele doesn't match to reference
DU191809_420.1,T/C,BOT,No valid alignments after filtering


Ok, now get my Sheep variants and focus on *NCBI* data: I could have more variants than *NCBI* if there are probes more recent than dbSNP152. Using `pymongoarrow` to collect data, first connect to database and get a collection:

In [5]:
conn = MongoClient(
    'mongodb://localhost:27017/',
    username=os.getenv("MONGODB_SMARTER_USER"),
    password=os.getenv("MONGODB_SMARTER_PASS")
)
smarter = conn['smarter']
variantSheep = smarter['variantSheep']

Now define a *MongoDB* pipeline which collect and transform data in the simplest way:

In [6]:
pipeline = [
    # match the SNPs I want
    {"$match": {
        "chip_name": "IlluminaOvineHDSNP", 
        "locations": {"$elemMatch": {"version": "Oar_v4.0", "imported_from": "dbSNP152"}}
    }},
    # now limit the fields I need
    {"$project": {
        "snp_name": "$name",
        # this will join a list of strings, like ",".join(list)
        "rs_id": {
            "$reduce": {
                "input": "$rs_id", 
                "initialValue": "", 
                "in": {
                    "$concat": [
                        "$$value", 
                        {'$cond': [{'$eq': ['$$value', '']}, '', ', ']}, 
                        "$$this"
                    ]
                }
            }
        },
        # this is how to do an $elemMatch in a projection step of a pipeline
        "locations": {
            "$filter": {
                "input": "$locations", 
                "as": "location", 
                "cond": {
                    "$and": [
                        {"$eq": ["$$location.imported_from", "dbSNP152"]}, 
                        {"$eq": ["$$location.version", "Oar_v4.0"]}
                    ]
                }
            }
        }
    }},
    # attempt to simplify locations, get a row for each item of array (unpack the only item)
    {"$unwind": "$locations"}, 
    # track the fields I'm interested
    {"$set": {
        "ncbi_chrom": "$locations.chrom", 
        "ncbi_position": "$locations.position"
    }},
    # remove the field I don't want
    {"$unset": "locations"}
]

Next, define a schema able to process and load data into a pandas dataframe:

In [7]:
schema = Schema({"snp_name": str, "rs_id": str, "ncbi_chrom": str, "ncbi_position": int})

Here I execute the aggregation pipeline and set index like I did for other chips:

In [8]:
ncbi_locations = variantSheep.aggregate_pandas_all(pipeline, schema=schema)
ncbi_locations.set_index('snp_name', inplace=True)
ncbi_locations.head()

Unnamed: 0_level_0,rs_id,ncbi_chrom,ncbi_position
snp_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
250506CS3900140500001_312.1,rs55630642,23,26243215
250506CS3900176800001_906.1,rs55630654,7,81590897
250506CS3900211600001_1041.1,rs55630658,16,41363310
250506CS3900218700001_1294.1,rs55630663,2,148834939
250506CS3900371000001_1255.1,rs417377113,11,35291132


Ok, merge these data in a new dataframe. Get rid of SNPs not in *NCBI*, mind that since `ncbi_position` could have *NA* values, need to be converted as *integer*:

In [9]:
tmp = results.merge(ncbi_locations, how="left", on="snp_name")
ncbi_results = tmp[tmp['ncbi_chrom'].isna() == False].astype({'ncbi_position':'int'})
ncbi_results.head()

Unnamed: 0_level_0,chrom,position,alleles,illumina,illumina_forward,illumina_strand,strand,ref,alt,rs_id,ncbi_chrom,ncbi_position
snp_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
250506CS3900140500001_312.1,23,26243215,C/T,T/C,T/C,BOT,forward,C,T,rs55630642,23,26243215
250506CS3900176800001_906.1,7,81590897,C/T,T/C,T/C,BOT,forward,C,T,rs55630654,7,81590897
250506CS3900211600001_1041.1,16,41363310,G/T,T/G,T/G,BOT,forward,G,T,rs55630658,16,41363310
250506CS3900218700001_1294.1,2,148834939,C/T,T/C,T/C,BOT,forward,C,T,rs55630663,2,148834939
250506CS3900371000001_1255.1,11,35291133,C/T,T/C,T/C,BOT,forward,T,C,rs417377113,11,35291132


Ok focus on the differences between my alignment and NCBI:

In [10]:
differences = ncbi_results.query("chrom != ncbi_chrom | position != ncbi_position")
differences.head()

Unnamed: 0_level_0,chrom,position,alleles,illumina,illumina_forward,illumina_strand,strand,ref,alt,rs_id,ncbi_chrom,ncbi_position
snp_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
250506CS3900371000001_1255.1,11,35291133,C/T,T/C,T/C,BOT,forward,T,C,rs417377113,11,35291132
DU172264_319.1,25,20596182,A/G,T/C,A/G,BOT,reverse,G,A,rs55632153,25,20596183
DU175804_598.1,0,0,,T/C,,BOT,,,,rs409850824,13,12526490
DU178311_404.1,0,0,,T/C,,BOT,,,,rs55631803,6,36768153
DU179070_177.1,0,0,,A/G,,TOP,,,,rs55628106,1,111700859


In [11]:
differences.info()

<class 'pandas.core.frame.DataFrame'>
Index: 56470 entries, 250506CS3900371000001_1255.1 to s75909.1
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   chrom             56470 non-null  object
 1   position          56470 non-null  int64 
 2   alleles           976 non-null    object
 3   illumina          56470 non-null  object
 4   illumina_forward  976 non-null    object
 5   illumina_strand   56470 non-null  object
 6   strand            976 non-null    object
 7   ref               976 non-null    object
 8   alt               976 non-null    object
 9   rs_id             56470 non-null  object
 10  ncbi_chrom        56470 non-null  object
 11  ncbi_position     56470 non-null  int64 
dtypes: int64(2), object(10)
memory usage: 5.6+ MB


I have ~56K differences between NCBI, let's focus on different chromosome types:

In [12]:
differences["chrom"].value_counts()

0                 55494
2                   119
1                   113
3                    84
7                    48
4                    48
6                    45
17                   39
10                   36
8                    34
15                   34
5                    34
18                   33
11                   30
12                   29
9                    28
13                   27
20                   25
16                   21
21                   19
19                   19
22                   18
23                   18
14                   17
25                   17
26                   16
24                   13
X                     5
KQ725143.1            1
AMGL02044162.1        1
AMGL02046412.1        1
KQ726038.1            1
AMGL02042721.1        1
AMGL02042569.1        1
KQ725427.1            1
Name: chrom, dtype: int64

Ok, tell me how many SNPs I can't place, while *NCBI* can:

In [13]:
not_placed = differences[differences["chrom"] == '0']
print(f"There are {not_placed.shape[0]} SNPs that I can't map to genome")
not_placed.merge(errors["reason"], how="left", on="snp_name").sort_values("reason")

There are 55494 SNPs that I can't map to genome


Unnamed: 0_level_0,chrom,position,alleles,illumina,illumina_forward,illumina_strand,strand,ref,alt,rs_id,ncbi_chrom,ncbi_position,reason
snp_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
DU175804_598.1,0,0,,T/C,,BOT,,,,rs409850824,13,12526490,Allele doesn't match to reference
OAR1_193942646.1,0,0,,T/G,,BOT,,,,rs421413285,1,179560591,Allele doesn't match to reference
s36152.1,0,0,,T/G,,BOT,,,,rs398134079,13,34476763,Allele doesn't match to reference
OAR4_109410038.1,0,0,,A/G,,TOP,,,,rs429699161,4,102752787,Allele doesn't match to reference
OAR4_35980076.1,0,0,,T/C,,BOT,,,,rs428522734,4,34062779,Allele doesn't match to reference
...,...,...,...,...,...,...,...,...,...,...,...,...,...
oar3_OAR1_24022311,0,0,,A/G,,TOP,,,,rs403680498,1,23988389,Too many alignments after filtering
oar3_OAR9_74023539,0,0,,T/C,,BOT,,,,rs408020205,9,73900233,Too many alignments after filtering
oar3_OAR2_18680383,0,0,,A/G,,TOP,,,,rs422911826,2,18719319,Too many alignments after filtering
oar3_OAR2_183179891,0,0,,T/C,,BOT,,,,rs426856844,2,183203388,Too many alignments after filtering


Well, there are a lot of SNPs I cannot match. Group them by reason:

In [14]:
not_placed.merge(errors["reason"], how="left", on="snp_name")["reason"].value_counts()

No valid alignments after filtering    49254
Too many alignments after filtering     3067
Can't find T/C in alignment             1320
Can't find A/G in alignment             1200
Can't find T/G in alignment              296
Can't find A/C in alignment              281
Allele doesn't match to reference         75
Can't find C/G in alignment                1
Name: reason, dtype: int64

In [15]:
tmp = not_placed.merge(errors["reason"], how="left", on="snp_name")
tmp[tmp["reason"] == "No valid alignments after filtering"]

Unnamed: 0_level_0,chrom,position,alleles,illumina,illumina_forward,illumina_strand,strand,ref,alt,rs_id,ncbi_chrom,ncbi_position,reason
snp_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
DU178311_404.1,0,0,,T/C,,BOT,,,,rs55631803,6,36768153,No valid alignments after filtering
DU179070_177.1,0,0,,A/G,,TOP,,,,rs55628106,1,111700859,No valid alignments after filtering
DU191809_420.1,0,0,,T/C,,BOT,,,,"rs428367938, rs405711682",1,186920065,No valid alignments after filtering
DU192841_628.1,0,0,,T/C,,BOT,,,,rs55632389,23,19282729,No valid alignments after filtering
DU201902_316.1,0,0,,A/G,,TOP,,,,rs55628129,12,10990371,No valid alignments after filtering
...,...,...,...,...,...,...,...,...,...,...,...,...,...
s75759.1,0,0,,T/C,,BOT,,,,rs429941367,6,108740789,No valid alignments after filtering
s75799.1,0,0,,T/C,,BOT,,,,rs400816683,21,31837963,No valid alignments after filtering
s75819.1,0,0,,T/C,,BOT,,,,rs413852725,22,23316563,No valid alignments after filtering
s75898.1,0,0,,T/C,,BOT,,,,rs410893602,5,33620418,No valid alignments after filtering


Are then any *SNP* which I map to a different position than ncbi??

In [16]:
different = differences.query("chrom != '0' and ncbi_chrom != '0'")
print(f"There are {different.shape[0]} SNPs that I can map to a different position")
different

There are 936 SNPs that I can map to a different position


Unnamed: 0_level_0,chrom,position,alleles,illumina,illumina_forward,illumina_strand,strand,ref,alt,rs_id,ncbi_chrom,ncbi_position
snp_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
250506CS3900371000001_1255.1,11,35291133,C/T,T/C,T/C,BOT,forward,T,C,rs417377113,11,35291132
DU172264_319.1,25,20596182,A/G,T/C,A/G,BOT,reverse,G,A,rs55632153,25,20596183
DU206327_107.1,17,14328515,C/T,T/C,T/C,BOT,forward,C,T,rs417906482,17,14328514
DU206996_498.1,5,33118535,A/C,T/G,A/C,BOT,reverse,A,C,"rs403872294, rs421000549",5,33118534
DU325612_517.1,18,25506831,A/G,T/C,A/G,BOT,reverse,G,A,rs416903259,18,25506832
...,...,...,...,...,...,...,...,...,...,...,...,...
s74353.1,11,61175955,C/T,T/C,T/C,BOT,forward,C,T,rs405211381,11,61175954
s74415.1,4,60933502,A/C,T/G,A/C,BOT,reverse,C,A,rs399304728,4,60933503
s75491.1,4,37184952,A/G,A/G,A/G,TOP,forward,G,A,rs425245947,4,37184951
s75543.1,17,44671573,A/C,T/G,A/C,BOT,reverse,C,A,rs424526798,17,44671575
