In [1]:
#Start

#### Code Summary

The code imports necessary libraries like pandas and processes mutation data from an input .gz file. It extracts specific columns like PMID, Concept ID, Mentions, and Resource into a DataFrame. After sampling and checking the data, filtering is done to keep only SNP mutation rows where the Concept ID matches an RSID format. 

Rows with null Mentions are removed after verification that the annotations are inconsistent. The filtered SNP data is written out to a .tsv file. 

Analysis is done on the filtered data to output statistics - total RSID annotations, unique PMIDs, and unique SNPs. This summarizes the workflow of loading mutation data, filtering to SNP mutations with valid annotations, removing inconsistent rows, and writing the cleaned data to file for downstream analysis.

#### Output File Data Description

The columns in the dataset are:

i. PMID: PubMed abstract identifier </br>
ii. Concept ID: Database identifier for the mutation, e.g. RSID for SNPs </br>
iii. Mentions: Text mentions of the mutation concepts found in the abstract </br>
iv. Resource: Sources of the annotation data, e.g. dbSNP, ClinVar, tmVar

In [2]:
import gzip
import os
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

In [3]:
# Input file names from part1-data_retrival.ipynb
input_filename = 'mutation2pubtatorcentral.gz'

In [16]:
max_cols = 5
headers = "PMID,Type,Concept ID,Mentions,Resource".split(',')

with gzip.open(input_filename, 'rt') as f:
    df = pd.read_csv(f, sep='\t', header=0, names=headers, usecols=range(max_cols))

In [25]:
print(df.sample(n=10))

             PMID           Concept ID                   Mentions     Resource
4065     26342000           rs30259360                 rs30259360  tmVar|dbSNP
1117437  32266159             rs254560                   rs254560        tmVar
5042467  12751820  tmVar:c|SUB|C|125|A                      C125A        tmVar
286339   15616038            rs1501299                  rs1501299  tmVar|dbSNP
4251750  33537542          rs762098236                   c.590T>C        tmVar
2956850  23273425     tmVar:p|SUB|W||D  tryptophan with aspartate        tmVar
1842858  24033266          rs144505461                        NaN      ClinVar
4361421  21376568         rs1785363766                        NaN      ClinVar
3180031  27490458     tmVar:c|FOR|10|C                   C for 10        tmVar
544164   35849076  tmVar:p|SUB|N|156|Q                      N156Q        tmVar


In [27]:
len(df)

6361507

The data as seen above is the combination of bothe Gene & Protien Mutation, Hence further filtering is performed.

In [20]:
#Delete Type column
df = df.drop('Type', axis=1)

In [35]:
# Filtering rows where 'Concept ID' the RSID format
snp_df = df[df['Concept ID'].str.match('^rs', case=False, na=False)]
print(snp_df.sample(n=10))

             PMID    Concept ID     Mentions     Resource
4435286  26332579   rs318236639  rs318236639        tmVar
5925856  34936921     rs1558902    rs1558902        tmVar
696974    9536098   rs897784116          NaN      ClinVar
5546178  25741868   rs121434352          NaN      ClinVar
1162816  36143166     rs7743373    rs7743373        tmVar
6121804  18072964        rs7521       rs7521  tmVar|dbSNP
4147382  28492532   rs751518628          NaN      ClinVar
4925512  31822803    rs61757643     c.485G>A        tmVar
4030384  28492532  rs1787875866          NaN      ClinVar
4660034  35234610   rs797045013        T767I        tmVar


In [36]:
len(snp_df)

2650051

In [37]:
# Sum of null values per column in snp_df
null_per_column = snp_df.isnull().sum()
print("Null values per column:")
print(null_per_column)
print("\n")

Null values per column:
PMID                0
Concept ID          0
Mentions      1157211
Resource            0
dtype: int64




In [59]:
Null_mention_df = snp_df[snp_df['Mentions'].isnull()]
print(Null_mention_df.sample(n=10))

             PMID    Concept ID Mentions Resource
1980833  15157284     rs8179178      NaN    dbSNP
5381142  25741868   rs986731225      NaN  ClinVar
4144419  28492532  rs2078660025      NaN  ClinVar
5550125  25741868  rs1560965164      NaN  ClinVar
3990096  28492532   rs371280399      NaN  ClinVar
3727534  28492532  rs1319798252      NaN  ClinVar
2956876  29054425   rs137852801      NaN  ClinVar
2945561  23006423    rs11668609      NaN    dbSNP
3106687  15689448   rs121912935      NaN  ClinVar
2651325  25637381    rs28936405      NaN  ClinVar


In [60]:
Null_mention_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1157211 entries, 183 to 6361473
Data columns (total 4 columns):
 #   Column      Non-Null Count    Dtype 
---  ------      --------------    ----- 
 0   PMID        1157211 non-null  int64 
 1   Concept ID  1157211 non-null  object
 2   Mentions    0 non-null        object
 3   Resource    1157211 non-null  object
dtypes: int64(1), object(3)
memory usage: 44.1+ MB


After verification of various random PMID in Null_mention_df, we observed the annotations are inconsistant to attain the result. Hence we are removing rows if Mentions is null

In [61]:
#filter snp_df
snp_df = snp_df[snp_df['Mentions'].notnull()]
len(snp_df)

1492840

In [63]:
#store processed data to tsv format
output_file = 'filtered_snp_data.tsv'
snp_df.to_csv(output_file, sep='\t', index=False)
print(f"Filtered SNP Mutation data is stored in {output_file}")

Filtered SNP Mutation data is stored in filtered_snp_data.tsv


In [64]:
#Stats

In [65]:
total_lines = 0
unique_pmids = set()
unique_snps = set()

with open(output_file, 'r') as tsv_file:
    # Skip the header
    next(tsv_file)
    
    # Process each line in the file
    for line in tsv_file:
        values = line.strip().split('\t')
        
        # Increment the total RSID annotations count
        total_lines += 1
        
        # Add the PMID to the set of unique PMIDs
        unique_pmids.add(values[0])
        
        # Add the Concept ID (SNP) to the set of unique SNPs
        unique_snps.add(values[1])

# Calculated stats
print(f"Total RSID annotations: {total_lines}")
print(f"Total unique PMID: {len(unique_pmids)}")
print(f"Total unique SNPs: {len(unique_snps)}")

Total RSID annotations: 1492840
Total unique PMID: 373734
Total unique SNPs: 496218


In [None]:
# END