In [1]:
#Start

### Code Summary

The JSON file containing filtered abstracts is loaded and converted into a dictionary for quicker access. The TSV file containing SNP data is loaded, and mentions are converted to lowercase. For every row, the mention text is substituted with the corresponding Concept ID (RSID) in the abstract. The unique RSIDs are collected and saved to a file. The abstracts are then split into sentences using NLTK's sentence tokenizer. Using multiprocessing, we extract sentences containing any of the RSIDs for each PMID. The PMID-to-sentence list mappings are then saved as a JSON file. The final output is a file where each line corresponds to a PMID, with the sentences containing RSID mentions from its abstract. The key steps involved in the process are: 
1. Substituting mentions with Concept IDs 
2. Splitting into sentences 
3. Extracting sentences with RSIDs 
4. Saving the PMID-to-sentences mapping.

In [2]:
import glob
import json
import pandas as pd
import numpy as np
import re
import warnings
import pickle
import shutil
warnings.filterwarnings("ignore")

In [4]:
zip_file_path = './filtered_snp_abstracts.zip'
extract_to_path = './'

# Extract the zip file
shutil.unpack_archive(zip_file_path, extract_to_path, 'zip')

In [5]:
file_path = "./filtered_snp_abstracts.json"
with open(file_path, "r") as f:
    abstracts_data = [json.loads(line) for line in f]

In [6]:
# Convert to dict for efficient queries and make text case-insensitive
abstracts_dict = {entry['ID']: entry['Abstract'].lower() for entry in abstracts_data}

In [7]:
# Load the TSV file
file_path = "./filtered_snp_data.tsv"
snp_df = pd.read_csv(file_path, delimiter='\t')

# Convert mentions to lowercase for matching
snp_df['Mentions'] = snp_df['Mentions'].str.lower()

In [8]:
# List to store unique RSIDs replaced
rsid_tokens = []

# Replace mentions in abstracts with Concept ID
for _, row in snp_df.iterrows():
    pmid = row['PMID']
    mention = row['Mentions']
    concept_id = row['Concept ID']
    if pmid in abstracts_dict:
        # Replacing the mention with the Concept ID
        abstracts_dict[pmid] = abstracts_dict[pmid].replace(mention, f'{concept_id}')
        rsid_tokens.append(concept_id)

In [9]:
rsid_tokens = list(set(rsid_tokens))

#Store RSID tokens
file_path = "./rsid_tokens.pkl"

# Save the list to the specified file using pickle
with open(file_path, 'wb') as f:
    pickle.dump(rsid_tokens, f)
print(f"Saved rsid_tokens to {file_path}")

Saved rsid_tokens to ./rsid_tokens.pkl


In [10]:
changed_abstracts = sum(1 for abstract in abstracts_dict.values() if "rs" in abstract)
print(f"Total abstracts where the mentions are changed: {changed_abstracts} out of {len(abstracts_dict)}")

Total abstracts where the mentions are changed: 107631 out of 107631


In [11]:
#Print Random Abstract to check if Mentions are replaced with RSID
print(abstracts_dict[10747905])
print('\n')
print(snp_df[snp_df['PMID'] == 10747905])

interleukin 6 (il6) plays key roles in hematopoiesis, immune, and acute phase responses. dysregulated il6 expression is implicated in diseases such as atherosclerosis and arthritis. we have examined the functional effect of four polymorphisms in the il6 promoter (rs1800797, rs1800796, -373a(n)t(n), rs1800795) by identifying the naturally occurring haplotypes and comparing their effects on reporter gene expression. the results indicate different transcriptional regulation in the ecv304 cell line compared with the hela cell line, suggesting cell type-specific regulation of il6 expression. the haplotypes showed functional differences in the ecv304 cell line; transcription was higher from the gg9/11g haplotype and lower from the ag8/12g allele. the differences suggest that more than one of the polymorphic sites is functional; the base differences at distinct polymorphic sites do not act independently of one another, and one polymorphism influences the functional effect of variation at othe

In [12]:
#Sentence Split

In [13]:
import nltk
from nltk.tokenize import sent_tokenize
nltk.download('punkt')
from concurrent.futures import ProcessPoolExecutor
import json
import tqdm

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [14]:
def extract_rsid_sentences_for_pmid(pmid_abstract_tuple):
    """Extract sentences containing RSID tokens from a given abstract for a PMID."""
    pmid, abstract = pmid_abstract_tuple
    sentences = [sentence for sentence in sent_tokenize(abstract) if any(rsid in sentence for rsid in rsid_tokens_set)]
    return (pmid, sentences)

# Convert rsid_tokens list to set for faster membership checks
rsid_tokens_set = set(rsid_tokens)

# Dictionary to store PMID to list of sentences containing RSIDs
pmid_to_rsid_sentences = {}

# Use ProcessPoolExecutor to parallelize the extraction
with ProcessPoolExecutor() as executor:
    results = list(tqdm.tqdm(executor.map(extract_rsid_sentences_for_pmid, abstracts_dict.items()), total=len(abstracts_dict), desc="Extracting RSID sentences"))

# Populate the dictionary with the results
for pmid, sentences in results:
    if sentences:
        pmid_to_rsid_sentences[pmid] = sentences

Extracting RSID sentences: 100%|██████████| 107631/107631 [52:18<00:00, 34.29it/s]


File saved to /rsid_sentences.json


In [15]:
# Define the path where you want to save the file on your Google Drive
file_path = "./rsid_sentences.json"

# Save each PMID and its sentences as a separate line in the file
with open(file_path, 'w') as file:
    for pmid, sentences in pmid_to_rsid_sentences.items():
        file.write(json.dumps({pmid: sentences}) + "\n")

print(f"File saved to {file_path}")

File saved to ./rsid_sentences.json


In [None]:
#END

In [None]:
#Sentences Extracted.