# Clean data

Author: Yifei Wan\
E-mail: wanyifei0123@gmail.com\
Clean data for NLP.


## Read and explore data
### Read and verify signature

In [1]:
import pickle
from hmac_signature import *

In [2]:
input_title = "data/title_abstract_refs/pmid_title.pickle"  # title dictionary
input_abstract = "data/title_abstract_refs/pmid_abstract.pickle"  # abstract dictionary
key = "wanyifei"

with open(input_title, "rb") as title:
    title_signature_old = title.readline().decode("utf-8")
    title_content = b"".join(title.readlines())
    if verify_hmac_signature(title_content, key, title_signature_old):
        pmids_title = pickle.loads(title_content)
        print("Dict pmids_title is ready!")
        
with open(input_abstract, "rb") as abstract:
    abstract_signature_old = abstract.readline().decode("utf-8")
    abstract_content = b"".join(abstract.readlines())
    if verify_hmac_signature(abstract_content, key, abstract_signature_old):
        pmids_abstract = pickle.loads(abstract_content)
        print("Dict pmids_abstract is ready!")
    
    

Dict pmids_title is ready!
Dict pmids_abstract is ready!


### View title and abstract

In [3]:
print("Preview of title:\n {}\n".format(dict(list(pmids_title.items())[0:2])))
print("Preview of abstract:\n {}\n".format(dict(list(pmids_abstract.items())[0:1])))

Preview of title:
 {'25820570': 'Functional Complementation Assay for 47 MUTYH Variants in a MutY-Disrupted Escherichia coli Strain.', '25330149': 'Mutations predisposing to breast cancer in 12 candidate genes in breast cancer patients from Poland.'}

Preview of abstract:
 {'25820570': 'MUTYH-associated polyposis (MAP) is an adenomatous polyposis transmitted in an autosomal-recessive pattern, involving biallelic inactivation of the MUTYH gene. Loss of a functional MUTYH protein will result in the accumulation of G:T mismatched DNA caused by oxidative damage. Although p.Y179C and p.G396D are the two most prevalent MUTYH variants, more than 200 missense variants have been detected. It is difficult to determine whether these variants are disease-causing mutations or single-nucleotide polymorphisms. To understand the functional consequences of these variants, we generated 47 MUTYH gene variants via site-directed mutagenesis, expressed the encoded proteins in MutY-disrupted Escherichia coli

## Capitalization or Lower case

In [4]:
pmids_title_l = {pmid: title.lower() for pmid, title in pmids_title.items()}
pmids_abstract_l = {pmid: abstract.lower() for pmid, abstract in pmids_abstract.items()}

print("Preview of title in lower case:\n {}\n".format(dict(list(pmids_title_l.items())[0:2])))
print("Preview of abstract in lower case:\n {}\n".format(dict(list(pmids_abstract_l.items())[0:1])))

Preview of title in lower case:
 {'25820570': 'functional complementation assay for 47 mutyh variants in a muty-disrupted escherichia coli strain.', '25330149': 'mutations predisposing to breast cancer in 12 candidate genes in breast cancer patients from poland.'}

Preview of abstract in lower case:
 {'25820570': 'mutyh-associated polyposis (map) is an adenomatous polyposis transmitted in an autosomal-recessive pattern, involving biallelic inactivation of the mutyh gene. loss of a functional mutyh protein will result in the accumulation of g:t mismatched dna caused by oxidative damage. although p.y179c and p.g396d are the two most prevalent mutyh variants, more than 200 missense variants have been detected. it is difficult to determine whether these variants are disease-causing mutations or single-nucleotide polymorphisms. to understand the functional consequences of these variants, we generated 47 mutyh gene variants via site-directed mutagenesis, expressed the encoded proteins in mut

## Expand the contractions

In [5]:
import contractions

example = "He doesn't like apple."
print("Example before contraction fix: {}".format(example))
print("Example after contraction fix: {}".format(contractions.fix(example)))

Example before contraction fix: He doesn't like apple.
Example after contraction fix: He does not like apple.


In [6]:
pmids_title_lc = {pmid: contractions.fix(title) for pmid, title in pmids_title_l.items()}
pmids_abstract_lc = {pmid: contractions.fix(abstract) for pmid, abstract in pmids_abstract_l.items()}

## Noise remove

In [7]:
from data_clean_toolkit import *

### Remove URL

In [8]:
pmids_title_lcu = {pmid: remove_url(title) for pmid, title in pmids_title_lc.items()}
pmids_abstract_lcu = {pmid: remove_url(abstract) for pmid, abstract in pmids_abstract_lc.items()}

example_pmid = [pmid for pmid, abstract in pmids_abstract_lc.items() if "http" in abstract][0]
print("Before remove URL:\n {}\n".format(pmids_abstract_lc[example_pmid]))
print("After remove URL:\n {}".format(pmids_abstract_lcu[example_pmid]))

Before remove URL:
 as sequencing becomes more economical, we are identifying sequence variations in the population faster than ever. for disease-associated genes, it is imperative that we differentiate a sequence variant as either benign or pathogenic, such that the appropriate therapeutic interventions or surveillance can be implemented. <i>pten</i> is a frequently mutated tumor suppressor that has been linked to the pten hamartoma tumor syndrome. although the domain structure of pten and the functional impact of a number of its most common tumor-linked mutations have been characterized, there is a lack of information about many recently identified clinical variants. to address this challenge, we developed a cell-based assay that utilizes a premalignant phenotype of normal mammary epithelial cells lacking pten. we measured the ability of pten variants to rescue the spheroid formation phenotype of <i>pten<sup>-/-</sup></i> mcf10a cells maintained in suspension. as proof of concept, we

### Remove HTML

In [9]:
pmids_title_lcuh = {pmid: remove_html(title) for pmid, title in pmids_title_lcu.items()}
pmids_abstract_lcuh = {pmid: remove_html(abstract) for pmid, abstract in pmids_abstract_lcu.items()}

html_pattern = re.compile(r"<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});")
example_pmid = [pmid for pmid, abstract in pmids_abstract_lcu.items() if re.match(html_pattern, abstract)][0]
print(example_pmid)
print("Before remove HTML:\n {}\n".format(pmids_abstract_lcu[example_pmid]))
print("After remove HTML:\n {}".format(pmids_abstract_lcuh[example_pmid]))

28223274
Before remove HTML:
 <b>purpose:</b> maintenance therapy with olaparib has improved progression-free survival in women with high-grade serous ovarian cancer (hgsoc), particularly those harboring <i>brca1/2</i> mutations. the objective of this study was to characterize long-term (lt) versus short-term (st) responders to olaparib.<b>experimental design:</b> a comparative molecular analysis of study 19 (nct00753545), a randomized phase ii trial assessing olaparib maintenance after response to platinum-based chemotherapy in hgsoc, was conducted. lt response was defined as response to olaparib/placebo >2 years, st as <3 months. molecular analyses included germline <i>brca1/2</i> status, three-biomarker homologous recombination deficiency (hrd) score, <i>brca1</i> methylation, and mutational profiling. another olaparib maintenance study (study 41; nct01081951) was used as an additional cohort.<b>results:</b> thirty-seven lt (32 olaparib) and 61 st (21 olaparib) patients were identif

### Remove special characters
 Remove special special characters, including symbols, emojis, and other graphic characters.

In [10]:
pmids_title_lcuhs = {pmid: remove_special_characters(title) for pmid, title in pmids_title_lcuh.items()}
pmids_abstract_lcuhs = {pmid: remove_special_characters(abstract) for pmid, abstract in pmids_abstract_lcuh.items()}

print("Before remove special characters:\n {}\n".format(pmids_abstract_lcuh["28223274"]))
print("After remove special characters:\n {}".format(pmids_abstract_lcuhs["28223274"]))

Before remove special characters:
 purpose: maintenance therapy with olaparib has improved progression-free survival in women with high-grade serous ovarian cancer (hgsoc), particularly those harboring brca1/2 mutations. the objective of this study was to characterize long-term (lt) versus short-term (st) responders to olaparib.experimental design: a comparative molecular analysis of study 19 (nct00753545), a randomized phase ii trial assessing olaparib maintenance after response to platinum-based chemotherapy in hgsoc, was conducted. lt response was defined as response to olaparib/placebo >2 years, st as brca1/2 status, three-biomarker homologous recombination deficiency (hrd) score, brca1 methylation, and mutational profiling. another olaparib maintenance study (study 41; nct01081951) was used as an additional cohort.results: thirty-seven lt (32 olaparib) and 61 st (21 olaparib) patients were identified. treatment was significantly associated with outcome (p p tp53, brca1, and brca2 

### Remove punctuations

In [11]:
# Try remove punctuations directly
pmids_title_lcuhsp = {pmid: remove_punctuation(title) for pmid, title in pmids_title_lcuhs.items()}
pmids_abstract_lcuhsp = {pmid: remove_punctuation(abstract) for pmid, abstract in pmids_abstract_lcuhs.items()}

print("Before remove punctuations:\n {}\n".format(pmids_abstract_lcuhs["28223274"]))
print("After remove punctuations:\n {}".format(pmids_abstract_lcuhsp["28223274"]))

Before remove punctuations:
 purpose: maintenance therapy with olaparib has improved progression-free survival in women with high-grade serous ovarian cancer (hgsoc), particularly those harboring brca1/2 mutations. the objective of this study was to characterize long-term (lt) versus short-term (st) responders to olaparib.experimental design: a comparative molecular analysis of study 19 (nct00753545), a randomized phase ii trial assessing olaparib maintenance after response to platinum-based chemotherapy in hgsoc, was conducted. lt response was defined as response to olaparib/placebo >2 years, st as brca1/2 status, three-biomarker homologous recombination deficiency (hrd) score, brca1 methylation, and mutational profiling. another olaparib maintenance study (study 41; nct01081951) was used as an additional cohort.results: thirty-seven lt (32 olaparib) and 61 st (21 olaparib) patients were identified. treatment was significantly associated with outcome (p p tp53, brca1, and brca2 mutati

#### Check varaints

In [12]:
variant = "c.188T>C"
print("Test punctuation remover on variant:\n")
print("Variant before remove punctuation: {}\n".format(variant))
print("Variant after remove punctuation: {}\n".format(remove_punctuation(variant)))

Test punctuation remover on variant:

Variant before remove punctuation: c.188T>C

Variant after remove punctuation: c188TC



The regualr punctuation remove methond would remove punctuation from variants. To avoid it, a variant safe approach is necessary.

In [13]:
# Find a example of variant from text
var_pattern = "[cC]\.(\d+|\*\d+|-\d+)([+-]\d+)?([GCTAgcta])?>([GCTAgcta])"
example_pmid = [pmid for pmid, abstract in pmids_abstract_lcuhs.items() if re.match(var_pattern, abstract)][0]
print(pmids_abstract_lcuhs[example_pmid])
print(example_pmid)

c.224g>a, p.arg75gln (r75q) presumably leads to an amino-acid change from arginine to glutamine in the membrane-spanning domain of the cftr protein. initially reported as a benign sequence variation, p.arg75gln was shown to be associated with a high risk of pancreatitis, a risk that was strikingly higher when p.arg75gln was combined with a spink1 variant. in addition, it was shown that p.arg75gln alters bicarbonate but not chloride conductance and that the mutation also induces exon 3 skipping. to investigate the role of p.arg75gln in idiopathic chronic pancreatitis (icp), we performed genotyping of the cftr gene in 880 patients with icp, 198 patients with idiopathic bronchiectasis (ib), 74 patients with classical cystic fibrosis (cf), 48 patients with congenital bilateral absence of the vas deferens (cbavd) and 148 healthy controls. p.arg75gln variant was identified in 3.3% (29/880) of patients with icp, 3.3% (9/272) patients with a pulmonary disease, 2.1% (1/48) of patients with cbav

In [14]:
#### Map variants to placeholders

In [16]:
import more_itertools
from hgvs_variants_capture import *
from tmVar_rest_client import *


# define a variants to placeholder function
def map_var_placeholder(pmid, text, vartype):
    # tmVar method
    soup = tmvar_rest_api(pmid, "mutation")
    tmvar_ner = retrieve_variant_entity(soup)
    tmvar_entities = get_variant_entity_tmvar(text, tmvar_ner)
    key_2_placeholder_init = {}
    key_2_placeholder_tmvar = build_var_temp_encoding(tmvar_entities, key_2_placeholder_init, notice=False)
    tmvar_text = replace_var_2_temp_encoding(key_2_placeholder_tmvar, text, notice=False)
    # conventional regex method
    regex_entities = recognize_variant_entity(vartype, tmvar_text)  # conventional regex NER
    key_2_placeholder = build_var_temp_encoding(regex_entities, key_2_placeholder_tmvar)
    new_text = replace_var_2_temp_encoding(key_2_placeholder, tmvar_text)
    return new_text

# map first 10 variants to placeholders
vartype = "dna"
pmids_title_lcuhsv = {pmid: map_var_placeholder(pmid, title, vartype) for pmid, title in more_itertools.take(10, pmids_title_lcuhs.items())}
pmids_abstract_lcuhsv = {pmid: map_var_placeholder(pmid, abstract, vartype) for pmid, abstract in more_itertools.take(10, pmids_abstract_lcuhs.items())}

print("Before map variants: \n{}".format(more_itertools.take(1, pmids_abstract_lcuhs.values())))
print("\nAfter map variants: \n{}".format(more_itertools.take(1, pmids_abstract_lcuhsv.values())))

Try to call tmVar API of PMID: 25820570
From URL: https://www.ncbi.nlm.nih.gov/research/pubtator-api/publications/export/biocxml?pmids=25820570&concepts=mutation
Call tmVar attempt 1
Try to call tmVar API of PMID: 25330149
From URL: https://www.ncbi.nlm.nih.gov/research/pubtator-api/publications/export/biocxml?pmids=25330149&concepts=mutation
Call tmVar attempt 1
Try to call tmVar API of PMID: 29264456
From URL: https://www.ncbi.nlm.nih.gov/research/pubtator-api/publications/export/biocxml?pmids=29264456&concepts=mutation
Call tmVar attempt 1
Try to call tmVar API of PMID: 16140997
From URL: https://www.ncbi.nlm.nih.gov/research/pubtator-api/publications/export/biocxml?pmids=16140997&concepts=mutation
Call tmVar attempt 1
Try to call tmVar API of PMID: 24733792
From URL: https://www.ncbi.nlm.nih.gov/research/pubtator-api/publications/export/biocxml?pmids=24733792&concepts=mutation
Call tmVar attempt 1
Try to call tmVar API of PMID: 26719302
From URL: https://www.ncbi.nlm.nih.gov/resear

Seems most variants are mapped good.

In [17]:
# Remove punctuation after map
def variant_safe_punctuation_remover(pmid, text, vartype):
    # tmVar method
    soup = tmvar_rest_api(pmid, "mutation")
    tmvar_ner = retrieve_variant_entity(soup)
    tmvar_entities = get_variant_entity_tmvar(text, tmvar_ner)
    key_2_placeholder_init = {}  # initialize a dict as the parameter of build_var_temp_encoding function
    key_2_placeholder_tmvar = build_var_temp_encoding(tmvar_entities, key_2_placeholder_init, notice=False)
    tmvar_text = replace_var_2_temp_encoding(key_2_placeholder_tmvar, text, notice=True)
    # conventional regex method
    regex_entities = recognize_variant_entity(vartype, tmvar_text)  # conventional regex NER
    key_2_placeholder = build_var_temp_encoding(regex_entities, key_2_placeholder_tmvar)
    new_text = replace_var_2_temp_encoding(key_2_placeholder, tmvar_text)
    new_text = remove_punctuation(new_text)
    new_text = replace_temp_encoding_2_var(key_2_placeholder, new_text, notice=False)
    return new_text

In [18]:
# Use first 1 pmid as test
pmids_title_lcuhsv = {pmid: variant_safe_punctuation_remover(pmid, title, vartype) for pmid, title in more_itertools.take(1, pmids_title_lcuhs.items())}
pmids_abstract_lcuhsv = {pmid: variant_safe_punctuation_remover(pmid, abstract, vartype) for pmid, abstract in more_itertools.take(1, pmids_abstract_lcuhs.items())}

Try to call tmVar API of PMID: 25820570
From URL: https://www.ncbi.nlm.nih.gov/research/pubtator-api/publications/export/biocxml?pmids=25820570&concepts=mutation
Call tmVar attempt 1
Try to call tmVar API of PMID: 25820570
From URL: https://www.ncbi.nlm.nih.gov/research/pubtator-api/publications/export/biocxml?pmids=25820570&concepts=mutation
Call tmVar attempt 1
Variant: p.y179c
Placeholder: placeholder0

Original text:
mutyh-associated polyposis (map) is an adenomatous polyposis transmitted in an autosomal-recessive pattern, involving biallelic inactivation of the mutyh gene. loss of a functional mutyh protein will result in the accumulation of g:t mismatched dna caused by oxidative damage. although p.y179c and p.g396d are the two most prevalent mutyh variants, more than 200 missense variants have been detected. it is difficult to determine whether these variants are disease-causing mutations or single-nucleotide polymorphisms. to understand the functional consequences of these varia

In [20]:
print("Before remove punctuation: \n{}".format(more_itertools.take(1, pmids_abstract_lcuhs.values())))
print("\nAfter remove punctuation: \n{}".format(more_itertools.take(1, pmids_abstract_lcuhsv.values())))

Before remove punctuation: 
['mutyh-associated polyposis (map) is an adenomatous polyposis transmitted in an autosomal-recessive pattern, involving biallelic inactivation of the mutyh gene. loss of a functional mutyh protein will result in the accumulation of g:t mismatched dna caused by oxidative damage. although p.y179c and p.g396d are the two most prevalent mutyh variants, more than 200 missense variants have been detected. it is difficult to determine whether these variants are disease-causing mutations or single-nucleotide polymorphisms. to understand the functional consequences of these variants, we generated 47 mutyh gene variants via site-directed mutagenesis, expressed the encoded proteins in muty-disrupted escherichia coli, and assessed their abilities to complement the functional deficiency in the e. coli by monitoring spontaneous mutation rates. although the majority of variants exhibited intermediate complementation relative to the wild type, some variants severely interfe