# Creating SPARQL query to find useful MeSH concepts.

The purpose of this notebook is to create a SPARQL query to run in MeSH dataset. We want to find essential/useful MeSH concepts that might be connected with keywords found in our hypotheses. In order to do that we extracted previously merged_noun_chunks and keywords representing nouns in these hypotheses. Then we tokenize here the keywords and use them in regex-shaped query to find any MeSH concepts that have labels connected with these tokens.

First we load essential libraries.

In [None]:
import pandas as pd

We read the file with data from abstracts. The format of the file is csv (table) with columns: object (abstract), pattern_match (sentence with "hypoth"), merged_noun_chunks (from the sentence), merged_sent, keywords. We are interested only in keywords column.

In [None]:
data = pd.read_csv('abstract_data.csv')
keywords = data.keywords

Now we prepare our keywords data to tokenize the keywords from the csv file.

In [None]:
clean_items = []
for i in keywords:
    i = i.replace('[', '')   
    i = i.replace(']', '')  
    clean_items.append(i)
    
data.keywords = clean_items

In [None]:
def clean_text(series):
    clean_words = []
    list_of_words = series.split(',')
    for word in list_of_words:
#word = word.replace('_', ' ')      
        word = word.replace("'", '')
        word = word.lower()
        word = word.strip(' ')
        clean_words.append(word)
    return clean_words

In [None]:
data.keywords = data.keywords.apply(clean_text)

Keywords = the noun entities taken from the merged noun_chunks
Here we clean them and then tokenize them


In [None]:
def clean_keyword(text):
    cleaned_keywords = []
    for word in (text):
        word.split(' ')
        new_word = word.replace('_', ' ')
        cleaned_keywords.append(new_word)
    
    return cleaned_keywords

def tokenize(text):
    res = [sub.split() for sub in text]
    flattened = [i for j in res for i in j]
    return flattened


In [None]:
data['keywords_clean'] = data["keywords"].astype(str)
# clean the keywords
data['keywords_clean'] = data['keywords'].apply(clean_keyword)
# tokenize 
data['tokens'] = data['keywords_clean'].apply(tokenize)
# put tokens into set
data['tuple_tokens'] = data['tokens'].apply(set)

We start with keywords -- which are nouns that were extracted from noun-chunks

In [None]:
for row in data.keywords[0:5]:  # iterating through the rows of the object column
    print(row, '\n')

And we tokenize these keywords.

In [None]:
for row in data.tokens[0:5]:  # iterating through the rows of the object column
    print(row, '\n')

These are all the functions that are run on the tokens sets to clean up the tokens column. We require "datasets_freq_words.csv" for these to work.

In [None]:
def drop_double_char(ents):
    """Drop any entities that are less than three characters. 
    
    Keyword arguments:
    ents -- a set of entities
    
    """
    drop_ents = {ent for ent in ents if len(ent) < 3}
    return ents - drop_ents

def keep_alpha(ents):
    """Keep only entities with alphabetical unicode characters, hyphens, and spaces. 
    
    Keyword arguments:
    ents -- a set of entities
    
    """
    keep_char = set('-abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ ')
    drop_ents = {ent for ent in ents if not set(ent).issubset(keep_char)}
    return ents - drop_ents

def drop_single_char_nps(ents):
    """Within an entity, drop single characters. 
    
    Keyword arguments:
    ents -- a set of entities
    
    """
    return {' '.join([e for e in ent.split(' ') if not len(e) == 1]) for ent in ents}

def remove_freq_words(entities):
    """Drop any entities in the 5000 most common words in the English langauge. 
    
    Keyword arguments:
    ents -- a set of entities
    
    """
    freq_words = pd.read_csv('datasets_freq_words.csv')['Word'].iloc[1:]
    for word in freq_words:
        try:
            entities.remove(word)
        except KeyError:
            continue # ignore the stop word if it's not in the list of abstract entities
    return entities

def add_clean_ents(df, funcs=[]):
    """Create new column in data frame with cleaned entities.
    
    Keyword arguments:
    df -- a dataframe object
    funcs -- a list of heuristic functions to be applied to entities
    
    """
    col = 'tuple_tokens_clean'
    df[col] = df['tuple_tokens']
    for f in funcs:
        df[col] = df[col].apply(f)

We run all the functions through 'add clean ents function'.

In [None]:
functions = [drop_double_char, keep_alpha, drop_single_char_nps, remove_freq_words]
add_clean_ents(data, functions)

Lets take a look at the cleaned tokens.

In [None]:
for row in data.tuple_tokens_clean[0:5]:  # iterating through the rows of the object column
    print(row, '\n')

Put the set of sets into a list, expand the list and create one final clean set

In [None]:
def large_list(text):
    large_list = []
    for word in (text):
        word.split(',')
        if word not in large_list:
            large_list.append(word)
    return large_list

data["list_clean"] = data["tuple_tokens_clean"].apply(large_list)
aggregated_list = data.list_clean.sum()

unique_tokens = set()
for word in aggregated_list:
    unique_tokens.add(word)

This is the flattened set that we use to create the SPARQL query

In [None]:
unique_tokens

We define function to create the SPARQL query.

In [None]:
def sparql_query(text):

    print ("WHERE {")
    print ("?sub meshv:preferredConcept ?pa .")
    print ("?pa rdfs:label ?paLabel .")
    print ("FILTER(")
    for keyword in text:
        print (f"REGEX(?paLabel, \'^{keyword}', 'i') ||")
    print ("}\n")

In [None]:
sparql_query(unique_tokens)