# Semantic fingerprinting of scientific journals

To compare the similarity of texts, it is helpful to get an aggregate representation of the relevant texts. This notebook displays the semantic overlap in content between major scientific journals in the biomedical field.

We are going to be using the [cortical.io](http://cortical.io) API to get a semantic fingerprint for the abstracts from every journal, compare them using the [Jaccard distance metric](https://en.wikipedia.org/wiki/Jaccard_index) and finally plot them in an interactive map using [Bokeh](https://bokeh.pydata.org/en/latest).

Semantic fingerprinting is a technique based on embedding a word or text in a context so that the conceptual links to other concepts are being revealed. There are many ways to embed a text in a vector space. The method I am showing in this notebook relies on [semantic folding](https://en.wikipedia.org/wiki/Semantic_folding). It has its origins in theoretical concepts on how the brain could be storing information, e.g., theories about [sparse distributed representations](https://arxiv.org/ftp/arxiv/papers/1503/1503.07469.pdf).
Find more information about semantic fingerprinting [here](http://www.cortical.io/technology_semantic.html).

The data used in this notebook (a sample of 200 abstracts for each journal) stems from Pubmed queries and can e.g., be obtained using the scripts in [this repository](https://github.com/wiheto/readabilityinscience). However, with adjustments it can be used to gather semantic fingerprints and compare them for any text-based data.

## Imports

In [1]:
import retinasdk
import pprint
import os
import numpy as np
import pandas as pd
import unidecode
import time
import json
import matplotlib.pyplot as plt
from nltk.stem import WordNetLemmatizer

import sys
sys.path.append('./')

## Initializations

In [2]:
timer = 10  # n seconds delay between every bulk request to the API

In [3]:
basePath = './data/'
outPath = './output/'
ending = '/id_article/abstracttext_pubdate_year_pmid_articletitle_journal_title_keyword_doi/searchresults'

Obtain a free API key by filling out this form: http://www.cortical.io/resources_apikey.html and entering it below

In [4]:
cortical_api_key =  # Insert your API key here
liteClient = retinasdk.LiteClient(cortical_api_key)  # Only need the liteClient for the operations below

We are using all the folders in /data as journal names:

In [5]:
# Code from https://stackoverflow.com/questions/800197/how-to-get-all-of-the-immediate-subdirectories-in-python
def get_immediate_subdirectories(a_dir):
    return [name for name in os.listdir(a_dir)
            if os.path.isdir(os.path.join(a_dir, name))]

In [6]:
journalList = get_immediate_subdirectories(basePath)

### Which journals are we looking at?

In [7]:
pprint.pprint(journalList)

['adv_anat_embryol_cell_biol.[journal]',
 'adv_drug_deliv_rev.[journal]',
 'adv_virus_res[journal]',
 'aids[journal]',
 'allergy[journal]',
 'am_fam_physician[journal]',
 'am_j_hum_genet[journal]',
 'am_j_neuroradiol[journal]',
 'am_j_psychiat[journal]',
 'anim_behav[journal]',
 'annu_rev_biochem[journal]',
 'annu_rev_cell_dev_biol.[journal]',
 'annu_rev_genet[journal]',
 'annu_rev_immunol[journal]',
 'annu_rev_med[journal]',
 'annu_rev_microbiol[journal]',
 'annu_rev_neurosci[journal]',
 'annu_rev_physiol[journal]',
 'annu_rev_phytopathol[journal]',
 'annu_rev_psychol[journal]',
 'ann_intern_med[journal]',
 'ann_ny_acad_sci[journal]',
 'antivir_res[journal]',
 'aquat_toxicol[journal]',
 'arch_gen_psychiat[journal]',
 'arch_intern_med[journal]',
 'arch_toxicol[journal]',
 'behav_brain_sci[journal]',
 'biofouling[journal]',
 'biol_psychiat[journal]',
 'biol_rev[journal]',
 'bmj[journal]',
 'brain_behav_immun[journal]',
 'brit_j_psychiat[journal]',
 'br_j_gen_pract.[journal]',
 'ca-cance

### Initializing a Pandas DataFrame for the output

In [8]:
df_journal_fingerprints = pd.DataFrame(index=range(0,len(journalList)), columns=['fingerprint', 'journal', 'similar', 'abstract_count', 'median_year'])

## Getting a semantic fingerprint for every journal using the cortical.io API

In [None]:
for i in range(0, len(journalList)):
    fingerprintsList = []
    overallidList = []
    overallyearList = []
    combinedPath = (basePath + journalList[i] + ending)
    df_temp = pd.read_json(combinedPath)
    df_temp.sort_index(inplace=True)
    df_temp['abstracttext'].replace('', np.nan, inplace=True)
    df_reduced = df_temp.dropna(subset=['abstracttext'])  # Removing abstracts without abstracttext
    df_reduced.index = range(len(df_reduced))
    abstractList = df_reduced['abstracttext'].tolist()
    
    # Get journal fingerprint and median year for those abstracts by aggregating 200 abstracts
    if len(abstractList) > 200:
        journal_abstracts = ' '.join(abstractList[0:200])
        abstractlength = 200
        try:
            median_year = np.median(df_reduced['pubdate_year'][0:200])
        except:
            median_year = 'NaN'
    else:
        journal_abstracts = ' '.join(abstractList[0:len(abstractList)])
        abstractlength = len(abstractList)
        try:
            median_year = np.median(df_reduced['pubdate_year'][0:len(abstractList)])
        except:
            median_year = 'NaN'

    journalfingerprint = liteClient.getFingerprint(journal_abstracts)  # Alternative would be to use createCategoryFilter on unjoined list
    similarterms = liteClient.getSimilarTerms(journal_abstracts)

    df_journal_fingerprints['fingerprint'][i] = journalfingerprint
    df_journal_fingerprints['journal'][i] = journalList[i].replace('[journal]','')
    df_journal_fingerprints['similar'][i] = similarterms
    df_journal_fingerprints['abstract_count'][i] = abstractlength
    df_journal_fingerprints['median_year'][i] = median_year
    print("Journal " + journalList[i].replace('[journal]','') + " done")
    print("Waiting " + str(timer) + " seconds, about " + str(int((len(journalList)-i) * timer / 60)) + " minutes left")
    time.sleep(timer)

# Saving the final dataframe
df_journal_fingerprints.reset_index().to_json(outPath + 'journal_fingerprints.json')

A scientific abstract from this list will look like this:

In [None]:
print(abstractList[0])

## Plotting an interactive [Jaccard distance](https://en.wikipedia.org/wiki/Jaccard_index) map between fingerprints using Bokeh

In [11]:
# Load the fingerprints dataframe if necessary
df_journal_fingerprints = pd.read_json(outPath + 'journal_fingerprints.json')

# Remove journals with under 200 proper abstracts
reduced_fingerprints = df_journal_fingerprints[df_journal_fingerprints['abstract_count'] >= 200].copy().sort_values('journal', ascending=False)

# We can stem the similar words to reduce the amount of singular/plural similar terms a bit
wnl = WordNetLemmatizer()
def lemmatize(row):
    return [wnl.lemmatize(word) for word in row]  # Lemmatize and take unique words

def extract_unique(row):
    return pd.unique(row)

reduced_words = pd.Series(reduced_fingerprints['similar'].apply(lemmatize))
unique_words = reduced_words.apply(extract_unique)
reduced_fingerprints['similar_reduced'] = unique_words.values
    
comparison_array = np.empty((len(reduced_fingerprints),len(reduced_fingerprints),))
comparison_array[:] = np.NAN

def compute_jaccard_index(set_1, set_2):
    n = len(set_1.intersection(set_2))
    return n / float(len(set_1) + len(set_2) - n)
    
for j in range(0,len(reduced_fingerprints)):
    for k in range(j,len(reduced_fingerprints)):
        comparison_array[j][k] = 1 - compute_jaccard_index(set(reduced_fingerprints['fingerprint'].iloc[j]), set(reduced_fingerprints['fingerprint'].iloc[k])) # 1 minus Jaccard to get Jaccard distance


# Interactive plotting with Bokeh, based on http://bokeh.pydata.org/en/latest/docs/gallery/les_mis.html

from bokeh.plotting import figure, show, output_file
from bokeh.models import HoverTool, ColumnDataSource
from bokeh.io import reset_output

reset_output()

subset_fingerprints = reduced_fingerprints[['journal','similar_reduced']]
nodes = json.loads(subset_fingerprints.to_json(orient="records"))

names = [node['journal'] for node in nodes]
N = len(nodes)

counts = comparison_array
normcounts = comparison_array * 1.0/comparison_array[~np.isnan(comparison_array)].max()  # Normalize the array for color display

xname = []
yname = []
color = []
alpha = []
similar = []
for i, node1 in enumerate(nodes):
    for j, node2 in enumerate(nodes):
        xname.append(node1['journal'])
        yname.append(node2['journal'])
        similar.append([itm for itm in node1['similar_reduced'] if itm in node2['similar_reduced']])

        if j > i:
            color.append('orange')
            # Take (1 - value) for smaller distances to appear darker
            alpha.append(1 - normcounts[i,j])
        elif j == i:
            color.append('white')
            alpha.append(1)
        else:
            color.append('lightgrey')
            alpha.append(1)

source = ColumnDataSource(data=dict(
    xname=xname,
    yname=yname,
    alphas=alpha,
    colors=color,
    count=counts.flatten(),
    similar=similar,
))

p = figure(title="Semantic similarity of biomedical journals from a sample of 200 abstracts per journal",
           x_axis_location="above", tools="hover,save",
           x_range=list(reversed(names)), y_range=names)

p.plot_width = 1200
p.plot_height = 1200
p.grid.grid_line_color = None
p.axis.axis_line_color = None
p.axis.major_tick_line_color = None
p.axis.major_label_text_font_size = "5pt"
p.axis.major_label_standoff = 0
p.xaxis.major_label_orientation = np.pi/3

p.rect('xname', 'yname', 0.9, 0.9, source=source,
       color='colors', alpha='alphas', line_color=None,
       hover_line_color='black', hover_color='colors', hover_alpha='alphas')

p.select_one(HoverTool).tooltips = [
    ('Journals', '@yname, @xname'),
    ('Jaccard', '@count'),
    ('Concepts', '@similar'),
]

output_file("journals_200.html", title="Semantic similarity of biomedical journals from a sample of 200 abstracts per journal")

show(p)

The output can then be viewed as an HTML file in the main directory or online [here](http://bcschiffler.com/journals_200.html). On the diagonal in white, similar terms associated with the fingerprints for every journal can be seen when hovering over them. The upper triangle of the matrix reflects the (Jaccard) distance between all fingerprints in the dataset. A darker orange colour indicates fingerprints which are more similar to each other. Hovering over these will show the names of the two journals being compared, their Jaccard distance, as well as terms which are similar to both fingerprints.

A few examples of interest in this map indicate that semantic overlaps indeed reflect journal similarity:

![Comparison between Nature Neuroscience and Neuron](img/nat_neurosci_neuron.png)

The two journals Nature Neuroscience and Neuron are quite similar as indicated by a low Jaccard distance and both of their fingerprints are related to concepts like "receptor", "neuron" and "inhibition".

![Comparison between Psychological Medicine and Schizophrenia Bulletin](img/psychol_med_schizophr_bull.png)

The journals Psychological Medicine and Schizophrenia Bulletin are also similar, but here their fingerprints overlap on concepts such as "disorder", "symptom" and "therapy".

![Comparison between Psychological Bulletin and Science and Engineering Ethics](img/psychol_bull_sci_eng_ethics.png)

On the other hand, the journal Science and Engineering Ethics does not strongly overlap with most other journals as indicated by relatively higher Jaccard distance values, likely because thematically it focuses on the ethics of science and therefore has more of a meta-approach. However, on this meta-level, there are similarities with some journals which are focused on reviews of the literature such as Psychological Bulletin. This is reflected in the overlap in terms such as "understanding", "approach", or "concept".