## Collect text for analysis

The example data is from a Pubmed query for 'climate change health' for the years 1972 through 2000.

I used the web page UI to save the results of a Pubmed query to a text file, and wrote some custom code to pull out the title and abstract for each article. (This is kind of a kludge; it would be better to fetch the data in structured format to start with: see 'To Do' below.)

## Install dependencies

* I am using Python 3.8.2
* pip install pandas
* pip install spacy
* pip install spacy-universal-sentence-encoder
* pip install https://github.com/MartinoMensio/spacy-universal-sentence-encoder/releases/download/v0.4.3/en_use_md-0.4.3.tar.gz#en_use_md-0.4.3
* pip install scipy
* pip install sklearn


## To Do

* One option might be to fetch the abstracts in structured format using [BioC API](https://www.ncbi.nlm.nih.gov/research/bionlp/APIs/BioC-PubMed/). 
    - `url = f"https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pubmed.cgi/BioC_json/{pmid}/unicode"`

* But the most reliable opption might be as described in this blog post on "[How to Write a Python Program to Query Biomedical Journal Citations in the PubMed Database](https://rruntsch.medium.com/how-to-write-a-python-program-to-query-biomedical-journal-citations-in-the-pubmed-database-c7e842e4df89)" for a more script-oriented approach using `eutils`.

* Once we have some way to evaluate performance, we should try re-featurizing with the 'large' language model (`en_use_lg`).

* Try running TfIdf on lower-case text ("The" should be a stopword.)

In [1]:
INPUT_FILE = "climate_change_health_abstract.txt"
OUTPUT_FILE = 'climate_change_health_document_sentence_vector.csv'


import re
import numpy as np
import pandas as pd
from collections import Counter

with open(INPUT_FILE, 'r', encoding='utf-8') as fh:
    all_data = fh.read()

all_data = all_data.replace("\n\n", 'XXX').replace("\n", ' ').replace("XXX", '\n')

info_pmid = re.split('(PMID: \d+)[^\n]+\n',  all_data)


def get_title(info):
    lines = info.split('\n')
    return lines[1]

def get_abstract(info):
    lines = info.split('\n')
    line_lengths = [len(line) for line in lines] # the longest one is (usually?) the abstract. KLUDGE ALERT!
    return lines[np.argmax(line_lengths)]

document_text = pd.DataFrame([{'pmid':info_pmid[i+1].replace('PMID: ', ''), 'text': get_title(info_pmid[i]) + ' ' + get_abstract(info_pmid[i])} for i in range(0, len(info_pmid) - 1, 2)])

document_text

Unnamed: 0,pmid,text
0,11019462,Climate change and vector-borne diseases: a re...
1,8604175,Global climate change and emerging infectious ...
2,9509631,"Climate change, human health, and sustainable ..."
3,11215015,Impacts of climate and climate change on medic...
4,8559115,The potential health impacts of climate change...
...,...,...
359,1297311,Psychosocial work environment in relation to c...
360,4027452,The seasonality of infant deaths due to diarrh...
361,8850768,Midwifery in New Zealand: responding to changi...
362,10123355,Five original recipes for strengthening hospit...


# Break documents into sentences

In [2]:
from spacy.lang.en import English

nlp = English()
nlp.add_pipe('sentencizer')

sentences = [ [sent for sent in nlp(txt).sents] for txt in document_text['text'].values]
sentences

[[Climate change and vector-borne diseases: a regional analysis.,
  Current evidence suggests that inter-annual and inter-decadal climate  variability have a direct influence on the epidemiology of vector-borne  diseases.,
  This evidence has been assessed at the continental level in order to  determine the possible consequences of the expected future climate change.,
  By  2100 it is estimated that average global temperatures will have risen by 1.0-3.5  degrees C, increasing the likelihood of many vector-borne diseases in new areas.,
   The greatest effect of climate change on transmission is likely to be observed  at the extremes of the range of temperatures at which transmission occurs.,
  For  many diseases these lie in the range 14-18 degrees C at the lower end and about  35-40 degrees C at the upper end.,
  Malaria and dengue fever are among the most  important vector-borne diseases in the tropics and subtropics; Lyme disease is  the most common vector-borne disease in the USA an

# Vectorize sentences

This cell takes a while the first time you run it, because it has to download the language model (987.47MB).

In [3]:
import spacy

document_sentences = [(doc_id, sent_id, str(sent)) for (doc_id, doc) in zip(range(len(sentences)), sentences) 
 for (sent_id, sent) in zip(range(len(doc)), doc)]

document_sentence_pdf = pd.DataFrame(document_sentences, columns=['document_id', 'sentence_number', 'sentence'])

sentence_encoder = spacy.load('en_use_md')
document_sentence_pdf['vector'] = [sentence_encoder(s).vector for s in document_sentence_pdf['sentence'].values]

document_sentence_pdf.head()

Unnamed: 0,document_id,sentence_number,sentence,vector
0,0,0,Climate change and vector-borne diseases: a re...,"[0.05090928, 0.033927415, -0.06473036, 0.06657..."
1,0,1,Current evidence suggests that inter-annual an...,"[0.061849233, 0.062392477, -0.032037795, 0.035..."
2,0,2,This evidence has been assessed at the contine...,"[0.046435084, -0.033241905, -0.07296619, -0.04..."
3,0,3,By 2100 it is estimated that average global t...,"[-0.0381868, 0.035116114, -0.03148239, -0.0298..."
4,0,4,The greatest effect of climate change on tran...,"[0.06649031, 0.05183534, -0.038783565, 0.01175..."


In [4]:
document_sentence_pdf[['document_id', 'sentence_number']]

Unnamed: 0,document_id,sentence_number
0,0,0
1,0,1
2,0,2
3,0,3
4,0,4
...,...,...
3324,362,4
3325,363,0
3326,363,1
3327,363,2


## Cluster sentences by embedding vectors

In [5]:
from scipy.cluster.hierarchy import ward, fcluster
from scipy.spatial.distance import pdist

X = document_sentence_pdf['vector'].tolist()
y = pdist(X, metric='cosine')
z = ward(y)

In [6]:
from collections import Counter

# Try different distances to get large, medium, and small clusters
# len(Counter(fcluster(z, 1.2, criterion='distance') ).most_common())  
# 4.0: 11, 3.5:17, 3.0:29, 2.0:86, 1.5:107, 1.2: 379, 1.0:617

document_sentence_pdf['clusterA'] = fcluster(z, 3.5, criterion='distance')
document_sentence_pdf['clusterB'] = fcluster(z, 2.0, criterion='distance')
document_sentence_pdf['clusterC'] = fcluster(z, 1.2, criterion='distance')

## Get candidate names for clusters

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer


def get_candidate_names(cluster_text_pdf, cluster_col, text_col='sentence', 
                        sentence_sep=' ... ', max_ngram_length=3, top_n=1):
    corpus = cluster_text_pdf[[cluster_col, text_col]]\
                    .groupby([cluster_col])[text_col]\
                    .transform(lambda s: sentence_sep.join(s))\
                    .values
    
    stop_words = 'english' # ['a', 'an', 'the', 'and', 'of', 'at', 'in', 'or', 'been']
    tfidf = TfidfVectorizer(ngram_range=(1, max_ngram_length), 
                            stop_words=stop_words, # sublinear_tf=True, # max_df=0.5,
                            lowercase=False)
    X = tfidf.fit_transform(corpus)
    feature_names = tfidf.get_feature_names()
    
    # give slight advantage to terms containing more words
    npX = np.array(X.todense())
    smidge = 1e-12  # just enough to break ties, not enough to affect real differences
    feature_tfidf_adjustment = [smidge * len(n.split(' ')) for n in feature_names]
    adjusted_X = np.array([npX[i,:]+feature_tfidf_adjustment for i in range(len(npX))])
    # candidate_names = [x for x in np.array(feature_names)[adjusted_X.argmax(axis=1)].tolist()]
    
    top_idx = [ [i for i in v[-top_n:][::-1]] for v in adjusted_X.argsort(axis=1)] # [-n:]
    candidate_names = [', '.join(x) for x in np.array(feature_names)[top_idx].tolist()]
    
    return candidate_names

In [8]:
document_sentence_pdf['clusterA_name'] = get_candidate_names(document_sentence_pdf, cluster_col='clusterA', text_col='sentence')
document_sentence_pdf['clusterB_name'] = get_candidate_names(document_sentence_pdf, cluster_col='clusterB', text_col='sentence')
document_sentence_pdf['clusterC_name'] = get_candidate_names(document_sentence_pdf, cluster_col='clusterC', text_col='sentence')
# document_sentence_pdf['clusterB_top3'] = get_candidate_names(document_sentence_pdf, cluster_col='clusterB', text_col='sentence', top_n=3)

In [9]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

keep_cols = ['document_id', 'sentence', 'clusterA','clusterB','clusterC','clusterA_name','clusterB_name','clusterC_name']
document_sentence_pdf[keep_cols].sort_values(by=['clusterA', 'clusterB', 'clusterC'])

Unnamed: 0,document_id,sentence,clusterA,clusterB,clusterC,clusterA_name,clusterB_name,clusterC_name
2460,259,It is key in today's health care climate that the nursing administrator be visible.,1,1,1,nursing,nursing,administrator
2461,259,One method to assure visibility is for staff to actually see a role model serving as an extension of the nurse administrator.,1,1,1,nursing,nursing,administrator
2464,259,There is no better method of communicating this than having someone who visibly echoes the values of the nurse administrator.,1,1,1,nursing,nursing,administrator
2469,259,It is important that the values of the nurse administrator be disseminated throughout the organization.(ABSTRACT TRUNCATED AT 250 WORDS),1,1,1,nursing,nursing,administrator
1182,124,"Midwives and doctors, including paediatricians, were identified as the major professional sources of these beliefs.",1,1,2,nursing,nursing,means health care
1833,190,"Even though they are the most important means of health care for these women, the US public knows little about them.",1,1,2,nursing,nursing,means health care
1852,191,This means that health care managers need to view stakeholders as parts of larger bundles rather than only as individual organizations.,1,1,2,nursing,nursing,means health care
2027,214,"The director of the center, Rose Mazibuko, is a nurse who believes that primary health care in its broadest sense is a human right.",1,1,2,nursing,nursing,means health care
2435,255,"While many physicians are depressed by the present health care climate, feeling a loss of power and a loss in spirit, the vision of the physician manager must carry them and the organizations they build forward through uncharted waters to a future which is every bit as exciting as our past.",1,1,2,nursing,nursing,means health care
1054,114,The new ward manager: an evaluation of the changing role of the charge nurse.,1,1,3,nursing,nursing,nurse


## Dimension reduction

Convert the 512-dimension vectors to 2d, and save them as (x, y) coordinates.

In [10]:
from sklearn.manifold import TSNE
v0 = np.zeros(512)  # len(document_sentence_pdf['vector'][0])
vec_2d = TSNE(n_components=2).fit_transform([v if v is not None else v0 for v in document_sentence_pdf['vector'].values])

document_sentence_pdf['x'] = vec_2d[:,0]
document_sentence_pdf['y'] = vec_2d[:,1]

## Save the file

In [11]:
document_sentence_pdf.sort_values(by=['clusterA', 'clusterB', 'clusterC']).to_csv(OUTPUT_FILE)