# To recap

### Data so far

1. We have created a list of people and their ORCID identifiers - [data/pure_person_to_orcid.txt](data/pure_person_to_orcid.txt)
2. We have created a list of ORCID to PubMed identifiers - [data/orcid.tsv](data/orcid.tsv)
3. We have created a list of PubMed IDs to PubMed info - [data/pubmed.tsv](data/pubmed.tsv)
4. We have created a list of the top 100 terms for each person - [data/orcid-tf-idf.txt](data/orcid-tf-idf.txt)

### Questions

Now we need to use the common identifiers in these data to answer some questions:

1. Can we produce a set of potential collaborators for each person, a collaborator being someone they have significant terms in common with but not previously published with.
2. Using a piece of text, such as a conference or publication, can we select a set of people that most closely map to it?
3. If we were to compare the publications from a person outside of our group, where would they fit?


### Create some dataframes

Normally at this point we would use a database, and the phrase `knowledge graph` in the title of the workshop suggests that would be the aim here (if I were to use a database at this point, I would use Neo4j - https://neo4j.com/). However, to keep it all in a single language and framework, and as the data themselves are relatively small, we can use pandas - https://pandas.pydata.org/. 

So, following on from the data descriptions above:


In [None]:
import pandas as pd
import config

# 1. People and their ORCID identifiers 
personToOrcid = pd.read_csv(config.demoPureOrcidFile,sep='\t')
print(personToOrcid.head())
print(personToOrcid.shape)

# 2. ORCID to PubMed identifiers
orcidToPubmed = pd.read_csv(config.demoOrcidFile,sep='\t')
print(orcidToPubmed.head())
print(orcidToPubmed.shape)

# 3. PubMed IDs to PubMed info
pubmedToInfo = pd.read_csv(config.demoPubmedFile,sep='\t')
print(pubmedToInfo.head())
print(pubmedToInfo.shape)

# 4. Top 100 terms for each person
topTerms = pd.read_csv(config.demoTfidfFile,sep='\t')
print(topTerms.head())
print(topTerms.shape)

# 5. PURE to name
pureToName = pd.read_csv(config.demoPurePeopleFile,sep='\t')
print(pureToName.head())
print(pureToName.shape)

A first pass would be to count the number of matching terms between each person. To start with, let's just compare the first in the list to everyone else:

In [None]:
#loop through people, limit to top n in list
def compare_people(limit):
    for i, iData in personToOrcid.head(n=limit).iterrows():
        print('### Comparing',i,iData.pure_person_id,iData.orcid_id)
        #get all terms and tf-idf values
        iTopTerms=topTerms[topTerms['orcid_id']==iData.orcid_id]
        iTerms = iTopTerms['term']

        #compare to all other people
        for j in range(i+1,personToOrcid.shape[0]):
            jData=personToOrcid.iloc[j]
            jTopTerms=topTerms[topTerms['orcid_id']==jData.orcid_id]
            jTerms = jTopTerms['term']
            com = list(set(iTerms) & set(jTerms))
            #only show maches with more than 1 matching term
            if len(com) > 1:
                print(jData.orcid_id,com)
    
compare_people(1)        

There are few matches with many overlapping terms, except one **0000-0001-7086-8882**. Let's look at the research pages of these two people

https://research-information.bristol.ac.uk/en/persons/melody-a-s-sylvestre(81e7f06c-77d8-4020-9608-0b30dd001c43)/publications.html https://research-information.bristol.ac.uk/en/persons/nicholas-a-teanby(4ec18a96-fd2e-4311-a6fa-7ec65696a4e9)/publications.html 

This suggests they are from similar research areas, and have indeed co-published.   

Let's try again, this time with the top 3

In [None]:
compare_people(3)  

The second person matched no-one, the third however matched quite a few. Let's modify the function to sort by number:  

In [None]:
def compare_people(limit):
    for i, iData in personToOrcid.head(n=limit).iterrows():
        print('### Comparing',i,iData.pure_person_id,iData.orcid_id)
        #get all terms and tf-idf values
        iTopTerms=topTerms[topTerms['orcid_id']==iData.orcid_id]
        iTerms = iTopTerms['term']
        jComp={}
        for j in range(i+1,personToOrcid.shape[0]):
            jData=personToOrcid.iloc[j]
            jTopTerms=topTerms[topTerms['orcid_id']==jData.orcid_id]
            jTerms = jTopTerms['term']
            com = list(set(iTerms) & set(jTerms))
            #only show maches with more than 1 matching term
            if len(com) > 1:
                jComp[jData.orcid_id]=com
        #create sorted dictionary using number of items
        jComp = sorted(jComp.items(), key=lambda kv: len(kv[1]), reverse=True)
        for p in jComp[0:10]:
            print(p)
            
compare_people(3)

This only shows us terms that are in common. It could be possible to have lots of terms in common, but these terms are not the 'most' representative of either person, or indeed in order of importance per person. It would be better to compare all terms, and their tf-idf values simultaneously.

Back in the TF-IDF section (http://localhost:8888/lab#TF-IDF-using-sklearn) we created a similarity matrix from the TF-IDF model.

In [None]:
#get similarity matrix for all people

matrixCom = {}

#get tfidf matrix and orcidText dictionary
%store -r matrix
%store -r token_dict

for i in range(0,len(matrix)):
    iOrcid=list(token_dict)[i]
    matrixCom[iOrcid]={}
    for j in range(0,len(matrix)):
        jOrcid=list(token_dict)[j]
        matrixCom[iOrcid][jOrcid]=matrix[i][j]

o = open(config.orcidToOrcid,'w')
o.write('orcid_1\torcid_2\ttf-idf\n')
counter=0
for m in matrixCom:
    sorted_res = sorted(matrixCom[m].items(), key=lambda kv: kv[1], reverse=True)
    if counter<3:
        print(m)
        for s in sorted_res[0:5]:
            print('\t',s[0],s[1])
        counter+=1
    for s in sorted_res:
        o.write(m+'\t'+s[0]+'\t'+str(s[1])+'\n')
o.close()


### 1. Collaboration recommendation engine

This similarity data depicts the similarity between each person's publication text based on tf-idf. Often, similarities arise due to co-publication, and perhaps a more informative recommender would be to idenfity cases where people have signficant overlap in their publication text, but have never previously co-published. 

In [None]:
#load the new data into a dataframe

orcidToOrcid = pd.read_csv(config.demoOrcidToOrcid,sep='\t')
print(orcidToOrcid.shape)
print(orcidToOrcid.head())

In [None]:
#filter on co-publishing
example_orcid='0000-0001-7328-4233'
#example_orcid='0000-0003-0924-3247'
example_pubs=orcidToPubmed[orcidToPubmed['orcid_id']==example_orcid]['pmid']

def find_overlapping_tfidf(orcid1,orcid2):
    o1=topTerms[topTerms['orcid_id']==orcid1]
    o2=topTerms[topTerms['orcid_id']==orcid2]
    m = o1.merge(o2, left_on='term', right_on='term')
    return m.iloc[:, [1,2,4]]

counter=0
oColab=orcidToOrcid[orcidToOrcid['orcid_1']==example_orcid]['orcid_2']
for o in oColab:
    colabPubs=orcidToPubmed[orcidToPubmed['orcid_id']==o]['pmid']
    pubCom=len(set(example_pubs).intersection(set(colabPubs)))
    #find cases where there are no common publications
    if pubCom==0:
        if counter<5:
            #get overlapping tf-idf terms
            overlapPubs = find_overlapping_tfidf(example_orcid,o)
            r = orcidToOrcid[(orcidToOrcid['orcid_1']==example_orcid) & (orcidToOrcid['orcid_2']==o)]
            print(o,r['tf-idf'].to_string(header=False))
            print(overlapPubs,'\n')
            counter+=1

### 2. Matching text to people

Using the same tf-idf model, we can try and match any piece of text to the most relevant people. To do this we will compare the tf-idf vectors for a piece of text to each person using the cosine similarity of the vectors. For more info - https://janav.wordpress.com/2013/10/27/tf-idf-and-cosine-similarity/

>The cosine measure similarity is another similarity metric that depends on envisioning user preferences as points in space.  Hold in mind the image of user preferences as points in an n-dimensional space. Now imagine two lines from the origin, or  point (0,0,…,0), to each of these two points. When two users are similar, they’ll have similar ratings, and so will be  relatively close in space—at least, they’ll be in roughly the same direction from the origin. The angle formed between these two lines will be relatively small. In contrast, when the two users are dissimilar, their points will be distant, and likely in different directions from the origin, forming a wide angle. This angle can be used as the basis for a similarity metric in the same way that the Euclidean distance was used to form a similarity metric. In this case, the cosine of the angle leads to a similarity value. <b>If you’re rusty on trigonometry, all you need to remember to understand this is that the cosine value is always between –1 and 1: the cosine of a small angle is near 1, and the cosine of a large angle near 180 degrees is close to –1. This is good, because small angles should map to high similarity, near 1, and large angles should map to near –1.

In [None]:
#https://stackoverflow.com/questions/55677314/using-sklearn-how-do-i-calculate-the-tf-idf-cosine-similarity-between-documents
%store -r tfs
corpus_tfidf = tfs
%store -r tfidf

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

#merge pure people data  to get names
pure_people_orcid = personToOrcid.merge(pureToName, left_on='pure_person_id', right_on='pure_person_id')

def compare_text(query,topMatches=5):
    print('\n','# Comparing text:',query[:100],'...')
    query_tfidf = tfidf.transform([query])
    
    #get terms for query
    feature_names = tfidf.get_feature_names()
    res={}
    for col in query_tfidf.nonzero()[1]:
        res[feature_names[col]]=query_tfidf[0, col]
        #reverse sort the results
        sorted_res = sorted(res.items(), key=lambda kv: kv[1], reverse=True)
    
    #compare tfidf vectors using cosine similarity
    cosineSimilarities = cosine_similarity(query_tfidf, corpus_tfidf).flatten()
    cosineData={}
    
    #order cosine similarities 
    for c in range(0,len(cosineSimilarities)):
        cosineData[c]=cosineSimilarities[c]
    sortedCosineData = sorted(cosineData.items(), key=lambda kv: kv[1], reverse=True)
    
    #process the output and get more info for matches
    for s in sortedCosineData[0:topMatches]:
        #get pureID 
        orcid=list(token_dict)[s[0]]
        p = pure_people_orcid[personToOrcid['orcid_id']==orcid]
        if not p.empty:
            print('\n',s[1],orcid,p['person_name'].values[0])
            #get overlapping terms
            o1=topTerms[topTerms['orcid_id']==orcid]
            o2=sorted_res[:100]
            #create a dataframe for results
            data=[]
            for i in o2:
                if i[0] in o1.term.values:
                    data.append([i[0], i[1], o1[o1['term']==i[0]]['tf-idf'].values[0]])
            df=pd.DataFrame(data, columns=["term", "doc-tfidf", "person-tdidf"])
            print(df)

In [None]:
#PubMed article 26930047
query="""
Diagnosis of Coronary Heart Diseases Using Gene Expression Profiling; Stable Coronary Artery Disease, Cardiac Ischemia with and without Myocardial Necrosis.
Cardiovascular disease (including coronary artery disease and myocardial infarction) is one of the leading causes of death in Europe, and is influenced by both 
environmental and genetic factors. With the recent advances in genomic tools and technologies there is potential to predict and diagnose heart disease using 
molecular data from analysis of blood cells. We analyzed gene expression data from blood samples taken from normal people (n = 21), non-significant coronary artery 
disease (n = 93), patients with unstable angina (n = 16), stable coronary artery disease (n = 14) and myocardial infarction (MI; n = 207). We used a feature 
selection approach to identify a set of gene expression variables which successfully differentiate different cardiovascular diseases. The initial features 
were discovered by fitting a linear model for each probe set across all arrays of normal individuals and patients with myocardial infarction. Three different 
feature optimisation algorithms were devised which identified two discriminating sets of genes, one using MI and normal controls (total genes = 6) and another 
one using MI and unstable angina patients (total genes = 7). In all our classification approaches we used a non-parametric k-nearest neighbour (KNN) classification 
method (k = 3). The results proved the diagnostic robustness of the final feature sets in discriminating patients with myocardial infarction from healthy controls. 
Interestingly it also showed efficacy in discriminating myocardial infarction patients from patients with clinical symptoms of cardiac ischemia but no myocardial 
necrosis or stable coronary artery disease, despite the influence of batch effects and different microarray gene chips and platforms.
"""
compare_text(query)

In [None]:
query="""
genome wide association gwas
"""
compare_text(query)

In [None]:
query="""
quantum mechanics
"""
compare_text(query)

In [None]:
#load from file
#https://www.bristol.ac.uk/cancer/events/2019/bioc-16may.html
text=open('data/cancer-conf.txt','r').read()
compare_text(text.lower(),topMatches=5)

## 3. Compare a person

How would a person outside of this group of people compare. If you have an ORCID, or know of one, try it and see.

In [None]:
from scripts.common_functions import orcid_to_pubmedData

example_orcid='0000-0001-9918-058X'

#get publication data fron orcid and write to file
orcid_to_pubmedData([example_orcid])

#load publication ids for orcid data
pubData=set()
with open(config.orcidFile,'r') as f:
    for line in f:
        orcid,pmid = line.rstrip().split('\t')
        if orcid==example_orcid:
            pubData.add(pmid)
#print(pubData)

#load publication text matching PubMed IDs
text=''
with open(config.pubmedFile,'r') as f:
    for line in f:
        pmid,year,title,abstract = line.rstrip().split('\t')
        if pmid in pubData:
            pubData.add(pmid)
            text+=title+' '+abstract
if len(text)>10:
    compare_text(text.lower(),topMatches=10)
else:
    print('Not enough text',len(text))
    

## Final thoughts

This is a very crude demonstration of how you might do this kind of thing. Natural language processing methods are developing at an incredible rate, and there are many alternative methods, e.g. word2vec, doc2vec, BERT. We haven't covered many important areas, such as entity tagging, context aware text processing, stemming/lematizing, building graphs, network analysis, etc. Maybe next time :)

There are also lots of alternative tools, e.g. Scopus, Scival, Fingerprints (Pure) but these are mainly closed source and provide no specific information on methods. I plan to develop the above ideas in relation to AXON and welcome collaborations and ideas.