# To recap

### Data so far

1. We have created a list of people and their ORCID identifiers - [output/pure_person_to_orcid.txt](output/pure_person_to_orcid.txt)
2. We have created a list of ORCID to PubMed identifiers - [output/orcid.tsv](output/orcid.tsv)
3. We have created a list of PubMed IDs to PubMed info - [output/pubmed.tsv](output/pubmed.tsv)
4. We have created a list of the top 100 terms for each person - [output/orcid-tf-idf.txt](output/orcid-tf-idf.txt)

### Questions

Now we need to use the common identifiers in these data to answer some questions:

1. Can we produce a set of potential collaborators for each person, a collaborator being someone they have significant terms in common with but not previously published with.
2. Can we select a set of people that most closely map to a specific piece of text.


### Pandas dataframes

Normally at this point we would use a database. However, to keep it all in a single language and framework, and as the data themselves are relatively small, we can use pandas - https://pandas.pydata.org/

>pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.



So, following on from the data descriptions above:


In [None]:
import pandas as pd

# 1. People and their ORCID identifiers 
personToOrcid = pd.read_csv('data/pure_person_to_orcid.txt',sep='\t')
print(personToOrcid.shape)
print(personToOrcid.head())

# 2. ORCID to PubMed identifiers
orcidToPubmed = pd.read_csv('data/orcid.tsv',sep='\t')
print(orcidToPubmed.shape)
print(orcidToPubmed.head())

# 3. PubMed IDs to PubMed info
pubmedToInfo = pd.read_csv('data/pubmed.tsv',sep='\t')
print(pubmedToInfo.shape)
print(pubmedToInfo.head())

# 4. Top 100 terms for each person
topTerms = pd.read_csv('data/orcid-tf-idf.txt',sep='\t')
print(topTerms.shape)
print(topTerms.head())

# 5. PURE to name
pureToName = pd.read_csv('data/pure_people.txt',sep='\t')
print(pureToName.shape)
print(pureToName.head())


A first pass would be to count the number of matching terms between each person. To start with, let's just compare the first in the list to everyone else:

In [None]:
 #loop through people (limit to top 1)
def compare_people(limit):
    for i, iData in personToOrcid.head(n=limit).iterrows():
        print('### Comparing',i,iData.pure_person_id,iData.orcid_id)
        #get all terms and tf-idf values
        iTopTerms=topTerms[topTerms['orcid_id']==iData.orcid_id]
        #print(iTopTerms.head())
        iTerms = iTopTerms['term']

        for j in range(i+1,personToOrcid.shape[0]):
            jData=personToOrcid.iloc[j]
            jTopTerms=topTerms[topTerms['orcid_id']==jData.orcid_id]
            #print(jTopTerms['term'].head())
            jTerms = jTopTerms['term']
            #print(iTerms,jTerms)
            com = list(set(iTerms) & set(jTerms))
            #only show maches with more than 1 matching term
            if len(com) > 1:
                print(jData.orcid_id,com)
    
compare_people(1)        

There are few matches with many overlapping terms, except one **0000-0001-7086-8882**. Let's look at the research pages of these two people

https://research-information.bristol.ac.uk/en/persons/melody-a-s-sylvestre(81e7f06c-77d8-4020-9608-0b30dd001c43)/publications.html https://research-information.bristol.ac.uk/en/persons/nicholas-a-teanby(4ec18a96-fd2e-4311-a6fa-7ec65696a4e9)/publications.html 

This suggests they are from similar research areas, and have indeed co-published.   

Let's try again, this time with the top 3

In [None]:
compare_people(3)  

The second person matched no-one, the third however matched quite a few. Let's modify the function to sort by number:  

In [None]:
def compare_people(limit):
    for i, iData in personToOrcid.head(n=limit).iterrows():
        print('### Comparing',i,iData.pure_person_id,iData.orcid_id)
        #get all terms and tf-idf values
        iTopTerms=topTerms[topTerms['orcid_id']==iData.orcid_id]
        iTerms = iTopTerms['term']
        jComp={}
        for j in range(i+1,personToOrcid.shape[0]):
            jData=personToOrcid.iloc[j]
            jTopTerms=topTerms[topTerms['orcid_id']==jData.orcid_id]
            jTerms = jTopTerms['term']
            com = list(set(iTerms) & set(jTerms))
            #only show maches with more than 1 matching term
            if len(com) > 1:
                jComp[jData.orcid_id]=com
        #create sorted dictionary using number of items
        jComp = sorted(jComp.items(), key=lambda kv: len(kv[1]), reverse=True)
        for p in jComp[0:10]:
            print(p)
            
compare_people(3)

This only shows us terms that are in common. It could be possible to have lots of terms in common, but these terms are not the 'most' representative of either person. It would be better to compare all terms, and their tf-idf values.

Back in the TF-IDF section (http://localhost:8888/lab#TF-IDF-using-sklearn) we created a similarity matrix from the TF-IDF model.



In [None]:
#get similarity matrix for all people

matrixCom = {}

#get tfidf matrix and orcidText dictionary
%store -r matrix
%store -r token_dict

for i in range(0,len(matrix)):
    iOrcid=list(token_dict)[i]
    matrixCom[iOrcid]={}
    for j in range(0,len(matrix)):
        jOrcid=list(token_dict)[j]
        matrixCom[iOrcid][jOrcid]=matrix[i][j]

o = open('output/orcid-to-orcid-tf-idf.tsv','w')
o.write('orcid_1\torcid_2\ttf-idf\n')
counter=0
for m in matrixCom:
    sorted_res = sorted(matrixCom[m].items(), key=lambda kv: kv[1], reverse=True)
    if counter<3:
        print(m)
        for s in sorted_res[0:5]:
            print('\t',s[0],s[1])
        counter+=1
    for s in sorted_res:
        o.write(m+'\t'+s[0]+'\t'+str(s[1])+'\n')
o.close()


### Collaboration recommendation engine

This similarity data depicts the similarity between each person's publication text based on tf-idf. Often, similarities arise due to co-publication, and perhaps a more informative recommender would be to idenfity cases where people have signficant overlap in their publication text, but have never previously co-published. 

In [None]:
#load the new data into a dataframe

orcidToOrcid = pd.read_csv('data/orcid-to-orcid-tf-idf.tsv',sep='\t')
print(orcidToOrcid.shape)
print(orcidToOrcid.head())

In [180]:
#filter on co-publishing
example_orcid='0000-0001-7328-4233'
#example_orcid='0000-0003-0924-3247'
example_pubs=orcidToPubmed[orcidToPubmed['orcid_id']==example_orcid]['pmid']

def find_overlapping_tfidf(orcid1,orcid2):
    o1=topTerms[topTerms['orcid_id']==orcid1]
    o2=topTerms[topTerms['orcid_id']==orcid2]
    m = o1.merge(o2, left_on='term', right_on='term')
    return m.iloc[:, [1,2,4]]

counter=0
oColab=orcidToOrcid[orcidToOrcid['orcid_1']==example_orcid]['orcid_2']
for o in oColab:
    colabPubs=orcidToPubmed[orcidToPubmed['orcid_id']==o]['pmid']
    pubCom=len(set(example_pubs).intersection(set(colabPubs)))
    #print(o)
    if pubCom==0:
        if counter<5:
            #get overlapping tf-idf terms
            overlapPubs = find_overlapping_tfidf(example_orcid,o)
            r = orcidToOrcid[(orcidToOrcid['orcid_1']==example_orcid) & (orcidToOrcid['orcid_2']==o)]
            print(o,r['tf-idf'].to_string(header=False))
            print(overlapPubs,'\n')
            counter+=1

0000-0002-1407-8314 536    0.102831
       term  tf-idf_x  tf-idf_y
0    cancer  0.085914  0.079305
1     genes  0.061123  0.050338
2      data  0.054236  0.064645
3     using  0.035485  0.060613
4  analysis  0.032270  0.051746 

0000-0003-1562-891X 537    0.098246
            term  tf-idf_x  tf-idf_y
0           cell  0.091965  0.151794
1         cancer  0.085914  0.264330
2         breast  0.059255  0.247853
3  breast cancer  0.038263  0.198393
4    cancer cell  0.032882  0.042547
5    development  0.031938  0.032471 

0000-0001-8186-5708 542    0.074373
     term  tf-idf_x  tf-idf_y
0    cell  0.091965  0.274139
1  cancer  0.085914  0.062854 

0000-0002-2715-9930 543    0.071174
          term  tf-idf_x  tf-idf_y
0       cancer  0.085914  0.042130
1        genes  0.061123  0.115058
2         data  0.054236  0.045467
3   mechanisms  0.040592  0.030163
4  development  0.031938  0.035598 

0000-0002-6009-7137 544    0.070957
     term  tf-idf_x  tf-idf_y
0    cell  0.091965  0.241663
1

### Matching text to people

Using the same tf-idf model, we can try and match any piece of text to the most relevant people.

In [None]:
#https://stackoverflow.com/questions/55677314/using-sklearn-how-do-i-calculate-the-tf-idf-cosine-similarity-between-documents
%store -r tfs
corpus_tfidf = tfs
%store -r tfidf

In [218]:
from sklearn.metrics.pairwise import cosine_similarity

#merge pure people data
pure_people_orcid = personToOrcid.merge(pureToName, left_on='pure_person_id', right_on='pure_person_id')

def compare_text(query):
    print('\n','# Comparing text:',query[:100],'...')
    query_tfidf = tfidf.transform([query])
    
    #get terms for query
    feature_names = tfidf.get_feature_names()
    res={}
    for col in query_tfidf.nonzero()[1]:
        res[feature_names[col]]=query_tfidf[0, col]
        #reverse sort the results
        sorted_res = sorted(res.items(), key=lambda kv: kv[1], reverse=True)
    
    #compare tfidf
    cosineSimilarities = cosine_similarity(query_tfidf, corpus_tfidf).flatten()
    cosineData={}
    
    for c in range(0,len(cosineSimilarities)):
        cosineData[c]=cosineSimilarities[c]
    sortedCosineData = sorted(cosineData.items(), key=lambda kv: kv[1], reverse=True)
    
    for s in sortedCosineData[0:5]:
        #get pureID 
        orcid=list(token_dict)[s[0]]
        p = pure_people_orcid[personToOrcid['orcid_id']==orcid]
        print('\n',s[1],orcid,p['person_name'].values[0])
        
        #get overlapping terms
        o1=topTerms[topTerms['orcid_id']==orcid]
        o2=sorted_res[:100]
        #print(o2)
        termMatch=[]
        #print(o1.term.values)
        for i in o2:
            #print(i)
            if i[0] in o1.term.values:
                print(i[0], i[1], o1[o1['term']==i[0]]['tf-idf'].values[0])
        
        
#26930047
query1="""
Diagnosis of Coronary Heart Diseases Using Gene Expression Profiling; Stable Coronary Artery Disease, Cardiac Ischemia with and without Myocardial Necrosis.
Cardiovascular disease (including coronary artery disease and myocardial infarction) is one of the leading causes of death in Europe, and is influenced by both 
environmental and genetic factors. With the recent advances in genomic tools and technologies there is potential to predict and diagnose heart disease using 
molecular data from analysis of blood cells. We analyzed gene expression data from blood samples taken from normal people (n = 21), non-significant coronary artery 
disease (n = 93), patients with unstable angina (n = 16), stable coronary artery disease (n = 14) and myocardial infarction (MI; n = 207). We used a feature 
selection approach to identify a set of gene expression variables which successfully differentiate different cardiovascular diseases. The initial features 
were discovered by fitting a linear model for each probe set across all arrays of normal individuals and patients with myocardial infarction. Three different 
feature optimisation algorithms were devised which identified two discriminating sets of genes, one using MI and normal controls (total genes = 6) and another 
one using MI and unstable angina patients (total genes = 7). In all our classification approaches we used a non-parametric k-nearest neighbour (KNN) classification 
method (k = 3). The results proved the diagnostic robustness of the final feature sets in discriminating patients with myocardial infarction from healthy controls. 
Interestingly it also showed efficacy in discriminating myocardial infarction patients from patients with clinical symptoms of cardiac ischemia but no myocardial 
necrosis or stable coronary artery disease, despite the influence of batch effects and different microarray gene chips and platforms.
"""
compare_text(query1)

query2="""
genome wide association gwas
"""
compare_text(query2)

query3="""
quantum
"""
compare_text(query3)


 # Comparing text: 
Diagnosis of Coronary Heart Diseases Using Gene Expression Profiling; Stable Coronary Artery Diseas ...

 0.1498055959132002 0000-0002-2515-0852 Chiara Bucciarelliducci
myocardial 0.19096778316249413 0.21954367831038224
coronary 0.14344988815995552 0.12280020987203692
myocardial infarction 0.14301498823094844 0.0809045507024542
infarction 0.1376311519301692 0.08278356899359958
artery 0.1260582378238358 0.0407330570196053
mi 0.08962962902132333 0.03268908257777348
disease 0.08950236549400564 0.07128833603762648
patients 0.0840322723215066 0.10959539873839973

 0.14495921424895872 0000-0003-0924-3247 Tom R Gaunt
coronary 0.14344988815995552 0.06361745729766614
disease 0.08950236549400564 0.09837724280174748
gene 0.06140523925283896 0.09681124182367212
genes 0.04930607122405027 0.0931818434280978

 0.14106425254783597 0000-0002-1753-3730 Gianni D Angelini
myocardial 0.19096778316249413 0.05845243236233413
coronary 0.14344988815995552 0.18075640109051788
coronary arter