# To recap

### Data so far

1. We have created a list of people and their ORCID identifiers - [output/pure_person_to_orcid.txt](output/pure_person_to_orcid.txt)
2. We have created a list of ORCID to PubMed identifiers - [output/orcid.tsv](output/orcid.tsv)
3. We have created a list of PubMed IDs to PubMed info - [output/pubmed.tsv](output/pubmed.tsv)
4. We have created a list of the top 100 terms for each person - [output/orcid-tf-idf.txt](output/orcid-tf-idf.txt)

### Questions

Now we need to use the common identifiers in these data to answer some questions:

1. Can we produce a set of potential collaborators for each person, a collaborator being someone they have significant terms in common with but not previously published with.
2. Can we select a set of people that most closely map to a specific piece of text.


### Pandas dataframes

Normally at this point we would use a database. However, to keep it all in a single language and framework, and as the data themselves are relatively small, we can use pandas - https://pandas.pydata.org/

>pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.



So, following on from the data descriptions above:


In [23]:
import pandas as pd

# 1. People and their ORCID identifiers 
personToOrcid = pd.read_csv('data/pure_person_to_orcid.txt',sep='\t')
print(personToOrcid.shape)
print(personToOrcid.head())

# 2. ORCID to PubMed identifiers
orcidToPubmed = pd.read_csv('data/orcid.tsv',sep='\t')
print(orcidToPubmed.shape)
print(orcidToPubmed.head())

# 3. PubMed IDs to PubMed info
pubmedToInfo = pd.read_csv('data/pubmed.tsv',sep='\t')
print(pubmedToInfo.shape)
print(pubmedToInfo.head())

# 4. Top 100 terms for each person
topTerms = pd.read_csv('data/orcid-tf-idf.txt',sep='\t')
print(topTerms.shape)
print(topTerms.head())


(898, 2)
                         pure_person_id             orcid_id
0  4ec18a96-fd2e-4311-a6fa-7ec65696a4e9  0000-0003-3108-5775
1  7a04c325-7a39-4560-8308-8cbcaa763747  0000-0002-6772-7111
2  b4014828-88e9-4861-ae1d-5c369b6ae35a  0000-0001-7328-4233
3  ccedc4d6-6d7e-4a60-8313-f59409ecc6dd  0000-0001-6224-3073
4  66e3df0f-6cbe-4cf4-a034-9e1b57684f6b  0000-0003-0300-4990
(15722, 2)
              orcid_id      pmid
0  0000-0001-5001-3350  26961927
1  0000-0001-5001-3350  30605491
2  0000-0001-5008-0705  29118635
3  0000-0001-5017-9473  18194108
4  0000-0001-5017-9473  18607707
(11160, 4)
       pmid  year                                              title  \
0  25475436  2015  Sixty-five common genetic variants and predict...   
1  25011450  2014  Association between alcohol and cardiovascular...   
2  28968714  2018  FATHMM-XF: accurate prediction of pathogenic p...   
3  21965548  2012  Four genetic loci influencing electrocardiogra...   
4  26930047  2016  Diagnosis of Coronary Hear

A first pass would be to count the number of matching terms between each person. To start with, let's just compare the first in the list to everyone else:

In [24]:
 #loop through people (limit to top 1)
def compare_people(limit):
    for i, iData in personToOrcid.head(n=limit).iterrows():
        print('### Comparing',i,iData.pure_person_id,iData.orcid_id)
        #get all terms and tf-idf values
        iTopTerms=topTerms[topTerms['orcid_id']==iData.orcid_id]
        #print(iTopTerms.head())
        iTerms = iTopTerms['term']

        for j in range(i+1,personToOrcid.shape[0]):
            jData=personToOrcid.iloc[j]
            jTopTerms=topTerms[topTerms['orcid_id']==jData.orcid_id]
            #print(jTopTerms['term'].head())
            jTerms = jTopTerms['term']
            #print(iTerms,jTerms)
            com = list(set(iTerms) & set(jTerms))
            #only show maches with more than 1 matching term
            if len(com) > 1:
                print(jData.orcid_id,com)
    
compare_people(1)        

### Comparing 0 4ec18a96-fd2e-4311-a6fa-7ec65696a4e9 0000-0003-3108-5775
0000-0001-7086-8882 ['polar vortices', 'titans winter', 'rapid mesospheric cooling', 'relatively long atmospheric', 'polar vortex formation', 'solar heating', 'vortex formation 20102011', 'unexpected rapid mesospheric', 'radiative cooling efficiency', 'temperatures 26 years', 'produced mesospheric hotspot', 'postequinox cooling far', 'trace gases 2012', 'extreme', 'vortex configuration results', 'polar vortex saturns', 'titans', 'formation', 'spring equinox', 'rapid mesospheric', 'unusually cold', 'winter polar vortex', 'stable vortex configuration', 'seasonal effects', 'saturns', 'strong seasonal effects', 'vortices following', 'subsidence winter polar', 'postequinox long', 'saturns largest', 'timeframe reach stable', 'southpolar subsidence', 'produced mesospheric', 'postequinox', 'enrichment', 'radiative time', 'subsidence', 'southpolar', 'trace', 'subsidence winter', 'trace gases relatively', 'polar vortices fo

There are few matches with many overlapping terms, except one **0000-0001-7086-8882**. Let's look at the research pages of these two people

https://research-information.bristol.ac.uk/en/persons/melody-a-s-sylvestre(81e7f06c-77d8-4020-9608-0b30dd001c43)/publications.html https://research-information.bristol.ac.uk/en/persons/nicholas-a-teanby(4ec18a96-fd2e-4311-a6fa-7ec65696a4e9)/publications.html 

This suggests they are from similar research areas, and have indeed co-published.   

Let's try again, this time with the top 3

In [None]:
compare_people(3)  

The second person matched no-one, the third however matched quite a few. Let's modify the function to sort by number:  

In [25]:
def compare_people(limit):
    for i, iData in personToOrcid.head(n=limit).iterrows():
        print('### Comparing',i,iData.pure_person_id,iData.orcid_id)
        #get all terms and tf-idf values
        iTopTerms=topTerms[topTerms['orcid_id']==iData.orcid_id]
        iTerms = iTopTerms['term']
        jComp={}
        for j in range(i+1,personToOrcid.shape[0]):
            jData=personToOrcid.iloc[j]
            jTopTerms=topTerms[topTerms['orcid_id']==jData.orcid_id]
            jTerms = jTopTerms['term']
            com = list(set(iTerms) & set(jTerms))
            #only show maches with more than 1 matching term
            if len(com) > 1:
                jComp[jData.orcid_id]=com
        #create sorted dictionary using number of items
        jComp = sorted(jComp.items(), key=lambda kv: len(kv[1]), reverse=True)
        for p in jComp[0:10]:
            print(p)
            
compare_people(3)

### Comparing 0 4ec18a96-fd2e-4311-a6fa-7ec65696a4e9 0000-0003-3108-5775
('0000-0001-7086-8882', ['polar vortices', 'titans winter', 'rapid mesospheric cooling', 'relatively long atmospheric', 'polar vortex formation', 'solar heating', 'vortex formation 20102011', 'unexpected rapid mesospheric', 'radiative cooling efficiency', 'temperatures 26 years', 'produced mesospheric hotspot', 'postequinox cooling far', 'trace gases 2012', 'extreme', 'vortex configuration results', 'polar vortex saturns', 'titans', 'formation', 'spring equinox', 'rapid mesospheric', 'unusually cold', 'winter polar vortex', 'stable vortex configuration', 'seasonal effects', 'saturns', 'strong seasonal effects', 'vortices following', 'subsidence winter polar', 'postequinox long', 'saturns largest', 'timeframe reach stable', 'southpolar subsidence', 'produced mesospheric', 'postequinox', 'enrichment', 'radiative time', 'subsidence', 'southpolar', 'trace', 'subsidence winter', 'trace gases relatively', 'polar vortice

This only shows us terms that are in common. It could be possible to have lots of terms in common, but these terms are not the 'most' representative of either person. It would be better to compare all terms, and their tf-idf values.

Back in the TF-IDF section (http://localhost:8888/lab#TF-IDF-using-sklearn) we created a similarity matrix from the TF-IDF model.



In [26]:
#get similarity matrix for all people

matrixCom = {}

#get tfidf matrix and orcidText dictionary
%store -r matrix
%store -r token_dict

for i in range(0,len(matrix)):
    iOrcid=list(token_dict)[i]
    matrixCom[iOrcid]={}
    for j in range(0,len(matrix)):
        jOrcid=list(token_dict)[j]
        matrixCom[iOrcid][jOrcid]=matrix[i][j]

o = open('output/orcid-to-orcid-tf-idf.tsv','w')
o.write('orcid_1\torcid_2\ttf-idf\n')
counter=0
for m in matrixCom:
    sorted_res = sorted(matrixCom[m].items(), key=lambda kv: kv[1], reverse=True)
    if counter<3:
        print(m)
        for s in sorted_res[0:5]:
            print('\t',s[0],s[1])
        counter+=1
    for s in sorted_res:
        o.write(m+'\t'+s[0]+'\t'+str(s[1])+'\n')
o.close()


0000-0003-0924-3247
	 0000-0003-0924-3247 0.9999999999997207
	 0000-0002-1407-8314 0.618225606944109
	 0000-0002-7141-9189 0.5897571412339664
	 0000-0003-2052-4840 0.5690516398208536
	 0000-0002-6793-2262 0.531473900766906
0000-0001-7328-4233
	 0000-0001-7328-4233 1.0000000000000322
	 0000-0003-0920-1055 0.11566240835056221
	 0000-0003-0924-3247 0.11331746330572383
	 0000-0002-7141-9189 0.11135289016971364
	 0000-0002-4680-3517 0.10955988442674883
0000-0001-5001-3350
	 0000-0001-5001-3350 0.9999999999999893
	 0000-0002-1407-8314 0.15759457124910695
	 0000-0003-2052-4840 0.12386681519489502
	 0000-0003-0920-1055 0.10839962131879305
	 0000-0002-7141-9189 0.09669331068447609


### Collaboration recommendation engine

This similarity data depicts the similarity between each person's publication text based on tf-idf. Often, similarities arise due to co-publication, and perhaps a more informative recommender would be to idenfity cases where people have signficant overlap in their publication text, but have never previously co-published. 

In [28]:
#load the new data into a dataframe

orcidToOrcid = pd.read_csv('data/orcid-to-orcid-tf-idf.tsv',sep='\t')
print(orcidToOrcid.shape)
print(orcidToOrcid.head())

(279841, 3)
               orcid_1              orcid_2    tf-idf
0  0000-0003-0924-3247  0000-0003-0924-3247  1.000000
1  0000-0003-0924-3247  0000-0002-1407-8314  0.618226
2  0000-0003-0924-3247  0000-0002-7141-9189  0.589757
3  0000-0003-0924-3247  0000-0003-2052-4840  0.569052
4  0000-0003-0924-3247  0000-0002-6793-2262  0.531474
