# Exploration

## Takeways

### Authors CSV file
* 374 authors depicted by id, full name, h-index, and research sector. No null values.
* 5 authors presenting same full name and research sector but differents ids. 2 of them also have the same h-index.

### Publications CSV file
* 244 publications depicted by id, authors, topics, publication year, and DOI. 
* Authors are detailed in a list by id and full name. 
* Topics are detailed in a list by id.

### Incomming publications CSV file
* 218 publications depicted by id, authors, topics, publication year, and DOI. No null values.
* Authors are detailed in a list by id and full name. 
* Topics are detailed in a list by id

### Topics CSV file
* 714971 topics depicted by id and name.
* 1 topic (id = 164917456) has no name.

### Authors CSV file

In [375]:
# imports
import pandas as pd
import re

In [376]:
# read authors data and get overall overview
authors = pd.read_csv('../data/authors.csv')
authors.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 374 entries, 0 to 373
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   author_id        374 non-null    int64 
 1   FullName         374 non-null    object
 2   HIndex           374 non-null    int64 
 3   research_sector  374 non-null    int64 
dtypes: int64(3), object(1)
memory usage: 11.8+ KB


In [377]:
# analyze authors id duplicated
authors[authors.author_id.duplicated(keep=False)]

Unnamed: 0,author_id,FullName,HIndex,research_sector


In [378]:
# analyze authors full name duplicated
authors[authors.FullName.duplicated(keep=False)]

Unnamed: 0,author_id,FullName,HIndex,research_sector
6,352187825414,I. Mandić,2,1631149
34,1494649202701,G Testera,2,23376214
54,1082332353958,L. X. Chung,2,27313889
60,335007990736,I. Mandić,48,1631149
128,498216830546,Pauline Hall Barrientos,4,17040978
250,1022202726879,G Testera,6,23376214
294,1133871876862,J.-Y. Roussé,10,27313889
302,17180409555,J.-Y. Roussé,8,27313889
303,214748918399,Pauline Hall Barrientos,4,17040978
305,798864434338,L. X. Chung,2,27313889


### Publications CSV file

In [379]:
# read publications data and get overall overview
publications = pd.read_csv('../data/publications.csv')
publications.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   PublicationId     244 non-null    int64 
 1   authors           244 non-null    object
 2   topics            244 non-null    object
 3   publication_year  244 non-null    int64 
 4   Doi               244 non-null    object
dtypes: int64(2), object(3)
memory usage: 9.7+ KB


In [380]:
# publications data sample
publications.head(25)

Unnamed: 0,PublicationId,authors,topics,publication_year,Doi
0,465031,"['id:300648343950, name:Tadeusz Kaczorowski'\n...",[185592680 89423630],2019,10.3390/V11070657
1,8590182776,"['id:1151051732647, name:E. Paoloni' 'id:16406...",[192562407 120665830],2016,10.1088/1748-0221/11/12/C12018
2,8590359559,"['id:987842971362, name:Z. Galloway' 'id:33500...",[ 49040817 121332964],2019,10.1016/J.NIMA.2018.08.041
3,8590382155,"['id:962073216979, name:S. Zhamkochyan'\n 'id:...",[121332964 185544564],2019,10.1016/J.NIMA.2019.04.063
4,8590416941,"['id:111669774394, name:Eva Nordberg Karlsson'...",[185592680 55493867],2019,10.1107/S2059798319013330
5,17180318214,"['id:51540165879, name:L. Bosisio' 'id:1520419...",[121332964 185544564],2020,10.1016/J.NIMA.2019.05.025
6,34360166020,"['id:987842971362, name:Z. Galloway'\n 'id:627...",[ 49040817 121332964],2019,10.1016/J.NIMA.2018.08.123
7,34360228672,"['id:1477469340278, name:R G H Robertson'\n 'i...",[121332964 114614502],2020,10.1088/1742-6596/1342/1/012026
8,51539698674,"['id:197569082586, name:Michele Montuschi'\n '...",[121332964 185544564],2016,10.1088/1742-6596/675/1/012025
9,51539963557,"['id:635655697657, name:Astrid James'\n 'id:24...",[161191863 71924100],2017,10.1371/JOURNAL.PMED.1002315


### Incomming publications CSV file

In [381]:
# read incoming publications data and get overall overview
incomming_publications = pd.read_csv('../data/incomming_publications.csv')
incomming_publications.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 218 entries, 0 to 217
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   PublicationId     218 non-null    int64 
 1   authors           218 non-null    object
 2   topics            218 non-null    object
 3   publication_year  218 non-null    int64 
 4   Doi               218 non-null    object
dtypes: int64(2), object(3)
memory usage: 8.6+ KB


In [382]:
# incomming publications data sample
incomming_publications.head(25)

Unnamed: 0,PublicationId,authors,topics,publication_year,Doi
0,551926,"['id:206159056594, name:Lukoye Atwoli'\n 'id:3...",[144133560 155202549],2021,10.1136/BMJMILITARY-2021-001983
1,551957,"['id:206159056594, name:Lukoye Atwoli'\n 'id:3...",[47768531 71924100],2021,10.1093/FAMPRA/CMAB101
2,8590488517,"['id:206159056594, name:Lukoye Atwoli'\n 'id:3...",[144133560 175605778],2021,10.5152/TURKARCHPEDIATR.2021.21060921
3,25770356408,"['id:206159056594, name:Lukoye Atwoli'\n 'id:3...",[47768531 71924100],2021,10.1093/EUROPACE/EUAB226
4,34360290764,"['id:206159056594, name:Lukoye Atwoli'\n 'id:3...",[47768531 71924100],2021,10.1093/JTM/TAAB125
5,34360290891,"['id:206159056594, name:Lukoye Atwoli'\n 'id:3...",[47768531 86803240],2021,10.1093/HMG/DDAB231
6,60130094586,"['id:206159056594, name:Lukoye Atwoli'\n 'id:3...",[47768531 86803240],2021,10.1093/HMG/DDAB231
7,68720028858,"['id:206159056594, name:Lukoye Atwoli'\n 'id:3...",[47768531 17744445],2021,10.1111/BCP.15048
8,68720028878,"['id:206159056594, name:Lukoye Atwoli'\n 'id:3...",[47768531 17744445],2021,10.1093/JNCI/DJAB156
9,68720028879,"['id:206159056594, name:Lukoye Atwoli'\n 'id:3...",[47768531 17744445],2021,10.1093/JPP/RGAB127


### Topics CSV file

In [383]:
# read topics data and get overall overview
topics = pd.read_csv('../data/topics.csv')
topics.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 714971 entries, 0 to 714970
Data columns (total 2 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   topic_id  714971 non-null  int64 
 1   name      714970 non-null  object
dtypes: int64(1), object(1)
memory usage: 10.9+ MB


In [384]:
# topics data sample
topics.head(25)

Unnamed: 0,topic_id,name
0,42812,Partition (number theory)
1,70630,Perpendicular bisector construction
2,114263,Elliptical wing
3,182566,Organizational structure
4,202113,Cauchy number
5,205068,Face (geometry)
6,234837,Conceptual graph
7,294558,Newtonian fluid
8,301363,Eukaryotic Small Ribosomal Subunit
9,322993,Julian day


In [385]:
# retrieve null name topic
topics.query('name.isnull()')

Unnamed: 0,topic_id,name
634168,164917456,


# Cleaning up

## Pre-processing actions

### Authors CSV file
* Column names are streamlined: lowercased and using ´_´ as word separator.
* Column _author\_id_ values casted to string datatype.
* Column _research\_sector_ values casted to string datatype.
* For duplicated author names, the record with the highest h-index value is kept. The remaining one is removed.

### Publications CSV file
* Column names are streamlined: lowercased and using ´_´ as word separator.
* Column _publication\_id_ values casted to string datatype.
* Column _authors_ renamed as _author\_list_. Values replaced by a list of authors ids (list of strings).
* Column _topics_ renamed as _topic\_list_. Values replaced by a list of topics ids (list of strings). 
* Author ids removed from the authors CSV file are replaced by the kept author id.

### Incomming publications CSV file
* Column names are streamlined: lowercased and using ´_´ as word separator.
* Column _publication\_id_ values casted to string datatype.
* Column _authors_ renamed as _author\_list_. Values replaced by a list of authors ids (list of strings).
* Column _topics_ renamed as _topic\_list_. Values replaced by a list of topics ids (list of strings). 
* Author ids removed from the authors CSV file are replaced by the kept author id.

### Topics CSV file
* Name of the topic with _id = 164917456_ (originally blank) is set to _Not Available_. 
* Column _topic\_id_ valuescasted to string datatype.

### Authors CSV file

In [386]:
# streamline column names
authors.columns = ['author_id', 'full_name', 'h_index', 'research_sector']
    
# author id and research sector id casted to strings
authors.author_id = authors.author_id.astype(str)
authors.research_sector = authors.research_sector.astype(str)

authors.head(25)

Unnamed: 0,author_id,full_name,h_index,research_sector
0,309238221625,Guillaume Lemaître,10,32057259
1,747324850364,Patrick L. Meras,4,8258574
2,987843024183,S. I. Konovalov,5,30262949
3,1348620307694,A. Seiden,19,1631149
4,257698535807,F. Guescini,78,1631149
5,386547616079,Kirsten Patrick,11,7352532
6,352187825414,I. Mandić,2,1631149
7,730145054695,G. Lukyanchenko,15,23376214
8,747324904932,G. Rizzo,74,15925557
9,231928734231,S Watanuki,2,15925557


### Publications CSV file

In [387]:
# streamline column names
publications.columns = ['publication_id', 'author_list', 'topic_list', 'publication_year', 'doi']

# publication id casted to string
publications.publication_id = publications.publication_id.astype(str)

# author id extracted as list of strings
publications.author_list = publications.author_list.apply(lambda x: re.findall(r'\d+', x))

# topic id extracted as list of strings
publications.topic_list = publications.topic_list.apply(lambda x: re.findall(r'\d+', x))

publications.head(25)

Unnamed: 0,publication_id,author_list,topic_list,publication_year,doi
0,465031,"[300648343950, 566936217113, 214748948560]","[185592680, 89423630]",2019,10.3390/V11070657
1,8590182776,"[1151051732647, 1640678070519, 120259680919, 7...","[192562407, 120665830]",2016,10.1088/1748-0221/11/12/C12018
2,8590359559,"[987842971362, 335007990736, 1348620307694, 14...","[49040817, 121332964]",2019,10.1016/J.NIMA.2018.08.041
3,8590382155,"[962073216979, 274878535359, 867583901561]","[121332964, 185544564]",2019,10.1016/J.NIMA.2019.04.063
4,8590416941,"[111669774394, 403727562291, 300648328930, 300...","[185592680, 55493867]",2019,10.1107/S2059798319013330
5,17180318214,"[51540165879, 1520419039372, 1340030381809, 17...","[121332964, 185544564]",2020,10.1016/J.NIMA.2019.05.025
6,34360166020,"[987842971362, 627065791804, 1279900868211]","[49040817, 121332964]",2019,10.1016/J.NIMA.2018.08.123
7,34360228672,"[1477469340278, 755914786301, 1219771280085, 7...","[121332964, 114614502]",2020,10.1088/1742-6596/1342/1/012026
8,51539698674,"[197569082586, 1494649202701, 1236951107143, 4...","[121332964, 185544564]",2016,10.1088/1742-6596/675/1/012025
9,51539963557,"[635655697657, 240518737946, 326418070679]","[161191863, 71924100]",2017,10.1371/JOURNAL.PMED.1002315


### Incomming CSV publications file

In [388]:
# streamline column names
incomming_publications.columns = ['publication_id', 'author_list', 'topic_list', 'publication_year', 'doi']

# publication id casted to string
incomming_publications.publication_id = incomming_publications.publication_id.astype(str)

# author id extracted as list of strings
incomming_publications.author_list = incomming_publications.author_list.apply(lambda x: re.findall(r'\d+', x))

# topic id extracted as list of strings
incomming_publications.topic_list = incomming_publications.topic_list.apply(lambda x: re.findall(r'\d+', x))

incomming_publications.head(25)

Unnamed: 0,publication_id,author_list,topic_list,publication_year,doi
0,551926,"[206159056594, 386547616079, 532576531850, 429...","[144133560, 155202549]",2021,10.1136/BMJMILITARY-2021-001983
1,551957,"[206159056594, 386547616079, 532576531850, 429...","[47768531, 71924100]",2021,10.1093/FAMPRA/CMAB101
2,8590488517,"[206159056594, 386547616079, 532576531850, 429...","[144133560, 175605778]",2021,10.5152/TURKARCHPEDIATR.2021.21060921
3,25770356408,"[206159056594, 386547616079, 532576531850, 429...","[47768531, 71924100]",2021,10.1093/EUROPACE/EUAB226
4,34360290764,"[206159056594, 386547616079, 532576531850, 429...","[47768531, 71924100]",2021,10.1093/JTM/TAAB125
5,34360290891,"[206159056594, 386547616079, 532576531850, 429...","[47768531, 86803240]",2021,10.1093/HMG/DDAB231
6,60130094586,"[206159056594, 386547616079, 532576531850, 429...","[47768531, 86803240]",2021,10.1093/HMG/DDAB231
7,68720028858,"[206159056594, 386547616079, 532576531850, 429...","[47768531, 17744445]",2021,10.1111/BCP.15048
8,68720028878,"[206159056594, 386547616079, 532576531850, 429...","[47768531, 17744445]",2021,10.1093/JNCI/DJAB156
9,68720028879,"[206159056594, 386547616079, 532576531850, 429...","[47768531, 17744445]",2021,10.1093/JPP/RGAB127


### Topics CSV file

In [389]:
# streamline column names
topics.columns = ['topic_id', 'name']
    
# topic id casted to string
topics.topic_id = topics.topic_id.astype(str)

# fill null value on topic with topic_id = 164917456
topics.name.fillna('Not Available', inplace=True)

topics.head(25)

Unnamed: 0,topic_id,name
0,42812,Partition (number theory)
1,70630,Perpendicular bisector construction
2,114263,Elliptical wing
3,182566,Organizational structure
4,202113,Cauchy number
5,205068,Face (geometry)
6,234837,Conceptual graph
7,294558,Newtonian fluid
8,301363,Eukaryotic Small Ribosomal Subunit
9,322993,Julian day


### Clean authors and publications up

In [390]:
# duplicated authors
duplicated = authors[authors.duplicated(['full_name'], keep=False)]

# for each duplicated author: 
# the author data record with highest h_index will be kept
# in the publication and the incomming_publications data, 
# the author_id with the remaining h_index will be replaced by the other one
for name in set(duplicated.full_name):
    
    # index of the row of a duplicated author with highest h_index
    to_keep = authors.query(f"full_name == '{name}'").h_index.idxmax()
    id_to_keep = authors.loc[to_keep].author_id

    # index of the row of a duplicated author with lowest h_index
    to_remove = authors.query(f"full_name == '{name}'").iloc[::-1].h_index.idxmin()
    id_to_remove = authors.loc[to_remove].author_id

    # remove authors record with the lowest h_index
    authors.drop(to_remove, inplace = True)

    # in publication data, author_id with the lowest h_inde is replaced 
    publications.author_list = publications.author_list.apply(lambda x: 
                                                                list(map(lambda y:
                                                                        y.replace(id_to_remove, id_to_keep), x)))

    # in incomming_publication data, author_id with the lowest h_inde is replaced 
    incomming_publications.author_list = incomming_publications.author_list.apply(lambda x: 
                                                                                        list(map(lambda y:
                                                                                                y.replace(id_to_remove, id_to_keep), x)))

In [392]:
authors

Unnamed: 0,author_id,full_name,h_index,research_sector
0,309238221625,Guillaume Lemaître,10,32057259
1,747324850364,Patrick L. Meras,4,8258574
2,987843024183,S. I. Konovalov,5,30262949
3,1348620307694,A. Seiden,19,1631149
4,257698535807,F. Guescini,78,1631149
...,...,...,...,...
369,833224177629,P P Martel,7,17040978
370,1340030382486,C Dunagan,7,30262949
371,446677101812,Isaac J. Arnquist,16,30262949
372,635655697657,Astrid James,14,7352532


In [398]:
# creates a set of the publications topics (both published and incomming ones)
pub_topics = set(pd.concat([publications, incomming_publications]).topic_list.explode())

# only get details of topics mentioned on publications
pub_topic_details = topics[topics.topic_id.apply(lambda x: x in pub_topics)]
pub_topic_details 

Unnamed: 0,topic_id,name
2168,71240020,Stereochemistry
3500,114614502,Combinatorics
3693,121332964,Physics
4544,149923435,Demography
4654,153911025,Molecular biology
...,...,...
632397,105795698,Statistics
632735,116915560,Nuclear engineering
633859,155202549,International trade
634076,162324750,Economics
