# Takeaways

## Authors CSV file
* 374 authors depicted by id, full name, h-index, and research sector. No null values.
* 5 authors presenting same full name and research sector but differents ids. 2 of them also have the same h-index.

## Publications CSV file
* 244 publications depicted by id, authors, topics, publication year, and DOI. 
* Authors are detailed in a list by id and full name. 
* Topics are detailed in a list by id.

## Topics CSV file
* 714971 topics depicted by id and name.
* 1 topic (id = 164917456) has no name.

## Incomming publications
* 218 publications depicted by id, authors, topics, publication year, and DOI. No null values.
* Authors are detailed in a list by id and full name. 
* Topics are detailed in a list by id.

In [1]:
# imports
import pandas as pd
import re

In [2]:
# read authors data and get overall overview
authors = pd.read_csv('../data/authors.csv')
authors.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 374 entries, 0 to 373
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   author_id        374 non-null    int64 
 1   FullName         374 non-null    object
 2   HIndex           374 non-null    int64 
 3   research_sector  374 non-null    int64 
dtypes: int64(3), object(1)
memory usage: 11.8+ KB


In [3]:
# analyze authors id duplicated
authors[authors.author_id.duplicated(keep=False)]

Unnamed: 0,author_id,FullName,HIndex,research_sector


In [4]:
# analyze authors full name duplicated
authors[authors.FullName.duplicated(keep=False)]

Unnamed: 0,author_id,FullName,HIndex,research_sector
6,352187825414,I. Mandić,2,1631149
34,1494649202701,G Testera,2,23376214
54,1082332353958,L. X. Chung,2,27313889
60,335007990736,I. Mandić,48,1631149
128,498216830546,Pauline Hall Barrientos,4,17040978
250,1022202726879,G Testera,6,23376214
294,1133871876862,J.-Y. Roussé,10,27313889
302,17180409555,J.-Y. Roussé,8,27313889
303,214748918399,Pauline Hall Barrientos,4,17040978
305,798864434338,L. X. Chung,2,27313889


In [5]:
# read publications data and get overall overview
publications = pd.read_csv('../data/publications.csv')
publications.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   PublicationId     244 non-null    int64 
 1   authors           244 non-null    object
 2   topics            244 non-null    object
 3   publication_year  244 non-null    int64 
 4   Doi               244 non-null    object
dtypes: int64(2), object(3)
memory usage: 9.7+ KB


In [6]:
# publications data sample
publications.head(25)

Unnamed: 0,PublicationId,authors,topics,publication_year,Doi
0,465031,"['id:300648343950, name:Tadeusz Kaczorowski'\n...",[185592680 89423630],2019,10.3390/V11070657
1,8590182776,"['id:1151051732647, name:E. Paoloni' 'id:16406...",[192562407 120665830],2016,10.1088/1748-0221/11/12/C12018
2,8590359559,"['id:987842971362, name:Z. Galloway' 'id:33500...",[ 49040817 121332964],2019,10.1016/J.NIMA.2018.08.041
3,8590382155,"['id:962073216979, name:S. Zhamkochyan'\n 'id:...",[121332964 185544564],2019,10.1016/J.NIMA.2019.04.063
4,8590416941,"['id:111669774394, name:Eva Nordberg Karlsson'...",[185592680 55493867],2019,10.1107/S2059798319013330
5,17180318214,"['id:51540165879, name:L. Bosisio' 'id:1520419...",[121332964 185544564],2020,10.1016/J.NIMA.2019.05.025
6,34360166020,"['id:987842971362, name:Z. Galloway'\n 'id:627...",[ 49040817 121332964],2019,10.1016/J.NIMA.2018.08.123
7,34360228672,"['id:1477469340278, name:R G H Robertson'\n 'i...",[121332964 114614502],2020,10.1088/1742-6596/1342/1/012026
8,51539698674,"['id:197569082586, name:Michele Montuschi'\n '...",[121332964 185544564],2016,10.1088/1742-6596/675/1/012025
9,51539963557,"['id:635655697657, name:Astrid James'\n 'id:24...",[161191863 71924100],2017,10.1371/JOURNAL.PMED.1002315


In [7]:
# format authors field to list author ids
# format topics to list topic ids
publications.authors = publications.authors.apply(lambda x: re.findall(r'\d+', x))
publications.topics = publications.topics.apply(lambda x: re.findall(r'\d+', x))
publications.head(25)

Unnamed: 0,PublicationId,authors,topics,publication_year,Doi
0,465031,"[300648343950, 566936217113, 214748948560]","[185592680, 89423630]",2019,10.3390/V11070657
1,8590182776,"[1151051732647, 1640678070519, 120259680919, 7...","[192562407, 120665830]",2016,10.1088/1748-0221/11/12/C12018
2,8590359559,"[987842971362, 335007990736, 1348620307694, 14...","[49040817, 121332964]",2019,10.1016/J.NIMA.2018.08.041
3,8590382155,"[962073216979, 274878535359, 867583901561]","[121332964, 185544564]",2019,10.1016/J.NIMA.2019.04.063
4,8590416941,"[111669774394, 403727562291, 300648328930, 300...","[185592680, 55493867]",2019,10.1107/S2059798319013330
5,17180318214,"[51540165879, 1520419039372, 1340030381809, 17...","[121332964, 185544564]",2020,10.1016/J.NIMA.2019.05.025
6,34360166020,"[987842971362, 627065791804, 1279900868211]","[49040817, 121332964]",2019,10.1016/J.NIMA.2018.08.123
7,34360228672,"[1477469340278, 755914786301, 1219771280085, 7...","[121332964, 114614502]",2020,10.1088/1742-6596/1342/1/012026
8,51539698674,"[197569082586, 1494649202701, 1236951107143, 4...","[121332964, 185544564]",2016,10.1088/1742-6596/675/1/012025
9,51539963557,"[635655697657, 240518737946, 326418070679]","[161191863, 71924100]",2017,10.1371/JOURNAL.PMED.1002315


In [8]:
# read topics data and get overall overview
topics = pd.read_csv('../data/topics.csv')
topics.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 714971 entries, 0 to 714970
Data columns (total 2 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   topic_id  714971 non-null  int64 
 1   name      714970 non-null  object
dtypes: int64(1), object(1)
memory usage: 10.9+ MB


In [9]:
# topics data sample
topics.head(25)

Unnamed: 0,topic_id,name
0,42812,Partition (number theory)
1,70630,Perpendicular bisector construction
2,114263,Elliptical wing
3,182566,Organizational structure
4,202113,Cauchy number
5,205068,Face (geometry)
6,234837,Conceptual graph
7,294558,Newtonian fluid
8,301363,Eukaryotic Small Ribosomal Subunit
9,322993,Julian day


In [10]:
# retrieve null name topic
topics.query('name.isnull()')

Unnamed: 0,topic_id,name
634168,164917456,


In [11]:
# read incoming publications data and get overall overview
in_publications = pd.read_csv('../data/incomming_publications.csv')
in_publications.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 218 entries, 0 to 217
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   PublicationId     218 non-null    int64 
 1   authors           218 non-null    object
 2   topics            218 non-null    object
 3   publication_year  218 non-null    int64 
 4   Doi               218 non-null    object
dtypes: int64(2), object(3)
memory usage: 8.6+ KB


In [12]:
# incomming publications data sample
in_publications.head(25)

Unnamed: 0,PublicationId,authors,topics,publication_year,Doi
0,551926,"['id:206159056594, name:Lukoye Atwoli'\n 'id:3...",[144133560 155202549],2021,10.1136/BMJMILITARY-2021-001983
1,551957,"['id:206159056594, name:Lukoye Atwoli'\n 'id:3...",[47768531 71924100],2021,10.1093/FAMPRA/CMAB101
2,8590488517,"['id:206159056594, name:Lukoye Atwoli'\n 'id:3...",[144133560 175605778],2021,10.5152/TURKARCHPEDIATR.2021.21060921
3,25770356408,"['id:206159056594, name:Lukoye Atwoli'\n 'id:3...",[47768531 71924100],2021,10.1093/EUROPACE/EUAB226
4,34360290764,"['id:206159056594, name:Lukoye Atwoli'\n 'id:3...",[47768531 71924100],2021,10.1093/JTM/TAAB125
5,34360290891,"['id:206159056594, name:Lukoye Atwoli'\n 'id:3...",[47768531 86803240],2021,10.1093/HMG/DDAB231
6,60130094586,"['id:206159056594, name:Lukoye Atwoli'\n 'id:3...",[47768531 86803240],2021,10.1093/HMG/DDAB231
7,68720028858,"['id:206159056594, name:Lukoye Atwoli'\n 'id:3...",[47768531 17744445],2021,10.1111/BCP.15048
8,68720028878,"['id:206159056594, name:Lukoye Atwoli'\n 'id:3...",[47768531 17744445],2021,10.1093/JNCI/DJAB156
9,68720028879,"['id:206159056594, name:Lukoye Atwoli'\n 'id:3...",[47768531 17744445],2021,10.1093/JPP/RGAB127


In [13]:
# format authors field to list author ids
# format topics to list topic ids
in_publications.authors = in_publications.authors.apply(lambda x: re.findall(r'\d+', x))
in_publications.topics = in_publications.topics.apply(lambda x: re.findall(r'\d+', x))
in_publications.head(25)

Unnamed: 0,PublicationId,authors,topics,publication_year,Doi
0,551926,"[206159056594, 386547616079, 532576531850, 429...","[144133560, 155202549]",2021,10.1136/BMJMILITARY-2021-001983
1,551957,"[206159056594, 386547616079, 532576531850, 429...","[47768531, 71924100]",2021,10.1093/FAMPRA/CMAB101
2,8590488517,"[206159056594, 386547616079, 532576531850, 429...","[144133560, 175605778]",2021,10.5152/TURKARCHPEDIATR.2021.21060921
3,25770356408,"[206159056594, 386547616079, 532576531850, 429...","[47768531, 71924100]",2021,10.1093/EUROPACE/EUAB226
4,34360290764,"[206159056594, 386547616079, 532576531850, 429...","[47768531, 71924100]",2021,10.1093/JTM/TAAB125
5,34360290891,"[206159056594, 386547616079, 532576531850, 429...","[47768531, 86803240]",2021,10.1093/HMG/DDAB231
6,60130094586,"[206159056594, 386547616079, 532576531850, 429...","[47768531, 86803240]",2021,10.1093/HMG/DDAB231
7,68720028858,"[206159056594, 386547616079, 532576531850, 429...","[47768531, 17744445]",2021,10.1111/BCP.15048
8,68720028878,"[206159056594, 386547616079, 532576531850, 429...","[47768531, 17744445]",2021,10.1093/JNCI/DJAB156
9,68720028879,"[206159056594, 386547616079, 532576531850, 429...","[47768531, 17744445]",2021,10.1093/JPP/RGAB127
