In [1]:
import pandas as pd

In [2]:
metadata_df = pd.read_csv('metadata.tsv', delimiter='\t')
metadata_df.head(10)

Unnamed: 0,id,doi,space,title,authors,year,journal,citation_count
0,9065511,10.1523/jneurosci.17-07-02512.1997,MNI,Environmental knowledge is subserved by separa...,"Aguirre GK, D'Esposito M",1997,The Journal of neuroscience : the official jou...,180.0
1,9084599,10.1152/jn.1997.77.3.1313,MNI,Anatomy of motor learning. I. Frontal cortex a...,"Jueptner M, Stephan KM, Frith CD, Brooks DJ, F...",1997,Journal of neurophysiology,464.0
2,9114263,10.1152/jn.1997.77.4.2164,MNI,Multiple nonprimary motor areas in the human c...,"Fink GR, Frackowiak RS, Pietrzyk U, Passingham RE",1997,Journal of neurophysiology,367.0
3,9185551,10.1523/jneurosci.17-13-05136.1997,TAL,A role for the right anterior temporal lobe in...,"Small DM, Jones-Gotman M, Zatorre RJ, Petrides...",1997,The Journal of neuroscience : the official jou...,115.0
4,9256495,10.1073/pnas.94.17.9406,TAL,Pattern of neuronal activity associated with c...,"Sahraie A, Weiskrantz L, Barbur JL, Simmons A,...",1997,Proceedings of the National Academy of Science...,197.0
5,9391021,10.1016/s0168-0102(97)82145-1,MNI,Role of the supplementary motor area and the r...,"Sadato N, Yonekura Y, Waki A, Yamada H, Ishii Y",1997,The Journal of neuroscience : the official jou...,0.0
6,9405692,10.1073/pnas.94.26.14792,UNKNOWN,Role of left inferior prefrontal cortex in ret...,"Thompson-Schill SL, D'Esposito M, Aguirre GK, ...",1997,Proceedings of the National Academy of Science...,1508.0
7,9412517,10.1523/jneurosci.18-01-00411.1998,TAL,Masked presentations of emotional facial expre...,"Whalen PJ, Rauch SL, Etcoff NL, McInerney SC, ...",1998,The Journal of neuroscience : the official jou...,1551.0
8,9465007,10.1523/jneurosci.18-05-01827.1998,UNKNOWN,Transition of brain activation from frontal to...,"Sakai K, Hikosaka O, Miyauchi S, Takino R, Sas...",1998,The Journal of neuroscience : the official jou...,360.0
9,9491989,10.1016/S0896-6273(00)80456-0,TAL,Functional-anatomic correlates of object primi...,"Buckner RL, Goodman J, Burock M, Rotte M, Kout...",1998,Neuron,443.0


In [3]:
metadata_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14371 entries, 0 to 14370
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   id              14371 non-null  int64  
 1   doi             14371 non-null  object 
 2   space           14371 non-null  object 
 3   title           14371 non-null  object 
 4   authors         14371 non-null  object 
 5   year            14371 non-null  int64  
 6   journal         14371 non-null  object 
 7   citation_count  14366 non-null  float64
dtypes: float64(1), int64(2), object(5)
memory usage: 898.3+ KB


In [4]:
na_counts = metadata_df.isna().sum()
na_counts = na_counts[na_counts > 0]

na_counts

citation_count    5
dtype: int64

In [5]:
metadata_df = metadata_df.dropna(subset=['citation_count'])

In [6]:
metadata_df['space'].unique()

array(['MNI', 'TAL', 'UNKNOWN'], dtype=object)

In [7]:
mni = (len(metadata_df[metadata_df['space'] == 'MNI']) / len(metadata_df)) * 100
print(f'MNI Percentage: {mni}')

mni = (len(metadata_df[metadata_df['space'] == 'TAL']) / len(metadata_df)) * 100
print(f'TAL Percentage: {mni}')

mni = (len(metadata_df[metadata_df['space'] == 'UNKNOWN']) / len(metadata_df)) * 100
print(f'UNKNOWN Percentage: {mni}')

MNI Percentage: 70.99401364332452
TAL Percentage: 16.253654461923986
UNKNOWN Percentage: 12.752331894751498


We consider UNKNOWN as NA - becuase it is not specific and indicates a missing information.  
We may fill UNKNOWNs with MNI / TAL, remove samples with UNKNOWN space or remove the feature.  
We decided that, due to the big portion (12%) of the data with UNKNOWN space, we should not remove the samples, however, we didn't want to modify the data as well.  
Therefore we decided to remove Space feature.

Observing the metadata, we decide of a few modifications to to be applied on the dataset:
1.  Drop doi column: doi is a standardized unique number given to articles, papers, and books. 
It is required for identification purposes, but irrelevant to our prediction model.  
2.  Drop space column
3.  Drop citation_count and journal columns: they already exist in abtracts_df which will be joined to metadata_df.  

In [8]:
metadata_df = metadata_df.drop(['doi', 'space', 'journal', 'citation_count'], axis=1)
metadata_df = metadata_df.rename(columns={'year': 'year_published'})

metadata_df.head(3)

Unnamed: 0,id,title,authors,year_published
0,9065511,Environmental knowledge is subserved by separa...,"Aguirre GK, D'Esposito M",1997
1,9084599,Anatomy of motor learning. I. Frontal cortex a...,"Jueptner M, Stephan KM, Frith CD, Brooks DJ, F...",1997
2,9114263,Multiple nonprimary motor areas in the human c...,"Fink GR, Frackowiak RS, Pietrzyk U, Passingham RE",1997


In [9]:
metadata_df.to_csv('metadata_df_eda.csv')