# Part 1. READ "match_trt" and "match_con"

In [1]:
import pandas as pd
import numpy as np

## READ "match_trt" and "match_con"
df_trt = pd.read_csv('./Test_data/src_data/match_trt_df.csv')
df_con = pd.read_csv('./Test_data/src_data/match_con_df.csv')
print( df_trt.shape, df_con.shape)

# examples
df_trt.head(2)

(6826, 26) (5132, 26)


Unnamed: 0,paper_url,arxiv_id,title,abstract,url_abs,url_pdf,proceeding,authors,tasks,date,...,doi,venue_raw,venue_id,venue_type,authors_name,authors_id,authors_org,fos_name,fos_w,ref
0,https://paperswithcode.com/paper/large-scale-p...,1706.03736,large-scale plant classification with deep neu...,This paper discusses the potential of applying...,http://arxiv.org/abs/1706.03736v1,http://arxiv.org/pdf/1706.03736v1.pdf,,['Ignacio Heredia'],[],2017-06-12,...,10.1145/3075564.3075590,Computing Frontiers,2626327000.0,C,['Ignacio Heredia'],[2622430314],"['Instituto de Fisica de Cantabria (CSIC-UC), ...","['Fork (system call)', 'Data mining', 'Compute...","[Decimal('0.4434'), Decimal('0.45259'), Decima...","[1522301498, 1556850077, 1606347560, 191666523..."
1,https://paperswithcode.com/paper/deep-boltzman...,1509.06535,deep boltzmann machines in estimation of distr...,Estimation of Distribution Algorithms (EDAs) r...,http://arxiv.org/abs/1509.06535v2,http://arxiv.org/pdf/1509.06535v2.pdf,,"['Malte Probst', 'Franz Rothlauf']",['Combinatorial Optimization'],2015-09-22,...,,Neural and Evolutionary Computing,2596710000.0,J,"['Malte Probst', 'Franz Rothlauf']","[2137666597, 2171731261]",[],"['Convergence (routing)', 'Estimation of distr...","[Decimal('0.48188'), Decimal('0.60781'), Decim...","[157468466, 189596042, 1536712125, 1597878669,..."


In [54]:
## Assign "group" variable for Treated/control index 
df_trt['group']= 1
df_con['group']= 0

## Stack both trt and control data as "df_all"
df_all = pd.concat([df_trt, df_con], axis = 0, ignore_index=True)
df_all.shape

(11958, 27)

In [55]:
## create "year" value
df_all['year'] = df_all['date'].str[:4].astype(int)
df_all.groupby(['group', 'year']).size()

group  year
0      1995       1
       1997       2
       1998       1
       1999       1
       2000       1
       2002       3
       2003       2
       2006       2
       2007       1
       2008       4
       2009       3
       2010       8
       2011      17
       2012      35
       2013      47
       2014     109
       2015     253
       2016     516
       2017    1077
       2018    1937
       2019    1111
       2020       1
1      2012       4
       2013      23
       2014      69
       2015     267
       2016     716
       2017    1410
       2018    2574
       2019    1741
       2020      22
dtype: int64

Based on above published-year distributions, I study the year-range during the **2016-2018** period based on two reasons below: 
- 1) starting from 2016, there were more papers that mentioned codes in our sample data; 
- 2) the number of citations are also sensitive to the publish years. For instance, those papers published in 2019 might need more time to validate their acedamic influences in terms of citations. 

In [59]:
## 2016-2018 period -- DT
DT = df_all.loc[ (df_all['year'] >= 2016) & (df_all['year'] <= 2018), :]
print( DT.shape )
DT.groupby(['group', 'year']).size()

(8230, 28)


group  year
0      2016     516
       2017    1077
       2018    1937
1      2016     716
       2017    1410
       2018    2574
dtype: int64

**To Dos:**
- describe variables ??

# Part 2. Feature Extraction_simple
**Dependent Variable (DV)**
- n_citation: number of citation

**MAIN IV**
- group: whether the paper mentions code in paper?

**CREATE Basic features**
- 1) title_len: title length
- 2) abs_len: abstract length


- 3) no_author: number of authors
- 4) no_affi:   number of affiliations --> too much unknown (skip)
- 5) no_ref:    number of references given by the dblp data

In [60]:
print( DT.keys() )
DT.head(1)

Index(['paper_url', 'arxiv_id', 'title', 'abstract', 'url_abs', 'url_pdf',
       'proceeding', 'authors', 'tasks', 'date', 'id', 'n_citation',
       'doc_type', 'publisher', 'volume', 'issue', 'doi', 'venue_raw',
       'venue_id', 'venue_type', 'authors_name', 'authors_id', 'authors_org',
       'fos_name', 'fos_w', 'ref', 'group', 'year'],
      dtype='object')


Unnamed: 0,paper_url,arxiv_id,title,abstract,url_abs,url_pdf,proceeding,authors,tasks,date,...,venue_id,venue_type,authors_name,authors_id,authors_org,fos_name,fos_w,ref,group,year
0,https://paperswithcode.com/paper/large-scale-p...,1706.04,large-scale plant classification with deep neu...,This paper discusses the potential of applying...,http://arxiv.org/abs/1706.03736v1,http://arxiv.org/pdf/1706.03736v1.pdf,,['Ignacio Heredia'],[],2017-06-12,...,2626327000.0,C,['Ignacio Heredia'],[2622430314],"['Instituto de Fisica de Cantabria (CSIC-UC), ...","['Fork (system call)', 'Data mining', 'Compute...","[Decimal('0.4434'), Decimal('0.45259'), Decima...","[1522301498, 1556850077, 1606347560, 191666523...",1,2017


In [159]:
DT = DT.fillna('NA')

# 1) title_len: title length
DT.loc[:, 'title_len'] = DT['title'].apply(lambda x: len(x.strip().split()) )

# 2) abs_len: abstract length  --- ????
def get_abslen(x):
    if x == 'NA':
        return 'NA'
    else:
        return len( x.strip().split() ) 
DT.loc[:, 'abs_len'] = DT['abstract'].apply(lambda x: get_abslen(x) )
print( '\n', 'The number of unknown abstracts: {}, \n \
      so I remove those samples with unknown abstracts info'.format( 
           sum(DT['abs_len'] == 'NA'))  ) 
DT = DT.loc[DT['abs_len'] != 'NA', :]

# 3) no_author: number of authors
DT.loc[:, 'no_author'] = DT['authors_name'].apply(lambda x: len(x) )

# 4) no_affi: number of affiliations
print( '\n', 'The percentage of unknown authors_affilations: {}%, \n \
      so I drop this feature'.format( 
           round( sum( DT['authors_org'] == '[]' )/DT.shape[0] * 100), 2)) 

# 5) no_ref: number of references given by the dblp data
DT.loc[:, 'no_ref'] = DT['ref'].apply(lambda x: len(x) )


 The number of unknown abstracts: 140, 
       so I remove those samples with unknown abstracts info

 The percentage of unknown authors_affilations: 77%, 
       so I drop this feature


In [161]:
## get "DT_base" : DT with Basic features
DT_base = DT[['n_citation', 'group', 'title_len', 'abs_len', 'no_author',
             'no_ref']]
print(DT_base.shape)
DT_base.head()

(8090, 6)


Unnamed: 0,n_citation,group,title_len,abs_len,no_author,no_ref
0,1,1,7,95,19,168
2,0,1,11,104,95,264
5,9,1,7,201,33,405
6,6,1,9,226,52,143
7,2,1,7,94,16,155


In [162]:
### SAVE DT_base
DT_base.to_csv('./Test_data/src_data/DT_base.csv', index=False)


# Part 3. Feature Extraction_complex  [Not finished yet]

**CREATE Complex features**
- 6) a vector of field indicator <-- using unique "fos_name", each input should be the "fos_w"
- 7) unsupervised Topic Modeling using abstract data? -- LDA


In [None]:
### SAVE DT_
DT_cplx.to_csv('./Test_data/src_data/DT_cplx.csv', index=False)
