# Lognormal Generalized Linear Model

Using *Lognormal Generalized Linear Model*（LogGLM）to estimate the effects of **National Multi-affiliated author(NM)**,**International Multi-affiliated author(IM)** on the citation count.

We also consider some factors which have been reported to be associated with citation counts as control variables. 
The *R-style* regression equation is expressed as
```R
TC ~ NM_mark + IM_mark + N_ins + N_c + N_refs + N_a
```
where

- NM_mark: 1 for having 1 or more NM authors, otherwise 0
- IM_mark: 1 for having 1 or more IM authors, otherwise 0
- N_ins: number of institutions
- N_c: number of countries
- N_refs: number of references
- N_a: number of authors*

*we only consider papers with no more than 10 authors.

In [71]:
# folders
project = 'MultipleAffiliations'
data_dir = f'D:/Data/{project}/data/'
result_dir = f'D:/Data/{project}/result/'
regression_result_dir = f"{result_dir}/GLM_log/"

In [72]:
# import packages
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
import glob
import os
from statsmodels.tools.tools import add_constant
from statsmodels.stats.outliers_influence import variance_inflation_factor

%matplotlib inline

In [73]:
# function pmark(c,p)  mark p-value
def pmark(c,p):
    if 0.01<=p<0.05:
        mark = '*'
    elif 0.001<=p<0.01:
        mark = '**'
    elif p<0.001:
        mark = '***'
    else:
        mark = ''
    return f'{c}{mark}'

In [74]:
# fitting function: fit_model(data,cate,country,IVs,reg_file)
def fit_model(data,cate,IVs,reg_file):
#    data = df[df['Subject']==cate]
    mod = smf.glm(f"TC ~ {'+ '.join(IVs)}", data, family=sm.families.Gaussian(sm.families.links.log))
    res = mod.fit()
    cate = cate.replace('/','&')
    with open(reg_file,'w') as fw:
        print(res.summary2(),file=fw)
    return [cate] + [pmark(f"{res.params['Intercept']:.3f}",res.pvalues['Intercept'])] + [pmark(f"{(np.exp(res.params[x])-1)*100:.1f}",res.pvalues[x]) for x in IVs] + [f"{1-res.deviance/res.null_deviance:.2f}"] + [res.aic,res.bic]

## Example

take papers of immunology as example

In [75]:
# load data
df = pd.read_csv(f"{data_dir}/IIC_reg_19subject.csv")
df.columns = ['UT', 'TC', 'NM_mark', 'IM_mark', 'S_mark', 'N_refs', 'N_ins',
       'N_c', 'N_a', 'Subject']

In [76]:
# add one to total citation
df['TC'] = df['TC'] +1

In [77]:
for iv in ['NM','IM','S']:
    df[f'{iv}_mark'] = df[f'{iv}_mark'].apply(lambda x:1 if x=='Y' else 0)

In [78]:
cate = 'IMM'
data = df[df['Subject']==cate]
n = data.shape[0]
data = data[data['N_a']<=10] # we only consider papers with <= 10 authors
m = data.shape[0]
n,m

(18586, 13653)

### VIF test

Variance Inflation Factor is used to test the multicollinearity among independent variables.

In [79]:
IVs = ['NM_mark','IM_mark','N_refs','N_ins','N_c','N_a']
X = add_constant(data[IVs])
vif = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif =[cate]+vif
df_vif = pd.DataFrame([vif], columns=['Category']+list(X.columns))
df_vif

Unnamed: 0,Category,const,NM_mark,IM_mark,N_refs,N_ins,N_c,N_a
0,IMM,20.164208,1.413633,1.023131,1.063238,2.174727,1.532999,1.212383


The variables are not highly correlated.

### Fitting

In [80]:
IVs = ['NM_mark','IM_mark','N_refs','N_ins','N_c','N_a']

In [81]:
mod = smf.glm(f"TC ~ {'+ '.join(IVs)}", data, family=sm.families.Gaussian(sm.families.links.log))
res = mod.fit()
print(res.summary())

                 Generalized Linear Model Regression Results                  
Dep. Variable:                     TC   No. Observations:                13653
Model:                            GLM   Df Residuals:                    13646
Model Family:                Gaussian   Df Model:                            6
Link Function:                    log   Scale:                          180.53
Method:                          IRLS   Log-Likelihood:                -54839.
Date:                Tue, 14 Jan 2020   Deviance:                   2.4636e+06
Time:                        21:13:24   Pearson chi2:                 2.46e+06
No. Iterations:                    17                                         
Covariance Type:            nonrobust                                         
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      2.0498      0.041     49.606      0.0

Use an instance of a link class instead.
  """Entry point for launching an IPython kernel.


In [82]:
print(res.summary2())

                Results: Generalized linear model
Model:              GLM              AIC:            109692.3966 
Link Function:      log              BIC:            2333629.5338
Dependent Variable: TC               Log-Likelihood: -54839.     
Date:               2020-01-14 21:13 LL-Null:        -55525.     
No. Observations:   13653            Deviance:       2.4636e+06  
Df Model:           6                Pearson chi2:   2.46e+06    
Df Residuals:       13646            Scale:          180.53      
Method:             IRLS                                         
------------------------------------------------------------------
               Coef.   Std.Err.     z     P>|z|    [0.025   0.975]
------------------------------------------------------------------
Intercept      2.0498    0.0413  49.6056  0.0000   1.9688   2.1308
NM_mark       -0.0227    0.0272  -0.8330  0.4048  -0.0760   0.0307
IM_mark        0.0978    0.0221   4.4189  0.0000   0.0544   0.1412
N_refs         0.005

**R-squared** can be calculated as $r^2 = 1 - residual\ deviance/null\ deviance$


In [83]:
f"R-squared:{1-res.deviance/res.null_deviance:.2f}"

'R-squared:0.09'

AIC and BIC for goodness-of-fit

In [84]:
f"AIC: {res.aic:.2f}",f"BIC: {res.bic:.2f}"

('AIC: 109692.40', 'BIC: 2333629.53')

## Across Disciplines

### Institutional Collaboration

In [85]:
# load data
df = pd.read_csv(f"{data_dir}/Factor_IC.csv")
df.columns = ['UT', 'TC', 'NM_mark', 'IM_mark', 'S_mark', 'N_refs', 'N_ins',
       'N_c', 'N_a', 'Subject']
df=df[df['N_a']<=10]

In [86]:
df['TC'] = df['TC'] +1

In [87]:
for iv in ['NM','IM','S']:
    df[f'{iv}_mark'] = df[f'{iv}_mark'].apply(lambda x:1 if x=='Y' else 0)

#### VIF test

In [88]:
# vif test
vifs = []
for cate,data in df.groupby('Subject'):    
    IVs = ['NM_mark','IM_mark','N_refs','N_ins','N_c','N_a']
    X = add_constant(data[IVs])
    vif = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
    vifs.append([cate]+vif)

df_vif = pd.DataFrame(vifs, columns=['Category']+list(X.columns))
df_vif

Unnamed: 0,Category,const,NM_mark,IM_mark,N_refs,N_ins,N_c,N_a
0,AGR,14.801898,1.111909,1.313493,1.025844,1.446552,1.565893,1.167522
1,BIO,15.305312,1.219432,1.380594,1.024301,1.56012,1.709745,1.134153
2,CHE,16.103251,1.184236,1.360849,1.011681,1.538391,1.743881,1.124487
3,CLI,13.305229,1.123049,1.286456,1.017947,1.434947,1.561606,1.116686
4,COM,14.254099,1.190514,1.211634,1.012692,1.713483,1.601197,1.207819
5,ENG,15.503698,1.140109,1.247712,1.020795,1.486757,1.540843,1.140294
6,ENV,11.609784,1.176832,1.374624,1.031189,1.873638,1.850714,1.270242
7,GEO,10.610943,1.194339,1.268281,1.044108,2.119532,1.811895,1.403498
8,IMM,15.026661,1.201426,1.313674,1.052732,1.606743,1.681364,1.172332
9,MATE,15.687966,1.162498,1.379841,1.029935,1.501608,1.730371,1.132541


In [89]:
excel = pd.ExcelWriter(f'{regression_result_dir}/DIS_VIF.xlsx')
df_vif.to_excel(excel,index=False)
excel.close()

#### Fitting

In [90]:
# fitting
IVs = ['NM_mark','IM_mark','N_refs','N_ins','N_c','N_a']
#cates = list(df['Subject'].unique())
models = [fit_model(data,cate,IVs,f"{regression_result_dir}/Discipline/{cate}.txt") for cate,data in df.groupby('Subject')]
cols = ['Subject','Intercept']+IVs+['R-Squared']+['AIC','BIC']
models = pd.DataFrame(models, columns=cols)

Use an instance of a link class instead.
  after removing the cwd from sys.path.


In [91]:
models.set_index('Subject',inplace=True)
idx = ['SPA','NEU','PSY','IMM','CLI','PHA','PHY','MOL','BIO','MIC','PLA','ENV','GEO','CHE','AGR','MATE','COM','ENG','MATH']
models.loc[idx]

Unnamed: 0_level_0,Intercept,NM_mark,IM_mark,N_refs,N_ins,N_c,N_a,R-Squared,AIC,BIC
Subject,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
SPA,2.024***,-11.0***,4.2*,0.4***,4.3***,-6.0***,2.0***,0.08,197027.3,2987198.0
NEU,1.601***,5.7***,2.8*,0.4***,-0.5,13.4***,1.3***,0.07,569330.3,5736446.0
PSY,1.152***,10.4***,3.9,0.7***,-1.5*,14.3***,3.8***,0.09,176160.1,996543.1
IMM,1.862***,1.6,14.0***,0.6***,3.0***,7.8***,-3.3***,0.09,272491.8,4178401.0
CLI,1.247***,10.7***,-4.1***,0.3***,-1.1**,24.2***,3.3***,0.02,3192057.0,100196100.0
PHA,1.799***,5.2***,-5.4***,0.2***,1.0,11.8***,-2.3***,0.05,442262.7,3957333.0
PHY,1.345***,14.6***,11.1***,0.2***,-0.6,22.2***,3.0***,0.04,1250129.0,18528020.0
MOL,1.961***,13.1***,8.9**,0.5***,-2.4*,8.9***,-0.2,0.02,507326.6,28842510.0
BIO,1.758***,11.4***,3.6*,0.1***,-2.2***,17.3***,0.5,0.01,838030.8,16031780.0
MIC,1.778***,-1.8,7.1***,0.3***,0.7,10.4***,-0.9**,0.07,215649.4,1820651.0


In [92]:
excel = pd.ExcelWriter(f'{regression_result_dir}/DIS.xlsx')
models.loc[idx].to_excel(excel,index=True)
excel.close()

## Across Countries and Disciplines

### Institutional Collaboration 

In [93]:
files = glob.glob(f"{data_dir}/country/*.csv")

#### Number of records

In [94]:
frames = []
for file in files:
    df = pd.read_csv(file)
    df.columns = ['UT', 'TC', 'DomesticNM_mark', 'DomesticIM_mark', 'ForeignNM_mark','ForeignIM_mark', 'N_refs', 'N_ins',
       'N_c', 'N_a', 'Subject']
    df['TC'] = df['TC'] + 1
    country = file.split('\\')[-1].split('.')[0]
    df['country'] = country
    frames.append(df)


In [95]:
df = pd.concat(frames)

In [96]:
df = df[df['N_a']<=10]
df = pd.pivot_table(df,values='UT',index='Subject',columns='country',aggfunc='count')
df.head(2)

country,BR,CA,CN,DE,FR,IN,IT,JP,RU,UK,US,ZA
Subject,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
AGR,8215,2654,9921,3171,3367,3145,3441,2971,331,2645,11985,698
BIO,3494,4606,17774,9215,6034,4522,5471,8180,1835,8308,30880,530


In [97]:
excel = pd.ExcelWriter(f"{regression_result_dir}/number_obs_country_discipline.xlsx")
df.to_excel(excel)
excel.close()

#### VIF test

In [98]:
# vif test
vifs = []
IVs = ['DomesticNM_mark', 'DomesticIM_mark', 'ForeignNM_mark','ForeignIM_mark', 'N_refs', 'N_ins',
       'N_c', 'N_a']
for file in files:
    df = pd.read_csv(file)
    df.columns = ['UT', 'TC', 'DomesticNM_mark', 'DomesticIM_mark', 'ForeignNM_mark','ForeignIM_mark', 'N_refs', 'N_ins',
       'N_c', 'N_a', 'Subject']

    df = df[df['N_a']<=10] # only consider authors <= 10
    country = file.split('\\')[-1].split('.')[0]
    for iv in ['DomesticNM_mark', 'DomesticIM_mark', 'ForeignNM_mark', 'ForeignIM_mark']:
        df[f'{iv}'] = df[f'{iv}'].apply(lambda x:1 if x=='Y' else 0)
    for cate,data in df.groupby('Subject'):
        X = add_constant(data[IVs])
        vif = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
        vifs.append([country,cate]+vif)

df_vif = pd.DataFrame(vifs, columns=['Country','Category']+list(X.columns))
df_vif

Unnamed: 0,Country,Category,const,DomesticNM_mark,DomesticIM_mark,ForeignNM_mark,ForeignIM_mark,N_refs,N_ins,N_c,N_a
0,BR,AGR,18.530956,1.060037,1.211888,1.108075,1.241542,1.159229,1.296338,1.713922,1.119690
1,BR,BIO,16.399821,1.188910,1.245379,1.258686,1.284104,1.028372,1.797475,2.064364,1.106289
2,BR,CHE,17.467469,1.177435,1.226529,1.179001,1.305664,1.021160,1.732024,1.883072,1.202658
3,BR,CLI,15.308445,1.222377,1.176643,1.255745,1.209599,1.036266,1.812964,1.996822,1.109240
4,BR,COM,14.365915,1.215298,1.093997,1.133444,1.259180,1.025046,1.997995,1.885732,1.269868
5,BR,ENG,15.430703,1.129447,1.124476,1.160198,1.171233,1.050383,1.805027,1.658848,1.223371
6,BR,ENV,12.444946,1.181756,1.238318,1.194680,1.325276,1.125296,1.958191,2.198936,1.242411
7,BR,GEO,11.336970,1.149493,1.088253,1.246704,1.288549,1.139872,2.528635,2.072367,1.507647
8,BR,IMM,17.057812,1.276073,1.184368,1.264594,1.282007,1.044671,1.977626,2.192313,1.109057
9,BR,MATE,15.896360,1.158964,1.236561,1.235553,1.239276,1.062306,1.769546,1.855672,1.255964


In [99]:
excel = pd.ExcelWriter(f'{regression_result_dir}/country_discipline_VIF.xlsx')
df_vif.to_excel(excel,index=False)
excel.close()

In [100]:
df_vif.max()

Country                 ZA
Category               SPA
const              23.9502
DomesticNM_mark     1.7696
DomesticIM_mark    1.80785
ForeignNM_mark     2.27201
ForeignIM_mark     1.69554
N_refs             1.37947
N_ins              4.75107
N_c                3.92743
N_a                2.62247
dtype: object

#### Fitting

For each country and subject, we fit the regression model.

In [101]:
# fitting
IVs = ['DomesticNM_mark', 'DomesticIM_mark', 'ForeignNM_mark', 'ForeignIM_mark',
       'N_refs', 'N_ins', 'N_c', 'N_a']

for file in files:
    df = pd.read_csv(file)
    df.columns = ['UT', 'TC', 'DomesticNM_mark', 'DomesticIM_mark', 'ForeignNM_mark','ForeignIM_mark', 'N_refs', 'N_ins',
       'N_c', 'N_a', 'Subject']
    df = df[df['N_a']<=10]
    country = file.split('\\')[-1].split('.')[0]
    os.makedirs(f"{regression_result_dir}/IC/{country}/", exist_ok=True)
    print(country,end='\r')
    for iv in ['DomesticNM_mark', 'DomesticIM_mark', 'ForeignNM_mark', 'ForeignIM_mark']:
        df[f'{iv}'] = df[f'{iv}'].apply(lambda x:1 if x=='Y' else 0)
#cates = list(df['Subject'].unique())
    models = [fit_model(data,cate,IVs,f"{regression_result_dir}/IC/{country}/{cate}.txt") for cate,data in df.groupby('Subject')]
    cols = ['Subject','Intercept']+IVs+['R-Squared']+['AIC','BIC']
    models = pd.DataFrame(models, columns=cols)

    excel = pd.ExcelWriter(f'{regression_result_dir}/countries_summary/IC_{country}.xlsx')
    models.to_excel(excel,index=False)
    excel.close()

BR

Use an instance of a link class instead.
  after removing the cwd from sys.path.


ZA

#### Merge tables

In [102]:
files = glob.glob(f"{regression_result_dir}/countries_summary/IC_*.xlsx")

In [103]:
files

['D:/Data/MultipleAffiliations/result//GLM_log//countries_summary\\IC_BR.xlsx',
 'D:/Data/MultipleAffiliations/result//GLM_log//countries_summary\\IC_CA.xlsx',
 'D:/Data/MultipleAffiliations/result//GLM_log//countries_summary\\IC_CN.xlsx',
 'D:/Data/MultipleAffiliations/result//GLM_log//countries_summary\\IC_DE.xlsx',
 'D:/Data/MultipleAffiliations/result//GLM_log//countries_summary\\IC_FR.xlsx',
 'D:/Data/MultipleAffiliations/result//GLM_log//countries_summary\\IC_IN.xlsx',
 'D:/Data/MultipleAffiliations/result//GLM_log//countries_summary\\IC_IT.xlsx',
 'D:/Data/MultipleAffiliations/result//GLM_log//countries_summary\\IC_JP.xlsx',
 'D:/Data/MultipleAffiliations/result//GLM_log//countries_summary\\IC_RU.xlsx',
 'D:/Data/MultipleAffiliations/result//GLM_log//countries_summary\\IC_UK.xlsx',
 'D:/Data/MultipleAffiliations/result//GLM_log//countries_summary\\IC_US.xlsx',
 'D:/Data/MultipleAffiliations/result//GLM_log//countries_summary\\IC_ZA.xlsx']

In [104]:
IVs = ['DomesticNM_mark', 'DomesticIM_mark', 'ForeignNM_mark', 'ForeignIM_mark',
       'N_refs', 'N_ins', 'N_c', 'N_a']

In [105]:
df = pd.read_excel(files[0])
df.head(2)

Unnamed: 0,Subject,Intercept,DomesticNM_mark,DomesticIM_mark,ForeignNM_mark,ForeignIM_mark,N_refs,N_ins,N_c,N_a,R-Squared,AIC,BIC
0,AGR,0.225***,51.8***,52.3***,39.3***,4.2,1.3***,-17.6***,36.9***,6.5***,0.16,44541.367793,34652.71325
1,BIO,0.985***,8.3,-2.5,12.1,-16.2*,0.6***,-12.6***,52.7***,-0.4,0.18,21945.147421,80291.855646


In [106]:
from collections import defaultdict
frames = defaultdict(list)

for file in files:
    df = pd.read_excel(file)
    for iv in IVs:
        frames[iv].append(df[['Subject',iv]].set_index('Subject'))

In [107]:
countries = [file.split('_')[-1].split('.')[0] for file in files]

In [108]:
for iv in IVs:
    df = pd.concat(frames[iv], axis=1)
    #print(df.head(2))
    df.columns = countries
    excel = pd.ExcelWriter(f"{regression_result_dir}/IC/IC_{iv}.xlsx")
    idx = ['SPA','NEU','PSY','IMM','CLI','PHA','PHY','MOL','BIO','MIC','PLA','ENV','GEO','CHE','AGR','MATE','COM','ENG','MATH']
    cols = ['CA','DE','FR','UK','IT','JP','US','BR','CN','IN','RU','ZA']
    df.loc[idx,cols].to_excel(excel)
    excel.close()