# Analyze graphs
Logit regression on the graph obtained in ```get_graphs.ipynb``` to find relations between the different studied metrics.

In [1]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf

In [2]:
DATA_PATH = '/dlabdata1/turkish_wiki/processed_data/'

In [11]:
# Read the DataFrame
pageset_edits = pd.read_csv(f'{DATA_PATH}/pageset_edits_pct.csv')
pageset_edits['migrated'] = pageset_edits['migrated'].astype('float') 
pageset_edits = pageset_edits[pageset_edits['number_of_neighbors'] > 0]

In [14]:
pageset_edits.head()

Unnamed: 0,event_user_id,page_id,number_of_neighbors,centrality,migrated,migration_percentage,edits_per_day
0,25,"{815064, 222351}",5,0.000458,0.0,1.0,0.002018
1,39,"{9312, 1067105, 1571688, 644809, 414218, 18022...",178,0.006456,0.0,0.44382,0.014127
2,47,"{1664134, 2124426, 2124428, 2124437, 61624}",12,0.000638,0.0,0.583333,0.007064
3,137,"{647363, 473637, 1631751, 16958, 8148, 1908, 9...",98,0.004934,0.0,0.520408,0.010091
4,146,"{518121, 554324, 919417}",11,0.000868,1.0,0.727273,0.003027


In [39]:
mod = smf.logit(formula=f'migrated ~ migration_percentage + edits_per_day + number_of_neighbors + centrality',  data=pageset_edits)
res = mod.fit()
print(res.summary())

Optimization terminated successfully.
         Current function value: 0.216870
         Iterations 9
                           Logit Regression Results                           
Dep. Variable:               migrated   No. Observations:                37012
Model:                          Logit   Df Residuals:                    37007
Method:                           MLE   Df Model:                            4
Date:                Mon, 19 Apr 2021   Pseudo R-squ.:                 0.07520
Time:                        15:35:18   Log-Likelihood:                -8026.8
converged:                       True   LL-Null:                       -8679.5
Covariance Type:            nonrobust   LLR p-value:                2.256e-281
                           coef    std err          z      P>|z|      [0.025      0.975]
----------------------------------------------------------------------------------------
Intercept               -4.4790      0.099    -45.385      0.000      -4.672      -4.286

All variables are significative with very small P values. We can see that all variables influence the migration of a user positively, except fot the number of neighbours. 
The results can be interpreted as:

* A user is likelier to be active during the preblock period if its neighbours on the preblock graph were active during the block period.

* A user is likelier to be active during the preblock period the more he edited in the preblock period.
    
* A user is likelier to be active during the preblock period if the user was central in the preblock graph.

The fact that the coefficient for the number of neighbours is negative doesn't make much sense. We remove the centrality from the regression and fit again.
    


In [41]:
mod = smf.logit(formula='migrated ~ migration_percentage + edits_per_day + number_of_neighbors',  data=pageset_edits)
res = mod.fit()
print(res.summary())

Optimization terminated successfully.
         Current function value: 0.224985
         Iterations 9
                           Logit Regression Results                           
Dep. Variable:               migrated   No. Observations:                37012
Model:                          Logit   Df Residuals:                    37008
Method:                           MLE   Df Model:                            3
Date:                Mon, 19 Apr 2021   Pseudo R-squ.:                 0.04060
Time:                        15:36:04   Log-Likelihood:                -8327.1
converged:                       True   LL-Null:                       -8679.5
Covariance Type:            nonrobust   LLR p-value:                2.006e-152
                           coef    std err          z      P>|z|      [0.025      0.975]
----------------------------------------------------------------------------------------
Intercept               -2.9709      0.062    -48.011      0.000      -3.092      -2.850

Now the number of neigbours is positively correlated to the survival of a user.