# Harassment and Newcomer Retention (Paper)


Regression analysis notebook for study of harassment on newcomer retention in Wikipedia. See [research project page](https://meta.wikimedia.org/wiki/Research:Detox/Harassment_and_User_Retention) for an overview.

In [68]:
% matplotlib inline
import pandas as pd
from dateutil.relativedelta import relativedelta
import statsmodels.formula.api as sm
import requests
from io import StringIO
import math
import pandas as pd

### Load Data and take sample

WIP, documentation coming

In [69]:
#Features computes in ./Harassment and Newcomer Retention Data Munging.ipynb
df_reg = pd.read_csv("../../data/retention/newcomer_sample_features.csv")

In [70]:
df_reg.shape

(111290, 24)

In [71]:
df_newcomer_sample = pd.read_csv("../../data/retention/newcomer_sample.csv")
df_reg = df_reg.merge(df_newcomer_sample, on = "user_text", how = "inner")

In [72]:
df_reg.shape

(111290, 26)

In [73]:
df_reg['sample'].value_counts()

random      92469
attacked    18821
Name: sample, dtype: int64

In [74]:
df_reg['t1_harassment_received'].value_counts()

0    108887
1      2403
Name: t1_harassment_received, dtype: int64

In [75]:
df_reg['has_gender'].value_counts()

0    105896
1      5394
Name: has_gender, dtype: int64

In [76]:
df_reg['t1_active'].value_counts()

1    111290
Name: t1_active, dtype: int64

In [77]:
df_reg = pd.concat(
    [df_reg.query('t1_harassment_received == 1')
     , df_reg.query("sample == 'random'").sample(50000, random_state = 12)]
).drop_duplicates(subset = "user_text")

In [78]:
df_reg['sample'].value_counts()

random      50014
attacked     2375
Name: sample, dtype: int64

In [79]:
df_reg['t1_harassment_received'].value_counts()

0    49986
1     2403
Name: t1_harassment_received, dtype: int64

In [80]:
df_reg.shape

(52389, 26)

In [81]:
column_map = {
        't1_num_days_active': 'm1_days_active',
        't2_num_days_active' : 'm2_days_active',
        't1_harassment_received': 'm1_received_harassment',
        't1_harassment_made': 'm1_made_harassment',
        't1_fraction_ns0_deleted': 'm1_fraction_ns0_deleted',
        't1_fraction_ns0_reverted': 'm1_fraction_ns0_reverted',
        't1_num_warnings_recieved': 'm1_warnings',
        }
        
df_reg = df_reg.rename(columns=column_map)

### Regression Analysis

In [82]:
def regress(df, f, family = 'linear'):
    if family == 'linear':
        results = sm.ols(formula=f, data=df).fit()
        return results.summary().tables[1]

    elif family == 'logistic':
        results = sm.logit(formula=f, data=df).fit(disp=0)
        return results.summary().tables[1]
    else:
        return
    

def get_latex_table(results, famiily = 'linear'):
    """
    Mess of a function for turning a statsmodels SimpleTable
    into a nice latex table strinf
    """
    
    results = pd.read_csv(StringIO(results.as_csv()))
    
    if family == 'linear':
        column_map = {
            results.columns[0]: "",
            '   coef   ' : 'coef',
           'P>|t| ': "p-val",
            '    t    ': "z-stat",
           ' [95.0% Conf. Int.]': "95% CI"
        }

    elif family == 'logistic':
        column_map = {
            results.columns[0]: "",
            '   coef   ' : 'coef',
           'P>|z| ': "p-val",
            '    z    ': "z-stat",
           ' [95.0% Conf. Int.]': "95% CI"
        }
    else:
        return
        
        
    results = results.rename(columns=column_map)
    results.index = results[""]
    del results[""]
    results = results[['coef', "z-stat", "p-val", "95% CI"]]
    results['coef'] = results['coef'].apply(lambda x: round(float(x), 2))
    results['z-stat'] = results['z-stat'].apply(lambda x: round(float(x), 1))
    results['p-val'] = results['p-val'].apply(lambda x: round(float(x), 3))
    results['95% CI'] = results['95% CI'].apply(reformat_ci)
    header = """
\\begin{table}[h]
\\begin{center}
    """
    footer = """
\\end{center}
\\caption{%s}
\\label{tab:}
\\end{table}
    """
    f = f.replace("_", "\_").replace("~", "\\texttildelow\\")
    latex = header + results.to_latex() + footer % f
    print(latex)
    return results
        
    
def reformat_ci(s):
    ci = s.strip().split()
    ci = (round(float(ci[0]), 1), round(float(ci[1]), 1))
    return "[%.1f, %.1f]" % ci    

#### RQ1: Do newcomers in general show reduced activity after experiencing harassment?

In [83]:
f ="m2_days_active ~ m1_received_harassment"
regress(df_reg, f)

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
Intercept,0.1788,0.007,23.884,0.000,0.164 0.193
m1_received_harassment,1.8599,0.035,53.213,0.000,1.791 1.928


In [84]:
f= "m2_days_active ~ m1_days_active + m1_received_harassment"
regress(df_reg, f)

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
Intercept,-0.5654,0.007,-81.425,0.000,-0.579 -0.552
m1_days_active,0.4933,0.003,190.999,0.000,0.488 0.498
m1_received_harassment,-0.4743,0.029,-16.084,0.000,-0.532 -0.416


The first regression shows that newcomers who are harassed in m1 tend to be more active in m2, indicating that harassment does not have a chilling effect on continued newcomer activity. However, this result is an artifact of the group of harassed newcomers being more active in general. After controlling for the level of activity in m1, we see that when comparing users of comparable activity levels in m1, those who get harassed are significantly less active in m2. 

#### RQ2: Does a newcomer's gender affect how they behave after experiencing harassment?

In [85]:
f="m1_received_harassment ~ is_female"
regress(df_reg.query("has_gender == 1"), f, family = 'logistic')

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
Intercept,-1.8962,0.083,-22.712,0.000,-2.060 -1.733
is_female,0.5148,0.177,2.901,0.004,0.167 0.863


In [87]:
f="m2_days_active ~ m1_days_active + m1_received_harassment + m1_received_harassment : is_female"
regress(df_reg.query("has_gender == 1"), f)

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
Intercept,-0.8618,0.090,-9.548,0.000,-1.039 -0.685
m1_days_active,0.7157,0.017,41.515,0.000,0.682 0.749
m1_received_harassment,-1.6535,0.291,-5.691,0.000,-2.223 -1.084
m1_received_harassment:is_female,-0.0869,0.472,-0.184,0.854,-1.013 0.839


For our gender analysis, we reduce our sample to the set of users who reported a gender. First off, we observe that newcomers who end up reporting a female gender are more likely to receive harassment in m1. To investigate whether the impact of receiving harassment differs across genders, we ran the same regression as in RQ1, but restricted our analysis to users who supplied a gender and added a interaction term between gender and our measure of harassment in m1. We find that when restricting to users who supplied a gender, we again see that users who received harassment have reduced activity in m2. Inspecting the regression results for the interaction term between harassment and gender indicates that the impact is not significantly different for males and females.

#### RQ3: How do good faith newcomers behave after experiencing harassment?

In [88]:
f="m2_days_active ~ m1_days_active + m1_received_harassment +  m1_received_harassment : m1_made_harassment + m1_received_harassment : m1_warnings"
regress(df_reg, f)

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
Intercept,-0.5666,0.007,-81.218,0.000,-0.580 -0.553
m1_days_active,0.4941,0.003,188.288,0.000,0.489 0.499
m1_received_harassment,-0.3896,0.033,-11.781,0.000,-0.454 -0.325
m1_received_harassment:m1_made_harassment,-0.1049,0.061,-1.724,0.085,-0.224 0.014
m1_received_harassment:m1_warnings,-0.1512,0.019,-7.920,0.000,-0.189 -0.114


A serious potential confound in our analyses could be that the users who receive harassment are just bad faith newcomers or sock-puppets. They get attacked for their misbehavior and reduce their activity in m2 because they get blocked or because they never intended to stick around past their own attacks. To reduce this confound, we control for whether the user harassed anyone in m1 and for whether they received an user warning of any type. The results show that even users who receive harassment but did not harass anyone or receive a user warning show reduced activity in m2.

#### RQ4: How does experiencing harassment compare to previously studied barriers to newcomer socialization?

[Halfak et al](https://www-users.cs.umn.edu/~halfak/publications/The_Rise_and_Decline/halfaker13rise-preprint.pdf) examine how user warnings and deletions and reverts correlate with newcomer retention. Here we add those features and see how they compare to measure of harassment.

In [89]:
f = "m2_days_active ~ m1_days_active + m1_received_harassment + m1_warnings +  m1_fraction_ns0_deleted + m1_fraction_ns0_reverted "
regress(df_reg.query("t1_num_ns0_edits > 0"), f)

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
Intercept,-0.6018,0.012,-49.840,0.000,-0.625 -0.578
m1_days_active,0.5094,0.003,158.863,0.000,0.503 0.516
m1_received_harassment,-0.5343,0.041,-12.992,0.000,-0.615 -0.454
m1_warnings,-0.1200,0.016,-7.667,0.000,-0.151 -0.089
m1_fraction_ns0_deleted,-0.0483,0.044,-1.099,0.272,-0.134 0.038
m1_fraction_ns0_reverted,-0.0034,0.019,-0.180,0.857,-0.041 0.034


WIP: Receiving harassment is worse for a newcomer than receiving 11 warning messages or having all their first months work deleted or reverted.