# Harassment and Newcomer Retention

In this notebook we investigate how receiving harassment correlates with newcomer activity and retention. For the purposes of this study, our measures of harassment are classifiers over individual discussion comments for personal attacks, aggression and toxicity. These classifiers were developed [in previous work](https://arxiv.org/abs/1610.08914). We will investigate the relationship between harassment and newcomer retention through running regression models that use a measures of editing activity and harassment in time span t1 as independent variables and a measure of harassment in time span t2 as the dependent variable.

In [2]:
% matplotlib inline
import pandas as pd
from dateutil.relativedelta import relativedelta
import statsmodels.formula.api as sm
import requests

### Regression Modeling

In this section we explore various regression models that use a measures of editing activity and harassment in time span t1 as independent variables and a measure of harassment in time span t2 as the dependent variable.

In [3]:
df_reg = pd.read_csv("../../data/retention/newcomer_sample_features.csv")

In [4]:
def regress(df, formula, family = 'linear'):
    if family == 'linear':
        result = sm.ols(formula=f, data=df).fit()
    elif family == 'logistic':
        result = sm.logit(formula=f, data=df).fit(disp=0)
    else:
        print("Wrong Family")
    return result.summary().tables[1]

#### Logistic Regression: Does receiving harassment in t1 make you less likely to make an edit t2?

In [5]:
f = "t2_active ~ t1_harassment_received"
regress(df_reg, f, family = 'logistic')

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
Intercept,-2.5973,0.013,-207.074,0.000,-2.622 -2.573
t1_harassment_received,1.8786,0.078,24.133,0.000,1.726 2.031


This model suggests that users who receive harassment have increased probability of being active in t2, compared to users who did not receive harassment. Lets control for how active the user was in t1 to see if the result holds.

In [6]:
f ="t2_active ~ t1_harassment_received + t1_num_days_active"
regress(df_reg, f, family = 'logistic')

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
Intercept,-3.7756,0.021,-179.910,0.000,-3.817 -3.734
t1_harassment_received,-0.1524,0.115,-1.321,0.187,-0.378 0.074
t1_num_days_active,0.5172,0.006,79.999,0.000,0.505 0.530


After controlling for the number of days a user was active in t1, we see that users receiving harassment have a decreased probability of activity in t2.

#### Linear Regression: Is harassment correlated with reduction in activity from t1 to t2?

Instead of running a logistic regression using an indicator for activity in t2 as our dependent variable, we will run a linear regression using the the number of days active in t2 as our dependent variable.

In [7]:
f ="t2_num_days_active ~ t1_harassment_received"
regress(df_reg, f)

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
Intercept,0.3336,0.007,48.537,0.000,0.320 0.347
t1_harassment_received,1.5663,0.078,20.057,0.000,1.413 1.719


We see a similar results as above. Without controlling for activity in t1, users who receive harassment have, on average, more active days in t2.

In [8]:
f= "t2_num_days_active ~ t1_num_days_active + t1_harassment_received"
regress(df_reg, f)

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
Intercept,-0.7705,0.006,-138.232,0.000,-0.781 -0.760
t1_num_days_active,0.6635,0.002,350.196,0.000,0.660 0.667
t1_harassment_received,-1.5108,0.053,-28.522,0.000,-1.615 -1.407


However, when we control for the number of days a user is active in t1, we see that users who receive harassment have fewer active days in t2. The coefficient is significantly less than 0 and larger in magnitude as on the number of active days in t1. Instead of using an indicator for whether the user received harassment, lets use the count of various types of harassment received (i.e personal attacks, aggression, toxicity)

In [9]:
f= "t2_num_days_active ~ t1_num_days_active + t1_num_attacks_received"
regress(df_reg, f)

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
Intercept,-0.7703,0.006,-138.105,0.000,-0.781 -0.759
t1_num_days_active,0.6623,0.002,349.970,0.000,0.659 0.666
t1_num_attacks_received,-1.3215,0.050,-26.356,0.000,-1.420 -1.223


In [10]:
f ="t2_num_days_active ~ t1_num_days_active + t1_num_aggression_received"
regress(df_reg, f)

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
Intercept,-0.7703,0.006,-138.160,0.000,-0.781 -0.759
t1_num_days_active,0.6627,0.002,350.171,0.000,0.659 0.666
t1_num_aggression_received,-1.3447,0.049,-27.417,0.000,-1.441 -1.249


In [11]:
f="t2_num_days_active ~ t1_num_days_active + t1_num_toxicity_received"
regress(df_reg, f)

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
Intercept,-0.7669,0.006,-137.068,0.000,-0.778 -0.756
t1_num_days_active,0.6548,0.002,348.819,0.000,0.651 0.658
t1_num_toxicity_received,-0.3383,0.091,-3.725,0.000,-0.516 -0.160


We see the same general pattern as above, accept that toxic comments received seem to have a weaker association with lower activity in t2 than personal attacks and aggression. Also, the magnitude of the coefficients decreased since we are using a count and not an indicator as above.

Before we regress an activity measure in t2 on an activity measure in t1 and multiple measures of harassment, lets see how our different measures of harassment correlate:

In [12]:
from scipy.stats import pearsonr 
print(pearsonr(df_reg['t1_num_attacks_received'] ,  df_reg['t1_num_aggression_received']))
print(pearsonr(df_reg['t1_num_toxicity_received'] , df_reg['t1_num_aggression_received']))
print(pearsonr(df_reg['t1_num_toxicity_received'] , df_reg['t1_num_attacks_received']))

(0.97434616050296752, 0.0)
(0.49586164323313414, 0.0)
(0.48773009729386224, 0.0)


Personal attacks and aggression are very highly correlated. This is probably because both questions appeared on the same form. The toxicity measure has a lower though still high correlation with both personal attacks and aggression. Let's try a regression using toxicity and one of the two other measures:

In [13]:
f="t2_num_days_active ~ t1_num_days_active + t1_num_toxicity_received + t1_num_attacks_received"
regress(df_reg, f)

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
Intercept,-0.7712,0.006,-138.331,0.000,-0.782 -0.760
t1_num_days_active,0.6632,0.002,350.277,0.000,0.659 0.667
t1_num_toxicity_received,1.0852,0.104,10.469,0.000,0.882 1.288
t1_num_attacks_received,-1.6150,0.057,-28.125,0.000,-1.728 -1.502


This result is harder to interpret due to the strong correlation between `t1_num_toxicity_received` and `t1_num_attacks_received`.

#### Linear Regression: How do gender, harassment and activity interact?

Gender in Wikiedia is not well defined. After registering, users have the ability to report their gender in their user preferences. The vast majority of users do not report their gender. This may be because reporting their gender is not important to them, they don't want to report a gender, or they simply are unaware of the feature. There is anectodital evidence that users often report an incorrect gender. Overall, this means that we should expect users who report their gender to be different than the rest and we cannot be sure if reported genders are correct. Another caveat for the following analysis is that we do not know when the user reported their gender; they may have changed their user preference after our 2 month interval of interest.




In [14]:
f="t1_harassment_received ~ has_gender"
regress(df_reg, f, family = 'logistic')

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
Intercept,-5.0119,0.040,-125.282,0.000,-5.090 -4.933
has_gender,1.6434,0.095,17.333,0.000,1.458 1.829


Users who supply a gender are more likely to receive harassment! Lets see if this is different for males and females:

In [15]:
f="t1_harassment_received ~ is_female"
regress(df_reg.query("has_gender == 1"), f, family = 'logistic')

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
Intercept,-3.4824,0.100,-34.980,0.000,-3.678 -3.287
is_female,0.5424,0.198,2.741,0.006,0.155 0.930


Females have an increased probability of receiving harassment in t1.

In [16]:
f="t2_active ~ has_gender"
regress(df_reg, f)

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
Intercept,0.0641,0.001,77.514,0.000,0.062 0.066
has_gender,0.1701,0.004,42.328,0.000,0.162 0.178


Users who supply a gender are also more likely to be active in t2! Again, lets see if this is different for males and females:

In [17]:
f="t2_active ~ is_female"
regress(df_reg.query("has_gender == 1"), f, family = 'logistic')

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
Intercept,-1.1576,0.040,-29.161,0.000,-1.235 -1.080
is_female,-0.1637,0.100,-1.640,0.101,-0.359 0.032


Females have a descreased probability of being active in t1, although the effect is not significant. Lets see what happens when we control for activity in t1.

In [18]:
f="t2_num_days_active ~ t1_num_days_active + has_gender"
regress(df_reg, f)

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
Intercept,-0.7741,0.006,-138.212,0.000,-0.785 -0.763
t1_num_days_active,0.6485,0.002,340.575,0.000,0.645 0.652
has_gender,0.4127,0.023,17.812,0.000,0.367 0.458


Users who supply a gender appear to be more active in t2 even after controlling for activity in t1.

In [19]:
f="t2_num_days_active ~ t1_num_days_active + is_female"
regress(df_reg.query("has_gender == 1"), f)

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
Intercept,-0.9551,0.067,-14.236,0.000,-1.087 -0.824
t1_num_days_active,0.8225,0.009,91.613,0.000,0.805 0.840
is_female,-0.3886,0.139,-2.801,0.005,-0.661 -0.117


Females appear to have decreased activity in t2 even after controlling for activity in t1 compared to males.

In [20]:
f="t2_num_days_active ~ t1_num_days_active + t1_harassment_received * has_gender"
regress(df_reg, f)

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
Intercept,-0.7794,0.006,-139.606,0.000,-0.790 -0.768
t1_num_days_active,0.6575,0.002,342.434,0.000,0.654 0.661
t1_harassment_received,-1.4099,0.058,-24.253,0.000,-1.524 -1.296
has_gender,0.4556,0.023,19.476,0.000,0.410 0.501
t1_harassment_received:has_gender,-0.7548,0.137,-5.526,0.000,-1.022 -0.487


It seems like users who supply a gender and receive harassment have even more strongly reduced activity in t2 compared to users who do not supply a gender and get harassed.

In [21]:
f="t2_num_days_active ~ t1_num_days_active + t1_harassment_received * is_female"
regress(df_reg.query("has_gender == 1"), f)

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
Intercept,-0.9257,0.066,-13.943,0.000,-1.056 -0.796
t1_num_days_active,0.8402,0.009,93.345,0.000,0.823 0.858
t1_harassment_received,-3.3005,0.336,-9.820,0.000,-3.959 -2.642
is_female,-0.3325,0.140,-2.372,0.018,-0.607 -0.058
t1_harassment_received:is_female,0.4436,0.660,0.672,0.501,-0.850 1.737


Although the effect is not significant, it seems like females who receive harassment have even more strongly reduced activity in t2.

#### Linear regression: addressing the bad newcomer confound

A serious potential confound in our analyses could be that the users who receive harassment are just bad faith newcomers or sock-puppets. They get attacked for their misbehavior and reduce their activity in t2 because they get blocked or because they never intended to stick around past their own attacks. To reduce this confound, we control for whether the user harassed anyone in t1 and for whether they received an user warning of any type:

In [23]:
f="t2_num_days_active ~ t1_num_days_active + t1_harassment_received + t1_harassment_made * t1_harassment_received + t1_num_warnings_recieved * t1_harassment_received "
regress(df_reg, f)

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
Intercept,-0.7702,0.006,-137.792,0.000,-0.781 -0.759
t1_num_days_active,0.6690,0.002,348.027,0.000,0.665 0.673
t1_harassment_received,-1.4574,0.055,-26.318,0.000,-1.566 -1.349
t1_harassment_made,-0.7307,0.050,-14.742,0.000,-0.828 -0.634
t1_harassment_made:t1_harassment_received,-0.2378,0.190,-1.251,0.211,-0.610 0.135
t1_num_warnings_recieved,-0.1032,0.020,-5.130,0.000,-0.143 -0.064
t1_num_warnings_recieved:t1_harassment_received,0.0942,0.035,2.667,0.008,0.025 0.163


Even users who receive harassment but did not harass anyone or receive a user warning show reduced activity in t2.

#### Linear regression: add revert and deletion features

### Investigate some newcomer experiences

Our regression analyses have established that newcomer who receive harassment show a greater subsequent decline in activity than normal. Let's look at a few example of newcomers, what edits they made and how other interacted with them.