# Test Assumptions of Survival Analysis
[J. Nathan Matias](https://github.com/natematias)
February 2019

Retention measures and survival analyses both make the assumption that if someone doesn't edit in a given period, that they are finished with Wikipedia, or at least are unlikely to be an editor who makes regular contributions to Wikipedia. 

How true is this? If accounts leave Wikipedia and come back the next week or month, then estimates of retention or survival analyses might be under-counting participation in Wikipedia.

In this short analysis, I compare the assumptions of several possible measures to try to develop a valid, usable measure of retention. The dataset analyzed in this code is [historical data prepared by Max Klein](https://docs.google.com/document/d/1RKJZqoWKQuWDoKk94drIEsJWK6kBUeZ8KIJOyEqDTTE/edit).

## Load Libraries

In [1]:
import os, sys, csv
import matplotlib as plt
import seaborn as sns

## temporarily, until Max fixes the problem with lots of serialized timestamp objects within lines
csv.field_size_limit(sys.maxsize)

131072

## Load Files

In [2]:
homedir = "/home/civilservant"
data_path = "Tresors/CivilServant/projects/wikipedia-integration/gratitude-study/datasets/power_analysis"

de_power_df = []

with open(os.path.join(homedir, data_path, 
                       "de_gratitude_power-analysis_dataset_sim_date_20180306_v1.csv"), 
          "r") as f:
#    for line in f:
#        de_power_df.append(line)
    for row in csv.DictReader(f):
        for key in row.keys():
            if key.find("num")>-1:
                row[key] = int(float(row[key]))
            elif key.find("any") >-1:
                row[key] = row[key] == "True"
            elif key.find("hours") >-1:
                row[key] = float(row[key])
        de_power_df.append(row)

# Diagnostic Analyses

## One Week Windows
When looking retention in week windows, within experience groups, estimate the number of weeks with at least one edit that follow at least one observation of 0 edits in a week, over at least a month-long period

In [3]:
week_range = [1,2,3,4,5,6,7,8,9,10,11,12]
max_zero_eligible = 8

In [4]:
def week_key(week):
    return "any_edits_week_" + str(week) + "_post_treatment"

for participant in de_power_df:
    participant['any_edits_one_week_eligible_zeroes'] = None
    participant['any_edits_one_week_false_dropouts'] = None

for participant in de_power_df:

    eligible_zeroes = 0
    num_false_dropouts = 0
    for week in week_range:
        any_edits = participant[week_key(week)]
        if(week <= max_zero_eligible):
            if(any_edits == False):
                eligible_zeroes +=1
        if(eligible_zeroes > 0 and any_edits == True):
            num_false_dropouts += 1
    participant['any_edits_one_week_eligible_zeroes'] = eligible_zeroes
    participant['any_edits_one_week_false_dropouts'] = num_false_dropouts


In [5]:
# print the false dropout rate for a subset of Wikipedians in the sample
# df: dataframe
# lang: string of the language used (for subsampling and output)
# newcomer: binary indicator for whether to subselect on newcomers
def print_false_dropout_rate(df, dropout_key, weeks, lang, newcomer):
    if(newcomer):
        dropout_sum = sum([1 for x in df if 
                                   x[dropout_key] and
                                   x['lang'] == lang and
                                   x['experience_level_pre_treatment']=="bin_0"])*100 / \
                          len([x for x in df if 
                                   x['lang'] == lang and 
                                   x['experience_level_pre_treatment']=="bin_0"])
    else:
        dropout_sum = sum([1 for x in df if 
                                   x[dropout_key] and
                                   x['lang'] == lang])*100 / \
                          len([x for x in df if 
                                   x['lang'] == lang])        
    
    newcomer_string = ""
    if(newcomer):
        newcomer_string = " newcomer"
    
    print("A total of {:.1f}% of sampled ".format(dropout_sum) + 
          lang + newcomer_string + 
          " Wikipedia accounts have false dropouts over 4 weeks with " +
          weeks + " week observation windows.")


In [6]:
print_false_dropout_rate(de_power_df, 'any_edits_one_week_false_dropouts', 
                         "one", "de", True)
print()
print_false_dropout_rate(de_power_df, 'any_edits_one_week_false_dropouts', 
                         "one", "de", False)

A total of 3.0% of sampled de newcomer Wikipedia accounts have false dropouts over 4 weeks with one week observation windows.

A total of 15.0% of sampled de Wikipedia accounts have false dropouts over 4 weeks with one week observation windows.


## When looking retention in 2 week windows, estimate the number of 2 week periods with at least one edit that follow at least one observation of 0 edits, over at least a month-long period

In [7]:
for participant in de_power_df:
    participant['any_edits_two_week_eligible_zeroes'] = None
    participant['any_edits_two_week_false_dropouts'] = None
    
for participant in de_power_df:

    eligible_zeroes = 0
    num_false_dropouts = 0
    for week in range(1,week_range[-1]+1, 2):
        any_edits = participant[week_key(week)] or participant[week_key(week+1)]
        if(week < max_zero_eligible-1):
            if(any_edits == False):
                eligible_zeroes +=1
        if(eligible_zeroes > 0 and any_edits == True):
            num_false_dropouts += 1
    participant['any_edits_two_week_eligible_zeroes'] = eligible_zeroes
    participant['any_edits_two_week_false_dropouts'] = num_false_dropouts

In [8]:
print_false_dropout_rate(de_power_df, 'any_edits_two_week_false_dropouts', 
                         "two", "de", True)
print()
print_false_dropout_rate(de_power_df, 'any_edits_two_week_false_dropouts', 
                         "two", "de", False)

A total of 2.3% of sampled de newcomer Wikipedia accounts have false dropouts over 4 weeks with two week observation windows.

A total of 11.8% of sampled de Wikipedia accounts have false dropouts over 4 weeks with two week observation windows.


## When looking retention in 3 week windows, estimate the number of 3 week periods with at least one edit that follow at least one observation of 0 edits, over at least a month-long period

In [9]:
for participant in de_power_df:
    participant['any_edits_three_week_eligible_zeroes'] = None
    participant['any_edits_three_week_false_dropouts'] = None
    
for participant in de_power_df:

    eligible_zeroes = 0
    num_false_dropouts = 0
    for week in range(1,week_range[-3]+1, 3):
        any_edits = participant[week_key(week)] or participant[week_key(week+1)] or participant[week_key(week+2)]
        if(week < max_zero_eligible-2):
            if(any_edits == False):
                eligible_zeroes +=1
        if(eligible_zeroes > 0 and any_edits == True):
            num_false_dropouts += 1
    participant['any_edits_three_week_eligible_zeroes'] = eligible_zeroes
    participant['any_edits_three_week_false_dropouts'] = num_false_dropouts


In [10]:
print_false_dropout_rate(de_power_df, 'any_edits_three_week_false_dropouts', 
                         "three", "de", True)
print()
print_false_dropout_rate(de_power_df, 'any_edits_three_week_false_dropouts', 
                         "three", "de", False)

A total of 1.7% of sampled de newcomer Wikipedia accounts have false dropouts over 4 weeks with three week observation windows.

A total of 8.9% of sampled de Wikipedia accounts have false dropouts over 4 weeks with three week observation windows.


### ## When looking retention in 4 week windows, estimate the number of 4 week periods with at least one edit that follow at least one observation of 0 edits, over at least a month-long period

In [11]:
for participant in de_power_df:
    participant['any_edits_four_week_eligible_zeroes'] = None
    participant['any_edits_four_week_false_dropouts'] = None
    
for participant in de_power_df:
    eligible_zeroes = 0
    num_false_dropouts = 0
    for week in range(1,week_range[-4]+1, 4):
        any_edits = participant[week_key(week)] or participant[week_key(week+1)] or participant[week_key(week+2)] or participant[week_key(week+3)]
        if(week < max_zero_eligible-4):
            if(any_edits == False):
                eligible_zeroes +=1
        if(eligible_zeroes > 0 and any_edits == True):
            num_false_dropouts += 1
    participant['any_edits_four_week_eligible_zeroes'] = eligible_zeroes
    participant['any_edits_four_week_false_dropouts'] = num_false_dropouts


In [12]:
print_false_dropout_rate(de_power_df, 'any_edits_four_week_false_dropouts', 
                         "four", "de", True)
print()
print_false_dropout_rate(de_power_df, 'any_edits_four_week_false_dropouts', 
                         "four", "de", False)

A total of 1.7% of sampled de newcomer Wikipedia accounts have false dropouts over 4 weeks with four week observation windows.

A total of 6.0% of sampled de Wikipedia accounts have false dropouts over 4 weeks with four week observation windows.
