# Goal
1. Implement the data described in this [google doc](https://docs.google.com/document/d/1YCsWf2G89ABSyE3MMB5bWqaSyOkvvYp2IPwBkI1mlMs/edit).
1. One participant per row. any user that was ever seen by the app.
1. Survey results from julia
1. Then need replica data on behaviour (with caching probably).


### connections
1. connecting to aws studies mysql on `3311`
    1. `ssh -N studies.cs -L 3311:localhost:3306`
2. connecting to wmf repliacs on `3310`
    1. `ssh -N maximilianklein@tools-login.wmflabs.org -L 3310:enwiki.analytics.db.svc.eqiad.wmflabs:3306`

In [1]:
from civilservant.models.core import ExperimentThing, ExperimentAction
# from thanks.utils import _get_experiment_id
from civilservant.util import read_config_file
import os
from civilservant.db import init_session, init_engine
from sqlalchemy.dialects import mysql
import sqlalchemy
import pandas as pd
import datetime
# import uuid
CACHEDIR='/data/project/cache'
TRESORDIR='CivilServant/projects/wikipedia-integration/fr-newcomer-study'

In [2]:
no_pii_f_stem = os.path.join(os.getenv('TRESORDIR'), TRESORDIR, 'datasets', 'post-experiment')
df = pd.read_pickle(os.path.join(no_pii_f_stem, 'user_records_and_survey.pickle'))

acct_map = pd.read_pickle(os.path.join(no_pii_f_stem, 'acct_map.pickle'))

In [3]:
acct_map.head(1)

Unnamed: 0,lang,user_name,user_id,public_anonymous_id,private_anonymous_id
0,fr,Blasquin,3708538,e22d958a-a5a6-4cc7-8845-27b9733823f4,e116132f-531a-5131-94e4-6e40c70eac3e


In [4]:
df = df.merge(acct_map[['private_anonymous_id','user_name','user_id']], on='private_anonymous_id')

## Getting external data

1. seven.day.activation: 
    + Binary variable whether participant made an edit within 7 days of registration
1. labor.hours: (int) The number of estimated labor hours associated with the 84 day period after registration
    + This  measure  is  the  total  labor  hours  in  the  84  days  after  the  treatment.   Labor  hours  are  a measure based on an account's edit history [1].  To generate this integer, we observe the oating point  number  of  labor  hours  an  account  contributed  to  Wikipedia  in  the  84  day  period  after receiving the intervention and round it down
1. four.week.retention :
    + Binary variable whether the account is considered to be still contributing to Wikipedia, as defined by whether they made at least one edit to Wikipedia in a five week period starting at the beginning of a given week.  For example, a subject would be retained after  four  weeks  if  they  made  a  contribution  between  the  22nd  -  56th  day  after registering (that is, the 1st day of the 4th week and last day of the 8th week). 
    + NOTE: in experiment plan incorrectly said to the 56th day and last day of the 9th week
1. labor.hours.nontruncated
    + Like labor hours but the original float not the int.
1. mentor.respond
    + Responding  to  sender  (only  comparing  those  in  Treatment  1  and  Treatment  2). Binary variable whether the subject sent a message to the sender within 7 days.
1. forum.des.nouveaux  (stretch goal)
    + Contributing to Newcomer Forum (only comparing those in Treatment 1 and Treatment 2).  Binary variable whether the subject made a contribution to the Newcomer Forum within 7 days.
1. social.response <- mentor.respond |forum.des.nouveaux 
1. sandbox
    + Using sandbox (only comparing those in Treatment 1 and Treatment 2).  Binary variable whether the subject made a contribution in their Sandbox within 7 days. In french that's called Brullion. 



In [5]:
from civilservant.wikipedia.queries.revisions import get_timestamps_within_range
from civilservant.wikipedia.queries.user_interactions import get_thanks_sending
from civilservant.wikipedia.queries.users import get_user_basic_data 
from civilservant.wikipedia.utils import make_cached_df, make_sessions, calc_labour_hours,\
                                            to_wmftimestamp, from_wmftimestamp, bin_from_td, decode_or_nan
from civilservant.wikipedia.connections.database import make_wmf_con

wmf_con = make_wmf_con()

MAXIMUM_OBS_WINDOW = datetime.timedelta(days=84)

@make_cached_df('fr-welcome-timestamps')
def get_user_edits_before_and_after_obs(lang, user_name, user_registration):
    start_date = user_registration
    end_date = start_date + MAXIMUM_OBS_WINDOW
    ts = get_timestamps_within_range(lang=lang, start_date=start_date, end_date=end_date, user_name=user_name,
                         con=wmf_con, with_page_id=True)
    return ts
    

In [6]:
# delete this for full run
# df = df[:1000]
# print(f'data frem length: {len(df)}')

In [7]:
df['labor_hours_ts_df'] = df.apply(lambda row: get_user_edits_before_and_after_obs(row['lang'],
                                                                                  row['user_name'],
                                                                                  row['user_registration'])
                                                                                   , axis=1)

In [8]:
def window_ts_df(user_registration, start_days_offset, end_days_offset, ts_df):
    start_date = user_registration + datetime.timedelta(days=start_days_offset)
    end_date = start_date + datetime.timedelta(days=end_days_offset)
    between_cond = (ts_df['rev_timestamp']>start_date) & (ts_df['rev_timestamp']<=end_date)
    between = ts_df[between_cond]
    return between

In [9]:
def seven_day_activation(user_registration, ts_df):
    between = window_ts_df(user_registration=user_registration, 
                           start_days_offset=0, 
                           end_days_offset=7, 
                           ts_df=ts_df)
    return len(between)>0

In [10]:
def four_week_retention(user_registration, ts_df):
    between = window_ts_df(user_registration=user_registration, 
                           start_days_offset=22, 
                           end_days_offset=56, 
                           ts_df=ts_df)
    return len(between)>0

In [11]:
df['seven_day_activation'] = df.apply(lambda row: seven_day_activation(row['user_registration'], row['labor_hours_ts_df'],
                                                                    ), axis=1)

In [12]:
df['seven_day_activation'].sum()

21124

In [13]:
df['labor_hours_ts_df'].apply(lambda s:len(s)>0).sum()

22194

In [14]:
df['four_week_retention'] = df.apply(lambda row: four_week_retention(row['user_registration'], row['labor_hours_ts_df'],
                                                                    ), axis=1)

In [15]:
df['four_week_retention'].sum()

2516

In [16]:
def num_labor_hours(before_after, behavior_start_dt, ts_df):
    start_dt = behavior_start_dt - MAXIMUM_OBS_WINDOW if before_after=='before' else behavior_start_dt
    end_dt = behavior_start_dt if before_after=='before' else behavior_start_dt + MAXIMUM_OBS_WINDOW
    
    window_ts_df =  ts_df[(ts_df['rev_timestamp'] > start_dt)  & (ts_df['rev_timestamp'] <= end_dt)]
    if len(window_ts_df)==0:
        return 0
    else:
        window_dts = [pd.to_datetime(np_dt) for np_dt in window_ts_df['rev_timestamp'].values]
        window_labor_hours = calc_labour_hours(window_dts)
        return window_labor_hours

In [17]:
df['labor_hours_nontruncated'] = df.apply(lambda row: num_labor_hours('after', 
                                                                row['user_registration'],
                                                                row['labor_hours_ts_df']),
                                    axis=1)

In [18]:
df['labor_hours'] = df['labor_hours_nontruncated'].apply(int)

In [19]:
df['labor_hours'].max(), df['labor_hours_nontruncated'].max()

(520, 520.0369444444449)

## get the page-ids of the user-talk pages of all the signers

In [20]:
mentors = set(df['mentor_user_name'])

In [21]:
@make_cached_df('frwiki-welcome-page-ids-of-users')
def get_user_page_ids(lang, user_name):
    wmf_con.execute(f'use {lang}wiki_p;')

    page_id_sql = """select page_id, page_title, page_namespace
    from page where page_title=:user_name"""

    page_id_sql_esc = sqlalchemy.text(page_id_sql)
    params = {"user_name": user_name,
              "lang": lang}
    page_id_df = pd.read_sql(page_id_sql_esc, con=wmf_con, params=params)
    page_id_df['page_title'] = page_id_df['page_title'].apply(decode_or_nan)
    return page_id_df

In [22]:
mentor_id_dfs = []
for mentor in mentors:
#     print(mentor)
    page_id_df = get_user_page_ids('fr', mentor)
    mentor_id_dfs.append(page_id_df)

In [23]:
mentor_df = pd.concat(mentor_id_dfs)

In [24]:
mentor_df

Unnamed: 0,page_id,page_title,page_namespace
0,446121,Floflo,2
1,13018341,Floflo,3
0,7065621,Goombiis,2
1,10321851,Goombiis,3
0,9881003,Sijysuis,2
...,...,...,...
0,2978542,Trizek,2
1,9289907,Trizek,3
0,6847384,Frakir,0
1,7267572,Frakir,2


In [25]:
mentor_page_ids = mentor_df['page_id'].values

In [26]:
df.iloc[1]['labor_hours_ts_df']

Unnamed: 0,rev_timestamp,rev_page


In [27]:
def responded_to_mentor(ts_df, mentor, any_mentor=False):
    if any_mentor:
        target_ids = mentor_page_ids
    else:
        target_ids = mentor_df[mentor_df['page_title']==mentor]['page_id'].values
    ts_df['mentor_reply'] = ts_df['rev_page'].apply(lambda i: i in target_ids)
    return ts_df['mentor_reply'].sum() > 0 

In [28]:
df['mentor_respond'] = df.apply(lambda row: responded_to_mentor(row['labor_hours_ts_df'], row['mentor_user_name']), axis=1)
# df['mentor_respond_any'] = df.apply(lambda row: responded_to_mentor(row[''], row[''], True))

In [29]:
df['mentor_respond'].mean()

0.0031704326502014365

In [30]:
df[df['mentor_respond']==True]

Unnamed: 0,private_anonymous_id,lang,user_registration,randomization_arm,randomization_block_id,complier,survey.consent,manipulation.check,efficacy,help,...,survey_invitation,mentor_user_name,user_name,user_id,labor_hours_ts_df,seven_day_activation,four_week_retention,labor_hours_nontruncated,labor_hours,mentor_respond
303,5f90f24a-0b1d-57c7-be86-de610b4a6c60,fr,2020-01-31 11:56:21,2,52,,,,,,...,True,Bastenbas,Luc dc Luc,3709118,rev_timestamp rev_page mentor_reply ...,True,False,0.119444,0,True
1389,c8d81685-731e-5f10-9400-124793b5a169,fr,2020-02-02 12:54:45,1,233,,,,,,...,True,Braaark,CathyDocLPPR,3710953,rev_timestamp rev_page mentor_reply...,True,True,7.502778,7,True
1458,86edade5-136f-5fdf-ac07-261fa0621020,fr,2020-02-02 15:02:22,1,244,,,,,,...,True,Bastenbas,DUPOIZAT,3711066,rev_timestamp rev_page mentor_reply ...,True,False,0.119444,0,True
1823,d27d179b-f96d-542d-b16a-7f81b63e1094,fr,2020-02-03 08:58:55,1,305,,,,,,...,True,Panam2014,Revuedroitetlittérature,3711750,rev_timestamp rev_page mentor_reply...,True,True,2.377500,2,True
2154,37f08ce4-6dae-567e-b59f-8467e7cf32bd,fr,2020-02-03 16:30:20,2,360,,,,,,...,True,Floflo,ClaireMarieV,3712228,rev_timestamp rev_page mentor_reply...,True,False,1.505000,1,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
56270,4af09b77-0182-5b7f-b3cb-26a21f8a2500,fr,2020-05-10 11:29:10,2,9379,,,,,,...,True,Culex,Vik-corto,3811599,rev_timestamp rev_page mentor_reply...,True,True,3.339444,3,True
56433,0bde495e-457e-5966-a1b0-64b10759a676,fr,2020-05-10 16:12:51,2,9407,,,,,,...,True,Erdrokan,Cocodet,3811891,rev_timestamp rev_page mentor_reply...,True,False,3.137222,3,True
56503,34c84b19-66b8-5d9c-9141-e164699b7aa4,fr,2020-05-10 18:05:14,1,9418,,,,,,...,True,Bastenbas,Tristan-vic,3812001,rev_timestamp rev_page mentor_reply ...,False,True,0.758056,0,True
56726,cb75c241-ddff-5bc9-8f06-cbb3e91d12ca,fr,2020-05-11 06:56:14,1,9455,,,,,,...,True,AvatarFR,Soisita,3812436,rev_timestamp rev_page mentor_reply...,True,False,4.343611,4,True


## brouillon

In [31]:
# ## this is a little copypasta of the mentor, so we should abstract these, they only differ in the page title
# wmf_con.execute(f'use frwiki_p;')

# @make_cached_df('frwiki-welcome-broullion')
# def get_broullion_df(lang, user_name):

#     page_id_sql = """select page_id, page_title, page_namespace
#     from page where page_title=:broullion_title"""

#     page_id_sql_esc = sqlalchemy.text(page_id_sql)
#     broullion_title = f'{user_name}/Brouillon'
#     params = {"broullion_title": broullion_title,
#               "lang": lang}
#     page_id_df = pd.read_sql(page_id_sql_esc, con=wmf_con, params=params)
#     page_id_df['page_title'] = page_id_df['page_title'].apply(decode_or_nan)
#     return page_id_df

# def broullion_exists(user_name):
#     broullion_df = get_broullion_df('fr', user_name)
#     return len(broullion_df) > 0

In [32]:
# df_sm = df[:900]

In [33]:
# df_sm['sandbox'] = df_sm['user_name'].apply(broullion_exists)

In [34]:
# df_sm['sandbox'].sum()

In [35]:
# df_sm[['sandbox','sandbox_quarry']].sum()

## brouillon deux


In [36]:
# exected on qarri on 2020-08-21
# use frwiki_p;

    
# select page_id, page_title, page_namespace
#     from page where page_title like '%/Brouillon'
#     and page_namespace=2;

q_broui = pd.read_csv(os.path.join(no_pii_f_stem, 'from_quary_brouillons_20200821.csv'))

In [37]:
q_broui['sandbox'] = True

In [38]:
df['user_name_brouillon'] = df['user_name'].apply(lambda user_name: f'{user_name}/Brouillon')

In [39]:
df = df.merge(q_broui[['page_title','sandbox']], how='left', left_on='user_name_brouillon', right_on='page_title')

In [40]:
df['sandbox'] = df['sandbox'].fillna(False)

In [41]:
df['sandbox'].sum()

1954

## forum des nouveaux

In [42]:
fdn = pd.read_csv(os.path.join(no_pii_f_stem, 'fr_user_ids_posting_forum_des_nouveaux_20200801.csv'), index_col=0)

In [43]:
fdn['forum_des_nouveaux'] = True

In [44]:
fdn.groupby('user_id').size().max(), fdn.groupby('user_id').size().count() , len(fdn)

(1, 6650, 6650)

In [45]:
df = df.merge(fdn[['user_id','forum_des_nouveaux']], how='left', left_on='user_id', right_on='user_id')

In [46]:
df['forum_des_nouveaux'] = df['forum_des_nouveaux'].fillna(False)

In [47]:
df['forum_des_nouveaux'].sum()

298

In [48]:
df['social_response'] = df.apply(lambda row: row['mentor_respond'] or row['forum_des_nouveaux'], axis=1)

In [49]:
df['social_response'].sum(), df['mentor_respond'].sum(), df['forum_des_nouveaux'].sum()

(458, 181, 298)

# Output 

In [50]:
df.columns

Index(['private_anonymous_id', 'lang', 'user_registration',
       'randomization_arm', 'randomization_block_id', 'complier',
       'survey.consent', 'manipulation.check', 'efficacy', 'help', 'role',
       'trust', 'friendliness', 'close.community', 'close.individuals.1',
       'close.individuals.2', 'close.individuals.3',
       'control_accidentally_treated', 'failed_to_treat',
       'failed_to_treat_already_created', 'failed_to_treat_blocked',
       'block_failed_to_treat', 'block_control_accidentally_treated',
       'survey_invitation', 'mentor_user_name', 'user_name', 'user_id',
       'labor_hours_ts_df', 'seven_day_activation', 'four_week_retention',
       'labor_hours_nontruncated', 'labor_hours', 'mentor_respond',
       'user_name_brouillon', 'page_title', 'sandbox', 'forum_des_nouveaux',
       'social_response'],
      dtype='object')

In [51]:
OUTPUT_COLS = [
'private_anonymous_id',
'seven_day_activation', 
'labor_hours', 
'efficacy',
'friendliness', 
'four_week_retention',
'labor_hours_nontruncated', 
'mentor_respond',
'forum_des_nouveaux',
'social_response',
'sandbox',
'help', 
'role',
'trust', 
'close.community', 
'close.individuals.1',
'close.individuals.2',
'close.individuals.3',
'randomization_arm', 
'randomization_block_id',
'control_accidentally_treated', 
'failed_to_treat',
'failed_to_treat_already_created',
'failed_to_treat_blocked',
'block_failed_to_treat', 
'block_control_accidentally_treated',
'survey_invitation',   
'complier',
'survey.consent',
'manipulation.check', 
'lang', 
'user_registration',
]

In [52]:
df_out = df[OUTPUT_COLS]
r_col_names = [cname.replace('_','.') for cname in df_out.columns]
df_out.columns = r_col_names

In [53]:
df_out.columns

Index(['private.anonymous.id', 'seven.day.activation', 'labor.hours',
       'efficacy', 'friendliness', 'four.week.retention',
       'labor.hours.nontruncated', 'mentor.respond', 'forum.des.nouveaux',
       'social.response', 'sandbox', 'help', 'role', 'trust',
       'close.community', 'close.individuals.1', 'close.individuals.2',
       'close.individuals.3', 'randomization.arm', 'randomization.block.id',
       'control.accidentally.treated', 'failed.to.treat',
       'failed.to.treat.already.created', 'failed.to.treat.blocked',
       'block.failed.to.treat', 'block.control.accidentally.treated',
       'survey.invitation', 'complier', 'survey.consent', 'manipulation.check',
       'lang', 'user.registration'],
      dtype='object')

In [54]:
# col_rename = {'num.skips':'number.skips.received',
#              'num.thanks':'number.thanks.received'}
# df = df.rename(columns=col_rename)

In [55]:
# output_col_present = [oc in df_out.columns for oc in r_col_names]
# all(output_col_present)

In [56]:
# list(zip(OUTPUT_COLS, output_col_present))

In [57]:
out_fname = 'frwiki-welcome-post-treatment-vars.csv'
df_out[r_col_names].to_csv(os.path.join(no_pii_f_stem, out_fname), index=False)

In [58]:
df_valid = df_out[df_out['block.control.accidentally.treated']==False]

In [59]:
df_valid.groupby('randomization.arm').mean()

Unnamed: 0_level_0,seven.day.activation,labor.hours,efficacy,friendliness,four.week.retention,labor.hours.nontruncated,mentor.respond,forum.des.nouveaux,social.response,sandbox,...,randomization.block.id,failed.to.treat,failed.to.treat.already.created,failed.to.treat.blocked,block.failed.to.treat,block.control.accidentally.treated,survey.invitation,complier,survey.consent,manipulation.check
randomization.arm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0.363698,0.276534,3.798107,3.526814,0.044003,0.375399,0.0,0.004602,0.004602,0.034069,...,4566.709996,0.0,0.0,0.0,0.038054,0.0,0.956558,0.709751,1.0,0.194888
1,0.370952,0.263483,3.890688,3.62753,0.043717,0.364044,0.003816,0.004714,0.008137,0.032942,...,4566.197542,0.010494,0.010326,0.000168,0.038049,0.0,0.950895,0.681564,1.0,0.340164
2,0.370973,0.231227,3.856574,3.557769,0.045067,0.331713,0.005276,0.005893,0.010495,0.034684,...,4566.453755,0.010327,0.010271,5.6e-05,0.038051,0.0,0.953811,0.69337,1.0,0.278884


In [60]:
# df[df['user.name']=="Lujonae"]