# Process data

This notebook handles the processing of the dumps downloaded in ```download_wikis.ipynb```. The dumps downloaded are cleaned and combined into aggregated datasets. 

The data is then exploited to create data corresponding to the following main signals for an initial analysis.
* Number of new registrations
* Number of edits 
* Number of reverts and number of pages that were reverted
* Revert rate

In [1]:
import pandas as pd
import numpy as np
import logging
import time
from pathlib import Path

In [2]:
RES_PATH = "/dlabdata1/turkish_wiki"

In [3]:
DUMPS_PATH = '/dlabdata1/turkish_wiki'

In [4]:
DATA_PATH = '/dlabdata1/turkish_wiki'

The column names of the dumps were scraped from https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history_dumps. They are available in ```column_names.csv``` file in this directory. The csv file also contains the data types of the columns in the dumps that we use later to preprocess the data.

In [5]:
column_names = pd.read_csv('column_names.csv')

# I. Preprocess the data
## 1) Combine the dumps of each year into one dataset
The dumps of Turkish Wikipedia are separated into one dump per year. The function below concatanetes the data of the different dumps and returns them in one dataset. The data is then saved in ```/dlabdata1/turkish_wiki/aggregated.tsv.gz```.

In [6]:
def combine_yearly_dumps(column_names, dtypes, lang ='tr',  path=DUMPS_PATH, 
                         ending='tsv.bz2', years= list(range(2002, 2022))):
    """
    Combines the yearly Wikipedia dumps into one aggregated DataFrame.
    
    Parameters
    ----------
        column_names : list
            names of the columns of the dumps to be combined
        dtypes : list
            data types of the columns above, use str
        lang : str
            language of the dump, Default = 'tr'
        path : str
            location of the dumps, Default = DUMPS_PATH
        ending : str
            file extension of the dumps, Default = 'tsv.bz2'
        years : list
            years to be aggregated into one DataFrame
    
    Returns
    -------
        df_lang: pd.DataFrame
            combined DataFrame
    """
    df_lang = pd.DataFrame()
    for year in years:
        start = time.time()
        try:
            df_lang = pd.concat([df_lang, pd.read_csv(f'{path}/{lang}-{year}.{ending}', sep='\t', names=list(column_names), dtype=dtypes, warn_bad_lines=True, error_bad_lines=False)])
            logging.warning(f'Loaded {lang}-{year} in {time.time() - start}')
        except:
            traceback.print_exc()
            logging.error(f'Error when processing {lang}-{year}')
    return df_lang

In [None]:
df_tr = combine_yearly_dumps(column_names, dtypes=str)

In [None]:
Path(f'{RES_PATH}').mkdir(parents=True, exist_ok=True)
df_tr.to_csv(f'{RES_PATH}/aggregated.tsv.gz', index=False, sep="\t", compression="gzip")

### How to read the aggregated (raw) dumps

In [10]:
df = pd.read_csv(f'{RES_PATH}/aggregated.tsv.gz', sep="\t", dtype=str, error_bad_lines=False, warn_bad_lines=True, usecols= column_names.col_name[1:-1].values, compression = 'gzip')

In [11]:
df.head()

Unnamed: 0,event_entity,event_type,event_timestamp,event_comment,event_user_id,event_user_text_historical,event_user_text,event_user_blocks_historical,event_user_blocks,event_user_groups_historical,...,revision_text_sha1,revision_content_model,revision_content_format,revision_is_deleted_by_page_deletion,revision_deleted_by_page_deletion_timestamp,revision_is_identity_reverted,revision_first_identity_reverting_revision_id,revision_seconds_to_identity_revert,revision_is_identity_revert,revision_is_from_before_page_creation
0,revision,create,2002-12-05 22:51:28.0,(moved from tr.wikipedia.com),,209.162.17.70,209.162.17.70,,,,...,8h2s3vbsvhk0xfyymbit06i0ef6j26s,,,False,,False,,,False,True
1,user,create,2002-12-05 22:54:39.0,,,,,,,,...,,,,,,,,,,
2,revision,create,2002-12-05 22:54:39.0,,1.0,Brion VIBBER,Brion VIBBER,,,,...,jevuozi5divb9m74s5x4gr3xrr12ch4,,,True,2016-10-07 18:22:17.0,False,,,False,False
3,revision,create,2002-12-05 23:39:38.0,"language links added - good luck, turkish wiki...",,80.128.44.46,80.128.44.46,,,,...,7s919m6k15itrhd1v3wwr8b1ci069on,,,False,,False,,,False,True
4,revision,create,2002-12-13 17:59:34.0,,,193.140.196.133,193.140.196.133,,,,...,8sqxjw60e25kh1siv8e19jozs6tj52s,,,False,,False,,,False,True


## 2) Clean and preprocess the data 
### a. Get column datatypes from scraped Wikipedia table.
The datatypes of each of the columns are available in the scraped ```column_names``` DataFrame. We adapt the dump datatypes to corresponding Python datatypes to reduce the size of the dumps in memory and remove inconsistent rows from the dumps
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history_dumps

In [6]:
def transform_data_type_to_dtype(col_name, data_type):
    """
    Associates mediawiki datatypes to Python dtypes.
    
    Parameters
    ----------
        col_name : str
            column name
        data_type : str
            mediawiki datatype
    Returns 
    -------
        dtype: str
            corresponding python dtype
    """
    
    if (col_name == 'event_entity') or (col_name == 'event_type'):
        return 'category'
    elif ('timestamp' in col_name):
        return 'datetime64[ns, UTC]'
    elif (data_type == 'string') or (data_type == 'array<string>'):
        return 'object'
    elif (data_type == 'bigint') or (data_type == 'int') :
        return 'Int64'
    elif (data_type == 'boolean'):
        return 'boolean'
    else:
        return 'object'

In [7]:
col_to_dtype = column_names[['col_name', 'data_type']].set_index('col_name').to_dict()['data_type']

In [8]:
col_to_dtype = {k: transform_data_type_to_dtype(k, v) for k, v in col_to_dtype.items()}

In [9]:
category_cols = list({k  for k, v in col_to_dtype.items() if v == 'category' })
timestamp_cols = list({k  for k, v in col_to_dtype.items() if v == 'datetime64[ns, UTC]' })
numerical_cols = list({k  for k, v in col_to_dtype.items() if v == 'Int64' })
boolean_cols = list({k  for k, v in col_to_dtype.items() if v == 'boolean' })

### b. Preprocess the dataset to reduce size in memory
The aggregated dataset is then transformed with the correct datatypes to ease future use. This preprocessing step also removes errors that appear in the dumps. For example errors that might appear in columns corresponding to timestamps (i.e. an IP address appears instead of a timestamp) are removed from the dataset. The transformation also reduces the size that the DataFrames occupy in memory since the conversions optimize the datatypes used.

#### Convert low cardinality categorical columns

In [None]:
df[category_cols] = df[category_cols].astype("category")

#### Convert Timestamps

In [None]:
df[timestamp_cols] = df[timestamp_cols].apply(pd.to_datetime, utc =True, errors='coerce')

#### Convert numerical columns

In [None]:
df[numerical_cols] = df[numerical_cols].apply(pd.to_numeric, errors='coerce').convert_dtypes()

#### Convert booleans

In [62]:
df[boolean_cols] = df[boolean_cols].replace({'true': True,'false': False})
df[boolean_cols] = df[boolean_cols].where(df[boolean_cols].applymap(type) == bool)
df[boolean_cols] = df[boolean_cols].convert_dtypes()

In [63]:
# Save the cleaned aggregated raw dump dataset
df.to_csv(f'{RES_PATH}/cleaned_trwiki_1.tsv.gz', index=False, sep="\t", compression="gzip")

### c. Separate DataFrame for user, revision and page
As it can be seen in the dumps documentation, some columns correspond only to events related to user, page or revisions. It is thus a good idea to separate the aggregated DataFrame into these three DataFrames to simplify further use.
* ```user_df```: Has all events related to user activities. The events can correspond to the registering of a new account, changing the name of a user, changing the groups (rights) of a user or the blocking/unblocking of a user. Saved in ```/dlabdata1/turkish_wiki/user_events.tsv.gz```
* ```page_df```: Has all events related to page activities. The events can correspond to the creation, deletion or merging of a page. Saved in ```/dlabdata1/turkish_wiki/page_events.tsv.gz```
* ```revision_df```: Has all events related to revisions (edits). Saved in ```/dlabdata1/turkish_wiki/revision_events.tsv.gz```

In [85]:
user_df = df[df['event_entity'] == 'user'][['event_entity', 'event_type', 'event_timestamp', 'event_comment',
       'event_user_id', 'event_user_text_historical', 'event_user_text',
       'event_user_blocks_historical', 'event_user_blocks',
       'event_user_groups_historical', 'event_user_groups',
       'event_user_is_bot_by_historical', 'event_user_is_bot_by',
       'event_user_is_created_by_self', 'event_user_is_created_by_system',
       'event_user_is_created_by_peer', 'event_user_is_anonymous',
       'event_user_registration_timestamp', 'event_user_creation_timestamp',
       'event_user_first_edit_timestamp', 'event_user_revision_count',
       'event_user_seconds_since_previous_revision',
       'user_id', 'user_text_historical', 'user_text',
       'user_blocks_historical', 'user_blocks', 'user_groups_historical',
       'user_groups', 'user_is_bot_by_historical', 'user_is_bot_by',
       'user_is_created_by_self', 'user_is_created_by_system',
       'user_is_created_by_peer', 'user_is_anonymous',
       'user_registration_timestamp', 'user_creation_timestamp',
       'user_first_edit_timestamp']]

In [None]:
user_df.to_csv(f'{RES_PATH}/user_events.tsv.gz', index=False, sep="\t", compression="gzip")

In [None]:
page_df = df[df['event_entity'] == 'page'][['event_entity', 'event_type', 'event_timestamp', 'event_comment',
       'event_user_id', 'event_user_text_historical', 'event_user_text',
       'event_user_blocks_historical', 'event_user_blocks',
       'event_user_groups_historical', 'event_user_groups',
       'event_user_is_bot_by_historical', 'event_user_is_bot_by',
       'event_user_is_created_by_self', 'event_user_is_created_by_system',
       'event_user_is_created_by_peer', 'event_user_is_anonymous',
       'event_user_registration_timestamp', 'event_user_creation_timestamp',
       'event_user_first_edit_timestamp', 'event_user_revision_count',
       'event_user_seconds_since_previous_revision', 'page_id',
       'page_title_historical', 'page_title', 'page_namespace_historical',
       'page_namespace_is_content_historical', 'page_namespace',
       'page_namespace_is_content', 'page_is_redirect', 'page_is_deleted',
       'page_creation_timestamp', 'page_first_edit_timestamp',
       'page_revision_count', 'page_seconds_since_previous_revision']]

In [None]:
page_df.to_csv(f'{RES_PATH}/page_events.tsv.gz', index=False, sep="\t", compression="gzip")

In [None]:
revision_df = df[df['event_entity'] == 'revision'][['event_entity', 'event_type', 'event_timestamp', 'event_comment',
       'event_user_id', 'event_user_text_historical', 'event_user_text',
       'event_user_blocks_historical', 'event_user_blocks',
       'event_user_groups_historical', 'event_user_groups',
       'event_user_is_bot_by_historical', 'event_user_is_bot_by',
       'event_user_is_created_by_self', 'event_user_is_created_by_system',
       'event_user_is_created_by_peer', 'event_user_is_anonymous',
       'event_user_registration_timestamp', 'event_user_creation_timestamp',
       'event_user_first_edit_timestamp', 'event_user_revision_count',
       'event_user_seconds_since_previous_revision', 'page_id',
       'page_title_historical', 'page_title', 'page_namespace_historical',
       'page_namespace_is_content_historical', 'page_namespace',
       'page_namespace_is_content', 'page_is_redirect', 'page_is_deleted',
       'page_creation_timestamp', 'page_first_edit_timestamp',
       'page_revision_count', 'page_seconds_since_previous_revision',
       'revision_id', 'revision_parent_id',
       'revision_minor_edit', 'revision_deleted_parts',
       'revision_deleted_parts_are_suppressed', 'revision_text_bytes',
       'revision_text_bytes_diff', 'revision_text_sha1',
       'revision_content_model', 'revision_content_format',
       'revision_is_deleted_by_page_deletion',
       'revision_deleted_by_page_deletion_timestamp',
       'revision_is_identity_reverted',
       'revision_first_identity_reverting_revision_id',
       'revision_seconds_to_identity_revert', 'revision_is_identity_revert',
       'revision_is_from_before_page_creation']]

In [None]:
revision_df.to_csv(f'{RES_PATH}/revision_events.tsv.gz', index=False, sep="\t", compression="gzip")

# II. Get main signals
In this part we retrieve the main signals mentioned above from the dumps.
## 1) Newcomers

### Get all user creation events

Processes the data and returns a DataFrame called ```all_registrations``` where all registration events to Turkish Wikipedia are available. The data is saved at ```/dlabdata1/turkish_wiki/processed_data/all_registrations.csv```

The format of the DataFrame is as such:
 * date : Timestamp of the registration event
 * user_id : ID of the registered user

In [15]:
user_df = pd.read_csv(f'{RES_PATH}/user_events.tsv.gz', sep="\t", error_bad_lines=False, warn_bad_lines=True, compression = 'gzip')

In [16]:
try:
    
    # Convert data types
    user_df = user_df.convert_dtypes()
    
    # Process dates
    user_timestamp_columns = [col for col in user_df.columns if 'timestamp' in col]
    user_df[user_timestamp_columns] = user_df[user_timestamp_columns].apply(pd.to_datetime, utc =True, errors='coerce')
    user_df["date"] = user_df.event_timestamp.dt.strftime("%Y-%m-%d")

    # User registration event
    create_event_mask = (user_df.event_entity == 'user') & (user_df.event_type == 'create')
    # Filter bots
    no_bot_mask = (user_df['event_user_is_bot_by'].isna() | user_df['event_user_is_bot_by_historical'].isna())
    # Additional Filters
    self_creation_mask = (user_df['event_user_is_created_by_self'] == True)
    no_anon_mask = (user_df['event_user_is_anonymous'] != True)

    
    all_registrations = user_df[create_event_mask & no_anon_mask & no_bot_mask & self_creation_mask][['event_timestamp', 'event_user_id']]
    all_registrations.columns = ['date', 'user_id']
    all_registrations.to_csv(f'{DATA_PATH}/processed_data/all_registrations.csv', index =False)
    
except Exception as e:
    logging.error(f'Error: {str(e)}')  


In [17]:
all_registrations.head()

Unnamed: 0,date,user_id
762,2005-09-08 00:14:22+00:00,2985
763,2005-09-08 00:38:01+00:00,2986
764,2005-09-08 06:48:49+00:00,2987
765,2005-09-08 08:37:43+00:00,2988
766,2005-09-08 09:07:11+00:00,2989


### Get daily number of registrations

Processes the data and returns a DataFrame called ```group_creation``` where daily number of registrations to Turkish Wikipedia are available. The data is saved at ```/dlabdata1/turkish_wiki/processed_data/newcomers.csv```

The format of the DataFrame is as such:
 * date : Timestamp of the day
 * number_of_newcomers : Number of registrations that day

In [19]:
try:
    
    # Convert data types
    user_df = user_df.convert_dtypes()

    # Process dates
    user_timestamp_columns = [col for col in user_df.columns if 'timestamp' in col]
    user_df[user_timestamp_columns] = user_df[user_timestamp_columns].apply(pd.to_datetime, utc =True, errors='coerce')
    user_df["date"] = user_df.event_timestamp.dt.strftime("%Y-%m-%d")

    # User registration event
    create_event_mask = (user_df.event_entity == 'user') & (user_df.event_type == 'create')
    # Filter bots
    no_bot_mask = (user_df['event_user_is_bot_by'].isna() | user_df['event_user_is_bot_by_historical'].isna())
    # Additional Filters
    self_creation_mask = (user_df['event_user_is_created_by_self'] == True)
    no_anon_mask = (user_df['event_user_is_anonymous'] != True)

    # Group registrations by calendar day and get number of registrations
    group_creation = user_df[create_event_mask & no_anon_mask & no_bot_mask & self_creation_mask].groupby(['date'])['event_user_id'].size()

    # Format the data
    group_creation = group_creation.reset_index()
    group_creation.columns = ['date', 'number_of_newcomers']
    group_creation['date'] = pd.to_datetime(group_creation['date'],   utc = True)
    group_creation.to_csv(f'{DATA_PATH}/processed_data/newcomers.csv', index =False)
    
except Exception as e:
    logging.error(f'Error: {str(e)}')  


In [21]:
group_creation.head()

Unnamed: 0,date,number_of_newcomers
0,2005-09-08 00:00:00+00:00,22
1,2005-09-09 00:00:00+00:00,16
2,2005-09-10 00:00:00+00:00,12
3,2005-09-11 00:00:00+00:00,13
4,2005-09-12 00:00:00+00:00,19


## 2) Edits

In [22]:
revision_df = pd.read_csv(f'{RES_PATH}/revision_events.tsv.gz', sep="\t", usecols= ['event_type', 'page_namespace',  'event_entity', 'event_type', 'event_timestamp', 
       'event_user_id', 'event_user_groups', 'event_user_revision_count',
       'event_user_seconds_since_previous_revision', 'page_id', 
       'event_user_id', 'event_user_is_bot_by',
       'page_title', 'page_revision_count', 'revision_minor_edit',
       'revision_text_bytes', 'revision_text_bytes_diff','revision_is_identity_revert',
       'revision_is_identity_revert', 'revision_is_identity_reverted'],  error_bad_lines=False, warn_bad_lines=True, compression = 'gzip')

In [23]:
revision_df.head()

Unnamed: 0,event_entity,event_type,event_timestamp,event_user_id,event_user_groups,event_user_is_bot_by,event_user_revision_count,event_user_seconds_since_previous_revision,page_id,page_title,page_namespace,page_revision_count,revision_minor_edit,revision_text_bytes,revision_text_bytes_diff,revision_is_identity_reverted,revision_is_identity_revert
0,revision,create,2002-12-05 22:51:28+00:00,,,,,,2740662.0,Anasayfa,0.0,1.0,True,809.0,809.0,False,False
1,revision,create,2002-12-05 22:54:39+00:00,1.0,,,1.0,,5.0,Main_Page,0.0,1.0,False,24.0,24.0,False,False
2,revision,create,2002-12-05 23:39:38+00:00,,,,,,2740662.0,Anasayfa,0.0,2.0,False,1010.0,201.0,False,False
3,revision,create,2002-12-13 17:59:34+00:00,,,,,,2740662.0,Anasayfa,0.0,3.0,False,890.0,-120.0,False,False
4,revision,create,2002-12-13 18:01:20+00:00,,,,,,2740662.0,Anasayfa,0.0,4.0,False,891.0,1.0,False,False


### Get number of edits per day, user kind and page id

Processes the data and returns a DataFrame called ```dict_edits_byid```. The data is saved at ```/dlabdata1/turkish_wiki/processed_data/edits.csv```

The format of the DataFrame is as such:
 * date : Timestamp of the day
 * page_id : ID of the edited page
 * user_kind : User kind: account, anonymous or bot 
 * event_user_id : Number of edits
 * revision_text_bytes : Total edited bytes

In [24]:
try:
    
    # Data type conversion
    revision_df = revision_df.convert_dtypes()

    # Choose revisions
    create_revision_mask = (revision_df.event_entity=='revision') & (revision_df.event_type == 'create')
    # Namespace 0 selects edits to articles
    ns_mask = revision_df.page_namespace == 0
    
    revision_df = revision_df[create_revision_mask & ns_mask]

    revision_df['revision_text_bytes'] = pd.to_numeric(revision_df['revision_text_bytes'], errors='coerce').fillna(0)
    revision_df['event_timestamp'] = pd.to_datetime(revision_df['event_timestamp'],  utc = True, errors = 'coerce')
    
    revision_df["date"] = revision_df.event_timestamp.dt.strftime("%Y-%m-%d")

    # Get user kinds
    revision_df['user_kind'] = revision_df.apply(lambda row: 'anonymous' if pd.isna(row.event_user_id) else 'bot' if not pd.isna(row.event_user_is_bot_by) else 'account', axis=1)

    # group by date, page_id, user_kind

    dict_edits_byid = revision_df.groupby(['date', 'page_id', 'user_kind']).agg(
        {'event_user_id': 'size', 'revision_text_bytes': 'sum'})

except Exception as e:
    logging.error(f'Error: {str(e)}')  

In [84]:
dict_edits_byid.to_csv(f'{DATA_PATH}/processed_data/edits.csv')

### Get daily number of edits by user_kind

Processes the data and returns a DataFrame called ```daily_edits```. The data is saved at ```/dlabdata1/turkish_wiki/processed_data/daily_edits.csv```

The format of the DataFrame is as such:
 * date : Timestamp of the day
 * user_kind : User kind: account, anonymous or bot 
 * event_user_id : Number of edits
 * revision_text_bytes : Total edited bytes

In [26]:
try:

    dict_edits_byid = pd.read_csv(f'{DATA_PATH}/processed_data/edits.csv')
    daily_edits = dict_edits_byid.groupby(['date', 'user_kind']).agg({'event_user_id': 'sum', 'revision_text_bytes': 'sum'})
    daily_edits = daily_edits.reset_index()
    daily_edits['date'] = pd.to_datetime(daily_edits['date'],   utc = True)
    daily_edits.columns = ['date', 'user_kind', 'number_of_edits', 'total_edited_bytes']
    daily_edits.to_csv(f'{DATA_PATH}/processed_data/daily_edits.csv')
    
except Exception as e:
    logging.error(f'Error: {str(e)}')  

In [28]:
daily_edits.head()

Unnamed: 0,date,user_kind,number_of_edits,total_edited_bytes
0,2002-12-05 00:00:00+00:00,account,1,24
1,2002-12-05 00:00:00+00:00,anonymous,2,1819
2,2002-12-13 00:00:00+00:00,anonymous,2,1781
3,2002-12-16 00:00:00+00:00,anonymous,2,4766
4,2002-12-17 00:00:00+00:00,anonymous,1,4310


### Get raw edits coming from registered accounts

Processes the data and returns a DataFrame called ```revision_df``` containing raw edit data of edits from registered accounts. The data is saved at ```/dlabdata1/turkish_wiki/processed_data/account_edits.csv```

In [36]:
try:
    revision_df = pd.read_csv(f'{RES_PATH}/revision_events.tsv.gz', sep="\t", usecols= ['event_type', 'page_namespace',  'event_entity', 'event_type', 'event_timestamp', 
       'event_user_id', 'event_user_groups', 'event_user_revision_count',
       'event_user_seconds_since_previous_revision', 'page_id', 
       'event_user_id', 'event_user_is_bot_by',
       'page_title', 'page_revision_count', 'revision_minor_edit',
       'revision_text_bytes', 'revision_text_bytes_diff','revision_is_identity_revert',
       'revision_is_identity_revert', 'revision_is_identity_reverted'], error_bad_lines=False, warn_bad_lines=True, compression = 'gzip')
    
    revision_df = revision_df.convert_dtypes()

    create_revision_mask = (revision_df.event_entity=='revision') & (revision_df.event_type == 'create')
    ns_mask = revision_df.page_namespace == 0
    
    account_mask = (~revision_df.event_user_id.isna()) & (revision_df.event_user_is_bot_by.isna())
    
    revision_df = revision_df[create_revision_mask & ns_mask & account_mask]

    revision_df['revision_text_bytes'] = pd.to_numeric(revision_df['revision_text_bytes'], errors='coerce').fillna(0)
    revision_df['event_timestamp'] = pd.to_datetime(revision_df['event_timestamp'],  utc = True, errors = 'coerce')
    
    revision_df = revision_df[['event_type', 'event_timestamp', 
       'event_user_id', 'event_user_groups', 'event_user_revision_count',
       'event_user_seconds_since_previous_revision', 'page_id', 
       'event_user_id', 'event_user_is_bot_by',
       'page_title', 'page_revision_count', 'revision_minor_edit',
       'revision_text_bytes', 'revision_text_bytes_diff','revision_is_identity_revert']]
    
    
    revision_df.to_csv(f'{DATA_PATH}/processed_data/account_edits.csv')

except Exception as e:
    logging.error(f'Error: {str(e)}')  

### Get raw registerations with only the relevant columns

Processes the data and returns a DataFrame called ```revision_df``` where all raw edit information is stored. The data is saved at ```/dlabdata1/turkish_wiki/processed_data/all_edits.csv```



In [34]:
try:
    
    revision_df = pd.read_csv(f'{RES_PATH}/revision_events.tsv.gz', sep="\t", usecols= ['event_type', 'page_namespace',  'event_entity', 'event_type', 'event_timestamp', 
       'event_user_id', 'event_user_groups', 'event_user_revision_count',
       'event_user_seconds_since_previous_revision', 'page_id', 
       'event_user_id', 'event_user_is_bot_by', 'event_user_text_historical', 
       'page_title', 'page_revision_count', 'revision_minor_edit',
       'revision_text_bytes', 'revision_text_bytes_diff','revision_is_identity_revert',
       'revision_is_identity_revert', 'revision_is_identity_reverted'], error_bad_lines=False, warn_bad_lines=True, compression = 'gzip')
    
    
    revision_df = revision_df.convert_dtypes()

    create_revision_mask = (revision_df.event_entity=='revision') & (revision_df.event_type == 'create')
    ns_mask = revision_df.page_namespace == 0
    
    
    revision_df = revision_df[create_revision_mask & ns_mask]

    revision_df['event_timestamp'] = pd.to_datetime(revision_df['event_timestamp'],  utc = True, errors = 'coerce')
    
    revision_df = revision_df[['event_type', 'event_timestamp', 
       'event_user_id', 'event_user_text_historical', 'page_id', 'revision_minor_edit',
       'revision_is_identity_revert', 'revision_is_identity_reverted']]
    
    revision_df.to_csv(f'{DATA_PATH}/processed_data/all_edits.csv')

except Exception as e:
    logging.error(f'Error: {str(e)}')  

## 3) Reverts

In [None]:
revision_df = pd.read_csv(f'{RES_PATH}/revision_events.tsv.gz', sep="\t", usecols= ['event_type', 'page_namespace',  'event_entity', 'event_type', 'event_timestamp', 
   'event_user_id', 'event_user_groups', 'event_user_revision_count',
   'event_user_seconds_since_previous_revision', 'page_id', 
   'event_user_id', 'event_user_is_bot_by', 'event_user_text_historical', 
   'page_title', 'page_revision_count', 'revision_minor_edit',
   'revision_text_bytes', 'revision_text_bytes_diff','revision_is_identity_revert',
   'revision_is_identity_revert', 'revision_is_identity_reverted'], error_bad_lines=False, warn_bad_lines=True, compression = 'gzip')

revision_df['event_timestamp'] = pd.to_datetime(revision_df['event_timestamp'],  utc = True, errors = 'coerce')
    
revision_df["date"] = revision_df.event_timestamp.dt.strftime("%Y-%m-%d")

# Get user kinds
revision_df['user_kind'] = revision_df.apply(lambda row: 'anonymous' if pd.isna(row.event_user_id) else 'bot' if not pd.isna(row.event_user_is_bot_by) else 'account', axis=1)

### Get daily number of reverted edits or revert edits

Processes the data and returns DataFrames called ```df_reverted```  and ```df_reverts```. ```df_reverted``` corresponds to all edits that were reverted later on, and ```df_reverts``` corresponds to all edits that are reverts. The data is saved at ```/dlabdata1/turkish_wiki/processed_data/...```,

The format of the DataFrames is as such:
 * date : Timestamp of the day
 * user_kind : User kind
 * revision_is_identity_reverted/revision_is_identity_revert: Number of registrations that were reverted later or that were reverts on the corresponding day by the corresponding user_kind.

In [41]:
try:
    
    #get reverts per day as well as reverted
    df_reverted = revision_df[revision_df['revision_is_identity_reverted'] == True].groupby(['date', 'user_kind'])['revision_is_identity_reverted'].size()
    df_reverts = revision_df[revision_df['revision_is_identity_revert'] == True].groupby(['date', 'user_kind'])['revision_is_identity_revert'].size()

    # reindex so all dates are filled
    df_reverted = df_reverted.reindex(
        pd.MultiIndex.from_product([revision_df.date.unique(), df_reverted.index.levels[1]], names=['date', 'user_kind']), fill_value=0)
    df_reverts = df_reverts.reindex(
        pd.MultiIndex.from_product([revision_df.date.unique(), df_reverts.index.levels[1]], names=['date', 'user_kind']), fill_value=0)


except Exception as e:
    logging.error(f'Error: {str(e)}')  

In [None]:
df_reverted.to_csv(f'{DATA_PATH}/processed_data/df_reverted.csv')
df_reverts.to_csv(f'{DATA_PATH}/processed_data/df_reverts.csv')

### Get daily number of reverted edits or revert edits by page id

Processes the data and returns two DataFrames called ```df_reverted_pid``` and ```df_reverts_pid```. The DataFrames are the same as above only that they are also grouped by page_id. The data is saved at ```/dlabdata1/turkish_wiki/processed_data/..```

In [53]:
try:
    
    # get reverts per day as well as reverted
    df_reverted_pid = revision_df[revision_df['revision_is_identity_reverted'] == True].groupby(['date','page_id', 'user_kind'])['revision_is_identity_reverted'].size()
    df_reverts_pid = revision_df[revision_df['revision_is_identity_revert'] == True].groupby(['date', 'page_id','user_kind'])['revision_is_identity_revert'].size()


except Exception as e:
    logging.error(f'Error: {str(e)}')  

In [103]:
df_reverted_pid.to_csv(f'{DATA_PATH}/processed_data/df_reverted_by_pageid.csv')
df_reverts_pid.to_csv(f'{DATA_PATH}/processed_data/df_reverts_by_pageid.csv')

### Get revert rate
Revert Rate is defined as the ratio of reverts to all edits coming from non-bot accounts in a given time frame. It's a measure conflict amount editors. We calculate the revert rate on a daily basis in this case. The data is saved at ```/dlabdata1/turkish_wiki/processed_data/revert_rate.csv```



In [54]:
edits = pd.read_csv(f'{DATA_PATH}/processed_data/edits.csv')
reverts = pd.read_csv(f'{DATA_PATH}/processed_data/df_reverts.csv')

In [55]:
daily_non_bot_edits = edits[edits['user_kind'] != 'bot'].groupby(['date'])[['event_user_id']].sum()
daily_identity_reverts = reverts.groupby(['date'])[['revision_is_identity_revert']].sum()
revert_rate = pd.merge(daily_identity_reverts, daily_non_bot_edits, on = 'date', how = 'outer')
revert_rate['revert_rate'] = revert_rate['revision_is_identity_revert']/revert_rate['event_user_id']
revert_rate =revert_rate.reset_index()

In [56]:
revert_rate['date'] = pd.to_datetime(revert_rate['date'], utc=True)

In [57]:
revert_rate = revert_rate.set_index(['date'])

In [58]:
revert_rate = revert_rate[['revert_rate']]

In [59]:
idx = pd.date_range(revert_rate.index.min(), revert_rate.index.max())

In [60]:
revert_rate = revert_rate.reindex(idx, fill_value=0)

In [40]:
revert_rate.to_csv(f'{DATA_PATH}/processed_data/revert_rate.csv')