# Predict edits to articles in the Tropical Cyclones category

## Goals & Plans

Predict daily edits to pages in jawiki category: Tropical Cyclones (熱帯低気圧)

### Steps

#### ETL to get processed-input dataframes:

- tcyc_page_ids_and_names  (data dump or https://ja.wikipedia.org/wiki/Category:%E7%86%B1%E5%B8%AF%E4%BD%8E%E6%B0%97%E5%9C%A7)
- target_dayseries (edit counts)  (data dump or https://wikimedia.org/api/rest_v1/#/)  (timezone?)
- target_dayseries (pageviews) https://wikimedia.org/api/rest_v1/#/Pageviews_data/get_metrics_pageviews
- jp_cyc (landfall data etc)

#### Quant Analysis

- overall lags
    - Estimate mean/median/mode lags of edits around a tcyc versus landfall
    - T-test (?) of mean lag for each of m/m/m lag to check if it's different from zero
- proportions of edits by time category
    - time categories
        - reactive edits
        - in-season edits
        - off-season edits
    - cross reference time-categories with user-categories
        - weather-specialist editors
        - frequent editors
        - infrequent registered editors
        - IP editors (by location?)
    - cross-reference time-categories with different subcategories
        - tcyc in Japan versus storms in other particular continents
        - tcyc-science articles versus tcyc-storm articles
        - compare edit counts b/o damages, storm strength

#### Predict

- baseline preds by day
    - get residuals from raw values
    - values could be:
        - editcounts
        - residuals
        - reverted-edits
- predict residual pageviews & edits based on landfall, landfall-severity, lagged vars (including self)
    - plot edit-spread around landfall and predictions
- predict non-tcyc edits
    - with similar features
- interesting questions:
    - Are non-tcyc edits "diverted" to tcyc's?
    - Are edits during crises less/more likely to get quickly reverted?
        - Are users whose first edit occurs during a crisis more or less likely to become long-term contributors?

### Details of processed-input dataframes

- Table **tcyc_page_ids_and_names:**
    - Purpose: Help generate the target_dayseries
    - Includes: One row for each page_id / page name combo in wikiproject
    - Primary key (2-col pk): page_id, page name
    - Other columns: none
    - Sorting: By page_id, then by page name

- Table **target_dayseries:**
    - Purpose: hold raw targets, baseline predictions, and deltas for-analysis
    - Includes: One row for every day during period
    - Primary key: edit_day (datetime)
    - Other columns:  
```
editcount_all_raw,        editcount_cat_raw,  
editcount_all_basepreds,     editcount_cat_basepreds,  
delta_all_actual_basepreds,  delta_cat_actual_basepreds,
```

### Sources

- Articles in the Tropical Cyclones category: ([jawiki categories](https://ja.wikipedia.org/wiki/Category:%E7%86%B1%E5%B8%AF%E4%BD%8E%E6%B0%97%E5%9C%A7))  
- Articles to exclude: 
    - People who died in tcyc's ([jawiki categories](https://ja.wikipedia.org/wiki/Category:%E6%B4%9E%E7%88%BA%E4%B8%B8%E4%BA%8B%E6%95%85%E3%81%AE%E7%8A%A0%E7%89%B2%E8%80%85))
    - English-name-articles for names that got assigned to tcyc's: ([jawiki categories](https://ja.wikipedia.org/wiki/Category:%E5%8F%B0%E9%A2%A8%E3%81%AE%E8%8B%B1%E5%90%8D))
- Pageviews:
    - 熱帯低気圧 ([jawiki page](https://ja.wikipedia.org/wiki/%E7%86%B1%E5%B8%AF%E4%BD%8E%E6%B0%97%E5%9C%A7)) ([data source](https://pageviews.toolforge.org/?project=ja.wikipedia.org&platform=all-access&agent=user&redirects=0&range=all-time&pages=%E7%86%B1%E5%B8%AF%E4%BD%8E%E6%B0%97%E5%9C%A7))

## Get data

### data prep imports

In [1]:
import pandas as pd, numpy as np, seaborn as sns, mysql.connector as mysql
import os, re, sqlalchemy, datetime
# from pathlib import Path
import pickle

### read prepped data

In [2]:
damage = pd.read_csv('../data/raw/weather/japan_nii/damage.tsv', sep='\t')
landfall = pd.read_csv('../data/raw/weather/japan_nii/landfall.tsv', sep='\t')

### Import jawiki into sql

#### table list

- unified tables:
    - category
    - categorylinks
    - page
- revisions tables:
    - (each year)

#### import script

Import these tables into a single MySQL database: "jawiki"  
```bash
mysql --user=root --password=XXXXXXXX
```

```SQL
CREATE DATABASE jawiki CHARACTER SET utf8 COLLATE utf8_bin;
USE jawiki;
GRANT ALL PRIVILEGES ON jawiki TO bhrdwj@localhost IDENTIFIED BY XXXXXXX;
GRANT ALL PRIVILEGES ON jawiki.* TO bhrdwj@localhost IDENTIFIED BY XXXXXXX;
EXIT
```

```bash
mysql --user=bhrdwj --password=XXXXXXX jawiki < jawiki-20211220-category.sql > output.tab
mysql --user=bhrdwj --password=XXXXXXX jawiki < jawiki-20211220-page.sql > output2.tab
mysql --user=bhrdwj --password=XXXXXXX jawiki < jawiki-20211220-categorylinks.sql > output3.tab
```

*See notebook 1.15 for importing history tables*

### Get sql data

#### turn on sql in bash

```bash
sudo service mysqld start
netstat -lnp | grep mysql
```

In [6]:
mysql_superuser = 'root'
# mysql_su_pass = input(f'Enter the MySQL password for user {mysql_superuser}: ')

#### Initialize a connection and cursor

In [7]:
host='localhost'; user=mysql_superuser; passwd=mysql_su_pass; dbname='jawiki';
cxn = mysql.connect(host=host,user=user,passwd=passwd, database=dbname) # not the first time around
cur = cxn.cursor()

#### Check what's in mysql, pick a database (SILENCED)

#### Initialize an engine

In [8]:
# Finally, let's instantiate a SQL alchemy engine, so we can pass results sets into pandas and evaluate them here 
connection_str = 'mysql+mysqlconnector://'+user+':'+passwd+'@'+host+'/'+dbname  # removed this after host +':'+dbport
try:
    engine1 = sqlalchemy.create_engine(connection_str)
    conn1 = engine1.connect()
except:
    print('Database connection error - check creds')
metadata = sqlalchemy.MetaData(conn1)
metadata.reflect()
metadata.tables.keys()

dict_keys(['category', 'categorylinks', 'page'])

#### Count rows in a table (SILENCED)

#### Show table head (SILENCED)

#### Get tables' schemas

In [9]:
tblnames = list(metadata.tables.keys())
schemas = {}
for tn in tblnames:
    schemas[tn] = pd.read_sql(f'DESCRIBE {tn};', engine1)
    print(tn)
    display(schemas[tn])

category


Unnamed: 0,Field,Type,Null,Key,Default,Extra
0,cat_id,int(10) unsigned,NO,PRI,,auto_increment
1,cat_title,varbinary(255),NO,UNI,,
2,cat_pages,int(11),NO,MUL,0.0,
3,cat_subcats,int(11),NO,,0.0,
4,cat_files,int(11),NO,,0.0,


categorylinks


Unnamed: 0,Field,Type,Null,Key,Default,Extra
0,cl_from,int(8) unsigned,NO,PRI,0,
1,cl_to,varbinary(255),NO,PRI,,
2,cl_sortkey,varbinary(230),NO,,,
3,cl_timestamp,timestamp,NO,,current_timestamp(),on update current_timestamp()
4,cl_sortkey_prefix,varbinary(255),NO,,,
5,cl_collation,varbinary(32),NO,MUL,,
6,cl_type,"enum('page','subcat','file')",NO,,page,


page


Unnamed: 0,Field,Type,Null,Key,Default,Extra
0,page_id,int(8) unsigned,NO,PRI,,auto_increment
1,page_namespace,int(11),NO,MUL,0.0,
2,page_title,varbinary(255),NO,,,
3,page_restrictions,varbinary(255),YES,,,
4,page_is_redirect,tinyint(1) unsigned,NO,MUL,0.0,
5,page_is_new,tinyint(1) unsigned,NO,,0.0,
6,page_random,double unsigned,NO,MUL,0.0,
7,page_touched,varbinary(14),NO,,,
8,page_links_updated,varbinary(14),YES,,,
9,page_latest,int(8) unsigned,NO,,0.0,


## get all nested subcats and pages in category: 熱帯低気圧

### define functions, gcloud credentials

#### simple sql queries:

In [10]:
def mw_query(colval_dict, tblname, engine=engine1):
    """
    Simple query for a SQL table like pandas "loc", for mediawiki data dumps.
    Accepts: dict of filtering pairs: {colname:val, ...}
    Returns tuple: (df, query)
    Presumes all 'object' cols are bytearrays. (MAKE THIS OPTIONAL LATER.)
    """
    d = colval_dict
    
    query = (
        f'SELECT * FROM {tblname} WHERE ' +
        ' AND '.join([
                f'{col} = (_BINARY "{d[col]}")' 
                    if type(d[col]) == str else 
                f'{col} = {d[col]}'
            for col in d
        ]) +
        ';'
    )
    
    selected_rows = pd.read_sql(query, engine)
    selected_rows = decode_df(selected_rows)
    
    return (selected_rows, query)

def decode_df(df, encoding='utf-8'):
    """presume all 'object'-type cols of a pandas df are cols of bytearrays, and decode them."""
    str_df = df.select_dtypes(['object'])       # get list of columns that need decoding
    str_df = str_df.stack().str.decode('utf-8').unstack()   # decode those columms
    for col in str_df:
        df[col] = str_df[col]                   # replace in original df
    return df

#### translate:  ja2en_txt, ja2en_ser

##### Get gcloud credentials

In [12]:
import os, six
from google.cloud import translate_v2 as translate

os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '/home/bhrdwj/git/.my-translation-sa_keys.json'
bq_client = translate.Client()

##### functions for translation

In [13]:
translate_client = translate.Client()

def ja2en_txt(el):
    """Translates text into the target language.
    Target must be an ISO 639-1 language code.
    See https://g.co/cloud/translate/v2/translate-reference#supported_languages
    """
    # import six
    # from google.cloud import translate_v2 as translate
    translate_client = translate.Client()
    
    if isinstance(el, six.binary_type):
        el = el.decode("utf-8")
    if not isinstance(el, str):
        return el
    
    text = el
    return translate_client.translate(
        text, target_language='EN', source_language='JA'
    )['translatedText']


def ja2en_ser(ser):
    """
    Maps each element of a series with ja2en_txt, only if the type is string
    """
    assert isinstance(ser,pd.Series)
    if pd.api.types.is_object_dtype(ser):
        return ser.map(lambda x: translate_client.translate(x, target_language='EN', source_language='JA')['translatedText'])    
    else:
        return ser

### get all subcategories of category 熱帯低気圧

#### Initialize stack and output-collection

In [14]:
topcattitle = '熱帯低気圧'
slxn, query = mw_query({'cat_title':'熱帯低気圧'}, tblname='category')

# stack of categories with subcats that haven't been queried yet
catstack = [slxn.cat_id.values[0]]

# storing the found categories' ids and titles
all_subcats = {slxn.cat_id.values[0]: topcattitle}  # {cat_id, cat_title}

In [15]:
display(catstack)

[43024]

#### Add subcats to stack of category page_id's, and pop thru them, until all are found 

In [16]:
while len(catstack) > 0:
    cattitle = all_subcats[catstack.pop()]
    slxn, query = mw_query({'cl_to':cattitle}, tblname='categorylinks')

    cids = slxn.loc[slxn.cl_type=='subcat'].cl_from.tolist()
    catstack = catstack + cids

    cttls = [mw_query({'page_id':cid}, tblname='page')[0].page_title[0] for cid in cids]
    all_subcats.update({cids[i]:cttls[i] for i in range(len(cids))})

In [17]:
print(query)
display(pd.Series(all_subcats))

SELECT * FROM categorylinks WHERE cl_to = (_BINARY "2013年の台風");


43024          熱帯低気圧
159919            台風
626534         サイクロン
626516         ハリケーン
2923547    地域別の熱帯低気圧
             ...    
4113125     1828年の台風
4113116     1924年の台風
4117210     1917年の台風
4164115      989年の台風
4164618     1856年の台風
Length: 149, dtype: object

### get all member-pages of category 熱帯低気圧

In [18]:
all_pages = {}

for i in all_subcats:
    slxn, query = mw_query({'cl_to':all_subcats[i]}, tblname='categorylinks')
    pids = slxn.loc[slxn.cl_type=='page'].cl_from.tolist()
    pttls = [mw_query({'page_id':pid}, tblname='page')[0].page_title[0] for pid in pids]
    all_pages.update({pids[i]:pttls[i] for i in range(len(pids))})

In [19]:
print(query)
display(pd.Series(all_pages))

SELECT * FROM categorylinks WHERE cl_to = (_BINARY "1856年の台風");


26863              サイクロン
94451              ハリケーン
527125             藤原の効果
36650                積乱雲
147548         ウィリー・ウィリー
               ...      
4152789    ヴィッキー_(曖昧さ回避)
1917792              エミー
1006963             アイオン
4098865       令和元年台風第19号
4098869       令和元年台風第15号
Length: 2519, dtype: object

## get time series of revisions to these pages

```bash

```

# Failed methods

## Get parent-categories. (I need children.) DON'T NEED THIS.

## Scrape page information. (Hard b/c lazy loading.) DATA DUMPS ARE BETTER

Failed because wikipedia's [Extension: CategoryTree](https://www.mediawiki.org/wiki/Extension:CategoryTree) uses AJAX and might require selenium to scrape carefully.

### scrape page identifiers