# Predict edits to articles in the Tropical Cyclones category

## Goals & Plans

Predict daily edits to pages in jawiki category: Tropical Cyclones (熱帯低気圧)

### Steps

#### ETL to get processed-input dataframes:

- tcyc_page_ids_and_names  (data dump or https://ja.wikipedia.org/wiki/Category:%E7%86%B1%E5%B8%AF%E4%BD%8E%E6%B0%97%E5%9C%A7)
- target_dayseries (edit counts)  (data dump or https://wikimedia.org/api/rest_v1/#/)  (timezone?)
- target_dayseries (pageviews) https://wikimedia.org/api/rest_v1/#/Pageviews_data/get_metrics_pageviews
- jp_cyc (landfall data etc)

#### Quant Analysis

- overall lags
    - Estimate mean/median/mode lags of edits around a tcyc versus landfall
    - T-test (?) of mean lag for each of m/m/m lag to check if it's different from zero
- proportions of edits by time category
    - time categories
        - reactive edits
        - in-season edits
        - off-season edits
    - cross reference time-categories with user-categories
        - weather-specialist editors
        - frequent editors
        - infrequent registered editors
        - IP editors (by location?)
    - cross-reference time-categories with different subcategories
        - tcyc in Japan versus storms in other particular continents
        - tcyc-science articles versus tcyc-storm articles
        - compare edit counts b/o damages, storm strength

#### Predict

- baseline preds by day
    - get residuals from raw values
    - values could be:
        - editcounts
        - residuals
        - reverted-edits
- predict residual pageviews & edits based on landfall, landfall-severity, lagged vars (including self)
    - plot edit-spread around landfall and predictions
- predict non-tcyc edits
    - with similar features
- interesting questions:
    - Are non-tcyc edits "diverted" to tcyc's?
    - Are edits during crises less/more likely to get quickly reverted?
        - Are users whose first edit occurs during a crisis more or less likely to become long-term contributors?

### Details of processed-input dataframes

- Table **tcyc_page_ids_and_names:**
    - Purpose: Help generate the target_dayseries
    - Includes: One row for each page_id / page name combo in wikiproject
    - Primary key (2-col pk): page_id, page name
    - Other columns: none
    - Sorting: By page_id, then by page name

- Table **target_dayseries:**
    - Purpose: hold raw targets, baseline predictions, and deltas for-analysis
    - Includes: One row for every day during period
    - Primary key: edit_day (datetime)
    - Other columns:  
```
editcount_all_raw,        editcount_cat_raw,  
editcount_all_basepreds,     editcount_cat_basepreds,  
delta_all_actual_basepreds,  delta_cat_actual_basepreds,
```

### Sources

- Articles in the Tropical Cyclones category: ([jawiki categories](https://ja.wikipedia.org/wiki/Category:%E7%86%B1%E5%B8%AF%E4%BD%8E%E6%B0%97%E5%9C%A7))  
- Articles to exclude: 
    - People who died in tcyc's ([jawiki categories](https://ja.wikipedia.org/wiki/Category:%E6%B4%9E%E7%88%BA%E4%B8%B8%E4%BA%8B%E6%95%85%E3%81%AE%E7%8A%A0%E7%89%B2%E8%80%85))
    - English-name-articles for names that got assigned to tcyc's: ([jawiki categories](https://ja.wikipedia.org/wiki/Category:%E5%8F%B0%E9%A2%A8%E3%81%AE%E8%8B%B1%E5%90%8D))
- Pageviews:
    - 熱帯低気圧 ([jawiki page](https://ja.wikipedia.org/wiki/%E7%86%B1%E5%B8%AF%E4%BD%8E%E6%B0%97%E5%9C%A7)) ([data source](https://pageviews.toolforge.org/?project=ja.wikipedia.org&platform=all-access&agent=user&redirects=0&range=all-time&pages=%E7%86%B1%E5%B8%AF%E4%BD%8E%E6%B0%97%E5%9C%A7))

## Next-steps

### imports

In [88]:
import pandas as pd, numpy as np, seaborn as sns, mysql.connector as mysql
import os, re, sqlalchemy, datetime
from pathlib import Path

### read prepped data

In [2]:
damage = pd.read_csv('../data/raw/weather/japan_nii/damage.tsv', sep='\t')
landfall = pd.read_csv('../data/raw/weather/japan_nii/landfall.tsv', sep='\t')

### Import jawiki: {category, categorylinks, page}

Import these tables into a single MySQL database: "jawiki"

```bash
mysql --user=root --password=XXXXXXXX
```

```SQL
CREATE DATABASE jawiki CHARACTER SET utf8 COLLATE utf8_bin;
USE jawiki;
GRANT ALL PRIVILEGES ON jawiki TO bhrdwj@localhost IDENTIFIED BY XXXXXXX;
GRANT ALL PRIVILEGES ON jawiki.* TO bhrdwj@localhost IDENTIFIED BY XXXXXXX;
EXIT
```

```bash
mysql --user=bhrdwj --password=XXXXXXX jawiki < jawiki-20211220-category.sql > output.tab
mysql --user=bhrdwj --password=XXXXXXX jawiki < jawiki-20211220-page.sql > output2.tab
mysql --user=bhrdwj --password=XXXXXXX jawiki < jawiki-20211220-categorylinks.sql > output3.tab

```

### Get sql data

#### turn on sql in bash

```bash
sudo service mysqld start
netstat -lnp | grep mysql
```

In [14]:
mysql_superuser = 'root'
# mysql_su_pass = input(f'Enter the MySQL password for user {mysql_superuser}: ')

#### Initialize a connection and cursor

In [71]:
host='localhost'; user=mysql_superuser; passwd=mysql_su_pass; dbname='jawiki';
cxn = mysql.connect(host=host,user=user,passwd=passwd, database=dbname) # not the first time around
cur = cxn.cursor()

#### Check what's in mysql, pick a database ***(Unnecessary after raw tables fully loaded)***

#### Initialize an engine

In [72]:
# Finally, let's instantiate a SQL alchemy engine, so we can pass results sets into pandas and evaluate them here 
connection_str = 'mysql+mysqlconnector://'+user+':'+passwd+'@'+host+'/'+dbname  # removed this after host +':'+dbport
try:
    engine1 = sqlalchemy.create_engine(connection_str)
    conn1 = engine1.connect()
except:
    print('Database connection error - check creds')
metadata = sqlalchemy.MetaData(conn1)
metadata.reflect()
metadata.tables.keys()

dict_keys(['category', 'categorylinks', 'page'])

#### Count rows in a table

#### Check a table's head and schema

In [None]:
tblname = 'categorylinks'
cur.execute(f'SELECT * FROM {tblname} LIMIT 10;');   
firstfew = cur.fetchall();   firstfew;

In [173]:
tblname = 'categorylinks'
schema = pd.read_sql(f'DESCRIBE {tblname};', engine1)
schema

Unnamed: 0,Field,Type,Null,Key,Default,Extra
0,cl_from,int(8) unsigned,NO,PRI,0,
1,cl_to,varbinary(255),NO,PRI,,
2,cl_sortkey,varbinary(230),NO,,,
3,cl_timestamp,timestamp,NO,,current_timestamp(),on update current_timestamp()
4,cl_sortkey_prefix,varbinary(255),NO,,,
5,cl_collation,varbinary(32),NO,MUL,,
6,cl_type,"enum('page','subcat','file')",NO,,page,


## get list of pages in category: 熱帯低気圧

https://stackoverflow.com/questions/21782410/finding-subcategories-of-a-wikipedia-category-using-category-and-categorylinks-t

### define function for simple sql queries

In [188]:
def sql_loc(val, colname, tblname, engine=engine1):
    """
    Simple query for a SQL table like pandas "loc".
    Returns tuple: (df, query)
    """
    # query
    get_row_by_val = f'SELECT * FROM {tblname} WHERE {colname}=(_BINARY "{search_val}");'
    selected_rows = pd.read_sql(get_row_by_val, engine)
    
    # decode bytearrays for display in pandas
    str_df = selected_rows.select_dtypes(['object'])       # get list of columns that need decoding
    str_df = str_df.stack().str.decode('utf-8').unstack()   # decode those columms
    for col in str_df:
        selected_rows[col] = str_df[col]                   # replace in original df
    return selected_rows, get_row_by_val

### get pages from page database with title "熱帯低気圧"

In [189]:
selection, query = sql_loc(val='熱帯低気圧', colname='page_title', tblname='page', engine=engine1)
print(query); selection

SELECT * FROM page WHERE page_title=(_BINARY "熱帯低気圧");


Unnamed: 0,page_id,page_namespace,page_title,page_restrictions,page_is_redirect,page_is_new,page_random,page_touched,page_links_updated,page_latest,page_len,page_content_model,page_lang
0,31168,0,熱帯低気圧,,0,0,0.467697,20211218082619,20211218082615,87074895,78092,wikitext,
1,626482,14,熱帯低気圧,,0,0,0.919185,20210831034053,20210831034134,79263384,303,wikitext,
2,669114,1,熱帯低気圧,,0,0,0.275103,20210913090002,20210804091932,84859939,3527,wikitext,
3,761911,10,熱帯低気圧,,0,0,0.868552,20210808233839,20210808233839,84947664,1006,wikitext,


### get categories named 熱帯低気圧 from category table

In [190]:
selection, query = sql_loc(val='熱帯低気圧', colname='cat_title', tblname='category', engine=engine1)
print(query); selection

SELECT * FROM category WHERE cat_title=(_BINARY "熱帯低気圧");


Unnamed: 0,cat_id,cat_title,cat_pages,cat_subcats,cat_files
0,43024,熱帯低気圧,17,4,0


### get subcategories and member-pages of category 熱帯低気圧

In [192]:
selection, query = sql_loc(val='熱帯低気圧', colname='cl_to', tblname='categorylinks', engine=engine1)
print(query); selection

SELECT * FROM categorylinks WHERE cl_to=(_BINARY "熱帯低気圧");


Unnamed: 0,cl_from,cl_to,cl_sortkey,cl_timestamp,cl_sortkey_prefix,cl_collation,cl_type
0,26863,熱帯低気圧,さいくろん\nサイクロン,2006-08-02 15:43:31,さいくろん,uppercase,page
1,94451,熱帯低気圧,はりけん\nハリケーン,2006-08-02 15:44:30,はりけん,uppercase,page
2,159919,熱帯低気圧,たいふう\n台風,2006-08-02 15:50:18,たいふう,uppercase,subcat
3,626534,熱帯低気圧,さいくろん\nサイクロン,2006-08-02 16:25:55,さいくろん,uppercase,subcat
4,527125,熱帯低気圧,ふしわらのこうか\n藤原の効果,2006-08-02 16:42:54,ふしわらのこうか,uppercase,page
5,36650,熱帯低気圧,せきらんうん\n積乱雲,2006-09-07 10:39:53,せきらんうん,uppercase,page
6,147548,熱帯低気圧,ういりういり\nウィリー・ウィリー,2007-09-15 14:34:59,ういりういり,uppercase,page
7,15090,熱帯低気圧,たいふう\n台風,2008-07-25 16:55:05,たいふう,uppercase,page
8,31168,熱帯低気圧,*\n熱帯低気圧,2008-12-27 13:15:02,*,uppercase,page
9,1682974,熱帯低気圧,はりけんはんた\nハリケーン・ハンター,2009-02-07 05:48:43,はりけんはんた,uppercase,page


# GET NESTED CATEGORIES AND MEMBER-PAGES!!!!   😼

# Failed methods

## scrape page information

Failed because wikipedia's [Extension: CategoryTree](https://www.mediawiki.org/wiki/Extension:CategoryTree) uses AJAX and might require selenium to scrape carefully.

### scrape page identifiers

In [None]:
import requests
from bs4 import BeautifulSoup

In [None]:
req = requests.get('https://ja.wikipedia.org/wiki/Category:%E7%86%B1%E5%B8%AF%E4%BD%8E%E6%B0%97%E5%9C%A7')
bs = BeautifulSoup(req.content, features='html.parser')

#### get pages

In [None]:
pages_in_TropicalCyclones_htmlchunk = bs.find('body').find('div',{'id':'mw-pages'}).find('div',{'class':'mw-category'}).find_all('li')

In [None]:
pages_in_TropicalCyclones = {}
for li in pages_in_TropicalCyclones_htmlchunk:
    pages_in_TropicalCyclones[
            li.find('a').attrs['href']
        ] = li.find('a').attrs['title']

In [None]:
pages_in_TropicalCyclones

### get categories 

In [None]:
subcats_in_TropicalCyclones_htmlchunk = bs.find('body').find('div',{'id':'mw-subcategories'}).find_all('div',{'class':'CategoryTreeItem'})

In [None]:
subcats_in_TropicalCyclones = {}
for li in subcats_in_TropicalCyclones_htmlchunk:
    subcats_in_TropicalCyclones[
            li.find('a').attrs['href']
        ] = li.find('a').attrs['title']

In [None]:
bs.find('body').find('div',{'id':'mw-subcategories'})

In [None]:
subcats_in_TropicalCyclones_htmlchunk