# Predict edits to articles in the Tropical Cyclones category

## Goals & Plans

Predict daily edits to pages in jawiki category: Tropical Cyclones (熱帯低気圧)

### Steps

#### ETL to get processed-input dataframes:

- tcyc_page_ids_and_names  (data dump or https://ja.wikipedia.org/wiki/Category:%E7%86%B1%E5%B8%AF%E4%BD%8E%E6%B0%97%E5%9C%A7)
- target_dayseries (edit counts)  (data dump or https://wikimedia.org/api/rest_v1/#/)  (timezone?)
- target_dayseries (pageviews) https://wikimedia.org/api/rest_v1/#/Pageviews_data/get_metrics_pageviews
- jp_cyc (landfall data etc)

#### Quant Analysis

- overall lags
    - Estimate mean/median/mode lags of edits around a tcyc versus landfall
    - T-test (?) of mean lag for each of m/m/m lag to check if it's different from zero
- proportions of edits by time category
    - time categories
        - reactive edits
        - in-season edits
        - off-season edits
    - cross reference time-categories with user-categories
        - weather-specialist editors
        - frequent editors
        - infrequent registered editors
        - IP editors (by location?)
    - cross-reference time-categories with different subcategories
        - tcyc in Japan versus storms in other particular continents
        - tcyc-science articles versus tcyc-storm articles
        - compare edit counts b/o damages, storm strength

#### Predict

- baseline preds by day
    - get residuals from raw values
    - values could be:
        - editcounts
        - residuals
        - reverted-edits
- predict residual pageviews & edits based on landfall, landfall-severity, lagged vars (including self)
    - plot edit-spread around landfall and predictions
- predict non-tcyc edits
    - with similar features
- interesting questions:
    - Are non-tcyc edits "diverted" to tcyc's?
    - Are edits during crises less/more likely to get quickly reverted?
        - Are users whose first edit occurs during a crisis more or less likely to become long-term contributors?

### Details of processed-input dataframes

- Table **tcyc_page_ids_and_names:**
    - Purpose: Help generate the target_dayseries
    - Includes: One row for each page_id / page name combo in wikiproject
    - Primary key (2-col pk): page_id, page name
    - Other columns: none
    - Sorting: By page_id, then by page name

- Table **target_dayseries:**
    - Purpose: hold raw targets, baseline predictions, and deltas for-analysis
    - Includes: One row for every day during period
    - Primary key: edit_day (datetime)
    - Other columns:  
```
editcount_all_raw,        editcount_cat_raw,  
editcount_all_basepreds,     editcount_cat_basepreds,  
delta_all_actual_basepreds,  delta_cat_actual_basepreds,
```

### Sources

- Articles in the Tropical Cyclones category: ([jawiki categories](https://ja.wikipedia.org/wiki/Category:%E7%86%B1%E5%B8%AF%E4%BD%8E%E6%B0%97%E5%9C%A7))  
- Articles to exclude: 
    - People who died in tcyc's ([jawiki categories](https://ja.wikipedia.org/wiki/Category:%E6%B4%9E%E7%88%BA%E4%B8%B8%E4%BA%8B%E6%95%85%E3%81%AE%E7%8A%A0%E7%89%B2%E8%80%85))
    - English-name-articles for names that got assigned to tcyc's: ([jawiki categories](https://ja.wikipedia.org/wiki/Category:%E5%8F%B0%E9%A2%A8%E3%81%AE%E8%8B%B1%E5%90%8D))
- Pageviews:
    - 熱帯低気圧 ([jawiki page](https://ja.wikipedia.org/wiki/%E7%86%B1%E5%B8%AF%E4%BD%8E%E6%B0%97%E5%9C%A7)) ([data source](https://pageviews.toolforge.org/?project=ja.wikipedia.org&platform=all-access&agent=user&redirects=0&range=all-time&pages=%E7%86%B1%E5%B8%AF%E4%BD%8E%E6%B0%97%E5%9C%A7))

## Next-steps

### imports

In [216]:
import pandas as pd, numpy as np, seaborn as sns, mysql.connector as mysql
import os, psycopg2, sqlalchemy
from pathlib import Path

### read prepped data

In [83]:
damage = pd.read_csv('../data/raw/weather/japan_nii/damage.tsv', sep='\t')
landfall = pd.read_csv('../data/raw/weather/japan_nii/landfall.tsv', sep='\t')

### Get sql data

#### turn on sql in bash

```bash
sudo service mysqld start
netstat -lnp | grep mysql
```

In [160]:
mysql_superuser = 'root'
# mysql_su_pass = input(f'Enter the MySQL password for user {mysql_superuser}: ')

#### Initialize a connection and cursor

In [236]:
host='localhost'; user=mysql_superuser; passwd=mysql_su_pass;
cxn = mysql.connect(host=host,user=user,passwd=passwd) # database='latest_categorylinks' # not the first time around
cur = cxn.cursor()

#### Check what's in mysql, pick a database, pick a table

In [237]:
cur.execute('select user();');       print(cur.fetchall())
cur.execute('show databases;');      print(cur.fetchall())

[('root@localhost',)]
[('information_schema',), ('latest_categ',), ('latest_categorylinks',), ('latest_page',), ('mysql',), ('performance_schema',)]


In [238]:
dbname = 'latest_categorylinks'
cur.execute(f'use {dbname};')
cur.execute('select database();');   print(cur.fetchall())
cur.execute('show tables;');         print(cur.fetchall())

[('latest_categorylinks',)]
[('categorylinks',)]


In [244]:
tblname = 'categorylinks'
cur.execute(f'DESCRIBE {tblname};')
schema_categorylinks = pd.DataFrame(cur.fetchall(), columns=['Field','Type','Null','Key','Default','Extra'])
schema_categorylinks;

In [243]:
cur.execute(f'select * from {tblname} limit 10;');   
firstfew = cur.fetchall();   firstfew;

#### Initialize an engine

In [245]:
# Finally, let's instantiate a SQL alchemy engine, so we can pass results sets into pandas and evaluate them here 
connection_str = 'mysql+mysqlconnector://'+user+':'+passwd+'@'+host+'/'+dbname  # removed this after host +':'+dbport
try:
    engine1 = sqlalchemy.create_engine(connection_str)
    conn1 = engine1.connect()
except:
    print('Database connection error - check creds')
metadata = sqlalchemy.MetaData(conn1)
metadata.reflect()
metadata.tables.keys()

dict_keys(['categorylinks'])

In [None]:
# group by
#     event
# order by
#     count(distinct username) desc


In [249]:
sql_eda2 = """
select
    count(*)
from
    categorylinks
;
"""
pd.read_sql(sql_eda2,engine1)

Unnamed: 0,count(*)
0,171083519


## something

https://stackoverflow.com/questions/21782410/finding-subcategories-of-a-wikipedia-category-using-category-and-categorylinks-t

# Failed methods

## scrape page information

Failed because wikipedia's [Extension: CategoryTree](https://www.mediawiki.org/wiki/Extension:CategoryTree) uses AJAX and might require selenium to scrape carefully.

### scrape page identifiers

In [140]:
import requests
from bs4 import BeautifulSoup

In [85]:
req = requests.get('https://ja.wikipedia.org/wiki/Category:%E7%86%B1%E5%B8%AF%E4%BD%8E%E6%B0%97%E5%9C%A7')
bs = BeautifulSoup(req.content, features='html.parser')

#### get pages

In [133]:
pages_in_TropicalCyclones_htmlchunk = bs.find('body').find('div',{'id':'mw-pages'}).find('div',{'class':'mw-category'}).find_all('li')

In [138]:
pages_in_TropicalCyclones = {}
for li in pages_in_TropicalCyclones_htmlchunk:
    pages_in_TropicalCyclones[
            li.find('a').attrs['href']
        ] = li.find('a').attrs['title']

In [139]:
pages_in_TropicalCyclones

{'/wiki/%E7%86%B1%E5%B8%AF%E4%BD%8E%E6%B0%97%E5%9C%A7': '熱帯低気圧',
 '/wiki/%E3%82%A6%E3%82%A3%E3%83%AA%E3%83%BC%E3%83%BB%E3%82%A6%E3%82%A3%E3%83%AA%E3%83%BC': 'ウィリー・ウィリー',
 '/wiki/%E3%82%B3%E3%83%AA%E3%82%AA%E3%83%AA%E3%81%AE%E5%8A%9B': 'コリオリの力',
 '/wiki/%E3%82%B5%E3%82%A4%E3%82%AF%E3%83%AD%E3%83%B3': 'サイクロン',
 '/wiki/%E7%A9%8D%E4%B9%B1%E9%9B%B2': '積乱雲',
 '/wiki/%E5%8F%B0%E9%A2%A8': '台風',
 '/wiki/%E5%9C%9F%E7%94%A8%E6%B3%A2': '土用波',
 '/wiki/%E7%86%B1%E5%B8%AF%E6%B3%A2': '熱帯波',
 '/wiki/%E3%83%8F%E3%83%AA%E3%82%B1%E3%83%BC%E3%83%B3': 'ハリケーン',
 '/wiki/%E3%83%8F%E3%83%AA%E3%82%B1%E3%83%BC%E3%83%B3%E3%83%BB%E3%83%8F%E3%83%B3%E3%82%BF%E3%83%BC': 'ハリケーン・ハンター',
 '/wiki/%E8%97%A4%E5%8E%9F%E3%81%AE%E5%8A%B9%E6%9E%9C': '藤原の効果',
 '/wiki/%E5%8F%B0%E9%A2%A8%E3%81%AE%E7%9B%AE': '台風の目',
 '/wiki/%E7%86%B1%E5%B8%AF%E4%BD%8E%E6%B0%97%E5%9C%A7%E3%81%AE%E5%91%BD%E5%90%8D': '熱帯低気圧の命名'}

### get categories 

In [152]:
subcats_in_TropicalCyclones_htmlchunk = bs.find('body').find('div',{'id':'mw-subcategories'}).find_all('div',{'class':'CategoryTreeItem'})

In [153]:
subcats_in_TropicalCyclones = {}
for li in subcats_in_TropicalCyclones_htmlchunk:
    subcats_in_TropicalCyclones[
            li.find('a').attrs['href']
        ] = li.find('a').attrs['title']

In [155]:
bs.find('body').find('div',{'id':'mw-subcategories'})

<div id="mw-subcategories">
<h2>下位カテゴリ</h2>
<p>このカテゴリには下位カテゴリ 4 件が含まれており、そのうち以下の 4 件を表示しています。
</p><div class="mw-content-ltr" dir="ltr" lang="ja"><h3>*</h3>
<ul><li><div class="CategoryTreeSection"><div class="CategoryTreeItem"><span class="CategoryTreeBullet"><span class="CategoryTreeToggle" data-ct-state="collapsed" data-ct-title="地域別の熱帯低気圧"></span> </span> <a href="/wiki/Category:%E5%9C%B0%E5%9F%9F%E5%88%A5%E3%81%AE%E7%86%B1%E5%B8%AF%E4%BD%8E%E6%B0%97%E5%9C%A7" title="Category:地域別の熱帯低気圧">地域別の熱帯低気圧</a>‎ <span dir="ltr" title="下位カテゴリ 4 件、ページ 0 件、ファイル 0 件を含んでいます">（4サブカテゴリ）</span></div><div class="CategoryTreeChildren" style="display:none"></div></div></li></ul><h3>さ</h3>
<ul><li><div class="CategoryTreeSection"><div class="CategoryTreeItem"><span class="CategoryTreeBullet"><span class="CategoryTreeToggle" data-ct-state="collapsed" data-ct-title="サイクロン"></span> </span> <a href="/wiki/Category:%E3%82%B5%E3%82%A4%E3%82%AF%E3%83%AD%E3%83%B3" title="Category:サイクロン">サイクロン</a>‎ <span dir="ltr

In [154]:
subcats_in_TropicalCyclones_htmlchunk

[<div class="CategoryTreeItem"><span class="CategoryTreeBullet"><span class="CategoryTreeToggle" data-ct-state="collapsed" data-ct-title="地域別の熱帯低気圧"></span> </span> <a href="/wiki/Category:%E5%9C%B0%E5%9F%9F%E5%88%A5%E3%81%AE%E7%86%B1%E5%B8%AF%E4%BD%8E%E6%B0%97%E5%9C%A7" title="Category:地域別の熱帯低気圧">地域別の熱帯低気圧</a>‎ <span dir="ltr" title="下位カテゴリ 4 件、ページ 0 件、ファイル 0 件を含んでいます">（4サブカテゴリ）</span></div>,
 <div class="CategoryTreeItem"><span class="CategoryTreeBullet"><span class="CategoryTreeToggle" data-ct-state="collapsed" data-ct-title="サイクロン"></span> </span> <a href="/wiki/Category:%E3%82%B5%E3%82%A4%E3%82%AF%E3%83%AD%E3%83%B3" title="Category:サイクロン">サイクロン</a>‎ <span dir="ltr" title="下位カテゴリ 5 件、ページ 5 件、ファイル 0 件を含んでいます">（5サブカテゴリ、5ページ）</span></div>,
 <div class="CategoryTreeItem"><span class="CategoryTreeBullet"><span class="CategoryTreeToggle" data-ct-state="collapsed" data-ct-title="台風"></span> </span> <a href="/wiki/Category:%E5%8F%B0%E9%A2%A8" title="Category:台風">台風</a>‎ <span dir="ltr" tit