# Coletando revisões da API da Wikipédia

A wikipédia disponibiliza algumas formas de coletar seus dados. 
Pode-se brincar bastante com essa API por meio da interface https://meta.wikimedia.org/wiki/Special:ApiSandbox.
para a realização destes exemplos os parâmetros foram escolhidos nesse sandbox e depois passados para cá.

Os artigos utilizados como exemplo fazem parte do dataset da dissertação do Professor Daniel Hassan.

In [None]:
import pandas as pd

dataset = pd.read_csv('wikipedia_dataset_hasan/wikipedia.csv')
titles = pd.DataFrame(dataset, columns = ['page_title']).head(5).values
titles

Abaixo um exemplo de uso da API da wikipédia. O paramêtro "format" como json faz com que o retorno venha no formato de dicionário em python, ao invés de xml ou html, que é mais fácil de ser manipulado.

In [None]:
import requests

S = requests.Session()
URL = "https://en.wikipedia.org/w/api.php"
title = titles[0]

params = {
    "action": "query",
    "format": "json",
    "titles": title,
}

response = S.get(url=URL, params=params).json()
response

Os valores da _response_ podem ser facilmente acessados como um dicionário em python, por meio da sintaxe dict[chave].

In [None]:
query = response["query"]
print(query)

In [None]:
pages = query["pages"] # or response["query"]["pages"]
print(pages)

In [None]:
pages_list = list(pages.values()) # or list(response["query"]["pages"].values()) 
print(pages_list)

first_page_title = pages_list[0]["title"] # or list(response["query"]["pages"].values())[0]["title"]
print(first_page_title)

É possível coletar as revisões, dentre outras informações. A [documentação](https://en.wikipedia.org/w/api.php) pode ser lida em  

[outra API](https://en.wikipedia.org/api/rest_v1/#/) que ainda não foi explorada.

In [1]:
from wiki_revision_crawler import get_revisions_info, date_range_monthly, parse_revisions_info_monthly

date_start = "2017-10-01T00:00:00Z"
date_end = "2017-8-01T00:00:00Z"
 
print(date_range_monthly(date_end, date_start))

date_start = "2006-01-03T00:00:00Z"

response = get_revisions_info("Dypsis onilahensis", date_start, date_end)
print(response)

revisions_info = parse_revisions_info_monthly(response, date_start, date_end)


['2017-10-01T00:00:00Z', '2017-09-01T00:00:00Z', '2017-08-01T00:00:00Z']
{'error': {'code': 'badtimestamp_rvend', 'info': 'Invalid value "2017-8-01T00:00:00Z" for timestamp parameter "rvend".', '*': 'See https://en.wikipedia.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at &lt;https://lists.wikimedia.org/mailman/listinfo/mediawiki-api-announce&gt; for notice of API deprecations and breaking changes.'}, 'servedby': 'mw1287'}


KeyError: 'query'

In [4]:
revisions_info

([{'access': '2019-07-01T00:00:00Z',
   'revision': {'revid': 833068079,
    'parentid': 825632911,
    'user': 'Tom.Reding',
    'timestamp': '2018-03-29T14:04:16Z',
    'comment': '+[[:Category:Taxonomy articles created by Polbot\u200e\u200e]]; cleanup; [[WP:GenFixes]] on, using [[Project:AWB|AWB]]'}},
  {'access': '2019-06-01T00:00:00Z',
   'revision': {'revid': 833068079,
    'parentid': 825632911,
    'user': 'Tom.Reding',
    'timestamp': '2018-03-29T14:04:16Z',
    'comment': '+[[:Category:Taxonomy articles created by Polbot\u200e\u200e]]; cleanup; [[WP:GenFixes]] on, using [[Project:AWB|AWB]]'}},
  {'access': '2019-05-01T00:00:00Z',
   'revision': {'revid': 833068079,
    'parentid': 825632911,
    'user': 'Tom.Reding',
    'timestamp': '2018-03-29T14:04:16Z',
    'comment': '+[[:Category:Taxonomy articles created by Polbot\u200e\u200e]]; cleanup; [[WP:GenFixes]] on, using [[Project:AWB|AWB]]'}},
  {'access': '2019-04-01T00:00:00Z',
   'revision': {'revid': 833068079,
    'pare

In [None]:
import datetime
import dateutil.parser

def parse_date(date):
    """Parse date from ISO 8601 format to datetime
 
    Parameters:
        date (str): date in the format ISO 8601: 2001-01-15T14:56:00Z
        
    Returns:
        date (datetime): date parsed

    """
    return dateutil.parser.parse(date)

def format_date(date):
    """Format date ISO 8601 (i.e. 2001-01-15T14:56:00Z)
 
    Parameters:
        date (datetime): date
        
    Returns:
        date (str): ISO 8601 formatted date
    
    """
    s = "%Y-%m-%dT%H:%M:%SZ"
    return date.strftime(s)

date_idx=0
dates = pd.date_range("2017-01-03T01:32:59Z","2017-04-03T01:32:59Z", freq='MS').strftime("%Y-%m-%dT%H:%M:%SZ").tolist()

while date_idx < len(dates):
    print(dates[date_idx])
    date_idx += 1

In [None]:
from pandas.io.json import json_normalize

#revisions = data["query"]["pages"][page_id]["revisions"]
#revisions = list(data["query"]["pages"].values())[0]["revisions"]

json_normalize(revisions_info)

In [11]:
import requests

S = requests.Session()
URL = "https://en.wikipedia.org/w/api.php"

def get_revision(title, access_date):
    # title = "Procellariidae"
    # date = "2017-04-03T01:32:59.000Z"
    params = {
        "action": "query",
        "prop": "revisions",
        "titles": title,
       # "rvprop": "timestamp|user|comment|content|ids",
        "rvprop": "content",
        "rvslots": "main",
        "formatversion": "2",
        "format": "json",
        "rvlimit": 1,
        "rvstart": access_date,
        "rvdir": "older",
    }

    response = S.get(url=URL, params=params)
    
    return response.json()


def parse_revision_response(response, date):
    page = list(response["query"]["pages"])[0]
    revision = list(page["revisions"])[0]
   # return (page["pageid"], page["title"], revision["user"], revision["timestamp"], revision["comment"], revision["slots"]["main"]["content"])
    return revision["slots"]["main"]["content"]

date = "2009-01-03T00:00:00.000Z"
response = get_revision("Procellariidae", date)
parse_revision_response(response, date)
#category = parse_revision_category_content(response)
# page = list(response["query"]["pages"])[0]
# revision = list(page["revisions"])[0]
# content = revision["slots"]["main"]["content"]
# content

In [7]:
response

{'continue': {'rvcontinue': '20081113210734|251630846', 'continue': '||'},
 'query': {'pages': [{'pageid': 224443,
    'ns': 0,
    'title': 'Procellariidae',
    'revisions': [{'slots': {'main': {'contentmodel': 'wikitext',
        'contentformat': 'text/x-wiki',
        'content': '{{Taxobox\n| name = Procellariidae\n| image = Cape Petrel (Pintado) at Antarctic Convergence Zone.jpg\n| image_width = 250px\n| image_caption = [[Cape Petrel]] \'\'Daption capense\'\'\n| regnum = [[Animal]]ia\n| phylum = [[Chordate|Chordata]]\n| classis = [[bird|Aves]]\n| ordo = [[Procellariiformes]]\n| familia = \'\'\'Procellariidae\'\'\'\n| familia_authority = [[William Elford Leach|Leach]], 1820\n| subdivision_ranks = Genera\n| subdivision = \nSeveral, [[List of Procellariidae]].\n}}\nThe [[family (biology)|family]] \'\'\'Procellariidae\'\'\' is a group of [[seabird]]s that comprises the [[fulmarine petrel]]s, the [[gadfly petrel]]s, the [[prion (bird)|prions]], and the [[shearwater]]s. This family is p

In [27]:
def get_revision_info(title, access_date):
    # title = "Procellariidae"
    # date = "2017-04-03T01:32:59.000Z"
    params = {
        "action": "query",
        "prop": "revisions",
        "titles": title,
        "rvprop": "timestamp|user|comment|ids",
        "rvslots": "main",
        "formatversion": "2",
        "format": "json",
        "rvlimit": 500,
        "rvstart": access_date,
        "rvdir": "older",
              
    }

    response = S.get(url=URL, params=params)
    
    return response.json()
title = 'Spoo'
date_end = "2009-01-03T00:00:00Z"
get_revision_info(title, date_end)

{'batchcomplete': True,
 'query': {'pages': [{'pageid': 59643877, 'ns': 0, 'title': 'Spoo'}]}}

In [28]:
def get_page_redirect(title):
    """ redirect page
    """
    PARAMS = {
        'action': "query",
        'format': "json",
        'titles': title,
       # 'prop': "redirects"
        'redirects' : 1,
    }

    R = S.get(url=URL, params=PARAMS)
    DATA = R.json()
    return DATA

title = 'Spoo'
print(get_page_redirect(title))

# title = 'Variegated fairywren'
# print(get_page_redirect(title))

title = 'List of Danish football champions'
print(get_page_redirect(title))

{'batchcomplete': '', 'query': {'pages': {'59643877': {'pageid': 59643877, 'ns': 0, 'title': 'Spoo'}}}}
{'batchcomplete': '', 'query': {'pages': {'4862862': {'pageid': 4862862, 'ns': 0, 'title': 'List of Danish football champions'}}}}


In [79]:
def parse_revision_category_content(text):
    classes = []
    if text is not None:
        lines = text.replace("{{","}}").replace("\n","").split("}}")
        for line in lines:
            if "class" in line:
                print(line)
                atributes = line.split("|")
                wiki_project = ""
                wiki_class = "-"
                for idx, atribute in enumerate(atributes):
                    atribute_values = atribute.split("=")
                    if idx == 0:
                        wiki_project = atribute_values[0]
                    if atribute_values[0] == "class" and len(atribute_values) > 1 :
                        wiki_class = atribute_values[1]
                classes.append((wiki_project, wiki_class))
    return classes
response = get_revision("Talk:OpenBSD", date)
content = parse_revision_response(response, date)
parse_revision_category_content(content)

 -->{| class="messagebox standard-talk"|-<!-- |[[Image:Crystal kthememgr.png|50px|Userbox!]] -->|'''
}V0.5|class=FA|category=Engtech
{| class="infobox" width="270px"|-!align="center"|[[Image:Vista-file-manager.png|50px|Archive]]<br>[[Wikipedia:How to archive a talk page|Archives]]----|-|* [[


[(' -->{', '-'), ('}V0.5', 'FA'), ('{', '-')]

In [80]:
import re

re.findall("{{.*}}", content)

['{{featured|36052463}}',
 '{{Mainpage date|April 10|2006}}',
 '{{oldpeerreview}}',
 '{{talkheader}}',
 '{{todo priority}}',
 '{{CryptographyProject}}',
 "{{PAGENAME}}''' has a userbox and category at [[:Category:Wikipedians who use {{PAGENAME}}",
 '{{V0.5|class=FA|category=Engtech}}',
 '{{NAMESPACE}}:{{PAGENAME}}',
 '{{NAMESPACE}}:{{PAGENAME}}',
 '{{NAMESPACE}}:{{PAGENAME}}',
 '{{{2|}}}',
 '{{{2|}}}']

In [81]:
import re

re.findall("{{.*}}", content)

['{{featured|36052463}}',
 '{{Mainpage date|April 10|2006}}',
 '{{oldpeerreview}}',
 '{{talkheader}}',
 '{{todo priority}}',
 '{{CryptographyProject}}',
 "{{PAGENAME}}''' has a userbox and category at [[:Category:Wikipedians who use {{PAGENAME}}",
 '{{V0.5|class=FA|category=Engtech}}',
 '{{NAMESPACE}}:{{PAGENAME}}',
 '{{NAMESPACE}}:{{PAGENAME}}',
 '{{NAMESPACE}}:{{PAGENAME}}',
 '{{{2|}}}',
 '{{{2|}}}']

In [82]:
import re

re.findall("{{.*}}", content2)

['{{WikiProject Color|class=start|importance=high}}',
 '{{British-English}}',
 '{{#if:{{{1|}}}',
 '{{{1}}}',
 '{{#if:{{{1|}}}',
 '{{{1}}}',
 '{{#if:{{{1|}}}',
 '{{{1}}}',
 '{{#if:{{{1|}}}',
 '{{{1}}}']

In [64]:
import os
folder = "collected_data/revision_info_200701-200901/data"
titles = os.listdir(folder)
titles[:10]

['The Lion King',
 'Luther Burbank',
 'Avatar: The Last Airbender',
 'Freak the Sheep Vol. 2',
 'Heroes of Wrestling',
 'Lady',
 'New York State Route 345',
 'Universe of Kingdom Hearts',
 'Helpless Automaton',
 'Samuel G. Arnold']

In [75]:
import pandas as pd
title = "Sesame Street"
input = f"{folder}/{title}"
df = pd.read_csv(input)

In [84]:
def get_category(title, date):
    response = get_revision(f"Talk:{title}", date)
    content = parse_revision_response(response, date)
    return re.findall("{{.*class.*}}", content)

for i, row in df.iterrows():
    date = row['revision.timestamp']
    category = get_category(title, date)
    print(category)
    df.loc[i, 'raw_category'] = str(category)
df

    

['{{TelevisionWikiProject|class=B|importance=top}}', '{{WP1.0 Arts|class=|small=yes}}', '{{V0.5|class=|category=Socsci|small=yes}}']
['{{TelevisionWikiProject|class=B|importance=top}}', '{{WP1.0 Arts|class=|small=yes}}', '{{V0.5|class=|category=Socsci|small=yes}}']
['{{TelevisionWikiProject|class=B|importance=top}}', '{{WP1.0 Arts|class=|small=yes}}', '{{V0.5|class=|category=Socsci|small=yes}}']
['{{TelevisionWikiProject|class=B|importance=top}}', '{{WP1.0 Arts|class=|small=yes}}', '{{V0.5|class=|category=Socsci|small=yes}}']
['{{TelevisionWikiProject|class=FA|importance=top}}', '{{WP1.0 Arts|class=FA|small=yes}}', '{{V0.5|class=FA|category=Socsci|small=yes}}']
['{{TelevisionWikiProject|class=FA|importance=top}}', '{{WP1.0 Arts|class=FA|small=yes}}', '{{V0.5|class=FA|category=Socsci|small=yes}}']
['{{TelevisionWikiProject|class=FA|importance=top}}', '{{WP1.0 Arts|class=FA|small=yes}}', '{{V0.5|class=FA|category=Socsci|small=yes}}']
['{{TelevisionWikiProject|class=FA|importance=top}}', 

Unnamed: 0,access,revision.anon,revision.comment,revision.parentid,revision.revid,revision.timestamp,revision.user,raw_category
0,2009-01-01T00:00:00Z,,[[User:Rjwilmsi#Other_fixes|gen fixes]]: (3) f...,259962296,260654675,2008-12-29T11:11:45Z,Rjwilmsi,['{{TelevisionWikiProject|class=B|importance=t...
1,2008-12-01T00:00:00Z,,Reverted 2 edits by [[Special:Contributions/98...,254921379,254922104,2008-11-30T03:44:50Z,HexaChord,['{{TelevisionWikiProject|class=B|importance=t...
2,2008-11-01T00:00:00Z,,dab,248344628,248347536,2008-10-29T04:21:52Z,Jc37,['{{TelevisionWikiProject|class=B|importance=t...
3,2008-10-01T00:00:00Z,,,241625943,241899484,2008-09-30T00:28:35Z,Jasonstru,['{{TelevisionWikiProject|class=B|importance=t...
4,2008-09-01T00:00:00Z,,,235295043,235295143,2008-08-31T01:35:33Z,Azumanga1,['{{TelevisionWikiProject|class=FA|importance=...
5,2008-08-01T00:00:00Z,,"Adding ""citation needed"" notices.",228974673,229045386,2008-07-31T16:40:46Z,Peter4Truth,['{{TelevisionWikiProject|class=FA|importance=...
6,2008-07-01T00:00:00Z,,"/* Movies, videos, and specials */",221091360,221937120,2008-06-26T19:41:54Z,Colt9033,['{{TelevisionWikiProject|class=FA|importance=...
7,2008-06-01T00:00:00Z,,,214879142,216027846,2008-05-30T19:26:02Z,Cookie81910,['{{TelevisionWikiProject|class=FA|importance=...
8,2008-05-01T00:00:00Z,,/* Criticism */,209090873,209190520,2008-04-30T08:36:38Z,121.218.65.39,['{{TelevisionWikiProject|class=FA|importance=...
9,2008-04-01T00:00:00Z,,Reverted edits by [[Special:Contributions/DJ Z...,202291854,202293140,2008-03-31T13:52:53Z,RobertG,['{{TelevisionWikiProject|class=FA|importance=...
