# Coletando revisões da API da Wikipédia

A wikipédia disponibiliza algumas formas de coletar seus dados. 
Pode-se brincar bastante com essa API por meio da interface https://meta.wikimedia.org/wiki/Special:ApiSandbox.
para a realização destes exemplos os parâmetros foram escolhidos nesse sandbox e depois passados para cá.

Os artigos utilizados como exemplo fazem parte do dataset da dissertação do Professor Daniel Hassan.

In [14]:
import pandas as pd

dataset = pd.read_csv('wikipedia_dataset_hasan/wikipedia.csv')
titles = pd.DataFrame(dataset, columns = ['page_title']).head(5).values
titles

array([['Casino Royale (2006 film)'],
       ['Procellariidae'],
       ['Kakapo'],
       ['2005 NFL Draft'],
       ['Danish football champions']], dtype=object)

Abaixo um exemplo de uso da API da wikipédia. O paramêtro "format" como json faz com que o retorno venha no formato de dicionário em python, ao invés de xml ou html, que é mais fácil de ser manipulado.

In [43]:
import requests

S = requests.Session()
URL = "https://en.wikipedia.org/w/api.php"
title = titles[0]

params = {
    "action": "query",
    "format": "json",
    "titles": title,
}

response = S.get(url=URL, params=params).json()
response

{'batchcomplete': '',
 'query': {'pages': {'930379': {'pageid': 930379,
    'ns': 0,
    'title': 'Casino Royale (2006 film)'}}}}

Os valores da _response_ podem ser facilmente acessados como um dicionário em python, por meio da sintaxe dict[chave].

In [29]:
query = response["query"]
print(query)

{'pages': {'930379': {'pageid': 930379, 'ns': 0, 'title': 'Casino Royale (2006 film)'}}}


In [36]:
pages = query["pages"] # or response["query"]["pages"]
print(pages)

{'930379': {'pageid': 930379, 'ns': 0, 'title': 'Casino Royale (2006 film)'}}


In [41]:
pages_list = list(pages.values()) # or list(response["query"]["pages"].values()) 
print(pages_list)

first_page_title = pages_list[0]["title"] # or list(response["query"]["pages"].values())[0]["title"]
print(first_page_title)

[{'pageid': 930379, 'ns': 0, 'title': 'Casino Royale (2006 film)'}]
Casino Royale (2006 film)


É possível coletar as revisões, dentre outras informações. A [documentação](https://en.wikipedia.org/w/api.php) pode ser lida em  

[outra API](https://en.wikipedia.org/api/rest_v1/#/) que ainda não foi explorada.

In [17]:
from wiki_revision_crawler import get_revisions_info, date_range_monthly, parse_revisions_info_monthly

date_start = "2017-10-01T00:00:00Z"
date_end = "2017-8-01T00:00:00Z"
 
print(date_range_monthly(date_end, date_start))

date_start = "2009-01-03T00:00:00Z"
date_end = "2007-01-03T00:00:00Z"

response = get_revisions_info("Casino Royale (2006 film)", date_start, date_end)
print(response)

revisions_info = parse_revisions_info_monthly(response, date_start, date_end)


['2017-10-01T00:00:00Z', '2017-09-01T00:00:00Z', '2017-08-01T00:00:00Z']
{'continue': {'rvcontinue': '20080508080539|210987364', 'continue': '||'}, 'query': {'pages': {'930379': {'pageid': 930379, 'ns': 0, 'title': 'Casino Royale (2006 film)', 'revisions': [{'revid': 261118362, 'parentid': 261097463, 'user': 'X201', 'timestamp': '2008-12-31T17:18:03Z', 'comment': '[[WP:UNDO|Undid]] revision 261097463 by [[Special:Contributions/81.110.187.3|81.110.187.3]] ([[User talk:81.110.187.3|talk]])'}, {'revid': 261097463, 'parentid': 261044359, 'user': '81.110.187.3', 'anon': '', 'timestamp': '2008-12-31T15:16:52Z', 'comment': ''}, {'revid': 261044359, 'parentid': 260874979, 'user': 'Croctotheface', 'timestamp': '2008-12-31T06:40:55Z', 'comment': '/* Plot */ tightening up'}, {'revid': 260874979, 'parentid': 260874586, 'user': 'Ian Rose', 'timestamp': '2008-12-30T13:27:56Z', 'comment': 'Revert to revision 260585018 dated 2008-12-29 00:48:15 by J.delanoy using [[:en:Wikipedia:Tools/Navigation_popup

In [None]:
import json
with open("response.json", 'w') as fp:
    json.dump(response, fp)

In [None]:
import datetime
import dateutil.parser

def parse_date(date):
    """Parse date from ISO 8601 format to datetime
 
    Parameters:
        date (str): date in the format ISO 8601: 2001-01-15T14:56:00Z
        
    Returns:
        date (datetime): date parsed

    """
    return dateutil.parser.parse(date)

def format_date(date):
    """Format date ISO 8601 (i.e. 2001-01-15T14:56:00Z)
 
    Parameters:
        date (datetime): date
        
    Returns:
        date (str): ISO 8601 formatted date
    
    """
    s = "%Y-%m-%dT%H:%M:%SZ"
    return date.strftime(s)

date_idx=0
dates = pd.date_range("2017-01-03T01:32:59Z","2017-04-03T01:32:59Z", freq='MS').strftime("%Y-%m-%dT%H:%M:%SZ").tolist()

while date_idx < len(dates):
    print(dates[date_idx])
    date_idx += 1

In [None]:
from pandas.io.json import json_normalize

#revisions = data["query"]["pages"][page_id]["revisions"]
#revisions = list(data["query"]["pages"].values())[0]["revisions"]

json_normalize(revisions_info)

In [16]:
import requests

S = requests.Session()
URL = "https://en.wikipedia.org/w/api.php"

def get_revision(title, access_date):
    # title = "Procellariidae"
    # date = "2017-04-03T01:32:59.000Z"
    params = {
        "action": "query",
        "prop": "revisions",
        "titles": title,
        "rvprop": "timestamp|user|comment|content|ids",
        "rvslots": "main",
        "formatversion": "2",
        "format": "json",
        "rvlimit": 1,
        "rvstart": access_date,
        "rvdir": "older",
    }

    response = S.get(url=URL, params=params)
    
    return response.json()

def parse_revision_response(response, date):
    page = list(response["query"]["pages"])[0]
    revision = list(page["revisions"])[0]
    return (page["pageid"], page["title"], revision["user"], revision["timestamp"], revision["comment"], revision["slots"]["main"]["content"])

date = "2009-01-03T00:00:00.000Z"
response = get_revision("Procellariidae", date)
#parse_revision_response(response, date)
#category = parse_revision_category_content(response)
page = list(response["query"]["pages"])[0]
revision = list(page["revisions"])[0]
content = revision["slots"]["main"]["content"]
content

'{{Taxobox\n| name = Procellariidae\n| image = Cape Petrel (Pintado) at Antarctic Convergence Zone.jpg\n| image_width = 250px\n| image_caption = [[Cape Petrel]] \'\'Daption capense\'\'\n| regnum = [[Animal]]ia\n| phylum = [[Chordate|Chordata]]\n| classis = [[bird|Aves]]\n| ordo = [[Procellariiformes]]\n| familia = \'\'\'Procellariidae\'\'\'\n| familia_authority = [[William Elford Leach|Leach]], 1820\n| subdivision_ranks = Genera\n| subdivision = \nSeveral, [[List of Procellariidae]].\n}}\nThe [[family (biology)|family]] \'\'\'Procellariidae\'\'\' is a group of [[seabird]]s that comprises the [[fulmarine petrel]]s, the [[gadfly petrel]]s, the [[prion (bird)|prions]], and the [[shearwater]]s. This family is part of the [[bird]] order [[Procellariiformes]] (or tubenoses), which also includes the [[albatross]]es, the [[storm-petrel]]s, and the [[diving petrel]]s. \n\nThe procellariids are the most numerous family of tubenoses, and the most diverse. They range in size from the [[giant petre

In [None]:
def parse_revision_category_content(text):
    classes = []
    if text is not None:
        lines = text.replace("{{","}}").replace("\n","").split("}}")
        for line in lines:
            if "class" in line:
                print(line)
                atributes = line.split("|")
                wiki_project = ""
                wiki_class = "-"
                for idx, atribute in enumerate(atributes):
                    atribute_values = atribute.split("=")
                    if idx == 0:
                        wiki_project = atribute_values[0]
                    if atribute_values[0] == "class" and len(atribute_values) > 1 :
                        wiki_class = atribute_values[1]
                classes.append((wiki_project, wiki_class))
    return classes
parse_revision_category_content(content)