# Coletando revisões da API da Wikipédia

A wikipédia disponibiliza algumas formas de coletar seus dados. 
Pode-se brincar bastante com essa API por meio da interface https://meta.wikimedia.org/wiki/Special:ApiSandbox.
para a realização destes exemplos os parâmetros foram escolhidos nesse sandbox e depois passados para cá.

Os artigos utilizados como exemplo fazem parte do dataset da dissertação do Professor Daniel Hassan.

In [2]:
import pandas as pd

dataset = pd.read_csv('wikipedia_dataset_hasan/wikipedia_old.csv')
titles = pd.DataFrame(dataset, columns = ['page_title']).head(5).values
titles

array([['Casino Royale (2006 film)'],
       ['Procellariidae'],
       ['Kakapo'],
       ['2005 NFL Draft'],
       ['Danish football champions']], dtype=object)

Abaixo um exemplo de uso da API da wikipédia. O paramêtro "format" como json faz com que o retorno venha no formato de dicionário em python, ao invés de xml ou html, que é mais fácil de ser manipulado.

In [119]:
import requests

S = requests.Session()
URL = "https://en.wikipedia.org/w/api.php"
#title = titles[0]

params = {
        "action": "query",
        "prop": "revisions",
        "titles": "Prostitution in the People's Republic of China",
        "rvprop": "timestamp|user|comment|content|ids",
        "rvslots": "main",
        "formatversion": "2",
        "format": "json",
        "rvlimit": 1,
        "rvstart": '2008-12-19T16:51:08Z',
        "rvdir": "older",
    }

response = S.get(url=URL, params=params).json()
response

{'batchcomplete': True,
 'query': {'pages': [{'pageid': 40275401,
    'ns': 0,
    'title': "Prostitution in the People's Republic of China"}]}}

Os valores da _response_ podem ser facilmente acessados como um dicionário em python, por meio da sintaxe dict[chave].

In [92]:
query = response["query"]
print(query)

{'normalized': [{'from': 'Wikipedia:Featured_articles', 'to': 'Wikipedia:Featured articles'}], 'pages': {'5921878': {'pageid': 5921878, 'ns': 4, 'title': 'Wikipedia:Featured articles', 'links': [{'ns': 0, 'title': '1924 Rose Bowl'}, {'ns': 0, 'title': '1926 World Series'}, {'ns': 0, 'title': '1927 Chicago mayoral election'}, {'ns': 0, 'title': '1928 Okeechobee hurricane'}, {'ns': 0, 'title': '1930 FIFA World Cup'}, {'ns': 0, 'title': '1933 Atlantic hurricane season'}, {'ns': 0, 'title': '1937 Fox vault fire'}, {'ns': 0, 'title': "1937 Social Credit backbenchers' revolt"}, {'ns': 0, 'title': '1940 Brocklesby mid-air collision'}, {'ns': 0, 'title': '1941 Atlantic hurricane season'}]}}}


In [93]:
pages = query["pages"] # or response["query"]["pages"]
print(pages)

{'5921878': {'pageid': 5921878, 'ns': 4, 'title': 'Wikipedia:Featured articles', 'links': [{'ns': 0, 'title': '1924 Rose Bowl'}, {'ns': 0, 'title': '1926 World Series'}, {'ns': 0, 'title': '1927 Chicago mayoral election'}, {'ns': 0, 'title': '1928 Okeechobee hurricane'}, {'ns': 0, 'title': '1930 FIFA World Cup'}, {'ns': 0, 'title': '1933 Atlantic hurricane season'}, {'ns': 0, 'title': '1937 Fox vault fire'}, {'ns': 0, 'title': "1937 Social Credit backbenchers' revolt"}, {'ns': 0, 'title': '1940 Brocklesby mid-air collision'}, {'ns': 0, 'title': '1941 Atlantic hurricane season'}]}}


In [96]:
pages_list = list(pages.values()) # or list(response["query"]["pages"].values()) 
print(pages_list)

first_page_title = pages_list[0]["title"] # or list(response["query"]["pages"].values())[0]["title"]
print(first_page_title)

[{'pageid': 5921878, 'ns': 4, 'title': 'Wikipedia:Featured articles', 'links': [{'ns': 0, 'title': '1924 Rose Bowl'}, {'ns': 0, 'title': '1926 World Series'}, {'ns': 0, 'title': '1927 Chicago mayoral election'}, {'ns': 0, 'title': '1928 Okeechobee hurricane'}, {'ns': 0, 'title': '1930 FIFA World Cup'}, {'ns': 0, 'title': '1933 Atlantic hurricane season'}, {'ns': 0, 'title': '1937 Fox vault fire'}, {'ns': 0, 'title': "1937 Social Credit backbenchers' revolt"}, {'ns': 0, 'title': '1940 Brocklesby mid-air collision'}, {'ns': 0, 'title': '1941 Atlantic hurricane season'}]}]
Wikipedia:Featured articles
[{'ns': 0, 'title': '1924 Rose Bowl'}, {'ns': 0, 'title': '1926 World Series'}, {'ns': 0, 'title': '1927 Chicago mayoral election'}, {'ns': 0, 'title': '1928 Okeechobee hurricane'}, {'ns': 0, 'title': '1930 FIFA World Cup'}, {'ns': 0, 'title': '1933 Atlantic hurricane season'}, {'ns': 0, 'title': '1937 Fox vault fire'}, {'ns': 0, 'title': "1937 Social Credit backbenchers' revolt"}, {'ns': 0, 

In [118]:
def parse_links(response):
    result = []
    for page in list(response["query"]["pages"].values()):
        r = {
            "title": page["title"],
            "links": [x["title"] for x in page["links"]]
        }
        result.append(r)
    return result

parse_links(response)

[{'title': 'Wikipedia:Featured articles',
  'links': ['1924 Rose Bowl',
   '1926 World Series',
   '1927 Chicago mayoral election',
   '1928 Okeechobee hurricane',
   '1930 FIFA World Cup',
   '1933 Atlantic hurricane season',
   '1937 Fox vault fire',
   "1937 Social Credit backbenchers' revolt",
   '1940 Brocklesby mid-air collision',
   '1941 Atlantic hurricane season']}]

É possível coletar as revisões, dentre outras informações. A [documentação](https://en.wikipedia.org/w/api.php) pode ser lida em  

[outra API](https://en.wikipedia.org/api/rest_v1/#/) que ainda não foi explorada.

In [1]:
from wiki_revision_crawler import get_revisions_info, date_range_monthly, parse_revisions_info_monthly

date_start = "2017-10-01T00:00:00Z"
date_end = "2017-8-01T00:00:00Z"
 
print(date_range_monthly(date_end, date_start))

date_start = "2006-01-03T00:00:00Z"

response = get_revisions_info("Dypsis onilahensis", date_start, date_end)
print(response)

revisions_info = parse_revisions_info_monthly(response, date_start, date_end)


['2017-10-01T00:00:00Z', '2017-09-01T00:00:00Z', '2017-08-01T00:00:00Z']
{'error': {'code': 'badtimestamp_rvend', 'info': 'Invalid value "2017-8-01T00:00:00Z" for timestamp parameter "rvend".', '*': 'See https://en.wikipedia.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at &lt;https://lists.wikimedia.org/mailman/listinfo/mediawiki-api-announce&gt; for notice of API deprecations and breaking changes.'}, 'servedby': 'mw1287'}


KeyError: 'query'

In [4]:
revisions_info

([{'access': '2019-07-01T00:00:00Z',
   'revision': {'revid': 833068079,
    'parentid': 825632911,
    'user': 'Tom.Reding',
    'timestamp': '2018-03-29T14:04:16Z',
    'comment': '+[[:Category:Taxonomy articles created by Polbot\u200e\u200e]]; cleanup; [[WP:GenFixes]] on, using [[Project:AWB|AWB]]'}},
  {'access': '2019-06-01T00:00:00Z',
   'revision': {'revid': 833068079,
    'parentid': 825632911,
    'user': 'Tom.Reding',
    'timestamp': '2018-03-29T14:04:16Z',
    'comment': '+[[:Category:Taxonomy articles created by Polbot\u200e\u200e]]; cleanup; [[WP:GenFixes]] on, using [[Project:AWB|AWB]]'}},
  {'access': '2019-05-01T00:00:00Z',
   'revision': {'revid': 833068079,
    'parentid': 825632911,
    'user': 'Tom.Reding',
    'timestamp': '2018-03-29T14:04:16Z',
    'comment': '+[[:Category:Taxonomy articles created by Polbot\u200e\u200e]]; cleanup; [[WP:GenFixes]] on, using [[Project:AWB|AWB]]'}},
  {'access': '2019-04-01T00:00:00Z',
   'revision': {'revid': 833068079,
    'pare

In [None]:
import datetime
import dateutil.parser

def parse_date(date):
    """Parse date from ISO 8601 format to datetime
 
    Parameters:
        date (str): date in the format ISO 8601: 2001-01-15T14:56:00Z
        
    Returns:
        date (datetime): date parsed

    """
    return dateutil.parser.parse(date)

def format_date(date):
    """Format date ISO 8601 (i.e. 2001-01-15T14:56:00Z)
 
    Parameters:
        date (datetime): date
        
    Returns:
        date (str): ISO 8601 formatted date
    
    """
    s = "%Y-%m-%dT%H:%M:%SZ"
    return date.strftime(s)

date_idx=0
dates = pd.date_range("2017-01-03T01:32:59Z","2017-04-03T01:32:59Z", freq='MS').strftime("%Y-%m-%dT%H:%M:%SZ").tolist()

while date_idx < len(dates):
    print(dates[date_idx])
    date_idx += 1

In [None]:
from pandas.io.json import json_normalize

#revisions = data["query"]["pages"][page_id]["revisions"]
#revisions = list(data["query"]["pages"].values())[0]["revisions"]

json_normalize(revisions_info)

In [33]:
import requests

S = requests.Session()
URL = "https://en.wikipedia.org/w/api.php"

def get_revision(title, access_date):
    # title = "Procellariidae"
    # date = "2017-04-03T01:32:59.000Z"
    params = {
        "action": "query",
        "prop": "revisions",
        "titles": title,
       # "rvprop": "timestamp|user|comment|content|ids",
        "rvprop": "content",
        "rvslots": "main",
        "formatversion": "2",
        "format": "json",
        "rvlimit": 1,
        "rvstart": access_date,
        "rvdir": "older",
    }

    response = S.get(url=URL, params=params)
    
    return response.json()


def parse_revision_response(response, date):
    page = list(response["query"]["pages"])[0]
    revision = list(page["revisions"])[0]
   # return (page["pageid"], page["title"], revision["user"], revision["timestamp"], revision["comment"], revision["slots"]["main"]["content"])
    return revision["slots"]["main"]["content"]

date = "2009-01-03T00:00:00.000Z"
response = get_revision("Procellariidae", date)
conent = parse_revision_response(response, date)
#category = parse_revision_category_content(response)
# page = list(response["query"]["pages"])[0]
# revision = list(page["revisions"])[0]
# content = revision["slots"]["main"]["content"]
# content

In [27]:
def get_revision_info(title, access_date):
    # title = "Procellariidae"
    # date = "2017-04-03T01:32:59.000Z"
    params = {
        "action": "query",
        "prop": "revisions",
        "titles": title,
        "rvprop": "timestamp|user|comment|ids",
        "rvslots": "main",
        "formatversion": "2",
        "format": "json",
        "rvlimit": 500,
        "rvstart": access_date,
        "rvdir": "older",
              
    }

    response = S.get(url=URL, params=params)
    
    return response.json()
title = 'Spoo'
date_end = "2009-01-03T00:00:00Z"
get_revision_info(title, date_end)

{'batchcomplete': True,
 'query': {'pages': [{'pageid': 59643877, 'ns': 0, 'title': 'Spoo'}]}}

In [28]:
def get_page_redirect(title):
    """ redirect page
    """
    PARAMS = {
        'action': "query",
        'format': "json",
        'titles': title,
       # 'prop': "redirects"
        'redirects' : 1,
    }

    R = S.get(url=URL, params=PARAMS)
    DATA = R.json()
    return DATA

title = 'Spoo'
print(get_page_redirect(title))

# title = 'Variegated fairywren'
# print(get_page_redirect(title))

title = 'List of Danish football champions'
print(get_page_redirect(title))

{'batchcomplete': '', 'query': {'pages': {'59643877': {'pageid': 59643877, 'ns': 0, 'title': 'Spoo'}}}}
{'batchcomplete': '', 'query': {'pages': {'4862862': {'pageid': 4862862, 'ns': 0, 'title': 'List of Danish football champions'}}}}


In [30]:
import re
result = re.findall('(class=(.+?)\n|class=(.+?)\||class=(.+?)})', "dad{{class=A|sadad}}ad")
result
#return result[0][1] if result[0][0][-1:] == '\n' else result[0][2] if result[0][0][-1:] == '|' else result[0][3]


[('class=A|', '', 'A', '')]

In [318]:
def parse_revision_category_content(text):
    result = re.findall('(class=(.+?)}|class=(.+?)\||class=(.+?)\n)', content)
    return result[0][1] if result[0][0][-1:] == '}' else result[0][2] if result[0][0][-1:] == '|' else result[0][3]

title = "GNOME"
date = "2007-01-03T00:00:00.000Z"

response = get_revision(f"Talk:{title}", date)
content = parse_revision_response(response, date)
parse_revision_category_content(content)

'GA|importance=Top|bc-current=yes|nested=yes'

In [64]:
import os
folder = "collected_data/revision_info_200701-200901/data"
titles = os.listdir(folder)
titles[:10]

['The Lion King',
 'Luther Burbank',
 'Avatar: The Last Airbender',
 'Freak the Sheep Vol. 2',
 'Heroes of Wrestling',
 'Lady',
 'New York State Route 345',
 'Universe of Kingdom Hearts',
 'Helpless Automaton',
 'Samuel G. Arnold']

In [36]:
   lista = ['The Lion King',
 'Luther Burbank',
 'Avatar: The Last Airbender',
 'Freak the Sheep Vol. 2',
 'Heroes of Wrestling',
 'Lady',
 'New York State Route 345',
 'Universe of Kingdom Hearts',
 'Helpless Automaton',
 'Samuel G. Arnold']
title = lista[1]
response = get_revision(f"Talk:{title}", date)
parse_revision_response(response, date)

'{{ArticleHistory\n|action1=FAC\n|action1date=14:39, 2 September 2006\n|action1link=Wikipedia:Featured article candidates/Luther Burbank/archive1\n|action1result=not promoted\n|action1oldid=73066167\n|currentstatus=FFAC\n}}\n{{WikiProjectBanners\n|1={{WikiProject Plants|class=B|importance=Mid}}\n|2={{WPBiography|living=no|class=B|priority=Mid\n|needs-infobox=yes|s&a-work-group=yes|listas= Burbank, Luther}}\n|3={{HistSci|class=B|importance=Low}}\n|4={{WPReligion|class=|importance=|UnitarianUniversalism=yes}}\n|5={{WikiProject California|class=B|importance=Mid}}\n|6={{SFBAProject|class=B|importance=Mid}}\n}}\n==Unitarian Universalist?==\nMost of my knowledge of Burbank comes from reading the Krafts\' book, which was based mainly on interviews with people who knew him and on original documents.  They did not give me the impression that he was a member of any religion, although he had spiritual beliefs.  I\'m not sure if I should take off this tag in case the Unitarian Universalists might 

In [107]:
import pandas as pd
title = "Nelvana"
input = f"{folder}/{title}"
df = pd.read_csv(input)

In [336]:
# Pope, Nelvana, Big Bang, Munster, Fremen, Chicago 19, GNOME, Stargate, Enron, Namco, Pholcidae, Freenet
title = "Pope"
def get_category(title, date):
    response = get_revision(f"Talk:{title}", date)
    content = parse_revision_response(response, date)
    #return re.findall("{{.*class=.*}}", content)
   # return re.search('class=(.+?)\||}}]', content).group(1)
    result = re.findall('(class=(.+?)^[|\n ]}|class=(.+?)\||class=(.+?)\n)', content)
    return result[0][1] if result[0][0][-1:] == '}' else result[0][2] if result[0][0][-1:] == '|' else result[0][3]# if result[0][0][-1:] == ' ' else result[0][4]

for i, row in df.iterrows():
    date = row['revision.timestamp']
    category = get_category(title, date)
    print(category)
    df.loc[i, 'raw_category'] = str(category)
df

    

B
B
A
A
A
A
A
A
A
A
A
A
A


KeyError: 'revisions'

In [20]:
date_start = "2002-01-03T00:00:00Z"
date_end = "2011-01-03T00:00:00Z"

dates = pd.date_range(date_start, date_end, freq='3MS').strftime("%Y-%m-%dT%H:%M:%SZ").tolist()[::-1]


In [345]:
# Pope, Nelvana, Big Bang, Munster, Fremen, Chicago 19, GNOME, Stargate, Enron, Namco, Pholcidae, Freenet, Kakapo
title = "Pigment"
for date in dates:
    try:
        category = get_category(title, date)
        print(f"{date} {category}")
    except:
        pass
{{class=A}}
{{class=A|aodkaod}}
{{class=A\n|}}
{{class=A }}

2010-11-01T00:00:00Z B
2010-08-01T00:00:00Z B
2010-05-01T00:00:00Z B
2010-02-01T00:00:00Z B
2009-11-01T00:00:00Z B
2009-08-01T00:00:00Z B
2009-05-01T00:00:00Z GA
2009-02-01T00:00:00Z GA
2008-11-01T00:00:00Z GA
2008-08-01T00:00:00Z GA
2008-05-01T00:00:00Z GA
2008-02-01T00:00:00Z GA
2007-11-01T00:00:00Z GA
2007-08-01T00:00:00Z GA
2007-05-01T00:00:00Z GA
2007-02-01T00:00:00Z GA
2006-11-01T00:00:00Z GA
2006-08-01T00:00:00Z "thumb tleft">


In [18]:

dataset= dataset[dataset.page_title.isin(titles_not_found)]
list(dataset["page_id"])


NameError: name 'titles_not_found' is not defined