# Creating the corpus
* Extracting a text corpus from Wikipedia made of plain text sentences
* selected so as to have a roughly balanced corpus in terms of training data (each target category should be associated with the same number of sentences). 

* Including 6 categories:
architects, mathematicians, painters, politicians, singers and writers.

***

* SCRIPT INPUT:
    
    * a number k of persons per category

    * a number n of sentences per person
    (persons whose wikipedia description is too short to have n sentences, should be ignored). 

* SCRIPT OUTPUT:

    * the text of the corresponding Wikipedia page
    
    * the corresponding data type (A or Z) and category
    
    * WikiData description
    
            ▶ Store these into a csv or a json file and save it on your hard drive

## Part 1
Create a list of persons you want to work with. Those persons
should fall into two main types: artists (A) and non artists (Z).
Singers, writers and painters are of type A while architects, politicians and mathematicians of type Z. For each category, select 30
persons of that category. So in total you should have a list of 180
persons (half of them are artists and half of them are not).
You can use the wikidata warehouse to find persons of the expected categories. More precisely, the Wikidata collection can be
filtered out using the SPARQL language and the following endpoint: https://query.wikidata.org/.
You can thus use the SPARQLwrapper python library to apply a
SPARQL query to the Wikidata warehouse and retrieve the required item identifiers.

In [19]:
# import modules
import wikipedia
from random import Random
from spacy.lang.en import English
from itertools import islice
import pandas as pd
import json
import wptools
import requests
import nltk
import random
from SPARQLWrapper import SPARQLWrapper, JSON

In [41]:
keywords = ['architect', 'mathematician', 'painter', 'politician', 'singer', 'writer']

In [21]:
endpoint = SPARQLWrapper('https://query.wikidata.org/sparql')
endpoint.setReturnFormat(JSON)

In [43]:

def get_ids(keyword):

    result = requests.get('https://www.wikidata.org/w/api.php',
                        params={'format':'json',
                                'action':'wbsearchentities',
                                'search':keyword,
                                'language':'en'})
    result = result.json()

    key_id = result['search'][0]['id']

    query = '''SELECT ?personLabel
    WHERE
    {
    ?person wdt:P106 wd:'''+key_id+'''.
    SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
    }
    '''

    endpoint.setQuery(query)
    results = endpoint.query()

    uri = json.loads(''.join(map(lambda x: x.decode(), results)))
    
    ids = []
    for person in uri['results']['bindings']:
        ids.append(person['personLabel']['value'])

    return ids 

In [44]:
ids = {}
for keyword in keywords:
    ids[keyword] = get_ids(keyword)

HTTPError: HTTP Error 429: Too Many Requests

## Part 2
for each selected person, retrieve his.her Wikidata description and
Wikipedia page title. This can be done using the wikidata API
along with the wptools python library.

In [25]:
def get_title_and_description(id_):
    result = requests.get('https://www.wikidata.org/w/api.php',
                        params={'action':'wbgetentities',
                                'ids':id_,
                                'format':'json',
                                'props':'descriptions|sitelinks'})
    result = result.json()
    try:
        title = result['entities'][id_]['sitelinks']['enwiki']['title']
    except:
        return None
    
    description = result['entities'][id_]['descriptions']['en']['value']
    
    return title, description

In [26]:
get_title_and_description('Q43799')

## Part 3
Once you have a list of wikipedia page titles, fetch (if it exists) the corresponding English wikipedia page, and extract the n first sentences of its content.

In [27]:
def get_content(title : str,
                n_sentences,
                sentence_tokenize=nltk.sent_tokenize):
    
    page = wikipedia.page(title)
    content = page.content
    sentences = [sent.strip() for sent in islice(sentence_tokenize(content), n_sentences)]
    
    return ' '.join(sentences)

In [36]:

def get_data(ids, n_people):
    data = {}

    for keyw,ppl_ids in ids.items():
        data[keyw] = []
        print (keyw,ppl_ids)
        
        kw_ids = random.sample(ppl_ids, 30)

        for id_ in kw_ids:
            
            article = {}
            
            t, d = get_title_and_description(id_)
            
            try:
                c = get_content(wikipedia.page(t), n_sentences=10)
            except wikipedia.PageError:
                continue

            article['title'] = t
            article['description'] = d
            article['content'] = c
            
            data[keyw].append(article)

In [37]:
get_data(ids, 30)

architects ['Q84312758']


ValueError: Sample larger than population or is negative

In [196]:
# split categories
A = ['singer', 'writer', 'painter']
Z = ['architect', 'politician', 'mathematician']

A_cat = {a:data[a] for a in A}

Z_cat = {z:data[z] for z in Z}

data = {'A':A_cat, 'Z':Z_cat}

In [198]:
with open('data.json', 'w') as f:
    json.dump(data, f)