# Creating the corpus
* Extracting a text corpus from Wikipedia made of plain text sentences
* selected so as to have a roughly balanced corpus in terms of training data (each target category should be associated with the same number of sentences). 

* Including 6 categories:
architects, mathematicians, painters, politicians, singers and writers.

***

* SCRIPT INPUT:
    
    * a number k of persons per category

    * a number n of sentences per person
    (persons whose wikipedia description is too short to have n sentences, should be ignored). 

* SCRIPT OUTPUT:

    * the text of the corresponding Wikipedia page
    
    * the corresponding data type (A or Z) and category
    
    * WikiData description
    
            ▶ Store these into a csv or a json file and save it on your hard drive

## Part 1
Create a list of persons you want to work with. Those persons
should fall into two main types: artists (A) and non artists (Z).
Singers, writers and painters are of type A while architects, politicians and mathematicians of type Z. For each category, select 30
persons of that category. So in total you should have a list of 180
persons (half of them are artists and half of them are not).
You can use the wikidata warehouse to find persons of the expected categories. More precisely, the Wikidata collection can be
filtered out using the SPARQL language and the following endpoint: https://query.wikidata.org/.
You can thus use the SPARQLwrapper python library to apply a
SPARQL query to the Wikidata warehouse and retrieve the required item identifiers.

In [1]:
# import modules
import wikipedia
from random import Random
from spacy.lang.en import English
from itertools import islice
import pandas as pd
import json
import wptools
import requests
import nltk
import random
from SPARQLWrapper import SPARQLWrapper, JSON
import threading

In [2]:
keywords = ['architect', 'mathematician', 'painter', 'politician', 'singer', 'writer']

In [3]:
endpoint = SPARQLWrapper('https://query.wikidata.org/sparql')
endpoint.setReturnFormat(JSON)

In [4]:

def get_ids(keyword):

    result = requests.get('https://www.wikidata.org/w/api.php',
                        params={'format':'json',
                                'action':'wbsearchentities',
                                'search':keyword,
                                'language':'en'})
    result = result.json()

    key_id = result['search'][0]['id']

    query = '''SELECT ?personLabel
    WHERE
    {
    ?person wdt:P106 wd:'''+key_id+'''.
    SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
    }
    '''

    endpoint.setQuery(query)
    results = endpoint.query()

    uri = json.loads(''.join(map(lambda x: x.decode(), results)))
    
    ids = []
    for person in uri['results']['bindings']:
        ids.append(person['personLabel']['value'])

    return ids 

In [5]:
ids = {}
for keyword in keywords:
    print('▶ keyword:', keyword)
    ids[keyword] = get_ids(keyword)

▶ keyword: architect
▶ keyword: mathematician
▶ keyword: painter
▶ keyword: politician
▶ keyword: singer
▶ keyword: writer


## Part 2
for each selected person, retrieve his.her Wikidata description and
Wikipedia page title. This can be done using the wikidata API
along with the wptools python library.

In [6]:
def get_title_and_description(id_):
    
    page = wptools.page(wikibase=id_)
    
    page.get_wikidata()
    title = page.data['title']
    description = page.data['description']
    
    return title, description

In [7]:
t,d = get_title_and_description('Q43799')
t,d

www.wikidata.org (wikidata) Q43799
www.wikidata.org (labels) Q40|Q188|Q665807|P648|Q11122389|Q147770...
Johann Baptist Moser (en) data
{
  claims: <dict(26)> P19, P20, P27, P214, P227, P31, P569, P570, P...
  description: Austrian folk singer
  label: Johann Baptist Moser
  labels: <dict(37)> Q40, Q188, Q665807, P648, Q11122389, Q1477702...
  modified: <dict(1)> wikidata
  requests: <list(2)> wikidata, labels
  title: Johann_Baptist_Moser
  what: human
  wikibase: Q43799
  wikidata: <dict(26)> place of birth (P19), place of death (P20),...
  wikidata_pageid: 45987
  wikidata_url: https://www.wikidata.org/wiki/Q43799
}


('Johann_Baptist_Moser', 'Austrian folk singer')

In [8]:
page = wptools.page(wikibase='Q43799')
type(page)

wptools.page.WPToolsPage

## Part 3
Once you have a list of wikipedia page titles, fetch (if it exists) the corresponding English wikipedia page, and extract the n first sentences of its content.

In [9]:
def get_content(title : str,
                n_sentences,
                sentence_tokenize=nltk.sent_tokenize):
    
    page = wikipedia.page(title)
    content = page.content
    sentences = [sent.strip() for sent in islice(sentence_tokenize(content), n_sentences)]
    
    return ' '.join(sentences)

In [10]:
c = get_content(t, 10)

In [11]:
class DataExtractor(threading.Thread):
    def __init__(self, id_):
        super().__init__()
        self.id = id_
        self.article = None
    
    def run(self):
            
        article = {}
            
        try:
            t, d = get_title_and_description(self.id)
        except LookupError:
            return
        try:
            c = get_content(wikipedia.page(t), n_sentences=10)
        except (wikipedia.PageError, wikipedia.DisambiguationError):
            return 
            
        if d is None:
            return

        article['title'] = t
        article['description'] = d
        article['content'] = c

        self.article = article

In [12]:
def get_data(ids, n_people):
    data = {}
    for keyw,ppl_ids in ids.items():
        print('▶ keyw,ppl_ids[:n_people]', keyw,ppl_ids[:n_people])
        data[keyw] = []
        random.shuffle(ppl_ids)
        print('▶ len(ppl_ids)', len(ppl_ids))
        
        counter = 0
        
        i_ppl_ids = iter(ppl_ids)

        while counter < n_people:
            articles_left = n_people - counter
            extractors = []
            for id_ in i_ppl_ids:
                extractor = DataExtractor(id_)
                extractor.start()
                extractors.append(extractor)
                if len(extractors)==articles_left:
                    break
            
            for extractor in extractors:
                extractor.join()
                if extractor.article is not None:
                    counter += 1
                    data[keyw].append(extractor.article)
    return data


In [13]:
%%time
data = get_data(ids=ids, n_people=30)

  wikidata_pageid: 2901913
  wikidata_url: https://www.wikidata.org/wiki/Q3035539
}
Nikolaus Höniger (en) data
{
  claims: <dict(26)> P21, P227, P20, P27, P31, P569, P570, P106, P...
  description: German author and translator
  label: Nikolaus Höniger
  labels: <dict(37)> Q188, Q482980, P1006, Q23540, P5587, P906, P2...
  modified: <dict(1)> wikidata
  requests: <list(2)> wikidata, labels
  title: Nikolaus_Höniger
  what: human
  wikibase: Q1986840
  wikidata: <dict(26)> sex or gender (P21), GND ID (P227), place o...
  wikidata_pageid: 1915542
  wikidata_url: https://www.wikidata.org/wiki/Q1986840
}
en.wikipedia.org (imageinfo) File:Josephine-pinckney-smoking.jpg
en.wikipedia.org (imageinfo) File:François Edouard Raynal A.Quine...
Hassouna Mosbahi (en) data
{
  claims: <dict(27)> P31, P21, P27, P735, P569, P106, P1559, P213,...
  description: Tunisian writer, literary critic and journalist
  image: <list(1)> {'file': 'File:Svět knihy 2011 - Hasúna Al-Musb...
  label: Hassouna Mosbahi


In [14]:
print(len(data['writer']),
        '\n',data.keys())
        

30 
 dict_keys(['architect', 'mathematician', 'painter', 'politician', 'singer', 'writer'])


In [15]:
# split categories
A = ['singer', 'writer', 'painter']
Z = ['architect', 'politician', 'mathematician']

A_cat = {a:data[a] for a in A}

Z_cat = {z:data[z] for z in Z}

data = {'A':A_cat, 'Z':Z_cat}

In [17]:
with open('data/data.json', 'w') as f:
    json.dump(data, f)