# Creating the corpus
* Extracting a text corpus from Wikipedia made of plain text sentences
* selected so as to have a roughly balanced corpus in terms of training data (each target category should be associated with the same number of sentences). 

* Including 6 categories:
architects, mathematicians, painters, politicians, singers and writers.

***

* SCRIPT INPUT:
    
    * a number k of persons per category

    * a number n of sentences per person
    (persons whose wikipedia description is too short to have n sentences, should be ignored). 

* SCRIPT OUTPUT:

    * the text of the corresponding Wikipedia page
    
    * the corresponding data type (A or Z) and category
    
    * WikiData description
    
            ▶ Store these into a csv or a json file and save it on your hard drive

## Part 1
Create a list of persons you want to work with. Those persons
should fall into two main types: artists (A) and non artists (Z).
Singers, writers and painters are of type A while architects, politicians and mathematicians of type Z. For each category, select 30
persons of that category. So in total you should have a list of 180
persons (half of them are artists and half of them are not).
You can use the wikidata warehouse to find persons of the expected categories. More precisely, the Wikidata collection can be
filtered out using the SPARQL language and the following endpoint: https://query.wikidata.org/.
You can thus use the SPARQLwrapper python library to apply a
SPARQL query to the Wikidata warehouse and retrieve the required item identifiers.

In [1]:
# import modules
import wikipedia
from random import Random
from spacy.lang.en import English
from itertools import islice
import json

In [2]:
# store wiki list pages
wiki_pages = {}
wiki_pages['singer'] = wikipedia.page('List of singers')
wiki_pages['writer'] = wikipedia.page('List of writers')
wiki_pages['painter'] = wikipedia.page('List of painters')
wiki_pages['architect'] = wikipedia.page('List of architects')
wiki_pages['politician'] = wikipedia.page('List of politicians by nationality')
wiki_pages['mathematician'] =  wikipedia.page('List of mathematicians')

In [3]:
def check_article_keyword(wiki_page):
    #if article
    

    #if list
    keyword = wiki_page.title #after list(s) of
    for category in wiki_page.categories:
        if keyword in category:
            return True
    return False

In [4]:
class KeywordArticleChecker:
    def __init__(self,keyword):
        self.keyword = keyword

    def __call__(self,wiki_page):
        for category in wiki_page.categories:
            if self.keyword in category:
                return True
        return False

In [5]:
def check_list_keyword(wiki_page):
    
    if keyword in wiki_page.title:
        return True
    return False

In [6]:
def check_is_not_list(wiki_page):   
    for category in wiki_page.categories:
        if category.lower().startswith('list'):
            return False
    return True

In [7]:
def get_articles_from_dic(wiki_page, 
                            n_articles,
                            page_filter=lambda x: True,
                            check_is_final_node=check_is_not_list,
                            rng=Random(),
                            max_depth = 10,
                            _memory=None):

    print('  Checking:', wiki_page.title)
    # if the function is being called for the first time, assign an empty set 
    if _memory is None:
        _memory = set()

    # trivial cases:
    # if the number of articles retrieved satisfies the request
    # or the current wiki_page has already been visited
    # return an empty list
    if (wiki_page.title in _memory 
        or n_articles <= 0
        or not page_filter(wiki_page)
        or max_depth == 0):
        return []

    # if the page has not been visited yet, add it to the _memory
    _memory.add(wiki_page.title)

    # if the wiki_page is an article (not a list),
    # return it 
    if check_is_final_node(wiki_page):
        print('▶ Added to list:',wiki_page.title)
        return [wiki_page]
    
    # else, the page is a list
    else:
        articles = []
        # shuffle the links of the list
        while n_articles > 0:
            title = rng.choices(wiki_page.links, k=1)   
            try:
                page = wikipedia.page(title)
            except wikipedia.exceptions.WikipediaException:
                continue

            # get articles from the list links and add them to the list of articles
            local_articles = get_articles_from_dic(wiki_page=page, 
                                                    n_articles=1, 
                                                    page_filter=page_filter,
                                                    check_is_final_node=check_is_final_node,
                                                    rng=rng,
                                                    max_depth = max_depth-1,
                                                    _memory=_memory)
            articles.extend(local_articles)

            n_articles -= len(local_articles)

            # if the number of articles retrieved satisfies the request
            # return the articles
            if n_articles == 0:
                return articles
            elif n_articles < 0:
                return articles[:n_articles]

        return articles

In [16]:
wiki_keywords_people = {}
for k,v in wiki_pages.items():
    wiki_keywords_people[k] = get_articles_from_dic(wiki_page=v,
                                        n_articles=30, 
                                        page_filter=KeywordArticleChecker(k),
                                        rng=Random(0))
wiki_keywords_people

s
  Checking: List of French-language authors
  Checking: List of Belarusian writers
  Checking: List of Emirati writers
  Checking: Tobias S. Buckell
  Checking: List of Salvadoran writers
  Checking: List of German-language authors
  Checking: List of Guinean writers
  Checking: List of Colombian writers
  Checking: List of Macedonian writers
  Checking: List of Russian-language writers
  Checking: List of Peruvian writers
  Checking: List of Irish writers
  Checking: Gus John
  Checking: List of Uruguayan writers
  Checking: List of Guinean writers
  Checking: List of Ghanaian writers
  Checking: Alister Hughes
  Checking: List of Austrian writers
  Checking: List of Trinidad and Tobago writers
  Checking: List of Kenyan writers
  Checking: List of Russian-language writers
  Checking: List of Azerbaijani writers
  Checking: List of Emirati writers
  Checking: List of Tanzanian writers
  Checking: List of Sri Lankan writers
  Checking: List of Chinese writers
  Checking: Don Rojas
  

## Part 2
for each selected person, retrieve his.her Wikidata description and
Wikipedia page title. This can be done using the wikidata API
along with the wptools python library.
Once you have a list of wikipedia page titles, fetch (if it exists) the corresponding English wikipedia page, and extract the n first sentences of its content.

In [None]:
class SpacySentenceTokenizer:
    def __init__(self, nlp=English()):
        self.nlp = nlp
        nlp.add_pipe(nlp.create_pipe('sentencizer'))

    def __call__(self, txt):
        return self.nlp(txt).sents

In [None]:
def get_titles_descriptions_sentences(wiki_page : wikipedia.WikipediaPage,
                                        n_sentences,
                                        sentence_tokenize=SpacySentenceTokenizer()):
    
    title = wiki_page.title
    description = wiki_page.summary # TODO: wikidata
    content = wiki_page.content
    sentences = [sent.string.strip() for sent in islice(sentence_tokenize(content), n_sentences)]
    
    return (title, description, sentences)

In [None]:
data = {}

for k,v in wiki_keywords_people.items():
    data[k] = []
    for page in v:
        article = {}
        t,d,s = get_titles_descriptions_sentences(page, 10)
        article['title'] = t
        article['description'] = d
        article['sentences'] = s
        data[k].append(article)

In [None]:
data['singer']

In [None]:
# split categories
A = ['singer', 'writer', 'painter']
Z = ['architect', 'politician', 'mathematician']

A_cat = {a:data[a] for a in A}

Z_cat = {z:data[z] for z in Z}

data = {'A':A_cat, 'Z':Z_cat}

## part 3
Save data for preprocessing

In [None]:
with open('data.json', 'w') as f:
    json.dump(data, f)