# Creating the corpus
* Extracting a text corpus from Wikipedia made of plain text sentences
* selected so as to have a roughly balanced corpus in terms of training data (each target category should be associated with the same number of sentences). 

* Including 6 categories:
architects, mathematicians, painters, politicians, singers and writers.

***

* SCRIPT INPUT:
    
    * a number k of persons per category

    * a number n of sentences per person
    (persons whose wikipedia description is too short to have n sentences, should be ignored). 

* SCRIPT OUTPUT:

    * the text of the corresponding Wikipedia page
    
    * the corresponding data type (A or Z) and category
    
    * WikiData description
    
            ▶ Store these into a csv or a json file and save it on your hard drive

## Part 1
Create a list of persons you want to work with. Those persons
should fall into two main types: artists (A) and non artists (Z).
Singers, writers and painters are of type A while architects, politicians and mathematicians of type Z. For each category, select 30
persons of that category. So in total you should have a list of 180
persons (half of them are artists and half of them are not).
You can use the wikidata warehouse to find persons of the expected categories. More precisely, the Wikidata collection can be
filtered out using the SPARQL language and the following endpoint: https://query.wikidata.org/.
You can thus use the SPARQLwrapper python library to apply a
SPARQL query to the Wikidata warehouse and retrieve the required item identifiers.

In [166]:
# import modules
import wikipedia
from random import Random
from spacy.lang.en import English
from itertools import islice
import pandas as pd
import json

In [167]:
# store wiki list pages
wiki_pages = {}
wiki_pages['singer'] = wikipedia.page('List of singers')
wiki_pages['writer'] = wikipedia.page('List of writers')
wiki_pages['painter'] = wikipedia.page('List of painters')
wiki_pages['architect'] = wikipedia.page('List of architects')
wiki_pages['politician'] = wikipedia.page('List of politicians by nationality')
wiki_pages['mathematician'] =  wikipedia.page('List of mathematicians')

In [168]:
def check_article_keyword(wiki_page):
    #if article
    

    #if list
    keyword = wiki_page.title #after list(s) of
    for category in wiki_page.categories:
        if keyword in category:
            return True
    return False

In [169]:
class KeywordArticleChecker:
    def __init__(self,keyword):
        self.keyword = keyword

    def __call__(self,wiki_page):
        for category in wiki_page.categories:
            if self.keyword in category:
                return True
        return False

In [170]:
def check_list_keyword(wiki_page):
    
    if keyword in wiki_page.title:
        return True
    return False

In [171]:
def check_is_not_list(wiki_page):   
    for category in wiki_page.categories:
        if category.lower().startswith('list'):
            return False
    return True

In [172]:
def get_articles_from_dic(wiki_page, 
                            n_articles,
                            page_filter=lambda x: True,
                            check_is_final_node=check_is_not_list,
                            rng=Random(),
                            max_depth = 7,
                            _memory=None):

    print('  Checking:', wiki_page.title)
    # if the function is being called for the first time, assign an empty set 
    if _memory is None:
        _memory = set()

    # trivial cases:
    # if the number of articles retrieved satisfies the request
    # or the current wiki_page has already been visited
    # return an empty list
    if (wiki_page.title in _memory 
        or n_articles <= 0
        or not page_filter(wiki_page)
        or max_depth == 0):
        return []

    # if the page has not been visited yet, add it to the _memory
    _memory.add(wiki_page.title)

    # if the wiki_page is an article (not a list),
    # return it 
    if check_is_final_node(wiki_page):
        print('▶ Added to list:',wiki_page.title)
        return [wiki_page]
    
    # else, the page is a list
    else:
        articles = []
        # shuffle the links of the list
        while n_articles > 0:
            title = rng.choices(wiki_page.links, k=1)   
            try:
                page = wikipedia.page(title)
            except wikipedia.exceptions.WikipediaException:
                continue

            # get articles from the list links and add them to the list of articles
            local_articles = get_articles_from_list(wiki_page=page, 
                                                    n_articles=1, 
                                                    page_filter=page_filter,
                                                    check_is_final_node=check_is_final_node,
                                                    rng=rng,
                                                    max_depth = max_depth-1,
                                                    _memory=_memory)
            articles.extend(local_articles)

            n_articles -= len(local_articles)

            # if the number of articles retrieved satisfies the request
            # return the articles
            if n_articles == 0:
                return articles
            elif n_articles < 0:
                return articles[:n_articles]

        return articles

In [177]:
wiki_keywords_people = {}
for k,v in wiki_pages.items():
    wiki_keywords_people[k] = get_articles_from_dic(wiki_page=v,
                                        n_articles=3, 
                                        page_filter=KeywordArticleChecker(k),
                                        rng=Random(0))
wiki_keywords_people

  Checking: Lists of singers
  Checking: List of scat singers
  Checking: Nikki Yanofsky
▶ Added to list: Nikki Yanofsky
  Checking: List of Nepalese singers
  Checking: Karna Das
▶ Added to list: Karna Das
  Checking: List of Romanian singers
  Checking: Joseph Schmidt
▶ Added to list: Joseph Schmidt
  Checking: Lists of writers
  Checking: List of poetry awards
  Checking: List of early-modern women playwrights (UK)
  Checking: Frances Brooke
▶ Added to list: Frances Brooke
  Checking: List of French-language authors
  Checking: Jean du Vergier de Hauranne
  Checking: Henri Michaux
▶ Added to list: Henri Michaux
  Checking: List of historical novelists
  Checking: List of Hebrew-language authors
  Checking: List of Hebrew-language playwrights
  Checking: List of Hebrew-language poets
  Checking: Solomon ibn Gabirol
▶ Added to list: Solomon ibn Gabirol
  Checking: Lists of painters
  Checking: Plastic arts
  Checking: Model (art)
  Checking: List of Carlo Maratta pupils and assistants

{'singer': [<WikipediaPage 'Nikki Yanofsky'>,
  <WikipediaPage 'Karna Das'>,
  <WikipediaPage 'Joseph Schmidt'>],
 'writer': [<WikipediaPage 'Frances Brooke'>,
  <WikipediaPage 'Henri Michaux'>,
  <WikipediaPage 'Solomon ibn Gabirol'>],
 'painter': [<WikipediaPage 'Filippo Tancredi'>,
  <WikipediaPage 'Yuki Ogura'>,
  <WikipediaPage 'Esteban Márquez de Velasco'>],
 'architect': [<WikipediaPage 'Roger Taillibert'>,
  <WikipediaPage 'Paulo Mendes da Rocha'>,
  <WikipediaPage 'Hippodamus of Miletus'>],
 'politician': [<WikipediaPage 'Joseph Henry Burrows'>,
  <WikipediaPage 'John Quinn (New York politician)'>,
  <WikipediaPage 'Forrest Goodwin'>],
 'mathematician': [<WikipediaPage 'Prasanta Chandra Mahalanobis'>,
  <WikipediaPage 'Pafnuty Chebyshev'>,
  <WikipediaPage 'Varāhamihira'>]}

## Part 2
for each selected person, retrieve his.her Wikidata description and
Wikipedia page title. This can be done using the wikidata API
along with the wptools python library.
Once you have a list of wikipedia page titles, fetch (if it exists) the corresponding English wikipedia page, and extract the n first sentences of its content.

In [183]:
class SpacySentenceTokenizer:
    def __init__(self, nlp=English()):
        self.nlp = nlp
        nlp.add_pipe(nlp.create_pipe('sentencizer'))

    def __call__(self, txt):
        return self.nlp(txt).sents

In [184]:
def get_titles_descriptions_sentences(wiki_page : wikipedia.WikipediaPage,
                                        n_sentences,
                                        sentence_tokenize=SpacySentenceTokenizer()):
    
    title = wiki_page.title
    description = wiki_page.summary # TODO: wikidata
    content = wiki_page.content
    sentences = [sent.string.strip() for sent in islice(sentence_tokenize(content), n_sentences)]
    
    return (title, description, sentences)

In [186]:
data = {}

for k,v in wiki_keywords_people.items():
    data[k] = []
    for page in v:
        article = {}
        t,d,s = get_titles_descriptions_sentences(page, 10)
        article['title'] = t
        article['description'] = d
        article['sentences'] = s
        data[k].append(article)

In [194]:
data['singer']

[{'title': 'Nikki Yanofsky',
  'description': 'Nicole Rachel "Nikki" Yanofsky (born February 8, 1994) is a jazz-pop singer from Montreal, Quebec. She sang the CTV Olympic broadcast theme song, "I Believe", which was also the theme song of the 2010 Winter Olympic Games. She also performed at the opening and closing ceremonies for the Olympics and at the opening ceremony of the 2010 Winter Paralympic Games. She has released three studio albums to date, including Nikki in 2010, Little Secret in 2014, and Turn Down the Sound in 2020.',
  'sentences': ['Nicole Rachel "Nikki" Yanofsky (born February 8, 1994) is a jazz-pop singer from Montreal, Quebec.',
   'She sang the CTV Olympic broadcast theme song, "I Believe", which was also the theme song of the 2010 Winter Olympic Games.',
   'She also performed at the opening and closing ceremonies for the Olympics and at the opening ceremony of the 2010 Winter Paralympic Games.',
   'She has released three studio albums to date, including Nikki in 

In [196]:
# split categories
A = ['singer', 'writer', 'painter']
Z = ['architect', 'politician', 'mathematician']

A_cat = {a:data[a] for a in A}

Z_cat = {z:data[z] for z in Z}

data = {'A':A_cat, 'Z':Z_cat}

## part 3
Save data for preprocessing

In [198]:
with open('data.json', 'w') as f:
    json.dump(data, f)