# Creating the corpus
* Extracting a text corpus from Wikipedia made of plain text sentences
* selected so as to have a roughly balanced corpus in terms of training data (each target category should be associated with the same number of sentences). 

* Including 6 categories:
architects, mathematicians, painters, politicians, singers and writers.

***

* SCRIPT INPUT:
    
    * a number k of persons per category

    * a number n of sentences per person
    (persons whose wikipedia description is too short to have n sentences, should be ignored). 

* SCRIPT OUTPUT:

    * the text of the corresponding Wikipedia page
    
    * the corresponding data type (A or Z) and category
    
    * WikiData description
    
            ▶ Store these into a csv or a json file and save it on your hard drive

## Part 1
Create a list of persons you want to work with. Those persons
should fall into two main types: artists (A) and non artists (Z).
Singers, writers and painters are of type A while architects, politicians and mathematicians of type Z. For each category, select 30
persons of that category. So in total you should have a list of 180
persons (half of them are artists and half of them are not).
You can use the wikidata warehouse to find persons of the expected categories. More precisely, the Wikidata collection can be
filtered out using the SPARQL language and the following endpoint: https://query.wikidata.org/.
You can thus use the SPARQLwrapper python library to apply a
SPARQL query to the Wikidata warehouse and retrieve the required item identifiers.

In [1]:
# install packages
!pip install wikipedia



In [58]:
# import modules
import wikipedia
from random import Random 

# input number of articles required
while True:
    try:
        n = int(input("How many aricles would you like to search?\n"))
        break
    except ValueError:
        print("\nNot a valid number! :(\nPlease, try again.\n")

    ask for keyword

In [2]:
# store wiki list pages
wiki_page_singers = wikipedia.page('List of singers')
wiki_page_writers = wikipedia.page('List of writers')
wiki_page_painters = wikipedia.page('List of painters')
wiki_page_architects = wikipedia.page('List of architects')
wiki_page_politicians = wikipedia.page('List of politicians')
wiki_page_mathematicians = wikipedia.page('List of mathematicians')

In [3]:
dir(wiki_page_architects)

['_WikipediaPage__continued_query',
 '_WikipediaPage__load',
 '_WikipediaPage__title_query_param',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 'categories',
 'content',
 'coordinates',
 'html',
 'images',
 'links',
 'original_title',
 'pageid',
 'parent_id',
 'references',
 'revision_id',
 'section',
 'sections',
 'summary',
 'title',
 'url']

In [35]:
def check_article_keyword(wiki_page):
    #if article
    

    #if list
    keyword = wiki_page.title #after list(s) of
    for category in wiki_page.categories:
        if keyword in category:
            return True
    return False

In [48]:
class KeywordArticleChecker:
    def __init__(self,keyword):
        self.keyword = keyword

    def __call__(self,wiki_page):
        for category in wiki_page.categories:
            if self.keyword in category:
                return True
        return False

In [36]:
def check_list_keyword(wiki_page):
    
    if keyword in wiki_page.title:
        return True
    return False

In [37]:
def check_is_not_list(wiki_page):   
    for category in wiki_page.categories:
        if 'list' in category.lower():
            return False
    return True

In [38]:
wiki_page_singers.categories

['Articles with short description',
 'Lists of performers lists',
 'Lists of singers',
 'Short description is different from Wikidata',
 'Use dmy dates from April 2020']

In [60]:
def get_articles_from_list(wiki_page, 
                            n_articles,
                            page_filter=lambda x: True,
                            check_is_final_node=check_is_not_list,
                            rng=Random(),
                            _memory=None):
    '''
    '''

    print(wiki_page.title)
    # if the function is being called for the first time, assign an empty set 
    if _memory is None:
        _memory = set()

    # trivial cases:
    # if the number of articles retrieved satisfies the request
    # or the current wiki_page has already been visited
    # return an empty list
    if (wiki_page.title in _memory 
        or n_articles <= 0
        or not page_filter(wiki_page)):
        return []

    # if the page has not been visited yet, add it to the _memory
    _memory.add(wiki_page.title)

    # if the wiki_page is an article (not a list),
    # return it 
    if check_is_final_node(wiki_page):
        return [wiki_page]
    
    # else, the page is a list
    else:
        articles = []
        # shuffle the links of the list
        while n_articles > 0:
            title = rng.choices(wiki_page.links, k=1)   
            try:
                page = wikipedia.page(title)
            except wikipedia.exceptions.WikipediaException:
                continue

            # get articles from the list links and add them to the list of articles
            local_articles = get_articles_from_list(wiki_page=page, 
                                                    n_articles=1, 
                                                    check_is_final_node=check_is_final_node,
                                                    page_filter=page_filter,
                                                    rng=rng,
                                                    _memory=_memory)
            articles.extend(local_articles)

            n_articles -= len(local_articles)

            # if the number of articles retrieved satisfies the request
            # return the articles
            if n_articles == 0:
                return articles
            elif n_articles < 0:
                return articles[:n_articles]

        return articles

In [70]:
print(get_articles_from_list(wiki_page_mathematicians,
 1000, 
 KeywordArticleChecker("mathematician"),
 rng=Random(3)))

Lists of mathematicians
List of Indian mathematicians


KeyboardInterrupt: 

# input number of articles required
while True:
    try:
        n = int(input("How many aricles would you like to search?\n"))
        break
    except ValueError:
        print("\nNot a valid number! :(\nPlease, try again.\n")

In [None]:
# split categories
A = ['singer', 'writer', 'painter']
Z = ['architect', 'politician', 'mathematician']

# create list of people



## Part 2
for each selected person, retrieve his.her Wikidata description and
Wikipedia page title. This can be done using the wikidata API
along with the wptools python library.

## Part 3
Once you have a list of wikipedia page titles, fetch (if it exists)
the corresponding English wikipedia page, and extract the n first
sentences of its content.