Connect to Wikipedia using `wikipediaapi` library.

In [16]:
import wikipediaapi
wiki_wiki = wikipediaapi.Wikipedia('en')

# Getting a list of writers

In [19]:
wiki_wiki.page("Category:Writers").categorymembers

{'Writer': Writer (id: ??, ns: 0),
 'Category:Writers by award': Category:Writers by award (id: ??, ns: 14),
 'Category:Writers by city': Category:Writers by city (id: ??, ns: 14),
 'Category:Writers by continent': Category:Writers by continent (id: ??, ns: 14),
 'Category:Writers by ethnicity': Category:Writers by ethnicity (id: ??, ns: 14),
 'Category:Writers by format': Category:Writers by format (id: ??, ns: 14),
 'Category:Writers by genre': Category:Writers by genre (id: ??, ns: 14),
 'Category:Writers by language': Category:Writers by language (id: ??, ns: 14),
 'Category:Writers by nationality': Category:Writers by nationality (id: ??, ns: 14),
 'Category:Writers by period': Category:Writers by period (id: ??, ns: 14),
 'Category:Writers by religion': Category:Writers by religion (id: ??, ns: 14),
 'Category:Writers by subject area': Category:Writers by subject area (id: ??, ns: 14),
 'Category:Lists of writers': Category:Lists of writers (id: ??, ns: 14),
 'Category:Works abou

We can see that there are a lot of subcategories of Writers category in Wikipedia.
We will choose to work with the subcategory `Writers by nationality`, assuming that all of the writers will be included in one of its sublists.

In [77]:
wiki_wiki.page("Category:Writers by nationality").categorymembers.keys()

dict_keys(['Category:Writers by century and nationality', 'Category:Writers by language and nationality', 'Category:Writers by nationality and century', 'Category:Writers by nationality and city', 'Category:Writers by nationality and language', 'Category:Bloggers by nationality', "Category:Children's writers by nationality", 'Category:Comics writers by nationality', 'Category:Communist writers by nationality', 'Category:Crime writers by nationality', 'Category:Dramatists and playwrights by nationality', 'Category:Fiction writers by nationality', 'Category:Marxist writers by nationality', 'Category:Newspaper writers by nationality', 'Category:Non-fiction writers by nationality', 'Category:Poets by nationality', 'Category:Propagandists by nationality', 'Category:Romantic fiction writers by nationality', 'Category:Scholars and academics by nationality', 'Category:Screenwriters by nationality', 'Category:Songwriters by nationality', 'Category:Spiritual writers by nationality', 'Category:Te

We can see that in the list of subcategories by nationality there are some additional lists that appear before the actual list of writers by nationality (starting with `Afghan writers`). Overall we can notice that th categories we're interested in are of a format `Category:<word> writers`. 

We can use a regular expression to match only this type of categories.

In [35]:
import re
pattern = re.compile(r'Category:[A-Za-z ]* writers$')
all_cats = wiki_wiki.page("Category:Writers by nationality").categorymembers.keys()
cats = [c for c in all_cats if pattern.match(c)]
cats

['Category:Afghan writers',
 'Category:Albanian writers',
 'Category:Algerian writers',
 'Category:American writers',
 'Category:Andorran writers',
 'Category:Angolan writers',
 'Category:Anguillan writers',
 'Category:Antigua and Barbuda writers',
 'Category:Argentine writers',
 'Category:Armenian writers',
 'Category:Aruban writers',
 'Category:Australian writers',
 'Category:Austrian writers',
 'Category:Azerbaijani writers',
 'Category:Bahamian writers',
 'Category:Bahraini writers',
 'Category:Bangladeshi writers',
 'Category:Barbadian writers',
 'Category:Belarusian writers',
 'Category:Belgian writers',
 'Category:Belizean writers',
 'Category:Beninese writers',
 'Category:Bermudian writers',
 'Category:Bhutanese writers',
 'Category:Bolivian writers',
 'Category:Bosnia and Herzegovina writers',
 'Category:Botswana writers',
 'Category:Brazilian writers',
 'Category:British writers',
 'Category:Bruneian writers',
 'Category:Bulgarian writers',
 'Category:Burmese writers',
 'Cate

Now we can have a look at the response structure for each of the subcategories:

In [37]:
wiki_wiki.page(cats[0]).categorymembers.keys()

dict_keys(['Hyder Akbar', 'Hasan Akhund', 'Sonita Alizadeh', 'Mohammad Ishaq Aloko', 'Awista Ayub', 'Durkhanai Ayubi', 'Mohammad Yousuf Azraq', 'Abdul Hamid Bahij', 'Wasef Bakhtari', 'Farhad Bitani', 'Zohre Esmaeli', 'Masuma Esmati-Wardak', 'Fereshteh Forough', 'Abdul Hai Habibi', 'Chékéba Hachemi', 'Mohammad Shafiq Hamdam', 'M. Jamil Hanifi', 'Saifuddin Jalal', 'Malalai Joya', 'Faizullah Kakar', 'Kazem Kazemi', 'Qiamuddin Khadim', 'Hafizullah Khaled', 'Khalilullah Khalili', 'Gharzai Khwakhuzhi', 'Mohammad Ibraheem Khwakhuzhi', 'Jamil Jan Kochai', 'Maryam Mahboob', 'Razaq Mamoon', 'Lutfullah Mashal', 'Mohammad Daud Miraki', 'Y. Misdaq', 'Nazar Mohammad Mutmaeen', 'Ali Mohaqiq Nasab', 'Hakan Massoud Navabi', 'Fariba Nawa', 'Massoud Nawabi', 'Qais Akbar Omar', 'Rahraw Omarzad', 'Parween Pazhwak', 'Layla Sarahat Rushani', 'Amanullah Sailaab Sapi', 'Humira Saqib', 'Idries Shah', 'Asef Soltanzadeh', 'Gul Pacha Ulfat', 'Mohammad Amin Wakman', 'Burhanuddin Kushkaki', 'Sakena Yacoobi', 'Hamid 

We can see that the subcategories have subcategories of their own as well as the list of actual writers.
We can use regular expression to include only individual results in the final list: so the ones not starting with `Category:<words>`.

In [72]:
p2 = re.compile(r'^(?!(Category\:)|(List of)).*')

Using the filtering pattern we can create a list of all writers, saving their subcategory as well if it's needed later. We will store the list in a csv and can load it to Pandas for visualization.

In [4]:
import pandas as pd

In [55]:
p3 = re.compile(r'(?<=Category\:).*')

In [61]:
re.search(p3, 'Category:Afghan writers').group(0)

'Afghan writers'

In [83]:
def get_people_list(cat_list, top_cat, pattern):

    all_people_list = []
    p3 = re.compile(r'(?<=Category\:).*')

    # Go through all the writers by nationality subcategories we filtered above
    for cat in cat_list:
        # Include only individual pages from these categories
        # We can also assume that one person can be in 2 subcategories, if they are we will just include them once
        people = set([p for p in wiki_wiki.page(cat).categorymembers.keys() if pattern.match(p)])
    
        #In the Pandas DF we want to show the name of the author, the subcategory and the high-level category - Writers
        # We will create a dictionary for each row of the future DataFrame
        # We will remove `Category:` in front of the name of subcategory for better readability using pattern p3.
        for p in people:
            people_dict = {'Person': p, 'Subcategory': re.search(p3, cat).group(0), 'Category': top_cat}
            all_people_list.append(people_dict)
            
    return all_people_list

In [84]:
all_writers = get_people_list(cats, 'Writers', p2)
all_writers_df = pd.DataFrame(all_writers)
all_writers_df

Unnamed: 0,Person,Subcategory,Category
0,Khalilullah Khalili,Afghan writers,Writers
1,Mohammad Hashem Zamani,Afghan writers,Writers
2,Mohammad Ishaq Aloko,Afghan writers,Writers
3,Mohammad Shafiq Hamdam,Afghan writers,Writers
4,M. Jamil Hanifi,Afghan writers,Writers
...,...,...,...
11543,Denford Magora,Zimbabwean writers,Writers
11544,Charles Mudede,Zimbabwean writers,Writers
11545,Tendai Huchu,Zimbabwean writers,Writers
11546,Edmund Chipamaunga,Zimbabwean writers,Writers


We can store the results in a CSV if needed for further research.

In [85]:
all_writers_df.to_csv('all_writers_list.csv')

# Getting all astronauts

We will repeat the investigation for Astronauts now.

In [74]:
wiki_wiki.page("Category:Astronauts").categorymembers

{'Astronaut': Astronaut (id: ??, ns: 0),
 'Women in space': Women in space (id: ??, ns: 0),
 'Commercial astronaut': Commercial astronaut (id: ??, ns: 0),
 'ESA CAVES': ESA CAVES (id: ??, ns: 0),
 'Max Q (astronaut band)': Max Q (astronaut band) (id: ??, ns: 0),
 'Mission specialist': Mission specialist (id: ??, ns: 0),
 'Payload specialist': Payload specialist (id: ??, ns: 0),
 'Astronaut ranks and positions': Astronaut ranks and positions (id: ??, ns: 0),
 'Category:Astronauts by nationality': Category:Astronauts by nationality (id: ??, ns: 14),
 'Category:Astronauts by space program': Category:Astronauts by space program (id: ??, ns: 14),
 'Category:Lists of astronauts': Category:Lists of astronauts (id: ??, ns: 14),
 'Category:Books by astronauts': Category:Books by astronauts (id: ??, ns: 14),
 'Category:Astronaut candidates': Category:Astronaut candidates (id: ??, ns: 14),
 'Category:Commercial astronauts': Category:Commercial astronauts (id: ??, ns: 14),
 'Category:Cultural depi

There is a similar subcategory `Astronauts by nationality` which we can use as a base.

In [75]:
wiki_wiki.page("Category:Astronauts by nationality").categorymembers.keys()

dict_keys(['Category:Lists of astronauts by nationality', 'Category:Afghan cosmonauts', 'Category:American astronauts', 'Category:Australian astronauts', 'Category:Austrian astronauts', 'Category:Belgian astronauts', 'Category:Brazilian astronauts', 'Category:British astronauts', 'Category:Bulgarian cosmonauts', 'Category:Canadian astronauts', 'Category:Chinese astronauts', 'Category:Costa Rican astronauts', 'Category:Cuban cosmonauts', 'Category:Czech cosmonauts and astronauts', 'Category:Czechoslovak cosmonauts', 'Category:Danish astronauts', 'Category:Dutch astronauts', 'Category:Emirati astronauts', 'Category:French spationauts', 'Category:German astronauts', 'Category:Hungarian astronauts', 'Category:Indian astronauts', 'Category:Iranian astronauts', 'Category:Israeli astronauts', 'Category:Italian astronauts', 'Category:Japanese astronauts', 'Category:Kazakhstani cosmonauts', 'Category:Kyrgyzstani cosmonauts', 'Category:Lithuanian astronauts', 'Category:Malaysian astronauts', 'Ca

All the subcategories we're interested in are of a form: `Category: <words> astronauts|cosmonauts`.
We can add a regex to match it:

In [80]:
p4 = re.compile(r'Category:[A-Za-z ]* (astronauts|cosmonauts)$')
all_astr_cats = wiki_wiki.page("Category:Astronauts by nationality").categorymembers.keys()
astr_cats = [c for c in all_astr_cats if p4.match(c)]
astr_cats

['Category:Afghan cosmonauts',
 'Category:American astronauts',
 'Category:Australian astronauts',
 'Category:Austrian astronauts',
 'Category:Belgian astronauts',
 'Category:Brazilian astronauts',
 'Category:British astronauts',
 'Category:Bulgarian cosmonauts',
 'Category:Canadian astronauts',
 'Category:Chinese astronauts',
 'Category:Costa Rican astronauts',
 'Category:Cuban cosmonauts',
 'Category:Czech cosmonauts and astronauts',
 'Category:Czechoslovak cosmonauts',
 'Category:Danish astronauts',
 'Category:Dutch astronauts',
 'Category:Emirati astronauts',
 'Category:German astronauts',
 'Category:Hungarian astronauts',
 'Category:Indian astronauts',
 'Category:Iranian astronauts',
 'Category:Israeli astronauts',
 'Category:Italian astronauts',
 'Category:Japanese astronauts',
 'Category:Kazakhstani cosmonauts',
 'Category:Kyrgyzstani cosmonauts',
 'Category:Lithuanian astronauts',
 'Category:Malaysian astronauts',
 'Category:Mexican astronauts',
 'Category:Mongolian cosmonauts'

Now we can create a DataFrame using all these subcategories:

In [86]:
all_astronauts = get_people_list(astr_cats, 'Astronauts', p2)
all_astronauts_df = pd.DataFrame(all_astronauts)
all_astronauts_df

Unnamed: 0,Person,Subcategory,Category
0,Abdul Ahad Momand,Afghan cosmonauts,Astronauts
1,John-David F. Bartoe,American astronauts,Astronauts
2,Robert J. Cenker,American astronauts,Astronauts
3,Marion Dietrich,American astronauts,Astronauts
4,Jared Isaacman,American astronauts,Astronauts
...,...,...,...
446,Pavel Popovich,Ukrainian cosmonauts,Astronauts
447,Yaroslav Pustovyi,Ukrainian cosmonauts,Astronauts
448,Eugene H. Trinh,Vietnamese astronauts,Astronauts
449,Phạm Tuân,Vietnamese astronauts,Astronauts


We can as well store this information in a CSV:

In [87]:
all_astronauts_df.to_csv('all_astronauts_list.csv')

_______________________

Overall, we can see that there are much more writers than astronauts which we will need to address later.

# Getting texts for each person

Now that we have lists of personalities, we can store content of each article into a corresponding folder.

In [94]:
def save_people_text(people, category):
    for p in people:
        with open(f'{category}/{p}_{category}.txt', 'w') as f:
            f.write(wiki_wiki.page(p).text)

In [95]:
save_people_text(all_astronauts_df['Person'], 'Astronauts')

In [107]:
save_people_text(all_writers_df['Person'], 'Writers')

# Combined DataFrame

We know that there are much more writer articles than astronaut ones.
To address this, we will first create a random sample of the same size for Writers and Astronauts and then store there texts in a dataframe.

In [108]:
def get_sample_2cat(size, df1, df2):
    return pd.concat([df1.sample(n=size), df2.sample(n=size)])

In [111]:
combined_df = get_sample_2cat(200, all_astronauts_df, all_writers_df)
combined_df

Unnamed: 0,Person,Subcategory,Category
104,Jeanette Epps,American astronauts,Astronauts
288,Sergey Avdeev,Russian cosmonauts,Astronauts
198,Matthias Maurer,German astronauts,Astronauts
365,Andriyan Nikolayev,Soviet cosmonauts,Astronauts
412,Gennady Manakov,Soviet cosmonauts,Astronauts
...,...,...,...
7053,Uhwudong,Korean writers,Writers
5383,Óttar M. Norðfjörð,Icelandic writers,Writers
4093,Claude Phillips,English writers,Writers
7477,Francis Moto,Malawian writers,Writers


We can now merge the DataFrame with the actual text of the person's article:

In [132]:
def add_text_to_df(df):
    texts = []
    for i in range(len(df)):
        cat = df.iloc[i]['Category']
        p = df.iloc[i]['Person']
        with open(f'{cat}/{p}_{cat}.txt') as f:
            text = f.read()        
        texts.append(text)    
    
    df['Text'] = texts
    return df

In [133]:
df_with_texts = add_text_to_df(combined_df)
df_with_texts

Unnamed: 0,Person,Subcategory,Category,Text
104,Jeanette Epps,American astronauts,Astronauts,"Jeanette Jo Epps (born November 3, 1970) is an..."
288,Sergey Avdeev,Russian cosmonauts,Astronauts,Sergei Vasilyevich Avdeyev (Сергей Васильевич ...
198,Matthias Maurer,German astronauts,Astronauts,Matthias Josef Maurer (born 18 March 1970) is ...
365,Andriyan Nikolayev,Soviet cosmonauts,Astronauts,Andriyan Grigoryevich Nikolayev (Chuvash and R...
412,Gennady Manakov,Soviet cosmonauts,Astronauts,Gennady Mikhailovich Manakov (Russian: Геннади...
...,...,...,...,...
7053,Uhwudong,Korean writers,Writers,"Eowudong or Uhwudong (어우동, 於宇同; 1440 - 18 Octo..."
5383,Óttar M. Norðfjörð,Icelandic writers,Writers,Óttar Martin Norðfjörð (born 1980) is an Icela...
4093,Claude Phillips,English writers,Writers,Sir Claude Phillips (29 January 1846 – 9 Augus...
7477,Francis Moto,Malawian writers,Writers,Professor Francis P. B. Moto (born 1952) is a ...


We can store this DataFrame to a csv file as well:

In [134]:
df_with_texts.to_csv('df_with_texts.csv')

In [11]:
pd.read_csv('df_with_texts.csv', index_col=0)

Unnamed: 0,Person,Subcategory,Category,Text
104,Jeanette Epps,American astronauts,Astronauts,"Jeanette Jo Epps (born November 3, 1970) is an..."
288,Sergey Avdeev,Russian cosmonauts,Astronauts,Sergei Vasilyevich Avdeyev (Сергей Васильевич ...
198,Matthias Maurer,German astronauts,Astronauts,Matthias Josef Maurer (born 18 March 1970) is ...
365,Andriyan Nikolayev,Soviet cosmonauts,Astronauts,Andriyan Grigoryevich Nikolayev (Chuvash and R...
412,Gennady Manakov,Soviet cosmonauts,Astronauts,Gennady Mikhailovich Manakov (Russian: Геннади...
...,...,...,...,...
7053,Uhwudong,Korean writers,Writers,"Eowudong or Uhwudong (어우동, 於宇同; 1440 - 18 Octo..."
5383,Óttar M. Norðfjörð,Icelandic writers,Writers,Óttar Martin Norðfjörð (born 1980) is an Icela...
4093,Claude Phillips,English writers,Writers,Sir Claude Phillips (29 January 1846 – 9 Augus...
7477,Francis Moto,Malawian writers,Writers,Professor Francis P. B. Moto (born 1952) is a ...


# Text cleaning

In [2]:
with open('Writers/A B M Shawkat Ali_Writers.txt', 'r') as f:
    text = f.read()
print(text)

A B M Shawkat Ali is a Bangladeshi origin-Australian author, computer scientist and data analyst. He author of several books in the area of Data Mining, Computational Intelligence, and Smart Grid. He is a newspaper columnist. He is an academic and well-known researcher in the areas of Machine Learning and Data Science. He is also the founder of a research center and international conferences in Data Science and Engineering. He is now an Adjunct Professor in Data Science in the School of Engineering and Technology, Central Queensland University, Australia. Early life and education Ali was born (July 30, 1969) just before the independent date of Bangladesh in Rajapur, Jamalpur. His parents Md. Saifuddin Sarker was a farmer and businessman and Mrs. Soufia Khatun was a housewife. Ali has two brothers and three sisters. He completed year five Primary education with regional first position from Rajapur Primary School in 1978, Secondary School Certificate (SSC) in 1984 securing First Division

We can see that the texts contain multiple linebreaks that can affect tokenization, we can replace all the linebreaks with spaces to avoid un-needed tokens for extra linebreaks.

In [1]:
from util import text_clean

In [3]:
print(text_clean(text))

A B M Shawkat Ali is a Bangladeshi origin-Australian author, computer scientist and data analyst. He author of several books in the area of Data Mining, Computational Intelligence, and Smart Grid. He is a newspaper columnist. He is an academic and well-known researcher in the areas of Machine Learning and Data Science. He is also the founder of a research center and international conferences in Data Science and Engineering. He is now an Adjunct Professor in Data Science in the School of Engineering and Technology, Central Queensland University, Australia. Early life and education Ali was born (July 30, 1969) just before the independent date of Bangladesh in Rajapur, Jamalpur. His parents Md. Saifuddin Sarker was a farmer and businessman and Mrs. Soufia Khatun was a housewife. Ali has two brothers and three sisters. He completed year five Primary education with regional first position from Rajapur Primary School in 1978, Secondary School Certificate (SSC) in 1984 securing First Division

We will apply this cleaning to the generated DataFrame above:

In [5]:
df = pd.read_csv('df_with_texts.csv', index_col=0)
df

Unnamed: 0,Person,Subcategory,Category,Text
104,Jeanette Epps,American astronauts,Astronauts,"Jeanette Jo Epps (born November 3, 1970) is an..."
288,Sergey Avdeev,Russian cosmonauts,Astronauts,Sergei Vasilyevich Avdeyev (Сергей Васильевич ...
198,Matthias Maurer,German astronauts,Astronauts,Matthias Josef Maurer (born 18 March 1970) is ...
365,Andriyan Nikolayev,Soviet cosmonauts,Astronauts,Andriyan Grigoryevich Nikolayev (Chuvash and R...
412,Gennady Manakov,Soviet cosmonauts,Astronauts,Gennady Mikhailovich Manakov (Russian: Геннади...
...,...,...,...,...
7053,Uhwudong,Korean writers,Writers,"Eowudong or Uhwudong (어우동, 於宇同; 1440 - 18 Octo..."
5383,Óttar M. Norðfjörð,Icelandic writers,Writers,Óttar Martin Norðfjörð (born 1980) is an Icela...
4093,Claude Phillips,English writers,Writers,Sir Claude Phillips (29 January 1846 – 9 Augus...
7477,Francis Moto,Malawian writers,Writers,Professor Francis P. B. Moto (born 1952) is a ...


In [6]:
df['Text'] = df['Text'].apply(text_clean)

We will store it back to the CSV-file:

In [7]:
df.to_csv('df_with_texts.csv')

## Clean all texts

We can also clean all previously scrapped texts:

In [72]:
import os

In [73]:
writers = os.listdir('Writers')

In [74]:
for writer in writers:
    with open(f'Writers/{writer}', 'r') as f:
        text = f.read()
        
    text_cleaned = text_clean(text)
    with open(f'Writers/{writer}', 'w') as f:
        f.write(text_cleaned)

In [75]:
astronauts = os.listdir('Astronauts')

In [76]:
for astro in astronauts:
    with open(f'Astronauts/{astro}', 'r') as f:
        text = f.read()
        
    text_cleaned = text_clean(text)
    with open(f'Astronauts/{astro}', 'w') as f:
        f.write(text_cleaned)