# Text mining and processing


This notebook is split in two parts:
- __Text mining__: Documents are created out of text extracted from Wikipedia pages using the wikipedia api and [`Wikipedia-API`](https://pypi.org/project/Wikipedia-API/), a Python wrapper. This is done using custom functions defined in the python script [extrac_wikipedia_data.py](extract_wikipedia_data.py).
- __Text processing__: The text of the documents are then processed using [spaCy](https://spacy.io/).
    This includes
    - tokenization of words,
    - cleaning-up, 
    - lemmatization,
    - POS tagging.

Finally, a dictionary is created, mapping words to numerical ids, and the documents are converted to a bag-of-words format.

In [135]:
from tqdm import tqdm_notebook as tqdm

# for text mining
import wikipediaapi
import re
wiki = wikipediaapi.Wikipedia('en')

from extract_wikipedia_data import get_page_text, get_categorymembers, get_authors

# for text processing
import spacy
spacy_nlp = spacy.load('en_core_web_sm')

from gensim.models.atmodel import construct_author2doc
from gensim import models, corpora 

import json

## Text mining

My goal is to apply topic modelling to the content of science-fiction novels in english, using only the information available on wikipedia.

Here I gather all wikipedia pages corresponding to science-fiction novels. Wikipedia pages are members of categories, which can be probed via the wikipedia API.

I use the category ["Science_fiction_novels_by_year"](https://en.wikipedia.org/wiki/Category:Science_fiction_novels_by_year), which seems to include the highest amount of different SF novels. Each of its member is a category iself that contains pages about novels of a given year.

- First, I use custom functions written in [extrac_wikipedia_data.py](extract_wikipedia_data.py) to collect all relevant pages.
- Then, the text of each page is split into its sections, and only relevant sections are kept. These are sections about the content of the novel (e.g. "plot", "theme"), rather than metadata about the author or the book.

Later, I also extract information about the author of the novel from the wikipedia page.

In [143]:
# Extract members and submembers of the following two categories that are not themselves categories.
cat = wiki.page("Category:Science_fiction_novels_by_year")
cat_years = [page for page in cat.categorymembers.values()]

cat = wiki.page("Category:Science_fiction_novels_by_writer")
cat_writers = [page for page in cat.categorymembers.values()]

pages_by_year = [
    get_categorymembers(c, n_pages_per_level_threshold=1, level=0, max_level=None, verbose=False)
    for c in cat_years
]

pages_by_writer = [
    get_categorymembers(c, n_pages_per_level_threshold=1, level=0, max_level=None, verbose=False)
    for c in cat_writers
]

pages_years = [p for page_list in pages_by_year for p in page_list]
pages_writers = [p for page_list in pages_by_writer for p in page_list]

pages_all = pages_writers + [p for p in pages_years if p not in pages_writers]


years = [int(re.search('\d+', c.title).group()) for c in cat_years]

print(f"There are {len(pages_all)} wikipedia pages (novels) written between {min(years)} and {max(years)}.")

There are 4851 wikipedia pages (novels) written between 1840 and 2019.


In [144]:
# Extract the text of the wikipedia pages.
# Only sections containing the following keywords are considered:

kws = ['plot', 'summary', 'topic', 'theme', 'summari',
      'background', 'origin', 'introduction', 'concept', 'symbol',
      'synopsis', 'content']

# `documents` is a list of list of str, for each section of each document.
documents = []
for p in tqdm(pages_all):
    documents.append(get_page_text(p, keywords=kws, verbose=False, use_summary_if_empty=True))

A Jupyter Widget

## Text processing


The text is processed for LDA:
1. The text of each document is tokenized by words.
2. Stopwords are removed. I use the set of stopwords of spaCy, to which I included additional words that occur frequently without being relevant (e.g. 'story', 'character', 'novel').
3. Using the POS tagging feature of spaCy, only words with the allowed tag are kept. Here I keep nouns, verbs, adjectives, and adverbs.
4. Words are reduce to their lemmas. (e.g. the lemma of 'went', 'gone', nd 'goes' is 'go')

Some pages end up containing no tokens (if they have no relevant sections, or no sections at all). Those pages are removed from the corpus.

In [145]:
def prepare_text_for_lda(document,
                         lemmatize=True,
                         allowed_postags=None,
                         min_len=3,
                         additional_stopwords=None):
    
    """
    Returns a list of tokens or lemmas for the text of `document`, filtering out stopwords.
    
    Args:
    ----
    document (str).
    lemmatize (bool): if True, returns a list of lemmas, otherwise a list of tokens.
    allowed_postags (list of str): list of allowed POS tags. Tokens without an allowed POS tag are ignored.
        If None, all POS tags are allowed.
    min_len (int): tokens shorter than `min_len` (not included) are ignored.
    additional_stopwords (list of str): list of stopwords to add to the defaults stopwords of spacy.
        Note, the additional stopwords are also filtered out after lemmatization.
    
    Returns:
    -------
    list of str: list of valid tokens or lemmas.
    
    """
    if document == []:
        return []
    
    tokens = []
    for section in document:
        tokens += [word for word in spacy_nlp(section) if
                   (len(word.text) >= min_len and
                    not word.is_stop and
                    word.text not in additional_stopwords)]

    if allowed_postags is not None:
        tokens = [t for t in tokens if t.pos_ in allowed_postags]
    if lemmatize:
        tokens = [t.lemma_.lower() for t in tokens]
    else:
        tokens = [t.text.lower() for t in tokens]
    return [t for t in tokens if t not in additional_stopwords]

In [152]:
additional_stopwords=['story', 'character', 'novel', 'book', 'write',
                      'writer', 'fiction', 'series', 'publish', 'year',
                      'television', 'feature', 'american', 'british', 'narrator',
                      'original', 'reference', 'author', 'chapter', 'film',
                      'episode', 'release']

indices_empty_documents = []
tokenized_data = []

for i, doc in tqdm(list(enumerate(documents))):
    lda_tokens = prepare_text_for_lda(
        doc,
        lemmatize=True,
        allowed_postags=['NOUN', 'VERB'],
#         allowed_postags=['NOUN', 'VERB', 'PROPN'],
        min_len=3,
        # additional stopwords (after lemmatization)
        additional_stopwords=additional_stopwords,
    )
    if len(lda_tokens) == 0:
        indices_empty_documents.append(i)
    tokenized_data.append(lda_tokens)
print(f"{len(indices_empty_documents)} pages were ignored (they ended up empty after processing).")

# remove pages without any valid tokens
pages = [p for i, p in enumerate(pages_all) if i not in indices_empty_documents]
tokenized_data = [t for i, t in enumerate(tokenized_data) if i not in indices_empty_documents]
print(f'There are {len(tokenized_data)} documents (wikipedia pages).')

A Jupyter Widget

6 pages were ignored (they ended up empty after processing).
There are 4845 documents (wikipedia pages).


In [153]:
# Build a Dictionary - associate a numeric id to each word
dictionary = corpora.Dictionary(tokenized_data)
 
# Transform the collection of texts to a numerical form (bag-of-words)
corpus = [dictionary.doc2bow(text) for text in tokenized_data]

## Additional text mining

Finally, I mine the wikipedia pages in order to obtain information about the __author__ of each page.

I did not find any easy way of doing this with the wrapper, so I directly use the wikipedia API to extract the information in the __infoboxes__ of the wikipedia pages (box on the topright part of a page).
Currently, if a page does not have the name of author in its infobox, or if it does not contain an infobox, the author is left unspecified.

(This could be improved by mining the information in the main text of the page.)

In [154]:
# The wikipedia API accepts a maximum of 50 pageids at a time 
N = 50
        
authors = []
for m in tqdm(range(len(pages)//N)):
    authors += get_authors(pages[N*m:N*(m+1)])
authors += get_authors(pages[N*(len(pages)//N):])

doc2author = dict([(i, [author]) for i, author in enumerate(authors)])

# replace the 'NA' tag for unknown authors with 'unknown_i', different for each novel.
i = 0
for key, value in doc2author.items():
    if value == ['NA']:
        doc2author[key] = ['unknown_' + str(i)]
        i += 1
        
print(i, 'novels have an unknown author.')
    
# remove unknown authors (indicated by 'NA')
# doc2author = {key: [elem for elem in value if elem != 'NA'] for key, value in doc2author.items()}
author2doc = construct_author2doc(doc2author)

A Jupyter Widget

671 novels have an unknown author.


## Save the data for later use

In [155]:
with open("data/author2doc.json", 'w') as f:
        json.dump(author2doc, f, indent=2)
        
with open("data/tokenized_data.json", 'w') as f:
    json.dump(tokenized_data, f, indent=2)
    
dictionary.save('data/dictionary')
corpora.MmCorpus.serialize('data/corpus.mm', corpus)