# Replicability and transparency in topic modelling: developing best practice guidelines for the digital humanities

_Copyright (c) 2023 [Andressa Gomide, Mathew Gillings, Diego Gimenez]_

This file is part of Gomide et al. 2023.

This project is licensed under the terms of the MIT license.

This Jupyter Notebook is divided in three sections: (i) data collection and cleaning; (ii) tokenization, tagging and cleaning; and (iii) applying TM.
Each section represents one file in the folder [link]

### Data collection and cleaning
@get_gutemberg.py
The codes in this section are used to download books from the Gutemberg Project (https://www.gutenberg.org/); to remove unecessary elements (e.g. boilerplates, page numbers); to
extract the metadata for each book; and to save the original book file (html), the cleaned content (txt), and the metadata (tsv)

### Tokenization, Tagging and Cleaning
@create_bow.py
This section
- reads plain text files in a give folder
- applies Spacy Lang model
- creates different bags of words ('all_tokens', 'full_clean', 'custom_tok')
and saves:
- the original book file (html)
- the cleaned content (txt)
- the metadata (tsv)

### Applying TM
@apply_tm.py



Functions in this file

 `download_url` - it takes a string with with url path as an argument and returns the content of the url as bytes

### Import libraries
- TODO add short explanation of libraries

In [8]:
# For section 1
import re # for regular expressions
from urllib.request import urlopen # to request the content from the internet
from bs4 import BeautifulSoup # to work with html files (bs4 is known to be user friendly)
import pandas as pd # to store metadata as dataframe

# For section 2
import spacy # to tokenize and annotate the data
import pandas as pd # to store metadata as dataframe
from gensim.models import Phrases # to compute the bigrams
import utilsNLP # our library with functions

# For section 3

## Data Collection and Cleaning

- keep original data (as obtained from source)
- keep as much metadata as possible
- if it doesn't require a lot of work, it might be a better idea to use your own code, as you have more awareness and avoid having a lot of dependencies. In this example, its better to use our own codes than importing https://pypi.org/project/Gutenberg/. In our case, to avoid unecessary repetition, we created the function `download_url` to get the content of the book from the website.
- it is always easier to work with plain text, but preserving section breaks can lead to further analysis
- sometimes the same data content is available in different formats. it is a good idea to test extracting two different formats to get an idea which one will be better for the project.
- in our case, getting the data from html format sounds better and easier to (a) preserve the sections boundaries (b) to make cleaning easier

In [9]:
def download_url(urlpath):
    ''' 
    download content from an url address
    Args: 
        urlpath (str): the url path
    Returns:
        connection.read() (bytes): the content of the page 
    '''
    try:
        # open a connection to the server
        with urlopen(urlpath, timeout=3) as connection:
            # return content of the url read as bytes
            return connection.read()
    except:
        # return None
        print(f"There was an issue when trying to download{urlpath}")

Once we know the books we want to download, we create a list with their identification numbers (IDs).
The IDs can be retrieved at https://www.gutenberg.org/.
Here we got all books by Machado de Assis in Portugese available at the Gutenberg project.

54829   Memorias Posthumas de Braz Cubas
55682   Quincas Borba
55752   Dom Casmurro
55797   Memorial de Ayres
56737   Esau e Jacob
57001   Papeis Avulsos
67935   Reliquias de Casa Velha
33056   Historias Sem Data
53101   A Mao e A Luva
67162   Helena
67780   Yayá Garcia
61653   Poesias Completas

In [11]:
book_id_list = ["54829", "55682", "55752", "55797", "56737", "57001", "67935", "33056", "53101", "67162", "67780", "61653"]

- sometimes the same data content is available in different formats. it is a good idea to test extracting two different formats to get an idea which one will be better for the project.
- it is almost always easier to work with plain text, but preserving section breaks can lead to further analysis
- in our case, getting the data from html format sounds better and easier to (a) preserve the sections boundaries (b) to make cleaning easier

### getting the books from the plain format

we first create a data frame that will serve to store the metadata

In [12]:
df = pd.DataFrame(columns = ['author', 'title', 'lang', 'subj', 'datepub'])

then we go through the list of book ids

In [13]:
for book_id in book_id_list:
    # url for plain text book
    url_plain = f'https://www.gutenberg.org/cache/epub/{book_id}/pg{book_id}.txt'

    # download the content
    data_plain = download_url(url_plain)

    # plain text link doesnt include metadata. 
    # we have to go to the previous page
    url_meta = f'https://www.gutenberg.org/ebooks/{book_id}'
    metadata = download_url(url_meta)

    # parse document 
    soup = BeautifulSoup(metadata, 'html.parser')

    # get metadata
    author = soup.find('a', {'about': re.compile(r'\/authors\/.*')}).text
    lang = soup.find('a', {'href': re.compile(r'\/browse\/languages\/.*')}).text
    subj = soup.find('a', {'href': re.compile(r'\/ebooks\/subject\/*')}).text
    title = soup.find('td', {'itemprop': 'headline'}).text
    datepub = soup.find('td', {'itemprop': 'datePublished'}).text

    # remove line breaks
    meta_list = [sub.replace('\n', '') for sub in [author, title, lang, subj, datepub]]


    # df.loc[book_id] = [book_id, meta_list[0], meta_list[1], meta_list[2], meta_list[3], meta_list[4]]
    df.loc[book_id] = [meta_list[0], meta_list[1], meta_list[2], meta_list[3], meta_list[4]]

    # write book content to file
    with open(f"input/{book_id}.txt", 'wb') as file:
        file.write(data_plain)

There was an issue when trying to downloadhttps://www.gutenberg.org/cache/epub/55797/pg55797.txt


TypeError: a bytes-like object is required, not 'NoneType'

and save the metadata as a tsv file

In [6]:
# see the data
print(df)

# write metadata to file
df.to_csv('output/books_metadata.tsv', sep='\t', encoding='utf-8')

                                       author  \
44540  Alencar, José Martiniano de, 1829-1877   
55682             Machado de Assis, 1839-1908   
31971              Queirós, Eça de, 1845-1900   

                                               title        lang  \
44540                                  Cinco minutos  Portuguese   
55682                                  Quincas Borba  Portuguese   
31971  O crime do padre Amaro, scenas da vida devota  Portuguese   

                                                    subj       datepub  
44540                                            Fiction  Dec 29, 2013  
55682  Brazil -- History -- Empire, 1822-1889 -- Fiction   Oct 5, 2017  
31971                                Portugal -- Fiction  Apr 13, 2010  


## to get the books from html
# create empty df to store the metadata

In [7]:
df = pd.DataFrame(columns = ['author', 'title', 'lang', 'subj', 'datepub'])

for book_id in book_id_list:
    url_html = f'https://www.gutenberg.org/cache/epub/{book_id}/pg{book_id}-images.html'
    data_html = download_url(url_html)

    # parse
    soup = BeautifulSoup(data_html, 'html.parser')

    # get metadata
    author = soup.find('meta', {'name' : 'AUTHOR'})['content'] if soup.find('meta', {'name' : 'AUTHOR'}) is not None else 'NA'
    lang = soup.find('meta', {'name' : 'dc.language'})['content'] if soup.find('meta', {'name' : 'dc.language'}) is not None else 'NA'
    subj = soup.find('meta', {'name' : 'dc.subject'})['content'] if soup.find('meta', {'name' : 'dc.subject'}) is not None else 'NA'
    title = soup.find('meta', {'property' : 'og:title'})['content'] if soup.find('meta', {'property' : 'og:title'}) is not None else 'NA'
    datepub = soup.find('meta', {'name' : 'dcterms.created'})['content'] if soup.find('meta', {'name' : 'dcterms.created'}) is not None else 'NA'

    ## remove unnecessary elements
    # style
    for i in soup.find_all('style'):
        i.decompose()

    # boiler plates
    for i in soup.find_all('section', {'class': re.compile('.*boilerplate.*')}):
        i.decompose()

    # editor comments
    for i in soup.find_all('div', {'class': 'fbox'}):
        i.decompose()

    # page numbers
    for i in soup.find_all('span', {'class': 'pagenum'}):
        i.decompose()

    # remove br tags
    for i in soup.find_all('br'):
        i.unwrap()

    # remove head
    soup.find('head').decompose()

    # get metadata
    df.loc[book_id] = [author, title, lang, subj, datepub]


    # write to file with tags
    with open(f'input/html/{book_id}.html', 'w', encoding = 'utf-8') as file:
        file.write(str(soup.prettify()))
    # write to file without tags
    with open(f'input/plain/{book_id}.txt', 'w', encoding = 'utf-8') as file:
        file.write(soup.text)

print(df)
# write metadata to file
df.to_csv('output/books_metadata.tsv', sep='\t', encoding='utf-8')

KeyboardInterrupt: 

@create_bow
This script
- reads plain text files in a give folder
- applies Spacy Lang model
- creates different bags of words ('all_tokens', 'full_clean', 'custom_tok')
and saves:
- the original book file (html)
- the cleaned content (txt)
- the metadata (tsv)

## load language model

there are different models availables at https://spacy.io/models 

we can also create our own

here we will use a small model to be more efficient

In [None]:
nlp = spacy.load('pt_core_news_sm')

## get list with files
the folder input has the plain files prepared with @get_gutemberg.py
the function get_file_list creates a list and append the files names to it

In [None]:

file_list = utilsNLP.get_file_list('input/plain')

## Cleaning
list of elements to be removed (we can also here our own)

In [None]:
# POS tags to be removed
pos_rm = ['PUNCT', 'DET', 'SPACE', 'NUM', 'SYM']
# Named Entities tags to be removed
ner_rm = ['PER', 'LOC']
# words to be removed
wrd_rm = ['ella', 'elle']

go through the files extracting the words and save the bag of words

In [None]:
# create empty df to store the different bag of words (BoWs)
df = pd.DataFrame(columns = ['all_tokens', 'full_clean', 'custom_tok'])

# iterate each file and create the 3 different BoWs
for val in file_list:
    # read file
    with open(val, 'r', encoding='utf-8') as f:
        text_org = f.read()
    
    # remove line breaks
    text_oneline = text_org.replace("\n", " ")

    # apply model
    nlp_text = nlp(text_org)

    # create a list to store the NER labes to be 
    ne2rm = []
    for ent in nlp_text.ents:
        if ent.label_ in ner_rm:
            ne2rm.append(ent.text.lower())

    # get lis of unique values for the ner found
    ne2rm = list(set(ne2rm))

    # other possibilities
    # - remove numbers, but not words that contain numbers...
    # - Remove words that are only one character...

    # all tokens (no space)
    print(f'getting all tokens BoW for {val.stem}...')
    all_tokens = [token.text.lower() for token in nlp_text if token.pos_ != 'SPACE']

    # get all lemma that are not in the removel list neither in the stop list and that is alpha (not letters)
    print("getting BoW with a 'full clean' approach ...")
    full_clean = [token.lemma_.lower() for token in nlp_text if token.pos_ not in pos_rm and not token.is_stop and token.is_alpha]

    # remove locations and named person/family
    print("getting customized BoW")
    custom_tok = [token.text.lower() for token in nlp_text if token.text.lower() not in ne2rm and token.text.lower() not in wrd_rm and token.pos_ not in pos_rm and not token.is_stop]

    # add BoWs to dataframe
    df.loc[val.stem] = [all_tokens, full_clean, custom_tok]

# write dataframe to file
df.to_csv('output/bows.tsv', sep='\t', encoding='utf-8')

# print df 
print(df)

## Compute bigrams.
as this can be a very heavy (and slow) process, we make it separately 
and save it in a seperate file

In [None]:

# get only the values from the all_tokens column
bow = df['all_tokens']

len(bow[0]) # 93208
len(df['all_tokens'][0]) # 1369489

# get bigrams that occur at least 5 times
bigrams = Phrases(bow, min_count=5)

# add bigrams to BoW
for idx in range(len(bow)):
    for token in bigrams[bow[idx]]:
        if '_' in token:
            # Token is a bigram, add to document.
            bow[idx].append(token)

# save to file
bow.to_csv('output/bow_with2gram.tsv', sep='\t', encoding='utf-8')


