Skip to content

defoe-code/CDCS_Text_Mining_Lab

Repository files navigation

For running defoe queries in Cirrus, we first have to start a Spark cluster within a SLURM job. And once the Spark cluster is running, then we can submit defoe queries to that cluster.

We have divided the work performed in the CDCS TDM Lab in two Rounds: Round 1 and Round 2. Each Round has a different set of studies, and each study have a set of defoe queries. In parallel, we have started other studies, such as Geoparsing the Scottish Gazetteers and Trade Legacy Slavery. Details of those can be found at the end of in this document.

For understanding better how defoe works, we recommend to read first this paper, along with checking how to run defoe queries and how to specify data to queries documentation.

Presentations about defoe are also available in this link. Note, that the last presentation introducing defoe was done for the Research Libraries UK (RLUK) Text and Data Mining Seminar, which is available here.

As follows, we have a summary of the instructions needed to replicate this work.

1. Spark installation steps

Download Spark 2.4.0

wget http://apache.mirrors.nublue.co.uk/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz
tar xvf spark-2.4.0-bin-hadoop2.7

Copy all this repository in your $HOME directory

In your $HOME you need to have the following:

2. Creating a conda python3 enviroment in Cirrus

To create a python 3 enviroment in Cirrus, do:

module load anaconda/python3 
conda create -n cirrus-py36 python=3.6 anaconda

To activate an active environment, use:

source activate cirrus-py36

To deactivate an active environment, use:

source deactivate

3. Installing defoe in Cirrus (inside the conda enviroment)

To install defoe in Cirrus HPC cluster, do:

git clone https://github.com/defoe-code/defoe.git
source activate cirrus-py36
cd defoe
./requirements.sh
zip -r defoe.zip defoe

Note: Every time you change something inside defoe library, you need to ZIP the DEFOE code. If you dont change nothing, you dont need to zip it again.

4. Starting Spark Cluster in Cirrus

To start a spark cluster in Cirrus the only thing needed is to run following command:

sbatch sparkcluster_driver.slurm 

You will need to wait until the job is running before proceding to run defoe queries.

You can modify sparkcluster_driver.slurm according to your need. For example, for chaning the amount of time, number of nodes, and account. The current script configures a Spark cluster of 324 cores (9 nodes X 36 cores per node).

#SBATCH --job-name=SPARKCLUSTER
#SBATCH --time=24:00:00
#SBATCH --exclusive
#SBATCH --nodes=9
#SBATCH --tasks-per-node=36
#SBATCH --cpus-per-task=1
#SBATCH --account=XXXX
#SBATCH --partition=standard
#SBATCH --qos=standard

5. Submitting defoe queries for Round 1 and Round 2

During this summer, we conducted a serie of studies within the CDCS text-mining lab, in which we worked with humanities and social science researchers who can ask complex questions of large-scale data sets. We selected four research projects for Round 1, and two for Round 2.

A description of each research project/study can be found as follows:

Round 1:

Round 2:

Each reserch project/study has a serie of defoe queries. In most of them, we first submitted a frequency query modifying different parameters (e.g. article count vs term count, date, lexicon, target words, preprocessing treatment), and then we submitted another query for getting the details (text) of the desired/filtered articles/pages. The requirements were collected using this document as a baseline for formulating defoe queries.

Later we created two slurm jobs, one per Round (Round1.slurm and Round2.slurm), for running all defoe queries in Cirrus. You can comment the studies that do not want to run. For running all the studies (with all defoe queries) included in Round 1, type the following command:

sbatch Round1.slurm

Similarly, for running all the studies (with all defoe queries) included in Round 2, type the following command:

sbatch Round2.slurm

Note, that for running Round[1|2].slurm jobs, you need to have first running the sparkcluster_driver.slurm job.

Also, you need to modify Round[1|2].slurm files according to your needs - e.g time, account, job name. But you will only need to reserve 1 node (36 cores) for submitting defoe queries to the Spark cluster. The parallelization of defoe relays on the number of nodes that the Spark cluster has been configured with (inside sparkcluster_driver.slurm - in this case with 9 nodes), and not in the number of nodes used for submitting defoe queries to the Spark cluster.

#!/bin/bash
#SBATCH --job-name=Round1
#SBATCH --time=20:00:00
#SBATCH --exclusive
#SBATCH --nodes=1
#SBATCH --tasks-per-node=36
#SBATCH --cpus-per-task=1
#SBATCH --account=XXXX
#SBATCH --partition=standard
#SBATCH --qos=standard

6. Long-S fix

Many historical documents used a long-S that OCR tends to confuse as f. For fixing this kind of OCR errors we apply in most of our defoe queries the longsfix_sentence function before applying any other preprocessing treatment to the text's words. This function calls automatically to a set of scripts produced for the Edinburgh Geoparser to fix the long-S errors.

def longsfix_sentence(sentence, defoe_path, os_type)

As we can see above longsfix_sentence function needs to two user's parameters (apart from the sentence/word to inspect). The user's operating system - os_type -(either linux or mac), along with the path of user's defoe installation path- defoe_path. The longsfix_sentence function (LINE 263) calls to a set of different scripts depending on the user's operationg system .

Both parameteres are usually spicified in a configuration file (example), like this one bellow:

preprocess: normalize
data: music.txt
defoe_path: /lustre/home/sc048/rosaf4/defoe/
os_type: linux

The long-S fix can be tested as a single script called long_s.py. For running it you just need to do the following (after changing the defoe_path and os_type variables according to your needs - just in this case, those variables are specified in the python script).

cd $HOME/defoe/defoe/long_s_fix/
python long_s.py

More information about the long-S can be found in this paper.

7. DATASETS

We have worked with the following datasets:

We had also planned to work with the British Library Books (BL Books), which are stored at the UoE DataStore /sg/datastore/lib/groups/lac-store/blpaper. However, this dataset is too big for storing it in Cirrus.

Transferring the 20th century TDA newspapers to Cirrus

Example of how to transfer a subset of TDA newspapers to Cirrus - E.g. from 1900 to 2000 (20th Century)- using SFTP.

mkdir -p $HOME/TDA_GDA_1785-2009/
cd $HOME/TDA_GDA_1785-2009/
sftp -oPort=22222 XXX@chss.datastore.ed.ac.uk:/chss/datastore/chss/groups/Digital-Cultural-Heritage/LBORO/TimesDigitalArchive_XMLS/TDA_GDA_1785-2009

Connected to chss.datastore.ed.ac.uk.
Changing to: /chss/datastore/chss/groups/Digital-Cultural-Heritage/LBORO/TimesDigitalArchive_XMLS/TDA_GDA_1785-2009

sftp> get 19[0-9][0-9]/*/*.xml .

Example of how to create a data file with all XML files

This data file is needed for running defoe queries against the downloaded dataset.

find $HOME/TDA_GDA_1785-2009/ -name "*.xml" | sort > tda_1900_1999.txt

Round 1 and Round 2 Results

Results of these studies (Round 1 and Round 2) are uploaded here.

Furthermore, we have also created several notebooks for visualizing frequency results:

Trade Legacy Slavery Study

We also started an investigation on the slave trade and how it permeates the different volumes of the Encyclopaedia Brittanica (EB). We have a lexicon, slavery_trade.txt, that we looked up at two levels:

  • Page level: returning a snippet (40 words before and after each term) every time a term from the lexicon is found in a page.
  • Article level: returning an article every time a term from the lexicon is foun in an article. For doing this, we need first to extract all the articles per EB page, and store them in CSV files (one per edition). See more information about extracting articles bellow. Once extracted the articles per page, then we can use anohter defoe to filter those by the lexicon.

At page level we also run the frequency query using the same lexicon.

Defoe queries

  • Page level: The query used for doing this work can be found under nls defoe model.
  • Article level: All queries used for doing this work can be found under nlsArticles and hdfs defoe models.

Defoe queries configuration file

This configuration file might need to be modified according to your set up and needs.

Slurm job

The SLURM job to run this study can be found here.

Data files

The data files used in the above slurm job can be found here, as nls_[first|second|etc].txt.

Preliminary Results

  • Page level: Text (snippets) results can be found here. Frequency results can be visualized here.

  • Article level: Text (articles) results can be found them here.

Geoparsing the Scottish Gazetteers

Furthermore, we have continued our work on devising automatic and parallel methods for geoparsing large digital historical textual data by combining the strengths of three natural language processing (NLP) tools, the Edinburgh Geoparser, spaCy and defoe, and employing different tokenisation and named entity recognition (NER) techniques. We apply these tools to a large collection of nineteenth century Scottish geographical dictionaries.

This work is being conducted in collaboration with the Language Technology Group at Informatics.

For running the defoe geoparsing queries we have not used Cirrus, since it requires that the computing nodes have connection to internet to georesolve locations. Therefore, we have used a VM for this. Instructions of how we have set up this VM with defoe, the Edinburgh geoparser and Spark can be foud here, along with examples of how to run defoe geoparser queries using different configurations.

We have two defoe geoparser queries under NLS model:

These two queries are also avaible under ES model:

Note: For using the queries under ES model, that would require to write the data first to ES using this query and a configuration file like this one (which it might need to be modified according to your set up and needs).

A paper describing this work can be found here, and the notebooks presented in this paper can be visualized here.

Recently, we have also extended this work to geoparse automatically the Encyclopaedia Britannica. Therefore, we have four configuration files, since the Scottish Gazetteer and Encyclopaedia Britannica uses different gazetteers and bounding box configurations (which they might need to be modified according to your set up and needs):

Notice that the geoparser also calls to a different set of script, depending on the user's operative systems. Therefore, all configuration files used to run those queries need to specify the os_type and defoe_path parameteres.

For more details about the Edinburgh Geoparser you can follow the tutorial specified in the link.

Extracting automatically articles from the Encyclopaedia Britannica (EB)

Finally, we have created a new defoe query for extracting automatically articles from the EB. The articles are stored per edition (and also we have them in a single file), in CSV files.

Important: For doing this work, instead of adding this query under the defoe NLS model, we have created a new one, called nlsArticles model. This is because, for extracting automatically the articles from the pages, it required to introduce specific modifications at the page and archive level - for capturing headers and text columns. Therefore, this query under nlsArticles and not under nls.

Note that for running this query, apart from Spark you need to have HADOOP installeld in your computing enviroment. Instructions for installing it in Cirrus can be found here and for installing it in a VM can be found here. (Note, I used a VM for running this query).

Furthermore, this query also need a configuration file for specifying the operating system and the defoe path for the long_s fix:

Articles metadata

Each CSV file has a row per article found within a page, with the following columns (being the most important term and definition):

  • title: title of the book (e.g. Encyclopaedia Britannica)
  • edition: edition of the book (e.g Eighth edition, Volume 2, A-Anatomy)
  • year: year of publication/edition (e.g. 1853)
  • place: place (e.g. Edinburgh)
  • archive_filename: directory path of the book (e.g. /home/rosa_filgueira_vicente/datasets/single_EB/193322698/)
  • source_text_filename: directory Path of the page (e.g. alto/193403113.34.xml)
  • text_unit: unit that represent each ALTO XML. These could be Page or Issue.
  • text_unit_id: id of the page (e.g. Page704)
  • num_text_unit: number of pages (e.g. 904)
  • type_archive: type of archive. Thse could be book or newspapers.
  • model: defoe model used for ingesting this dataset (nlsArticles)
  • type_page: the page classification that has been done by defoe. These could be Topic, Articles, Mix or Full Page.
  • header: the header of the page (e.g. AMERICA)
  • term: term that is going to be described (e.g. AMERICA)
  • definition: words describing an article / topic/ full page: ( e.g. “AMERICA. being inhabited. The Aleutian ….”)
  • num_articles: number of articles per page. In case a page has been classified as Topic or FullPage, the number of articles is 1.
  • num_page_words: number of words per page (e.g. 1373)
  • num_article_words: number of words of an article (e.g. 1362)

We have detected two types of articles with two different patterns at “page” level:

  • Short articles (named as articles): Usually presented by a TERM in the main text in uppercase, followed by a “,” (e.g. ALARM, ) and then a DESCRIPTION of the TERM (similar to an entry in a dictionary). This description normally is one or two paragraphs, but of course there are exceptions.

    • Term: ALARM
    • Definition: in the Military Art, denotes either the apprehension of being suddenly attacked, or the notice thereof signified by firing a cannon, firelock, or the like. False alarms are frequently made use of to harass the enemy, by keeping them constantly under arms. , ….
  • Long articles (named as topics): In this is the case, the Encyclopaedia introduces a TERM in the header of a page (which is not the case for the short articles), and then it normally uses several pages to describe that topic (and very often it uses a combination of text, pictures, tables, etc.). For example, the “topic” AMERICA goes from page 677 to 724 (47 pages!)

Important: Topic is just the way we named the long articles that expands more than a page. It does not refer to “NLP topic”.

Downloading EB articles datasets

Those files (one per edition, and also one with all articles) can be downloaded from here. Once you decompressed the eb_articles_per_page.tar file, you will find these:

> tar -zxvf eb_articles_per_page.tar .
> ls -lht 
36M 24 Aug 12:01 eb_first_edition_total_articles.csv
59M 24 Aug 12:08 eb_second_edition_total_articles.csv
158M 24 Aug 12:26 eb_third_edition_total_articles.csv
105M 24 Aug 12:06 eb_fourth_edition_total_articles.csv
110M 24 Aug 11:59 eb_fifth_edition_total_articles.csv
110M 24 Aug 12:19 eb_sixth_edition_total_articles.csv
129M 24 Aug 12:14 eb_seventh_edition_total_articles.csv
137M 24 Aug 11:54 eb_eighth_edition_total_articles.csv
70M 24 Aug 11:12 eb_4_5_6_suplement_total_articles.csv
913M 24 Aug 11:48 eb_all_editions_total_articles.csv -- It has all articles for all editions!

Slurm job

In this SLURM job (in the second part of file - At Article level), you can find the defoe queries necessaries for extracting the articles per edition and storing them in HDFS files.

Data Files

The data files used in the above slurm job can be found here, as nls_[first|second|etc].txt.

About

Repository of scripts and documents for running defoe text mining queries in Cirrus for CDCS Text Mining Lab

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages