# Digital Corpus Processing of Fernando Pessoa's work with Python's spaCy module

## Project description:

Fernando Pessoa was a Portuguese writer and one of the most important poets of the 20th century in Europe. An enigmatic figure, Pessoa's work is divided between his orthonymous writing and his writing under various aliases or, as he was calling them, heteronyms (of which there are around seventy-five). The writer firmly believed that there were multiple consciousnesses living inside him, each with their own biographies, passions and views on life. This is reflectled in the various themes and perspectives explored by Pessoa and his heteronyms in poetry and prose. For further reading on Pessoa and his heteronyms, https://poetrysociety.org/poems-essays/tributes/fernando-pessoa-his-heteronyms is a good resource.

Unfortunately, a rather modest part of Pessoa's (or his heteronyms) body of work is translated into English and an even smaller part is available digitally. This project aims to incentivize digital humanities scholars, but also scholars from other disciplines who are interested in the computational analysis of large corpora of texts, to build a digital, annotated corpus of Pessoa and his heteronyms' work in order to open new avenues of exploring his writing using text processing and distant viewing techniques performed by digital humanists. 

An important digital resource of Pessoa's work can be found at https://www.pessoadigital.pt/en/index.html. While in Portuguese, this resource gathered a large part of Pessoa's (and his heteronyms') corpus and digitized them in machine readable format, along with each text's transcription. This resource could serve as a starting point for translating Pessoa's works into English and using annotation tools as the ones presented in the code below, in order to process and extract important linguistic features from his work. For speakers of Portuguese, the spaCy module that we will use in the code below (which for our case is set to the English pipeline) can be used with the Portuguese pipeline and perform similar tasks of text processing and annotation. 

The nascent case that I present here uses three works originally written in English by Pessoa which were downloaded from Project Gutenberg. Already from this small dataset one can explore some textual features that are present in these three collections of poems and sonnets. 


## Processing steps:

In [1]:
# Install the following modules from the command prompt or Anaconda Navigator if you do not have them installed already 
# pip install spacy

# pip install pandas 
# pip install plotly
# pip install nbformat

In [2]:
# Install the gutenberg-cleaner package for cleaning gutenberg texts 
# pip install gutenberg-cleaner

To begin, let's import the needed Python modules that will be used to operate with our dataset and transform them into DataFrame objects. Furthermore, let's implement the spaCy module and install the simple English Language model that we will be used for the text processing tasks below.

In [3]:
# Import spacy
import spacy

# Install English language model
!spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------- ----------------- 7.1/12.8 MB 39.6 MB/s eta 0:00:01
     --------------------------------------  12.6/12.8 MB 39.4 MB/s eta 0:00:01
     --------------------------------------- 12.8/12.8 MB 28.6 MB/s eta 0:00:00
[+] Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')


In [4]:
# Import os to upload documents and metadata
import os

# Load spaCy visualizer
from spacy import displacy

# Import pandas DataFrame packages
import pandas as pd
pd.options.mode.chained_assignment = None  # default='warn'

In this step we will write some code to access the folder where our corpus is. Note that if you work with your own files then you have to add them to the folder of the corpus or change the directory from which Python will access and read the texts. After setting the correct directories add the filenames and the content of the files to two separate lists. As a last step, create a dictionary object that will assign each text to its corresponding filename.

In [5]:
# Create empty lists for file names and contents
texts = []
file_names = []

# Iterate through each file in the folder
for _file_name in os.listdir('data'):
# Look for only text files
    if _file_name.endswith('.txt'):
    # Append contents of each text file to text list
        texts.append(open('data' + '/' + _file_name, 'r', encoding='utf-8').read())
        # Append name of each file to file name list
        file_names.append(_file_name)

In [6]:
# Create dictionary object associating each file name with its text
d = {'Filename':file_names,'Document':texts}

In [7]:
# Import simple_cleaner module from gutenberg-cleaner
from gutenberg_cleaner import simple_cleaner

Now we can turn our cleaned corpus into a DataFrame object. 

In [8]:
# Turn dictionary into a dataframe
text_df = pd.DataFrame(d)

In [9]:
# Display the result
text_df.head()

Unnamed: 0,Filename,Document
0,19978.txt,﻿The Project Gutenberg eBook of 35 Sonnets\n ...
1,66039.txt,"﻿The Project Gutenberg eBook of English Poems,..."
2,66040.txt,"﻿The Project Gutenberg eBook of English Poems,..."


Since our corpus is downloaded from Project Gutenberg, we need to slightly clean it up before doing any text processing on it. Project Gutenberg downloads contain other information in the .txt files besides the raw text, thus we will use the simple_cleaner package from the gutenberg_cleaner module in order to tidy up our texts and remove information that is not necessary for the task we want to perform. The raw text will be added to a new column in the DataFrame in order to also preserve the documents in their original state.

In [10]:
text_df['Raw_text'] = text_df['Document'].apply(simple_cleaner)
text_df.head()

Unnamed: 0,Filename,Document,Raw_text
0,19978.txt,﻿The Project Gutenberg eBook of 35 Sonnets\n ...,\r\n\r\n\r\n\r\n35 Sonnets\r\n\r\nby Fernando ...
1,66039.txt,"﻿The Project Gutenberg eBook of English Poems,...",\r\nENGLISH\r\nPOEMS\r\n\r\n\r\n\r\n\r\nBY\r\n...
2,66040.txt,"﻿The Project Gutenberg eBook of English Poems,...",ENGLISH\r\nPOEMS\r\n\r\n\r\n\r\n\r\nBY\r\nFERN...


Upon inspection, one can notice that the text still contains some characters that we would like to avoid for our processing purposes. Let's get rid of them and also of the '.txt' extension at the end of our filenames.

Note: if for some reason you feel like the presence of certain characters, like the new line (\n), are relevant for your analysis, just remove the line of code below that deletes these characters from our texts. 

In [11]:
# Remove unwanted characters from the text. 
text_df['Raw_text'] = text_df['Raw_text'].str.replace('\r', ' ', regex=True).str.strip()
text_df['Raw_text'] = text_df['Raw_text'].str.replace('\t', ' ', regex=True).str.strip()
text_df['Raw_text'] = text_df['Raw_text'].str.replace('\n', ' ', regex=True).str.strip()
# Remove .txt from title of each paper
text_df['Filename'] = text_df['Filename'].str.replace('.txt', '', regex=True)
text_df.head()

Unnamed: 0,Filename,Document,Raw_text
0,19978,﻿The Project Gutenberg eBook of 35 Sonnets\n ...,35 Sonnets by Fernando Pessoa I. ...
1,66039,"﻿The Project Gutenberg eBook of English Poems,...",ENGLISH POEMS BY FERNANDO PESSOA ...
2,66040,"﻿The Project Gutenberg eBook of English Poems,...",ENGLISH POEMS BY FERNANDO PESSOA ...


We can now proceed with loading the metadata for the three texts in our corpus. This can be done by downloading the metadata .csv file from the Gutenberg website (called pg_catalog.csv) and writing some code that will select only the metadata of the texts that we include in our dataset.

In [12]:
# Load the Gutenberg metadata csv
metadata_df = pd.read_csv('pg_catalog.csv')

# List of Gutenberg IDs for the chosen books
chosen_book_ids = [19978, 66039, 66040]  # Extract only the metadata for the three texts by Pessoa in our corpus

# Filter the DataFrame to include only rows with these IDs
filtered_metadata = metadata_df[metadata_df['Text#'].isin(chosen_book_ids)]

Print the new filtered_metadata to inspect if our code worked properly. If yes, then we now have the required metadata for the works in our corpus and we can export it to a '.csv' file.

In [13]:
# Display the filtered metadata
print(filtered_metadata)

# Save the filtered metadata to a new CSV
filtered_metadata.to_csv('pessoa_gutenberg_metadata.csv', index=False)

       Text#  Type      Issued                            Title Language  \
19907  19978  Text  2006-11-30                       35 Sonnets       en   
65960  66039  Text  2021-08-11  English Poems, Volume 01 (of 2)       en   
65961  66040  Text  2021-08-11  English Poems, Volume 02 (of 2)       en   

                           Authors Subjects LoCC  \
19907  Pessoa, Fernando, 1888-1935   Poetry   PQ   
65960  Pessoa, Fernando, 1888-1935   Poetry   PQ   
65961  Pessoa, Fernando, 1888-1935   Poetry   PQ   

                                  Bookshelves  
19907  Browsing: Literature; Browsing: Poetry  
65960  Browsing: Literature; Browsing: Poetry  
65961  Browsing: Literature; Browsing: Poetry  


In [14]:
# Assign the newly created pessoa_metadata.csv to the metadata dataframe
metadata_df = pd.read_csv('pessoa_gutenberg_metadata.csv')
metadata_df.head()

Unnamed: 0,Text#,Type,Issued,Title,Language,Authors,Subjects,LoCC,Bookshelves
0,19978,Text,2006-11-30,35 Sonnets,en,"Pessoa, Fernando, 1888-1935",Poetry,PQ,Browsing: Literature; Browsing: Poetry
1,66039,Text,2021-08-11,"English Poems, Volume 01 (of 2)",en,"Pessoa, Fernando, 1888-1935",Poetry,PQ,Browsing: Literature; Browsing: Poetry
2,66040,Text,2021-08-11,"English Poems, Volume 02 (of 2)",en,"Pessoa, Fernando, 1888-1935",Poetry,PQ,Browsing: Literature; Browsing: Poetry


In order to merge the two DataFrames that we have created until this point we need to create a common column upon which the merge will be executed. Let's rename the 'Text#' column in our metadata_df to 'Filename' in order to match the 'Filename' column in the text_df. Another issue that we need to take care of before merging is that the data type for 'Text#' in our .csv file downloaded from Project Gutenberg is integer while the one in 'Title' column in our text_df is a string. To solve this we will convert the datatype in our metadata_df to strings. 

In [15]:
# Rename column from Title to Filename in order to merge the two tables
metadata_df.rename(columns={"Text#": "Filename"}, inplace=True)

# Convert the data type of the Filename column into strings to allow the merging of the metadata and text tables in the next step
metadata_df['Filename'] = metadata_df['Filename'].astype(str)

# Merge the files to their metadata in a new DataFrame
pessoa_df = metadata_df.merge(text_df,on='Filename')
pessoa_df.head()

Unnamed: 0,Filename,Type,Issued,Title,Language,Authors,Subjects,LoCC,Bookshelves,Document,Raw_text
0,19978,Text,2006-11-30,35 Sonnets,en,"Pessoa, Fernando, 1888-1935",Poetry,PQ,Browsing: Literature; Browsing: Poetry,﻿The Project Gutenberg eBook of 35 Sonnets\n ...,35 Sonnets by Fernando Pessoa I. ...
1,66039,Text,2021-08-11,"English Poems, Volume 01 (of 2)",en,"Pessoa, Fernando, 1888-1935",Poetry,PQ,Browsing: Literature; Browsing: Poetry,"﻿The Project Gutenberg eBook of English Poems,...",ENGLISH POEMS BY FERNANDO PESSOA ...
2,66040,Text,2021-08-11,"English Poems, Volume 02 (of 2)",en,"Pessoa, Fernando, 1888-1935",Poetry,PQ,Browsing: Literature; Browsing: Poetry,"﻿The Project Gutenberg eBook of English Poems,...",ENGLISH POEMS BY FERNANDO PESSOA ...


Now we can save our merged DataFrame that containts both the metadata of each text in our corpus as well as the content of each file into a new '.csv' file

In [16]:
# Save merged DataFrame as csv to your computer's working directory
pessoa_df.to_csv('Metadata_and_full_texts.csv', encoding='utf-8', index=False, header=True)

After merging the two dataframes we can start with processing the content of our DataFrames using spaCy. See the Notebook at https://github.com/yevgenm/corpus-analysis-spacy/ for a detailed description of all the steps that are performed in the following lines of code.

In [17]:
# Load nlp pipeline
nlp = spacy.load('en_core_web_sm')

# Check what functions it performs
print(nlp.pipe_names)

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']


In [18]:
# Define a function that runs the nlp pipeline on any given input text
def process_text(text):
    return nlp(text)

In [19]:
# Apply the function to the "Raw_text" column, so that the nlp pipeline is called on each of the three Pessoa works in our corpus
pessoa_df['Doc'] = pessoa_df['Raw_text'].apply(process_text)
pessoa_df.head()

Unnamed: 0,Filename,Type,Issued,Title,Language,Authors,Subjects,LoCC,Bookshelves,Document,Raw_text,Doc
0,19978,Text,2006-11-30,35 Sonnets,en,"Pessoa, Fernando, 1888-1935",Poetry,PQ,Browsing: Literature; Browsing: Poetry,﻿The Project Gutenberg eBook of 35 Sonnets\n ...,35 Sonnets by Fernando Pessoa I. ...,"(35, Sonnets, , by, Fernando, Pessoa, ..."
1,66039,Text,2021-08-11,"English Poems, Volume 01 (of 2)",en,"Pessoa, Fernando, 1888-1935",Poetry,PQ,Browsing: Literature; Browsing: Poetry,"﻿The Project Gutenberg eBook of English Poems,...",ENGLISH POEMS BY FERNANDO PESSOA ...,"(ENGLISH, , POEMS, , BY, , FERNAN..."
2,66040,Text,2021-08-11,"English Poems, Volume 02 (of 2)",en,"Pessoa, Fernando, 1888-1935",Poetry,PQ,Browsing: Literature; Browsing: Poetry,"﻿The Project Gutenberg eBook of English Poems,...",ENGLISH POEMS BY FERNANDO PESSOA ...,"(ENGLISH, , POEMS, , BY, , FERNANDO..."


In [20]:
# Define a function to retrieve tokens from a doc object
def get_token(doc):
    return [(token.text) for token in doc]

In [21]:
# Run the token retrieval function on the doc objects in the dataframe
pessoa_df['Tokens'] = pessoa_df['Doc'].apply(get_token)

In [22]:
# Display the list of tokens in each text within the corpus
tokens = pessoa_df[['Title', 'Tokens']].copy()
tokens.head()

Unnamed: 0,Title,Tokens
0,35 Sonnets,"[35, Sonnets, , by, Fernando, Pessoa, ..."
1,"English Poems, Volume 01 (of 2)","[ENGLISH, , POEMS, , BY, , FERNAN..."
2,"English Poems, Volume 02 (of 2)","[ENGLISH, , POEMS, , BY, , FERNANDO..."


In [23]:
# Define a function to retrieve lemmas from a doc object
def get_lemma(doc):
    return [(token.lemma_) for token in doc]

# Run the lemma retrieval function on the doc objects in the dataframe
pessoa_df['Lemmas'] = pessoa_df['Doc'].apply(get_lemma)


In [24]:
# Display the lemmas of the words found in each text
lemmas = pessoa_df[['Title', 'Lemmas']].copy()
lemmas.head()

Unnamed: 0,Title,Lemmas
0,35 Sonnets,"[35, sonnet, , by, Fernando, Pessoa, ..."
1,"English Poems, Volume 01 (of 2)","[ENGLISH, , poems, , by, , FERNAN..."
2,"English Poems, Volume 02 (of 2)","[ENGLISH, , poems, , by, , FERNANDO..."


In [25]:
# Define a function to retrieve part-of-speech from a doc object
def get_pos(doc):
    #Return the coarse- and fine-grained part of speech text for each token in the doc
    return [(token.pos_, token.tag_) for token in doc]

# Define a function to retrieve parts of speech from a doc object
pessoa_df['POS'] = pessoa_df['Doc'].apply(get_pos)

In [26]:
# Display the POS found in each text within the corpus
pos = pessoa_df[['Title', 'POS']].copy()
pos.head()

Unnamed: 0,Title,POS
0,35 Sonnets,"[(NUM, CD), (NOUN, NNS), (SPACE, _SP), (ADP, I..."
1,"English Poems, Volume 01 (of 2)","[(PROPN, NNP), (SPACE, _SP), (NOUN, NN), (SPAC..."
2,"English Poems, Volume 02 (of 2)","[(PROPN, NNP), (SPACE, _SP), (NOUN, NN), (SPAC..."


In [27]:
# Define function to extract proper nouns from Doc object
def extract_verbs(doc):
    return [token.text for token in doc if token.pos_ == 'VERB']

# Apply function to Doc column and store resulting proper nouns in new column
pessoa_df['Verbs'] = pessoa_df['Doc'].apply(extract_verbs)

In [28]:
# Display just the title of the work in the corpus and the lemmatized verbs present in it.
verbs = pessoa_df[['Title', 'Verbs']].copy()
verbs.head()

Unnamed: 0,Title,Verbs
0,35 Sonnets,"[write, speak, do, look, transfused, give, ges..."
1,"English Poems, Volume 01 (of 2)","[published, meant, annul, supersede, published..."
2,"English Poems, Volume 02 (of 2)","[Set, come, Let, tell, comparing, lay, awaking..."


After all the processing is done it is sensible to export the complete DataFrame into a final '.csv' file in case someone wants to just read it with Pandas in their own environment and perform further processing without repeating the steps done in this Notebook.

In [29]:
# Save the final DataFrame with all its columns as a CSV file
pessoa_df.to_csv('Corpus_data_and_annotations.csv', index=False, encoding='utf-8', header = True)