# Workflow for text analysis from PDF

Formalizing the workflow from various projects into a workflow that can be applied to various texts, including longer form documents from PDF.

<a target="_blank" href="https://colab.research.google.com/github/arielsaffer/pest-text-pipeline/blob/main/notebooks/text_mining_workflow.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

## Steps

0. Set up the workspace
1. Extract text from PDF to produce a corpus of "documents" (e.g., pages, paragraphs, sentences)
2. Apply exploratory text analysis: topic modeling (LDA), keyword search (regex)
3. Machine learning to select for topics about presence.
4. Geoparsing to extract locations from presence records.

## 0. Set up the workspace

In [None]:
run_on = "Local" # "Local" or "Colab"
# Local assumes that you have cloned the full Github repository to your local machine
# Colab assumes that you are running this notebook on Google Colab

In [68]:
# Import general libraries

import pandas as pd
pd.set_option('display.max_colwidth', 100)

# Set up workspace

if run_on == "Local":
    # Set up pytesseract
    import os
    import pytesseract
    os.chdir("..")
    # This should be the path of the tesseract installation
    pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract'
    import text_analysis.data_functions as ta
    
elif run_on == "Colab":
    # Setup Google Drive mount
    from google.colab import drive
    drive.mount('/content/drive')

    # Install required programs and packages
    !sudo apt install tesseract-ocr
    !pip install pytesseract
    !pip install pdf2image
    !pip install tomotopy
    !python -m spacy download en_core_web_md

    # Import the data functions from the Github repository
    !git clone https://github.com/arielsaffer/pest_text_pipeline.git
    import pest_text_pipeline.text_analysis.data_functions as ta
    

### 1. Extract text from PDF to produce a corpus of documents

In [76]:
# Provide the location of the PDF file to be processed
# Either a local path, relative path (to the repository root), 
# or a Google Drive path (typically starts with "/content/drive/My Drive/")
data_dir = r"data"
pdf_path = f"{data_dir}\DowleyBook6.19.24.pdf"
# Provide the language of the text
language = 'eng'
# Determine how the document should be subdivided ("page", "paragraph", or "sentence")
document_level = "paragraph"

In [84]:
root_dir = os.path.dirname(pdf_path)
file_name = os.path.basename(pdf_path)

In [86]:
file_name

'DowleyBook6.19.24.pdf'

In [85]:
root_dir

'data'

In [None]:
text_corpus = ta.pdf_to_corpus(pdf_path=pdf_path, lang=language, document_level=document_level)

In [74]:
# Look at the result

text_corpus

0       THE FARMERS GAZETTE  References to  THE FAMINE PERIOD AND POTATO DISEASES IN IRELAND 1844-1847  ...
1       Dowley  from the  Farmers Gazette and Journal of Practical Horticulture 1844-1847  Department of...
2       Landscape Gardener  Published every Saturday morning at 23 Bachelors Walk, Dublin  January 2011 ...
3        At home, the Gazette regularly reported the proceedings of the Royal Agricultural Improvement S...
4       It is true that the control and membership of the above societies were largely in the hands of t...
                                                       ...                                                 
8282    Also from John Dillon Croker, Esq., Mallow, enclosing sonie excellent specimens of bread made pa...
8283                            Also from William Phibbs, Esc., Seafield, Sligo, enclosing a report from Mr
8284    Cooper of Markree, on the subjects; all which communications, together with various others were ...
8285     The secretary state

In [83]:
# Save the text corpus to a CSV file

text_corpus.to_csv(f"{pdf_path[:-4]}_{document_level}.csv", index=False)

### 2. Apply exploratory text analysis: topic modeling (LDA), keyword search (regex)

#### LDA

In [59]:
topic_table = ta.text_to_topics(text_corpus, iterations=10, num_topics=20)
# 10 iterations here just for texting, should be higher in practice

Text data preprocessed (tokenized, lowercased, stopwords removed).


Topic Model Training...


Iteration: 0	Log-likelihood: -9.347740383133466
Iteration: 1	Log-likelihood: -9.060880276318592
Iteration: 2	Log-likelihood: -8.932297997150279
Iteration: 3	Log-likelihood: -8.862250716003233
Iteration: 4	Log-likelihood: -8.809935294084484
Iteration: 5	Log-likelihood: -8.772942098421428
Iteration: 6	Log-likelihood: -8.75029810112901
Iteration: 7	Log-likelihood: -8.726211430604836
Iteration: 8	Log-likelihood: -8.703085268561262
Iteration: 9	Log-likelihood: -8.68506839065715
Top 10 words for each topic extracted.


In [60]:
# Look at the topics
topic_table


Unnamed: 0,Topic Number,Top Words
0,0,"[diseased, tubers, sound, potato, found, disease, potatoes, crop, parts, healthy]"
1,1,"[may, thus, potato, plant, place, long, early, reason, late, vegetable]"
2,2,"[upon, oo, price, c, country, made, already, would, high, carried]"
3,3,"[potatoes, seed, may, crop, produce, best, planting, potato, quantity, much]"
4,4,"[upon, also, present, meeting, would, general, council, state, purpose, best]"
5,5,"[would, could, much, many, mr, must, done, might, least, cannot]"
6,6,"[mr, shall, many, time, given, first, subject, far, yet, attention]"
7,7,"[potato, farmers, disease, crop, page, potatoes, november, last, 1845, failure]"
8,8,"[would, think, one, could, bad, part, might, good, great, state]"
9,9,"[much, may, subject, system, upon, practical, great, feel, give, means]"


#### Keyword search

In [62]:
# Define keywords you are interested in using to search for relevant text

keywords_of_interest = ["famine", "hunger", "hungry", "shortage", "starv"]

In [63]:
# Search for the keywords

ta.keyword_search(text_data=text_corpus, keywords=keywords_of_interest)

Unnamed: 0,Text,Keywords Found
11,"In the potato, the main diseases prior to the famine were potato leaf roll virus (Curl), black-...",True
15,In the period leading up to the famine the main disease referred to in the Gazette appeared to ...,True
228,"A tour of the area around Kilberry Cross, near Navan, in August 1980 reminded me very much of wh...",True
295,"It is quite certain that starch, or materials corresponding to it, exist to _ acertain amount in...",True
407,It may be as a result of this criticism that their utterances on the potato problem seemed to di...,True
...,...,...
8105,"The conference covered all aspects of research on Phytophthora infestans, including the scientif...",True
8110,What was the Farmers Gazette? The Farmers Gazette and Journal of Practical Horticulture was an...,True
8113,"Given the absence of modern means of communication at the time of the famine, it is extraordinar...",True
8204,"Potatoes being their all, their sole subsistence—if they fail, the people look in terror to a fa...",True


## 3. Machine learning to select for topics about presence

In [None]:
# This requires pre-labeled data to train the model
# i.e., a column in the text_corpus dataframe with the labels

train_path = f"{data_dir}\DowleyBook6.19.24_train.csv"
train_data = pd.read_csv(train_path)



4. Geoparsing to extract locations from presence records