# Workflow for text analysis from PDF

Formalizing the workflow from various projects into a workflow that can be applied to various texts, including longer form documents from PDF.

<a target="_blank" href="https://colab.research.google.com/github/arielsaffer/pest-text-pipeline/blob/main/notebooks/text_mining_workflow.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

In [None]:
# Google Colab setup

from google.colab import drive
drive.mount('/content/drive')

## Steps

1. Extract text from PDF.
2. Determine how documents are divided (e.g., chapter, page, paragraph, sentence)
3. Apply exploratory text analysis: topic modeling (LDA), keyword search (regex)
4. Machine learning to select for topics about presence.
5. Geoparsing to extract locations from presence records.

In [1]:
# Import general libraries

import pandas as pd
import os
import glob

# OCR

import pytesseract

pd.set_option('display.max_colwidth', 100)

In [2]:
os.chdir("..")
# This should be the path of the tesseract installation
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract'

### 1. Extract text from PDF (scans)

### 2. Determine how documents are divided

Set the division to page, paragraph, or sentence.

In [51]:
pdf_path = r"data\DowleyBook6.19.24.pdf"
language = 'eng'
document_level = "paragraph" # "page", "paragraph", or "sentence"

In [None]:
from text_analysis.data_functions import pdf_to_corpus

text_corpus = pdf_to_corpus(pdf_path, language, document_level)

### 3. Apply exploratory text analysis: topic modeling (LDA), keyword search (regex)

#### LDA

In [55]:
from text_analysis.data_functions import text_to_topics

In [58]:
text_corpus = pd.Series(text_corpus)

In [59]:
topic_table = text_to_topics(text_corpus, iterations=10, num_topics=10)
# 10 iterations here just for texting, should be higher in practice

Text data preprocessed (tokenized, lowercased, stopwords removed).


Topic Model Training...


Iteration: 0	Log-likelihood: -9.347740383133466
Iteration: 1	Log-likelihood: -9.060880276318592
Iteration: 2	Log-likelihood: -8.932297997150279
Iteration: 3	Log-likelihood: -8.862250716003233
Iteration: 4	Log-likelihood: -8.809935294084484
Iteration: 5	Log-likelihood: -8.772942098421428
Iteration: 6	Log-likelihood: -8.75029810112901
Iteration: 7	Log-likelihood: -8.726211430604836
Iteration: 8	Log-likelihood: -8.703085268561262
Iteration: 9	Log-likelihood: -8.68506839065715
Top 10 words for each topic extracted.


In [60]:
# Look at the topics
topic_table


Unnamed: 0,Topic Number,Top Words
0,0,"[diseased, tubers, sound, potato, found, disease, potatoes, crop, parts, healthy]"
1,1,"[may, thus, potato, plant, place, long, early, reason, late, vegetable]"
2,2,"[upon, oo, price, c, country, made, already, would, high, carried]"
3,3,"[potatoes, seed, may, crop, produce, best, planting, potato, quantity, much]"
4,4,"[upon, also, present, meeting, would, general, council, state, purpose, best]"
5,5,"[would, could, much, many, mr, must, done, might, least, cannot]"
6,6,"[mr, shall, many, time, given, first, subject, far, yet, attention]"
7,7,"[potato, farmers, disease, crop, page, potatoes, november, last, 1845, failure]"
8,8,"[would, think, one, could, bad, part, might, good, great, state]"
9,9,"[much, may, subject, system, upon, practical, great, feel, give, means]"


#### Keyword search

In [61]:
from text_analysis.data_functions import keyword_search

In [62]:
keywords_of_interest = ["famine", "hunger", "hungry", "shortage", "starv"]

In [63]:
# Search for the keywords

keyword_search(text_corpus, keywords_of_interest)

Unnamed: 0,Text,Keywords Found
11,"In the potato, the main diseases prior to the famine were potato leaf roll virus (Curl), black-...",True
15,In the period leading up to the famine the main disease referred to in the Gazette appeared to ...,True
228,"A tour of the area around Kilberry Cross, near Navan, in August 1980 reminded me very much of wh...",True
295,"It is quite certain that starch, or materials corresponding to it, exist to _ acertain amount in...",True
407,It may be as a result of this criticism that their utterances on the potato problem seemed to di...,True
...,...,...
8105,"The conference covered all aspects of research on Phytophthora infestans, including the scientif...",True
8110,What was the Farmers Gazette? The Farmers Gazette and Journal of Practical Horticulture was an...,True
8113,"Given the absence of modern means of communication at the time of the famine, it is extraordinar...",True
8204,"Potatoes being their all, their sole subsistence—if they fail, the people look in terror to a fa...",True
