# Workflow for text analysis from PDF

Formalizing the workflow from various projects into a workflow that can be applied to various texts, including longer form documents from PDF.

<a target="_blank" href="https://colab.research.google.com/github/arielsaffer/pest-text-pipeline/blob/main/notebooks/text_mining_workflow.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

## Steps

0. Set up the workspace
1. Extract text from PDF to produce a corpus of "documents" (e.g., pages, paragraphs, sentences)
2. Apply exploratory text analysis: Topic modeling (LDA) and keyword search (regex)
3. Machine learning to select for topics about presence.
4. Geoparsing to extract locations from presence records.
5. Visualize results

## 0. Set up the workspace

Define where you will be running the notebook:

"Local" assumes that you have cloned the full Github repository to your local machine. "Colab" assumes that you are running this notebook on Google Colab.



In [1]:
# @titleDefine where you will be running the notebook. { display-mode: "form" }

run_on = 'Local' # @param ["Local", "Colab"]
print('You will run the notebook on', run_on)


You will run the notebook on Local


In [2]:
# Import general libraries

import pandas as pd
pd.set_option('display.max_colwidth', 100)

# Set up workspace

if run_on == "Local":
    # Set up pytesseract
    import os
    import pytesseract
    os.chdir("..")
    # This should be the path of the tesseract installation
    pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract'
    import text_analysis.data_functions as ta
    
elif run_on == "Colab":
    # Setup Google Drive mount
    from google.colab import drive
    drive.mount('/content/drive')

    # Install required programs and packages
    !sudo apt install tesseract-ocr
    !pip install pytesseract
    !pip install pdf2image
    !pip install tomotopy
    !python -m spacy download en_core_web_md

    # Import the data functions from the Github repository
    !git clone https://github.com/arielsaffer/pest_text_pipeline.git
    import pest_text_pipeline.text_analysis.data_functions as ta
    

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\asaffer\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [3]:
# @title Describe the PDF file. { display-mode: "form" }
# @markdown Provide the location of the PDF file to be processed
# @markdown Either a local path, relative path (to the repository root), 
# @markdown or a Google Drive path (typically starts with "/content/drive/My Drive/")

file_location = r"data" # @param {type:"string", placeholder:"data"}
pdf_name = r"DowleyBook6.19.24.pdf" # @param {type:"string", placeholder:"DowleyBook6.19.24.pdf"}

# @markdown Provide the language of the text
language = 'English' # @param ['Afrikaans', 'Amharic', 'Arabic', 'Assamese', 'Azerbaijani', 'Azerbaijani - Cyrilic', 'Belarusian', 'Bengali', 'Tibetan', 'Bosnian', 'Breton', 'Bulgarian', 'Catalan; Valencian', 'Cebuano', 'Czech', 'Chinese - Simplified', 'Chinese - Traditional', 'Cherokee', 'Corsican', 'Welsh', 'Danish', 'German', 'German (Fraktur Latin)', 'Dzongkha', 'Greek, Modern (1453-)', 'English', 'English, Middle (1100-1500)', 'Esperanto', 'Math / equation detection module', 'Estonian', 'Basque', 'Faroese', 'Persian', 'Filipino (old - Tagalog)', 'Finnish', 'French', 'German - Fraktur (now deu_latf)', 'French, Middle (ca.1400-1600)', 'Western Frisian', 'Scottish Gaelic', 'Irish', 'Galician', 'Greek, Ancient (to 1453) (contrib)', 'Gujarati', 'Haitian; Haitian Creole', 'Hebrew', 'Hindi', 'Croatian', 'Hungarian', 'Armenian', 'Inuktitut', 'Indonesian', 'Icelandic', 'Italian', 'Italian - Old', 'Javanese', 'Japanese', 'Kannada', 'Georgian', 'Georgian - Old', 'Kazakh', 'Central Khmer', 'Kirghiz; Kyrgyz', 'Kurmanji (Kurdish - Latin Script)', 'Korean', 'Korean (vertical)', 'Lao', 'Latin', 'Latvian', 'Lithuanian', 'Luxembourgish', 'Malayalam', 'Marathi', 'Macedonian', 'Maltese', 'Mongolian', 'Maori', 'Malay', 'Burmese', 'Nepali', 'Dutch; Flemish', 'Norwegian', 'Occitan (post 1500)', 'Oriya', 'Orientation and script detection module', 'Panjabi; Punjabi', 'Polish', 'Portuguese', 'Pushto; Pashto', 'Quechua', 'Romanian; Moldavian; Moldovan', 'Russian', 'Sanskrit', 'Sinhala; Sinhalese', 'Slovak', 'Slovenian', 'Sindhi', 'Spanish; Castilian', 'Spanish; Castilian - Old', 'Albanian', 'Serbian', 'Serbian - Latin', 'Sundanese', 'Swahili', 'Swedish', 'Syriac', 'Tamil', 'Tatar', 'Telugu', 'Tajik', 'Thai', 'Tigrinya', 'Tonga', 'Turkish', 'Uighur; Uyghur', 'Ukrainian', 'Urdu', 'Uzbek', 'Uzbek - Cyrilic', 'Vietnamese', 'Yiddish', 'Yoruba']
# @markdown Determine how the document should be subdivided ("page", "paragraph", or "sentence")
document_level = "paragraph" # @param ["page", "paragraph", "sentence"]


In [43]:
# Check that the file exists
pdf_path = os.path.join(file_location, pdf_name)

# Map langauge to the OCR code and name in NLTK

ocr_code, nltk_lang, spacy_code = ta.map_language(language)

if not os.path.exists(pdf_path):
    raise FileNotFoundError(f"File {pdf_path} not found! Please double-check the location.")
else:
    print(f'You will analyze {pdf_path} in {language} at the {document_level} level')

You will analyze data\DowleyBook6.19.24.pdf in English at the paragraph level


### 1. Extract text from PDF to produce a corpus of documents

In [None]:
text_corpus = ta.pdf_to_corpus(pdf_path=pdf_path, lang=ocr_code, document_level=document_level)

In [12]:
# Look at the result

text_corpus

Unnamed: 0,Text
0,THE FARMERS GAZETTE References to THE FAMINE PERIOD AND POTATO DISEASES IN IRELAND 1844-1847 ...
1,Dowley from the Farmers Gazette and Journal of Practical Horticulture 1844-1847 Department of...
2,"Landscape Gardener Published every Saturday morning at 23 Bachelors Walk, Dublin January 2011 ..."
3,"At home, the Gazette regularly reported the proceedings of the Royal Agricultural Improvement S..."
4,It is true that the control and membership of the above societies were largely in the hands of t...
...,...
8280,"Also from John Dillon Croker, Esq., Mallow, enclosing sonie excellent specimens of bread made pa..."
8281,"Also from William Phibbs, Esc., Seafield, Sligo, enclosing a report from Mr"
8282,"Cooper of Markree, on the subjects; all which communications, together with various others were ..."
8283,The secretary stated that he had received numerous returns from the local societies in reply to...


In [31]:
# Because this is a scanned text, you may need to do some 
# additional cleaning.

# For example, I noticed that " | " appears intead of "I " in the text

text_corpus["Text"] = text_corpus["Text"].str.replace("\b\|\b", "I", regex=True)


# And " O " and " OQ " appear, probably where there were marks on the page

text_corpus["Text"] = text_corpus["Text"].str.replace(" OQ ", "", regex=False)
text_corpus["Text"] = text_corpus["Text"].str.replace(" O ", "", regex=False)
text_corpus["Text"] = text_corpus["Text"].str.replace(" QO ", "", regex=False)


In [17]:
# Save the text corpus to a CSV file

text_corpus.to_csv(f"{pdf_path[:-4]}_{document_level}.csv", index=False)

### 2. Apply exploratory text analysis: topic modeling (LDA), keyword search (regex)

#### LDA

In [15]:
# You can restart here by loading your text-corpus from the CSV file

text_corpus = pd.read_csv(f"{pdf_path[:-4]}_{document_level}.csv")

In [22]:
topic_table = ta.text_to_topics(text_corpus["Text"], lang = nltk_lang, num_topics=20, num_iter=10)
# 10 iterations here just for texting, should be higher in practice

Text data preprocessed (tokenized, lowercased, stopwords removed).


Topic Model Training...


Iteration: 0	Log-likelihood: -9.356104189489258
Iteration: 1	Log-likelihood: -9.087671357995603
Iteration: 2	Log-likelihood: -8.92950738336612
Iteration: 3	Log-likelihood: -8.851733474641065
Iteration: 4	Log-likelihood: -8.792270440067417
Iteration: 5	Log-likelihood: -8.770231976723602
Iteration: 6	Log-likelihood: -8.741254776723668
Iteration: 7	Log-likelihood: -8.713777518791224
Iteration: 8	Log-likelihood: -8.698384516607549
Iteration: 9	Log-likelihood: -8.687894034474917
Top 10 words for each topic extracted.


In [23]:
# Look at the topics
topic_table


Unnamed: 0,Topic Number,Top Words
0,0,"[may, much, great, best, would, land, use, necessary, place, mode]"
1,1,"[much, potato, cause, may, yet, far, disease, less, many, crop]"
2,2,"[would, us, one, let, give, part, little, well, man, class]"
3,3,"[every, must, cannot, could, even, would, many, great, means, public]"
4,4,"[present, food, people, may, country, state, must, means, ireland, cannot]"
5,5,"[potato, farmers, page, disease, crop, potatoes, november, failure, october, pages]"
6,6,"[potatoes, diseased, sound, seed, found, one, quite, two, many, taken]"
7,7,"[per, would, ireland, acres, produce, year, grain, average, fair, corn]"
8,8,"[mr, meeting, society, lord, council, also, agricultural, general, letter, different]"
9,9,"[one, potatoes, two, dry, kept, good, laid, mixed, three, lime]"


#### Keyword search

In [24]:
# Define keywords you are interested in using to search for relevant text

hunger_keywords = ["famine", "hunger", "hungry", "shortage", "starv"]

In [26]:
# Search for the keywords

ta.keyword_search(text_data=text_corpus["Text"], keywords=hunger_keywords)

Unnamed: 0,Text,Keywords Found
11,"In the potato, the main diseases prior to the famine were potato leaf roll virus (Curl), black-...",True
15,In the period leading up to the famine the main disease referred to in the Gazette appeared to ...,True
228,"A tour of the area around Kilberry Cross, near Navan, in August 1980 reminded me very much of wh...",True
295,"It is quite certain that starch, or materials corresponding to it, exist to _ acertain amount in...",True
407,It may be as a result of this criticism that their utterances on the potato problem seemed to di...,True
...,...,...
8103,"The conference covered all aspects of research on Phytophthora infestans, including the scientif...",True
8108,What was the Farmers Gazette? The Farmers Gazette and Journal of Practical Horticulture was an...,True
8111,"Given the absence of modern means of communication at the time of the famine, it is extraordinar...",True
8202,"Potatoes being their all, their sole subsistence—if they fail, the people look in terror to a fa...",True


## 3. Machine learning to select for topics about presence

In [36]:
# Since I don't have labeled data, here I am just going to consider all posts with 
# "disease report keywords" as positives.
# These are very imperfect! (e.g., "report", "found", and "present" have many common uses)
# These could be refined, or more ideally, a small sample of posts should be labeled manually

disease_report_keywords = ["report", "found",  "suffer",
                           "loss", "present", "disease", 
                           ]

# Add a Label column

text_corpus["Label"] = 0

# Set the label to 1 if any of the keywords are in the text

positive_locs = ta.keyword_search(
    text_data=text_corpus["Text"], keywords=disease_report_keywords
    ).index

text_corpus.loc[positive_locs, "Label"] = 1

# Take a stratified sample of 20% of the data as our "labeled data"

labeled_data = text_corpus.groupby('Label', group_keys=False).apply(lambda x: x.sample(frac=0.2)).reset_index(drop=True)

In [37]:
# Take a look at the labeled data

labeled_data

Unnamed: 0,Text,Label
0,"Garden further says—"" It is quite possible that my letter may have been likely to mar the prospe...",0
1,"Dig your potatoes in dry weather, if you can; and if you cannot, get them dry somehow as fast as...",0
2,"I fear the recommendation of leading politicians to get rid of the small farmer of Ireland, and...",0
3,The letter is sufficiently plain,0
4,"238 INDIAN CORN Farmers Gazette, February 14, 1846, page 682 sIR—As there is some prospect of...",0
...,...,...
1652,"Charred turf is not good, but wherever I find a wet pit either here (the model farm) or with th...",1
1653,"Instances of accidental application of salt for other purposes preventing disease, when adjoinin...",1
1654,"I have been lately through the counties of Galway, Roscommon, Westmeath, Kildare, Dublin, Meath...",1
1655,"Rogers does not come before the public as a theorist, and experience should have its proper weig...",1


In [38]:
# Test several models for classification 

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import ComplementNB
from sklearn.tree import DecisionTreeClassifier

# Define the models to test

models = [
    LinearSVC(),
    LogisticRegression(),
    ComplementNB(),
    DecisionTreeClassifier()
]

# Define the vectorizer

vectorizer = TfidfVectorizer(stop_words='english', min_df=0.001, ngram_range=(1, 2))

In [39]:
# @title Define the metric that will be used to select the top model. { display-mode: "form" }

selection_metric = "fscore" # @param ["accuracy", "precision", "recall", "fscore"]

In [40]:
# Test the models

model_testing_df = ta.test_multiple_models(
    X = labeled_data["Text"], 
    y = labeled_data["Label"], 
    models = models, 
    vectorizer = vectorizer, 
    k = 10, 
    random_state = 40
    )

# Look at the results

model_testing_df

Unnamed: 0,model,accuracy,accuracy_sd,precision,precision_sd,recall,recall_sd,fscore,fscore_sd
3,DecisionTreeClassifier(),0.928167,0.015693,0.811326,0.051849,0.807027,0.088169,0.806436,0.058516
0,LinearSVC(),0.923921,0.019677,0.936888,0.068357,0.650847,0.078038,0.764158,0.059485
1,ComplementNB(),0.796605,0.023947,0.476895,0.100424,0.628058,0.089306,0.539344,0.094121
2,LogisticRegression(),0.84123,0.028078,0.915714,0.079446,0.200735,0.05263,0.325766,0.074215


In [41]:
# Apply the best model to the full text corpus

best_model = model_testing_df.loc[model_testing_df[selection_metric].idxmax(), "model"]

# Train the best model on the full labeled data

best_model.fit(X = vectorizer.fit_transform(labeled_data["Text"]), y = labeled_data["Label"])

# Predict the labels for the full text corpus

text_corpus["Predicted_Label"] = best_model.predict(vectorizer.transform(text_corpus["Text"]))

In [42]:
# Show the positive results

text_corpus.loc[text_corpus["Predicted_Label"] == 1]

Unnamed: 0,Text,Label,Predicted_Label
0,THE FARMERS GAZETTE References to THE FAMINE PERIOD AND POTATO DISEASES IN IRELAND 1844-1847 ...,0,1
3,"At home, the Gazette regularly reported the proceedings of the Royal Agricultural Improvement S...",1,1
9,The frequent references to failures and diseases of the crop highlight the importance of these a...,1,1
10,References to Potato Diseases Prior to Late Blight While the discipline of Plant Pathology was...,1,1
11,"In the potato, the main diseases prior to the famine were potato leaf roll virus (Curl), black-...",1,1
...,...,...,...
8275,"The Secretary read a number of letters, from various quarters, in reference to the potato disea...",1,1
8276,"A letter was also read from Professor Johnston, of Durham, requesting to be made acquainted wit...",1,1
8279,"A communication was also read from the Lord Lieutenant, enclosing a letter detailing the ""resul...",1,1
8281,"Also from William Phibbs, Esc., Seafield, Sligo, enclosing a report from Mr",1,1


## 4. Geoparsing to extract locations from presence records

In [None]:
# Load the language-specific spaCy NER model

nlp = load_lang_nlp(spacy_code)



### 5. Visualize results