# Workflow for text analysis from PDF

Formalizing the workflow from various projects into a workflow that can be applied to various texts, including longer form documents from PDF.

<a target="_blank" href="https://colab.research.google.com/github/arielsaffer/pest-text-pipeline/blob/main/notebooks/text_mining_workflow.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

## Steps

0. Set up the workspace
1. Extract text from PDF to produce a corpus of "documents" (e.g., pages, paragraphs, sentences)
2. Apply exploratory text analysis: Topic modeling (LDA) and 
3. ... keyword search (regex)
4. Machine learning to select for topics about presence.
5. Geoparsing to extract locations from presence records.

## 0. Set up the workspace

In [None]:
run_on = "Local" # "Local" or "Colab"
# Local assumes that you have cloned the full Github repository to your local machine
# Colab assumes that you are running this notebook on Google Colab

In [None]:
# Import general libraries

import pandas as pd
pd.set_option('display.max_colwidth', 100)

# Set up workspace

if run_on == "Local":
    # Set up pytesseract
    import os
    import pytesseract
    os.chdir("..")
    # This should be the path of the tesseract installation
    pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract'
    import text_analysis.data_functions as ta
    
elif run_on == "Colab":
    # Setup Google Drive mount
    from google.colab import drive
    drive.mount('/content/drive')

    # Install required programs and packages
    !sudo apt install tesseract-ocr
    !pip install pytesseract
    !pip install pdf2image
    !pip install tomotopy
    !python -m spacy download en_core_web_md

    # Import the data functions from the Github repository
    !git clone https://github.com/arielsaffer/pest_text_pipeline.git
    import pest_text_pipeline.text_analysis.data_functions as ta
    

In [None]:
# Provide the location of the PDF file to be processed
# Either a local path, relative path (to the repository root), 
# or a Google Drive path (typically starts with "/content/drive/My Drive/")
data_dir = r"data"
pdf_path = f"{data_dir}\DowleyBook6.19.24.pdf"
# Provide the language of the text
language = 'eng'
# Determine how the document should be subdivided ("page", "paragraph", or "sentence")
document_level = "paragraph"

### 1. Extract text from PDF to produce a corpus of documents

In [None]:
text_corpus = ta.pdf_to_corpus(pdf_path=pdf_path, lang=language, document_level=document_level)

In [None]:
# Look at the result

text_corpus

In [None]:
# Because this is a scanned text, you may need to do some 
# additional cleaning.

# For example, I noticed that " | " appears intead of " I " in the text

text_corpus["text"] = text_corpus["text"].str.replace(" | ", " I ")

# And " O " and " OQ " appear, probably where there were marks on the page

text_corpus["text"] = text_corpus["text"].str.replace(" OQ ", "")
text_corpus["text"] = text_corpus["text"].str.replace(" O ", "")
text_corpus["text"] = text_corpus["text"].str.replace(" QO ", "")


In [None]:
# Save the text corpus to a CSV file

text_corpus.to_csv(f"{pdf_path[:-4]}_{document_level}.csv", index=False)

### 2. Apply exploratory text analysis: topic modeling (LDA), keyword search (regex)

#### LDA

In [None]:
topic_table = ta.text_to_topics(text_corpus, iterations=10, num_topics=20)
# 10 iterations here just for texting, should be higher in practice

In [None]:
# Look at the topics
topic_table


#### Keyword search

In [None]:
# Define keywords you are interested in using to search for relevant text

hunger_keywords = ["famine", "hunger", "hungry", "shortage", "starv"]

In [None]:
# Search for the keywords

ta.keyword_search(text_data=text_corpus, keywords=hunger_keywords)

## 3. Machine learning to select for topics about presence

In [3]:
# Since I don't have labeled data, here I am just going to consider all posts with 
# "disease report keywords" as positives.
# These are very imperfect! (e.g., "report", "found", and "present" have many common uses)
# These could be refined, or more ideally, a small sample of posts should be labeled manually

disease_report_keywords = ["report", "found",  "suffer",
                           "loss", "present", "disease", 
                           ]

# Define the metric that will be used for model selection
# Options: "accuracy", "precision", "recall", "fscore"
selection_metric = "fscore"

# Create the training dataframe

text_corpus = pd.read_csv(f"{pdf_path[:-4]}_{document_level}.csv")

# Add a Label column

text_corpus["Label"] = 0

# Set the label to 1 if any of the keywords are in the text

positive_locs = ta.keyword_search(
    text_data=text_corpus["Text"], keywords=disease_report_keywords
    ).index

text_corpus.loc[positive_locs, "Label"] = 1

# Take a stratified sample of 20% of the data as our "labeled data"

labeled_data = text_corpus.groupby('Label', group_keys=False).apply(lambda x: x.sample(frac=0.2)).reset_index(drop=True)

In [4]:
# Test several models for classification 

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import ComplementNB
from sklearn.tree import DecisionTreeClassifier

# Define the models to test

models = [
    LinearSVC(),
    LogisticRegression(),
    ComplementNB(),
    DecisionTreeClassifier()
]

# Define the vectorizer

vectorizer = TfidfVectorizer(stop_words='english', min_df=0.001, ngram_range=(1, 2))

In [5]:
# Test the models

model_testing_df = ta.test_multiple_models(
    X = labeled_data["Text"], 
    y = labeled_data["Label"], 
    models = models, 
    vectorizer = vectorizer, 
    k = 10, 
    random_state = 40
    )

# Look at the results

model_testing_df

Unnamed: 0,model,accuracy,accuracy_sd,precision,precision_sd,recall,recall_sd,fscore,fscore_sd
3,DecisionTreeClassifier(),0.926371,0.018472,0.816093,0.075313,0.799078,0.077066,0.803565,0.054994
0,LinearSVC(),0.911267,0.027993,0.91652,0.075365,0.599697,0.093675,0.720664,0.08273
1,ComplementNB(),0.818916,0.031248,0.530066,0.085706,0.579584,0.082907,0.549784,0.068503
2,LogisticRegression(),0.839996,0.039545,0.9275,0.149353,0.19278,0.096581,0.309408,0.140848


In [7]:
# Apply the best model to the full text corpus

best_model = model_testing_df.loc[model_testing_df[selection_metric].idxmax(), "model"]

# Train the best model on the full labeled data

best_model.fit(X = vectorizer.fit_transform(labeled_data["Text"]), y = labeled_data["Label"])

# Predict the labels for the full text corpus

text_corpus["Predicted_Label"] = best_model.predict(vectorizer.transform(text_corpus["Text"]))

In [9]:
# Show the positive results

text_corpus.loc[text_corpus["Predicted_Label"] == 1]

Unnamed: 0,Text,Label,Predicted_Label
0,THE FARMERS GAZETTE References to THE FAMINE...,0,1
3,"At home, the Gazette regularly reported the p...",1,1
9,The frequent references to failures and diseas...,1,1
10,References to Potato Diseases Prior to Late B...,1,1
11,"In the potato, the main diseases prior to the...",1,1
...,...,...,...
8275,"The Secretary read a number of letters, from ...",1,1
8276,A letter was also read from Professor Johnsto...,1,1
8279,A communication was also read from the Lord L...,1,1
8281,"Also from William Phibbs, Esc., Seafield, Slig...",1,1


## 4. Geoparsing to extract locations from presence records

### 5. Visualize results