# Workflow for text analysis from PDF

Formalizing the workflow from various projects into a workflow that can be applied to various texts, including longer form documents from PDF.

## Steps

1. Extract text from PDF.
2. Determine how documents are divided (e.g., chapter, page, paragraph, sentence)
3. Apply exploratory text analysis: topic modeling (LDA), keyword search (regex)
4. Machine learning to select for topics about presence.
5. Geoparsing to extract locations from presence records.

In [3]:
# Import general libraries

import numpy as np
import pandas as pd
import os
import sys
import unicodedata

pd.set_option('display.max_colwidth', 100)

In [None]:
# Go back one directory

os.chdir("..")

### 1. Extract text from PDF (scans)

This will require OCR - various steps.

### 2. Determine how documents are divided

Set the division to page, paragraph, or sentence.

In [None]:
# For now, use the Tweets as the data source
text_corpus = pd.read_csv("data/bw/bw_clean.csv", usecols=["Snippet", "Page Type", "Engagement Type", "Language"])

# Keep only original Tweets in English
text_corpus = text_corpus[text_corpus["Language"] == "en"]
text_corpus = text_corpus[text_corpus["Page Type"] == "twitter"]
text_corpus = text_corpus[text_corpus["Engagement Type"] != "RETWEET"]
text_corpus = text_corpus.drop(columns=["Language", "Page Type", "Engagement Type"]).reset_index(drop=True).Snippet

### 3. Apply exploratory text analysis: topic modeling (LDA), keyword search (regex)

#### LDA

In [3]:
from text_analysis.data_functions import text_to_topics, preprocess_text, model_topics, get_topics

In [13]:
topic_table = text_to_topics(text_corpus)

Text data preprocessed (tokenized, lowercased, stopwords removed).


Topic Model Training...


Iteration: 0	Log-likelihood: -6.304393730652546
Iteration: 1	Log-likelihood: -5.8098289345721135
Iteration: 2	Log-likelihood: -5.614641522053729
Iteration: 3	Log-likelihood: -5.513694763050845
Iteration: 4	Log-likelihood: -5.447391061854338
Iteration: 5	Log-likelihood: -5.408710166864365
Iteration: 6	Log-likelihood: -5.382227972967669
Iteration: 7	Log-likelihood: -5.361475161185057
Iteration: 8	Log-likelihood: -5.344765056319163
Iteration: 9	Log-likelihood: -5.339349165916599


In [15]:
# Look at the topics
topic_table


Unnamed: 0,Topic Number,Top Words
0,0,"[potato, late, blight, control, gene, using, phytophthora, infestans, 3, cloning]"
1,1,"[late, blight, tomato, potato, confirmed, rt, plants, found, growers, disease]"
2,2,"[infestans, phytophthora, de, resistance, relation, bary, potatoes, potato, tubers, behaviour]"
3,3,"[late, potato, blight, disease, gene, model, caused, control, tomato, wins]"
4,4,"[infestans, phytophthora, potato, plants, content, relation, carbohydrate, reaction, tuber, susc..."
5,5,"[potato, late, blight, disease, serious, farmers, well, agriculture, education, studets]"
6,6,"[infestans, phytophthora, resistance, potatoes, dna, involvement, invasion, factors, field, spec..."
7,7,"[de, potato, areas, late, blight, infestans, epidemiology, producer, mont, la]"
8,8,"[potato, phytophthora, irish, infestans, famine, caused, blight, late, disease, rt]"
9,9,"[potato, late, blight, new, farmers, resistance, resistant, variety, disease, genes]"


#### Keyword search

In [6]:
from text_analysis.data_functions import keyword_search

In [7]:
keywords_of_interest = ["famine", "hunger", "hungry", "shortage", "starv"]

In [8]:
# Search for the keywords

keyword_search(text_corpus, keywords_of_interest)

Unnamed: 0,Text,Keywords Found
10,@steezthemeans I'm confused the potato famine in Ireland was caused by a fungus called Phytophth...,True
23,"Jean Ristaino, a lead investigator for the @NC_PSI, found original famine-era plant samples and ...",True
39,Irish Potato Famine: Crop Failure Decimates Ireland's Population - As potato crops failed in 184...,True
50,"Just learned about the Irish potato famine, it's crazy that potato blight caused that much of a ...",True
58,"@Ken47188750 @FiatLuxGenesis I'm not saying Phytophthora infestans was made in a bio lab, but on...",True
...,...,...
16870,RT @mcmgmd: Information on late blight from the USDA. This is the disease which caused the Irish...,True
16872,Information on late blight from the USDA. This is the disease which caused the Irish potato fami...,True
16904,"The potato disease ""late blight"" was the principal cause of the Irish potato famine, which kille...",True
16910,"Lumpers Variety very susceptible to late blight disease, dates back to the 1850's,cause of the p...",True
