<section id="title-slide">
  <h1 class="title">The ABC of Computational Text Analysis</h1>
  <h2 class="subtitle">#7 Working with (your own) Data</h2>
  <p class="author">Alex Flückiger</p><p class="date">11 April 2024</p>
</section>

## Game Plan for today's coding
Extend the Python basics before extracting text from PDFs!

## Update the course material
1. Navigate to the course folde using `cd` in your command line
2. Update the files with `git pull`
3. If `git pull` doesn't work due to file conflicts, run `git restore .` first

## Getting started 
1. Open VS Code
2. Windows: Make sure that you are connected to WSL (green-badge in left-lower corner)
3. Open the `KED2024` folder via the menu: `File` > `Open Folder`
4. Navigate to `KED2024/ked2024/materials/code/KED2024_07.ipynb` and open with double-click
5. Run the code with `Run all` via the top menu

## Best Practices
- Check the values of variables in the `Variable Explorer`
- Use `tab` for auto-completion

## Working with texts
Texts are represented as strings of any length.


In [None]:
sentence_1 = "I love NLP and social science."
sentence_2 = "Computational Social Science applies NLP to social questions."

text = sentence_1 + " " + sentence_2
text

## Modify text

In [None]:
# replace `.` with `!`
text_modified = text.replace(".", "!")

# change text to lowercased letters
text_modified = text.lower()

# split text at space, yields words as list
text_modified = text.split(" ")
text_modified


## Count words

In [None]:
from collections import Counter

# initialize a counter object
counter = Counter()

# split the text and pass all elements (~words) to the counter
counter.update(text.split(" "))

# get the three most common words
counter.most_common(3)


## Read from a textfile

In [None]:
from pathlib import Path

# define the path to the file
infile = Path("../data/swiss_party_programmes/txt/gruene_programmes/gruene_programme_2019.txt")

# read the file
text = infile.read_text()

# show first 100 characters of file 
print(text[0:100])


## Write into a textfile

In [None]:
import re

# lowercase the text
text = text.lower()

# replace repeated newlines with a single newline
text = re.sub(r"\n+", "\n", text)

# write content to file
outfile = Path("../data/swiss_party_programmes/txt/gruene_programmes/gruene_programme_2019_lowercased.txt")

# write to file
with outfile.open("w") as f:
    f.write(text)


## Counting words in a textfile

In [None]:
from pathlib import Path
from collections import Counter

infile = Path("../data/swiss_party_programmes/txt/gruene_programmes/gruene_programme_2019.txt")
text = infile.read_text()

# lowercase all text
text = text.lower()

# extract alphanumeric words without punctuation
words = re.findall(r"\w+", text)

# count words
vocab = Counter(words)

# write to file, one word and its frequency per line
outfile = Path("../analysis/gruene_programme_vocab_frq.tsv")
with outfile.open("w") as f:
    for word, frq in vocab.most_common():
        line = f"{word}\t{frq}\n"
        f.write(line)

vocab.most_common(5)


## PDF: Digitized or digital?

### Two flavours of PDF documents
![Digitalized PDF made from a scanned page](../../lectures/images/pdf_scan.png)
![Native PDF converted from digital document (e.g., docx)](../../lectures/images/pdf_digital.png)

## Conversion of a single native PDF

### Use case: [Swiss party programmes](https://visuals.manifesto-project.wzb.eu/mpdb-shiny/cmp_dashboard_dataset/)


In [None]:
from pypdf import PdfReader

pdf_path = Path("../data/swiss_party_programmes/pdf/gruene_programmes/gruene_programme_2019.pdf")

# set up PDF reader
reader = PdfReader(pdf_path)

text = ""

# iterate over pages
for page in reader.pages:
    text_page = page.extract_text()
    
    # clean up repeated empty lines
    text_page = re.sub(r"\n\s*\n", "\n", text_page)

    # add text of page to text of document
    text += " " + text_page

print(text[:500])


## Optical Character Recognition (OCR)


- OCR ~ convert images into text
  - extract text from scans/images
- `tesseract` performs OCR
  - language-specific models
  - supports handwriting + Fraktur texts
- image quality is crucial

<img src="../../lectures/images/ocr.png" alt="Steps when performing OCR" style="width: 500px;"/>

## Conversion of a single digitized PDF

### use-case: [historical party programmes](https://visuals.manifesto-project.wzb.eu/mpdb-shiny/cmp_dashboard_dataset/)

1. extract image from PDF
2. run optical character recognition (OCR) on the image

In [1]:
import pytesseract
from pdf2image import convert_from_path

# path to PDF file
pdf_path = Path("../data/scanned_pdf_sample/fdp_scan_party_programme_1947.pdf")

# convert PDF to images (one image per page)
pages = convert_from_path(pdf_path, fmt="png")

# initialize text to collect the text per page
text = ""

# iterate over pages
for pageNum,imgBlob in enumerate(pages):
    # extract text from image per page
    text_page = pytesseract.image_to_string(imgBlob, lang='deu')

    # append text for each page
    text += " " + text_page

print(text[:100])

NameError: name 'Path' is not defined

## Extract the text from all PDFs in a folder

In [None]:
# path to PDF directory
indir = Path("../data/scanned_pdf_sample/")
outdir = Path("../data/scanned_pdf_sample/extracted")

# create output folder if it does not exist
outdir.mkdir(parents=True, exist_ok=True)

# iterate over all PDFs in input folder
for infile in indir.glob(pattern="*.pdf"):
    print(f"Reading PDF file: {infile}")
    
    pages = convert_from_path(infile, fmt="png")
    text = ""
    
    for pageNum,imgBlob in enumerate(pages):
        text_page = pytesseract.image_to_string(imgBlob, lang='deu')
        text += " " + text_page

    # define name of outfile (name.pdf -> name.txt)
    outfile = outdir / (infile.stem + ".txt")

    # write content to file
    with outfile.open("w") as f:
        f.write(text)
    
    print(f"Extracted text to: {outfile}")

## Bonus: Clean up artifacts

- remove empty lines
- remove page numbers
- remove footer
- merge hyphenated words

## Remove parts across lines

In [None]:
# Remove multiple lines in a string using regular expressions

import re

text = """
This is an example Text.

YOUR_PATTERN REMOVE THIS
whatever is written here
UNTIL HERE.

Keep this and the following.
"""

# remove a multiline string by substituting the match with an empty string
# re.DOTALL makes the . matching the newline character \n
text_clean = re.sub("YOUR_PATTERN.*UNTIL HERE.", "", text, flags=re.DOTALL)

print(text_clean)

## In-class: Exercises I

1. Go to [swissinfo.ch](swissinfo.ch), copy the content of a random article, and save it as `.txt` file.
2. Read this file with Python, count its vocabulary and write all the word counts into a `.tsv` file.
3. Open the `.tsv` file in a spreadsheet programm and compute the relative frequency of each word.