<section id="title-slide">
  <h1 class="title">The ABC of Computational Text Analysis</h1>
  <h2 class="subtitle">#7 Working with (your own) Data</h2>
  <p class="author">Alex Flückiger</p><p class="date">11/25 May 2023</p>
</section>

## Working with Texts
Texts are represented as strings of any length.


In [1]:
sentence_1 = "I love NLP and social science"
sentence_2 = "Computational Social Science applies NLP to question of social questions."

text = sentence_1 + " " + sentence_2
text

'I love NLP and social science Computational Social Science applies NLP to question of social questions.'

## Text Modifactions

In [2]:
# replace `.` with `!`
text.replace(".", "!")

# change text to lowercased letters
text.lower()

# split text at space (~words)
text.split(" ")


['I',
 'love',
 'NLP',
 'and',
 'social',
 'science',
 'Computational',
 'Social',
 'Science',
 'applies',
 'NLP',
 'to',
 'question',
 'of',
 'social',
 'questions.']

## Count Words

In [3]:
from collections import Counter

# initialize a counter object
counter = Counter()

# split the text and pass all elemnents (~words) to the counter
counter.update(text.split(" "))

# get the three most common words
counter.most_common(3)


[('NLP', 2), ('social', 2), ('I', 1)]

## Read from a Textfile

In [11]:
from pathlib import Path

infile = Path("../data/swiss_party_programmes/txt/gruene_programmes/gruene_programme_2019.txt")

text = infile.read_text()

# show first 100 characters of file 
print(text[0:100])


IMPRESSUM
GRÜNE Schweiz
Waisenhausplatz 21
3011 Bern
Tel. 031 326 66 00
www.gruene.ch
gruene@gruene.


## Write into a Textfile

In [13]:
# lowercase the text
text = text.lower()

# replace repeated newlines with a single newline
text = re.sub(r"\n+", "\n", text)

# write content to file
outfile = Path("../data/swiss_party_programmes/txt/gruene_programmes/gruene_programme_2019_lowercased.txt")

# write to file
with outfile.open("w") as f:
    f.write(text)


## Counting Words in a Textfile

In [8]:
from pathlib import Path
from collections import Counter
import re

infile = Path("../data/swiss_party_programmes/txt/gruene_programmes/gruene_programme_2019.txt")
text = infile.read_text()

# lowercase all text
text = text.lower()

# extract alphanumeric words without punctuation
words = re.findall(r"\w+", text)

# count words
vocab = Counter(words)

# write to file, one word and its frequency per line
outfile = Path("../analysis/vocab_frq_test.tsv")
with outfile.open("w") as f:
    for word, frq in vocab.most_common():
        line = f"{word}\t{frq}\n"
        f.write(line)

vocab.most_common(5)


[('und', 595), ('die', 517), ('der', 394), ('für', 217), ('in', 161)]

## PDF: Digitalized or Digital?

### Two flavours of PDF documents
![Digitalized PDF made from a scanned page](../../lectures/images/pdf_scan.png)
![Native PDF converted from digital document (e.g., docx)](../../lectures/images/pdf_digital.png)

## Conversion of a single native PDFs

### Use case: [Swiss party programmes](https://visuals.manifesto-project.wzb.eu/mpdb-shiny/cmp_dashboard_dataset/)


In [7]:
from pypdf import PdfReader

# path to PDF file
pdf_path = Path("../data/swiss_party_programmes/pdf/gruene_programmes/gruene_programme_2019.pdf")

text = ""
reader = PdfReader(pdf_path)

for page in reader.pages:
    text_page = page.extract_text()
    
    # clean up repeated empty lines
    text = re.sub(r"\n\s*\n", "\n", text)
    text += " " + text_page

print(text[:500])


  
 Wahlplattform der GRÜNEN Schweiz 2019  – 2023   ii  
IMPRESSUM  
GRÜNE  Schweiz  
Waisenhausplatz 21  
3011 Bern  
Tel. 031 326 66 00  
www.gruene.ch  
gruene@gruene.ch  
Postkonto 80-26747 -3 
Wahlplattform 201 9 – 2023  
Beschlossen an der  
Delegiertenversammlung vom 12. Januar 2019 in Emmen,  
ergänzt durch den Resolutionsbeschluss der  
Delegiertenversammlung vom 6. April 2019 in Sierre.  
   Wahlplattform der GRÜNEN Schweiz 2019  – 2023   iii INHALTSVERZEICHNIS  
GENDERNEUTRALE SPRACHE


## Optical Character Recognition (OCR)


- OCR ~ convert images into text
  - extract text from scans/images
- `tesseract` performs OCR
  - language-specific models
  - supports handwriting + Fraktur texts
- image quality is crucial

<img src="../../lectures/images/ocr.png" alt="Steps when performing OCR" style="width: 500px;"/>

## Conversion of a single digitalized PDF

### use-case: [historical party programmes](https://visuals.manifesto-project.wzb.eu/mpdb-shiny/cmp_dashboard_dataset/)

1. extract image from PDF
2. run optical character recognition (OCR) on the image

In [8]:
import pytesseract
from pdf2image import convert_from_path

# path to PDF file
pdf_path = Path("../data/scanned_pdf_sample/fdp_scan_party_programme_1947.pdf")

# convert PDF to images (one image per page)
pages = convert_from_path(pdf_path, fmt="png")

# initialize text to collect the text per page
text = ""

# iterate over pages
for pageNum,imgBlob in enumerate(pages):
    # extract text from image per page
    text_page = pytesseract.image_to_string(imgBlob, lang='deu')

    # append text for each page
    text += " " + text_page

print(text[:100])

 G3 420 47

FREIHEIT
FÜR UNSERE ZEIT

Vorschläge zur Orientierung
der freisinnigen Politik

nach dem


## Extract the text from all PDFs in a folder

In [9]:
# path to PDF directory
indir = Path("../data/scanned_pdf_sample/")
outdir = Path("../data/scanned_pdf_sample/extracted")

# create output folder if it does not exist
outdir.mkdir(parents=True, exist_ok=True)

# iterate over all PDFs
for infile in indir.glob(pattern="*.pdf"):
    print(f"Reading PDF file: {infile}")
    
    pages = convert_from_path(infile, fmt="png")
    text = ""
    
    for pageNum,imgBlob in enumerate(pages):
        text_page = pytesseract.image_to_string(imgBlob, lang='deu')
        text += " " + text_page

    # define name of outfile (name.pdf -> name.txt)
    outfile = outdir / (infile.stem + ".txt")

    # write content to file
    with outfile.open("w") as f:
        f.write(text)
    
    print(f"Extracted text to: {outfile}")

Reading PDF file: ../data/scanned_pdf_sample/fdp_scan_party_programme_1947.pdf
Extracted text to: ../data/scanned_pdf_sample/extracted/fdp_scan_party_programme_1947.txt
Reading PDF file: ../data/scanned_pdf_sample/sp_scan_party_programme_1947.pdf
Extracted text to: ../data/scanned_pdf_sample/extracted/sp_scan_party_programme_1947.txt


## Bonus: Clean up Artefacts

- remove empty lines
- remove page numbers
- remove footer
- merge hyphenated words

In [5]:
# Remove multiple lines in a string using regular expressions

import re

text = """
This is an example Text.

YOUR_PATTERN REMOVE THIS
whatever is written here
UNTIL HERE.

Keep this and the following.
"""

# remove a multiline string by substituting the match with an empty string
# re.DOTALL makes the . matching the newline character \n
text_clean = re.sub("YOUR_PATTERN.*UNTIL HERE.", "", text, flags=re.DOTALL)

print(text_clean)


This is an example Text.



Keep this and the following.

