<section id="title-slide">
  <h1 class="title">The ABC of Computational Text Analysis</h1>
  <h2 class="subtitle">#7 Working with (your own) Data</h2>
  <p class="author">Alex Flückiger</p><p class="date">10 April 2025</p>
</section>

## Game Plan for today's coding
Extend the Python basics before extracting text from PDFs!

## Update the course material
1. Navigate to the course folder `KED2025` using `cd` in your command line
2. Update the files with `git pull`
3. If `git pull` doesn't work due to file conflicts, run `git restore .` first

## Getting started 
1. Open VS Code
2. Windows: Make sure that you are connected to WSL (blue-badge in left-lower corner)
3. Open the `KED2025` folder via the menu: `File` > `Open Folder`
4. Navigate to `KED2025/ked/materials/code/KED2025_07.ipynb` and open with double-click
5. Run the code with `Run all` via the top menu

## Best Practices
- Check the values of variables in the `Variable Explorer`
- Use `tab` for auto-completion

## Working with texts
Texts are represented as strings of any length.


In [36]:
sentence_1 = "I love NLP and social science."
sentence_2 = "Computational Social Science applies NLP to social questions."

text = sentence_1 + " " + sentence_2
text

'I love NLP and social science. Computational Social Science applies NLP to social questions.'

## Modify text

In [37]:
# replace `.` with `!`
text_modified = text.replace(".", "!")

# change text to lowercased letters
text_modified = text_modified.lower()

# split text at space, yields words as list
text_modified = text_modified.split(" ")
text_modified


['i',
 'love',
 'nlp',
 'and',
 'social',
 'science!',
 'computational',
 'social',
 'science',
 'applies',
 'nlp',
 'to',
 'social',
 'questions!']

## Count words

In [38]:
from collections import Counter

# initialize a counter object
counter = Counter()

# split the text and pass all elements (~words) to the counter
counter.update(text.split(" "))

# get the three most common words
counter.most_common(5)


[('NLP', 2), ('social', 2), ('I', 1), ('love', 1), ('and', 1)]

## Read from a textfile

In [39]:
from pathlib import Path

# define the path to the file
infile = Path("../data/swiss_party_programmes/txt/gruene_programmes/gruene_programme_2019.txt")

# read the file
text = infile.read_text()

# show first 100 characters of file 
print(text[0:200])


IMPRESSUM
GRÜNE Schweiz
Waisenhausplatz 21
3011 Bern
Tel. 031 326 66 00
www.gruene.ch
gruene@gruene.ch
Postkonto 80-26747-3
Wahlplattform 2019 – 2023
Beschlossen an der
Delegiertenversammlung vom 12. 


## Write into a textfile

In [40]:
import re

# lowercase the text
text = text.lower()

# replace repeated newlines with a single newline
text = re.sub(r"\n+", "\n", text)

# write content to file
outfile = Path("../data/swiss_party_programmes/txt/gruene_programmes/gruene_programme_2019_lowercased.txt")

# write to file
with outfile.open("w") as f:
    f.write(text)


## Counting words in a textfile

In [41]:
from pathlib import Path
from collections import Counter

infile = Path("../data/swiss_party_programmes/txt/gruene_programmes/gruene_programme_2019.txt")
text = infile.read_text()

# lowercase all text
text = text.lower()

# extract alphanumeric words without punctuation
words = re.findall(r"\w+", text)

# count words
vocab = Counter(words)

# write to file, one word and its frequency per line
outfile = Path("../analysis/gruene_programme_vocab_frq.tsv")
with outfile.open("w") as f:
    for word, frq in vocab.most_common():
        line = f"{word}\t{frq}\n"
        f.write(line)

vocab.most_common(5)


[('und', 595), ('die', 517), ('der', 394), ('für', 217), ('in', 161)]

## Conversion of a single native PDF

### Use case: [Swiss party programmes](https://visuals.manifesto-project.wzb.eu/mpdb-shiny/cmp_dashboard_dataset/)


In [42]:
from pypdf import PdfReader

pdf_path = Path("../data/swiss_party_programmes/pdf/gruene_programmes/gruene_programme_2019.pdf")

# set up PDF reader
reader = PdfReader(pdf_path)

text = ""

# iterate over pages
for page in reader.pages:
    text_page = page.extract_text()
    
    # clean up repeated empty lines
    text_page = re.sub(r"\n\s*\n", "\n", text_page)

    # add text of page to text of document
    text += " " + text_page

print(text[:500])


  
 Wahlplattform der GRÜNEN Schweiz 2019  – 2023   ii  
IMPRESSUM  
GRÜNE  Schweiz  
Waisenhausplatz 21  
3011 Bern  
Tel. 031 326 66 00  
www.gruene.ch  
gruene@gruene.ch  
Postkonto 80-26747 -3 
Wahlplattform 201 9 – 2023  
Beschlossen an der  
Delegiertenversammlung vom 12. Januar 2019 in Emmen,  
ergänzt durch den Resolutionsbeschluss der  
Delegiertenversammlung vom 6. April 2019 in Sierre.  
   Wahlplattform der GRÜNEN Schweiz 2019  – 2023   iii INHALTSVERZEICHNIS  
GENDERNEUTRALE SPRACHE


## Bonus: Clean up artifacts

- remove empty lines
- remove page numbers
- remove footer
- merge hyphenated words

## Remove parts across lines

In [43]:
# Remove multiple lines in a string using regular expressions

import re

text = """
This is an example Text.

YOUR_PATTERN REMOVE THIS
whatever is written here
UNTIL HERE.

Keep this and the following.
"""

# remove a multiline string by substituting the match with an empty string
# re.DOTALL makes the . matching the newline character \n
text_clean = re.sub("YOUR_PATTERN.*UNTIL HERE.", "", text, flags=re.DOTALL)

print(text_clean)


This is an example Text.



Keep this and the following.



## In-class: Exercises I

1. Open `VS Code` and, then, open the folder `KED2025`.
2. Go to [swissinfo.ch](www.swissinfo.ch) and copy the content of a random news article.
3. Create a new file in VS Code, paste the news article into it and save it as `.txt` file.
4. Create another new file and save it as `.ipynb`, and do the following:
   - Read the textfile containing the news article into a variable.
   - Split the text into words.
   - Count all words (i.e., the vocabulary) and write all the word counts into a `.tsv` file.
   - Open the `.tsv` file in a spreadsheet programm.
   - Bonus: Compute the relative frequency of each word.