
# Assignment 2 - Example Solution
### Alex Flückiger
### Seminar: The ABC of Computational Text Analysis
### University of Lucerne

## Task 1: Get the data

Download the play `Romeo and Juliet` from Project Gutenberg with `wget`. Steps:
1. Open a shell 
2. Run `wget https://www.gutenberg.org/ebooks/1513.txt.utf-8`

Make sure that the downloaded file is in the same folder as the code below. Otherwise, you need to adapt the file path in the code.

## Task 2: Analyze the data

In [4]:
from pathlib import Path
import re
from collections import Counter

infile = Path("1513.txt.utf-8")
text = infile.read_text()

# replace the header with metainformation
text = re.sub(r"The Project Gutenberg.*START OF THE PROJECT GUTENBERG EBOOK ROMEO AND JULIET \*\*\*", "", text, flags=re.DOTALL)

# replace footer with terms and conditions
text = re.sub(r"\*\*\* END OF THE PROJECT GUTENBERG.*", "", text, flags=re.DOTALL)

# lowercase all text
text = text.lower()

### Bonus ###
#  join lines, keep paragraphs
def join_lines_keep_paragraphs(text):
    # Split the text into paragraphs
    paragraphs = text.split('\n\n')  # Assuming paragraphs are separated by two newlines

    # Join lines within each paragraph
    paragraphs = [' '.join(paragraph.split('\n')) for paragraph in paragraphs]

    # Join the paragraphs back together
    joined_text = '\n\n'.join(paragraphs)

    return joined_text

text = join_lines_keep_paragraphs(text)
### End of Bonus ###

# write to file to check cleaned version
outfile = Path("romeo_julia_cleaned.txt")
with outfile.open("w") as f:
    f.write(text)

# extract alphanumeric words without punctuation
words = re.findall(r"\w+", text)

print(words[:10])

# count words
vocab = Counter(words)

# write to file, one word and its frequency per line
outfile = Path("romeo_julia_vocab_frq.tsv")
with outfile.open("w") as f:
    for word, frq in vocab.most_common():
        line = f"{word}\t{frq}\n"
        f.write(line)

vocab.most_common(5)


['the', 'tragedy', 'of', 'romeo', 'and', 'juliet', 'by', 'william', 'shakespeare', 'contents']


[('and', 736), ('the', 688), ('i', 659), ('to', 544), ('a', 488)]

## Mini-Analysis

- the most frequent words are hard to interpret as they lack semantic meaning (so-called stopwords)
- `Romeo` gets mentioned more often than `Juliet`
- The `Capulet` family gets mentioned more often than `Montague`
- `love` is among the highest ranking nouns and corresponds with the underlying theme of the novel
- the adjective `good` is used more often than `bad` which might indicate a positive sentiment


The word distribution is highly skewed and follows the [Zipf's law](https://en.wikipedia.org/wiki/Zipf%27s_law).
It states that the frequency of a word is inversely proportional to its rank. This means that the most frequent word occurs twice as often as the second most frequent word, three times as often as the third most frequent word, and so on. This law is often used to identify stopwords, as they are the most frequent words in a text.

If you skip the first few dozen words, however, you will usually see keywords that hints at the most important protagonists, themes and the general plot of a book.