# Assignment 3 - Example Solution
### Alex Flückiger
### Seminar: The ABC of computational Text Analysis
### University of Lucerne


## Preparational part (not part of assignment)

The working directory is set to KED2022 instead to the directory where this Jupyter notebook is located.
Thus, I can specify the pathes relative to the top folder KED2022. 

In [1]:
import os
print("Old working directory:", os.getcwd())
os.chdir('../..')
print("New working directory:", os.getcwd())

Old working directory: /home/alex/KED2022/assignments/assignment_3
New working directory: /home/alex/KED2022


To run this notebook as script from the command line, you can export it as .py file via the menubar and save it to assignments/assignment_3/flueckiger_KED2022_3_solutions.py.

Then, you can navigate to the KED2022 directory and call:

```
python assignments/assignment_3/flueckiger_KED2022_3_solutions.py
```

## Tasks of Assignment

In [2]:
# task 3: import the modules that are needed
import textacy
import pandas as pd

In [3]:
# task 4: create a corpus
# define new function to read text from csv file (copied as given in assignment)

def get_texts_from_csv(f_csv, text_column):

    # read dataframe
    df = pd.read_csv(f_csv)

    # keep only documents that have text
    filtered_df = df[df[text_column].notnull()]

    # iterate over rows in dataframe
    for idx, row in filtered_df.iterrows():

        # read text and join lines (hard line-breaks)
        text = row[text_column].replace("\n", " ")

        # use all columns as metadata, except the column with the actual text
        metadata = row.to_dict()
        del metadata[text_column]

        yield (text, metadata)


# set correct path relative to working directory (folder where you saved this script)
f_csv = "materials/data/dataset_speeches_federal_council_2019.csv"

# stream the csv-dataset by calling the function defined above
texts = get_texts_from_csv(f_csv, text_column="Text")

# load German language model
de = textacy.load_spacy_lang("de_core_news_sm")

# create a corpus with all the texts
corpus_speeches = textacy.Corpus(de, data=texts)

In [4]:
# task 5: two subcorpora
# define two functions filtering by language and period
# similar as the lambda functions shown in assignment, yet they may be simpler to understand

def filter_func_pre(doc):
    return doc._.meta.get("Sprache") == "de" and doc._.meta.get("Jahr") < 2000


# greater-equal to include the year 2000
def filter_func_post(doc):
    return doc._.meta.get("Sprache") == "de" and doc._.meta.get("Jahr") >= 2000


#########################################
# ALTERNATIVE task 5: filter by gender instead of period
# You just need to call these functions when creating the supcorpora below.
# Everything else keeps the same.
def filter_func_women(doc):
    return (
        doc._.meta.get("Sprache") == "de"
        and doc._.meta.get("Jahr") >= 1999
        and doc._.meta.get("Geschlecht") >= "f"
    )


def filter_func_men(doc):
    return (
        doc._.meta.get("Sprache") == "de"
        and doc._.meta.get("Jahr") >= 1999
        and doc._.meta.get("Geschlecht") >= "m"
    )


#########################################

# create two new subcorpora after applying filter function
subcorpus_pre = textacy.corpus.Corpus(de, data=corpus_speeches.get(filter_func_pre))
subcorpus_post = textacy.corpus.Corpus(de, data=corpus_speeches.get(filter_func_post))

In [5]:
# task 6: print number of docs for both subcorpora
print("# documents in Subcorpus 1 (before 2000):", subcorpus_pre.n_docs)
print("# documents in Subcorpus 2 (as of 2000):", subcorpus_post.n_docs)

# documents in Subcorpus 1 (before 2000): 63
# documents in Subcorpus 2 (as of 2000): 97


In [6]:
# task 7: export the vocabulary of both subcorpora into file

# lowercased, unigram
vocab_pre = subcorpus_pre.word_counts(
    by="lower_",
    filter_stops=True,
    filter_punct=True,
    filter_nums=True,
    weighting="freq",  # get relative frequency instead of absolute
)
vocab_post = subcorpus_post.word_counts(
    by="lower_",
    filter_stops=True,
    filter_punct=True,
    filter_nums=True,
    weighting="freq",
)

# sort vocabulary by descending frequency
vocab_sorted_pre = sorted(vocab_pre.items(), key=lambda x: x[1], reverse=True)
vocab_sorted_post = sorted(vocab_post.items(), key=lambda x: x[1], reverse=True)

# write to file, one word and its frequency per line
fname = "assignments/assignment_3/vocab_frq_pre.txt"
with open(fname, "w") as f:
    for word, frq in vocab_sorted_pre:
        line = f"{word}\t{frq}\n"
        f.write(line)

fname = "assignments/assignment_3/vocab_frq_post.txt"
with open(fname, "w") as f:
    for word, frq in vocab_sorted_post:
        line = f"{word}\t{frq}\n"
        f.write(line)

### task 8: comparison of both vocabularies - changes? Which are the terms that are used most often?

### Interpretation
The top ten words look surprisingly similiar. Tradition seems to be reflected not only in the ritual itself but in the language of these speeches as well.

At a second glance, there are interesting differences though:
- speakers start to talk more about "europa" and "EU" after 2000
- nowadays, the term "Eidgenossenschaft" is primarily used by right-wing people and got generally replaced by "Schweiz"
- gender became important: greeting women as "Damen" besides "Herren" (notably, there are not only "Schweizerinnen und Schweizer")
- "Sicherheit" becomes a hot topic after the relatively calm post-war period
- increased talking about "Kultur", "Identität", "Werte". Who are we (no longer limited to Swiss people) in a globalized and multi-cultural world?

To compare the vocabulary in a more systematic way than simple eye-balling,
it helps to compute the relative change in frequency between the epochs.
This goes beyond what we have touched upon in the seminar. Yet, it is worth to look at it.

In [7]:
# create dataframe for both subcorpora
df_pre = pd.DataFrame(vocab_sorted_pre, columns=["term", "frq"])
df_post = pd.DataFrame(vocab_sorted_post, columns=["term", "frq"])

# merge them into a single dataset on the basis of term
df = df_pre.merge(df_post, on="term", how="inner", suffixes=("_pre", "_post"))

# compute the relative difference and sort by it
df["diff"] = df["frq_post"] - df["frq_pre"]
df.sort_values("diff", inplace=True)

# save the results as csv dataset
fname = "assignments/assignment_3/vocab_diff_periods.csv"
df.to_csv(fname)

In [8]:
# show terms with biggest increase wtr to relative frequency
df.tail(20)

Unnamed: 0,term,frq_pre,frq_post,diff
82,feiern,0.000437,0.000719,0.000282
305,bevölkerung,0.000185,0.000477,0.000292
220,kultur,0.000235,0.000536,0.000301
949,bürgerinnen,6.7e-05,0.000386,0.000319
927,erfolg,6.7e-05,0.000392,0.000325
195,wohlstand,0.000269,0.000595,0.000326
1212,identität,5e-05,0.000392,0.000342
355,braucht,0.000168,0.00051,0.000342
344,herren,0.000168,0.000549,0.000381
1260,wichtig,5e-05,0.000438,0.000388


In [9]:
# show terms with biggest decrease wtr to relative frequency
df.head(20)

Unnamed: 0,term,frq_pre,frq_post,diff
2,unseres,0.002268,0.001013,-0.001255
11,eidgenossen,0.001126,0.000216,-0.00091
10,eidgenossenschaft,0.00121,0.000314,-0.000896
5,landes,0.001663,0.00102,-0.000644
9,staat,0.001277,0.000641,-0.000636
38,unsern,0.000639,1.3e-05,-0.000625
23,probleme,0.000891,0.000268,-0.000622
20,bund,0.000907,0.000288,-0.00062
13,volk,0.001109,0.000497,-0.000612
34,aufgaben,0.000655,0.000105,-0.000551
