<section id="title-slide">
  <h1 class="title">The ABC of Computational Text Analysis</h1>
  <h2 class="subtitle">#10: NLP with Python</h2>
  <p class="author">Alex Flückiger</p><p class="date">02/16 May 2024</p>
</section>

## Update the course material
1. Navigate to the course folde using `cd` in your command line
2. Update the files with `git pull`
3. If `git pull` doesn't work due to file conflicts, run `git restore .` first

## Getting started 
1. Open VS Code
2. Windows: Make sure that you are connected to WSL (green-badge in left-lower corner)
3. Open the `KED2024` folder via the menu: `File` > `Open Folder`
4. Navigate to `KED2024/ked2024/materials/code/KED2024_10.ipynb` and open with double-click
5. Run the code with `Run all` via the top menu

# Overview analysis

- get linguistic information from text
- explore differences between two corpora 
    - using politcial party programmes
- visualize term frequency over time
  - using 1 August speeches by Swiss Federal Councillors

# Do Natural Language Processing (NLP)

## Modules
#### Standing of the shoulders of giants
- [spaCy](https://spacy.io/usage/spacy-101): use or build state-of-the-art NLP pipeline
- [textaCy](https://textacy.readthedocs.io): do high-level analysis, extends spaCy
- [pandas](https://pandas.pydata.org/docs/getting_started/intro_tutorials/index.html): analyze tabular data 
- [plotnine](https://plotnine.readthedocs.io): visualize anything (*ggplot for Python*)

# Importing modules

various ways of importing

In [None]:
# standard import
import textacy
import spacy

# import with a short name
import pandas as pd
import scattertext as st

# import all specific/all objects from a module
from pathlib import Path
from plotnine import *


# Basic NLP
Process a single document

In [None]:
# example text (to read from a file see below)
text = "Apple's CEO Tim Cook is looking at buying U.K. startup for $1 billion."

# load the English language model
en = textacy.load_spacy_lang("en_core_web_sm")

# process document (tokenizing, tagging, parsing, recognizing named entities)
doc = textacy.make_spacy_doc(text, lang=en)


## Linguistic features
Features per token and their linguistic dependencies

In [None]:
# visualize dependencies
spacy.displacy.render(doc, style="dep")


## Get linguistic features

In [None]:
# iterate over tokens of a document
for token in doc:
    print(
        token.text,
        "-->",
        token.lemma_,
        token.pos_,
        token.dep_,
        token.shape_,
        token.is_alpha,
        token.is_stop, 
    )


## Named Entity Recognition (NER)

In [None]:
# visualize named entities
spacy.displacy.render(doc, style="ent")


In [None]:
# iterate over named entities of a document
for ent in doc.ents:
    print(f"{ent.text} --> {ent.label_} ({spacy.explain(ent.label_)})")


## Read from a file

In [None]:
# alternatively, read from a single txt file
f_text = "../data/swiss_party_programmes/txt/sp_programmes/1920_parteiprogramm_d.txt"
text = Path(f_text).read_text()

print(text[:200])


# Working with a corpus

## Steps to create a corpus

How to make a corpus from many text files?

1. list all files of a folder 
2. read text from each file
3. parse metadata from file name
4. return each document sequentially

&rarr; wrap all this in a function `get_texts_and_metadata()`

## Define function

In [None]:
def get_texts_and_metadata(dir_texts):
    """
    Sequentially stream all documents from a given folder, including metadata.
    """
    p = Path(dir_texts)  # set base directory

    # iterate over all documents in base directory recursively
    for fname in p.glob("**/*.txt"):

        print("Parsing file:", fname.name)

        text = Path(fname).read_text()
        # join lines as there are hard line-breaks
        text = text.replace("\n", " ")
        # further modify the text here if needed

        # parse year from filename and set a metadata
        # example: 1920_parteiprogramm_d.txt --> year=1920
        try:
            year = int(fname.name.split("_")[0])
        except ValueError:
            print("WARNING: Parsing meta data has failed:", fname.name)
            continue

        # add more metadata here if needed
        metadata = {"fname": fname.name, "year": year}

        # return documents one after another (sequentially)
        yield (text, metadata)


## Create a corpus from TXT
Process documents and create corpus

In [None]:
# stream texts from a given folder
dir_texts = "../data/swiss_party_programmes/txt/sp_programmes/"
texts_and_metadata = get_texts_and_metadata(dir_texts)

# load German language model
de = textacy.load_spacy_lang("de_core_news_sm")

# create corpus from processed documents
corpus = textacy.Corpus(de, data=texts_and_metadata)


# Basic corpus statistics

In [None]:
print("# documents:", corpus.n_docs)
print("# sentences:", corpus.n_sents)
print("# tokens:", corpus.n_tokens)


## Export word counts

In [None]:
# get lowercased and filtered corpus vocabulary
vocab = corpus.word_counts(
    by="lemma_", # text for un-lemmatized words
    weighting="count", # freq for relative frequency
    filter_stops=True,
    filter_punct=True,
    filter_nums=True,
)

# sort vocabulary by descending frequency
vocab_sorted = sorted(vocab.items(), key=lambda x: x[1], reverse=True)

# write to file, one word and its frequency per line
fname = "../analysis/vocab_frq.txt"
with open(fname, "w") as f:
    for word, frq in vocab_sorted:
        line = f"{word}\t{frq}\n"
        f.write(line)

vocab_sorted[:5]


# Working with subcorpus

Interested in a group of documents only?

In [None]:
# select the first document in corpus
first_doc = corpus[0]
print(first_doc._.meta)
print(first_doc.text[:50])


In [None]:
# function to filter by metadata, e.g. publication year after 1900
def filter_func(doc):
    return doc._.meta.get("year") > 1900


# create new corpus after applying filter function
subcorpus = textacy.corpus.Corpus(de, data=corpus.get(filter_func))

subcorpus.n_docs, corpus.n_docs


# Key Word in Context (KWIC)

Show words in their original context

In [None]:
# iterate over documents and print matches
# you can use regular expressions as keyword
for doc in corpus:
    results = textacy.extract.kwic.keyword_in_context(
        doc.text, keyword="(Ausland|Inland)", ignore_case=True, window_width=50
    )
    for match in results:
        print(f"{match[0]}  {match[1]}  {match[2]}")


# Export results to TXT File

collect any information and write to file
- particular terms or linguistic constructions
- Named Entities (NE)
- ...

In [None]:
results = []

# collect information
for doc in corpus:
    for sent in doc.sents:
        if "Armut" in sent.text:
            # match contains the sentence where the term occurs, preceded by the filename (tab-separated)
            match = f"{doc._.meta['fname']}\t{sent.text}"
            results.append(match)

# write information to file
fname = "../analysis/sents_poverty.txt"
with open(fname, "w") as f:
    f.write("\n".join(results))

print(results[0])


# Export corpus as CSV Dataset
We have created a corpus containing all party programmes. Now, let's save it as csv dataset.

In [None]:
# merge dictionary with metadata and dictionary with actual text for each document in the corpus
data = [doc._.meta | {"text": doc.text} for doc in corpus]

# export corpus as csv
f_csv = "../data/swiss_party_programmes/corpus_party_programmes.csv"
textacy.io.csv.write_csv(data, f_csv, fieldnames=data[0].keys())

# check the data of the first party programm
data[0]



# In-class: Exercises I

1. Make sure that your local copy of the Github repository KED2024 is up-to-date with `git pull`. You can find the relevant material as follows:
- notebook `/KED2024/ked2024/materials/code/KED2024_10.ipynb`
- party programmes `/KED2024/ked2024/materials/data/swiss_party_programmes/txt`

2. Open the notebook in VS Code. *@Windows people*: Make sure that you are connected to WSL Ubuntu (check green badge).

3. Run all the code in the notebook by clicking `Run All`.

4. Process another English sentence with spaCy instead of the one mentioning Apple.

5. Load the German language model and process a German sentence. Display the linguist information and check the difference between the lemma and the form as it occurs in the text.

6. Play around with the code as it is a good way to learn. Modify one thing, run the code, and see if the output matches your expectations. Start easy and then get increasingly brave until the code breaks. Fix the issue and try again.


# Explore corpus of 1 August speeches interactively

![Example Scattertext](../analysis/viz_party_differences.png)

# What the graph shows
- speeches by the Swiss Federal Councilors on 1 August
- visualize the difference between speakers of *Social Democratic Party of Switzerland* (SP) and other parties
- interpretation
  -  top right: terms used by all
  -  top left: terms primarily used by SP
  -  lower right: terms primarily used by other parties
 
[Explore interactively in your browser](https://aflueckiger.github.io/KED2024/materials/analysis/viz_party_differences.html)

# Scattertext

- how does language differ by two groups
    - organization, person, gender, time etc.
- interactive exploring
- find discriminative terms
- scoring function *rank-frequency*
    - normalized by number of terms `[0,1]`

## Load CSV File

Load a dataset of 1 August speeches by Swiss federal councillors (received from [Republik, original article](https://www.republik.ch/2019/08/01/anleitung-fuer-die-perfekte-ansprache-zum-1-august))

In [None]:
# read dataset from csv file
f_csv = "../data/dataset_speeches_federal_council_2019.csv"
df = pd.read_csv(f_csv)

# make new column containing all relevant metadata
df["descripton"] = df[["Redner", "Partei", "Jahr"]].astype(str).agg(", ".join, axis=1)

# filter out non-german texts or very short texts
df_sub = df[(df["Sprache"] == "de") & (df["Text"].str.len() > 10)]

# sneak peek of dataset
df_sub.head()


## Create scattertext plot

In [None]:
censor_tags = set(['CARD']) # tags to ignore in corpus, e.g. numbers

# stop words to ignore in corpus
de_stopwords = spacy.lang.de.stop_words.STOP_WORDS # default stop words
custom_stopwords = set(['[', ']', '%', "*"])
de_stopwords = de_stopwords.union(custom_stopwords) # extend with custom stop words

# create corpus from dataframe
# lemmatized and lowercased terms, no stopwords, no numbers
corpus_speeches = st.CorpusFromPandas(df_sub, # dataset
                             category_col='Partei', # index differences by ...
                             text_col='Text', 
                             nlp=de, # German model
                             feats_from_spacy_doc=st.FeatsFromSpacyDoc(tag_types_to_censor=censor_tags, use_lemmas=True),
                             ).build().get_stoplisted_unigram_corpus(de_stopwords)
# produce visualization (interactive html)
html = st.produce_scattertext_explorer(corpus_speeches,
            category='SP', # set attribute to divide corpus into two parts
            category_name='SP',
            not_category_name='other parties',
            metadata=df_sub['descripton'],
            width_in_pixels=1000,
            minimum_term_frequency=5, # drop terms occurring less than 5 times
            save_svg_button=True,                          
)

# write visualization to html file
fname = "../analysis/viz_party_differences.html"
open(fname, 'wb').write(html.encode('utf-8'))

# Plot term frequencies over time

![Example](../analysis/rel_term_frq_nation.png)

## Create corpus from CSV

How to make a corpus from a dataset in `.csv`-format?

&rarr; define a new function `get_texts_from_csv`, similar to `get_texts_and_metadata`

In [None]:
def get_texts_from_csv(f_csv, text_column):
    """
    Read dataset from a csv file and sequentially stream the rows,
    including metadata.
    """

    # read dataframe
    df = pd.read_csv(f_csv)

    # keep only documents that have text
    filtered_df = df[df[text_column].notnull()]

    # iterate over rows in dataframe
    for idx, row in filtered_df.iterrows():

        # get text and join lines (remove hard line-breaks)
        text = row[text_column].replace("\n", " ")

        # use all columns as metadata, except the column with the actual text
        metadata = row.to_dict()
        del metadata[text_column]

        yield (text, metadata)


f_csv = "../data/dataset_speeches_federal_council_2019.csv"
texts = get_texts_from_csv(f_csv, text_column="Text")

corpus_speeches = textacy.Corpus(de, data=texts)


## Create a group-term matrix

In [None]:
# define how groups are formed and what terms should be included
# here, we get a list of lemmatized words (incl. stop words) and labels (=years) for each document
tokenized_docs, labels = textacy.io.unzip(
    (
        textacy.extract.utils.terms_to_strings(
            textacy.extract.words(doc, filter_stops=False), by="lemma"
        ),
        doc._.meta["Jahr"],
    )
    for doc in corpus_speeches
)

# define how to count
# here relative term frequency
vectorizer = textacy.representations.vectorizers.GroupVectorizer(
    tf_type="linear",  # absolute term frequency
    dl_type="linear",  # normalized by document length
    vocabulary_grps=range(1950, 2019),
)  # limit to years from 1950 to 2019

# create group-term-matrix with with frequency counts
grp_term_matrix = vectorizer.fit_transform(tokenized_docs, labels)

# create dataframe from matrix
df_terms = pd.DataFrame(
    grp_term_matrix.toarray(), index=vectorizer.grps_list, columns=vectorizer.terms_list
)
df_terms["year"] = df_terms.index

# change shape of dataframe
df_tidy = df_terms.melt(id_vars="year", var_name="term", value_name="frequency")
df_tidy


## Plot frequencies over time

In [None]:
# filter the dataset for the following terms
terms = ["Volk", "Schweiz", "Nation"]
df_terms = df_tidy[df_tidy["term"].isin(terms)]

# plot the relative frequency for the terms above
(
    ggplot(df_terms, aes(x="year", y="frequency", color="term"))
    + geom_point()  # show individual points
    + stat_smooth(
        method="lowess", span=0.15, se=False
    )  # overlay points with a smoothed line
    + theme_classic()
)  # make the plot look nicer


## Save Plot

In [None]:
# check some other terms
terms = ["Solidarität", "Kultur", "Wert"]

df_terms = df_tidy[df_tidy["term"].isin(terms)]

p = (
    ggplot(df_terms, aes("year", "frequency", color="term"))
    + geom_point(alpha=0.5, stroke=0)
    + stat_smooth(method="lowess", span=0.10, se=False)
    + theme_classic()
)

# save as png
fname = "../analysis/rel_term_frq_culture.png"
p.save(filename=fname, dpi=150, verbose=False)
p


# Number of documents per year

In [None]:
docs_per_year = (
    df_sub.groupby("Jahr")
    .agg({"Text": "count"})
    .reset_index()
    .rename(columns={"Text": "count"})
)

(
    ggplot(docs_per_year, aes(x="Jahr", y="count"))
    + geom_line(color="darkblue")
    + labs(title="Number of Speeches per Year", x="Year", y="absolute frequency")
    + scale_y_continuous(breaks=range(0, 20, 2), expand=(0, 1))
    + scale_x_continuous(breaks=range(1930, 2021, 10), expand=(0, 10))
    + theme_classic()
)

In [32]:
df1["new_column"] = "Your value for all rows in this column"
df_combined =  pd.concat([df1, df2], ignore_index=True)

Unnamed: 0,Jahr,Status,Vollständigkeit,Redner,Geschlecht,Funktion,Partei,Partei-Original,Typ,Bemerkung,Sprache,Originalsprache,Ort,Titel,Anrede,Text,Originaltext,Quelle,descripton
0,2018,done,vollständig,Alain Berset,m,BP,SP,SP,BP-Rede,,de,,,,Sehr geehrte Damen und Herren,Wir leben in der Schweiz in Frieden und Wohlst...,,https://www.admin.ch/gov/de/start/dokumentatio...,"Alain Berset, SP, 2018"
1,2018,done,vollständig,Alain Berset,m,BP,SP,SP,Lokal,,"de, fr",,"Euschels, Belfaux, Rütli",,,Wir leben in der Schweiz in Frieden und Wohlst...,,https://www.admin.ch/gov/de/start/dokumentatio...,"Alain Berset, SP, 2018"
2,2018,done,vollständig,Alain Berset,m,BP,SP,SP,Lokal,,"de, fr",,"Alp Oberer Euschels, Belfaux, Rütli",,,Wir leben in der Schweiz in Frieden und Wohlst...,,https://www.admin.ch/gov/de/start/dokumentatio...,"Alain Berset, SP, 2018"
3,2018,done,vollständig,Doris Leuthard,f,BR,CVP,CVP,Lokal,,de,,Villmergen,,Liebe Mitbürgerinnen und Mitbürger,Ich bedanke mich für die Einladung zu Ihrer 1....,,https://www.admin.ch/gov/de/start/dokumentatio...,"Doris Leuthard, CVP, 2018"
4,2018,done,vollständig,Guy Parmelin,m,BR,SVP,SVP,Lokal,,de,,,«Armbrust und Hellebarde»,"Sehr geehrte Eidgenossen, Meine Damen und Herren",Eine 1.-August-Rede ist eine der heikelsten rh...,,https://www.admin.ch/gov/de/start/dokumentatio...,"Guy Parmelin, SVP, 2018"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
165,1937,done,vollständig,Giuseppe Motta,m,BP,CVP,SKVP,BP-Rede,für Auslandschweizer,de,,,,"Meine Damen und Herren, liebe Landsleute,","Es ist nicht das erste Mal, dass ich die Freud...",,"NZZ vom 2. August 1937, Mittagsausgabe Nr. 139...","Giuseppe Motta, CVP, 1937"
166,1936,done,vollständig,Albert Meyer,m,BP,FDP,FDP,BP-Rede,,de,,,,,In republikanischer Schlichtheit und Einfachhe...,,"NZZ vom 3. August 1936, Morgenausgabe Nr. 1825...","Albert Meyer, FDP, 1936"
167,1935,done,vollständig,Rudolf Minger,m,BP,SVP,BGB,BP-Rede,wurde Lokal gehalten,de,,,,"Berner, Eidgenossen!","«Als Demut weint und Hochmut lacht, da ward de...",,"Rudolf Minger spricht. Francke Verlag Bern, 1967","Rudolf Minger, SVP, 1935"
168,1934,fehlt,,Marcel Pilet-Golaz,m,BP,FDP,FDP,BP-Rede,,,,,,,,,,"Marcel Pilet-Golaz, FDP, 1934"


# Working on mini-project

Ask questions, <br>
I am ready to help!

![Help!](../../lectures/images/help_frog.gif)

# Resources

#### tutorials for spaCy

- [official spaCy 101](https://spacy.io/usage/spacy-101)
- [official online course spaCy](https://course.spacy.io/en/chapter1)
- [Hitchhiker's Guide to NLP in spaCy](https://www.kaggle.com/nirant/hitchhiker-s-guide-to-nlp-in-spacy)