VORGEHEN: 

1) Module importieren
2) Define Function
3) Create Corpus
4) Create Subcorpus (beide Geschlechter, nach 2000)
5) Subcorpus als csv abspeichern
6) csv mit scattertext importieren
7) die beiden Achsen aufspannen auf Basis der Variable Geschlecht (m/f)

(Wenn alle Spaltennamnen stimmen sollte es so klappen)

In [None]:
### Step 1) Import modules

import spacy
import textacy
import scattertext as st
import pandas as pd
import plotnine as p9 

In [None]:
### Step 2) Define function

def get_texts_from_csv(f_csv, text_column):
    """
   Read dataset from a csv file and sequentially stream the rows,
    including metadata.
    """

    # read dataframe
    df = pd.read_csv(f_csv)

    # keep only documents that have text
    filtered_df = df[df[text_column].notnull()]
    
    # iterate over rows in dataframe
    for idx, row in filtered_df.iterrows():
        
        #read text and join lines (hard line-breaks)
        text = row[text_column].replace('\n', ' ')

        #use all columns as metadata, except the column with the actual text
        metadata = row.to_dict()
        del metadata[text_column]

        # return documents one after another (sequentially)
        yield (text, metadata)

In [None]:
### Step 3) Create Corpus

# stream texts from a given folder
f_csv = '../KED2022/materials/data/dataset_speeches_federal_council_2019.csv'
texts = get_texts_from_csv(f_csv, text_column='text')

# load german language model
de = textacy.load_spacy_lang("de_core_news_sm")

# create corpus from processed documents
corpus_speeches_XY = textacy.Corpus(de, data=texts)

In [None]:
### Step 4) Create Subcorpus (both genders, DE, after the year 2000)

## subcor (filtering by meta attributes "language" and "after 2000")

# function to filter by metadata 
def filter_func_1(doc):
    return doc._.meta.get("Jahr") > 2000

# create new corpus after applying filter function
subcor = textacy.corpus.Corpus(de, data=corpus_speeches_XY.get(filter_func_1))

In [None]:
### Step 5) Export corpus as csv dataset

# merge metadata and actual content for each document in the corpus
# ugly, verbose syntax to merge two dictionaries
data = [{**doc._.meta, **{'text': doc.text}} for doc in subcor]

# export corpus as csv
f_csv = '../KED2022/materials/data/dataset_speeches.csv'
textacy.io.csv.write_csv(data, f_csv, fieldnames=data[0].keys())

# csv format is the best to load in scattertext
data[0]

In [None]:
### Step 6) Import csv to use in scattertext: load file

# read dataset from csv file
f_csv = '../KED2022/materials/data/dataset_speeches.csv'
df = pd.read_csv(f_csv)

# filter out non-german texts or very short texts
df_sub = df[(df['Sprache'] == 'de') & (df['text'].str.len() > 10)]

# make new column containing all relevant metadata (showing in plot later on)
df_sub['descripton'] = df_sub[['Redner', 'Partei', 'Jahr']].astype(str).agg(', '.join, axis=1)

# sneak peek of dataset
df_sub.head()

In [None]:
### Step 7) create scattertext plot, axes basing on the variable "gender"

censor_tags = set(['CARD']) # tags to ignore in corpus, e.g. numbers

# stop words to ignore in corpus
de_stopwords = spacy.lang.de.stop_words.STOP_WORDS # default stop words
custom_stopwords = set(['[', ']', '%', '*', '•', '2.', '19.', '21.', '9.', '3.'])
de_stopwords = de_stopwords.union(custom_stopwords) # extend with custom stop words

# create corpus from dataframe
# lowercased terms, no stopwords, no numbers
# use lemmas for English only, German quality is too bad
corpus_speeches = st.CorpusFromPandas(df_sub, # dataset
                             category_col='Geschlecht', # index differences by ...
                             text_col='text', 
                             nlp=de, # German model
                             feats_from_spacy_doc=st.FeatsFromSpacyDoc(tag_types_to_censor=censor_tags, use_lemmas=False),
                             ).build().get_stoplisted_unigram_corpus(de_stopwords)
# produce visualization (interactive html)
html = st.produce_scattertext_explorer(corpus_speeches,
            category='m', # set attribute to divide corpus into two parts
            category_name='male',
            not_category_name='female',
            metadata=df_sub['descripton'],
            width_in_pixels=1000,
            minimum_term_frequency=5, # drop terms occurring less than 5 times
            save_svg_button=True,                          
)

# write visualization to html file
fname = '..KED2022/materials/data/gender_differences_final.html'
open(fname, 'wb').write(html.encode('utf-8'))

In [None]:
About the scattertext - Explanation for Orientation:

Terms in upper right    = frequently used by male and female speakers alike
Terms in lower right    = often used by female speakers
Terms in upper left     = often used by male speakers
Terms in lower left     = infrequently used by male and female speakers alike

### What can we see?

Top 3 terms in male speeches:       Geschichte, Zusammenhalt, Tessin  
Top 3 terms in female speeches:     gemeinsam, Grenzen, brauchen

Top 3 terms in both genders:        Menschen, Land, Schweiz

### Important to keep in mind

Document count total:   96
male document count:    67
female document count:  29