# Contrastive Text Analysis with Dracor and Scattertext

[Scattertext](https://github.com/JasonKessler/scattertext) enables the two-dimensional visualization of linguistic differences of two groups of text. We here use it to contrast text from speakers of different gender.

## Requirements

We first install the libraries that are necessary to process the data:

In [None]:
!pip install scattertext spacy spacy-transformers pandas
!python -m spacy download de_dep_news_trf

## Acquiring the Corpus

In [None]:
from io import StringIO
import pandas as pd
from urllib import request
import json

dracor_api = "https://dracor.org/api"                # API endpoint for DraCor

def get_character_text(corpus, play):
    url = dracor_api + "/corpora/" + corpus + "/play/" + play + "/spoken-text-by-character"
    req = request.Request(url)
    req.add_header("Accept", "text/csv")
    with request.urlopen(req) as resp:               # download data
        data = resp.read().decode()
        return pd.read_csv(StringIO(data))           # parse CSV into dataframe

We download the speaker text for [Goethe's Faust](https://dracor.org/ger/goethe-faust-eine-tragoedie):

In [None]:
play = "goethe-faust-eine-tragoedie"
text = get_character_text("ger", play)

What's the gender distribution of the speakers?

In [None]:
text.Gender.value_counts()

We remove texts from speakers with unknown gender to enable visualization in two dimensions:

In [None]:
text = text[text.Gender != "UNKNOWN"]

## Building the Scattertext Page

We are basically following [this tutorial](https://github.com/JasonKessler/scattertext#using-scattertext-as-a-text-analysis-library-finding-characteristic-terms-and-their-associations). 

First, we load the trained language model: 

In [None]:
import spacy
nlp = spacy.load("de_dep_news_trf")

Then we create a Scattertext corpus:

In [None]:
import scattertext as st
corpus = st.CorpusFromPandas(text, category_col='Gender', text_col='Text', nlp=nlp).build()

And we print the terms "that differentiate the corpus from a general German corpus":

In [None]:
list(corpus.get_scaled_f_scores_vs_background().index[:10])

Then we can create a HTML page showing the visualization of Scattertext:

In [None]:
html = st.produce_scattertext_explorer(corpus,
          category='MALE',
          category_name='Male',
          not_category_name='Female',
          width_in_pixels=1000,
          metadata=text['Label'])
open(play + ".html", 'wb').write(html.encode('utf-8'))

Here's the result: [goethe-faust-eine-tragoedie.html](goethe-faust-eine-tragoedie.html)