# Poems: "Deutscher Lyrik-Korpus"

We use the dataset [Deutscher Lyrik-Korpus](https://github.com/thomasnikolaushaider/DLK) collected by Thomas Nikolaus Haider.
The dataset is stored as compressed JSON. We use [Pandas](https://pandas.pydata.org) to load it as a [`pandas.DataFrame`](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html):

In [None]:
import pandas as pd

DATASET = 'deutscher-lyrik-korpus-v2.json.bz2'

poems = pd.read_json(DATASET, compression='infer')
poems.head()

## A quick look at the dataset

Let us get some overview on the dataset. The total number of poems is...

In [None]:
len(poems)

To visualize statistics of the dataset, we use the statistical plotting library [seaborn](https://seaborn.pydata.org/) and select 'svg' as output format.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import set_matplotlib_formats
%matplotlib inline
set_matplotlib_formats('svg')
sns.set()

### Chronological distribution of poems

In [None]:
ax = sns.distplot(poems['year'].astype(int), kde=False)

### Most frequent authors

In [None]:
poems_per_author = poems['author'].value_counts()
top_authors = poems_per_author.head(20)
top_authors

In [None]:
ax = top_authors.plot.barh()

### Distribution of the lengths of the poems 

In [None]:
lengths = [len(poem) for poem in poems['text']]

In [None]:
ax = sns.distplot(lengths, kde=False)

This diagram does not tell much because of outliers, so let us cut down the length to a maximum of 10000:

In [None]:
cut_lengths = [min(length, 10000) for length in lengths]
ax =sns.distplot(cut_lengths, kde=False)

## Selecting poems for a classification experiment 

We now select a slice of this dataset for a classification experiment, focussing on a few authors:

In [None]:
authors = ['Goethe, Johann Wolfgang', 'Heine, Heinrich', 'Tucholsky, Kurt']
selected_poems = poems[poems['author'].isin(authors)]
selected_poems['author'].value_counts()

In [None]:
EXTRACT = 'selected_poems.json.bz2'
selected_poems.to_json(EXTRACT, compression='bz2')