# Introduction

In this assignment, you are asked to produce analysis that follows a set of instructions. You can do this any way you like, as long as you show me your results and the code you used to get there. The easier this is for me to replicate, and the clearer the code is, the higher your mark will be. One option would be to make a copy of this file, add in code snippets, and submit the RMarkdown file along with the PDF of completed results. Another option would be to send me a link to an .ipynb notebook file on Github. 


###1.   Install dependencies and load data from URL


In [1]:
# Install dependencies
!pip install corpus_toolkit

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
# Import dependencies
import requests
import pandas as pd
from bs4 import BeautifulSoup
from random import randint
import re
from corpus_toolkit import corpus_tools as ct
import plotly.express as px
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
import nltk
from nltk import word_tokenize          
from nltk.stem import WordNetLemmatizer 
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')
stopwords = stopwords.words('english')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [3]:
# get text file from URL
response = requests.get("https://www.gutenberg.org/cache/epub/1934/pg1934.txt")
text_raw = response.text

In [4]:
# Split full text into the individual lines and store result in pandas df
content = text_raw.splitlines()
df = pd.DataFrame(content, columns=["text"])
# We see that the empty lines are retained and included as empty rows in the dataframe
df.head()

Unnamed: 0,text
0,﻿The Project Gutenberg eBook of Songs of Innoc...
1,
2,This eBook is for the use of anyone anywhere i...
3,most other parts of the world at no cost and w...
4,"whatsoever. You may copy it, give it away or r..."


In [5]:
# Extract content of full document corpus and drop Gutenberg project headers and footers
start_index = content.index("CONTENTS")
end_index = content.index("*** END OF THE PROJECT GUTENBERG EBOOK SONGS OF INNOCENCE AND OF EXPERIENCE ***")
content = content[start_index:end_index]

In [6]:
# Print new head and tail of sliced list
print(f"Head: {content[5:]}; \nTail: {content[-5:]}")

Head: ['The Shepherd', 'The Echoing Green', 'The Lamb', 'The Little Black Boy', 'The Blossom', 'The Chimney-Sweeper', 'The Little Boy Lost', 'The Little Boy Found', 'Laughing Song', 'A Cradle Song', 'The Divine Image', 'Holy Thursday', 'Night', 'Spring', 'Nurse’s Song', 'Infant Joy', 'A Dream', 'On Another’s Sorrow', '', '             SONGS OF EXPERIENCE', '', 'Introduction', 'Earth’s Answer', 'The Clod and the Pebble', 'Holy Thursday', 'The Little Girl Lost', 'The Little Girl Found', 'The Chimney-Sweeper', 'Nurse’s Song', 'The Sick Rose', 'The Fly', 'The Angel', 'The Tiger', 'My Pretty Rose-Tree', 'Ah, Sunflower', 'The Lily', 'The Garden of Love', 'The Little Vagabond', 'London', 'The Human Abstract', 'Infant Sorrow', 'A Poison Tree', 'A Little Boy Lost', 'A Little Girl Lost', 'A Divine Image', 'A Cradle Song', 'To Tirzah', 'The Schoolboy', 'The Voice of the Ancient Bard', '', '', '', '', 'SONGS OF INNOCENCE', '', '', '', '', 'INTRODUCTION', '', '', 'Piping down the valleys wild,', ' 

Next we will define functions that check for specific patterns in the list to extract titles, stances, and poems.

We first show how the **row patterns** to identify new poem and book titles we search for look like with an example:

In [7]:
# Example pattern for a poem title
i = content.index("THE SHEPHERD")
content[i-4:i+3]

['', '', '', '', 'THE SHEPHERD', '', '']

In [8]:
# Example pattern for new book title
i = content.index("SONGS OF EXPERIENCE")
content[i-4:i+5]

['', '', '', '', 'SONGS OF EXPERIENCE', '', '', '', '']

In [9]:
def is_book_title(index):
  """
  This function checks whether the current item of the content list matches the pattern of a book title.
  """
  return all(item == '' for item in content[index-4:index]) & all(item == '' for item in content[index+1:index+5])

In [10]:
def is_poem_title(index):
  """
  This function checks whether the current item of the content list matches the pattern of a poem title.
  (It is important here to check first whether the line under investigation might also match a book title first since the poem title pattern will also return True for lines that are in fact book titles.)
  """
  if is_book_title(index):
    return False
  else:
    return all(item == '' for item in content[index-4:index]) & all(item == '' for item in content[index+1:index+3])

In [11]:
# Assert that the number of matches is correct (2 Book titles, 47 Poem titles)
nr_matched_book_titles = sum(map(lambda x : is_book_title(x), range(len(content))))
nr_matched_poem_titles = sum(map(lambda x : is_poem_title(x), range(len(content))))

print(f"Number of matched book titles: {nr_matched_book_titles}\nNumber of matched poem titles: {nr_matched_poem_titles}")

Number of matched book titles: 2
Number of matched poem titles: 47


In [12]:
# Define output list that contains one dict per line which will later be a row in the df
rows = []

for index, line in enumerate(content): 
  if index == 0:
    # Define temp variables to store intermediate values and counts
    current_book_title = None
    current_poem_title = None
    current_stanza_number = None
    current_line_number = None

  if is_book_title(index):
    current_book_title = line
    print(f"New Book Title set: {current_book_title}")
    continue
  elif is_poem_title(index): # Only check whether poem title pattern is matched if it is not a book title 
    current_poem_title = line
    print(f"New Poem Title set: {current_poem_title}\nStanza and Line counters reset!")
    current_line_number = 0
    current_stanza_number = -1 # reset stanza number to -1 since there are two empty lines after a new poem title
    continue
  elif current_poem_title is None:
    continue
  elif line == '':
    current_stanza_number += 1
    continue # skip empty lines and dont add them to the resulting dataframe
  elif line is not '':
    current_line_number += 1
    #print(f"Current Book: {current_book_title}\nCurrent Poem: {current_poem_title}\nStanza Number: {current_stanza_number}\nLine Number: {current_line_number}")

  rows.append({
      'line_text': line,
      'book_title': current_book_title,
      'poem_title': current_poem_title,
      'stanza_number': current_stanza_number,
      'line_number': current_line_number
  })
   


New Book Title set: SONGS OF INNOCENCE
New Poem Title set: INTRODUCTION
Stanza and Line counters reset!
New Poem Title set: THE SHEPHERD
Stanza and Line counters reset!
New Poem Title set: THE ECHOING GREEN
Stanza and Line counters reset!
New Poem Title set: THE LAMB
Stanza and Line counters reset!
New Poem Title set: THE LITTLE BLACK BOY
Stanza and Line counters reset!
New Poem Title set: THE BLOSSOM
Stanza and Line counters reset!
New Poem Title set: THE CHIMNEY-SWEEPER
Stanza and Line counters reset!
New Poem Title set: THE LITTLE BOY LOST
Stanza and Line counters reset!
New Poem Title set: THE LITTLE BOY FOUND
Stanza and Line counters reset!
New Poem Title set: LAUGHING SONG
Stanza and Line counters reset!
New Poem Title set: A CRADLE SONG
Stanza and Line counters reset!
New Poem Title set: THE DIVINE IMAGE
Stanza and Line counters reset!
New Poem Title set: HOLY THURSDAY
Stanza and Line counters reset!
New Poem Title set: NIGHT
Stanza and Line counters reset!
New Poem Title set: S

In [13]:
# Store results in dataframe and print it formatted
df = pd.DataFrame(rows)
df

Unnamed: 0,line_text,book_title,poem_title,stanza_number,line_number
0,"Piping down the valleys wild,",SONGS OF INNOCENCE,INTRODUCTION,1,1
1,"Piping songs of pleasant glee,",SONGS OF INNOCENCE,INTRODUCTION,1,2
2,"On a cloud I saw a child,",SONGS OF INNOCENCE,INTRODUCTION,1,3
3,And he laughing said to me:,SONGS OF INNOCENCE,INTRODUCTION,1,4
4,‘Pipe a song about a Lamb!’,SONGS OF INNOCENCE,INTRODUCTION,2,5
...,...,...,...,...,...
904,Tangled roots perplex her ways;,SONGS OF EXPERIENCE,THE VOICE OF THE ANCIENT BARD,1,7
905,How many have fallen there!,SONGS OF EXPERIENCE,THE VOICE OF THE ANCIENT BARD,1,8
906,They stumble all night over bones of the dead;,SONGS OF EXPERIENCE,THE VOICE OF THE ANCIENT BARD,1,9
907,And feel—they know not what but care;,SONGS OF EXPERIENCE,THE VOICE OF THE ANCIENT BARD,1,10


## Getting and parsing texts

To start with, you are asked to retrieve *Songs of Innocence and of Experience* by William Blake from Project Gutenberg. It is located at https://www.gutenberg.org/cache/epub/1934/pg1934.txt. This is a collection of poems in two books: *Songs of Innocence* and *Songs of Experience*.

Parse this into a dataframe where each row is a line of a poem (there should be no empty lines). The following columns should describe where each line was found:

- line_number 
- stanza_number
- poem_title
- book_title



## Visualising text data

- Create a histogram showing the number of lines per poem



In [14]:
# Group by book and poem title and aggregate the line_number column with max function
# Here it is important to also group for the book title since some of the poems in the two books have the same name but are in fact two different poems
df_histogram = df[['book_title','poem_title', 'line_number']].groupby(['book_title','poem_title']).agg(line_total_number = ('line_number', 'max'))
df_histogram

Unnamed: 0_level_0,Unnamed: 1_level_0,line_total_number
book_title,poem_title,Unnamed: 2_level_1
SONGS OF EXPERIENCE,A CRADLE SONG,16
SONGS OF EXPERIENCE,A DIVINE IMAGE,8
SONGS OF EXPERIENCE,A LITTLE BOY LOST,24
SONGS OF EXPERIENCE,A LITTLE GIRL LOST,34
SONGS OF EXPERIENCE,A POISON TREE,16
SONGS OF EXPERIENCE,"AH, SUNFLOWER",8
SONGS OF EXPERIENCE,EARTH’S ANSWER,25
SONGS OF EXPERIENCE,HOLY THURSDAY,16
SONGS OF EXPERIENCE,INFANT SORROW,8
SONGS OF EXPERIENCE,INTRODUCTION,20


In [43]:
# Plot histogram
fig = px.histogram(df_histogram, 
                   x="line_total_number", 
                   text_auto=True,
                   template="simple_white",
                   color_discrete_sequence=['#5ab4ac'],
                   labels = {
                       'line_total_number': "Number of lines"
                   },
                   width=700,
                   title='Number of Lines per Poem<br><sup>Histogram including all 47 poems from both books in the data.</sup>')
fig.update_layout(
    yaxis_title = "Count"
)
fig.show()

- Create a document feature matrix treating each line as a document

In [16]:
# Use NLTK's Lemmatizer and Stopwwords
def lemma_tokenizer(str_input):
    # Build default tokenizer from sklearn and tokenize words
    default_tokenizer = CountVectorizer().build_tokenizer()
    tokens = default_tokenizer(str_input)

    # lemmatize and remove stopwords
    tokens = [WordNetLemmatizer().lemmatize(token) for token in tokens if token not in stopwords]
    return tokens

In [17]:
lines = df.line_text
vec = CountVectorizer(
    # we transform our tokens to lowercase, remove english stopwords, and filter for tokens that appear in less than three documents (docment frequency < 3)
    # The tokenization pattern we use in the custom tokenizer is equal to the pattern of the default CountVectorizer.
    # I only used custom tokenizer function to get lemmatization by help of the nltk lib
    min_df = 3,
    tokenizer=lemma_tokenizer
)

# Apply vectorizer
X = vec.fit_transform(lines)

# Print features to see if clean
print(vec.get_feature_names_out())

['among' 'angel' 'another' 'arise' 'armed' 'around' 'art' 'asleep' 'away'
 'babe' 'beam' 'bear' 'beast' 'bed' 'began' 'beguiles' 'bird' 'black'
 'bless' 'blossom' 'book' 'bore' 'born' 'bosom' 'bound' 'boy' 'break'
 'breast' 'bright' 'bud' 'call' 'came' 'care' 'chain' 'child' 'church'
 'clime' 'cloud' 'cold' 'come' 'could' 'covered' 'cruelty' 'cry' 'dale'
 'dare' 'dark' 'day' 'death' 'deep' 'delight' 'desert' 'desire' 'destroy'
 'dew' 'didst' 'divine' 'done' 'door' 'doth' 'dread' 'dream' 'dress'
 'drink' 'drive' 'earth' 'er' 'evening' 'ever' 'every' 'eye' 'face'
 'father' 'fear' 'feel' 'filled' 'fire' 'fled' 'flow' 'flower' 'fly' 'foe'
 'follow' 'foot' 'form' 'free' 'garden' 'gave' 'girl' 'give' 'go' 'god'
 'gold' 'golden' 'gone' 'grass' 'green' 'grey' 'grief' 'ground' 'grove'
 'hair' 'hand' 'happy' 'head' 'hear' 'heard' 'hears' 'heart' 'heat'
 'heaven' 'high' 'hill' 'holy' 'home' 'human' 'image' 'immortal' 'infant'
 'innocent' 'jealousy' 'joy' 'keep' 'kiss' 'know' 'lamb' 'laugh'
 'laug

In [18]:
# Build dfm from fitted vectorizer
dfm = pd.DataFrame(X.toarray(), columns=vec.get_feature_names_out())

dfm

Unnamed: 0,among,angel,another,arise,armed,around,art,asleep,away,babe,...,winter,wish,woe,work,worn,would,wrath,year,youth,youthful
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
904,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
905,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
906,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
907,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


- Create a separate document feature matrix treating each poem as a document

In [19]:
# Again, here it is important to also group for the book title
# Construct series with aggregated text per poem
poems = df[['book_title', 'poem_title', 'line_text']].groupby(['book_title', 'poem_title']).agg(lambda col: ' '.join(col)).line_text
poems

book_title           poem_title                   
SONGS OF EXPERIENCE  A CRADLE SONG                    Sleep, sleep, beauty bright, Dreaming in the j...
                     A DIVINE IMAGE                   Cruelty has a human heart,     And Jealousy a ...
                     A LITTLE BOY LOST                ‘Nought loves another as itself,     Nor vener...
                     A LITTLE GIRL LOST               Children of the future age, Reading this indig...
                     A POISON TREE                    I was angry with my friend: I told my wrath, m...
                     AH, SUNFLOWER                    Ah, sunflower, weary of time,     Who countest...
                     EARTH’S ANSWER                       Earth raised up her head From the darkness...
                     HOLY THURSDAY                    Is this a holy thing to see     In a rich and ...
                     INFANT SORROW                    My mother groaned, my father wept: Into the da...
             

In [20]:
# Once again, apply vectorizer
X = vec.fit_transform(poems)

# Print features to see if clean
print(vec.get_feature_names_out())

['among' 'angel' 'another' 'arise' 'around' 'art' 'asleep' 'away' 'babe'
 'beam' 'bear' 'beast' 'bed' 'began' 'bird' 'black' 'blossom' 'book'
 'bore' 'born' 'bosom' 'bound' 'boy' 'break' 'breast' 'bright' 'bud'
 'call' 'came' 'care' 'chain' 'child' 'church' 'clime' 'cloud' 'cold'
 'come' 'could' 'covered' 'cruelty' 'cry' 'dale' 'dark' 'day' 'death'
 'deep' 'delight' 'desire' 'destroy' 'dew' 'done' 'door' 'dream' 'dress'
 'drive' 'earth' 'er' 'evening' 'ever' 'every' 'eye' 'face' 'father'
 'fear' 'feel' 'filled' 'fled' 'flow' 'flower' 'fly' 'follow' 'foot'
 'form' 'free' 'garden' 'girl' 'give' 'go' 'god' 'gold' 'golden' 'gone'
 'grass' 'green' 'grey' 'ground' 'hair' 'hand' 'happy' 'head' 'hear'
 'heard' 'heart' 'heaven' 'high' 'holy' 'home' 'human' 'image' 'infant'
 'jealousy' 'joy' 'kiss' 'know' 'lamb' 'laughing' 'lay' 'led' 'let' 'lick'
 'life' 'light' 'like' 'lion' 'little' 'live' 'look' 'lost' 'love' 'made'
 'make' 'maker' 'man' 'mane' 'many' 'may' 'meet' 'men' 'mercy' 'merry'
 'mil

In [21]:
# Build dfm
dfm_poems = pd.DataFrame(X.toarray(), columns=vec.get_feature_names_out(), index=poems.index)
dfm_poems

Unnamed: 0_level_0,Unnamed: 1_level_0,among,angel,another,arise,around,art,asleep,away,babe,beam,...,wild,wind,wing,winter,wish,woe,work,worn,youth,youthful
book_title,poem_title,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
SONGS OF EXPERIENCE,A CRADLE SONG,0,0,0,0,0,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0
SONGS OF EXPERIENCE,A DIVINE IMAGE,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
SONGS OF EXPERIENCE,A LITTLE BOY LOST,0,0,2,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
SONGS OF EXPERIENCE,A LITTLE GIRL LOST,0,0,0,0,0,0,0,0,0,1,...,0,0,0,1,0,0,0,0,1,1
SONGS OF EXPERIENCE,A POISON TREE,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
SONGS OF EXPERIENCE,"AH, SUNFLOWER",0,0,0,1,0,0,0,1,0,0,...,0,0,0,0,1,0,0,0,1,0
SONGS OF EXPERIENCE,EARTH’S ANSWER,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
SONGS OF EXPERIENCE,HOLY THURSDAY,0,0,0,0,0,0,0,0,2,0,...,0,0,0,1,0,0,0,0,0,0
SONGS OF EXPERIENCE,INFANT SORROW,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
SONGS OF EXPERIENCE,INTRODUCTION,1,0,0,1,0,0,0,2,0,0,...,0,0,0,0,0,0,0,1,0,0


- Using one of these document feature matrices, create a plot that compares the frequency of words in each book. Comment on the features that are more or less frequent in one book than another.

In [22]:
dfm_poems_innocence = dfm_poems.filter(like='SONGS OF INNOCENCE', axis=0)
dfm_poems_experience = dfm_poems.filter(like='SONGS OF EXPERIENCE', axis=0)

In [23]:
# get feature counts and sort descending
counts_innocence = dfm_poems_innocence.sum(axis=0).sort_values(ascending=False)
counts_experience = dfm_poems_experience.sum(axis=0).sort_values(ascending=False)

print(f"Five most frequent features per book:\nSongs of Innocence: \n{counts_innocence[:5]}\n\nSongs of Experience: \n{counts_experience[:5]}")

Five most frequent features per book:
Songs of Innocence: 
thee      25
little    21
sweet     19
joy       18
lamb      17
dtype: int64

Songs of Experience: 
night     20
thy       19
love      14
sleep     13
little    13
dtype: int64


In [24]:
df_counts = pd.concat([counts_innocence, counts_experience], keys=["Songs of Innocence", "Songs of Experience"], names=['book_title', 'feature']).to_frame(name="count").reset_index()
df_counts

Unnamed: 0,book_title,feature,count
0,Songs of Innocence,thee,25
1,Songs of Innocence,little,21
2,Songs of Innocence,sweet,19
3,Songs of Innocence,joy,18
4,Songs of Innocence,lamb,17
...,...,...,...
479,Songs of Experience,boy,0
480,Songs of Experience,call,0
481,Songs of Experience,stream,0
482,Songs of Experience,merry,0


First, we will look into the terms that have the highest difference in number of occurences across the two books and plot the top ten terms:

In [25]:
# This gets the top ten features that have the largest difference in occurences across the two books
df_top_ten_diff_words = df_counts[['feature', 'count']].groupby(['feature']).agg(diff = ('count',np.ptp)).sort_values('diff', ascending=False).head(10).reset_index()
df_top_ten_diff_words

Unnamed: 0,feature,diff
0,thee,20
1,lamb,16
2,night,11
3,sweet,10
4,merry,9
5,fear,9
6,green,9
7,infant,8
8,boy,8
9,little,8


In [26]:
# Filter df_counts for top ten diff words
df_counts_top_diff = df_counts[df_counts['feature'].isin(df_top_ten_diff_words.feature)]
df_counts_top_diff

Unnamed: 0,book_title,feature,count
0,Songs of Innocence,thee,25
1,Songs of Innocence,little,21
2,Songs of Innocence,sweet,19
4,Songs of Innocence,lamb,17
8,Songs of Innocence,green,12
10,Songs of Innocence,infant,11
16,Songs of Innocence,merry,9
20,Songs of Innocence,night,9
26,Songs of Innocence,boy,8
156,Songs of Innocence,fear,2


In [27]:
# Plot combined barplot to illustrate differences in the ten words with the strongest difference
# We use color friendly colors here from colorbrewer
fig = px.bar(
    df_counts_top_diff, 
    x="count", 
    y="feature",
    color='book_title', 
    barmode='group',
    template="simple_white",
    title="Absolute Feature Counts per Book<br><sup>Including the ten tokens with highest difference in counts across the two books.</sup>",
    labels={
      'book_title': 'Book',
      'count': 'Count',
      'feature': 'Feature'
    },
    text_auto=True,
    color_discrete_map = {
        'Songs of Experience': '#1b9e77',
        'Songs of Innocence': '#d95f02'
    }
)
fig.update_layout(
    legend_title="Book",
    font=dict(
        family="Corbel",
        size=18,
        color="black"
    )
)
fig.show()

**Comment:**

For this plot I looked into illustrating the differences in terms of feature occurences for the ten tokens with the largest divergence for this metric across the two books.

It becomes obvious that there is quite a large differences in tokens that are very specific such as "lamb", "infant". This makes sense naturally since the Songs of Innocence book contains poems that have these artefacts as core subjects -- The Lamb and Infant Joy.

The very large divergence for the token "thee" can, after some qualitiative analysis, also easily be explained since it occurs very frequently in the poem "The Lamb" of the Songs of Innocence book. 

Next, we want to do some statistical relative frequency analysis: Keyness

Keyness is calculated using two term frequency dictionaries (consisting of raw frequency values) - one for each book:


In [28]:
# Create frequency dictioniaries per book
frequ_dict_innocence = df_counts[df_counts['book_title']=="Songs of Innocence"].set_index('feature')['count'].to_dict()
frequ_dict_experience = df_counts[df_counts['book_title']=="Songs of Experience"].set_index('feature')['count'].to_dict()

In [29]:
# Calculate Keyness
# For this we will use the Python library corpus-toolkit which implementation for the keyness calculation can be found here: https://github.com/kristopherkyle/corpus_toolkit/blob/master/corpus_toolkit/corpus_tools.py
# Here I used the log_ratio since it generally gives the better estimate also for low frequency terms  than e.g. chi-square.
# The log ratio here is the binary log of the ratio of relative frequencies which means that every extra point of Log Ratio score represents a doubling in size of the frequency difference between the two books, for the keyword under consideration
corp_key = ct.keyness(frequ_dict_innocence,frequ_dict_experience, effect = "log-ratio")
# print top 10 features with highest keyness
ct.head(corp_key,10)

merry	29.841947384117425
boy	29.67202238267511
small	29.256984883396267
call	28.993950477562475
name	28.993950477562475
laughing	28.67202238267511
stream	28.256984883396267
girl	28.256984883396267
lamb	4.184060464826551
pity	3.0965976235762125


In [30]:
# Store keyness dict in df
df_keyness = pd.DataFrame(corp_key.items(), columns=['feature', 'log_ratio'])

# Select only top 10 largest and smallest values for df
df_keyness_smallest_largest = pd.concat([df_keyness.nlargest(10,'log_ratio'), df_keyness.nsmallest(10,'log_ratio')])
df_keyness_smallest_largest

Unnamed: 0,feature,log_ratio
16,merry,29.841947
26,boy,29.672022
35,small,29.256985
49,call,28.99395
58,name,28.99395
76,laughing,28.672022
90,stream,28.256985
109,girl,28.256985
4,lamb,4.18406
28,pity,3.096598


In [31]:
# Assign book_titles to df (here we now from the log ratio metric and the way we inserted the two frequency dicts into the formula that all positive values are "innocence" and negative are "experience")
df_keyness_smallest_largest.loc[df_keyness_smallest_largest['log_ratio'] < 0, 'book_title'] = 'Songs of Experience'
df_keyness_smallest_largest.loc[df_keyness_smallest_largest['log_ratio'] > 0, 'book_title'] = 'Songs of Innocence'

# convert dtype of feature to categorical
df_keyness_smallest_largest = df_keyness_smallest_largest.round(2)

df_keyness_smallest_largest

Unnamed: 0,feature,log_ratio,book_title
16,merry,29.84,Songs of Innocence
26,boy,29.67,Songs of Innocence
35,small,29.26,Songs of Innocence
49,call,28.99,Songs of Innocence
58,name,28.99,Songs of Innocence
76,laughing,28.67,Songs of Innocence
90,stream,28.26,Songs of Innocence
109,girl,28.26,Songs of Innocence
4,lamb,4.18,Songs of Innocence
28,pity,3.1,Songs of Innocence


In [32]:
# Again plot
# We use color friendly colors here from colorbrewer
fig = px.bar(
    df_keyness_smallest_largest, 
    x="log_ratio", 
    y="feature",
    color='book_title', 
    barmode='group',
    height=700,
    template="simple_white",
    title="Keyness of Features between Books<br><sup>Plot depicts log ratio of relative frequency statistics (every extra point of Log Ratio score represents a doubling in size of the frequency difference)</sup>",
    labels={
      'book_title': 'Book',
      'log_ratio': 'Log Ratio',
      'feature': 'Feature'
    },
    text_auto=True,
    color_discrete_map = {
        'Songs of Experience': '#1b9e77',
        'Songs of Innocence': '#d95f02'
    }
)
fig.update_layout(
    legend_title="Book",
    font=dict(
        family="Corbel",
        size=18,
        color="black"
    )
)

fig.update_traces(width=0.7)
fig.show()

**Comment:**

While this plot looks similar to the absolute frequency difference plot we saw before, it still containts some differences.

For example, it becomes clear that the relative frequency difference of "merry" is the highest which makes sense since merry appears only in poems of the book "Songs of Innocence".

In comparison wiht the absolute frequency plot, we see that "lamb" is not among the features with the very highest keyness values anymore since the realitve frequency comparison is less drastic.

## Parsing XML text data

Now we will work with German Parliamentary data, which is available in XML format [here](https://www.bundestag.de/services/opendata) for the last two parliamentary periods. Remember XML format is very like HTML format, and we can parse it using a scraper and CSS selectors. Speeches are contained in `<rede>` elements, which each contain a paragraph element describing the speaker, and paragraph elements recording what they said. Not that class selectors won't work, because the class attribute is called "klasse". You can use normal attribute selectors.

Choose one of the sessions, and retrieve it using R or Python. Using a scraper, get a list of all the <rede> elements. For each element, get the name of the speaker, and a single string containing everything that they said. Put this into a dataframe. Print the number of speeches, and the content of the first speech, by a politician of your choice.



In [33]:
# retrieve website html for session: Plenarprotokoll der 58. Sitzung von Freitag, den 30. September 2022
html = requests.get("https://www.bundestag.de/resource/blob/913444/aeecd11842a5e9e64c0aac4fbd2dd4b9/20058-data.xml")

# parse html text
soup = BeautifulSoup(html.text, "html.parser")

In [34]:
# extract all speeches via the <rede> tag
reden_html = soup.find_all('rede')

In [35]:
speeches = []

# retrieve desired attributes from each speech of the session
for rede in reden_html:
  # Extract titel, vorname and nachname tags for the speech
  titel_html = rede.find('titel')
  vorname_html = rede.find('vorname')
  nachname_html = rede.find('nachname')

  # Get content of tag for each tag that is not None (the person has a title/vorname/nachname that is contained in xml) and join the resulting list into one string
  full_name = " ".join([item.get_text() for item in [titel_html, vorname_html, nachname_html] if item is not None])

  # Get all paragraphs (<p>) of the current <rede>
  rede_paragraphs_html = rede.find_all("p")

  # remove the one paragraph in any <rede> that contains the <redner> tag, since this indicated that this paragraph contains the personal information about the speaker
  [rede_paragraphs_html.remove(p) for p in rede_paragraphs_html if p.find('redner') is not None] 
  
  # Note: At the end of every speech there is one or two paragraphs containing the moderation of what the president of the parliament said. Strickly speaking to derive with only the speeches, we would need to exclude these texts.
  # For this assignment, I only filtered for the paragraphs containing the announceent "nächste rednerin/nächster redner" since this gave very robust filtering results. To get the perfectly tidy speeches, we would need to develop a more complex approach which is probably beyond the scope of this assignment wich is why I excluded it here.
  [rede_paragraphs_html.remove(p) for p in rede_paragraphs_html if re.search("nächste rednerin|nächster redner", p.get_text().lower()) is not None]
  
  # Extract texts of all paragraphs tags <p> to get full speech text
  speech_text = " ".join([item.get_text() for item in rede_paragraphs_html])

  # Apply some preprocessing to speech text to clean data
  speech_text = speech_text.strip()

  speeches.append({
      'speaker_name': full_name,
      'speech_text': speech_text
  })

  print(f"Speaker of next speech: {full_name}")

Speaker of next speech: Christian Lindner
Speaker of next speech: Dr. Mathias Middelberg
Speaker of next speech: Tim Klüssendorf
Speaker of next speech: Klaus Stöber
Speaker of next speech: Katharina Beck
Speaker of next speech: Christian Leye
Speaker of next speech: Till Mansmann
Speaker of next speech: Alois Rainer
Speaker of next speech: Carlos Kasper
Speaker of next speech: Dr. Sebastian Schäfer
Speaker of next speech: Parsa Marvi
Speaker of next speech: Fritz Güntzler
Speaker of next speech: Dieter Janecek
Speaker of next speech: Timon Gremmels
Speaker of next speech: Patricia Lips
Speaker of next speech: Bettina Hagedorn
Speaker of next speech: Albrecht Glaser
Speaker of next speech: Jamila Schäfer
Speaker of next speech: Dr. Gesine Lötzsch
Speaker of next speech: Dr. Thorsten Lieb
Speaker of next speech: Yannick Bury
Speaker of next speech: Johannes Schraps
Speaker of next speech: Norbert Kleinwächter
Speaker of next speech: Andreas Audretsch
Speaker of next speech: Florian Oßne

In [36]:
# Store results in dataframe and print it formatted
df_speeches = pd.DataFrame(speeches)
df_speeches

Unnamed: 0,speaker_name,speech_text
0,Christian Lindner,"Frau Präsidentin, liebe Kolleginnen und Kolleg..."
1,Dr. Mathias Middelberg,"Herr Minister, das, was wir heute diskutieren,..."
2,Tim Klüssendorf,Frau Präsidentin! Liebe Kolleginnen und Kolleg...
3,Klaus Stöber,Sehr geehrte Frau Präsidentin! Sehr geehrte Ko...
4,Katharina Beck,Frau Präsidentin! Liebe Kolleginnen und Kolleg...
...,...,...
64,Jürgen Trittin,Frau Präsidentin! Meine Damen und Herren! Ich ...
65,Pascal Kober,Frau Präsidentin! Liebe Kolleginnen und Kolleg...
66,Matthias Helferich,Sehr geehrte Frau Präsidentin! Sehr geehrte Da...
67,Thorsten Frei,Frau Präsidentin! Liebe Kolleginnen und Kolleg...


In [37]:
# Print number of speeches in bold
print(f"The Plenarprotokoll der 58. Sitzung von Freitag, den 30. September 2022 contains \033[1m {len(df_speeches)} \033[0m speeches")

The Plenarprotokoll der 58. Sitzung von Freitag, den 30. September 2022 contains [1m 69 [0m speeches


In [38]:
# Print first speech of a random politican to uphold scientific objectivity ;-)

# select random politican from full df
rand_politican = df_speeches.speaker_name[randint(0,len(df_speeches)-1)]

# Print first speech of random poltician
first_speech = df_speeches[df_speeches['speaker_name'] == rand_politican].speech_text.values[0]
print(f"The first speech of {rand_politican} has the following transcript:\n{first_speech}")

The first speech of Dieter Janecek has the following transcript:
Sehr geehrte Frau Präsidentin! Wenn man den Vorschlägen der Union vom 9. März gefolgt wäre, lieber Herr Merz, nämlich Nord Stream 1 abzustellen, dann hätten wir heute eine Gasmangellage. Weil man den Vorschlägen der Union in Bezug auf die Abhängigkeit von Russland über 16 Jahre gefolgt ist, haben wir heute diese massive Krise, die Sie mitzuverantworten haben. Das muss an den Anfang der Rede gestellt werden; denn wenn Sie, Herr Güntzler, sich hierhinstellen und sagen: „Die Probleme sind verursacht durch die Ampel“, dann kann ich nur lachen. Wir haben einen russischen Angriffskrieg in der Ukraine, wir haben eine Energiekrise, auch ausgelöst durch die massiven fossilen Abhängigkeiten, die Sie über Jahrzehnte geschaffen haben, und wir haben jetzt – der Finanzminister hat es ja gesagt – einen Energiekrieg, der geführt wird von Russland gegen die Europäische Union. Deswegen brauchen wir jetzt einen Abwehrschirm. Und deswegen is

## Using regular expressions

Using a regular expression, get a list of words spoken in your parliamentary protocol that contain (in upper or lower case) the string "kohle" (coal). Show the number of occurrences of each of these words. If there are no mentions in the debate you have selected, try another protocol.

In [39]:
# example how our regex pattern works
matches = re.findall("[a-zA-Z]*[kK]ohle[a-zA-Z]*", "Ein Kohlebergwerk kostet viel Kohle egal ob man kohle groß oder klein schreiben mag!")

print(f"Matches: {matches}")

Matches: ['Kohlebergwerk', 'Kohle', 'kohle']


In [40]:
# placeholder to store matched words
kohle_words = [] 

# get list of words that contain kohle/Kohle
# Define pattern that looks for occurences of the string sequence "kohle" or "Kohle" in words of the entire protocol
pattern = "[a-zA-Z]*[kK]ohle[a-zA-Z]*" # Note here: I interpreted the "upper/lower case" instruction such as we want ignore whether the "k" in Kohle is upper or lower case. In case the assignment was also to include words such as "koHle" the pattern would look very similar (e.g. "[a-zA-Z]*[kK]o[Hh]le[a-zA-Z]*")
[kohle_words.extend(re.findall(pattern,speech)) for speech in df_speeches.speech_text]

# Get the list of unique words contained in the protocol that contain kohle/Kohle
unique_kohle_words = np.unique(np.array(kohle_words)).tolist()

print(f"Unique words that contain kohle/Kohle: {unique_kohle_words}")

Unique words that contain kohle/Kohle: ['Braunkohlekraftwerke', 'Kohle', 'Kohlekraftwerke', 'Kohlekraftwerken', 'Kohlestrom']


In [41]:
# count how often each unique "kohle word" appears in the list of all matches across the protocol and store counts in nested list
counts = [[word_unique,kohle_words.count(word_unique)] for word_unique in unique_kohle_words]

# Map counts into datagrame and plot nicely in descending order
df_unique_counts = pd.DataFrame(counts, columns=["word_unique", "count"]).sort_values('count', ascending=False)
df_unique_counts

Unnamed: 0,word_unique,count
1,Kohle,5
2,Kohlekraftwerke,3
0,Braunkohlekraftwerke,1
3,Kohlekraftwerken,1
4,Kohlestrom,1


Here we see tha number of occurences of the words that contain kohle/Kohle we identified in the protocol we are analyzing.