In [17]:
import nltk
import re
import string
TOKENIZER = nltk.tokenize.word_tokenize
def is_punct(t):
    return re.match(f'[{string.punctuation}]+$', t) is not None

In [20]:
import pandas as pd

# Language Change in Dutch Pop Song Lyrics (1989 - 2018)

_Alie Lassche_


## 1. Introduction

Is today's popular music worse than it was several years ago? In an [article](https://slate.com/technology/2014/08/musical-nostalgia-the-psychology-and-neuroscience-for-song-preference-and-the-reminiscence-bump.html) on Slate, Mark Jospeh Stern asks himself why the songs he heard when he was a teenager sound sweeter than anything he listens to as an adult. To answer this question he investigates the brain's relationship with music and states that our reaction to music depends on how we interact with it. The more we like a song, the more we get treated to neurochemical bliss. Since our brains undergo rapid neurological development between the ages of 12 and 22, the music we love during that decade seems to get wired in our lobes for good. Combine that with the fact that songs from our youth form the soundtrack to what feels, at that time, like the most vital and momentous years of lives, and you'll have the conclusion as Stern puts it: you'll never love another song the way you loved the music of your youth.

Carl Sharpe asks himself almost the same question in an [article](https://towardsdatascience.com/49-years-of-lyrics-why-so-angry-1adf0a3fa2b4) on Towards Data Science: "I _know_ late 90s music was the best music of all time (see Neural Nostalgia article above), but how could I prove/disprove that? How could I measure something so subjective?" In his article
he presents the results of a Python based study of the change in language for popular music from 1970 to 2018. One of the hypotheses he tests is that lyrics have become more aggressive and profane over the past 49 years. The dataset contains popular songs that were in the Billboard Top 100 between 1978 and 2018. In other studies on the language of song lyrics (take a look [here](https://www.johnwmillr.com/trucks-and-beer/), [here](https://towardsdatascience.com/does-country-music-drink-more-than-other-genres-a21db901940b) and [here](https://github.com/Hugo-Nattagh/2017-Hip-Hop)), the corpus consists of English popsongs as well. Drawing inspiration from the quantitative analytic studies on song lyrics mentioned above, I will research the change in language in lyrics of Dutch popular songs.

The research question I will answer in this study is: how do the dominant sentiments in Dutch song lyrics change between 1989 and 2018? To answer this question I use the Linguistic Inquiry and Word Count ([LIWC](http://liwc.wpengine.com)), which is a software program to analyse text by counting words in 66 psychologically meaningful categories that are calculated in a dictionary of words. The LIWC reads a given text and counts the percentages of words that fall in a certain category. Since it was originally developed by researchers with interest in social, clinical, health and cognitive psychology, the language categories were created to capture people's social and psychological states. The LIWC is an English dictionary, but is translated in many languages, among which the Dutch language. In this study I use the Dutch translation of the LIWC 2007 version.

In what follows I will first examine the making of the dataset I used, after which I will discuss the Dutch LIWC in detail. Subsequently there will be paragraphs on the analysis and the results. I will end with a conclusion and a discussion.


## 2. Corpus

To create a dataset, I used a method similar to Stern's. Instead of using the Top 100 of Billboard, I went to the Dutch equivalent: the [Top40](https://www.top40.nl). Here the 'Top 100-Jaaroverzicht' can be found from 1965 until 2018. I checked each list (as from 1989 until 2018) manually for artists that wrote songs in Dutch. I created a list with the names of these artists, dividing the thirty years in three decades, resulting in a dataframe with three columns, each containing the names of artists that were in the Top 100 during the following decades: 1989 - 1998, 1999 - 2008, 2009 - 2018.

After that I wrote a script that, given an artist from the above mentioned dataframe, scrapes corresponding song titles from [Genius](www.genius.com). The name of the artist and the song titles were saved in a dictionary - one for each decade. The next step was to clean this dictionary.

- remove English songs
- remove wrong artists

Maar, dit kan ook na het scrapen van de lyrics. Misschien zelfs makkelijker.



## 3. Analysis

## 4. Results

## 5. Conclusion

## 6. Discussion

ZORGEN DAT DE INDEX-KOLOM DE NAAM VAN DE ARTIEST BEVAT

In [141]:
from __future__ import division
from __future__ import print_function
import os
from codecs import open

#------------ DUTCH DATA --------------

csv = '/Users/alielassche/documents/github/cultural-analytics/LIWC_Dutch.csv'		#load Dutch LIWC data

csvfile = open(csv,"r", encoding='utf-8')
liwcfile = csvfile.read().split("\n")
csvfile.close()

liwc_nl_dict = dict()
for line in liwcfile:
	line = line.rsplit(",")
	liwc_nl_dict[line[0]] = line[1:]


#----------- FUNCTIONS ----------------


def freqdict(text):

	"""This function returns a frequency dictionary of the input list. All words are transformed to lower case."""
	
	freq_dict = dict() 
	for word in text:
		word = word.lower()
		if word in freq_dict:
			freq_dict[word] += 1
		else:
			freq_dict[word] = 1
	return freq_dict

def liwc(text,output='rel',lang='nl'):

	"""This function takes a list of tokens as input and returns a dictionary with the relative (output='rel') or absolute (output='abs') frequencies for every LIWC category. This function works for languages English (lang='en') and Dutch (lang='nl')."""

	#decide on relative or absolute frequenc
	if output == 'abs': #absolute frequency as output
		division = 1
	elif output == 'rel': #relative frequency as output
		division = len(text)

	#make frequency dictionary of the text to diminish number of runs in further for loop
	freq_dict = freqdict(text) 	
	
	if lang == 'nl':
		liwc_dict = liwc_nl_dict
	else:
		liwc_dict = liwc_en_dict
	
	features = dict()		
	for category in liwc_dict:
		freq = 0
		for term in liwc_dict[category]:
			term = term.lower()
			if term[-1] == u"*": #'*' indicates partial words that should match the beginning of the word (include variations on words)
				for word in freq_dict:
					if word.startswith(term[:-1]):
						freq += freq_dict[word]
			else:
				if term in freq_dict:
					freq += freq_dict[term]
		features[category] = freq / division
		
	return features

def liwc_nl(text,output="rel"):
	"""This function applies Dutch liwc() on input. Output is relative frequencies of liwc categories."""
	return liwc(text,output=output,lang="nl")


if __name__ == '__main__':
	doc = '/Users/alielassche/documents/github/cultural-analytics/decade1/Lyrics_WillekeAlberti.txt'
	
	
# 	with open(doc, 'r') as myfile:
# 		data = myfile.read().replace('\n', ' ')
# 		data = data.split(" ")
# 		liwc(data,output='rel',lang='nl')

In [142]:
with open(doc, 'r') as myfile:
    chars = myfile.read().replace('\n', ' ')
    words = []
    for sentence in TOKENIZER(chars, language="dutch"):
        words.extend([w.lower() for w in sentence.split() if not is_punct(w)])

In [143]:
pd.DataFrame.from_dict(liwc(words,output='rel', lang='nl'), orient='index', columns=['Andre'])

Unnamed: 0,Andre
Othref,0.041667
Time,0.090278
Inhib,0.000000
Space,0.017677
Posemo,0.027146
Self,0.075126
Social,0.080808
Humans,0.001263
Sports,0.000631
Other,0.003788


In [145]:
df = pd.read_csv('decade1_liwc_names.csv', sep=';', index_col=0)
df.head()

Unnamed: 0,Achieve,Affect,Anger,Anx,Article,Assent,Body,Cause,Certain,Cogmech,...,Social,Space,Sports,Swear,TV,Tentat,Time,Up,We,You
De Sjonnies,0.0,0.033939,0.003152,0.001697,0.061576,0.013576,0.013091,0.005818,0.020848,0.044121,...,0.069818,0.01697,0.00703,0.000485,0.0,0.012121,0.033697,0.016242,0.003152,0.019152
Hakkûhbar,0.0,0.032243,0.013751,0.0,0.028924,0.004742,0.007587,0.002371,0.014699,0.028924,...,0.048364,0.0422,0.009957,0.001897,0.0,0.014699,0.044097,0.013276,0.000948,0.013751
DJ Madman,0.0,0.086677,0.0,0.012841,0.022472,0.001605,0.024077,0.0,0.004815,0.065811,...,0.165329,0.004815,0.001605,0.0,0.0,0.038523,0.083467,0.006421,0.0,0.110754
Ome Henk,0.0,0.024903,0.003874,0.000553,0.054234,0.002214,0.016602,0.007194,0.007748,0.043165,...,0.037631,0.022136,0.000553,0.0,0.000553,0.012175,0.02435,0.010515,0.001107,0.008854
Mannenkoor karrespoor,0.0,0.049751,0.0,0.0,0.074627,0.004975,0.00995,0.004975,0.00995,0.034826,...,0.039801,0.004975,0.0,0.0,0.0,0.0,0.049751,0.014925,0.0199,0.0


In [155]:
print((df.loc['Total'].sort_values()))

Fillers    0.003894
TV         0.006488
Nonfl      0.006778
Swear      0.016422
Groom      0.017999
School     0.024512
Down       0.025390
Inhib      0.025856
Death      0.032974
Job        0.052025
Friends    0.057547
Anx        0.065253
Achieve    0.074014
Sports     0.075452
Relig      0.078856
Home       0.091372
Assent     0.101676
Anger      0.102907
Music      0.103822
Sleep      0.106605
Optim      0.108378
We         0.108784
Metaph     0.111785
Eating     0.124774
Family     0.126232
Money      0.127853
Occup      0.151109
Cause      0.155353
Sexual     0.169150
Humans     0.177949
             ...   
Leisure    0.362944
Up         0.377851
Body       0.425210
Negate     0.442613
Tentat     0.502498
Certain    0.504781
Insight    0.596597
Negemo     0.656299
Space      0.744772
Physcal    0.767056
Posemo     0.838418
Senses     0.866276
Discrep    0.907894
Motion     0.930377
Past       0.983815
You        1.291172
Excl       1.423621
Affect     1.514999
Article    1.765142


In [158]:
df3 = pd.read_csv('songlyrics/scripts/decade3_liwc.csv', sep='\t', index_col=0)
df3.head()

Unnamed: 0,Achieve,Affect,Anger,Anx,Article,Assent,Body,Cause,Certain,Cogmech,...,Social,Space,Sports,Swear,TV,Tentat,Time,Up,We,You
0,0.000584,0.031536,0.00292,0.000908,0.026604,0.006164,0.00584,0.004088,0.013432,0.056778,...,0.101616,0.012978,0.000389,0.001687,0.00026,0.006684,0.033288,0.009344,0.00266,0.052755
1,0.0,0.014433,0.003093,0.0,0.038144,0.0,0.005155,0.008247,0.001031,0.030928,...,0.041237,0.01134,0.0,0.001031,0.0,0.002062,0.036082,0.004124,0.002062,0.025773
2,0.001547,0.021454,0.004023,0.001908,0.024291,0.00263,0.010572,0.003816,0.00459,0.052347,...,0.075245,0.00753,0.00165,0.000722,0.000155,0.003816,0.029654,0.008097,0.002476,0.040175
3,0.00149,0.025034,0.000894,0.000596,0.022798,0.004768,0.008195,0.001192,0.008195,0.035911,...,0.075399,0.01207,0.000596,0.0,0.000149,0.010431,0.044554,0.004917,0.002235,0.027418
4,0.000612,0.031478,0.002959,0.001377,0.058364,0.002551,0.006785,0.008214,0.009897,0.068007,...,0.069741,0.016581,0.000765,0.001071,0.000153,0.019336,0.045049,0.010714,0.000459,0.036886


In [159]:
df3.loc['Total']= df3.sum()

In [160]:
print((df3.loc['Total'].sort_values()))

TV         0.004875
Fillers    0.006405
Nonfl      0.010015
Groom      0.011801
Death      0.013517
Inhib      0.021108
Down       0.021496
School     0.024558
Job        0.026784
Sports     0.028976
Achieve    0.032880
Anx        0.034871
Relig      0.038880
Swear      0.039141
Home       0.045516
Friends    0.051548
Metaph     0.052306
Sexual     0.058973
Music      0.060422
We         0.060703
Money      0.062509
Optim      0.063558
Sleep      0.067687
Eating     0.079396
Occup      0.084941
Assent     0.089424
Anger      0.093270
Number     0.099132
Feel       0.115572
Family     0.117340
             ...   
Future     0.238001
Hear       0.238942
Up         0.246167
Body       0.262449
Tentat     0.264099
Certain    0.270369
Comm       0.270879
Posemo     0.405964
Negemo     0.424767
Space      0.436754
Insight    0.437796
Physcal    0.440804
Past       0.506967
Senses     0.585264
Discrep    0.634611
Motion     0.684641
Affect     0.838845
Excl       0.999689
Article    1.002130
