# Language Change in Dutch Pop Song Lyrics (1989 - 2018)

_Alie Lassche_


## 1. Introduction

Is today's popular music worse than it was several years ago? In an [article](https://slate.com/technology/2014/08/musical-nostalgia-the-psychology-and-neuroscience-for-song-preference-and-the-reminiscence-bump.html) on Slate, Mark Jospeh Stern asks himself why the songs he heard when he was a teenager sound sweeter than anything he listens to as an adult. To answer this question he investigates the brain's relationship with music and states that our reaction to music depends on how we interact with it. The more we like a song, the more we get treated to neurochemical bliss. Since our brains undergo rapid neurological development between the ages of 12 and 22, the music we love during that decade seems to get wired in our lobes for good. Combine that with the fact that songs from our youth form the soundtrack to what feels, at that time, like the most vital and momentous years of lives, and you'll have the conclusion as Stern puts it: you'll never love another song the way you loved the music of your youth.

Carl Sharpe asks himself almost the same question in an [article](https://towardsdatascience.com/49-years-of-lyrics-why-so-angry-1adf0a3fa2b4) on Towards Data Science: "I _know_ late 90s music was the best music of all time (see Neural Nostalgia article above), but how could I prove/disprove that? How could I measure something so subjective?" In his article
he presents the results of a Python based study of the change in language for popular music from 1970 to 2018. One of the hypotheses he tests is that lyrics have become more aggressive and profane over the past 49 years. The dataset contains popular songs that were in the Billboard Top 100 between 1978 and 2018. In other studies on the language of song lyrics (take a look [here](https://www.johnwmillr.com/trucks-and-beer/), [here](https://towardsdatascience.com/does-country-music-drink-more-than-other-genres-a21db901940b) and [here](https://github.com/Hugo-Nattagh/2017-Hip-Hop)), the corpus consists of English popsongs as well. Drawing inspiration from the quantitative analytic studies on song lyrics mentioned above, I will research the change in language in lyrics of Dutch popular songs.

The research question I will answer in this study is: how do the dominant sentiments in Dutch song lyrics change between 1989 and 2018? To answer this question I use the Linguistic Inquiry and Word Count ([LIWC](http://liwc.wpengine.com)), which is a software program to analyse text by counting words in 66 psychologically meaningful categories that are calculated in a dictionary of words. The LIWC reads a given text and counts the percentages of words that fall in a certain category. Since it was originally developed by researchers with interest in social, clinical, health and cognitive psychology, the language categories were created to capture people's social and psychological states. The LIWC is an English dictionary, but is translated in many languages, among which the Dutch language. In this study I use the Dutch translation of the LIWC 2007 version.

In what follows I will first examine the making of the dataset I used, after which I will discuss the Dutch LIWC in detail. Subsequently there will be paragraphs on the analysis and the results. I will end with a conclusion and a discussion.


## 2. Corpus

To create a dataset, I used a method similar to Stern's. Instead of using the Top 100 of Billboard, I went to the Dutch equivalent: the [Top40](https://www.top40.nl). Here the 'Top 100-Jaaroverzicht' can be found from 1965 until 2018. I checked each list (as from 1989 until 2018) manually for artists that wrote songs in Dutch. I created a list with the names of these artists, dividing the thirty years in three decades, resulting in a dataframe with three columns, each containing the names of artists that were in the Top 100 during the following decades: 1989 - 1998, 1999 - 2008, 2009 - 2018.

After that I wrote a script that, given an artist from the above mentioned dataframe, scrapes corresponding song titles from [Genius](www.genius.com). The name of the artist and the song titles were saved in a dictionary - one for each decade. The next step was to clean this dictionary.

- remove English songs
- remove wrong artists

Maar, dit kan ook na het scrapen van de lyrics. Misschien zelfs makkelijker.



## 3. Analysis

## 4. Results

## 5. Conclusion

## 6. Discussion

ZORGEN DAT DE INDEX-KOLOM DE NAAM VAN DE ARTIEST BEVAT

In [29]:
from __future__ import division
from __future__ import print_function
import os
from codecs import open

#------------ DUTCH DATA --------------

csv = '/Users/alielassche/documents/github/cultural-analytics/LIWC_Dutch.csv'		#load Dutch LIWC data

csvfile = open(csv,"r", encoding='utf-8')
liwcfile = csvfile.read().split("\n")
csvfile.close()

liwc_nl_dict = dict()
for line in liwcfile:
	line = line.rsplit(",")
	liwc_nl_dict[line[0]] = line[1:]


#----------- FUNCTIONS ----------------


def freqdict(text):

	"""This function returns a frequency dictionary of the input list. All words are transformed to lower case."""
	
	freq_dict = dict() 
	for word in text:
		word = word.lower()
		if word in freq_dict:
			freq_dict[word] += 1
		else:
			freq_dict[word] = 1
	return freq_dict

def liwc(text,output='rel',lang='nl'):

	"""This function takes a list of tokens as input and returns a dictionary with the relative (output='rel') or absolute (output='abs') frequencies for every LIWC category. This function works for languages English (lang='en') and Dutch (lang='nl')."""

	#decide on relative or absolute frequenc
	if output == 'abs': #absolute frequency as output
		division = 1
	elif output == 'rel': #relative frequency as output
		division = len(text)

	#make frequency dictionary of the text to diminish number of runs in further for loop
	freq_dict = freqdict(text) 	
	
	if lang == 'nl':
		liwc_dict = liwc_nl_dict
	else:
		liwc_dict = liwc_en_dict
	
	features = dict()		
	for category in liwc_dict:
		freq = 0
		for term in liwc_dict[category]:
			term = term.lower()
			if term[-1] == u"*": #'*' indicates partial words that should match the beginning of the word (include variations on words)
				for word in freq_dict:
					if word.startswith(term[:-1]):
						freq += freq_dict[word]
			else:
				if term in freq_dict:
					freq += freq_dict[term]
		features[category] = freq / division
		
	return features

def liwc_nl(text,output="rel"):
	"""This function applies Dutch liwc() on input. Output is relative frequencies of liwc categories."""
	return liwc(text,output=output,lang="nl")


if __name__ == '__main__':
	doc = '/Users/alielassche/documents/github/cultural-analytics/decade1/Lyrics_AndrévanDuin.txt'
	
	
# 	with open(doc, 'r') as myfile:
# 		data = myfile.read().replace('\n', ' ')
# 		data = data.split(" ")
# 		liwc(data,output='rel',lang='nl')

In [30]:
import nltk
import re
import string
TOKENIZER = nltk.tokenize.word_tokenize
def is_punct(t):
    return re.match(f'[{string.punctuation}]+$', t) is not None

In [31]:
with open(doc, 'r') as myfile:
    chars = myfile.read().replace('\n', ' ')
    words = []
    for sentence in TOKENIZER(chars, language="dutch"):
        words.extend([w.lower() for w in sentence.split() if not is_punct(w)])

In [33]:
words

['hé',
 'jongens',
 'ja',
 'heb',
 'je',
 "'t",
 'al',
 'gehoord',
 'nee',
 'bij',
 'andré',
 'van',
 'duin',
 'staan',
 '35',
 'koeien',
 'in',
 "z'n",
 'tuin',
 'koeien',
 'in',
 "z'n",
 'tuin',
 'wat',
 'een',
 'zooitje',
 'hahaha',
 'ja',
 'ik',
 'heb',
 'altijd',
 'wat',
 "m'n",
 'hele',
 'tuin',
 'leg',
 'plat',
 "m'n",
 'gras',
 'is',
 'naar',
 "z'n",
 'moer',
 'ik',
 'ben',
 'in',
 'rep',
 'en',
 'roer',
 'en',
 'o',
 'wat',
 'maken',
 'ze',
 'toch',
 'een',
 'lawaai',
 "m'n",
 'hele',
 'tuin',
 'is',
 'ene',
 'grote',
 'koeievlaai',
 'nou',
 'zeker',
 'ja',
 'tjongejonge',
 'wat',
 'een',
 'lucht',
 'nou',
 '35',
 'koeien',
 'hoor',
 'die',
 'krengen',
 'loeien',
 'en',
 "m'n",
 'gras',
 'verknoeien',
 'o',
 "m'n",
 'tuin',
 'ligt',
 'volledig',
 'in',
 'puin',
 'dat',
 'kost',
 'duizenden',
 'guldens',
 'om',
 'dat',
 'weer',
 'op',
 'te',
 'knappen',
 'weet',
 'je',
 'dat',
 'ja',
 'komt',
 'nooit',
 'meer',
 'goed',
 'nou',
 'je',
 'kunt',
 'beter',
 'laten',
 'asfalteren',

In [36]:
import pandas as pd

In [39]:
pd.DataFrame.from_dict(liwc(words,output='rel', lang='nl'), orient='index', columns=['Andre'])

Unnamed: 0,Andre
Othref,0.040362
Time,0.043956
Inhib,0.000787
Space,0.021601
Posemo,0.016946
Self,0.053781
Social,0.069220
Humans,0.003903
Sports,0.002157
Other,0.008867


NameError: name 'features' is not defined