# Music Culture and Psychological States
## Replicating and Expanding DeWall et al. 2011 "Tuning in to Psychological Change"<sup>1</sup>

DeWall et al. 2011 tested three music lyric legisign relationships for evidence that replication of self-focus and antisocial legisigns has increased over time in song lyrics from the Billboard Hot 100 and that the replication of other-focus, social interactions, and positive emotion legisigns has decreased. Confirming their expectations, they argue that this pattern in musical legisign replication can be generalized to say that U.S. culture has become more self-focused and antisocial. If we accept Frith's assertion that songs "provide people with the means to articulate [...] feelings" (1996: 164)<sup>2</sup>, then this finding of increasingly antisocial and self-focused language within the most popular U.S. songs should be suggestive of a more general cultural psychological state. In Peircean terms, this means that, on one level, each word in the lyrics would be a replica of a dicent indexical legisign. For instance, replications of "self-focus" I/me/my/mine words stand for singer singing the word (and for listener who uses the word to articulate their own feelings), indexing their internal psychological state, and producing a dicent interpretant associating that psychological state with themselves. Today, we're going to attempt to replicate and expand DeWall et al.'s study (using Billboard Hot 100 song lyrics from 1950-2015) by answering the following questions:

1. Can we identify increasing self-focus (first person singular pronouns) and decreasing other-focus ("communion," first person plural pronouns)?
2. Can we identify decreasing "social connection" legisign replication?
3. Is there an increase in antisocial legisign replication and a decrease in pro-social legisign replication?
4. Extending DeWall et al.'s study, are the topics and themes in songs changing as well? Do topics of music also reflect this anti-social, self-focus interpretation?

Our data comes from a couple (see [here](https://github.com/kevinschaich/billboard) and [here](https://towardsdatascience.com/billboard-hot-100-analytics-using-data-to-understand-the-shift-in-popular-music-in-the-last-60-ac3919d39b49)) of public Github repositories that compiled extensive historical data about the Billboard Hot 100 from 1950-2015, along with data about the music itself from Spotify.
    
Note: The authors of the study tested individual words, but do not provide all of the words they used for Social Connection, Angry, Positive Emotion (unless the few words provided in the text on page 3 are the only words they checked). In the original study, DeWall et al. consider music genre as dummy variables in their regression models, as well as changes in ranking formula  to account for digital downloads and streamed media. We will not consider these factors for analysis as genre is a bit tricky to pin down because so many songs cross genre-boundaries and this information is often not available in early songs from the 50s. We will instead consider the overall effects across genres and ranking formulas.
    
---------------------------------
  
<sup>1</sup>DeWall, C. N., Pond, R. S., Jr., Campbell, W. K., & Twenge, J. M. 2011. "Tuning in to Psychological Change: Linguistic Markers of Psychological Traits and Emotions Over Time in Popular U.S. Song Lyrics." *Psychology of Aesthetics*, Creativity, and the Arts.

<sup>2</sup>Frith, Simon. 1996. “Songs as Texts.” In *Performing Rites: On the Value of Popular Music*. Cambridge,
MA: Harvard University Press, pp. 158-182.


In [64]:
import pandas as pd
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
import nltk
import string
from gensim import corpora, models

# Some Functions from Last Time to get us started:
def get_wordnet_pos(word):

    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": nltk.corpus.wordnet.ADJ,
                "N": nltk.corpus.wordnet.NOUN,
                "V": nltk.corpus.wordnet.VERB,
                "R": nltk.corpus.wordnet.ADV}

    return tag_dict.get(tag, nltk.corpus.wordnet.NOUN)

def get_lemmas(text):

    stop = nltk.corpus.stopwords.words('english') + list(string.punctuation)
    tokens = [i for i in nltk.word_tokenize(text.lower()) if i not in stop]
    lemmas = [nltk.stem.WordNetLemmatizer().lemmatize(t, get_wordnet_pos(t)) for t in tokens]
    return lemmas

# eliminate stop words item in get_lemmas so we can retain all the I's and Me's because we need them now
def get_tokens(text):
    # drop punctuation, but keep stopwords for initial word counting
    text = text.translate(str.maketrans('', '', string.punctuation))
    tokens = [i for i in nltk.word_tokenize(text.lower())]
    return tokens

def fill_topic_weights(df_row, bow_corpus):
    # Fill topic weights for topics in songs
    try:
        for i in ldamodel[bow_corpus[df_row.name]]:
            df_row[str(i[0])] = i[1]
    except:
        return df_row
    return df_row

def top_songs_by_topic(ldamodel, corpus, ntop=1): # collect top end songs for a given topic
    topn_songs_by_topic = {}
    for i in range(len(ldamodel.print_topics())):
        # For each topic, collect the most representative song(s) (i.e. highest probability containing words belonging to topic):
        top = sorted(zip(range(len(corpus)), ldamodel[corpus]), reverse=True, key=lambda x: abs(dict(x[1]).get(i, 0.0)))
        topn_songs_by_topic[i] = [j[0] for j in top[:ntop]]
        # Print out the topn songs for each topic and return their indices as a dictionary for further analysis:
        print("Topic " + str(i))
        print(music_df[['title','year','artist']].loc[topn_songs_by_topic[i]])
        print("*******************************")
    return topn_songs_by_topic

In [2]:
music_df = pd.read_csv('music_df.csv')
# adjust variable (year bin) because labels are not great (50's instead of 1950s)
music_df['year_bin'] = music_df['year_bin'].apply(lambda x: '20'+x if (x == '10s') or (x == '00s') else '19'+x)

In [4]:
print(music_df.columns)
music_df.head()
# today just working with lyrics, year and year bin variables - 
# there are a bunch of others that spotify and github authors have defined

Index(['lyrics', 'num_syllables', 'pos', 'year', 'fog_index', 'flesch_index',
       'num_words', 'num_lines', 'title', 'f_k_grade', 'artist',
       'difficult_words', 'num_dupes', 'neg', 'neu', 'compound', 'id',
       'danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness',
       'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo',
       'duration_ms', 'time_signature', 'uri', 'analysis_url',
       'artist_with_features', 'year_bin', 'image', 'cluster', 'Gender'],
      dtype='object')


Unnamed: 0,lyrics,num_syllables,pos,year,fog_index,flesch_index,num_words,num_lines,title,f_k_grade,...,tempo,duration_ms,time_signature,uri,analysis_url,artist_with_features,year_bin,image,cluster,Gender
0,"Mona Lisa, Mona Lisa, men have named you\nYou'...",189.0,0.199,1950,5.2,88.74,145,17,Mona Lisa,2.9,...,86.198,207573.0,3,spotify:track:3k5ycyXX5qsCjLd7R2vphp,https://api.spotify.com/v1/audio-analysis/3k5y...,,1950s,https://i.scdn.co/image/a4c0918f13b67aa8d9f4ea...,String Lover,male
1,I wanna be Loved\nBy Andrews Sisters\n\nOooo-o...,270.9,0.224,1950,4.4,82.31,189,31,I Wanna Be Loved,3.3,...,170.869,198027.0,5,spotify:track:4UY81WrDU3jTROGaKuz4uZ,https://api.spotify.com/v1/audio-analysis/4UY8...,Gordon Jenkins,1950s,https://i.scdn.co/image/42e4dc3ab9b190056a1ca1...,String Lover,Group
2,I was dancing with my darling to the Tennessee...,174.6,0.351,1950,5.2,88.74,138,16,Tennessee Waltz,2.9,...,86.335,182733.0,3,spotify:track:6DKt9vMnMN0HmlnK3EAHRQ,https://api.spotify.com/v1/audio-analysis/6DKt...,,1950s,https://i.scdn.co/image/353b05113b1a140d64d83d...,String Lover,female
3,Each time I hold someone new\nMy arms grow col...,135.9,0.231,1950,4.4,99.23,117,18,I'll Never Be Free,0.9,...,82.184,158000.0,3,spotify:track:0KnD456yC5JuweN932Ems3,https://api.spotify.com/v1/audio-analysis/0KnD...,Kay Starr,1950s,https://i.scdn.co/image/4bd427bb9181914d0fa448...,String Lover,male
4,"Unfortunately, we are not licensed to display ...",46.8,0.079,1950,6.0,69.79,32,3,All My Love,6.0,...,123.314,190933.0,4,spotify:track:05sXHTLqIpwywbpui1JT4o,https://api.spotify.com/v1/audio-analysis/05sX...,,1950s,https://i.scdn.co/image/353b05113b1a140d64d83d...,String Lover,female


--------------------

In [None]:
# write functions to answer questions
# need to tokenize lyrics first
lyrics = music_df['lyrics'].apply(get_tokens) #apply function to lyrics column
# series of functions that we can apply to lyrics to determine percentages of each thing in questions
def first_p_sg_perc(text): # self fcous (first person singular pronouns)
    first_p_sg_count = text.count("i") + text.count("me") + text.count("my") + text.count("mine")
    return first_p_sg_count/len(text)

def first_p_pl_perc(text): # first person plural (other person focus)
    first_p_pl_count = text.count("we") + text.count("us") + text.count("our") + text.count("ours")
    return first_p_pl_count/len(text)

# should be increase in first person singular and decrease in first person plural

#scial connection
def social_connection_perc(text):
    social_count = text.count("mate") + text.count("talk") + text.count("child") + text.count("together") + text.count("friend")
    return social_count/len(text)
# word2vec - find words that are related (clusters) and then identify how many there are'

def antisocial_perc(text):
    antisocial_count = text.count("kill") + text.count("hate") + text.count("annoyed") + text.count("damn") + text.count("fuck")
    return antisocial_count/len(text)

def positive_perc(text):
    positive_count = text.count("love") + text.count("nice") + text.count("sweet")
    return positive_count/len(text)

agency = lyrics.apply(first_p_sg_perc) # self focus score
communion = lyrics.apply(first_p_pl_perc) # other focus
social_connection = lyrics.apply(social_connection_perc)
antisocial = lyrics.apply(antisocial_perc)
positive = lyrics.apply(positive_perc)

#put them all into df by entering as dictionry values
df = pd.DataFrame({'Agency': agency, 'Communion' : communion,
                   'Social Connection': social_connection, 'Antisocial': antisocial,
                   'Positive': positive, 'Year': music_df['year'], 'Year_Bin': music_df['year_bin']})

# plot relationships to see visual patterns
# plot by decade, not year because it's too noisy
df[['Agency', 'Communion', 'Year_Bin']].groupby('Year_Bin').mean().plot(); # doule brackets mean that I'm taking columns and returning as a dataframe, groupby groups by particular item
# also plot the others
df[['Social Connection', 'Year_Bin']].groupby('Year_Bin').mean().plot();
df[['Antisocial', 'Positive', 'Year_Bin']].groupby('Year_Bin').mean().plot();

# look at regression equations
# fit regression equations by year and year bin - repeat for all diff measures (only copied some)

import statsmodels.formula.api as smf # formula api to write nice formulas

mod = smf.ols(formula = 'Agency ~ Year', data = df) #ols for ordinary least squares
res = mod.fit() # fit line
print(res.summary()) # get summary

mod = smf.ols(formula = 'Agency ~ Year_Bin', data = df) #ols for ordinary least squares
res = mod.fit() # fit line
print(res.summary()) # get summary


# apart from using word2vec, can assess thematically if things are changing in songs as a whole - systemic shifts
# can get after this using topic modeling
# to do this, need to get data into form that can work with in gensim - lemmetize data using get_lemmas formula
# not lemmetize in class because it takes a long time so read in previously lemmetized data
lemmas = pd.read_pickle('lyric_lemmas.pkl')
lemmas [0][:10] # see ten first lemmas of first song (nat king cole)
# mona lisa is being separated as two things - can deal with this using word2vec or gensims built-in bigram finder
# bigrams: two words that tend to cooccur and then combine into single unit to analyze as such
bigram = models.Phrases(lemmas, min_count = 5) # models imported from gensim, from lemmas, min_count: minimum amount of times it has to co-coccur to make the list of bigrams
bigram_mod = models.phrases.Phraser(bigram) # exporting model to a variable so it doesn't use a ton of memory on computer
print(bigram_mod[lemmas[0]]) # see if it worked with first song
# now mona lisa is considered one word, also broken heart, work of art was turned into bigram (work_art) becase we removed small words like "of"
# make bigrams for all of the songs
def make_bigrams(texts):
    return [bigram[doc] for doc in text] # for each document in text (for each song in songs), find bigrams and return all together as a single list
lemmas = make_bigrams(lemmas)
# can also make trigrams, etc
# now identify topics
# using particular lda approach - topics as probability distributions
# find series of probability distributions for words that will describe where words in songs come from
# a priori need to know number of topics we're looking for - we don't so usually choose number of topics to make it as interpretable as possible (multiple methods to do this) - topic coherence is commonly used (if they tend to co-occur often)
# based on this (look at key for code), we are using five topics
# kind of like latent factor analysis
# following takes ages to run:
ldamodel = models.ldamodel.ldaModel(bow_corpus, num_topics = 5, id2word = dictionary) # bag of words corpus taking in, id numbers is a dictionary
# above: unsupervised model
ldamodel = models.ldamodel.LdaModel.load('lda5p20_i400.model') # load results (previously run)
topics = ldamodel.print_topics(num_words=20) #print_topics command from gensim
# print them out
for topic in topics:
    print(topic)
    
# numbers associated with word - weighting
# hard to interpret so use function from top (will print out top scoring song for each topic)
# need to load bag of words corpus (see key)
top_songs_by_topic = top_songs_by_topic(ldamodel, b) # feed in lda model so it knows topic values and bag of words corpus
# 
