Skip to content

Predicting Mountain Goats Album Era with Sentiment Analysis

Notifications You must be signed in to change notification settings

ehighland/goats_stats

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 

Repository files navigation

goats stats

Predicting Mountain Goats Album Era with Sentiment Analysis

This is a binary classification problem. I am predicting album era based on the sentiment intensity of the lyrics. In order to make era binary, I have separated the data according to the lo-fi years and the hi-fi years. This is a common conceptual split for fans. Lo-fi refers to low fidelity recordings, or recordings that are not faithful to the original sound due to background noise, static, etc. Hi-fi refers to high fidelity records, which are more polished sounding and more true to the original sound.

I gathered lyrics from Kyle Barbour's Annotated Mountain Goats site by album. I copied/pasted the lyrics for each album into a .txt file. Cleaning was partially manual and partially computational. I cleaned each file by manually removing song titles before using Python scripts to remove numbers. Herein, these .txt files are further cleaned, have stopwords removed, and are tokenized, both by word and by sentence. I tokenized using NLTK.

Sentiment polarity was calculated both using the tokenized words and sentences. I used NLTK's VADER to analyze sentiment intensity. I calculated mean, median, and standard deviation for the sentiment intensity scores for each album. I did this separately for sentence-tokenized and word-tokenized data. This gave me a lot of data features, so I also used the K best algorithm for feature selection with sklearn.

I used four different classification algorithms on both the sentence- and word-tokenized data. The four classification algorithms I tried are Naive Bayes, K Nearest Neighbors, Decision Tree, and Random Forest. I did this using the whole set of numeric features and the k best features for each tokenized dataset. Although feature selection increased the performance of Naive Bayes and KNN, neither of these algorithms performed well overall. Sentence-tokenized data fared best with the Random Forest approach, reaching a high of 81.7% accuracy before and after feature selection. Word-tokenized data fared best with Decision Trees, reaching a high of 88.3% accuracy with and without feature selection.