# Predicting Mountain Goats Album Era with Sentiment Analysis
This is a binary classification problem. I am predicting album era based on the sentiment intensity of the lyrics. In order to make era binary, I have separated the data according to the lo-fi years and the hi-fi years. This is a common conceptual split for fans. Lo-fi refers to low fidelity recordings, or recordings that are not faithful to the original sound due to background noise, static, etc. Hi-fi refers to high fidelity records, which are more polished sounding and more true to the original sound. 

I gathered lyrics from Kyle Barbour's [Annotated Mountain Goats site](https://annotatedtmg.org/) by album. I copied/pasted the lyrics for each album into a .txt file. Cleaning was partially manual and partially computational. I cleaned each file by manually removing song titles before using Python scripts to remove numbers. Herein, these .txt files are further cleaned, have stopwords removed, and are tokenized, both by word and by sentence. I tokenized using [NLTK](https://www.nltk.org/index.html).

Sentiment polarity was calculated both using the tokenized words and sentences. I used NLTK's [VADER](https://www.nltk.org/_modules/nltk/sentiment/vader.html) to analyze sentiment intensity. I calculated mean, median, and standard deviation for the sentiment intensity scores for each album. I did this separately for sentence-tokenized and word-tokenized data. This gave me a lot of data features, so I also used the K best algorithm for feature selection with sklearn.

I used four different classification algorithms on both the sentence- and word-tokenized data. The four classification algorithms I tried are Naive Bayes, K Nearest Neighbors, Decision Tree, and Random Forest. I did this using the whole set of numeric features and the k best features for each tokenized dataset. In both cases, feature selection via k best improved the model's accuracy and variability. Overall, the Random Forest classifier performed the best.

### Step One: Imports, Functions, and Variables. Oh my!

In [1]:
# NLTK
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()

In [2]:
# sklearn libraries
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

  from numpy.core.umath_tests import inner1d


In [3]:
# Misc.
import statistics as stat
import pandas as pd
import numpy as np
import string

In [4]:
import matplotlib.pyplot as plt

In [5]:
# global variables
nums = string.digits
punc = string.punctuation

In [6]:
# Get tokenized sentences and words
def tokens(filename):
    tmp_str = ''
    
    # Open file
    with open(filename) as f:
        f_in = f.read()
    f.close()
    
    # Check for numbers or punctuation
    # If a character is a number or punctuation mark,
    # skip it and do not add it to the growing tmp string
    for line in f_in:
        line_str = str(line)
        if line_str in nums or line_str in punc:
            pass
        else:
            tmp_str += line_str
            
    # Replace newlines with a period + space to make it a sentence
    # This is to help define sentences for tokenizing later
    # Consolidate extraneous spaces
    f_prep = tmp_str.replace('\n','. ').replace('  ','')

    # Tokenize sentences
    # Some sentences became only ".", so I removed those
    f_sent = sent_tokenize(f_prep)
    f_sent = [s for s in f_sent if s != '.']
    
    # Tokenize words
    # Some words were stored as only ".", so I removed those
    f_words = word_tokenize(f_prep)
    f_words = [s for s in f_words if s != '.']
    
    # Remove stop words
    f_sent2 = [f for f in f_sent if f not in stop_words]
    f_words2 = [f for f in f_words if f not in stop_words]
    
    # return the tokenized sentences and words
    return f_sent2, f_words2  

In [7]:
# All text files are in their own directory (text)
# Get sentences and individual words for each album
aed_sentence, aed_words = tokens('text/all-eternals-deck.txt')
btc_sentence, btc_words = tokens('text/beat-the-champ.txt')
heretic_sentence, heretic_words = tokens('text/heretic-pride.txt')
gl_sentence, gl_words = tokens('text/get-lonely.txt')
life_sentence, life_words = tokens('text/tlotwtc.txt')
ty_sentence, ty_words = tokens('text/transcendental-youth.txt')
sunset_sentence, sunset_words = tokens('text/sunset-tree.txt')
wsabh_sentence, wsabh_words = tokens('text/wsabh.txt')
talla_sentence, talla_words = tokens('text/tallahassee.txt')
ahwt_sentence, ahwt_words = tokens('text/ahwt.txt')
tcg_sentence, tcg_words = tokens('text/the-coroners-gambit.txt')
ffg_sentence, ffg_words = tokens('text/full-force-galesburg.txt')
nfj_sentence, nfj_words = tokens('text/nothing-for-juice.txt')
sweden_sentence, sweden_words = tokens('text/sweden.txt')
zopilote_sentence, zopilote_words = tokens('text/zopilote-machine.txt')
nfj_sentence, nfj_words = tokens('text/nothing-for-juice.txt')
sweden_sentence, sweden_words = tokens('text/sweden.txt')
zopilote_sentence, zopilote_words = tokens('text/zopilote-machine.txt')

In [8]:
# Sentences
album_lines = [
    btc_sentence,
          ty_sentence,
          aed_sentence,
          life_sentence,
          heretic_sentence,
          gl_sentence,
          sunset_sentence,
         wsabh_sentence,
         talla_sentence,
          ahwt_sentence,
          tcg_sentence,
          ffg_sentence,
          nfj_sentence,
          sweden_sentence,
          zopilote_sentence
         ]

In [9]:
# Words
album_words = [aed_words,
          btc_words,
          heretic_words,
          gl_words,
          life_words,
          ty_words,
          sunset_words,
         wsabh_words,
         talla_words,
          ahwt_words,
          tcg_words,
          ffg_words,
          nfj_words,
          sweden_words,
          zopilote_words    
]

In [10]:
# Album names
album_names = [
    'Beat the Champ',
    'Transcendental Youth',
    'All Eternals Deck',
    'The Life of the World to Come',
    'Heretic Pride',
    'Get Lonely',
    'The Sunset Tree',
    'We Shall All Be Healed',
    'Tallahassee',
    'All Hail West Texas',
    "The Coroner's Gambit",
    'Full Force Galesburg',
    'Nothing for Juice',
    'Sweden',
    'Zopilote Machine'
]

In [11]:
# Album release years
album_years = [2015,
          2012,
          2011,
          2009,
          2008,
          2006,
          2005,
          2004,
          2002,
          2002,
          2000,
          1997,
            1996,
            1995,
            1994
         ]

In [12]:
# Album labels
album_labels = ['Merge',
             'Merge',
             'Merge',
             '4AD',
             '4AD',
             '4AD',
             '4AD',
             '4AD',
             '4AD',
             'Emperor Jones',
             'Absolutely Kosher',
             'Emperor Jones',
             'Ajax',
             'Shrimper',
             'Ajax'
            ]

In [13]:
# Hi-fi or lo-fi 
fi = [
    'hi-fi',
    'hi-fi',
    'hi-fi',
    'hi-fi',
    'hi-fi',
    'hi-fi',
    'hi-fi',
    'hi-fi',
    'hi-fi',
    'lo-fi',
    'lo-fi',
    'lo-fi',
    'lo-fi',
    'lo-fi',
    'lo-fi'
]
# Hi-fi or lo-fi represented as 1 or 0, respectively
fi_binary = np.array([1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0])

In [14]:
# Lambda functions to get the negative, positive, 
# and neutral sentiment polarity scores individually
neg = lambda n : [n[i]['neg'] for i in range(len(n))]
posi = lambda p : [p[i]['pos'] for i in range(len(p))]
neu = lambda ne : [ne[i]['neu'] for i in range(len(ne))]

In [15]:
# Function to get mean, median, and standard deviation of input
def get_stats(info):
    me = stat.mean(info)
    med = stat.median(info)
    sdev = stat.stdev(info)
    return me,med,sdev

In [16]:
# Function to get sentiment polarities and statistics
def get_sentiment(lyrics):
    # Empty list to store each set of stats
    stats_list = []
    
    # Run the sentiment intensity analysis for input parameter
    for l in lyrics:
        # Get polarity scores
        tmp_sia = [sia.polarity_scores(y) for y in l]
        # Get mean, median, and standard dev
        # For negative, positive, and neutral sentiments
        tmp_neg_one,tmp_neg_two,tmp_neg_three = get_stats(neg(tmp_sia))
        tmp_pos_one,tmp_pos_two,tmp_pos_three = get_stats(posi(tmp_sia))
        tmp_neu_one,tmp_neu_two,tmp_neu_three = get_stats(neu(tmp_sia))
        
        # Add all stats to temporary list
        tmp_list = [tmp_neg_one,tmp_neg_two,tmp_neg_three,
                    tmp_pos_one,tmp_pos_two,tmp_pos_three,
                    tmp_neu_one,tmp_neu_two,tmp_neu_three
                   ]
        # Make stats_list a list of lists
        stats_list.append(tmp_list)
        
    # Return the list of lists    
    return stats_list

In [17]:
# Statistics for sentences
album_stats_lines = get_sentiment(album_lines)

In [18]:
# Statistics for words
album_stats_words = get_sentiment(album_words)

In [19]:
# Function to make pandas data frames
# Since this is the only dataset in use, the
# column names are hard-coded in to make it easy
def make_df(stats_list):
    album_list = []
    for i in range(15):
        tmp = [album_names[i],album_labels[i],fi[i],album_years[i]]
        tmp_list = tmp + stats_list[i]
        album_list.append(tmp_list)
    my_df = pd.DataFrame(data = album_list,
                         columns = ['Album.Name',
                                    'Album.Label',
                                    'Hi.Lo.Fi',
                                    'Album.Year',
                                    'Avg.Neg','Med.Neg','StDev.Neg',
             'Avg.Posi','Med.Posi','StDev.Posi',
            'Avg.Neu','Med.Neu','StDev.Neu']
                        )
    return my_df
                                    
                         
    

### Step Two: Visualize with Pandas DataFrames

The first shown DataFrame contains statistics about the sentiment of the lines in each song on an album. These were generated with nltk's sentence tokenizer. The second DataFrame contains the same statistics, but calculated based on the sentiment scores of individual words. This was done with nltk's word tokenizer.

In [20]:
# Make and display dataframe for sentences statistics
goats_df = make_df(album_stats_lines)
goats_df

Unnamed: 0,Album.Name,Album.Label,Hi.Lo.Fi,Album.Year,Avg.Neg,Med.Neg,StDev.Neg,Avg.Posi,Med.Posi,StDev.Posi,Avg.Neu,Med.Neu,StDev.Neu
0,Beat the Champ,Merge,hi-fi,2015,0.097624,0.0,0.201724,0.103033,0.0,0.181572,0.799352,1.0,0.250419
1,Transcendental Youth,Merge,hi-fi,2012,0.07342,0.0,0.149858,0.106268,0.0,0.1884,0.820313,1.0,0.221928
2,All Eternals Deck,Merge,hi-fi,2011,0.057328,0.0,0.137868,0.107996,0.0,0.195333,0.834683,1.0,0.231174
3,The Life of the World to Come,4AD,hi-fi,2009,0.06858,0.0,0.149791,0.069929,0.0,0.146388,0.861494,1.0,0.200583
4,Heretic Pride,4AD,hi-fi,2008,0.067439,0.0,0.149653,0.086585,0.0,0.194525,0.84598,1.0,0.228919
5,Get Lonely,4AD,hi-fi,2006,0.067858,0.0,0.160274,0.062006,0.0,0.133065,0.87014,1.0,0.195595
6,The Sunset Tree,4AD,hi-fi,2005,0.064758,0.0,0.140908,0.071815,0.0,0.155714,0.863433,1.0,0.202443
7,We Shall All Be Healed,4AD,hi-fi,2004,0.042608,0.0,0.118405,0.085246,0.0,0.164869,0.872147,1.0,0.202518
8,Tallahassee,4AD,hi-fi,2002,0.072182,0.0,0.150827,0.109621,0.0,0.197108,0.818205,1.0,0.24992
9,All Hail West Texas,Emperor Jones,lo-fi,2002,0.060354,0.0,0.137448,0.096566,0.0,0.181431,0.843078,1.0,0.218602


In [21]:
# Make and display dataframe for word statistics
goats_df2 = make_df(album_stats_words)
goats_df2

Unnamed: 0,Album.Name,Album.Label,Hi.Lo.Fi,Album.Year,Avg.Neg,Med.Neg,StDev.Neg,Avg.Posi,Med.Posi,StDev.Posi,Avg.Neu,Med.Neu,StDev.Neu
0,Beat the Champ,Merge,hi-fi,2015,0.04447,0.0,0.206195,0.080388,0.0,0.27197,0.855758,1.0,0.351435
1,Transcendental Youth,Merge,hi-fi,2012,0.068724,0.0,0.253072,0.082048,0.0,0.274534,0.819074,1.0,0.385092
2,All Eternals Deck,Merge,hi-fi,2011,0.051188,0.0,0.220449,0.063376,0.0,0.243712,0.834857,1.0,0.371423
3,The Life of the World to Come,4AD,hi-fi,2009,0.048679,0.0,0.21527,0.052156,0.0,0.222418,0.819889,1.0,0.384414
4,Heretic Pride,4AD,hi-fi,2008,0.050294,0.0,0.218622,0.063357,0.0,0.243684,0.826911,1.0,0.378448
5,Get Lonely,4AD,hi-fi,2006,0.059303,0.0,0.236279,0.086731,0.0,0.281545,0.828021,1.0,0.377502
6,The Sunset Tree,4AD,hi-fi,2005,0.049574,0.0,0.217147,0.055771,0.0,0.229567,0.848954,1.0,0.358232
7,We Shall All Be Healed,4AD,hi-fi,2004,0.036558,0.0,0.187744,0.074638,0.0,0.262907,0.843869,1.0,0.363118
8,Tallahassee,4AD,hi-fi,2002,0.058533,0.0,0.234832,0.092384,0.0,0.289669,0.797602,1.0,0.401929
9,All Hail West Texas,Emperor Jones,lo-fi,2002,0.046989,0.0,0.211664,0.073449,0.0,0.260931,0.812044,1.0,0.390766


### Step Three: How well can album sentiment predict album era?

First, I made each DataFrame into a numpy array. Next, I removed the non-numeric data. I also removed the year. As stated previously, lo-fi and hi-fi albums correspond to certain years, hence using the binary distinction of lo- or hi-fi for era. I did this first for the sentence-tokenized and second for the word-tokenized data.

This is a binary classification problem on a small dataset. I will be trying multiple classification algorithms from sklearn and performing 5-fold cross validation. The classifiers I will are trying are:
* Naive Bayes
* K Nearest Neighbors
* Decision Tree
* Random Forest

I created a function (below) to generate a DataFrame with information about the performance of each classifier.

In [24]:
def make_clf_df(input_array):
    clf_nb = GaussianNB()
    clf_knn = KNeighborsClassifier(n_neighbors=3)
    clf_dt = DecisionTreeClassifier(random_state=0)
    clf_rf = RandomForestClassifier(n_estimators=10, max_depth=None, min_samples_split=2, random_state=0)
    
    nb_scores = cross_val_score(clf_nb, input_array, fi_binary, cv=5)
    knn_scores = cross_val_score(clf_knn, input_array, fi_binary, cv=5)
    dt_scores = cross_val_score(clf_dt, input_array, fi_binary, cv=5)
    rf_scores = cross_val_score(clf_rf, input_array, fi_binary, cv=5)
    
    scores_list = [nb_scores,knn_scores,dt_scores,rf_scores]
    
    df_data = [(s.mean(),s.std()) for s in scores_list]
    
    my_df = pd.DataFrame(data = df_data,
                         index = ['Naive Bayes',
                                    'K Nearest Neighbors',
                                    'Decision Tree',
                                    'Random Forest'],
                         columns = ['Mean',
                                 'Standard.Dev']
                        )
    return my_df

#### Sentence-Tokenized

In [22]:
goats_np = goats_df.to_numpy()
goats_nums = [g[4:14] for g in goats_np]
goats_nums_df = pd.DataFrame(data=goats_nums)
goats_np_num = goats_nums_df.to_numpy()

In [25]:
whole_df = make_clf_df(goats_np_num)
whole_df

Unnamed: 0,Mean,Standard.Dev
Naive Bayes,0.4,0.08165
K Nearest Neighbors,0.566667,0.249444
Decision Tree,0.6,0.226078
Random Forest,0.816667,0.152753


The Random Forest classifier is outperforming the others by a significant margin. While the mean Random Forest accuracy score is ~82, the accuracy varied substantially given the high standard deviation. Still, the RF standard deviation is lower than both the Decision Tree and KNN approaches. Naive Bayes had the lowest variability, but is also the least accurate.

There are a lot of features in this data set, so I decided to use the k best algorithm for feature selection. Using k=4, the highest scoring features are the mean and standard deviation for negative sentiment intensity score, the mean positive sentiment intensity score, and the mean neutral sentiment intensity score. 

In [26]:
x,y = goats_np_num, fi_binary
k_best = SelectKBest(chi2, k=4).fit_transform(x,y)
k_best

  if np.issubdtype(mask.dtype, np.int):


array([[0.09762388, 0.20172438, 0.10303284, 0.79935224],
       [0.07341972, 0.14985775, 0.10626761, 0.82031268],
       [0.05732766, 0.13786792, 0.10799574, 0.83468298],
       [0.06857988, 0.14979107, 0.06992899, 0.86149408],
       [0.06743946, 0.1496531 , 0.0865852 , 0.84597982],
       [0.06785756, 0.16027391, 0.06200581, 0.87013953],
       [0.06475839, 0.14090822, 0.07181544, 0.86343289],
       [0.04260778, 0.11840521, 0.08524551, 0.87214671],
       [0.07218159, 0.15082707, 0.10962148, 0.8182046 ],
       [0.060354  , 0.13744771, 0.096566  , 0.843078  ],
       [0.05684821, 0.13411923, 0.09298214, 0.8501756 ],
       [0.04168598, 0.11535008, 0.06910061, 0.88921341],
       [0.05318387, 0.16714171, 0.06206774, 0.88473548],
       [0.06322977, 0.14659495, 0.08487702, 0.85189644],
       [0.06028367, 0.1557444 , 0.08808596, 0.85163324]])

In [27]:
k_best_df = make_clf_df(k_best)
k_best_df

Unnamed: 0,Mean,Standard.Dev
Naive Bayes,0.5,0.258199
K Nearest Neighbors,0.566667,0.249444
Decision Tree,0.7,0.266667
Random Forest,0.816667,0.152753


This time, the RF approach and the KNN approach are just as accurate and variable as before. However, the accuracy for the other two algorithms is much higher, although they still vary a lot. The Naive Bayes and Decision Tree approaches both became 10% more accurate, but the variability of NB rose substantially.

#### Word-Tokenized

In [32]:
goats_np2 = goats_df2.to_numpy()
goats_nums2 = [g[4:14] for g in goats_np2]
goats_nums_df2 = pd.DataFrame(data=goats_nums2)
goats_np_num2 = goats_nums_df2.to_numpy()

In [33]:
whole_df2 = make_clf_df(goats_np_num2)
whole_df2

Unnamed: 0,Mean,Standard.Dev
Naive Bayes,0.7,0.266667
K Nearest Neighbors,0.766667,0.2
Decision Tree,0.883333,0.145297
Random Forest,0.816667,0.152753


For the word-tokenized data, the Decision Tree approach is leading the pack in accuracy and has the lowest standard deviation. Random Forest is still doing quite well and its performance is equal to that of the sentence-tokenized data. The accuracy of NB and KNN are much higher, both now 20% more accurate. For both NB and KNN, the standard deviation has decreased, although it is still relatively high.

I also used k best with k=4 for feature selection in the word-tokenized data set. This time, the highest scoring features were different. They were mean and standard deviation for both negative sentiment intensity score and positive sentiment intensity score. Neutral scores were not as high scoring for the word-tokenized data.

In [30]:
x2,y2 = goats_np_num2, fi_binary
k_best2 = SelectKBest(chi2, k=4).fit_transform(x2,y2)
k_best2

  if np.issubdtype(mask.dtype, np.int):


array([[0.04446978, 0.2061952 , 0.08038769, 0.27196999],
       [0.0687237 , 0.25307246, 0.08204769, 0.27453363],
       [0.0511883 , 0.2204488 , 0.06337599, 0.24371227],
       [0.04867872, 0.21527036, 0.05215577, 0.22241841],
       [0.05029393, 0.2186222 , 0.06335728, 0.24368407],
       [0.05930319, 0.23627887, 0.08673091, 0.2815447 ],
       [0.04957397, 0.21714723, 0.05577072, 0.22956735],
       [0.0365575 , 0.18774423, 0.07463823, 0.26290685],
       [0.05853315, 0.23483165, 0.09238364, 0.28966891],
       [0.04698905, 0.21166369, 0.07344891, 0.26093146],
       [0.04439746, 0.20604914, 0.068358  , 0.2524482 ],
       [0.03286034, 0.17833758, 0.0545183 , 0.22712238],
       [0.03722084, 0.18938088, 0.04631927, 0.21026261],
       [0.04058442, 0.19740555, 0.06412338, 0.24507207],
       [0.03745072, 0.18992595, 0.05387648, 0.22584799]])

In [31]:
k_best_df2 = make_clf_df(k_best2)
k_best_df2

Unnamed: 0,Mean,Standard.Dev
Naive Bayes,0.766667,0.2
K Nearest Neighbors,0.833333,0.210819
Decision Tree,0.883333,0.145297
Random Forest,0.883333,0.145297


This time, KNN, DT, and RF are equally accurate and DT and RF have the same variability. NB is at its most accurate level with its lowest variability, but does not come close to the other algorithms.