In [290]:
from gmusicapi import Mobileclient
import pandas as pd
import numpy as np
from PyLyrics import *
import string
from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, classification_report

# Recently, I've been curious about some real life applications of NLP. Here, I've decided to try my hand at using NLP to perform music genre classifiction based on lyrics. Is it true that country singers only sing about mama and trucks? Do rappers only rap about money, cars, and clothes? NLP can offer some insight into these patterns.

### I start by using my personal music playlist as a collection of music to examine. I listen to a lot of everything, and it seems like a good place to start. Thankfully, GmusicAPI exists. This API allows me to access my personal playlists and extract data. For this analysis, I'm going to analyze my playlist of songs that I've 'thumbbed up'. GmusicAPI makes it simple to login and access A LOT of information about music.

In [23]:
api = Mobileclient()
api.login(username, password, Mobileclient.FROM_MAC_ADDRESS)
playlists = api.get_all_user_playlist_contents()

### Here, I'm creating and filling a dataframe to hold some information that I will later use to retrieve lyrics. I'm intentionally ignoring some potentially informative information that GmusicAPI offers because I want to limit the analysis to lyrics. Also, some information isn't available for all songs so ignoring those songs for now.

In [155]:
songDataFrame = pd.DataFrame(columns=('title','genre','artist','lyrics','lyricsCleaned'))
for playlist in playlists:
    if playlist['name'] == 'ThumbbedUP':
        currentPlaylist = playlist['tracks']
        for song in currentPlaylist:
            try:
                currentSong = pd.DataFrame.from_dict({'title':[song['track']['title']], 'genre':[song['track']['genre']],'artist':[song['track']['artist']]})
                songDataFrame = songDataFrame.append(currentSong,ignore_index=True)
            except Exception as err:
                pass

### Lets take a look at what we have so far.

In [156]:
songDataFrame

Unnamed: 0,artist,genre,lyrics,lyricsCleaned,title
0,Sam Smith,Pop,,,Too Good At Goodbyes
1,Wayne Toups,Folk,,,Take My Hand (feat. Zydecajun)
2,Robert Earl Keen,Alt Country,,,The Road Goes On Forever
3,John Mayer,Adult Contemporary,,,Half of My Heart
4,Imagine Dragons,Alternative/Indie,,,Believer
5,Hayes Carll,Folk,,,The Magic Kid
6,Billy Paul,R&B/Soul,,,Me and Mrs. Jones
7,Kaleo,Alternative/Indie,,,Way Down We Go
8,Tom Petty,Rock,,,Runnin' Down A Dream
9,Tom Petty,Rock,,,Free Fallin'


In [157]:
songDataFrame.describe()

Unnamed: 0,artist,genre,lyrics,lyricsCleaned,title
count,609,609,0.0,0.0,609
unique,355,63,0.0,0.0,567
top,Eric Church,Alternative/Indie,,,Everlong
freq,12,65,,,4


### So far so good. One thing I've noticed is a large number of uniqe genres. I'm going to try to handle that by merging some of the super specific genres into some more general genres that I will analyze later. I'm sure theres a more intelligent way to do this, but I'll work on that later.

In [158]:
songDataFrame.loc[songDataFrame['genre'].str.contains('Rock', case=False), 'genre'] = 'Rock'
songDataFrame.loc[songDataFrame['genre'].str.contains('Hip-Hop', case=False), 'genre'] = 'Rap'
songDataFrame.loc[songDataFrame['genre'].str.contains('Country', case=False), 'genre'] = 'Country'
songDataFrame.loc[songDataFrame['genre'].str.contains('Metal', case=False), 'genre'] = 'Rock'
songDataFrame.loc[songDataFrame['genre'].str.contains('Pop', case=False), 'genre'] = 'Pop'
songDataFrame.loc[songDataFrame['genre'].str.contains('Punk', case=False), 'genre'] = 'Rock'

In [159]:
songDataFrame.describe()

Unnamed: 0,artist,genre,lyrics,lyricsCleaned,title
count,609,609,0.0,0.0,609
unique,355,31,0.0,0.0,567
top,Eric Church,Rock,,,Everlong
freq,12,162,,,4


### So converting some specific genres to general genres netted a pretty good reduction in number of unique genres. I'm going to further limit the analysis to rock, rap, and country music.

In [160]:
songDataFrame = songDataFrame[songDataFrame['genre'].isin(['Rock','Country','Rap'])]

In [161]:
songDataFrame.describe()

Unnamed: 0,artist,genre,lyrics,lyricsCleaned,title
count,301,301,0.0,0.0,301
unique,173,3,0.0,0.0,286
top,Eric Church,Rock,,,Sixteen Saltines
freq,12,162,,,2


### Here, I'm going to use the PyLyrics API to retrieve the actual lyrics. This API is a life saver. It will automatically retrive lyrics for most songs from lyrics.wikia.com given some information. Here I'm going to use title and artist to retrieve the lyrics.

In [163]:
for row in songDataFrame.iterrows():
    try:
        lyrics = PyLyrics.getLyrics(row[1][0],row[1][4])
    except:
        lyrics = ['']
    songDataFrame.loc[row[0]]['lyrics'] = lyrics

### Prepping the data for text classification. This is pretty kludgey, but I don't want punctuation or special characters to affect the analysis.

In [166]:
rmPunctuation = str.maketrans('', '', string.punctuation+ '(){}<>')
rmNewline = str.maketrans('\n',' ')
for row in songDataFrame.iterrows():
    songDataFrame.loc[row[0]]['lyricsCleaned'] = str(row[1][2]).translate(rmPunctuation)
    songDataFrame.loc[row[0]]['lyricsCleaned'] = str(songDataFrame.loc[row[0]]['lyricsCleaned']).translate(rmNewline)
songDataFrame = songDataFrame[songDataFrame['lyrics'].map(len) > 1]
songDataFrame

Unnamed: 0,artist,genre,lyrics,lyricsCleaned,title
2,Robert Earl Keen,Country,Sherry was a waitress at the only joint in tow...,Sherry was a waitress at the only joint in tow...,The Road Goes On Forever
8,Tom Petty,Rock,"It was a beautiful day, the sun beat down\nI h...",It was a beautiful day the sun beat down I had...,Runnin' Down A Dream
9,Tom Petty,Rock,"Unfortunately, we are not licensed to display ...",Unfortunately we are not licensed to display t...,Free Fallin'
11,JAY-Z,Rap,Do I find it so hard\nWhen I know in my heart\...,Do I find it so hard When I know in my heart I...,4:44
12,The Darkness,Rock,Can't explain all the feelings that you're mak...,Cant explain all the feelings that youre makin...,I Believe In A Thing Called Love
13,Lifehouse,Rock,Desperate for changing\nStarving for truth\nI'...,Desperate for changing Starving for truth Im c...,Hanging By a Moment
19,The Killers,Rock,You sit there in your heartache\nWaitin' on so...,You sit there in your heartache Waitin on some...,When You Were Young
25,Macklemore,Rap,"Donna Missal\nWe got that bad love, but it tas...",Donna Missal We got that bad love but it taste...,Over It (feat. Donna Missal)
27,DJ Khaled,Rap,Would you fuck me for free?\nAnother one\nWe t...,Would you fuck me for free Another one We the ...,For Free (feat. Drake)
28,Drake,Rap,You know alot of girls be\nThinkin' my songs a...,You know alot of girls be Thinkin my songs are...,Best I Ever Had


### So finally have all the data I want to use into the data frame. Time to get down to business... I'll start by splitting the data into training data and test data.

In [271]:
X_train, X_test = train_test_split(songDataFrame, test_size=.25,stratify=songDataFrame.genre)
y_train = X_train.genre
y_test = X_test.genre

### I chose to use the TfidfVectorizer from the scikit-learn package to vectorize the songs. I used the TfidfVectorizer over the CountVectorizer to minimize the effect that the lyric-length could have on classification. This results in a sparse matrix in which the rows are the samples (songs) and the columns are the features (words or n-grams). 

In [272]:
vectorizer = TfidfVectorizer(analyzer='word',max_df=.85, ngram_range=(1,1),stop_words='english')
vectorizer.fit(songDataFrame['lyricsCleaned'])

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.85, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words='english', strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

### I'm going to use a logistic regression model classifier because I want to keep it simple. Depending on how this works I may explore a Naive-Bayes classifier or an SVM based classifier. I'm using the 'lbfgs' solver because my data is small-ish. I'm also using a 'balanced' class weight to compensate for my unbalanced classes.

In [273]:
clf=LogisticRegression(solver='lbfgs',class_weight='balanced')
clf.fit(vectorizer.transform(X_train['lyricsCleaned']).toarray(),y_train)

LogisticRegression(C=1.0, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
          solver='lbfgs', tol=0.0001, verbose=0, warm_start=False)

In [274]:
predictions = clf.predict(vectorizer.transform(X_test['lyricsCleaned']).toarray())

### Going to see how I did classifying the songs by genre. I'm going to use the F1 score because of my unbalanced classes.

In [275]:
F1 = f1_score(y_test,predictions,average='weighted')
round(F1,2)

0.79000000000000004

### An F1 score of .79 indicates that the classifier is performing reasonably well given that we are performing a multi-class classification using somewhat noisy data (I'm assuming that words common to all genres will make this noisy). Let's get a better look at what the classifier is picking up on... This function will display the top 15 features (words) associated with each genre.

In [280]:
def seeTopFeatures(vectorizer, clf, classLabels):
    featureNames = vectorizer.get_feature_names()
    for i, classLabel in enumerate(classLabels):
        topFeatures = np.argsort(clf.coef_[i])[-10:]
        print("%s: %s" % (classLabel, " ".join(featureNames[j] for j in topFeatures)))

In [281]:
seeTopFeatures(vectorizer,clf, ['Country', 'Rap', 'Rock'])

Country: days easy runs just drink creepin sudden wave shes little
Rap: dun clean fuckin hol beast like money bitch niggas nigga
Rock: feel youve come jealous stop sharona ooh away gimme oh


### This is sort of what I expected to see. I can't make any definite conclusions from this analysis, but as a person that listens to country music, I can definitely understand how 'drink' and 'shes' made it into the top features. It seems as though the other top features for country are not country-specific

### Rap definitely takes the cake for having genre specific features. I suspect this is because these features comprise a large portion of the corpus of the rap data from my playlists. 

### Rock is unique in that some of the features are more like 'fillers' than words. 'ooh' and 'oh' are examples of this. The other words associated with rock are general words, as was the case with country. I'm suspecting that most of the classification power is coming from the unique vocabulary associated with rap. Maybe some more digging can determine whether or not I'm right. 

In [289]:
print(classification_report(y_test,predictions, labels=['Country', 'Rap', 'Rock'] ))

             precision    recall  f1-score   support

    Country       0.88      0.54      0.67        13
        Rap       0.88      0.64      0.74        11
       Rock       0.78      0.95      0.85        37

avg / total       0.82      0.80      0.79        61



### Surprisingly, the classifier seems to perform best on rock music with acceptable precision and excellent recall. Rap music has a very high precision, meaning that it does not overpredict samples to be rap, when they are not; however, the recall is low, indicating a tendency to miss positive examples. Country music classification suffers from the same problem as rap -- high precision with low recall. An overall F1-score of .79 indicates a classifier that performs reasonably well given the nature of the data.