## Logistic Regression With Lyrics

In [2]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn import datasets
import pandas as pd


In [3]:
lyrics_data = pd.read_csv("cleaned_2.csv")
lyrics_data['genre'].value_counts()

Rock       14000
Metal      14000
Hip-Hop    14000
Country    14000
Pop        14000
Name: genre, dtype: int64

In [5]:
X = lyrics_data['lyrics']
Y = lyrics_data['genre']

Creating a count vectorizer representation of the lyrics for text analysis.

In [6]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
vectorizer

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [7]:
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(X.astype('U'))

In [8]:
X_train.shape

(70000, 220295)

In [9]:
X_train.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

Applying multiclass logistic regression to classify lyrics into respective genres.

In [9]:
logreg = LogisticRegression(C=1.0, solver='lbfgs', multi_class='multinomial', max_iter = 300)

In [10]:
logreg.fit(X_train, Y)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=300, multi_class='multinomial',
          n_jobs=None, penalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False)

In [11]:
expected = Y
predicted = logreg.predict(X_train)

Testing our model on our training data, which should be relatively high.

In [12]:
from sklearn import metrics
print(metrics.accuracy_score(expected, predicted))

0.8077428571428571


In [14]:
from sklearn.model_selection import cross_val_score
y = Y
scores = cross_val_score(logreg, X_train, y, cv=5, scoring='accuracy')



In [15]:
scores.mean()

0.6197142857142858

Received a cross validation accuracy score of 0.619, which is not as good as we hoped. However, we also have to understand that sometimes, even for humans, it is very difficult to predict the genre of the song. As a result, we decided to use the Spotify API to extract additional features about the song.