Does increased lexical diversity in lyrics predict song hotness? What about artist hotness?

In [155]:
import pandas as pd
import sqlite3
from nltk.corpus import stopwords

## Background

Information about musiXmatch can be found at http://labrosa.ee.columbia.edu/millionsong/musixmatch

The MSD team partnered with musiXmatch to create a dataset that provides lyrics for many of the MSD tracks. The lyrics come in bag-of-words format: each track is described as the word-counts for a dictionary of the top 5000 words across the set. 

This dataset is available in two formats: 

1. As two text files, a training set and a test set
3. As a SQLite database

Their suggestion is to use the SQLite version. It's faster and "more convenient" (I'm not sure what more convenient describes)

The [SQLite database](http://labrosa.ee.columbia.edu/millionsong/sites/default/files/AdditionalFiles/mxm_dataset.db) was saved to the same directory as this file. It contains 2.3 GB of data. We first need to create a connection to our database file.

In [157]:
conn = sqlite3.connect('mxm_dataset.db')

To pull the data from the .db file we need to create a cursor. A cursor is a database object used to traverse records in a database. The .execute method creates a cursor and then calls the cursors execute method.

The sqlite_master table defines the schema for the database. So we'll use that to get information about the database. More information can be found at view-source:https://www.sqlite.org/faq.html#q7

In [158]:
res = conn.execute("Select * FROM sqlite_master where type = 'table'")

Next, the fetchall() method fetches the remaining rows of a query result, and returns a list.

In [159]:
res.fetchall()

[('table', 'words', 'words', 2, 'CREATE TABLE words (word TEXT PRIMARY KEY)'),
 ('table',
  'lyrics',
  'lyrics',
  4,
  'CREATE TABLE lyrics (track_id, mxm_tid INT, word TEXT, count INT, is_test INT, FOREIGN KEY(word) REFERENCES words(word))')]

It's worth talking about what's going on here. We have two tables: words and lyrics.

The words table has a single column, word.

The lyrics table has five columns:

* track_id: MSD song id
* mxm_tid: musiXmatch track id
* word: one of the words in the table words
* count: the word count for that track
* is_test: tells you if a track is in the test set(1) or not(0)

You can also view that tables by selecting name.

In [160]:
res = conn.execute("SELECT name FROM sqlite_master where type = 'table'")
res.fetchall()

[('words',), ('lyrics',)]

I'm not really sure how this stuff works (Like I don't understand what the first bit is before table), so going to [follow along with Thierry Bertin-Mahiex (TBM)](http://labrosa.ee.columbia.edu/millionsong/blog/11-4-11-musixmatch-dataset-connecting-lyrics) for a bit. The next thing he does is select selects word from words

## Table Exploration

In [161]:
res = conn.execute("SELECT word FROM words")

In [162]:
len(res.fetchall())

5000

Now if we try to look at those terms using the same cursor, we come up with an empty list. Why?

In [163]:
res.fetchall()

[]

In [179]:
res = conn.execute("SELECT word FROM words")

Now we can look at the list of words. And looking at the list, it appears that we can reduce it by at least getting rid of stop words (doing this only removed 106 words from our list so doesn't really make sense to do this)

In [180]:
words = res.fetchall()
words

[('i',),
 ('the',),
 ('you',),
 ('to',),
 ('and',),
 ('a',),
 ('me',),
 ('it',),
 ('not',),
 ('in',),
 ('my',),
 ('is',),
 ('of',),
 ('your',),
 ('that',),
 ('do',),
 ('on',),
 ('are',),
 ('we',),
 ('am',),
 ('will',),
 ('all',),
 ('for',),
 ('no',),
 ('be',),
 ('have',),
 ('love',),
 ('so',),
 ('know',),
 ('this',),
 ('but',),
 ('with',),
 ('what',),
 ('just',),
 ('when',),
 ('like',),
 ('now',),
 ('que',),
 ('time',),
 ('can',),
 ('come',),
 ('de',),
 ('there',),
 ('go',),
 ('up',),
 ('oh',),
 ('la',),
 ('one',),
 ('they',),
 ('out',),
 ('down',),
 ('get',),
 ('she',),
 ('was',),
 ('see',),
 ('if',),
 ('got',),
 ('never',),
 ('from',),
 ('he',),
 ('feel',),
 ('want',),
 ('let',),
 ('make',),
 ('way',),
 ('say',),
 ('take',),
 ('would',),
 ('as',),
 ('ca',),
 ('day',),
 ('at',),
 ('babi',),
 ('away',),
 ('life',),
 ('yeah',),
 ('y',),
 ('back',),
 ('by',),
 ('her',),
 ('heart',),
 ('here',),
 ('how',),
 ('could',),
 ('night',),
 ('need',),
 ('our',),
 ('look',),
 ('where',),
 ('en',),

Back to TMB. 

In [111]:
res = conn.execute("SELECT word FROM words WHERE ROWID=4703")
res.fetchone()[0]

'brooklyn'

Again, this command can only be used once.

In [112]:
res = conn.execute("SELECT word FROM words WHERE ROWID=4703")
res.fetchone()

('brooklyn',)

Come back to this about the difference in results

Now lets look at the track metadata SQLite database. We'll create a separate connection for it.

In [113]:
conn_tmdb = sqlite3.connect('track_metadata.db')

Let's see how many words there are with pretti (stemmed pretty) there are.

In [114]:
res = conn.execute("SELECT track_id FROM lyrics WHERE word ='pretti'")
len(res.fetchall())

6703

We can get a song at random with the word pretti. We assign this to the variable song so we can plug it into our query. 

In [115]:
res = conn.execute("SELECT track_id FROM lyrics WHERE word='pretti' ORDER BY RANDOM() LIMIT 1")
song =res.fetchone()[0] # here we actually need this so the variable is in the right format
song

'TRNFSDB128F4280398'

The syntax for parameter substitution in SQLite is we can use parameter substitution in sqlite. More info about parameter substitution can be found here http://stackoverflow.com/questions/228912/sqlite-parameter-substitution-problem , since the info in the python reference library doesn't seem to work.

In [116]:
res = conn_tmdb.execute("SELECT artist_name, title FROM songs where track_id = ?", [song])
res.fetchone()[0]

'The Whitlams'

Now we can see what other words are listed in that song

In [117]:
res = conn.execute("SELECT word, count FROM lyrics WHERE track_id=? ORDER BY count DESC", [song])
res.fetchall()

[('you', 20),
 ('as', 13),
 ('is', 12),
 ('when', 10),
 ('are', 8),
 ('it', 7),
 ('he', 7),
 ('true', 7),
 ('look', 6),
 ('pretti', 6),
 ('and', 5),
 ('of', 5),
 ('yeah', 5),
 ('the', 4),
 ('by', 4),
 ('his', 4),
 ('better', 4),
 ('than', 4),
 ('a', 3),
 ('all', 3),
 ('this', 3),
 ('get', 3),
 ('music', 3),
 ('anyth', 3),
 ('vocal', 3),
 ('not', 2),
 ('in', 2),
 ('but', 2),
 ('at', 2),
 ('life', 2),
 ('back', 2),
 ('too', 2),
 ('him', 2),
 ('chorus', 2),
 ('sound', 2),
 ('late', 2),
 ('vers', 2),
 ('bridg', 2),
 ('drum', 2),
 ('electr', 2),
 ('bass', 2),
 ('piano', 2),
 ('rob', 2),
 ('my', 1),
 ('on', 1),
 ('with', 1),
 ('like', 1),
 ('up', 1),
 ('they', 1),
 ('out', 1),
 ('see', 1),
 ('caus', 1),
 ('wo', 1),
 ('still', 1),
 ('were', 1),
 ('did', 1),
 ('had', 1),
 ('run', 1),
 ('head', 1),
 ('befor', 1),
 ('word', 1),
 ('start', 1),
 ('even', 1),
 ('same', 1),
 ('deep', 1),
 ('ah', 1),
 ('solo', 1),
 ('water', 1),
 ('laugh', 1),
 ('bed', 1),
 ('speak', 1),
 ('near', 1),
 ('met', 1),
 (

## Turning it into a Dataframe

One way to turn it into a dataframe

In [144]:
res = conn.execute("SELECT track_id, sum(count) FROM lyrics GROUP BY track_id")
df_lyrics = pd.DataFrame(res.fetchall())

Another way to turn it into a dataframe. This way is better. It retains the db schema.

In [150]:
df_lyrics_sql = pd.read_sql('SELECT track_id, sum(count) FROM lyrics GROUP BY track_id', con=conn)

In [151]:
len(df_lyrics_sql)

237662

In [141]:
len(df_lyrics) # lots of rows!

237662

In [152]:
df_lyrics.head(5)

Unnamed: 0,0,1
0,TRAAAAV128F421A322,103
1,TRAAABD128F429CF47,226
2,TRAAAED128E0783FAB,421
3,TRAAAEF128F4273421,139
4,TRAAAEW128F42930C0,115


In [153]:
df_lyrics_sql.head(5)

Unnamed: 0,track_id,sum(count)
0,TRAAAAV128F421A322,103
1,TRAAABD128F429CF47,226
2,TRAAAED128E0783FAB,421
3,TRAAAEF128F4273421,139
4,TRAAAEW128F42930C0,115


In [143]:
df_lyrics[[2]]

Unnamed: 0,2
0,element
1,goal
2,ideal
3,adam
4,leak
5,runaway
6,gi
7,fleet
8,profit
9,defend


In [None]:
stop = stopwords.words('english')
new_list = (df_words['word'])
sum(df_words['word'].isin(stop))
type(new_list), len(new_list), new_list.head(5)

In [183]:
conn.close()
conn_tmdb.close()