This is a utility file that creates a dictionary of tracks (key) and lyrics (value) for use with text mining algorithms. The output is a json file, "lyrics_dict_subset.json", which can be used in other modules. 

This particular version only has the lyrics that are in common with the 10k song subset. For a more generalized version use the file "Create-Dictionary-From-Word-Freq.ipynb"

We first import the three modules we'll use.

In [1]:
import pandas as pd
import sqlite3
import json

Next, we create connections to our database. We'll be connecting to the [mxm_dataset.db](http://labrosa.ee.columbia.edu/millionsong/sites/default/files/AdditionalFiles/mxm_dataset.db), which has lyrics for many of the tracks.

In [2]:
conn_lyrics = sqlite3.connect('../Data/mxm_dataset.db')

From the mxm_dataset.db, we want a list of unique tracks. We find lyrics for 237,662 tracks.

In [3]:
tracks = pd.read_sql("SELECT DISTINCT track_id FROM lyrics", con = conn_lyrics)
len(tracks) # 237662 tracks

237662

Next, we import a file we created earlier from the subset, which includes the track_id. We'll use this to restrict our cluster analysis to those songs that are in both the lyrics database and the subset. 

After importing we drop the first two characters (b'), and the last character('), which are artifacts of the encoding process.

In [4]:
save_load_path = '../Data/MillionSongSubset/data'
project_df = pd.read_pickle(save_load_path+'/project_df.pkl')
track_id = []
track_id = project_df['track_id'].map(lambda x: str(x)[2:len(x)+2])

We turn this into a pandas dataframe.

In [5]:
track_id =pd.DataFrame(track_id)

Next we merge the tracks that have lyrics available with the tracks in the subset. Since we're only interested in the tracks for which there are lyrics, we use an 'inner' join, which uses an intersection of keys from both dataframes. To learn more about different types of merges: http://pandas.pydata.org/pandas-docs/stable/merging.html#brief-primer-on-merge-methods-relational-algebra . 

We find that there are 2350 such tracks.

In [6]:
df_tracks = track_id.merge(tracks, how='inner', on='track_id')
len(df_tracks)

2350

Next we use the tracks from the dataframe to pull the lyrics from the database.

In [7]:
# set this to the number of tracks you want to pull from db
num_tracks = len(df_tracks['track_id']) 
# intialize empty dictionary to store tracks and lyrics
my_dict = {}

for i in range(0,num_tracks): 
    # assign the value of the track at current index to current track
    current_track = df_tracks.track_id[i]

    # pull the lyrics for that track and store it in a list
    res = conn_lyrics.execute("SELECT word, count FROM lyrics WHERE track_id = ?", [current_track])
    results = res.fetchall()

    # multiply the word by the number of times it occurs for each word in list
    li = [(x[0] + ' ') * x[1] for x in results]
    
    # use this version to get a single copy of each word
    # li = [x[0]for x in results]

    # get rid of commas between words
    li = str(li).replace(',','')
    
    # get rid of quotes between words
    li = str(li).replace("'",'')
    
    # add track and lyrics to dictionary
    my_dict[current_track] = li

We verify the number of tracks processed, and take a look at the output of my_dict

In [8]:
len(my_dict)

2350

In [9]:
#my_dict

Finally, we save the data to a json file. Why json? A dictionary stores data in the same way json stores data, so it seemed appropriate to use. It's human readable. We can open a json file in notepad and see our data. It's also faster than a pickle file, as benchmarked [here](https://kovshenin.com/2010/pickle-vs-json-which-is-faster/) . 

We've also included the code to save the data to a pickle file, in case those reasons aren't compelling enough to overcome a preference for pickle files.

In [10]:
# save to json file in same directory
import json
with open('lyrics_dict.json', 'w') as fp:
    # arguments can include indent=n or None, sort_keys = True
    json.dump(my_dict, fp, indent=None)

In [11]:
# save dictionary to pickle
#import pickle
#with open('lyrics_dict.p', 'wb') as fp:
#    pickle.dump(my_dict, fp)
    
#with open('lyrics_dict.p', 'rb') as fp:
#    data_pickle = pickle.load(fp)


In [12]:
conn_lyrics.close()