### Rating a song in the msd database
We need to
- identify an individual song
- get all lyrics corresponding to this song
- rate them by offensiveness

In [1]:
import sqlite3

In [2]:
conn = sqlite3.connect("datasets/mxm_dataset.db")

The database contains the following:

https://github.com/tbertinmahieux/MSongsDB/blob/master/Tasks_Demos/Lyrics/README.txt
_More details on the database:
   - it contains two tables, 'words' and 'lyrics'
   - table 'words' has one column: 'word'. Words are entered according
     to popularity, check their ROWID if you want to check their position.
     ROWID is an implicit column in SQLite, it starts at 1.
   - table 'lyrics' contains 5 columns, see below
   - column 'track_id' -> as usual, track id from the MSD
   - column 'mxm_tid' -> track ID from musiXmatch
   - column 'word' -> a word that is also in the 'words' table
   - column 'cnt' -> word count for the word
   - column 'is_test' -> 0 if this example is from the train set, 1 if test_
   
We want to connect our insights to the million song database and its metadata.
Therefore we want to use the track_id to identify songs.

The lyrics table contains individual entries for every 

In [3]:
# this is the content of the lyrics table
# please note that it's not "cnt" but "count", the README is wrong
cursor = conn.cursor()
cursor.execute("PRAGMA table_info(lyrics);")
print(cursor.fetchall())
cursor.close()

[(0, 'track_id', '', 0, None, 0), (1, 'mxm_tid', 'INT', 0, None, 0), (2, 'word', 'TEXT', 0, None, 0), (3, 'count', 'INT', 0, None, 0), (4, 'is_test', 'INT', 0, None, 0)]


### Use SQL to extract information

We want to have all track ids, and for every track id, we need the words and counts

The data is big, but even as a pandas DataFrame, the size stays below 5GB.
I think the comfort of pandas is enough to warrant loading this into memory.

In [4]:
cursor = conn.cursor()
cursor.execute("SELECT DISTINCT track_id FROM lyrics ORDER BY track_id;")
track_ids = cursor.fetchall()
cursor.close()
track_ids[:5]

[('TRAAAAV128F421A322',),
 ('TRAAABD128F429CF47',),
 ('TRAAAED128E0783FAB',),
 ('TRAAAEF128F4273421',),
 ('TRAAAEW128F42930C0',)]

In [5]:
cursor = conn.cursor()
cursor.execute("SELECT track_id, word, count FROM lyrics ORDER BY track_id;")
track_word_count = cursor.fetchall()
cursor.close()
track_word_count[:5]

[('TRAAAAV128F421A322', 'i', 6),
 ('TRAAAAV128F421A322', 'the', 4),
 ('TRAAAAV128F421A322', 'you', 2),
 ('TRAAAAV128F421A322', 'to', 2),
 ('TRAAAAV128F421A322', 'and', 5)]

In [6]:
conn.close()

In [7]:
import pandas as pd

In [8]:
id_series = pd.Series(track_ids)
del track_ids

In [9]:
sqldb_frame = pd.DataFrame(track_word_count, columns=["track_id", "word", "count"])
del track_word_count

### Constructing a table to hold song ratings

We want to create a table which allows intuitive indexing into the rating of a song.

The table will contain the frequency of each word category. We set up a multiindex to allow slicing along the different characteristics of the word

In [10]:
word_table = pd.read_pickle("pickles/word_table_cleaned.pickle")

In [11]:
word_table.head()

Unnamed: 0_level_0,category,strength,target
word,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
arse,non-discriminatory,mild,
bloody,non-discriminatory,mild,
bugger,non-discriminatory,mild,
cow,non-discriminatory,mild,
crap,non-discriminatory,mild,


In [12]:
index_tuples=[]

for strength in ["mild", "medium", "strong", "strongest"]:
    for category in ["discriminative", "non-discriminatory"]:
        for target in ["None", "race", "mental or physical ability", "sexuality"]:
            index_tuples.append([strength, category, target])

In [13]:
index = pd.MultiIndex.from_tuples(index_tuples, names=["strength", "category", "target"])

In [14]:
rating_frame = pd.DataFrame(index=["track_id"], columns = index)

In [15]:
rating_frame

strength,mild,mild,mild,mild,mild,mild,mild,mild,medium,medium,...,strong,strong,strongest,strongest,strongest,strongest,strongest,strongest,strongest,strongest
category,discriminative,discriminative,discriminative,discriminative,non-discriminatory,non-discriminatory,non-discriminatory,non-discriminatory,discriminative,discriminative,...,non-discriminatory,non-discriminatory,discriminative,discriminative,discriminative,discriminative,non-discriminatory,non-discriminatory,non-discriminatory,non-discriminatory
target,None,race,mental or physical ability,sexuality,None,race,mental or physical ability,sexuality,None,race,...,mental or physical ability,sexuality,None,race,mental or physical ability,sexuality,None,race,mental or physical ability,sexuality
track_id,,,,,,,,,,,...,,,,,,,,,,


### Populate the rating frame

In [16]:
for track_id in id_series:
    print(track_id)
    break

('TRAAAAV128F421A322',)


In [17]:
mentioned_words = sqldb_frame[sqldb_frame["track_id"]==track_id[0]]

In [18]:
def construct_dict():
    entry={}
    for strength in ["mild", "medium", "strong", "strongest"]:
        entry[strength]={}
        for category in ["discriminative", "non-discriminatory"]:
            entry[strength][category]={}
            for target in ["None", "race", "mental or physical ability", "sexuality"]:
                entry[strength][category][target]=0
    
    return entry

In [19]:
entry = construct_dict()
total_count = 0

for index, row in mentioned_words.iterrows():
    word = row["word"]
    count = row["count"]
    
    total_count+=count
    
    if word in word_table.index:
        rating = word_table.loc[word]
        strength = rating["strength"]
        category = rating["category"]
        target = rating["target"]
        
        if target==None:
            target="None"
        
        entry[strength][category][target]+=count

In [33]:
import numpy as np

In [37]:
rating_frame.index

Index(['track_id'], dtype='object')

In [31]:
entry["mild"]["discriminative"].keys()

dict_keys(['None', 'race', 'mental or physical ability', 'sexuality'])

In [22]:
entry

{'medium': {'discriminative': {'None': 0,
   'mental or physical ability': 0,
   'race': 0,
   'sexuality': 0},
  'non-discriminatory': {'None': 0,
   'mental or physical ability': 0,
   'race': 0,
   'sexuality': 0}},
 'mild': {'discriminative': {'None': 0,
   'mental or physical ability': 0,
   'race': 0,
   'sexuality': 0},
  'non-discriminatory': {'None': 1,
   'mental or physical ability': 0,
   'race': 0,
   'sexuality': 0}},
 'strong': {'discriminative': {'None': 0,
   'mental or physical ability': 0,
   'race': 0,
   'sexuality': 0},
  'non-discriminatory': {'None': 0,
   'mental or physical ability': 0,
   'race': 0,
   'sexuality': 0}},
 'strongest': {'discriminative': {'None': 0,
   'mental or physical ability': 0,
   'race': 0,
   'sexuality': 0},
  'non-discriminatory': {'None': 0,
   'mental or physical ability': 0,
   'race': 0,
   'sexuality': 0}}}

In [23]:
entry["mild"]

{'discriminative': {'None': 0,
  'mental or physical ability': 0,
  'race': 0,
  'sexuality': 0},
 'non-discriminatory': {'None': 1,
  'mental or physical ability': 0,
  'race': 0,
  'sexuality': 0}}

In [24]:
word_table.index = word_table.index.astype(str)

In [25]:
word_table.loc["arse"]["target"]

In [44]:
index_tuples=[]

for distance in ["near", "far"]:
    for vehicle in ["bike", "car"]:
        index_tuples.append([distance, vehicle])
        
index = pd.MultiIndex.from_tuples(index_tuples, names=["distance", "vehicle"])

In [45]:
dataframe = pd.DataFrame(index=["city"], columns = index)

In [46]:
dataframe

distance,near,near,far,far
vehicle,bike,car,bike,car
location,,,,


In [48]:
my_home_city = {"near":{"bike":1, "car":0},"far":{"bike":0, "car":1}}
my_home_city

{'far': {'bike': 0, 'car': 1}, 'near': {'bike': 1, 'car': 0}}

In [52]:
dataframe["my_home_city"]=my_home_city

ValueError: Length of values does not match length of index