## Parallelizing the computation, 2nd approach

In the last notebook we have created a table of track ids and offensiveness.
But we have lost the detailed multi-index structure.

The flattening is not necessary for pandas outer-join to work.
We repeat the procedure and generate a second table which has a deeply nested multiindex.

The structure of this notebook is basically identical to the previous one.

In [2]:
import numpy as np
import pandas as pd

In [3]:
word_table = pd.read_pickle("../pickles/word_table_cleaned.pickle")
word_table.head()

Unnamed: 0_level_0,category,strength,target
word,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bonk,non-discriminatory,mild,
bukkake,non-discriminatory,strong,
cocksucker,non-discriminatory,strong,
dildo,non-discriminatory,strong,
ho,non-discriminatory,strong,


In [5]:
import sqlite3
conn = sqlite3.connect("../datasets/mxm_dataset.db")

cursor = conn.cursor()
cursor.execute("SELECT track_id, word, count FROM lyrics ORDER BY track_id;")
track_word_count = cursor.fetchall()
cursor.close()
track_word_count[:5]

[('TRAAAAV128F421A322', 'i', 6),
 ('TRAAAAV128F421A322', 'the', 4),
 ('TRAAAAV128F421A322', 'you', 2),
 ('TRAAAAV128F421A322', 'to', 2),
 ('TRAAAAV128F421A322', 'and', 5)]

In [6]:
sqldb_frame = pd.DataFrame(track_word_count, columns=["track_id", "word", "count"])
del track_word_count

In [7]:
sqldb_frame["word"]=sqldb_frame["word"].astype(str)

## Performing an outer joint to match words between lyrics and offensiveness rating

In [8]:
joint = sqldb_frame.join(word_table, on="word", how="left", lsuffix='_caller', rsuffix='_other')
print(joint.shape)
joint.head()

(19045332, 6)


Unnamed: 0,track_id,word,count,category,strength,target
0,TRAAAAV128F421A322,i,6,,,
1,TRAAAAV128F421A322,the,4,,,
2,TRAAAAV128F421A322,you,2,,,
3,TRAAAAV128F421A322,to,2,,,
4,TRAAAAV128F421A322,and,5,,,


In [9]:
joint_indexed = joint.set_index(["track_id", "category", "strength", "target"])
joint_indexed.loc[("TRAADYI128E078FB38",),]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,word,count
category,strength,target,Unnamed: 3_level_1,Unnamed: 4_level_1
,,,i,16
,,,the,26
,,,you,6
,,,to,13
,,,and,29
,,,a,2
,,,me,3
,,,it,7
,,,not,5
,,,in,9


## Aggregating the data
The joint table uses track ids and offensiveness categories as indices. This is what we want, but we still have individual cells for every word.

Now we aggregate the items for every index. We sum the entries. This gives us the total count of words in each category.

In [10]:
joint_indexed.index.is_unique

False

In [11]:
joint_indexed_filtered = joint_indexed["count"]
joint_indexed_filtered.head()

track_id            category  strength  target
TRAAAAV128F421A322  NaN       NaN       NaN       6
                                        NaN       4
                                        NaN       2
                                        NaN       2
                                        NaN       5
Name: count, dtype: int64

In [12]:
joint_aggregated = joint_indexed_filtered.agg("sum")

In [13]:
offensiveness_rating = joint_indexed_filtered.groupby(str, axis=0).agg("sum")

In [14]:
offensiveness_rating.head()

('TRAAAAV128F421A322', 'non-discriminatory', 'mild', nan)      1
('TRAAAAV128F421A322', nan, nan, nan)                        102
('TRAAABD128F429CF47', nan, nan, nan)                        226
('TRAAAED128E0783FAB', nan, nan, nan)                        421
('TRAAAEF128F4273421', nan, nan, nan)                        139
Name: count, dtype: int64

In [15]:
offensiveness_rating.to_pickle("../pickles/offensiveness_rating_structured")