# 02 Final Data Prep - part 1

The data processing consist of 3 steps:

1. initial data collection and cleaning
2. creation of the BERT tokens. The creation of the tokens can take a long time so run this on COLAB with GPU enabled.
3. combining the tokens with the cleaned dataset

## Data files needed to run this notebook:
- `metal_songs.csv`
- `artist-data.csv`
- `lyrics-data.csv`

## Settings:
- set `COLAB = True` if you run this on Colab. Data can be placed in the root directory

In [10]:
# setup
import sys
import subprocess
import pkg_resources
from collections import Counter
import re
from numpy import log, mean, matmul


required = {'spacy', 'scikit-learn', 'numpy', 
            'pandas', 'torch', 'matplotlib',
            'transformers', 'allennlp==0.9.0'}
installed = {pkg.key for pkg in pkg_resources.working_set}
missing = required - installed

if missing:
    python = sys.executable
    subprocess.check_call([python, '-m', 'pip', 'install', *missing], stdout=subprocess.DEVNULL)
import spacy
import numpy as np
import pandas as pd

# SciKit Learn
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import normalize
from sklearn.metrics import accuracy_score
from sklearn.svm import LinearSVC
from sklearn.svm import SVC


# Spacy
from spacy.lang.en import English
en = English()

# !python -m spacy download en_core_web_md # includes GloVe Vectors
# !python -m spacy download en_core_web_sm
# !python -m spacy download en

# import en_core_web_sm
# import en_core_web_md


# PyTorch
import torch
# import torch.nn as nn
from torch.utils.data import TensorDataset, DataLoader



# File managment
import os
from os import listdir
from pathlib import Path
import pickle
import gzip

In [11]:
LOAD_DATA = False # read save data or regenerate data
SAVE_DATA = False # overwrite generated data? 

COLAB = False

In [12]:
if COLAB:
  # Google Colab
  path = "./"
  device = torch.device("cuda:0") # use GPU, change 
else:
  # Laptop
  path = "./data/"
  device = torch.device("cpu")
#   !pip install ipywidgets
#   !jupyter nbextension enable --py widgetsnbextension


In [13]:
def save_pickle(filename, data):
    with open(f"{path}{filename}", "wb") as fp: 
      pickle.dump(data, fp)

def load_pickle(filename):
    with open(f"{path}{filename}", 'rb') as f:
      return pickle.load(f)
    
def save_file(filename, train_tokens, test_tokens, val_tokens):
  data = {"train_tokens": train_tokens, "test_tokens" : test_tokens, "val_tokens": val_tokens}

  with open(f"{path}{filename}", "wb") as fp: 
    pickle.dump(data, fp)
    
def load_file(filename):
  with open(f"{path}{filename}", 'rb') as f:
      all_text = pickle.load(f)
      return (all_text["train_tokens"], all_text["test_tokens"], all_text["val_tokens"])


In [14]:
# Read data
df_artist =  pd.read_csv(f'{path}artists-data.csv')

In [15]:
df_lyrics = pd.read_csv(f'{path}lyrics-data.csv')

In [16]:
df_metal = pd.read_csv(f'{path}metal_songs.csv')

## Lyrics

In [17]:
df_lyrics_en = df_lyrics.query("Idiom == 'ENGLISH'")
df_lyrics_en = df_lyrics_en.drop(["SLink", "Idiom"], axis=1)

In [18]:
# optional filter #maybe filter Rock or Pop
df_artist_sm = df_artist.query("Genre == 'Rock' | Genre == 'Hip Hop' | Genre == 'Pop'")


## Artist

In [19]:
df_artist_sm_red = df_artist_sm.drop(["Genres", "Popularity", "Songs"], axis = 1)
df_artist_sm_red

Unnamed: 0,Artist,Link,Genre
0,10000 Maniacs,/10000-maniacs/,Rock
1,12 Stones,/12-stones/,Rock
2,311,/311/,Rock
3,4 Non Blondes,/4-non-blondes/,Rock
4,A Cruz Está Vazia,/a-cruz-esta-vazia/,Rock
...,...,...,...
3227,Tati Quebra Barraco,/tati-quebra-barraco/,Hip Hop
3228,Valesca Popozuda,/valesca-popozuda/,Pop
3229,Vine Rodry,/vine-rodry/,Pop
3234,Leandro Sapucahy,/leandro-sapucahy/,Pop


In [20]:
# drop duplicates
df_artist_dedup = df_artist_sm_red.drop(df_artist_sm_red[df_artist_sm_red.Link.duplicated()].index)
df_artist_sm_red

Unnamed: 0,Artist,Link,Genre
0,10000 Maniacs,/10000-maniacs/,Rock
1,12 Stones,/12-stones/,Rock
2,311,/311/,Rock
3,4 Non Blondes,/4-non-blondes/,Rock
4,A Cruz Está Vazia,/a-cruz-esta-vazia/,Rock
...,...,...,...
3227,Tati Quebra Barraco,/tati-quebra-barraco/,Hip Hop
3228,Valesca Popozuda,/valesca-popozuda/,Pop
3229,Vine Rodry,/vine-rodry/,Pop
3234,Leandro Sapucahy,/leandro-sapucahy/,Pop


## Metal

In [21]:
df_metal = df_metal[df_metal["Lyric"]!= ' Instrumental   ']
df_metal

Unnamed: 0.1,Unnamed: 0,Genre,Artist,Song,Lyric
1,1,Metal,QUANTICE NEVER CRASHED,Pins And Needles,Picture a parade of mannequins ivory white S...
2,2,Metal,QUANTICE NEVER CRASHED,Shaolin Casanova,Fuck you I want to know how it feels that I m...
3,3,Metal,QUANTICE NEVER CRASHED,Lighthouses,The ties that bind can gag and I m bound by bo...
4,4,Metal,QUANTICE NEVER CRASHED,Running Man,I ve built walls around me I ve surrounded my...
5,5,Metal,QUANTICE NEVER CRASHED,Two Bullets And A Gun,I guess I never told you I was never one to g...
...,...,...,...,...,...
49994,49994,Metal,ensiferum,The New Dawn,Through the storm like the wind we ride Leavin...
49996,49996,Metal,ensiferum,Victory Song,The plan of invasion an Evil deception Was mad...
49997,49997,Metal,ensiferum,Lady In Black,originally by Uriah Heep She came to me one m...
49998,49998,Metal,ensiferum,One More Magic Potion,Once when we were returning from a battle and ...


In [22]:
df_metal = df_metal[~df_metal["Lyric"].isnull()]

In [23]:
l = df_metal["Lyric"].apply(lambda x: x.strip())
l

1        Picture a parade of mannequins  ivory white  S...
2        Fuck you  I want to know how it feels that I m...
3        The ties that bind can gag and I m bound by bo...
4        I ve built walls around me  I ve surrounded my...
5        I guess I never told you  I was never one to g...
                               ...                        
49994    Through the storm like the wind we ride Leavin...
49996    The plan of invasion an Evil deception Was mad...
49997    originally by Uriah Heep She came to me one mo...
49998    Once when we were returning from a battle and ...
49999    In time bleeding wounds will heal Unlike some ...
Name: Lyric, Length: 49276, dtype: object

In [24]:
df_metal = df_metal[l!='']

In [25]:
df_metal2 = pd.DataFrame({"SName" : df_metal["Song"],	
                          "Lyric": df_metal["Lyric"],  	
                          "Artist" : df_metal["Artist"],  	
                          "Genre": df_metal["Genre"]})
                         
                       
                         
df_metal2 

Unnamed: 0,SName,Lyric,Artist,Genre
1,Pins And Needles,Picture a parade of mannequins ivory white S...,QUANTICE NEVER CRASHED,Metal
2,Shaolin Casanova,Fuck you I want to know how it feels that I m...,QUANTICE NEVER CRASHED,Metal
3,Lighthouses,The ties that bind can gag and I m bound by bo...,QUANTICE NEVER CRASHED,Metal
4,Running Man,I ve built walls around me I ve surrounded my...,QUANTICE NEVER CRASHED,Metal
5,Two Bullets And A Gun,I guess I never told you I was never one to g...,QUANTICE NEVER CRASHED,Metal
...,...,...,...,...
49994,The New Dawn,Through the storm like the wind we ride Leavin...,ensiferum,Metal
49996,Victory Song,The plan of invasion an Evil deception Was mad...,ensiferum,Metal
49997,Lady In Black,originally by Uriah Heep She came to me one m...,ensiferum,Metal
49998,One More Magic Potion,Once when we were returning from a battle and ...,ensiferum,Metal


## Join

In [26]:
# join the two data frames
df_lyrics_prejoin = df_lyrics_en.set_index('ALink')
df_artist_prejoin = df_artist_dedup.set_index('Link')
df_total = df_lyrics_prejoin.join(df_artist_prejoin)

In [27]:
# drop na values that couldn't be matched
df_total = df_total[~df_total["Genre"].isnull()]
df_total

Unnamed: 0,SName,Lyric,Artist,Genre
/10000-maniacs/,More Than This,I could feel at the time. There was no way of ...,10000 Maniacs,Rock
/10000-maniacs/,Because The Night,"Take me now, baby, here as I am. Hold me close...",10000 Maniacs,Rock
/10000-maniacs/,These Are Days,These are. These are days you'll remember. Nev...,10000 Maniacs,Rock
/10000-maniacs/,A Campfire Song,"A lie to say, ""O my mountain has coal veins an...",10000 Maniacs,Rock
/10000-maniacs/,Everyday Is Like Sunday,Trudging slowly over wet sand. Back to the ben...,10000 Maniacs,Rock
...,...,...,...,...
/zz-top/,Whiskey'n Mama,"I'm so tired, you on my head.. Whiskey'n mama,...",ZZ Top,Rock
/zz-top/,Woke Up With Wood,When I woke up this morning. I was feeling mig...,ZZ Top,Rock
/zz-top/,World of Swirl,"I hit the street running, had an angle in mind...",ZZ Top,Rock
/zz-top/,Your Legs Are As Hairy As My Beard,I've got a beard. And it is long. And you've g...,ZZ Top,Rock


In [28]:
# add metal dataset
df_total = df_total.append(df_metal2, ignore_index=True)
df_total

Unnamed: 0,SName,Lyric,Artist,Genre
0,More Than This,I could feel at the time. There was no way of ...,10000 Maniacs,Rock
1,Because The Night,"Take me now, baby, here as I am. Hold me close...",10000 Maniacs,Rock
2,These Are Days,These are. These are days you'll remember. Nev...,10000 Maniacs,Rock
3,A Campfire Song,"A lie to say, ""O my mountain has coal veins an...",10000 Maniacs,Rock
4,Everyday Is Like Sunday,Trudging slowly over wet sand. Back to the ben...,10000 Maniacs,Rock
...,...,...,...,...
155478,The New Dawn,Through the storm like the wind we ride Leavin...,ensiferum,Metal
155479,Victory Song,The plan of invasion an Evil deception Was mad...,ensiferum,Metal
155480,Lady In Black,originally by Uriah Heep She came to me one m...,ensiferum,Metal
155481,One More Magic Potion,Once when we were returning from a battle and ...,ensiferum,Metal


## Additional Cleaning

In [29]:
import re

def clean_lyrics(df, field):
  y = [re.sub('\[.*?\]', '', t) for t in df[field]]
  z = [re.sub('\(.*?\)', '', t) for t in y]
  df[field] = z
  return df

In [30]:
df_total_cleaned = clean_lyrics(df_total, "Lyric")
df_total_cleaned.head()

Unnamed: 0,SName,Lyric,Artist,Genre
0,More Than This,I could feel at the time. There was no way of ...,10000 Maniacs,Rock
1,Because The Night,"Take me now, baby, here as I am. Hold me close...",10000 Maniacs,Rock
2,These Are Days,These are. These are days you'll remember. Nev...,10000 Maniacs,Rock
3,A Campfire Song,"A lie to say, ""O my mountain has coal veins an...",10000 Maniacs,Rock
4,Everyday Is Like Sunday,Trudging slowly over wet sand. Back to the ben...,10000 Maniacs,Rock


In [31]:
df_total_cleaned.shape

(155483, 4)

In [32]:
df_total_cleaned["Genre"].value_counts()

Rock       55802
Metal      45998
Pop        34974
Hip Hop    18709
Name: Genre, dtype: int64

In [33]:
file_name = 'df_total_cleaned'
df_total_cleaned.to_csv(f"{path}{file_name}.csv")

In [34]:
save_pickle(f"{file_name}.pkl", df_total_cleaned  )

In [35]:
pickle.dump(df_total_cleaned, gzip.open(f'{path}{file_name}.pkl.gz', 'wb'))