### In this notebook we will be using the tokensized and class balanced data generated in the Basic cleaning notebook to train a Word2Vec model. The Word2Vec model will be a CBOW - Continuous Bag of Words model 

In [1]:
import gensim
import pandas as pd
from ast import literal_eval


Upload data with tokenized lyrics for dictionary creation

In [2]:
trackLyricsFeaturesTokenized = pd.read_csv('./tracksLyricFeatures/tracksLyricFeaturesTokenzised.csv')
trackLyricsFeaturesTokenized.head()

Unnamed: 0,id,track,trackArtist,genre,lyrics,top_lang_identified,top_lang_identified_prob,regex_cleaned_lyrics,tokenized_lyrics
0,127936,Armageddon,Synapsis,Electronic,Elektro Lyrics[Verse One]\nRight from the intr...,__label__en,0.910224,\nRight from the intro\nThe God opens your ear...,"['right', 'from', 'the', 'intro', 'the', 'god'..."
1,52632,Boom,Jason Shaw,Electronic,Cali Shit Lyrics[Verse 1: J-Easie]\nStay true ...,__label__en,0.927561,\nStay true to my sound I'm just the same\nI s...,"['stay', 'true', 'to', 'my', 'sound', 'just', ..."
2,99309,Trapped in a Single Celled Organism,Ample Mammal,Electronic,"Liquid Meets Land Lyrics[Andrew Bagadounts, Il...",__label__en,0.904999,"\nFluid, raging flood\nWrote my name in blood ...","['fluid', 'raging', 'flood', 'wrote', 'my', 'n..."
3,97992,Through Your Chest,Ant The Symbol,Electronic,Dungeons and Dragons Lyrics[Verse 1: Slug]\nKi...,__label__en,0.929616,\nKinetic responses were heard frequent in the...,"['kinetic', 'responses', 'were', 'heard', 'fre..."
4,112381,Blank Letter to God,statusq,Electronic,Automatic Systematic Lyrics()\nAutomatic syste...,__label__en,0.854843,\nAutomatic systematic\nI used to know the liv...,"['automatic', 'systematic', 'used', 'to', 'kno..."


In [3]:
trackLyricsFeaturesTokenized.shape

(7144, 9)

Datatypes like Python lists when saved as a csv lose their structure information when loaded again. 
We need to revert the tokenized column to list from string using **literal_eval** to continue building the Word2Vec model

In [4]:
trackLyricsFeaturesTokenized['tokenized_lyrics'] = trackLyricsFeaturesTokenized['tokenized_lyrics'].apply(literal_eval)
type(trackLyricsFeaturesTokenized.iloc[0,8])

list

Let's start with the Word2Vec CBOW model building process

The first model will be created with a vector size of 250 dimensions to be comparable with the 250 vector data created for the MFCC features of the FMA songs data

In [5]:
lyrics_dictionary_model = gensim.models.Word2Vec(
    sentences=trackLyricsFeaturesTokenized["tokenized_lyrics"],
    window=5,
    min_count=1,
    workers=1,
    vector_size=250,
    sg=0,
    epochs=5
)

In [6]:
lyrics_dictionary_model.wv.save_word2vec_format("./tracksLyricFeatures/Lyrics_Dictionary_Word2Vec.bin", binary=True)

Save the 250 dimensions model

In [7]:
lyrics_dictionary_model.save("./tracksLyricFeatures/Lyrics_Dictionary_Word2Vec.model")


Test the model to find most similar words for the word "love"

In [9]:
lyrics_dictionary_model.wv.most_similar("love")

[('passion', 0.5150929689407349),
 ('hate', 0.4149256646633148),
 ('loved', 0.41420456767082214),
 ('affection', 0.4109235107898712),
 ('loving', 0.4073731005191803),
 ('envy', 0.39163070917129517),
 ('pity', 0.38703957200050354),
 ('happiness', 0.38439705967903137),
 ('hope', 0.3827247619628906),
 ('joy', 0.3818041682243347)]

Create a second CBOW model with vector size of 125 for the comparable MFCC feature dataset being created

In [10]:
lyrics_dictionary_model_125 = gensim.models.Word2Vec(
    sentences=trackLyricsFeaturesTokenized["tokenized_lyrics"],
    window=5,
    min_count=1,
    workers=1,
    vector_size=125,
    sg=0,
    epochs=5
)

In [11]:
lyrics_dictionary_model_125.save("./tracksLyricFeatures/Lyrics_Dictionary_Word2Vec_125.model")

Test the 125 vector model

In [12]:
lyrics_dictionary_model_125.wv.most_similar("love")

[('passion', 0.6218207478523254),
 ('hate', 0.5335445404052734),
 ('fear', 0.5171015858650208),
 ('uncomeliness', 0.5126555562019348),
 ('heart', 0.5076101422309875),
 ('loved', 0.5032976269721985),
 ('hope', 0.5010501146316528),
 ('affection', 0.4981165826320648),
 ('soul', 0.49656733870506287),
 ('life', 0.48137032985687256)]

These Word2Vec models will now be used for getting the vector representation of the theme words determined in the next notebook