In the first notebook [notebook_01_webscraping_Evanescence_Within_Temptation.ipynb](http://localhost:8888/notebooks/Project_Evanescence_Within_Temptation/notebook_01_webscraping_Evanescence_Within_Temptation.ipynb) of this project we used web scrapping to obtain lyrics of `Evanescence` and `Within Temptation`. After that we used Spotify API to retrieve more details about both bands including information about them, their albums, and details about their lyrics including not only metadata (e.g. ) but also audio features (e.g. valence, energy, tempo, liveness) [notebook_02_retrieve_Spotify_data-Evanescence_Within_Temptation.ipynb](http://localhost:8888/notebooks/Project_Evanescence_Within_Temptation/notebook_02_retrieve_Spotify_data-Evanescence_Within_Temptation.ipynb).

Now is time to use the data retrieved and try to explore and visualize as much as we can. Our goal is not only explore text data, but also visualize numeric and categorical features. 

In what concerns NLP (Natural Language Processing) I want to do some things:

1. Text analysis: Analyze both bands and compare them through their lyrics using some metrics and word clouds.
2. Sentiment analysis: Explore the sentiment, polarity, and subjectivity of the lyrics provided by [TextBlob](https://textblob.readthedocs.io/en/dev/index.html) to compare both bands through visualization.
3. I'll try also to connect metadata of tracks with the sentiment provided by lyrics to draw conclusions.
4. Analyse some of the audio features, in special the ones that have been pointed as mood features, i.e., valence and energy, and see if there is a relation between them and the sentiment of lyrics of a track.

Let’s get started!

# Loading all data

## Lyrics

In [1]:
import pandas as pd


In [2]:
df_lyrics_evanescence = pd.read_csv("./data/lyrics_evanescence_2020-02-16.csv")
df_lyrics_evanescence.sort_values(by='song_title').head(20)

Unnamed: 0,song_title,lyrics
3,4th of july,Shower in the dark day. Clean sparks driving d...
69,all that im living for,All that I'm living for. All that I'm dying fo...
50,angel of mine,You are everything I need to see. Smile and su...
41,anything for you,I'd give anything to give me to you. Can you f...
61,anywhere,"Dear my love, haven't you wanted to be with me..."
39,away from me,I hold my breath. as this life starts to take ...
78,before the dawn,Meet me after dark again. and I'll hold you. I...
59,bleed,How can I pretend that I don't see. What you h...
67,breathe no more,I've been looking in the mirror for so long.. ...
6,bring me to life,how can you see into my eyes. like open doors....


In [3]:
df_lyrics_within_temptation = pd.read_csv("./data/lyrics_within_temptation_2020-02-16.csv")
df_lyrics_within_temptation.sort_values(by='song_title').head(15)

Unnamed: 0,song_title,lyrics
17,a dangerous mind,Cause something is not right. I follow the sig...
64,a demons fate,"Ooh, ooh, ooh, ooh, ooh. Ooh, ooh, ooh, ooh, o..."
21,all i need,I'm dying to catch my breath. Oh why don't I e...
60,angels,Sparkling angel I believed. You were my saviou...
40,another day,I know you are going away. I take my love into...
51,aquarius,I hear your whispers. Break the silence and it...
29,bittersweet,If I tell you. Will you listen?. Will you stay...
67,blue eyes,Blue eyes wide to the world. Full of dreams an...
56,caged,These are the darkest clouds. They have surrou...
31,candles,Take away. These hands of darkness. Reaching f...


## Spotify's data

From all the data retrieved I'll concentrate on the track's information csv. I'll be using the one we saved in .csv that should have eliminated at least some duplicates from tracks.

In [4]:
df_tracks_evanescence = pd.read_csv("./data/info_tracks_evanescence_without_duplicates_2020-02-16.csv")
df_tracks_evanescence.sort_values(by='track_name').head(10)

Unnamed: 0,album_name,track_id,track_name,track_duration,track_popularity,track_preview,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo
51,Evanescence (Deluxe Version),3UkDyGtriDY7NzOJbF0rIH,a new way to bleed,226400,44,,0.378,0.895,1,-4.347,1,0.0531,5e-05,0.0252,0.15,0.258,155.946
37,The Open Door,4iDQezFTnOwgnrPYiqQ6TP,all that i am living for,228706,48,,0.514,0.809,3,-4.396,0,0.0617,0.0121,0.0,0.0763,0.385,136.881
58,Lost Whispers,2lH8hMXxuIcjpbIok9KbUj,breathe no more b side version,228809,49,,0.62,0.186,11,-8.527,0,0.0284,0.971,1e-06,0.117,0.219,96.992
19,Anywhere But Home (Live),2zn4moJkEmIVfV83iye9t5,"breathe no more live from le zénith,france/2004",213853,33,,0.562,0.431,11,-10.67,0,0.0307,0.323,0.0185,0.955,0.167,108.012
1,Fallen,0COqiPhxzoWICwFCS4eZcp,bring me to life,235893,77,,0.331,0.943,4,-3.188,0,0.0698,0.00721,2e-06,0.242,0.296,94.612
73,Synthesis Live,1rvxZ0qg96Nkr3PLhHTbCA,bring me to life live,264026,29,https://p.scdn.co/mp3-preview/87cbd661e1853b8f...,0.149,0.813,4,-5.26,0,0.056,0.346,2.1e-05,0.914,0.242,90.642
21,Anywhere But Home (Live),1AjCrY9w0edn2jAGEAkzJ7,"bring me to life live from le zénith,france/...",283760,40,,0.341,0.825,4,-7.22,0,0.0622,0.0221,0.0306,0.522,0.0398,94.992
64,Synthesis,4vHFFk4Vm9NWhGq2FAsTlj,bring me to life synthesis,257320,6,,0.362,0.785,4,-3.876,0,0.0567,0.61,1e-06,0.0722,0.16,90.904
27,The Open Door,663Karu2rvKLdnY0eo1n3M,call me when you're sober,214706,64,,0.45,0.883,7,-4.094,1,0.0524,0.00193,0.0,0.293,0.328,93.41
30,The Open Door,6Sh05fnlrLbMfSuI8Qur6a,cloud nine,262173,44,,0.125,0.893,3,-4.217,0,0.21,0.0432,8.5e-05,0.151,0.19,194.55


In [5]:
df_tracks_within_temptation = pd.read_csv("./data/info_tracks_within_temptation_without_duplicates_2020-02-16.csv")
df_tracks_within_temptation.sort_values(by='track_name').head(10)

Unnamed: 0,album_name,track_id,track_name,track_duration,track_popularity,track_preview,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo
32,The Silent Force,6D5ih8y9mKmCSkuZO2Up2Q,a dangerous mind,256533,34,https://p.scdn.co/mp3-preview/c2c47b037fe1394c...,0.365,0.894,6,-5.491,0,0.0727,0.0711,0.000378,0.135,0.476,180.2
75,The Unforgiving,6ivwIJGFnzTRPG2dHvKA07,a demon's fate,329537,40,https://p.scdn.co/mp3-preview/cd99a3cda30d3714...,0.46,0.912,5,-3.444,0,0.0596,0.000579,0.000217,0.104,0.311,134.074
42,The Heart Of Everything,0lW4J9tzxpODQ8IExSumDW,all i need,290946,24,https://p.scdn.co/mp3-preview/15ddd25586b4f62e...,0.233,0.73,10,-4.855,1,0.0449,0.201,4e-06,0.13,0.123,152.972
55,An Acoustic Night At The Theatre,1tbSP6d2KwBB2DZUJLalRZ,all i need live,320946,21,https://p.scdn.co/mp3-preview/eae98f3734badfd0...,0.368,0.674,7,-5.859,0,0.0328,0.424,0.0,0.951,0.124,149.204
91,Hydra (Special Edition),6MubsJeQrVa0k7lJSxcdaM,and we run,230067,7,https://p.scdn.co/mp3-preview/af524142f40dcacf...,0.544,0.837,6,-4.618,0,0.0465,0.0596,0.0,0.0698,0.159,128.98
99,Hydra (Special Edition),13cZ2hORsadxvc2KLUBZoA,and we run evolution track,341497,5,https://p.scdn.co/mp3-preview/fc476336bac928d0...,0.578,0.723,6,-7.949,0,0.0507,0.136,2e-06,0.196,0.165,129.054
131,Let Us Burn - Elements & Hydra Live In Concert,301osYEEEVs4EQNXZXStCi,and we run live 2014,236746,0,https://p.scdn.co/mp3-preview/3789af5d464b633f...,0.51,0.865,9,-4.793,1,0.0431,0.19,3.1e-05,0.679,0.451,129.013
27,The Silent Force,3TEwbiC0GhIRStn3Eabtu7,angels,240440,55,https://p.scdn.co/mp3-preview/1dbf69a32db3b4d2...,0.341,0.867,7,-4.727,0,0.0492,0.293,0.0,0.257,0.2,182.023
114,Let Us Burn - Elements & Hydra Live In Concert,6oQdvGElasxvHYutiewDSc,angels live 2012,252226,0,https://p.scdn.co/mp3-preview/9f9cc354c35bf303...,0.438,0.852,7,-5.567,0,0.0387,0.147,0.0,0.976,0.246,91.061
105,Enter + The Dance,4nroowkyOM1HB9BOwUVV3M,another day,348453,16,https://p.scdn.co/mp3-preview/f76426030a7cfb44...,0.15,0.637,10,-6.177,1,0.0344,0.000843,0.00302,0.357,0.174,150.038


One thing it can be noticed is that the song's titles of lyrics data have no " ' ", while the tracks's names from Spotify have. E.g.: song_title: call me when youre sober x track_name: call me when you're sober.

So I'll remove " ' " from all track_name.

In [6]:
df_tracks_evanescence["track_name"] = df_tracks_evanescence["track_name"].apply(lambda x: x.replace("'",""))
df_tracks_within_temptation["track_name"]=df_tracks_within_temptation["track_name"].apply(lambda x: x.replace("'",""))

# Try to merge lyrics data with Spotify's data

In [7]:
df_evanescence_merged = df_lyrics_evanescence.merge(df_tracks_evanescence, left_on='song_title', right_on='track_name')

In [8]:
df_evanescence_merged

Unnamed: 0,song_title,lyrics,album_name,track_id,track_name,track_duration,track_popularity,track_preview,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo
0,lost in paradise,"I’ve been believing in something so distant, a...",Evanescence,7c8unZeNL9gI6Go9DgGYpb,lost in paradise,282293,46,,0.266,0.571,8,-4.983,1,0.0376,0.0532,4e-06,0.073,0.104,113.059
1,going under,now I will tell you what I've done for you. 50...,Fallen,3UygY7qW2cvG9Llkay6i1i,going under,214946,68,,0.37,0.858,11,-4.885,0,0.0545,0.00815,2.1e-05,0.229,0.464,175.077
2,lose control,You don't remember my name. I don't really car...,The Open Door,3DxKCst1JPt8qw7soauFXc,lose control,290000,43,,0.308,0.726,8,-6.547,0,0.0608,0.0966,0.0134,0.102,0.161,90.126
3,bring me to life,how can you see into my eyes. like open doors....,Fallen,0COqiPhxzoWICwFCS4eZcp,bring me to life,235893,77,,0.331,0.943,4,-3.188,0,0.0698,0.00721,2e-06,0.242,0.296,94.612
4,cloud nine,"If you want to live, let live.. If you want to...",The Open Door,6Sh05fnlrLbMfSuI8Qur6a,cloud nine,262173,44,,0.125,0.893,3,-4.217,0,0.21,0.0432,8.5e-05,0.151,0.19,194.55
5,lacrymosa,Out on your own. Cold and alone again. Can thi...,The Open Door,1M8YN6ekSgCnjc5UckHYpq,lacrymosa,217466,45,,0.439,0.775,4,-5.35,0,0.0727,0.00801,5.3e-05,0.104,0.194,136.973
6,call me when youre sober,Don't cry to me. If you loved me. You would be...,The Open Door,663Karu2rvKLdnY0eo1n3M,call me when youre sober,214706,64,,0.45,0.883,7,-4.094,1,0.0524,0.00193,0.0,0.293,0.328,93.41
7,sweet sacrifice,"It's true, we're all a little insane. But it's...",The Open Door,7hlXiMxN81uctLsvbtHZ8x,sweet sacrifice,185533,53,,0.486,0.87,2,-4.947,1,0.0903,0.00679,3e-06,0.312,0.336,97.013
8,imperfection,The more you try to fight it. The more you try...,Synthesis,4jdfxZWpcsTIAYs37OcY1y,imperfection,262893,10,,0.527,0.832,5,-3.593,0,0.0615,0.204,4.4e-05,0.108,0.339,83.04
9,farther away,"I took their smiles and I made them mine.. I,I...",Lost Whispers,63Yk0ZcjJSv37O8Vy7PFZi,farther away,239037,46,,0.304,0.868,9,-3.834,1,0.044,0.000285,0.155,0.185,0.325,170.093


In [9]:
df_evanescence_merged.shape[0]/df_lyrics_evanescence.shape[0]

0.47674418604651164

We succeeded in having all information (lyrics, metadatad, and audio features) for 41 songs by simply merging our dataframes. This means that 47.7% of the `Evanescence's songs` for which we retrieved lyrics are represented with additional information.

In [10]:
df_within_temptation_merged = df_lyrics_within_temptation.merge(df_tracks_within_temptation, left_on='song_title', right_on='track_name')
df_within_temptation_merged

Unnamed: 0,song_title,lyrics,album_name,track_id,track_name,track_duration,track_popularity,track_preview,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo
0,lost,My hope is on fire. My dreams are for sale. I ...,The Unforgiving,0v2Ad5NPKP8LKv48m0pVHx,lost,313469,38,https://p.scdn.co/mp3-preview/73c8077260af8b7b...,0.218,0.727,1,-3.751,1,0.0369,0.026,1e-06,0.12,0.122,93.493
1,the swan song,"Winter has come for me, can't carry on. The ch...",The Silent Force,0hShpzIaJdb5nd1nNnBKmQ,the swan song,237573,34,https://p.scdn.co/mp3-preview/f7a7c64a916cf764...,0.336,0.533,8,-6.654,0,0.0288,0.417,1.2e-05,0.0983,0.282,79.958
2,in perfect harmony,At the end of a closing day. A little child wa...,Mother Earth,2Y7cKGD4mm9gXXHvUDIxvl,in perfect harmony,419453,23,https://p.scdn.co/mp3-preview/3aba1baa7c7af95e...,0.336,0.353,2,-10.836,1,0.0331,0.907,0.00157,0.205,0.112,113.877
3,see who i am,"Is it true what they say,. are we too blind to...",The Silent Force,0CwkOVolvoJhO9Q2OH7zJf,see who i am,291533,43,https://p.scdn.co/mp3-preview/8b52de665a7413e1...,0.317,0.921,11,-4.25,0,0.0727,0.104,0.0,0.25,0.272,187.361
4,forsaken,Now the day has come. We are forsaken this tim...,The Silent Force,1O69rWQLUiNrhIRdeiLa6S,forsaken,293493,41,https://p.scdn.co/mp3-preview/f18f45599e52849c...,0.441,0.922,9,-4.555,0,0.0665,0.00852,0.00012,0.102,0.235,98.961
5,the howling,"We've been seeing what you want,. You've got u...",The Heart Of Everything,19BMw9z6SBfVQJyfCnAFyo,the howling,333866,27,https://p.scdn.co/mp3-preview/5a1858b2ce45e2a4...,0.444,0.953,6,-4.171,0,0.0681,0.0115,0.00837,0.105,0.279,93.008
6,what have you done,What have you done now. I know I'd better stop...,The Heart Of Everything,209U1Dxfp3k9FSYQ9oMUwk,what have you done,313333,30,https://p.scdn.co/mp3-preview/2b800076c0b572cd...,0.258,0.932,6,-3.813,0,0.187,0.00358,0.000138,0.0982,0.123,113.333
7,where is the edge,In the shadows it awakes the desire. But you k...,The Unforgiving,3MGoTdExMjBuJzaaFv8HbY,where is the edge,239253,35,https://p.scdn.co/mp3-preview/6fee29acaa320b87...,0.38,0.895,8,-3.924,0,0.044,0.00255,1e-06,0.0906,0.292,154.03
8,forgiven,Love you so it hurts my soul. Can you forgive ...,The Heart Of Everything,3diWdX9Upe8r0EPjoKBLmx,forgiven,292066,24,https://p.scdn.co/mp3-preview/7a2686a9a5e1999f...,0.523,0.342,7,-8.933,0,0.0298,0.932,0.000816,0.135,0.172,99.888
9,never ending story,Armies have conquered. And fallen in the end. ...,Mother Earth,5STGuT6TzYESk0uuz9CHZj,never ending story,244453,23,https://p.scdn.co/mp3-preview/8caa2ea438613597...,0.634,0.322,2,-11.425,0,0.0301,0.435,0.000427,0.141,0.135,137.083


In [11]:
df_within_temptation_merged.shape[0]/df_lyrics_within_temptation.shape[0]

0.7162162162162162

In the case of `Within Temptation`, we got complete information about 71.6% of the songs for which we have lyrics.

# Analysing lyrics

To start I'll build some Wordclouds and analyse some metrics solely based on lyrics since when merging the data not all lyrics were included.

Further investigation probably would increase the lyrics included, but for now we will keep like this.

In this part I will perform the following steps:

1. Clean `lyrics` and save the result in a new column `lyrics_clean`.
2. Apply [TextBlob](https://textblob.readthedocs.io/en/dev/index.html) and create columns with the sentiment property information (`polarity` and `subjectivity`). The polarity score is a float within the range [-1.0, 1.0] where -1.0 is very positive, 0 is neutral, and 1 is very positive. The subjectivity is a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective.
3. Use polarity to define a lyric as positive or negative. This information will be in a binary column `is_positive` where `1` indicates a positive lyric and `0` a negative one.
4. Create feature `lyrics_len` with the lenght of the lyric.
5. Create feature `num_words` with lyric's number of words.
6. Build word clouds
7. Build some graphs.


There are some nice word clouds in R that I wanted to include in this project and in fact I have already something which is included in [GitHub](https://github.com/dpbac/evanescence_and_within_temptation_in_R).

For the R project my intention was to use the original lyrics csv that we have just loaded. However, when developing the R word cloud part some strange characters showed up and although I've tried some R option to try to clean there was no effect. So, I came back to Python and I decided to apply the following line of code:

        unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')

found at https://github.com/kaparker/gameofthrones-wordclouds/blob/master/gotwordcloud.py which just fine worked fine.

The following function includes some basic cleaning and also lemmatization. If you need more cleaning the code presented at https://github.com/kaparker/gameofthrones-wordclouds/blob/master/gotwordcloud.py can be very useful. 

In some cases, stemming is used in place of lemmatization. Stemming reduce the words to their root form and because I prefer to have the dictionary form in my clouds I chose to apply lemmatization.

To deal with contraction I've tried both TokTok and Moses and my choice stayed with TokTok, because I didn't like when I got, for instances, gonna as gon na. https://stackoverflow.com/questions/43041039/dont-want-nltk-word-tokenize-to-tokenize-a-single-word-gotta-into-got-and

**TIP**: If you don't succeed in installing pycontractions using `pip install pycontractions` use:

python -m pip install _wheel_file_

in my case _wheel_file_ https://files.pythonhosted.org/packages/a6/f5/d3ec9491c530cbc03af32ca2c6b69b0e89660daeb2856b485d90f9d82e5e/pycontractions-2.0.1-py3-none-any.whl

In [12]:
#https://pypi.org/project/pycontractions/

import unicodedata
from nltk.tokenize import sent_tokenize    
from pycontractions import Contractions
# Specify any model from the gensim.downloader api - that was what worked for me
cont = Contractions(api_key="glove-twitter-100")
# optional, prevents loading on first expand_texts call
cont.load_models()

from nltk.corpus import stopwords
from nltk.tokenize.toktok import ToktokTokenizer
# from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

def nlp_clean(text):
    
    # Removing strange/accented characteres
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    
    # Create a list with sentences of the lyrics to be able to apply expand_texts
    sentences = sent_tokenize(text)
    
    
    # Expand words that are contracted
    
    cont.expand_texts(text)
    
    # lower case and keeping what has only of alphanumeric characters - I used toktok because of problems as gonna = gon na
    toktok = ToktokTokenizer()
    tokens = [w for w in toktok.tokenize(text.lower()) if w.isalnum()]
    
    # Removing non stop words in english
    no_stops = [t for t in tokens if t not in stopwords.words('english')]
    
    
    # Build a lemmatized list
    WNlemmatizer = WordNetLemmatizer()
    lem_tokens = [WNlemmatizer.lemmatize(token) for token in no_stops]
   
    return ' '.join(lem_tokens)    

Error initializing LanguageTool


JavaError: can't find Java

In [None]:
# Cleaning lyrics

df_ly['Review Text'] = preprocess(df['Review Text'])

## Word clouds

In this part I'll be creating some word clouds, simple ones but also using masks. 



1. Performing some cleaning on the text and saving it in a new column
2. Create word clouds 

Clean the text
Create a mask of the image we want to use
Create a word cloud using the mask

# Sentiment Analysis