# Task for Cuetessa, Inc. – Predicting Valence of Pop Songs

## Overview
The aim of this task is to develop a Python-based module to predict the valence of newly released pop songs.  Two approaches are to use as input 1) the audio data (e.g., .wav files) of songs and 2) the lyrics of songs.  Publicly available datasets can be used for training and testing. 

## Data Description
We found a lyrics dataset called labeled_lyrics_clearned.csv on Kaggle that contains full lyrics and
labels of more than 150,000 songs [6]. The label is the Spotify valence attribute, ranging from 0 to 1. It describes the musical positiveness conveyed by a track. Tracks with high valence sound more
positive (happy, cheerful), while tracks with low valence sound more negative (sad, depressed).

In [6]:
# Import all necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from tqdm.auto import tqdm
import re
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, accuracy_score, make_scorer
from sklearn.preprocessing import StandardScaler

In [11]:
# Load dataset and rename columns for a clearer understanding
df = pd.read_csv('/Users/gguillau/Desktop/Practicum/labeled_lyrics_cleaned.csv')
df.rename(columns= {'label': 'valence', 'seq':'lyrics'}, inplace=True)

In [12]:
df.sample()

Unnamed: 0.1,Unnamed: 0,artist,lyrics,song,valence
75670,75670,The Beta Band,\r\nBeta Band - Eclipse Lyrics\r\n\r\nAlbum: ...,Eclipse,0.232


In [18]:
# function to determine if columns in file have null values
def get_percent_of_na(df, num):
    count = 0
    df = df.copy()
    s = (df.isna().sum() / df.shape[0])
    for column, percent in zip(s.index, s.values):
        num_of_nulls = df[column].isna().sum()
        if num_of_nulls == 0:
            continue
        else:
            count += 1
        print('Column {} has {:.{}%} percent of Nulls, and {} of nulls'.format(column, percent, num, num_of_nulls))
    if count != 0:
        print("\033[1m" + 'There are {} columns with NA.'.format(count) + "\033[0m")
    else:
        print()
        print("\033[1m" + 'There are no columns with NA.' + "\033[0m")
        
# function to display general information about the dataset
def get_info(df):
    """
    This function uses the head(), info(), describe(), shape() and duplicated() 
    methods to display the general information about the dataset.
    """
    print("\033[1m" + '-'*100 + "\033[0m")
    print('Head:')
    print()
    display(df.head())
    print('-'*100)
    print('Info:')
    print()
    display(df.info())
    print('-'*100)
    print('Describe:')
    print()
    display(df.describe())
    print('-'*100)
    display(df.describe)
    print()
    print('Columns with nulls:')
    display(get_percent_of_na(df, 4))  # check this out
    print('-'*100)
    print('Shape:')
    print(df.shape)
    print('-'*100)
    print('Duplicated:')
    print("\033[1m" + 'We have {} duplicated rows.\n'.format(df.duplicated().sum()) + "\033[0m")

In [19]:
get_info(df)

[1m----------------------------------------------------------------------------------------------------[0m
Head:



Unnamed: 0.1,Unnamed: 0,artist,lyrics,song,valence
0,0,Elijah Blake,"No, no\r\nI ain't ever trapped out the bando\r...",Everyday,0.626
1,1,Elijah Blake,"The drinks go down and smoke goes up, I feel m...",Live Till We Die,0.63
2,2,Elijah Blake,She don't live on planet Earth no more\r\nShe ...,The Otherside,0.24
3,3,Elijah Blake,"Trippin' off that Grigio, mobbin', lights low\...",Pinot,0.536
4,4,Elijah Blake,"I see a midnight panther, so gallant and so br...",Shadows & Diamonds,0.371


----------------------------------------------------------------------------------------------------
Info:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 158353 entries, 0 to 158352
Data columns (total 5 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   Unnamed: 0  158353 non-null  int64  
 1   artist      158353 non-null  object 
 2   lyrics      158353 non-null  object 
 3   song        158353 non-null  object 
 4   valence     158353 non-null  float64
dtypes: float64(1), int64(1), object(3)
memory usage: 6.0+ MB


None

----------------------------------------------------------------------------------------------------
Describe:



Unnamed: 0.1,Unnamed: 0,valence
count,158353.0,158353.0
mean,79176.0,0.491052
std,45712.717926,0.249619
min,0.0,0.0
25%,39588.0,0.286
50%,79176.0,0.483
75%,118764.0,0.691
max,158352.0,0.998


----------------------------------------------------------------------------------------------------


<bound method NDFrame.describe of         Unnamed: 0        artist  \
0                0  Elijah Blake   
1                1  Elijah Blake   
2                2  Elijah Blake   
3                3  Elijah Blake   
4                4  Elijah Blake   
...            ...           ...   
158348      158348    Adam Green   
158349      158349    Adam Green   
158350      158350    Adam Green   
158351      158351    Adam Green   
158352      158352    Adam Green   

                                                   lyrics                song  \
0       No, no\r\nI ain't ever trapped out the bando\r...            Everyday   
1       The drinks go down and smoke goes up, I feel m...    Live Till We Die   
2       She don't live on planet Earth no more\r\nShe ...       The Otherside   
3       Trippin' off that Grigio, mobbin', lights low\...               Pinot   
4       I see a midnight panther, so gallant and so br...  Shadows & Diamonds   
...                                            


Columns with nulls:

[1mThere are no columns with NA.[0m


None

----------------------------------------------------------------------------------------------------
Shape:
(158353, 5)
----------------------------------------------------------------------------------------------------
Duplicated:
[1mWe have 0 duplicated rows.
[0m


In [23]:
# Clean lyrics column
df['lyrics'] =  [re.sub(r"\n", " ", string) for string in df['lyrics']]
df['lyrics'] =  [re.sub(r"\r", " ", string) for string in df['lyrics']]

In [24]:
# Visualize cleaned lyrics column
df['lyrics'].head(20)

0     No, no  I ain't ever trapped out the bando  Bu...
1     The drinks go down and smoke goes up, I feel m...
2     She don't live on planet Earth no more  She fo...
3     Trippin' off that Grigio, mobbin', lights low ...
4     I see a midnight panther, so gallant and so br...
5     I just want to ready your mind  'Cause I'll st...
6     To believe  Or not to believe  That is the que...
7     Dieses ist lange her.  Da ich deine schmalen H...
8     A child is born  Out of the womb of a mother  ...
9     Out of the darkness you came   You looked so t...
10    Each night I lie in my bed   And I think about...
11    Nebel zieh'n gespentisch vor   Der Sucher setz...
12    I'm a lonely stranger   In this world of pain ...
13    Schwere Tranen   Vergebens geweint   Rinnen wi...
14    Come calm my anger  Our love is like a perfect...
15    I was walking through the night Suddenly I rem...
16    Why can't I hear you breath Why can't I hold y...
17    Your cold cold heart has drowned my life i

In [25]:
# Take a random sample from original dataset
data = df.sample(frac=0.40, random_state=12345)

In [26]:
get_info(data)

[1m----------------------------------------------------------------------------------------------------[0m
Head:



Unnamed: 0.1,Unnamed: 0,artist,lyrics,song,valence
119994,119994,Letters to Cleo,The saddest sound I've ever heard; the saddest...,Wasted,0.465
44328,44328,Gregory Isaacs,I thought it would be better Now I'm a brande...,Hot Stepper,0.934
139867,139867,HONNE,Ten out of ten You killed it once again 'Cau...,Woman,0.246
152222,152222,Richard Marx,We're all victims of the system Still we love...,Hands in Your Pocket,0.789
90485,90485,Billy Currington,"Hey, girl, what's your name, girl I've been lo...",Hey Girl,0.652


----------------------------------------------------------------------------------------------------
Info:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 63341 entries, 119994 to 1033
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  63341 non-null  int64  
 1   artist      63341 non-null  object 
 2   lyrics      63341 non-null  object 
 3   song        63341 non-null  object 
 4   valence     63341 non-null  float64
dtypes: float64(1), int64(1), object(3)
memory usage: 2.9+ MB


None

----------------------------------------------------------------------------------------------------
Describe:



Unnamed: 0.1,Unnamed: 0,valence
count,63341.0,63341.0
mean,79241.518006,0.490853
std,45741.642903,0.249305
min,1.0,0.0
25%,39588.0,0.287
50%,79211.0,0.483
75%,118775.0,0.69
max,158352.0,0.996


----------------------------------------------------------------------------------------------------


<bound method NDFrame.describe of         Unnamed: 0            artist  \
119994      119994   Letters to Cleo   
44328        44328    Gregory Isaacs   
139867      139867             HONNE   
152222      152222      Richard Marx   
90485        90485  Billy Currington   
...            ...               ...   
6472          6472        Tony Brook   
48952        48952      Audrey Assad   
60400        60400      Muddy Waters   
61082        61082     Elvis Presley   
1033          1033     Julian Lennon   

                                                   lyrics  \
119994  The saddest sound I've ever heard; the saddest...   
44328   I thought it would be better  Now I'm a brande...   
139867  Ten out of ten  You killed it once again  'Cau...   
152222  We're all victims of the system  Still we love...   
90485   Hey, girl, what's your name, girl I've been lo...   
...                                                   ...   
6472    I looked over Jordan and what did I see?  Comi... 


Columns with nulls:

[1mThere are no columns with NA.[0m


None

----------------------------------------------------------------------------------------------------
Shape:
(63341, 5)
----------------------------------------------------------------------------------------------------
Duplicated:
[1mWe have 0 duplicated rows.
[0m


In [28]:
# Round down to 60000 rows
df = df.loc[:59999]
len(df)

60000