# Data Cleaning

In [48]:
import pandas as pd
from nltk import RegexpTokenizer, PorterStemmer

### Data Dictionary

|Key|Value Type|Value Description|
|---|---|---|
|danceability|float|Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.|
|energy|float|Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.|
|key|int|The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation. E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.|
|loudness|float|The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typically range between -60 and 0 db.|
|mode|int|Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.|
|speechiness|float|Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.|
|acousticness|float|A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.|
|instrumentalness|float|Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.|
|liveness|float|Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.|
|valence|float|A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).|
|tempo|float|The overall estimated tempo of a track in beats per minute (BPM).|
|duration_sec|float|The duration of the track in seconds.|
|time_signature|int|An estimated overall time signature of a track. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure).|

### Dropped Columns

|Key|Value Type|Value Description|
|---|---|---|
|Unnamed: 0|int|Index from csv|
|type|string|The object type: “audio_features”|
|id|string|The Spotify ID for the track.|
|uri|string|The Spotify URI for the track.|
|track_href|string|A link to the Web API endpoint providing full details of the track.|
|analysis_url|string|An HTTP URL to access the full audio analysis of this track. An access token is required to access this data.|
|duration_ms|int|The duration of the track in milliseconds.

### Clean and Merge Audio Feature DataFrames

In [3]:
# Create a function to clean, manipulate, and concatenate dataframes
# The inputs are 2 dataframes 
def clean_master(df1, df2):
    
    # Concatenate dataframes
    master_df = pd.concat([df1, df2])
    
    # Drop columns that will not be used for analysis
    master_df.drop(['Unnamed: 0', 'type', 'id', 'uri', 'track_href', 'analysis_url'], axis=1, inplace=True)
    
    # Create a new column with 'duration_ms' (milliseconds) column values converted to seconds
    master_df['duration_sec'] = master_df['duration_ms'].map(lambda x: x/1000)
    
    # Drop 'duration_ms' column
    master_df.drop('duration_ms', axis=1, inplace=True)
    
    # Drop all duplicates
    master_df.drop_duplicates(inplace=True)
    return master_df

### 2018 audio features

In [4]:
bb_18_ft = pd.read_csv('./data/2018_billboard_features')

In [5]:
sf_18_ft = pd.read_csv('./data/2018_songfacts_features')

In [6]:
# Run function on 2018 billboard and songfacts audio features
master_18_ft = clean_master(bb_18_ft, sf_18_ft)

In [8]:
master_18_ft.shape

(1395, 13)

In [9]:
master_18_ft.head()

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,duration_sec
0,0.754,0.449,7,-9.211,1,0.109,0.0332,8.3e-05,0.552,0.357,77.169,4,198.973
1,0.599,0.448,8,-6.312,1,0.0232,0.163,0.0,0.106,0.168,95.05,3,263.4
2,0.643,0.783,10,-6.458,1,0.0856,0.047,0.0,0.083,0.579,154.084,4,163.87
3,0.765,0.523,2,-4.333,1,0.03,0.184,3.6e-05,0.132,0.394,104.988,4,217.307
4,0.587,0.535,5,-6.09,0,0.0898,0.117,6.6e-05,0.131,0.14,159.847,4,218.147


In [23]:
# Create master csv file for EDA and modeling
master_18_ft.to_csv('./data/MASTER_2018_audio_features')

### 2013 audio features

In [10]:
bb_13_ft = pd.read_csv('./data/2013_billboard_features')

In [11]:
sf_13_ft = pd.read_csv('./data/2013_songfacts_features')

In [12]:
# Run function on 2013 billboard and songfacts audio features
master_13_ft = clean_master(bb_13_ft, sf_13_ft)

In [13]:
master_13_ft.shape

(1499, 13)

In [14]:
master_13_ft.head()

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,duration_sec
0,0.781,0.526,6,-6.985,0,0.293,0.0619,0.0,0.0457,0.662,94.992,4,235.613
1,0.862,0.608,7,-4.762,1,0.0402,0.00373,6e-06,0.0856,0.836,120.002,4,263.827
2,0.448,0.784,9,-3.686,1,0.0627,0.106,0.000108,0.668,0.236,136.245,4,186.813
3,0.452,0.794,0,-5.151,1,0.0483,0.0111,0.00182,0.416,0.282,137.825,4,196.664
4,0.641,0.922,2,-4.457,1,0.0786,0.0291,0.0,0.0862,0.847,146.078,4,258.343


In [24]:
# Create master csv file for EDA and modeling
master_13_ft.to_csv('./data/MASTER_2013_audio_features')

### 2008 audio features

In [15]:
bb_08_ft = pd.read_csv('./data/2008_billboard_features')

In [17]:
sf_08_ft = pd.read_csv('./data/2008_songfacts_features')

In [18]:
# Run function on 2008 billboard and songfacts audio features
master_08_ft = clean_master(bb_08_ft, sf_08_ft)

In [19]:
master_08_ft.shape

(1531, 13)

In [20]:
master_08_ft.head()

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,duration_sec
0,0.584,0.569,0,-4.259,1,0.166,0.0508,0.0,0.0924,0.501,89.779,4,217.603
1,0.638,0.656,5,-5.886,1,0.0357,0.188,0.0,0.146,0.225,104.036,4,262.467
2,0.703,0.748,1,-6.047,1,0.0435,0.123,0.0,0.0642,0.625,111.943,4,184.08
3,0.828,0.433,0,-9.716,1,0.199,0.0656,0.000876,0.122,0.44,148.073,4,299.333
4,0.591,0.718,8,-6.025,1,0.0368,0.348,0.000118,0.107,0.468,117.995,4,208.107


In [25]:
# Create master csv file for EDA and modeling
master_08_ft.to_csv('./data/MASTER_2008_audio_features')

### Clean, Process and Merge Track Lyrics DataFrames

In [49]:
# Create a function to process and clean lyrics dataframe
def process_master(df): 
    
    # Instantiate tokenizer with specific regular expression
    tokenizer = RegexpTokenizer(r'\w+')
    # Instantiate stemmer
    stemmer = PorterStemmer()

    # List to append stemmed words
    stemmed = []        
    # List to append tokenized words
    tokenized = []
    
    # Create a for loop to iterate through all the rows in specific column
    for i in df['lyrics']:                          
        
        # Converting lyrics text to tokens
        tokens = tokenizer.tokenize(i.lower()) 
        tokenized.append(tokens)

        # Stemming all tokens
        stems = [stemmer.stem(token) for token in tokens]  
        # Appending stems to stemmed list
        stemmed.append(stems)                                         
    
    # Creating new dataframe columns
    df['tokenized_lyrics'] = [' '.join(i) for i in tokenized]    
    df['stemmed_lyrics'] = [' '.join(i) for i in stemmed]
    
    # Drop unnecessary column
    df.drop('Unnamed: 0', axis=1, inplace=True)
    # Drop duplicates
    df.drop_duplicates(inplace=True)

### 2018 track lyrics

In [50]:
bb_18_lyr = pd.read_csv('./data/2018_billboard_lyrics')

In [51]:
sf_18_lyr = pd.read_csv('./data/2018_songfacts_lyrics')

In [52]:
# Concatenate dataframes
master_18_lyr = pd.concat([bb_18_lyr, sf_18_lyr])

In [53]:
# Run function on 2018 master lyrics dataframe
process_master(master_18_lyr)

In [54]:
master_18_lyr.head()

Unnamed: 0,lyrics,tokenized_lyrics,stemmed_lyrics
0,Yeah they wishin and wishin and wishin and wis...,yeah they wishin and wishin and wishin and wis...,yeah they wishin and wishin and wishin and wis...
1,I found a love for me Oh darling just dive rig...,i found a love for me oh darling just dive rig...,i found a love for me oh darl just dive right ...
2,Baby lay on back and relax Kick your pretty fe...,baby lay on back and relax kick your pretty fe...,babi lay on back and relax kick your pretti fe...
3,Hey Havana ooh na na (ayy) Half of my heart i...,hey havana ooh na na ayy half of my heart is i...,hey havana ooh na na ayi half of my heart is i...
4,(Award to the Artist and to the Producer(s) Re...,award to the artist and to the producer s reco...,award to the artist and to the produc s record...


In [55]:
master_18_lyr.shape

(1469, 3)

In [60]:
# Create master csv file for EDA and modeling
master_18_lyr.to_csv('./data/MASTER_2018_lyrics')

### 2013 track lyrics

In [57]:
bb_13_lyr = pd.read_csv('./data/2013_billboard_lyrics')

In [58]:
sf_13_lyr = pd.read_csv('./data/2013_songfacts_lyrics')

In [59]:
# Concatenate dataframes
master_13_lyr = pd.concat([bb_13_lyr, sf_13_lyr])

In [61]:
# Run function on 2013 master lyrics dataframe
process_master(master_13_lyr)

In [62]:
master_13_lyr.head()

Unnamed: 0,lyrics,tokenized_lyrics,stemmed_lyrics
0,Hey Macklemore can we go thrift shopping What ...,hey macklemore can we go thrift shopping what ...,hey macklemor can we go thrift shop what what ...
1,Everybody get up WOO! Hey hey hey Hey hey hey ...,everybody get up woo hey hey hey hey hey hey h...,everybodi get up woo hey hey hey hey hey hey h...
2,Whoah oh Whoah oh Whoah oh Whoah Im waking up...,whoah oh whoah oh whoah oh whoah im waking up ...,whoah oh whoah oh whoah oh whoah im wake up to...
3,Con los terroristas tas tas tas tas tas ...,con los terroristas tas tas tas tas tas tas ta...,con lo terrorista ta ta ta ta ta ta ta ta ta t...
4,Arent you somethin to admire Cause your shine ...,arent you somethin to admire cause your shine ...,arent you somethin to admir caus your shine is...


In [63]:
master_13_lyr.shape

(2765, 3)

In [64]:
# Create master csv file for EDA and modeling
master_13_lyr.to_csv('./data/MASTER_2013_lyrics')

### 2008 track lyrics

In [65]:
bb_08_lyr = pd.read_csv('./data/2008_billboard_lyrics')

In [66]:
sf_08_lyr = pd.read_csv('./data/2008_songfacts_lyrics')

In [67]:
# Concatenate dataframes
master_08_lyr = pd.concat([bb_08_lyr, sf_08_lyr])

In [68]:
# Run function on 2008 master lyrics list
process_master(master_08_lyr)

In [69]:
master_08_lyr.head()

Unnamed: 0,lyrics,tokenized_lyrics,stemmed_lyrics
0,Hmm mmm mmm mmm mmm mmm Let me talk to em let ...,hmm mmm mmm mmm mmm mmm let me talk to em let ...,hmm mmm mmm mmm mmm mmm let me talk to em let ...
1,Closed off from love I didnt need the pain Onc...,closed off from love i didnt need the pain onc...,close off from love i didnt need the pain onc ...
2,I just want you close Where you can stay forev...,i just want you close where you can stay forev...,i just want you close where you can stay forev...
3,Oww! Uh huh No homo Young Mula baby I said hes...,oww uh huh no homo young mula baby i said hes ...,oww uh huh no homo young mula babi i said he s...
4,Im holding on your rope got me ten feet off th...,im holding on your rope got me ten feet off th...,im hold on your rope got me ten feet off the g...


In [70]:
master_08_lyr.shape

(1994, 3)

In [71]:
# Create master csv file for EDA and modeling
master_08_lyr.to_csv('./data/MASTER_2008_lyrics')