# Importing Lyrics Dataset

#### Metrolyrics Analysis

https://www.kaggle.com/gyani95/380000-lyrics-from-metrolyrics

- ~266k of Non null rows
- ~380,000 Total Rows (more like 260k w/ 95k missing for lyrics)
- Also missing about 60k of correct genre names
- Columns: Song, Year, Artist, Genre and Lyrics

In [59]:
import pandas as pd
from datetime import datetime

In [2]:
lyrics = pd.read_csv("lyrics.csv")
del lyrics['index']

In [60]:
lyrics.head()

Unnamed: 0,song,year,artist,genre,lyrics
0,ego-remix,2009,beyonce-knowles,Pop,"Oh baby, how you doing?\nYou know I'm gonna cu..."
1,then-tell-me,2009,beyonce-knowles,Pop,"playin' everything so easy,\nit's like you see..."
2,honesty,2009,beyonce-knowles,Pop,If you search\nFor tenderness\nIt isn't hard t...
3,you-are-my-rock,2009,beyonce-knowles,Pop,"Oh oh oh I, oh oh oh I\n[Verse 1:]\nIf I wrote..."
4,black-culture,2009,beyonce-knowles,Pop,"Party the people, the people the party it's po..."


##### Looking at all the rows

In [4]:
# View the unique genre's
print("Unique Genre's", lyrics.genre.unique())
print(lyrics.genre.value_counts())

Unique Genre's ['Pop' 'Hip-Hop' 'Not Available' 'Other' 'Rock' 'Metal' 'Country' 'Jazz'
 'Electronic' 'Folk' 'R&B' 'Indie']
Rock             131377
Pop               49444
Hip-Hop           33965
Not Available     29814
Metal             28408
Other             23683
Country           17286
Jazz              17147
Electronic        16205
R&B                5935
Indie              5732
Folk               3241
Name: genre, dtype: int64


In [5]:
# View the unique artist's
print("Unique Artists", lyrics.artist.unique())
print(lyrics.artist.value_counts())

Unique Artists ['beyonce-knowles' 'eazy-e' 'asher-monroe' ... 'ale-mrthller' 'dee-smgn'
 'edens-edge']
dolly-parton        755
american-idol       700
elton-john          680
b-b-king            667
chris-brown         655
                   ... 
dani-harmer           1
enrique-infante       1
brittany-barber       1
dijaay-aadi           1
commando-nicolas      1
Name: artist, Length: 18231, dtype: int64


In [6]:
# View number of missing values in each column
# lyrics.lyrics.isna()
lyrics.isnull().sum(axis = 0)

song          2
year          0
artist        0
genre         0
lyrics    95680
dtype: int64

In [7]:
# Show number of Non Missing Values
lyrics.notnull().sum(axis = 0)

song      362235
year      362237
artist    362237
genre     362237
lyrics    266557
dtype: int64

In [8]:
# Created a lyrics count list for how many words are in a song
lyrics_count = lyrics['lyrics'].str.count(' ') + 1

In [9]:
print(lyrics_count.max())
lyrics_count.min()

7106.0


1.0

In [10]:
# Some info about the length of the songs for all Rows
print("Average # of words per song:", lyrics_count.mean())
print("Maximum # of words per song:", lyrics_count.max())
print("Minimum # of words per song:", lyrics_count.min())
print("StanDev # of words per song:", lyrics_count.std())
print("Total Number of Rows:", lyrics.count())

Average # of words per song: 193.35993052142695
Maximum # of words per song: 7106.0
Minimum # of words per song: 1.0
StanDev # of words per song: 138.2510659779955
Total Number of Rows: song      362235
year      362237
artist    362237
genre     362237
lyrics    266557
dtype: int64


In [11]:
lyrics_count.tail(15)

362222      NaN
362223      NaN
362224      NaN
362225    151.0
362226     71.0
362227    313.0
362228    262.0
362229    176.0
362230    203.0
362231    193.0
362232    280.0
362233    175.0
362234    213.0
362235    145.0
362236    287.0
Name: lyrics, dtype: float64

##### Looking at only NonNull Rows

In [12]:
lyrics_NotNull = lyrics.dropna()
lyrics_NotNull

Unnamed: 0,song,year,artist,genre,lyrics
0,ego-remix,2009,beyonce-knowles,Pop,"Oh baby, how you doing?\nYou know I'm gonna cu..."
1,then-tell-me,2009,beyonce-knowles,Pop,"playin' everything so easy,\nit's like you see..."
2,honesty,2009,beyonce-knowles,Pop,If you search\nFor tenderness\nIt isn't hard t...
3,you-are-my-rock,2009,beyonce-knowles,Pop,"Oh oh oh I, oh oh oh I\n[Verse 1:]\nIf I wrote..."
4,black-culture,2009,beyonce-knowles,Pop,"Party the people, the people the party it's po..."
...,...,...,...,...,...
362232,who-am-i-drinking-tonight,2012,edens-edge,Country,"I gotta say\nBoy, after only just a couple of ..."
362233,liar,2012,edens-edge,Country,I helped you find her diamond ring\nYou made m...
362234,last-supper,2012,edens-edge,Country,Look at the couple in the corner booth\nLooks ...
362235,christ-alone-live-in-studio,2012,edens-edge,Country,When I fly off this mortal earth\nAnd I'm meas...


In [13]:
# Values for Genre's for NonNull rows
lyrics_NotNull.genre.value_counts()

Rock             109235
Pop               40466
Hip-Hop           24850
Not Available     23941
Metal             23759
Country           14387
Jazz               7970
Electronic         7966
Other              5189
R&B                3401
Indie              3149
Folk               2243
Name: genre, dtype: int64

In [14]:
lyrics_count_NotNull = lyrics_NotNull['lyrics'].str.count(' ') + 1

In [15]:
# Some info about the length of the songs for NONNULL rows
print("Average # of words per song:", lyrics_count_NotNull.mean())
print("Maximum # of words per song:", lyrics_count_NotNull.max())
print("Minimum # of words per song:", lyrics_count_NotNull.min())
print("StanDev # of words per song:", lyrics_count_NotNull.std())
print("Total Number of Rows:", lyrics_NotNull.count())

Average # of words per song: 193.3595229520251
Maximum # of words per song: 7106
Minimum # of words per song: 1
StanDev # of words per song: 138.25116516859234
Total Number of Rows: song      266556
year      266556
artist    266556
genre     266556
lyrics    266556
dtype: int64


In [16]:
len(lyrics_NotNull.artist.unique())

14543

# Importing Song Shape Dataset

#### FMA: A Dataset For Music Analysis

https://archive.ics.uci.edu/ml/datasets/FMA%3A+A+Dataset+For+Music+Analysis

In [17]:
# import the features, and skip the unnecessary rows/rename the column
features = pd.read_csv("features.csv", skiprows=[1,2,3])
features.rename(columns={"feature": "track_id"}, inplace=True)

# import the genres
genres = pd.read_csv("raw_genres.csv")

# import the track list, and skip the unnecessary rows/rename the column
tracks = pd.read_csv("tracks.csv", skiprows=[0,2])
tracks.rename(columns={tracks.columns[0]: "track_id"}, inplace=True)

In [18]:
list(features.columns)

['track_id',
 'chroma_cens',
 'chroma_cens.1',
 'chroma_cens.2',
 'chroma_cens.3',
 'chroma_cens.4',
 'chroma_cens.5',
 'chroma_cens.6',
 'chroma_cens.7',
 'chroma_cens.8',
 'chroma_cens.9',
 'chroma_cens.10',
 'chroma_cens.11',
 'chroma_cens.12',
 'chroma_cens.13',
 'chroma_cens.14',
 'chroma_cens.15',
 'chroma_cens.16',
 'chroma_cens.17',
 'chroma_cens.18',
 'chroma_cens.19',
 'chroma_cens.20',
 'chroma_cens.21',
 'chroma_cens.22',
 'chroma_cens.23',
 'chroma_cens.24',
 'chroma_cens.25',
 'chroma_cens.26',
 'chroma_cens.27',
 'chroma_cens.28',
 'chroma_cens.29',
 'chroma_cens.30',
 'chroma_cens.31',
 'chroma_cens.32',
 'chroma_cens.33',
 'chroma_cens.34',
 'chroma_cens.35',
 'chroma_cens.36',
 'chroma_cens.37',
 'chroma_cens.38',
 'chroma_cens.39',
 'chroma_cens.40',
 'chroma_cens.41',
 'chroma_cens.42',
 'chroma_cens.43',
 'chroma_cens.44',
 'chroma_cens.45',
 'chroma_cens.46',
 'chroma_cens.47',
 'chroma_cens.48',
 'chroma_cens.49',
 'chroma_cens.50',
 'chroma_cens.51',
 'chroma_ce

In [19]:
list(genres.columns)

['genre_id', 'genre_color', 'genre_handle', 'genre_parent_id', 'genre_title']

In [20]:
list(tracks.columns)

['track_id',
 'comments',
 'date_created',
 'date_released',
 'engineer',
 'favorites',
 'id',
 'information',
 'listens',
 'producer',
 'tags',
 'title',
 'tracks',
 'type',
 'active_year_begin',
 'active_year_end',
 'associated_labels',
 'bio',
 'comments.1',
 'date_created.1',
 'favorites.1',
 'id.1',
 'latitude',
 'location',
 'longitude',
 'members',
 'name',
 'related_projects',
 'tags.1',
 'website',
 'wikipedia_page',
 'split',
 'subset',
 'bit_rate',
 'comments.2',
 'composer',
 'date_created.2',
 'date_recorded',
 'duration',
 'favorites.2',
 'genre_top',
 'genres',
 'genres_all',
 'information.1',
 'interest',
 'language_code',
 'license',
 'listens.1',
 'lyricist',
 'number',
 'publisher',
 'tags.2',
 'title.1']

In [21]:
# check the structure of the genre columns
tracks.loc[:,["genre_top'", "genres", "genres_all"]]

Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike
  return self._getitem_tuple(key)


Unnamed: 0,genre_top',genres,genres_all
0,,[21],[21]
1,,[21],[21]
2,,[21],[21]
3,,[10],[10]
4,,"[76, 103]","[17, 10, 76, 103]"
...,...,...,...
106569,,[25],"[25, 12]"
106570,,[25],"[25, 12]"
106571,,[25],"[25, 12]"
106572,,[25],"[25, 12]"


In [44]:
# select on the needed columns from the track
tracks_2 = tracks[['track_id','title.1','genre_top','genres','genres_all','name','type','subset']]
tracks_2.rename(columns={"title.1":"title"}, inplace=True)
tracks_2

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().rename(**kwargs)


Unnamed: 0,track_id,title,genre_top,genres,genres_all,name,type,subset
0,2,Food,Hip-Hop,[21],[21],AWOL,Album,small
1,3,Electric Ave,Hip-Hop,[21],[21],AWOL,Album,medium
2,5,This World,Hip-Hop,[21],[21],AWOL,Album,small
3,10,Freeway,Pop,[10],[10],Kurt Vile,Album,small
4,20,Spiritual Level,,"[76, 103]","[17, 10, 76, 103]",Nicky Cook,Album,large
...,...,...,...,...,...,...,...,...
106569,155316,The Auger,Rock,[25],"[25, 12]",Spowder,Live Performance,large
106570,155317,Let's Skin Ruby,Rock,[25],"[25, 12]",Spowder,Live Performance,large
106571,155318,My House Smells Like Kim Deal/Pulp,Rock,[25],"[25, 12]",Spowder,Live Performance,large
106572,155319,The Man With Two Mouths,Rock,[25],"[25, 12]",Spowder,Live Performance,large


In [45]:
# merge the tracks and features columns
fma = tracks_2.merge(features, on="track_id", how="inner")

In [46]:
# check the final dataframe
fma

Unnamed: 0,track_id,title,genre_top,genres,genres_all,name,type,subset,chroma_cens,chroma_cens.1,...,tonnetz.39,tonnetz.40,tonnetz.41,zcr,zcr.1,zcr.2,zcr.3,zcr.4,zcr.5,zcr.6
0,2,Food,Hip-Hop,[21],[21],AWOL,Album,small,7.180653,5.230309,...,0.054125,0.012226,0.012111,5.758890,0.459473,0.085629,0.071289,0.000000,2.089872,0.061448
1,3,Electric Ave,Hip-Hop,[21],[21],AWOL,Album,medium,1.888963,0.760539,...,0.063831,0.014212,0.017740,2.824694,0.466309,0.084578,0.063965,0.000000,1.716724,0.069330
2,5,This World,Hip-Hop,[21],[21],AWOL,Album,small,0.527563,-0.077654,...,0.040730,0.012691,0.014759,6.808415,0.375000,0.053114,0.041504,0.000000,2.193303,0.044861
3,10,Freeway,Pop,[10],[10],Kurt Vile,Album,small,3.702245,-0.291193,...,0.074358,0.017952,0.013921,21.434212,0.452148,0.077515,0.071777,0.000000,3.542325,0.040800
4,20,Spiritual Level,,"[76, 103]","[17, 10, 76, 103]",Nicky Cook,Album,large,-0.193837,-0.198527,...,0.095003,0.022492,0.021355,16.669037,0.469727,0.047225,0.040039,0.000977,3.189831,0.030993
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
106569,155316,The Auger,Rock,[25],"[25, 12]",Spowder,Live Performance,large,-0.490129,0.463834,...,0.128410,0.022547,0.019816,4.448255,0.172852,0.028773,0.028320,0.003906,0.955388,0.012385
106570,155317,Let's Skin Ruby,Rock,[25],"[25, 12]",Spowder,Live Performance,large,-0.461559,-0.229601,...,0.132964,0.023548,0.026527,3.270612,0.196289,0.031116,0.027832,0.002441,1.283060,0.019059
106571,155318,My House Smells Like Kim Deal/Pulp,Rock,[25],"[25, 12]",Spowder,Live Performance,large,0.552473,-0.110498,...,0.108324,0.017540,0.020471,2.356727,0.212891,0.038450,0.037109,0.003418,0.828569,0.017904
106572,155319,The Man With Two Mouths,Rock,[25],"[25, 12]",Spowder,Live Performance,large,-0.176901,0.187208,...,0.088311,0.018328,0.017936,6.188604,0.167480,0.041480,0.038086,0.004883,1.818740,0.020133


In [47]:
fma["type"].value_counts()

Album               87549
Radio Program        6551
Live Performance     5035
Single Tracks         917
Contest                14
Name: type, dtype: int64

# Merge the Two Datasets (Lyrics and FMA)

In [48]:
# restructure the artist/song name in the FMA dataset
# lowercase everything
fma["title"] = fma["title"].str.lower()
fma["name"] = fma["name"].str.lower()

# # remove live performances
# fma = fma[fma["type"] != "Live Performance"]

# clear blank spaces
fma["title"] = fma["title"].str.replace(" ","")
fma["name"] = fma["name"].str.replace(" ","")

# remove -
fma["title"] = fma["title"].str.replace("-","")
fma["name"] = fma["name"].str.replace("-","")

In [49]:
fma

Unnamed: 0,track_id,title,genre_top,genres,genres_all,name,type,subset,chroma_cens,chroma_cens.1,...,tonnetz.39,tonnetz.40,tonnetz.41,zcr,zcr.1,zcr.2,zcr.3,zcr.4,zcr.5,zcr.6
0,2,food,Hip-Hop,[21],[21],awol,Album,small,7.180653,5.230309,...,0.054125,0.012226,0.012111,5.758890,0.459473,0.085629,0.071289,0.000000,2.089872,0.061448
1,3,electricave,Hip-Hop,[21],[21],awol,Album,medium,1.888963,0.760539,...,0.063831,0.014212,0.017740,2.824694,0.466309,0.084578,0.063965,0.000000,1.716724,0.069330
2,5,thisworld,Hip-Hop,[21],[21],awol,Album,small,0.527563,-0.077654,...,0.040730,0.012691,0.014759,6.808415,0.375000,0.053114,0.041504,0.000000,2.193303,0.044861
3,10,freeway,Pop,[10],[10],kurtvile,Album,small,3.702245,-0.291193,...,0.074358,0.017952,0.013921,21.434212,0.452148,0.077515,0.071777,0.000000,3.542325,0.040800
4,20,spirituallevel,,"[76, 103]","[17, 10, 76, 103]",nickycook,Album,large,-0.193837,-0.198527,...,0.095003,0.022492,0.021355,16.669037,0.469727,0.047225,0.040039,0.000977,3.189831,0.030993
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
106569,155316,theauger,Rock,[25],"[25, 12]",spowder,Live Performance,large,-0.490129,0.463834,...,0.128410,0.022547,0.019816,4.448255,0.172852,0.028773,0.028320,0.003906,0.955388,0.012385
106570,155317,let'sskinruby,Rock,[25],"[25, 12]",spowder,Live Performance,large,-0.461559,-0.229601,...,0.132964,0.023548,0.026527,3.270612,0.196289,0.031116,0.027832,0.002441,1.283060,0.019059
106571,155318,myhousesmellslikekimdeal/pulp,Rock,[25],"[25, 12]",spowder,Live Performance,large,0.552473,-0.110498,...,0.108324,0.017540,0.020471,2.356727,0.212891,0.038450,0.037109,0.003418,0.828569,0.017904
106572,155319,themanwithtwomouths,Rock,[25],"[25, 12]",spowder,Live Performance,large,-0.176901,0.187208,...,0.088311,0.018328,0.017936,6.188604,0.167480,0.041480,0.038086,0.004883,1.818740,0.020133


In [86]:
# restructure the artist/song name in the lyrics dataset
# clear blank spaces
lyrics_NotNull["song"] = lyrics_NotNull["song"].str.replace(" ","")
lyrics_NotNull["artist"] = lyrics_NotNull["artist"].str.replace(" ","")

# remove -
lyrics_NotNull["song"] = lyrics_NotNull["song"].str.replace("-","")
lyrics_NotNull["artist"] = lyrics_NotNull["artist"].str.replace("-","")

In [87]:
lyrics_NotNull

Unnamed: 0,song,year,artist,genre,lyrics
0,egoremix,2009,beyonceknowles,Pop,"Oh baby, how you doing?\nYou know I'm gonna cu..."
1,thentellme,2009,beyonceknowles,Pop,"playin' everything so easy,\nit's like you see..."
2,honesty,2009,beyonceknowles,Pop,If you search\nFor tenderness\nIt isn't hard t...
3,youaremyrock,2009,beyonceknowles,Pop,"Oh oh oh I, oh oh oh I\n[Verse 1:]\nIf I wrote..."
4,blackculture,2009,beyonceknowles,Pop,"Party the people, the people the party it's po..."
...,...,...,...,...,...
362232,whoamidrinkingtonight,2012,edensedge,Country,"I gotta say\nBoy, after only just a couple of ..."
362233,liar,2012,edensedge,Country,I helped you find her diamond ring\nYou made m...
362234,lastsupper,2012,edensedge,Country,Look at the couple in the corner booth\nLooks ...
362235,christaloneliveinstudio,2012,edensedge,Country,When I fly off this mortal earth\nAnd I'm meas...


In [52]:
# merge the two datasets
lyrics_NotNull.merge(fma, left_on=["song", "artist"], right_on=["title", "name"], how="inner")

Unnamed: 0,song,year,artist,genre,lyrics,track_id,title,genre_top,genres,genres_all,...,tonnetz.39,tonnetz.40,tonnetz.41,zcr,zcr.1,zcr.2,zcr.3,zcr.4,zcr.5,zcr.6
0,skeletonsandspirits,2009,allisoncrowe,Jazz,Skeletons and Spirits\nWords & Music by Alliso...,13439,skeletonsandspirits,,"[10, 12]","[10, 12]",...,0.127967,0.024352,0.020429,48.341877,0.411133,0.033720,0.028809,0.001953,4.894382,0.024780
1,baitingthepublic,2006,fuckedup,Rock,"I want to smash your house, I want to scratch ...",11086,baitingthepublic,Rock,[109],"[25, 12, 109]",...,0.066577,0.012715,0.013905,1.893210,0.153320,0.048856,0.046875,0.000000,0.835074,0.015209
2,sexads,2009,coldcave,Pop,And sometimes the people are quite interesting...,5966,sexads,,"[15, 70, 71]","[70, 71, 12, 15, 25]",...,0.110774,0.028120,0.013421,10.446946,0.414062,0.043889,0.032227,0.009277,3.170191,0.037506
3,lovecomesclose,2009,coldcave,Pop,Everything change & we'll all change again\nSu...,5965,lovecomesclose,,"[15, 70, 71]","[70, 71, 12, 15, 25]",...,0.058607,0.014036,0.011297,2.202057,0.227051,0.066790,0.067383,0.001953,-0.187499,0.017968
4,qualmsofconscience,2010,diabloswingorchestra,Metal,[Instrumental],19256,qualmsofconscience,,"[2, 31, 98]","[2, 12, 98, 31]",...,0.177656,0.033107,0.038389,-0.392362,0.051270,0.021078,0.020996,0.004883,0.247039,0.007728
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
109,mysweetiewentaway,2015,bessiesmith,Jazz,I've got a lovesick tale to tell to you\nThoug...,122883,mysweetiewentaway,Old-Time / Historic,[8],[8],...,0.122691,0.029637,0.030487,1.305738,0.120605,0.039042,0.035645,0.011230,1.121782,0.014704
110,microcastle,2008,deerhunter,Rock,"All I've seen, all I've seen\nTry to describe ...",19466,microcastle,Rock,[58],"[58, 12]",...,0.151122,0.032622,0.029537,41.735195,0.339844,0.034911,0.031250,0.002930,4.768773,0.024125
111,agoraphobia,2008,deerhunter,Rock,Come for me\nYou come for me\nCome for me\nCom...,19470,agoraphobia,Rock,[58],"[58, 12]",...,0.068370,0.014176,0.018141,1.697590,0.171387,0.045997,0.044434,0.003418,0.852582,0.021800
112,neverstops,2008,deerhunter,Rock,I had dreams\nThat frightened me awake\nI happ...,19462,neverstops,Rock,[58],"[58, 12]",...,0.075649,0.018126,0.018396,5.569551,0.258301,0.043658,0.041992,0.002930,1.200580,0.022232


Attempt with difflib

In [69]:
fma["song"] = ""
fma["artist"] = ""

In [66]:
song_names_list = list(lyrics_NotNull['song'])
len(song_names_list)

266556

In [71]:
# get the start time
start = datetime.now()

# import the tool for fuzzy matching
import difflib 

try:
    # create song/artist columns for the fma dataset using the closest matches to the lyrics dataset
    fma['song'] = fma['title'].apply(lambda x: difflib.get_close_matches(x, song_names_list, cutoff=0.6)[0])
    fma['artist'] = fma['name'].apply(lambda x: difflib.get_close_matches(x, song_names_list, cutoff=0.6)[0])
except:
    fma['song'] = ""
    fma['artist'] = ""

# merge the two datasets
lyrics_NotNull.merge(fma, on=["song", "artist"], how="inner")

# get the end time and the overall time to run
end = datetime.now()
print("Overall Run Time for:", end - start)

Overall Run Time for: 0:00:16.121935


    106574
Name: song, dtype: int64

Attempt with fuzzywuzzy #1

In [62]:
fma_test = fma.title[0:100]
fma_test

0                      food
1               electricave
2                 thisworld
3                   freeway
4            spirituallevel
              ...          
95            nobodylovesme
96    it'sbecomingastranger
97                  newlife
98       mastersofenjoyment
99            wewanttobeold
Name: title, Length: 100, dtype: object

In [63]:
from fuzzywuzzy import fuzz

def match_name(name, list_names):
    
    # define the min score
    min_score=75
    
    # -1 score for no matches
    max_score = -1
    
    # empty name for no matches
    max_name = ""
    
    # Iternating over all names in the list
    for name2 in list_names:
        
        #Finding fuzzy match score
        score = fuzz.ratio(name, name2)
        
        # Checking if we are above our threshold and have a better score
        if (score > min_score) & (score > max_score):
            max_name = name2
            max_score = score
    return (max_name, max_score)

####################################
# run the code on the FMA dataframe
####################################

# get the start time
start = datetime.now()

# List for dicts for easy dataframe creation
dict_list = []

# list over the column of song titles
for song in fma_test:
    
    # Use our method to find best match, we can set a threshold here
    match = match_name(song, list(lyrics_NotNull.song))
    
    # New dict for storing data
    dict_ = {}
    dict_.update({"song_name" : song})
    dict_.update({"match_name" : match[0]})
    dict_.update({"score" : match[1]})
    dict_list.append(dict_)
    
merge_table = pd.DataFrame(dict_list)

# get the end time and the overall time to run
end = datetime.now()
print("Overall Run Time for:", end - start)

# Display results
merge_table

KeyboardInterrupt: 

Attempt with fuzzywuzzy #2

In [88]:
song_names_list = list(lyrics_NotNull['song'])
song_names_list

['egoremix',
 'thentellme',
 'honesty',
 'youaremyrock',
 'blackculture',
 'allicoulddowascry',
 'onceinalifetime',
 'waiting',
 'slowlove',
 'whydontyouloveme',
 'savethehero',
 'telephone',
 'icecreamtruck',
 'nobrokenheartedgirl',
 'control',
 'imalonenow',
 'poison',
 'worldwidewomen',
 'beautifulliar',
 'beautifulliarspanish',
 'beautifulliarspanglishversion',
 'bienvenue',
 'beutifulliarremix',
 'hestilllovesmefchoirfromfightingtemptations',
 'belloembustero',
 'irreplable',
 'thefirstday',
 'myman',
 'whatsitgonnabe',
 'if',
 'agirlwithnoname',
 'belloembusterobeautifulliarspanishversion',
 'voissurtonchemin',
 'flawsall',
 'icanttakenomore',
 'bonnieandclydeftjayz',
 'keepgivinyourlovetome',
 'listen',
 'welcometohollywood',
 'myfirsttime',
 'stillinlove',
 'daddyhiddentrack',
 'angel',
 'diamonds',
 'everthingido',
 'worldwidewomanwww',
 'whatsgoodwithyou',
 'amorgitano',
 'oyelistenspanishversion',
 'irreemplazablenortenaremix',
 'thejoyofpepsicarmen',
 'irreemplazableirrepla

In [89]:
fma_test = fma.iloc[0:1000,:]
fma_test

Unnamed: 0,track_id,title,genre_top,genres,genres_all,name,type,subset,chroma_cens,chroma_cens.1,...,tonnetz.41,zcr,zcr.1,zcr.2,zcr.3,zcr.4,zcr.5,zcr.6,song,artist
0,2,food,Hip-Hop,[21],[21],awol,Album,small,7.180653,5.230309,...,0.012111,5.758890,0.459473,0.085629,0.071289,0.000000,2.089872,0.061448,food,
1,3,electricave,Hip-Hop,[21],[21],awol,Album,medium,1.888963,0.760539,...,0.017740,2.824694,0.466309,0.084578,0.063965,0.000000,1.716724,0.069330,electric,
2,5,thisworld,Hip-Hop,[21],[21],awol,Album,small,0.527563,-0.077654,...,0.014759,6.808415,0.375000,0.053114,0.041504,0.000000,2.193303,0.044861,the-sword,
3,10,freeway,Pop,[10],[10],kurtvile,Album,small,3.702245,-0.291193,...,0.013921,21.434212,0.452148,0.077515,0.071777,0.000000,3.542325,0.040800,redway,
4,20,spirituallevel,,"[76, 103]","[17, 10, 76, 103]",nickycook,Album,large,-0.193837,-0.198527,...,0.021355,16.669037,0.469727,0.047225,0.040039,0.000977,3.189831,0.030993,spiritual,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,1270,cocksuckerbastardmotherfucker,Rock,[25],"[25, 12]",necropolis,Album,small,1.043248,0.251903,...,0.020245,13.859073,0.315918,0.080323,0.082031,0.000000,1.102002,0.023122,,
996,1271,giveup,Rock,[25],"[25, 12]",necropolis,Album,large,1.014890,2.480498,...,0.011647,-0.480145,0.212891,0.105331,0.104980,0.018555,0.117001,0.031885,,
997,1272,nochance,Rock,[25],"[25, 12]",necropolis,Album,large,1.148416,0.190993,...,0.012976,7.018421,0.409180,0.109652,0.099609,0.006836,2.261055,0.051228,,
998,1273,"notongue,notalking",Rock,[25],"[25, 12]",necropolis,Album,large,1.553778,2.985933,...,0.012072,0.920666,0.340332,0.124915,0.125977,0.014648,0.178685,0.033859,,


In [90]:
from fuzzywuzzy import fuzz

def match_name_2(name):
    
    # define the min score
    min_score=75
    
    # -1 score for no matches
    max_score = -1
    
    # empty name for no matches
    max_name = ""
    
    # Iternating over all names in the list
    for name2 in song_names_list:
        
        #Finding fuzzy match score
        score = fuzz.ratio(name, name2)
        
        # Checking if we are above our threshold and have a better score
        if (score > min_score) & (score > max_score):
            max_name = name2
    return (max_name)

####################################
# run the code on the FMA dataframe
####################################

# get the start time
start = datetime.now()

fma_test['song'] = fma_test['title'].apply(lambda x: match_name_2(x))

# get the end time and the overall time to run
end = datetime.now()
print("Overall Run Time for:", end - start)

Overall Run Time for: 2:08:05.922641


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [91]:
fma_test

Unnamed: 0,track_id,title,genre_top,genres,genres_all,name,type,subset,chroma_cens,chroma_cens.1,...,tonnetz.41,zcr,zcr.1,zcr.2,zcr.3,zcr.4,zcr.5,zcr.6,song,artist
0,2,food,Hip-Hop,[21],[21],awol,Album,small,7.180653,5.230309,...,0.012111,5.758890,0.459473,0.085629,0.071289,0.000000,2.089872,0.061448,aflood,
1,3,electricave,Hip-Hop,[21],[21],awol,Album,medium,1.888963,0.760539,...,0.017740,2.824694,0.466309,0.084578,0.063965,0.000000,1.716724,0.069330,electricblue,
2,5,thisworld,Hip-Hop,[21],[21],awol,Album,small,0.527563,-0.077654,...,0.014759,6.808415,0.375000,0.053114,0.041504,0.000000,2.193303,0.044861,thesword,
3,10,freeway,Pop,[10],[10],kurtvile,Album,small,3.702245,-0.291193,...,0.013921,21.434212,0.452148,0.077515,0.071777,0.000000,3.542325,0.040800,redway,
4,20,spirituallevel,,"[76, 103]","[17, 10, 76, 103]",nickycook,Album,large,-0.193837,-0.198527,...,0.021355,16.669037,0.469727,0.047225,0.040039,0.000977,3.189831,0.030993,spiritual,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,1270,cocksuckerbastardmotherfucker,Rock,[25],"[25, 12]",necropolis,Album,small,1.043248,0.251903,...,0.020245,13.859073,0.315918,0.080323,0.082031,0.000000,1.102002,0.023122,,
996,1271,giveup,Rock,[25],"[25, 12]",necropolis,Album,large,1.014890,2.480498,...,0.011647,-0.480145,0.212891,0.105331,0.104980,0.018555,0.117001,0.031885,giveup,
997,1272,nochance,Rock,[25],"[25, 12]",necropolis,Album,large,1.148416,0.190993,...,0.012976,7.018421,0.409180,0.109652,0.099609,0.006836,2.261055,0.051228,onemorechance,
998,1273,"notongue,notalking",Rock,[25],"[25, 12]",necropolis,Album,large,1.553778,2.985933,...,0.012072,0.920666,0.340332,0.124915,0.125977,0.014648,0.178685,0.033859,,
