# Importing Lyrics Dataset

#### Metrolyrics Analysis

https://www.kaggle.com/gyani95/380000-lyrics-from-metrolyrics

- ~266k of Non null rows
- ~380,000 Total Rows (more like 260k w/ 95k missing for lyrics)
- Also missing about 60k of correct genre names
- Columns: Song, Year, Artist, Genre and Lyrics

In [1]:
import pandas as pd
from datetime import datetime

In [2]:
lyrics = pd.read_csv("lyrics.csv")
del lyrics['index']

In [3]:
lyrics.head()

Unnamed: 0,song,year,artist,genre,lyrics
0,ego-remix,2009,beyonce-knowles,Pop,"Oh baby, how you doing?\nYou know I'm gonna cu..."
1,then-tell-me,2009,beyonce-knowles,Pop,"playin' everything so easy,\nit's like you see..."
2,honesty,2009,beyonce-knowles,Pop,If you search\nFor tenderness\nIt isn't hard t...
3,you-are-my-rock,2009,beyonce-knowles,Pop,"Oh oh oh I, oh oh oh I\n[Verse 1:]\nIf I wrote..."
4,black-culture,2009,beyonce-knowles,Pop,"Party the people, the people the party it's po..."


##### Looking at all the rows

In [4]:
# View the unique genre's
print("Unique Genre's", lyrics.genre.unique())
print(lyrics.genre.value_counts())

Unique Genre's ['Pop' 'Hip-Hop' 'Not Available' 'Other' 'Rock' 'Metal' 'Country' 'Jazz'
 'Electronic' 'Folk' 'R&B' 'Indie']
Rock             131377
Pop               49444
Hip-Hop           33965
Not Available     29814
Metal             28408
Other             23683
Country           17286
Jazz              17147
Electronic        16205
R&B                5935
Indie              5732
Folk               3241
Name: genre, dtype: int64


In [5]:
# View the unique artist's
print("Unique Artists", lyrics.artist.unique())
print(lyrics.artist.value_counts())

Unique Artists ['beyonce-knowles' 'eazy-e' 'asher-monroe' ... 'ale-mrthller' 'dee-smgn'
 'edens-edge']
dolly-parton                 755
american-idol                700
elton-john                   680
b-b-king                     667
chris-brown                  655
                            ... 
capital-b                      1
art-beatz                      1
danakadan                      1
the-butterbeer-experience      1
danny-malone                   1
Name: artist, Length: 18231, dtype: int64


In [6]:
# View number of missing values in each column
# lyrics.lyrics.isna()
lyrics.isnull().sum(axis = 0)

song          2
year          0
artist        0
genre         0
lyrics    95680
dtype: int64

In [7]:
# Show number of Non Missing Values
lyrics.notnull().sum(axis = 0)

song      362235
year      362237
artist    362237
genre     362237
lyrics    266557
dtype: int64

In [8]:
# Created a lyrics count list for how many words are in a song
lyrics_count = lyrics['lyrics'].str.count(' ') + 1

In [11]:
print(lyrics_count.max())
lyrics_count.min()

7106.0


1.0

In [12]:
# Some info about the length of the songs for all Rows
print("Average # of words per song:", lyrics_count.mean())
print("Maximum # of words per song:", lyrics_count.max())
print("Minimum # of words per song:", lyrics_count.min())
print("StanDev # of words per song:", lyrics_count.std())
print("Total Number of Rows:", lyrics.count())

Average # of words per song: 193.35993052142695
Maximum # of words per song: 7106.0
Minimum # of words per song: 1.0
StanDev # of words per song: 138.2510659779955
Total Number of Rows: song      362235
year      362237
artist    362237
genre     362237
lyrics    266557
dtype: int64


In [13]:
lyrics_count.tail(15)

362222      NaN
362223      NaN
362224      NaN
362225    151.0
362226     71.0
362227    313.0
362228    262.0
362229    176.0
362230    203.0
362231    193.0
362232    280.0
362233    175.0
362234    213.0
362235    145.0
362236    287.0
Name: lyrics, dtype: float64

##### Looking at only NonNull Rows

In [93]:
lyrics_NotNull = lyrics.dropna()
lyrics_NotNull

Unnamed: 0,song,year,artist,genre,lyrics
0,ego-remix,2009,beyonce-knowles,Pop,"Oh baby, how you doing?\nYou know I'm gonna cu..."
1,then-tell-me,2009,beyonce-knowles,Pop,"playin' everything so easy,\nit's like you see..."
2,honesty,2009,beyonce-knowles,Pop,If you search\nFor tenderness\nIt isn't hard t...
3,you-are-my-rock,2009,beyonce-knowles,Pop,"Oh oh oh I, oh oh oh I\n[Verse 1:]\nIf I wrote..."
4,black-culture,2009,beyonce-knowles,Pop,"Party the people, the people the party it's po..."
...,...,...,...,...,...
362232,who-am-i-drinking-tonight,2012,edens-edge,Country,"I gotta say\nBoy, after only just a couple of ..."
362233,liar,2012,edens-edge,Country,I helped you find her diamond ring\nYou made m...
362234,last-supper,2012,edens-edge,Country,Look at the couple in the corner booth\nLooks ...
362235,christ-alone-live-in-studio,2012,edens-edge,Country,When I fly off this mortal earth\nAnd I'm meas...


In [16]:
# Values for Genre's for NonNull rows
lyrics_NotNull.genre.value_counts()

Rock             109235
Pop               40466
Hip-Hop           24850
Not Available     23941
Metal             23759
Country           14387
Jazz               7970
Electronic         7966
Other              5189
R&B                3401
Indie              3149
Folk               2243
Name: genre, dtype: int64

In [17]:
lyrics_count_NotNull = lyrics_NotNull['lyrics'].str.count(' ') + 1

In [18]:
# Some info about the length of the songs for NONNULL rows
print("Average # of words per song:", lyrics_count_NotNull.mean())
print("Maximum # of words per song:", lyrics_count_NotNull.max())
print("Minimum # of words per song:", lyrics_count_NotNull.min())
print("StanDev # of words per song:", lyrics_count_NotNull.std())
print("Total Number of Rows:", lyrics_NotNull.count())

Average # of words per song: 193.3595229520251
Maximum # of words per song: 7106
Minimum # of words per song: 1
StanDev # of words per song: 138.25116516859234
Total Number of Rows: song      266556
year      266556
artist    266556
genre     266556
lyrics    266556
dtype: int64


In [19]:
len(lyrics_NotNull.artist.unique())

14543

# Importing Song Shape Dataset

#### FMA: A Dataset For Music Analysis

https://archive.ics.uci.edu/ml/datasets/FMA%3A+A+Dataset+For+Music+Analysis

In [74]:
# import the features, and skip the unnecessary rows/rename the column
features = pd.read_csv("features.csv", skiprows=[1,2,3])
features.rename(columns={"feature": "track_id"}, inplace=True)

# import the genres
genres = pd.read_csv("raw_genres.csv")

# import the track list, and skip the unnecessary rows/rename the column
tracks = pd.read_csv("tracks.csv", skiprows=[0,2])
tracks.rename(columns={tracks.columns[0]: "track_id"}, inplace=True)

In [70]:
list(features.columns)

['track_id',
 'chroma_cens',
 'chroma_cens.1',
 'chroma_cens.2',
 'chroma_cens.3',
 'chroma_cens.4',
 'chroma_cens.5',
 'chroma_cens.6',
 'chroma_cens.7',
 'chroma_cens.8',
 'chroma_cens.9',
 'chroma_cens.10',
 'chroma_cens.11',
 'chroma_cens.12',
 'chroma_cens.13',
 'chroma_cens.14',
 'chroma_cens.15',
 'chroma_cens.16',
 'chroma_cens.17',
 'chroma_cens.18',
 'chroma_cens.19',
 'chroma_cens.20',
 'chroma_cens.21',
 'chroma_cens.22',
 'chroma_cens.23',
 'chroma_cens.24',
 'chroma_cens.25',
 'chroma_cens.26',
 'chroma_cens.27',
 'chroma_cens.28',
 'chroma_cens.29',
 'chroma_cens.30',
 'chroma_cens.31',
 'chroma_cens.32',
 'chroma_cens.33',
 'chroma_cens.34',
 'chroma_cens.35',
 'chroma_cens.36',
 'chroma_cens.37',
 'chroma_cens.38',
 'chroma_cens.39',
 'chroma_cens.40',
 'chroma_cens.41',
 'chroma_cens.42',
 'chroma_cens.43',
 'chroma_cens.44',
 'chroma_cens.45',
 'chroma_cens.46',
 'chroma_cens.47',
 'chroma_cens.48',
 'chroma_cens.49',
 'chroma_cens.50',
 'chroma_cens.51',
 'chroma_ce

In [71]:
list(genres.columns)

['genre_id', 'genre_color', 'genre_handle', 'genre_parent_id', 'genre_title']

In [75]:
list(tracks.columns)

['track_id',
 'comments',
 'date_created',
 'date_released',
 'engineer',
 'favorites',
 'id',
 'information',
 'listens',
 'producer',
 'tags',
 'title',
 'tracks',
 'type',
 'active_year_begin',
 'active_year_end',
 'associated_labels',
 'bio',
 'comments.1',
 'date_created.1',
 'favorites.1',
 'id.1',
 'latitude',
 'location',
 'longitude',
 'members',
 'name',
 'related_projects',
 'tags.1',
 'website',
 'wikipedia_page',
 'split',
 'subset',
 'bit_rate',
 'comments.2',
 'composer',
 'date_created.2',
 'date_recorded',
 'duration',
 'favorites.2',
 'genre_top',
 'genres',
 'genres_all',
 'information.1',
 'interest',
 'language_code',
 'license',
 'listens.1',
 'lyricist',
 'number',
 'publisher',
 'tags.2',
 'title.1']

In [68]:
# check the structure of the genre columns
tracks.loc[:,["genre_top'", "genres", "genres_all"]]

Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike
  return self._getitem_tuple(key)


Unnamed: 0,genre_top',genres,genres_all
0,,[21],[21]
1,,[21],[21]
2,,[21],[21]
3,,[10],[10]
4,,"[76, 103]","[17, 10, 76, 103]"
...,...,...,...
106569,,[25],"[25, 12]"
106570,,[25],"[25, 12]"
106571,,[25],"[25, 12]"
106572,,[25],"[25, 12]"


In [89]:
# select on the needed columns from the track
tracks_2 = tracks[['track_id','title','genre_top','genres','genres_all','name','type','subset']]
tracks_2

Unnamed: 0,track_id,title,genre_top,genres,genres_all,name,type,subset
0,2,AWOL - A Way Of Life,Hip-Hop,[21],[21],AWOL,Album,small
1,3,AWOL - A Way Of Life,Hip-Hop,[21],[21],AWOL,Album,medium
2,5,AWOL - A Way Of Life,Hip-Hop,[21],[21],AWOL,Album,small
3,10,Constant Hitmaker,Pop,[10],[10],Kurt Vile,Album,small
4,20,Niris,,"[76, 103]","[17, 10, 76, 103]",Nicky Cook,Album,large
...,...,...,...,...,...,...,...,...
106569,155316,"Live at Monty Hall, 2/17/2017",Rock,[25],"[25, 12]",Spowder,Live Performance,large
106570,155317,"Live at Monty Hall, 2/17/2017",Rock,[25],"[25, 12]",Spowder,Live Performance,large
106571,155318,"Live at Monty Hall, 2/17/2017",Rock,[25],"[25, 12]",Spowder,Live Performance,large
106572,155319,"Live at Monty Hall, 2/17/2017",Rock,[25],"[25, 12]",Spowder,Live Performance,large


In [111]:
# merge the tracks and features columns
fma = tracks_2.merge(features, on="track_id", how="inner")

In [114]:
# check the final dataframe
fma

Unnamed: 0,track_id,title,genre_top,genres,genres_all,name,type,subset,chroma_cens,chroma_cens.1,...,tonnetz.39,tonnetz.40,tonnetz.41,zcr,zcr.1,zcr.2,zcr.3,zcr.4,zcr.5,zcr.6
0,2,AWOL - A Way Of Life,Hip-Hop,[21],[21],AWOL,Album,small,7.180653,5.230309,...,0.054125,0.012226,0.012111,5.758890,0.459473,0.085629,0.071289,0.000000,2.089872,0.061448
1,3,AWOL - A Way Of Life,Hip-Hop,[21],[21],AWOL,Album,medium,1.888963,0.760539,...,0.063831,0.014212,0.017740,2.824694,0.466309,0.084578,0.063965,0.000000,1.716724,0.069330
2,5,AWOL - A Way Of Life,Hip-Hop,[21],[21],AWOL,Album,small,0.527563,-0.077654,...,0.040730,0.012691,0.014759,6.808415,0.375000,0.053114,0.041504,0.000000,2.193303,0.044861
3,10,Constant Hitmaker,Pop,[10],[10],Kurt Vile,Album,small,3.702245,-0.291193,...,0.074358,0.017952,0.013921,21.434212,0.452148,0.077515,0.071777,0.000000,3.542325,0.040800
4,20,Niris,,"[76, 103]","[17, 10, 76, 103]",Nicky Cook,Album,large,-0.193837,-0.198527,...,0.095003,0.022492,0.021355,16.669037,0.469727,0.047225,0.040039,0.000977,3.189831,0.030993
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
106569,155316,"Live at Monty Hall, 2/17/2017",Rock,[25],"[25, 12]",Spowder,Live Performance,large,-0.490129,0.463834,...,0.128410,0.022547,0.019816,4.448255,0.172852,0.028773,0.028320,0.003906,0.955388,0.012385
106570,155317,"Live at Monty Hall, 2/17/2017",Rock,[25],"[25, 12]",Spowder,Live Performance,large,-0.461559,-0.229601,...,0.132964,0.023548,0.026527,3.270612,0.196289,0.031116,0.027832,0.002441,1.283060,0.019059
106571,155318,"Live at Monty Hall, 2/17/2017",Rock,[25],"[25, 12]",Spowder,Live Performance,large,0.552473,-0.110498,...,0.108324,0.017540,0.020471,2.356727,0.212891,0.038450,0.037109,0.003418,0.828569,0.017904
106572,155319,"Live at Monty Hall, 2/17/2017",Rock,[25],"[25, 12]",Spowder,Live Performance,large,-0.176901,0.187208,...,0.088311,0.018328,0.017936,6.188604,0.167480,0.041480,0.038086,0.004883,1.818740,0.020133


In [115]:
fma["type"].value_counts()

Album               87549
Radio Program        6551
Live Performance     5035
Single Tracks         917
Contest                14
Name: type, dtype: int64

# Merge the Two Datasets (Lyrics and FMA)

In [120]:
# restructure the artist/song name in the FMA dataset
# lowercase everything
fma["title"] = fma["title"].str.lower()
fma["name"] = fma["name"].str.lower()

# # remove live performances
# fma = fma[fma["type"] != "Live Performance"]

# clear blank spaces
fma["title"] = fma["title"].str.replace(" ","")
fma["name"] = fma["name"].str.replace(" ","")

# remove -
fma["title"] = fma["title"].str.replace("-","")
fma["name"] = fma["name"].str.replace("-","")

In [121]:
fma

Unnamed: 0,track_id,title,genre_top,genres,genres_all,name,type,subset,chroma_cens,chroma_cens.1,...,tonnetz.39,tonnetz.40,tonnetz.41,zcr,zcr.1,zcr.2,zcr.3,zcr.4,zcr.5,zcr.6
0,2,awolawayoflife,Hip-Hop,[21],[21],awol,Album,small,7.180653,5.230309,...,0.054125,0.012226,0.012111,5.758890,0.459473,0.085629,0.071289,0.000000,2.089872,0.061448
1,3,awolawayoflife,Hip-Hop,[21],[21],awol,Album,medium,1.888963,0.760539,...,0.063831,0.014212,0.017740,2.824694,0.466309,0.084578,0.063965,0.000000,1.716724,0.069330
2,5,awolawayoflife,Hip-Hop,[21],[21],awol,Album,small,0.527563,-0.077654,...,0.040730,0.012691,0.014759,6.808415,0.375000,0.053114,0.041504,0.000000,2.193303,0.044861
3,10,constanthitmaker,Pop,[10],[10],kurtvile,Album,small,3.702245,-0.291193,...,0.074358,0.017952,0.013921,21.434212,0.452148,0.077515,0.071777,0.000000,3.542325,0.040800
4,20,niris,,"[76, 103]","[17, 10, 76, 103]",nickycook,Album,large,-0.193837,-0.198527,...,0.095003,0.022492,0.021355,16.669037,0.469727,0.047225,0.040039,0.000977,3.189831,0.030993
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
106557,155303,scissorspaperstone,Folk,"[17, 103]","[17, 103]",gregatkinson,Album,large,-0.846720,-1.161092,...,0.142653,0.032535,0.027655,106.866531,0.718750,0.030033,0.020508,0.003418,9.441682,0.053503
106558,155304,scissorspaperstone,Folk,"[17, 103]","[17, 103]",gregatkinson,Album,large,-0.420769,-0.099274,...,0.078540,0.019037,0.017350,5.408319,0.326660,0.048426,0.037109,0.000000,1.850671,0.037827
106559,155305,scissorspaperstone,Folk,"[17, 103]","[17, 103]",gregatkinson,Album,large,-0.832872,-0.809870,...,0.140615,0.025470,0.024669,29.294441,0.323730,0.027366,0.019043,0.002441,4.632303,0.028712
106560,155306,scissorspaperstone,Folk,"[17, 103]","[17, 103]",gregatkinson,Album,medium,0.644815,-0.879404,...,0.125615,0.027110,0.028815,17.076239,0.247070,0.031542,0.025879,0.004883,3.179079,0.022023


In [122]:
# restructure the artist/song name in the lyrics dataset
# clear blank spaces
lyrics_NotNull["song"] = lyrics_NotNull["song"].str.replace(" ","")
lyrics_NotNull["artist"] = lyrics_NotNull["artist"].str.replace(" ","")

# remove -
lyrics_NotNull["song"] = lyrics_NotNull["song"].str.replace("-","")
lyrics_NotNull["artist"] = lyrics_NotNull["artist"].str.replace("-","")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .l

In [123]:
lyrics_NotNull

Unnamed: 0,song,year,artist,genre,lyrics
0,egoremix,2009,beyonceknowles,Pop,"Oh baby, how you doing?\nYou know I'm gonna cu..."
1,thentellme,2009,beyonceknowles,Pop,"playin' everything so easy,\nit's like you see..."
2,honesty,2009,beyonceknowles,Pop,If you search\nFor tenderness\nIt isn't hard t...
3,youaremyrock,2009,beyonceknowles,Pop,"Oh oh oh I, oh oh oh I\n[Verse 1:]\nIf I wrote..."
4,blackculture,2009,beyonceknowles,Pop,"Party the people, the people the party it's po..."
...,...,...,...,...,...
362232,whoamidrinkingtonight,2012,edensedge,Country,"I gotta say\nBoy, after only just a couple of ..."
362233,liar,2012,edensedge,Country,I helped you find her diamond ring\nYou made m...
362234,lastsupper,2012,edensedge,Country,Look at the couple in the corner booth\nLooks ...
362235,christaloneliveinstudio,2012,edensedge,Country,When I fly off this mortal earth\nAnd I'm meas...


In [129]:
# get the start time
start = datetime.now()

# import the tool for fuzzy matching
import difflib 

# create song/artist columns for the fma dataset using the closest matches to the lyrics dataset
fma['song'] = fma['title'].apply(lambda x: difflib.get_close_matches(x, lyrics_NotNull['song'])[0], n=1, cutoff=0.6)
fma['artist'] = fma['name'].apply(lambda x: difflib.get_close_matches(x, lyrics_NotNull['artist'])[0], n=1, cutoff=0.6)

# merge the two datasets
lyrics_NotNull.merge(fma, on=["song", "artist"], how="inner")

# get the end time and the overall time to run
end = datetime.now()
print("Overall Run Time for:", end - start)

KeyboardInterrupt: 

In [None]:
fma