In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/genius-song-lyrics-with-language-information/song_lyrics.csv


In [2]:
lyric_database = pd.read_csv("/kaggle/input/genius-song-lyrics-with-language-information/song_lyrics.csv")

https://www.kaggle.com/datasets/carlosgdcj/genius-song-lyrics-with-language-information

This dataset originates from the website 'Genius' where lyrics to songs are posted as well as community-submitted explanations to the meanings behind those lyrics. The dataset has song titles, the full string of lyrics, and a few other columns. However, the column of note this dataset has is a `tag` field which stores which genre each song belongs to. (i.e. rap, pop, rock, etc.)

For my project, I would like to train an NLP AI to try to find a connection between lyrics and genre. I'd like to train it to classify the lyrics of a song into which genre they belong to. There is not necessarily any innate connection between these two concepts, which is why I especially want to see how the AI tackles this problem. Perhaps it will be able to find some connection that I am unable to see.

This dataset has a few problems that need to be addressed before using it for this project. Firstly, this dataset has lyrics from all different languages. For the purpose of this project, I'm going to limit the scope to just English songs. There may be a deeper correlation that can be made by using all languages, but for readability I chose to limit it to English. This can be easily done by filtering out songs using the `language` column restricting it to only 'en'.

Additionally, the lyrics of a song on Genius are split up by verses and choruses and other musical notations separated from the lyrics with brackets. For example, a song can have the line `[Verse 1: Drake]`that follows before a verse sung by Drake. For the purposes of this project, those lines will remain in the lyrics. They aren't technically lyrics, but the AI can notice patterns with artists and verse/chorus/introduction setups to help classify which genre the song is in.

Finally, Genius allows users to upload lyrics for things that aren't songs as well such as scripts or poems. These are all tagged as 'misc,' but some songs are also tagged as misc. For the sake of simplicity, I will be culling all lyrics tagged as misc from the database even if they are songs.

In [3]:
db = (
    lyric_database
    .loc[lyric_database["language"] == "en"]
    .loc[lyric_database["tag"] != "misc"]
    .loc[lyric_database["tag"] != ""]
    .loc[lyric_database["tag"] != None]
    .loc[lyric_database["lyrics"] != None]
    .loc[lyric_database["lyrics"] != ""]
)

This dataset has several columns that can be used for all sorts of implementation but for the sake of this project only the 'lyrics', `tag`, `title` and `id` columns are of interest to us.

`id` is how we iterate through each song, and `title` is how we display the name of the song found by the id.

`lyrics` is our text input and `tag` is our classification output. We need these to train the AI.

The `language` column is extraneous now that we have filtered out any songs that are not in english.

In [4]:
columns_of_interest = ['id', 'title','artist', 'lyrics', 'tag']
db = db[columns_of_interest]

In [5]:
#code to find a song by its ID
for item in (db.loc[db["id"] == 31601].values):
    for x in item:
        print(f"{x}\n")
        

31601

Take This Ride

Miilkbone

[Intro 2X: Miilkbone]
Come and take this rii-ide, baby
And let love get lii-ive, baby
I'll even let you dri-ive, baby
If I can park insi-ide

[Miilkbone]
I pulled up, picked her up like 6 o'clock
She came out the house with a body like I'm Sir Mix-A-Lot
Got the fly car, fly girl, my masterpiece
Dressed in caramel, skin color tone match the seats
Only difference she's got a fast release, I tell her
My car's like my love and you can help me keep it greased
Hit the park on route 69, the hard way
We might crash, c'mon now girl I got triple weight
You know me better than that, my tires never been flat
Turned on the radio, Mariah made her lay the seat back
Let me teach you how to drive my stick, slide my shit
Like ba-ba-bump in traffic, I wanted her to match it

[Chorus: Miilkbone] + (female)
Come and take this rii-ide, baby
And let love get lii-ive, baby
I'll even let you dri-ive, baby
If I can park insi-ide
(Come and take this ride, baby)
(Make me wet insi

In [6]:
#code to find a song by its title and artist
for item in (db.loc[db["title"] == "FAMILY VAN"].loc[db["artist"] == "cleopatrick"].values):
    for x in item:
        print(f"{x}\n")

5402526

FAMILY VAN

cleopatrick

[Verse 1]
Caught you biting my shit when
I was mixing macchiatos for these old men
Wondering what it's worth, wondering who I am
You  feeling gutty with your pad and pen
I used to call work early
Up late tryna isolate what hurt me
I was hoping that the poetry would give me peace
Do you heard my heat, you observe this beat? Yeah
Look you can't just diss and come tell man sorry Yeah
Can't hear me talk and go tell my story, yeah
That's word from Aubrey, dog

[Chorus]
I know you only fucked with me 'cause I'm alone
'Cause I'm alone, yeah
But in the end you'll always know you're hollow, hollow
Yeah, yeah

[Verse 2]
All you rock dummies double take when I talk
Play it safe, got no heart
Fumble takes, got no sauce
Double back, double cross
Never break, take the loss
Can't keep pace, label boss
Label shelf, label toss
I just
[Chorus]
I know you only fucked with me 'cause I'm alone
'Cause I'm alone, yeah
But in the end you'll always know you're hollow, hollow
A

In [7]:
db.sort_values(by="id")
#the IDs go all the way up to 7.8 million despite only having 5.1 million songs in the database

Unnamed: 0,id,title,artist,lyrics,tag
0,1,Killa Cam,Cam'ron,"[Chorus: Opera Steve & Cam'ron]\nKilla Cam, Ki...",rap
1,3,Can I Live,JAY-Z,"[Produced by Irv Gotti]\n\n[Intro]\nYeah, hah,...",rap
2,4,Forgive Me Father,Fabolous,Maybe cause I'm eatin\nAnd these bastards fien...,rap
3,5,Down and Out,Cam'ron,[Produced by Kanye West and Brian Miller]\n\n[...,rap
4,6,Fly In,Lil Wayne,"[Intro]\nSo they ask me\n""Young boy\nWhat you ...",rap
...,...,...,...,...,...
5134847,7882838,Everything Is Alright Now,Chuck Bernard,"Everything is alright now\nOh yes, baby\nEvery...",pop
5134849,7882840,White Lies,ElementD,[Verse 1]\nHalf truth and half you\nDidn't we ...,pop
5134851,7882842,Ocean,Effemar,[Verse 1]\nDance for me now\nKeeping yourself ...,pop
5134853,7882845,Raise Our Hands,"Culture Code, Pag & Mylo",[Verse 1]\nHere our purpose feels alive\nWe ar...,pop
