In this notebook, we will prepare the lyrics data downloaded from [Kaggle](https://www.kaggle.com/neisse/scrapped-lyrics-from-6-genres?select=lyrics-data.csv).

_Credit: for the preparation of the lyrics data, we took some of the code privided by this Towards Data Science [post](https://towardsdatascience.com/how-to-fine-tune-gpt-2-for-text-generation-ae2ea53bc272)._

# Main Steps
1. Load in the datasets
2. Merge the two datasets in order to combine the genre and lyrics
3. Keep only the data that meets the following criteria:
    - Lyrics written in English (``Idiom``=='ENGLISH')
    - Popular enough (``Popularity``>12)
4. Select the desired ``genre`` of the songs (``Genre``=='Pop')

In [4]:
import pandas as pd
from tqdm import tqdm, trange

### Load in data
lyrics = pd.read_csv('./raw_data/lyrics-data.csv')
artists = pd.read_csv('./raw_data/artists-data.csv')

# merge the two datasets
lyrics_data = lyrics.merge(artists[['Artist','Popularity', 'Genre', 'Link']],
                  left_on='ALink', right_on='Link', how='inner')

# Keep only the data that meets certain criteria

crit_1 = lyrics_data['Idiom']=='ENGLISH'
crit_2 = lyrics_data['Genre'].isin(['Pop'])
crit_3 = lyrics_data['Popularity']>5

lyrics_data = lyrics_data[crit_1 & crit_2 & crit_3]
                          
# drop columns we don't need
lyrics_data = lyrics_data.drop(columns=['ALink','SLink','Idiom','Link'])

# reformat the columns names
lyrics_data.rename(columns={'SName':'song_name'}, inplace=True)
lyrics_data.columns = lyrics_data.columns.str.lower()

# drop duplicates 
lyrics_data = lyrics_data.drop_duplicates().reset_index(drop=True)

In [5]:
lyrics_data

Unnamed: 0,song_name,lyric,artist,popularity,genre
0,Careless Whisper,I feel so unsure. As I take your hand and lead...,George Michael,5.1,Pop
1,Freedom '90,I won't let you down. I will not give you up. ...,George Michael,5.1,Pop
2,One More Try,I've had enough of danger. And people on the s...,George Michael,5.1,Pop
3,Father Figure,"That's all I wanted. Something special, someth...",George Michael,5.1,Pop
4,Heal The Pain,Let me tell you a secret. Put it in your heart...,George Michael,5.1,Pop
...,...,...,...,...,...
10234,Life Is A Party,"Ah, life is a party. it´s a ride in your jeep....",Xuxa,14.1,Pop
10235,Our Song Of Peace,How great it´s to sing this song. ioioioia. wi...,Xuxa,14.1,Pop
10236,Quem Dorme É o Leão,"Ih, Ih, Ih, Ih, Ih... Amamauê"". ""Ih, Ih, Ih, I...",Xuxa,14.1,Pop
10237,Rainbow,i will paint a rainbow filled with energy. if ...,Xuxa,14.1,Pop


In [10]:
lyrics_data[lyrics_data.lyric.str.contains('Que')]

Unnamed: 0,song_name,lyric,artist,popularity,genre
10236,Quem Dorme É o Leão,"Ih, Ih, Ih, Ih, Ih... Amamauê"". ""Ih, Ih, Ih, I...",Xuxa,14.1,Pop


In [9]:
lyrics.loc[lyrics.SName == 'Quem Dorme É o Leão'].Lyric

153325    Ih, Ih, Ih, Ih, Ih... Amamauê". "Ih, Ih, Ih, I...
Name: Lyric, dtype: object

In [3]:
lyrics_data.to_csv('lyrics_data.csv')