#Generating song lyrics with RNN

##Imports

In [1]:
import pandas as pd
import re
import os

##Import data

Import data from local environment or Google Drive.

In [None]:
BASE_PATH = '.'

In [10]:
from google.colab import drive
drive.mount('/content/drive')
BASE_PATH = '/content/drive/MyDrive/Colab Notebooks/'

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Data consists od 2 sets:
- lyrics-data :
  * ALink - Link to the artist profile in Vagalume.com
  * SName - Song's name
  * SLink - Link to the song lyrics in Vagalume.com
  * Lyrics
  * Lyrics language

- artists-data:
  * Artist
  * Genres
  * Songs
  * Popularity
  * Link - Link to the artist profile in Vagalume.com


In [11]:
lyrics_df = pd.read_csv(f'{BASE_PATH}/data/raw/lyrics-data.csv')
artists_df = pd.read_csv(f'{BASE_PATH}/data/raw/artists-data.csv')

print(f"Lyrics dataset: {lyrics_df.head()}")
print(f"\nArtists dataset: {artists_df.head()}")

Lyrics dataset:              ALink                            SName  \
0  /ivete-sangalo/                            Arerê   
1  /ivete-sangalo/  Se Eu Não Te Amasse Tanto Assim   
2  /ivete-sangalo/                      Céu da Boca   
3  /ivete-sangalo/            Quando A Chuva Passar   
4  /ivete-sangalo/                     Sorte Grande   

                                               SLink  \
0                          /ivete-sangalo/arere.html   
1  /ivete-sangalo/se-eu-nao-te-amasse-tanto-assim...   
2                     /ivete-sangalo/chupa-toda.html   
3          /ivete-sangalo/quando-a-chuva-passar.html   
4                   /ivete-sangalo/sorte-grande.html   

                                               Lyric language  
0  Tudo o que eu quero nessa vida,\nToda vida, é\...       pt  
1  Meu coração\nSem direção\nVoando só por voar\n...       pt  
2  É de babaixá!\nÉ de balacubaca!\nÉ de babaixá!...       pt  
3  Quando a chuva passar\n\nPra quê falar\nSe voc...       p

##Prepare dataset

Merge lyrics and artist tables and select only English Rock and Pop songs (for now).

In [12]:

full_df = lyrics_df.merge(artists_df[['Artist', 'Genres', 'Link']], left_on='ALink', right_on='Link')
df = full_df[full_df['language'] == 'en'].copy()
target_genres = ['Rock', 'Pop']
df = df[df['Genres'].isin(target_genres)]
print(f"Song count afeter filtering: {len(df)}")


Song count afeter filtering: 7216


Data cleaning:

* Converts text to lowercase and validates inputs to reduce vocabulary size and prevent crashes.

* Uses Regex to strip structural tags like [Chorus] or (Verse) so the model focuses purely on the lyrics.

* Removes emojis and special symbols, keeping only alphanumeric characters and essential punctuation.

* Collapses multiple newlines and spaces into single occurrences to maintain consistent song structure.

* Filters out entries shorter than 50 characters to ensure the model trains only on meaningful data.


In [13]:
def clean_and_normalize(text):
    if not isinstance(text, str): return ""
    text = text.lower()

    text = re.sub(r'\[.*?\]', '', text)
    text = re.sub(r'\(.*?\)', '', text)

    text = re.sub(r'[^a-z0-9\s\n\.\,\']', '', text)

    text = re.sub(r'\n\s*\n', '\n', text)
    text = re.sub(r' +', ' ', text)

    return text.strip()

df['Cleaned_Lyric'] = df['Lyric'].apply(clean_and_normalize)

df = df[df['Cleaned_Lyric'].str.len() > 50]

print("Cleaned song example:")
print(df['Cleaned_Lyric'].iloc[0][:200])

Cleaned song example:
are we gonna make it
is this gonna hurt
oh, we can try to sedate it
but that never works
yeah
i start to imagine a world where we don't collide
it's making me sick but we'll heal and the sun will rise


Make processed dataset with clean lyrics and save it to data/processed csv file.

In [14]:
output_path = f'{BASE_PATH}/data/processed/processed_lyrics_data.csv'
os.makedirs(os.path.dirname(output_path), exist_ok=True)
df[['Artist', 'Genres', 'Cleaned_Lyric']].to_csv(output_path, index=False)

print(f"Dataset on location: {output_path}")
print(f"Song count: {len(df)}")

Dataset on location: /content/drive/MyDrive/Colab Notebooks//data/processed/processed_lyrics_data.csv
Song count: 7206
