In [1]:
from dotenv import load_dotenv
import lyricsgenius
import os
import pandas as pd
import re

To get the lyrics of the songs we are going to use the `lyricgenius` library. This library requires a Genius API key which can be generated creating a free account in [Genius](https://genius.com).

In [2]:
# load enviroment variables from .env file
load_dotenv()

# acces the variable
token = os.getenv("GENIUS_API_TOKEN")

Now we create the genius object that we will use to fetch the songs.

In [3]:
genius = lyricsgenius.Genius(token)
genius.remove_section_headers = True 
genius.verbose = False

To found the songs of a given artist we need its `id`:

In [38]:
artist_names = ["Kase.O", "SFDK", "Ayax y Prok", "Kaze"]

artists = {}

for name in artist_names:
    artist_id = genius.search_artist(name, max_songs=0).id
    artists[artist_id] = name

print(artists)

{27132: 'Kase.O', 55782: 'SFDK', 1556128: 'Ayax y Prok', 1062440: 'Kaze'}


Now we can go through the pages of the website to find songs of our artist. We will omit colaborations, live songs and remixes.

In [39]:
def find_songs(artist_id, page):  
    songs = genius.artist_songs(artist_id, page=3)["songs"]
    
    songs = list(filter(lambda x: len(x["featured_artists"]) == 0, songs))
    songs = list(filter(lambda x: x["artist_names"] == artists[artist_id], songs))
    
    omit_words = ["Directo", "Remix", "Mix"]
    pattern = '|'.join(map(re.escape, omit_words))
    
    songs = list(filter(lambda x: not re.search(pattern, x["full_title"], re.IGNORECASE), songs))
    
    song_ids = [song["id"] for song in songs]
    
    return song_ids

We will keep fetching songs until we have 10k words per artist.

In [47]:
texts = {}

MINIMUM_WORDS = 10000

for artist_id, artist_name in artists.items():
    wordcount = 0
    songcount = 0
    done = False
    page = 1
    texts[artist_name] = ""
    
    while not done:
        for song_id in find_songs(artist_id, page):
            song = genius.search_song(song_id=song_id)

            if song:
                text = song.to_text()
                texts[artist_name] += text

                wordcount += len(text.split())
                songcount += 1
            
                if wordcount > MINIMUM_WORDS:
                    done = True
                    break
    
        page +=1 

    print(f"Finished downloading {artist_name} songs: {songcount} songs, {page} pages")

Finished downloading Kase.O songs: 8 songs, 3 pages
Finished downloading SFDK songs: 9 songs, 2 pages
Finished downloading Ayax y Prok songs: 11 songs, 3 pages
Finished downloading Kaze songs: 9 songs, 3 pages


Now we have to clear the texts

In [48]:
texts["Kaze"][:10000]

'2 ContributorsDe Una; Pt. 2: Ya No Puedo Más Lyrics\n\nLatidos no me guían el camino\nYo persigo lo que quiero y lo consigo con mis\u2005manos\nOs\u2005miro como miro\u2005al rico que dice ser superior\nHasta\u2005que se da cuenta que vive en la tierra igual que yo\nComo me toca la moral llego la hora de elaborar una idea\nYa que el rap no es laboral\nY por ahora mola pero mas me molaría volar\nSin tener que pagar demoras hasta el día en que me muera\nComo duele tener que ver llorar a la abuela\nYa no me fastidia la envidia de sanguijuela\nSi el dolor mas grande no es el de muelas\nEs tener que ver el nombre de los que quieres en una esquela\nY me cago en dios y  hasta en la virgen\nNo hago mas  que ayudar a la gente pero di para que me sirve\nSuerte tiene Torbe no alguien que crece firme\nNi drogas ni amigos ni fiestas pa\' desinhibirse\n\nY más, y ya no puedo mas\nLo malo que me traen lo bueno que se va\nLas veces que te caes te ayudo a levantar\nY ya no puedo mas, y ya no aguanto m