## 🎵 Top Artist Song Lyrics Dataset (2017–2024) — Educational Use Only

This notebook collects and compiles a rich dataset of song lyrics by globally recognized artists from **2017 to 2024**, using publicly available data from music sources.

It includes:
- Artist names and song titles (fetched from AZLyrics)
- Corresponding song lyrics (fetched via AZLyrics and Genius where needed)

⚠️ **Disclaimer:** This dataset is intended strictly for **educational and non-commercial research**. All lyrics remain the intellectual property of their respective rights holders.


## 🧭 Notebook Structure & Steps

This notebook follows a step-by-step process to collect and organize song lyrics data from top artists between **2017 and 2024** using publicly available online sources.

---

### 🔹 Step 1: Import Required Libraries  
Load all necessary Python libraries including:
- `requests`, `BeautifulSoup` (for scraping)
- `pandas`, `time`, `re`, etc.

---

### 🔹 Step 2: Scrape Artist Names & Song List Links  
- Loop through the alphabet (A–Z) to fetch artist pages  
- Extract artist names and corresponding song list URLs  
- Save as `azlyrics_artists_links.csv`

---

### 🔹 Step 3: Load Top Artist List (2010–2024)  
- Load a CSV file (`top_artist.csv`) containing the most popular global artists  
- Clean artist names for matching

---

### 🔹 Step 4: Match Top Artists with AZLyrics Artist List  
- Filter the artists from AZLyrics to include only the top artists from 2017–2024  
- Save final matched artists and their song page links as `filtered_artist_links.csv`

---

### 🔹 Step 5: Scrape Each Artist's Songs and Links  
- For each matched artist, extract all songs listed on their page  
- Also collect song URLs and album/year information  
- Save output as `song_links_&_name.csv`

---

### 🔹 Step 6: Fetch Song Lyrics (AZLyrics or Genius)  
- Using the song links, fetch lyrics directly  
- If AZLyrics blocks access, fallback to using **Genius search** or **Genius API**  
- Apply text cleaning for NLP-friendly formatting  
- Save full lyrics data as `final_lyrics_dataset.csv`

---

### 🔹 Step 7: (Optional) Genius API Usage  
- If enabled, fetch lyrics using the official Genius API with your API key  
- Use only when AZLyrics scraping fails  
- Add results to your final dataset

---

### 🔹 Step 8: Final Dataset Overview  
- Load and inspect the final dataset  
- Columns: `artist`, `songs`, `lyrics`  
- Explore sample rows and text length

---

### 🔹 Step 9: Save & Upload  
- Save all outputs in CSV format  
- Upload to Kaggle with full documentation, license, and educational disclaimer

---

📌 **Note:**  
All lyrics are the property of their respective copyright holders.  
This project is strictly intended for **research and educational purposes only.**


In [72]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
import lyricsgenius

## 🔍 Scraping Strategy

1. Loop through all alphabet letters (`a` to `z`) which represent artist categories on AZLyrics.
2. For each letter:
   - Visit the corresponding URL (e.g., `https://www.azlyrics.com/a.html`).
   - Parse the artist names and the links to their song list pages.
3. Save all artist names and URLs for later use.


In [2]:
# Letters to iterate over (each represents a separate artist listing page)
letters = list("abcdefghijklmnopqrstuvwxyz")

# Store data
all_artists = []
all_links = []

# Loop through each letter and scrape artist names and their song list links
for letter in letters:
    time.sleep(5.5)  # Be respectful to AZLyrics servers
    url = f"https://www.azlyrics.com/{letter}.html"
    print(f"Fetching: {url}")
    
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")

    # Find all <a> tags and filter only relevant ones
    links = soup.find_all("a")[28:-8]  # Skip unrelated links (based on inspection)
    artist_names = [link.text for link in links]
    artist_urls = [link.get("href") for link in links]

    print(f"Found {len(artist_names)} artists for '{letter.upper()}'")

    all_artists.extend(artist_names)
    all_links.extend(artist_urls)


Fetching: https://www.azlyrics.com/a.html
Found 1363 artists for 'A'


In [4]:
all_links

['a/a1.html',
 'f/floyda1bentley.html',
 'a/a1xj1.html',
 'a/a.html',
 'a/a2h.html',
 'a/a36.html',
 'a/a4.html',
 'a/a92.html',
 's/snohaalegra.html',
 'a/aaliyah.html',
 's/saaraaalto.html',
 'a/aaradhna.html',
 'a/aarne.html',
 'a/aaroncarpenter.html',
 'c/carter.html',
 'a/aaroncole.html',
 'a/aarondoh.html',
 'a/aaronfresh.html',
 'a/aarongoodvin.html',
 'a/aaronhall.html',
 'a/aaronlewis.html',
 'l/lines.html',
 'a/aaronmay.html',
 'a/aaronneville.html',
 'a/aaronpritchett.html',
 'a/aaronshust.html',
 'a/aaronsmithuk.html',
 'a/aaronsmith.html',
 'a/aarontaos.html',
 'a/aarontaylor.html',
 'a/aarontippin.html',
 'a/aaronwatson.html',
 'a/aaronwestandtheroaringtwenties.html',
 'a/aaryanshah.html',
 'a/aasthagill.html',
 'a/ab6ix.html',
 'a/abandonallships.html',
 'a/abandonedpools.html',
 'a/abba.html',
 'a/abbath.html',
 'a/abbeycone.html',
 'a/abbeyglover.html',
 'a/abbiefalls.html',
 'g/gregoryabbott.html',
 'a/abbyanderson.html',
 'a/abbycates.html',
 'a/abc.html',
 'a/abdul.

## 📁 Combine and Save Data

We now save the artist names and their respective song page URLs into a structured DataFrame and export it as a CSV file for future use.


In [7]:
# Create full URLs from relative paths (AZLyrics uses relative links)
base_url = "https://www.azlyrics.com/"

# Convert relative links to full URLs
full_links = [link if link.startswith("http") else base_url + link.lstrip("../") for link in all_links]

# Create a DataFrame
df = pd.DataFrame({
    "artist": all_artists,
    "link": full_links
})

# Preview
df.head()


Unnamed: 0,artist,link
0,a1,https://www.azlyrics.com/a/a1.html
1,A1,https://www.azlyrics.com/f/floyda1bentley.html
2,A1 x J1,https://www.azlyrics.com/a/a1xj1.html
3,A,https://www.azlyrics.com/a/a.html
4,A2H,https://www.azlyrics.com/a/a2h.html


In [None]:
df.to_csv("azlyrics_artists_links.csv", index=False)

## 🎯 Filter Top Artists from Full AZLyrics Dataset

Now that we have saved the complete list of all artists and their AZLyrics links to `azlyrics_artists_links.csv`, we will:

1. Load our list of **Top Artists** (e.g., downloaded from Kaggle or another source).
2. Match them against the AZLyrics artist names.
3. Filter and extract only the matching artists along with their lyrics page links.
4. Save this filtered list to a new file: `filtered_artist_links.csv`.


In [33]:
import pandas as pd

# Load full artist dataset with links from AZLyrics
dff = pd.read_csv("azlyrics_artists_links.csv")  # Columns: artist, link

# Load top artists list (downloaded separately)
data1 = pd.read_csv("top_artist.csv")  # Column: artist

# Create a DataFrame to hold matching artist-link pairs
filtered_df = pd.DataFrame(columns=["artist", "link"])

# Match top artists with AZLyrics artists
for i in range(len(data1)):
    top_artist = str(data1["artist"][i]).lower().replace(" ", "").strip()

    for j in range(len(dff)):
        az_artist = str(dff["artist"][j]).lower().replace(" ", "").strip()

        if top_artist == az_artist:
            print(f"✅ MATCH: {top_artist} == {az_artist}")
            filtered_df.loc[len(filtered_df)] = [dff["artist"][j], dff["link"][j]]

# Drop duplicates
filtered_df = filtered_df.drop_duplicates()

# Save to new CSV



✅ MATCH: kesha == kesha
✅ MATCH: rihanna == rihanna
✅ MATCH: ladygaga == ladygaga
✅ MATCH: jasonderulo == jasonderulo
✅ MATCH: katyperry == katyperry
✅ MATCH: brunomars == brunomars
✅ MATCH: taiocruz == taiocruz
✅ MATCH: blackeyedpeas == blackeyedpeas
✅ MATCH: usher == usher
✅ MATCH: taylorswift == taylorswift
✅ MATCH: b.o.b == b.o.b
✅ MATCH: eminem == eminem
✅ MATCH: train == train
✅ MATCH: mikeposner == mikeposner
✅ MATCH: drake == drake
✅ MATCH: onerepublic == onerepublic
✅ MATCH: justinbieber == justinbieber
✅ MATCH: nickiminaj == nickiminaj
✅ MATCH: adamlambert == adamlambert
✅ MATCH: ludacris == ludacris
✅ MATCH: davidguetta == davidguetta
✅ MATCH: ladyantebellum == ladyantebellum
✅ MATCH: jay-z == jay-z
✅ MATCH: pitbull == pitbull
✅ MATCH: iyaz == iyaz
✅ MATCH: timbaland == timbaland
✅ MATCH: nelly == nelly
✅ MATCH: laroux == laroux
✅ MATCH: daughtry == daughtry
✅ MATCH: shontelle == shontelle
✅ MATCH: jaysean == jaysean
✅ MATCH: fareastmovement == fareastmovement
✅ MATCH: selen

✅ MATCH: daya == daya
✅ MATCH: dnce == dnce
✅ MATCH: shawnmendes == shawnmendes
✅ MATCH: lukasgraham == lukasgraham
✅ MATCH: zayn == zayn
✅ MATCH: charlieputh == charlieputh
✅ MATCH: g-eazy == g-eazy
✅ MATCH: zaralarsson == zaralarsson
✅ MATCH: xambassadors == xambassadors
✅ MATCH: kiiara == kiiara
✅ MATCH: majorlazer == majorlazer
✅ MATCH: jamesbay == jamesbay
✅ MATCH: halsey == halsey
✅ MATCH: flume == flume
✅ MATCH: troyesivan == troyesivan
✅ MATCH: ruthb. == ruthb.
✅ MATCH: gnash == gnash
✅ MATCH: torylanez == torylanez
✅ MATCH: jonbellion == jonbellion
✅ MATCH: kentjones == kentjones
✅ MATCH: niallhoran == niallhoran
✅ MATCH: gwenstefani == gwenstefani
✅ MATCH: camilacabello == camilacabello
✅ MATCH: rachelplatten == rachelplatten
✅ MATCH: desiigner == desiigner
✅ MATCH: mnek == mnek
✅ MATCH: jordanfisher == jordanfisher
✅ MATCH: shawnhook == shawnhook
✅ MATCH: robinschulz == robinschulz
✅ MATCH: oliviao'brien == oliviao'brien
✅ MATCH: kai == kai
✅ MATCH: grace == grace
✅ MATCH: t

✅ MATCH: zachbryan == zachbryan
✅ MATCH: kylieminogue == kylieminogue
✅ MATCH: summerwalker == summerwalker
✅ MATCH: tylayaweh == tylayaweh
✅ MATCH: playboicarti == playboicarti
✅ MATCH: 'nsync == 'nsync
✅ MATCH: gracieabrams == gracieabrams
✅ MATCH: teddyswims == teddyswims
✅ MATCH: fettywap == fettywap
✅ MATCH: omi == omi
✅ MATCH: elleking == elleking
✅ MATCH: natalielarose == natalielarose
✅ MATCH: silento == silento
✅ MATCH: ellahenderson == ellahenderson
✅ MATCH: vancejoy == vancejoy
✅ MATCH: shaggy == shaggy
✅ MATCH: sagethegemini == sagethegemini
✅ MATCH: alunageorge == alunageorge
✅ MATCH: asaprocky == asaprocky
✅ MATCH: skrillex == skrillex
✅ MATCH: paulmccartney == paulmccartney
✅ MATCH: disciples == disciples
✅ MATCH: omarion == omarion
✅ MATCH: sheppard == sheppard
✅ MATCH: lunchmoneylewis == lunchmoneylewis
✅ MATCH: rudimental == rudimental
✅ MATCH: conradsewell == conradsewell
✅ MATCH: jidenna == jidenna
✅ MATCH: thomasrhett == thomasrhett


In [34]:
filtered_df

Unnamed: 0,artist,link
0,Kesha,k/keha.html
1,Rihanna,r/rihanna.html
2,Lady Gaga,l/ladygaga.html
3,Jason Derulo,j/jasonderulo.html
4,Katy Perry,k/katyperry.html
...,...,...
536,LunchMoney Lewis,l/lunchmoneylewis.html
537,Rudimental,r/rudimental.html
538,Conrad Sewell,c/conradsewell.html
539,Jidenna,j/jidenna.html


In [None]:
filtered_df.to_csv("filtered_artist_links.csv", index=False)

## 🎵 Fetch All Songs and Song Links for Each Artist

This function visits each artist's AZLyrics profile page and collects:

- Song titles
- Corresponding song URLs
- The album/single/EP name the song belongs to

It returns a structured dataset with this information. Instrumentals and non-song sections are skipped.


In [36]:
d1 = pd.read_csv("fillter_artist_name_and_links.csv")
d1

Unnamed: 0.1,Unnamed: 0,artist,links
0,0,Kesha,k/keha.html
1,1,Rihanna,r/rihanna.html
2,2,Lady Gaga,l/ladygaga.html
3,3,Jason Derulo,j/jasonderulo.html
4,4,Katy Perry,k/katyperry.html
...,...,...,...
566,567,Jeremy Zucker,j/jeremyzucker.html
567,568,Sandro Cavazza,s/sandrocavazza.html
568,569,Jake Miller,j/jakemiller.html
569,570,347aidan,https://www.azlyrics.com/19/347aidan.html


In [47]:
dict = {'artist':[],
       'songs':[],
       'songs_links':[],
       'song_year':[]
      }

newdata = pd.DataFrame(dict)

In [45]:
def song_clean(name):
    a = name.find("(")
    if a == -1:
        song1 = name
    else:
        song1 = name[a:].strip()

    return song1
def fetch_songs(data,ept_data,loop):
    not_fetch_artist = []
    try:          # len(data["artist"])
        for i in range(loop,len(data["artist"]):
            print("loop : ",i)
            val = data["links"][i] 
            a_name = data["artist"][i]
            print("Artist name",a_name)
           # r = random.randint(8,15)
            time.sleep(5)
            #print("time",r)
            urls = f"https://www.azlyrics.com/{val}"
            print(urls)
            try:
                reqs = requests.get(urls)
            except:
                print("change proxy")
                break
            if (reqs.status_code != 200):
                print(" connection error ")
                break
            else:
                soup = BeautifulSoup(reqs.content,"html.parser")
                find = soup.find_all("div",{"class": "listalbum-item"})
                if(find == ""):
                    break
                #space = [title.text for title in find]
                #re_name = list(map(lambda x: x.replace('\n', ""), space))
                try:
                    all_name = [title.text for title in soup.find("div",id="listAlbum")]

                except:
                    all_name = [title.text for title in find]
                    if (all_name == []):
                        break
                #all_name = [title.text for title in soup.find("div",id="listAlbum")]
                #songs  links fetching  
                songs_link=[]
                a_tag = []
                for i in range(len(find)):
                    a_tag.append(find[i].find("a"))
                new_a_tag = []
                for i in a_tag:
                    if i != None:
                        new_a_tag.append(i) 
                for link in new_a_tag:
                    links = link.get('href')
                    songs_link.append(links)
                # fetching songs name 
                xx = []
                for i in all_name:
                    xx.append(i.replace("\n",""))
                while '' in xx:
                    xx.remove('')
                name =  xx[:-1]
                for k in range(4):
                    for m in name:
                        if("Instrumental" in m or "(Instrumental)" in m):
                            name.remove(m)
                        for jj in name:
                            if("Instrumental" in jj or "(Instrumental)" in jj):
                                name.remove(jj)
                new_name = []
                year = []
                a = "Nan"
                for i in name:
                    if "album:" in i or "single:" in i or "EP:" in i or "soundtrack:" in i or "demo:" in i or "mixtape" in i or "split EP" in i:
                        a = i
                        print(a)
                        name.remove(i)
                        #year.append(song_clean(a))
                    elif "other songs:" in i:
                        a = "Nan"
                        print(i)
                        name.remove(i)
                        #year.append(song_clean(a))
                    elif "Instrumental" in i:
                        print(i)
                        name.remove(i)
                        #year.append(song_clean(a))
                    year.append(song_clean(a))
                        #new_name.append(i)

                loop = i
                if len(name) == len(songs_link):
                    for j in range(len(name)):
                        ept_data.loc[len(ept_data.index)] = [a_name, name[j],songs_link[j],year[j]]

                else:
                    print(a_name," Something wrong with this artist we remove from dataset ")
                    not_fetch_artist = []
                    an = ept_data[ept_data['artist'] == a_name].index
                    ept_data.drop(index=an,inplace=True)
                    not_fetch_artist.append(a_name)
        return ept_data , not_fetch_artist,1,i
    except TypeError:
        print("something wrong in code")
        

## Function Purpose Recap

The original fetch_songs() function was written to:

Fetch all songs from each artist’s page on AZLyrics

Save the following info per song:

Artist name

Song name

Song link

song year

Allow resuming from a specific index (e.g. after a crash or block)

Avoid fetching duplicate data if partially completed

In [48]:
# Example: We fetch data for only one artist here to demonstrate 
# how the function works and verify it returns song names, links, 
# and album info correctly.
fetch_songs(d1,newdata,1)

loop :  1
Artist name Rihanna
https://www.azlyrics.com/r/rihanna.html
album: "Music Of The Sun" (2005)
album: "A Girl Like Me" (2006)
album: "Good Girl Gone Bad" (2007)
album: "Rated R" (2009)
album: "Loud" (2010)
album: "Talk That Talk" (2011)
album: "Unapologetic" (2012)
soundtrack: "Home" (2015)
album: "Anti" (2016)
other songs:


(      artist                           songs  \
 0    Rihanna                   Pon De Replay   
 1    Rihanna                 Here I Go Again   
 2    Rihanna    If It's Lovin' That You Want   
 3    Rihanna  You Don't Love Me (No, No, No)   
 4    Rihanna                 That La, La, La   
 ..       ...                             ...   
 163  Rihanna                   We Found Love   
 164  Rihanna                Whipping My Hair   
 165  Rihanna                Who's That Chick   
 166  Rihanna                   Winning Women   
 167  Rihanna                     World Peace   
 
                                       songs_links song_year  
 0                /lyrics/rihanna/pondereplay.html    (2005)  
 1               /lyrics/rihanna/hereigoagain.html    (2005)  
 2      /lyrics/rihanna/ifitslovinthatyouwant.html    (2005)  
 3        /lyrics/rihanna/youdontlovemenonono.html    (2005)  
 4                 /lyrics/rihanna/thatlalala.html    (2005)  
 ..                             

In [51]:
# See the new data after fetch all the songs links and years
newdata.head()

Unnamed: 0,artist,songs,songs_links,song_year
0,Rihanna,Pon De Replay,/lyrics/rihanna/pondereplay.html,(2005)
1,Rihanna,Here I Go Again,/lyrics/rihanna/hereigoagain.html,(2005)
2,Rihanna,If It's Lovin' That You Want,/lyrics/rihanna/ifitslovinthatyouwant.html,(2005)
3,Rihanna,"You Don't Love Me (No, No, No)",/lyrics/rihanna/youdontlovemenonono.html,(2005)
4,Rihanna,"That La, La, La",/lyrics/rihanna/thatlalala.html,(2005)


In [None]:
# Final save
newdata.to_csv("song_links_&_name.csv")

## 🎶  Fetch Full Song Lyrics Using Collected Song Links

Now that we've successfully scraped and stored:

- Artist names
- Song titles
- Song years
- Song page links (from AZLyrics)

It's time to **move forward and fetch the full lyrics** for each song.

We'll load the saved dataset `song_links_&_name.csv` and use the `song_link` column to visit each song's page and extract its lyrics.

```python



In [52]:
# Load the saved dataset containing song names and links
data1 = pd.read_csv("data/song_links_&_name.csv")

# Drop the unnecessary 'Unnamed: 0' index column (if it exists)
df = data1.drop(columns=["Unnamed: 0"], axis=1)

# Display the dataset to verify structure
df

Unnamed: 0,artist,songs,songs_links,song_year
0,Kesha,Bastards,/lyrics/keha/bastards.html,2017
1,Kesha,Let 'Em Talk,/lyrics/keha/letemtalk.html,2017
2,Kesha,Woman,/lyrics/keha/woman.html,2017
3,Kesha,Hymn,/lyrics/keha/hymn.html,2017
4,Kesha,Praying,/lyrics/keha/praying.html,2017
...,...,...,...,...
35846,347aidan,The Weekend (Remix),https://www.azlyrics.com/lyrics/88rising/thewe...,Nan
35847,347aidan,Till The Sun Comes,/lyrics/347aidan/tillthesuncomes.html,Nan
35848,347aidan,TROUBLE,/lyrics/347aidan/trouble.html,Nan
35849,347aidan,Up & Down,https://www.azlyrics.com/lyrics/chainsmokers/u...,Nan


In [73]:
import lyricsgenius
# 📘 Utility to separate mixed case words like "LoveYouMore" → "Love You More"
def split_mixed_text(text):
    words = re.split(r'(\s+|(?<=[a-z])(?=[A-Z])|-|(?<=[A-Z])(?=[A-Z][a-z])|(?<=[a-zA-Z])(?=\d))', text)
    words = [word for word in words if word.strip()]
    words = [re.sub(r'[^\w\s]', '', word) for word in words]
    return " ".join(words)

# 📘 Clean extra details from song title, like "Song Name (Acoustic Version)" → "Song Name"
def song_clean(name):
    a = name.find("(")
    if a == -1:
        song1 = name
    else:
        song1 = name[:a].strip()
    return song1

# 📘 Remove common structural text from lyrics like "Chorus", "Verse 1", brackets, etc.
def clean(result_string):
    re = result_string.replace('Verse 1', "")
    re1 = re.replace("Pre-Chorus","")
    re2 = re1.replace("Chorus","")
    re3 = re2.replace("Verse 2","")
    re4 = re3.replace("Outro","")
    re5 = re4.replace("[]","")
    re6 = re5.replace(",","")
    re7 = re6.replace("Intro","")
    re8 = re7.replace("'","")
    re9 = re8.replace("[","")
    r10 = re9.replace("]"," ")
    r11 = r10.replace("("," ")
    r12 = r11.replace(")"," ").replace("84Embed","").replace("8Embed","").replace("Embed","")
    r13 = r12.replace("?","").replace("Contributors","")
    r14 = r13.replace("\n"," ").replace("\r","")
    r15 = r14.replace(".","").replace("Bridge","")
    r16 = r15.replace("Verse 3","").replace("\u2005","")
    return r16

# 📘 Fallback method: If Genius direct URL fails, search the song on Genius using Google-style search
def genius_api(df,loop,data):
    genius = lyricsgenius.Genius("K-B4mzgcY2KxoIqd3y6cQfKUFd35h7ToXCneICM8uItRZZPLDRCtEkEeE2po18hL")
    for i in range(loop,1):
        artist_name = df["artist"][i]
        song_name = df["songs"][i]
        artist = genius.search_artist(artist_name,max_songs=1, sort="title")
        lyrics = artist.song(song_name)
        #print(clean(lyrics.lyrics)[len(song_name+"Lyrics"):])
        print(i)
        b.append(i)
        try:
            cln = clean(lyrics.lyrics)[len(song_name+"Lyrics"):]

            data.loc[len(data.index)] = [artist_name, song_name,cln]
        except:
            return i+1
    return i
            #data.loc[len(data.index)] = [artist_name, song_name,clean(lyrics)]


# 📘 Fetch lyrics from AZLyrics (preferred) using the partial song link
def azlayrics_fetch(df,loop,data):
    for i in range(loop,len(df["artist"])):
        print(i)
        link = df["songs_links"][i]
        artist = df["artist"][i]
        song = df["songs"][i]
        df["songs"][i]
        if "https" in link or "http" in link:
            print(link)
            try:
                reqs = requests.get(link,timeout=3)
            except:
                print("connection error")
                break
            so = BeautifulSoup(reqs.content,"html.parser")
            fi = so.findAll("div")[24:25]
            spac = [title.text for title in fi]
            ro = list(map(lambda x: x.replace('\u2005', ""), spac))    
            result_string = " ".join(ro)
            lay = clean(result_string)[2:].lower()
            data.loc[len(data.index)] = [artist, song,lay]

        else:
            url = f"https://www.azlyrics.com{link}"
            print(url)
            try:
                reqs = requests.get(url)
            except:
                print("connection error")
                break
            so = BeautifulSoup(reqs.content,"html.parser")
            fi = so.findAll("div")[24:25]
            spac = [title.text for title in fi]
            ro = list(map(lambda x: x.replace('\u2005', ""), spac))    
            result_string = " ".join(ro)
            lay = clean(result_string)[2:].lower()
            if lay != "":
                data.loc[len(data.index)] = [artist, song,lay]
            else:
                break
    return i

## What This Setup Allows You To Do:
Primary source: Fetch lyrics from AZLyrics (most accurate + direct).

Backup option: If blocked or not available, use Genius (via URL + fallback search).

Cleanup: Lyrics are stripped of metadata like "Verse 1", "Bridge", etc., for clean analysis.

Flexibility: Supports any artist or song using artist + song_title.


In [53]:
#creating new dataframe for store new data with lyrics
dict = {'artist':[],
        'songs':[],
        'songs_lyrics':[]
       }

data = pd.DataFrame(dict)
data

Unnamed: 0,artist,songs,songs_lyrics


## 🎤 Using Genius API to Fetch Lyrics (Backup to AZLyrics)

If AZLyrics blocks our IP address or stops responding, we can use the official **Genius API** as a backup to fetch song lyrics.

The function `genius_api()` uses the `lyricsgenius` Python library to:
- Authenticate using a Genius API key
- Search for the artist and their song
- Fetch and clean the lyrics
- Store results in a DataFrame

### 🔑 Prerequisite:
You need a Genius API token. In this example, it's already defined in the function:
```python
genius = lyricsgenius.Genius("YOUR_GENIUS_API_KEY")


In [65]:
#demo how these function work
azlayrics_fetch(df,0,data)

0
https://www.azlyrics.com/lyrics/keha/bastards.html
1
https://www.azlyrics.com/lyrics/keha/letemtalk.html


1

In [66]:
data

Unnamed: 0,artist,songs,songs_lyrics
0,Kesha,Bastards,one two three four one i got too many people ...
1,Kesha,Let 'Em Talk,you you got your own opinions but baby i dont ...


In [79]:
row_num=[]

In [None]:
try:
    azlayrics_fetch(df,0,data)
except:
    az = genius_api(df,0,data)
    #incase of loop stop we can pass raw_num to continue where the loop stop
    row_num.append(az)

In [81]:
#after collect all songs lyrics now our data look like this 
data

Unnamed: 0,artist,songs,songs_lyrics
0,Kesha,Bastards,one two three four one i got too many people ...
1,Kesha,Let 'Em Talk,you you got your own opinions but baby i dont ...
2,Kesha,Woman,lets be serious come on this is a real this is...
3,Kesha,Hymn,even the stars and the moon dont shine quite l...
4,Kesha,Praying,"music-video spoken intro: ""am i dead or is th..."
...,...,...,...
36477,347aidan,WHEN THE DEVIL CRIES,ics Yeah ayy No no no no no no no yeah yeah W...
36478,347aidan,WHEN THE DEVIL CRIES,ics Yeah ayy No no no no no no no yeah yeah W...
36479,347aidan,WHEN THE DEVIL CRIES,ics Yeah ayy No no no no no no no yeah yeah W...
36480,347aidan,WHEN THE DEVIL CRIES,ics Yeah ayy No no no no no no no yeah yeah W...


In [82]:
# now drop the duplicates row
data = data.drop_duplicates(subset=["songs_lyrics"],keep='first')
data

Unnamed: 0,artist,songs,songs_lyrics
0,Kesha,Bastards,one two three four one i got too many people ...
1,Kesha,Let 'Em Talk,you you got your own opinions but baby i dont ...
2,Kesha,Woman,lets be serious come on this is a real this is...
3,Kesha,Hymn,even the stars and the moon dont shine quite l...
4,Kesha,Praying,"music-video spoken intro: ""am i dead or is th..."
...,...,...,...
36277,347aidan,SUNDAY,ics Yeah its like this I woke up this mor...
36278,347aidan,The Weekend (Remix),ics: BIBI Barbie wanna party like all night A...
36279,347aidan,Till The Sun Comes,rics Its hard to uh Get a grasp of whats goin...
36280,347aidan,TROUBLE,ics Now I knew that you were trouble And now ...


In [83]:
#ready to save data
data.to_csv("data/top_and_famous_artist_songs_with_lyrics.csv")

In [88]:
data.songs_lyrics[3434]

'Lyrics Sure you can have the towels You can take my money Drag my name round town I dont mind I changed it anyway They say the worst parts of someone Come out to play when shit goes wrong So shit must hit the fan with you all day hey   As you run your mouth puff your chest Play cowboy in the wild wild west I dont mind you know best Keep on ridin til you cant see us I learned the hard way about love Sometimes it just isnt enough   But sure you can have the towels  Theyre all yours  Tell em all Im crazy  Loco  What a nice cliché blamin me I guess Some things never change  Bitch  Finding out what went down As soon as I wasnt around Broke my heart but hey Ill be okay hey You might also like As you run your mouth puff your chest Play cowboy in the wild wild west I dont mind you know best Keep on ridin til you cant see us Did you know theres only one truth And in the end you have to live with what you do     Ride ride along   Far far away  Oh-oh oh oh oh  Ride ride along  Oh-oh  Far far awa

## 🎶 Lyrics Dataset Project (2017–2024) — AZLyrics + Genius API

This project involves collecting and curating a high-quality lyrics dataset using a combination of **web scraping** and **official API integration**. It is tailored specifically for **modern music analysis**, using **top and popular artists** from 2017 to 2024.

---

### 📥 Data Collection Highlights

- ✅ Scraped artist pages and song lyrics from **AZLyrics**
- 🔁 Used **Genius API** as a backup when AZLyrics failed or was blocked
- 🎤 Focused on **top global artists** and **famous musicians**
- ⏳ Included only **songs released from 2017 to 2024** to ensure relevance
- 🧹 Cleaned out non-lyrical content (e.g., "Verse", "Chorus", "Outro")

---

### 📁 Dataset Structure

| Column        | Description                            |
|---------------|----------------------------------------|
| `artist`      | Artist name                            |
| `song_name`   | Title of the song                      |
| `lyrics`      | Cleaned full lyrics                    |

---

### 🔬 Real-World Use Cases

This dataset is ideal for various **data science, NLP, and machine learning tasks**, such as:

- 🧠 **Sentiment Analysis**  
  Understand the emotions conveyed in songs or build models to detect mood.

- 🎧 **Music Recommendation Systems**  
  Recommend songs to users based on lyrical similarity or themes.

- 🗣️ **Topic Modeling & Theme Extraction**  
  Discover recurring themes like love, heartbreak, freedom, etc.

- 🧬 **Stylometry & Artist Profiling**  
  Analyze linguistic style and word choices across different artists or genres.

- 📈 **Trend Analysis Over Time**  
  Observe how lyrical content has evolved from 2017 to 2024.

- 🤖 **Text Generation & Language Modeling**  
  Train models to generate lyrics in the style of particular artists.

- 🔍 **Genre or Gender-Based Analysis**  
  Study differences in lyrics between genres (e.g., pop vs. hip-hop) or male vs. female artists.

---

### 📂 Dataset Access

This dataset is uploaded and publicly available on **Kaggle**:

🔗 [**Download on Kaggle**](https://www.kaggle.com/datasets/uvaissaifi/top-artist-songs-with-lyrics-20172024/data)

---

### 🛠️ Tools & Libraries Used

- `requests`, `BeautifulSoup` — Web scraping from AZLyrics & Genius
- `lyricsgenius` — API access to Genius lyrics
- `pandas`, `re`, `time`, `selenium` — Data parsing and cleaning
- `CSV` export — For structured, portable data format

---

### ✅ Conclusion

This modern, clean, and well-structured dataset is perfect for anyone working on **music intelligence**, **language processing**, or **AI-driven music apps**. It bridges the gap between raw music content and machine learning.

Whether you're a **researcher, data scientist, or developer**, this dataset provides the right foundation to build innovative audio-lyrical models.

