# Genres Scraping

Here is the code used to extract the genres from two different sources: Last.fm and Wikipedia using the BeautifulSoup library. 
We used two sources because each had some missing genres. Finally, the results were combined into a single dataset.


In [1]:
pip install beautifulsoup4 requests

Note: you may need to restart the kernel to use updated packages.


In [None]:
import pandas as pd
import re 
from bs4 import BeautifulSoup
import requests
import random

## Retrieve genres from Last.fm

In [None]:

def get_first_genre_for_artist(artist):
    api_key = 'fabafa53711c66dad147eda0c4fe07b1'
    url = f'http://ws.audioscrobbler.com/2.0/?method=artist.getinfo&artist={artist}&api_key={api_key}&format=json'

    response = requests.get(url)
    data = response.json()

    # Extract genre
    try:
        genres = data['artist']['tags']['tag']
        if genres:
            return genres[0]['name']
        else:
            return None
    except KeyError:
        return None

dataset_songs = pd.read_csv('dataset_songs.csv') 


for index, row in dataset_songs.iterrows():
    artist = row['Artist']  
    first_genre = get_first_genre_for_artist(artist)
    dataset_songs.at[index, 'Genres'] = first_genre  


dataset_songs.to_csv('genre_dataset.csv', index=False)

## Retrieve genres from Wikipedia

In [111]:
df= pd.read_csv('dataset_songs.csv')
artists= df['Artist'].to_list()
genres= [] 

The following function splits the input string using the delimiters "&", "and", "featuring", and "feat." (or "feat"), to standardize the different ways artist collaborations are written. It returns the first part of the string to keep the genre of the first artist, since Wikipedia pages are for individual artists.

In [2]:

def split_string(input_string):
    
    parts = re.split(r'\s*\&\s*|\s*and\s*|\s*featuring\s*|\s*feat\.?\s*', input_string, maxsplit=1)
    return parts[0]



In [113]:
count=0
f=0

for artist in artists: 
    artist_splitted= split_string(artist)
    
    #Replace spaces with underscores in the artist name
    artist_formatted = artist_splitted.replace(' ', '_')
    print(artist_formatted)
    url = f"https://it.wikipedia.org/wiki/{artist_formatted}"
    print(url)
    
    user_agents_list = [
    'Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.83 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36'
    ]
#Retrieve the HTML page of the URL
    page= requests.get(url, headers={'User-Agent': random.choice(user_agents_list)})
#page.content contains the HTML
    page.content

#Parse the HTML content using BeautifulSoup
    soup = BeautifulSoup(page.content, 'html.parser')

#Find the <a> tag with the title "Genere musicale"
    a_tag = soup.find('a', title="Genere musicale")
    if a_tag:
    #Access the parent <th> tag
        th_tag = a_tag.parent if a_tag.parent.name == 'th' else None
        if not th_tag:
     
            print("The <a> tag is not contained within a <th> tag.")
            first_genre= 'Genre not found'
            genres.append(first_genre)
            f+=1 
            continue
            
    else:
        print("No <a> tag with title 'Genere musicale' was found.")
        first_genre= 'Genre not found'
        genres.append(first_genre)
        f+=1 
        continue
        
    
    td_tag= th_tag.find_next_sibling('td')
    first_genre = td_tag.find('a').text if td_tag else 'Genre not found'
    print(first_genre)
    genres.append(first_genre)
    count+=1
    print('Scraped:', count)
    print('Failed:', f)

    

Percy_Faith
https://it.wikipedia.org/wiki/Percy_Faith
Easy listening
Scraped: 1
Failed: 0
Jim_Reeves
https://it.wikipedia.org/wiki/Jim_Reeves
Country
Scraped: 2
Failed: 0
The_Everly_Brothers
https://it.wikipedia.org/wiki/The_Everly_Brothers
Rock and roll
Scraped: 3
Failed: 0
Johnny_Preston
https://it.wikipedia.org/wiki/Johnny_Preston
Pop
Scraped: 4
Failed: 0
Mark_Dinning
https://it.wikipedia.org/wiki/Mark_Dinning
Pop
Scraped: 5
Failed: 0
Brenda_Lee
https://it.wikipedia.org/wiki/Brenda_Lee
Pop
Scraped: 6
Failed: 0
Elvis_Presley
https://it.wikipedia.org/wiki/Elvis_Presley
Rock and roll
Scraped: 7
Failed: 0
Jimmy_Jones
https://it.wikipedia.org/wiki/Jimmy_Jones
No <a> tag with title 'Genere musicale' was found.
Elvis_Presley
https://it.wikipedia.org/wiki/Elvis_Presley
Rock and roll
Scraped: 8
Failed: 1
Chubby_Checker
https://it.wikipedia.org/wiki/Chubby_Checker
Twist
Scraped: 9
Failed: 1
Connie_Francis
https://it.wikipedia.org/wiki/Connie_Francis
Musica leggera
Scraped: 10
Failed: 1
Bobby_

In [114]:
df['tag']= genres

In [116]:
df.to_csv('dataset_songs_genres.csv', index=False)

## Combine genres scraped from Wikipedia and Last.fm

In [132]:
df1= pd.read_csv('dataset_songs_genres.csv')
df1.drop('Unnamed: 0', axis=1, inplace=True)


In [133]:
df1= df1[df1['Year']!=2023].reset_index(drop=True)
df1

Unnamed: 0,Year,Song Title,Artist,tag
0,1960,"""Theme from A Summer Place""",Percy Faith,Easy listening
1,1960,"""He'll Have to Go""",Jim Reeves,Country
2,1960,"""Cathy's Clown""",The Everly Brothers,Rock and roll
3,1960,"""Running Bear""",Johnny Preston,Pop
4,1960,"""Teen Angel""",Mark Dinning,Pop
...,...,...,...,...
6296,2022,"""Flower Shops""",Ernest featuring Morgan Wallen,Genre not found
6297,2022,"""To the Moon""",Jnr Choi and Sam Tompkins,Genre not found
6298,2022,"""Unholy""",Sam Smith and Kim Petras,Pop soul
6299,2022,"""One Mississippi""",Kane Brown,Country


In [134]:
df2= pd.read_excel("genre_dataset.xlsx")

df2.columns= ['Unnamed: 0', 'Year', 'Song Title', 'Artist', 'Genre']
df2.drop('Unnamed: 0', axis=1, inplace=True)
df2= df2[df2['Year']!=2023].reset_index(drop=True)


In [135]:
df2= df2.iloc[1:]
df2.reset_index(drop=True, inplace=True)
df2

Unnamed: 0,Year,Song Title,Artist,Genre
0,1960,"""Theme from A Summer Place""",Percy Faith,easy listening
1,1960,"""He'll Have to Go""",Jim Reeves,country
2,1960,"""Cathy's Clown""",The Everly Brothers,oldies
3,1960,"""Running Bear""",Johnny Preston,60s
4,1960,"""Teen Angel""",Mark Dinning,oldies
...,...,...,...,...
6296,2022,"""Flower Shops""",Ernest featuring Morgan Wallen,
6297,2022,"""To the Moon""",Jnr Choi and Sam Tompkins,
6298,2022,"""Unholy""",Sam Smith and Kim Petras,
6299,2022,"""One Mississippi""",Kane Brown,country


This function fills in missing genre information in df1 (scraped from Wipedia) using data from df2 (scraped from Last.fm).

In [136]:
for index, row in df1.iterrows(): 
    if row['tag']=='Genre not found':
        #if row['Song Title']== df2.loc[index, 'Song Title']:
        df1.at[index, 'tag']= df2.loc[index, 'Genre']
        

### Second attempt to get the missing genres from Wikipedia.

Since there are still missing genres in the resulting dataset, we make a second attempt to retrieve them from Wikipedia.

In [137]:
df1[df1['tag'].isna()]

Unnamed: 0,Year,Song Title,Artist,tag
61,1960,"""Image of a Girl""",The Safaris & The Phantom's Band,
286,1962,"""Percolator (Twist)""",Billy Joe and the Checkmates,
298,1962,"""I Wish That We Were Married""",Ronnie & the Hi-Lites,
318,1963,"""Mockingbird""",Inez and Charlie Foxx,
326,1963,"""Denise""",Randy & the Rainbows,
...,...,...,...,...
6273,2022,"""Better Days""","Neiked, Mae Muller and Polo G",
6274,2022,"""Meet Me at Our Spot""",The Anxiety: Willow and Tyler Cole,
6292,2022,"""Never Say Never""",Cole Swindell and Lainey Wilson,
6296,2022,"""Flower Shops""",Ernest featuring Morgan Wallen,


In [47]:
artists_not_found= df1[df1['tag'].isna()]['Artist'].tolist()
artists_not_found

["The Safaris & The Phantom's Band",
 'Billy Joe and the Checkmates',
 'Ronnie & the Hi-Lites',
 'Inez and Charlie Foxx',
 'Randy & the Rainbows',
 'Ruby & the Romantics',
 'Garnet Mimms & the Enchanters',
 'J. Frank Wilson and the Cavaliers',
 'Ronny & the Daytonas',
 'Chad & Jeremy',
 'Gary Lewis & the Playboys',
 'Gary Lewis & the Playboys',
 'Gary Lewis & the Playboys',
 'Dino, Desi & Billy',
 'Paul Revere & the Raiders',
 'Paul Revere & the Raiders',
 'Gary Lewis & the Playboys',
 'Paul Revere & the Raiders',
 'Mitch Ryder & The Detroit Wheels',
 'Gary Lewis & the Playboys',
 'Jay & the Techniques',
 'John Fred & His Playboy Band',
 'Friend & Lover',
 'Tommy Boyce & Bobby Hart',
 'Gary Lewis & the Playboys',
 'Jr. Walker & The All Stars',
 '"Mama" Cass Elliot',
 'Checkmates, Ltd.',
 'Joe Simon (singer)',
 'Paul Revere & the Raiders',
 'Paul Revere & the Raiders',
 'Pacific Gas & Electric',
 'Hamilton, Joe Frank & Reynolds',
 'Delaney & Bonnie & Friends',
 'Mac and Katie Kissoon',


This function aims to split artist names from input strings, considering common delimiters used in collaborations, and prioritizing the "featuring" delimiter if present.

In [3]:


def split_string(input_string):
    #First check if 'featuring' is present, if so, prioritize splitting by 'featuring'
    if 'featuring' in input_string:
        parts = re.split(r'\s*featuring\s*', input_string, maxsplit=1)
    else:
        #If 'featuring' is not present, then split by 'and'
        parts = re.split(r'\b\s*and\s*\b|\s*featuring\s*|\s*,\s*', input_string, maxsplit=1)
    
    #Return the first element from the split result
    return parts[0]




In [49]:
artists_not_found= [split_string(artist) for artist in artists_not_found ]
artists_not_found

["The Safaris & The Phantom's Band",
 'Billy Joe',
 'Ronnie & the Hi-Lites',
 'Inez',
 'Randy & the Rainbows',
 'Ruby & the Romantics',
 'Garnet Mimms & the Enchanters',
 'J. Frank Wilson',
 'Ronny & the Daytonas',
 'Chad & Jeremy',
 'Gary Lewis & the Playboys',
 'Gary Lewis & the Playboys',
 'Gary Lewis & the Playboys',
 'Dino',
 'Paul Revere & the Raiders',
 'Paul Revere & the Raiders',
 'Gary Lewis & the Playboys',
 'Paul Revere & the Raiders',
 'Mitch Ryder & The Detroit Wheels',
 'Gary Lewis & the Playboys',
 'Jay & the Techniques',
 'John Fred & His Playboy Band',
 'Friend & Lover',
 'Tommy Boyce & Bobby Hart',
 'Gary Lewis & the Playboys',
 'Jr. Walker & The All Stars',
 '"Mama" Cass Elliot',
 'Checkmates',
 'Joe Simon (singer)',
 'Paul Revere & the Raiders',
 'Paul Revere & the Raiders',
 'Pacific Gas & Electric',
 'Hamilton',
 'Delaney & Bonnie & Friends',
 'Mac',
 'Paul Stookey',
 'Mouth & MacNeal',
 'Dr. Hook & the Medicine Show',
 'Jerry Butler & Brenda Lee Eager',
 'Origin

In [50]:

count=0
f=0
genres= []

for artist in artists_not_found: 
    artist_splitted= split_string(artist)
    
    artist_formatted = artist_splitted.replace(' ', '_')
    print(artist_formatted)
    url = f"https://it.wikipedia.org/wiki/{artist_formatted}"
    print(url)
    
    user_agents_list = [
    'Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.83 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36'
    ]
#Retrieve the HTML page of the URL
    page= requests.get(url, headers={'User-Agent': random.choice(user_agents_list)})
#page.content contains the HTML
    page.content

#Parse the HTML content using BeautifulSoup
    soup = BeautifulSoup(page.content, 'html.parser')



#Find the <a> tag with the title "Genere musicale"
    a_tag = soup.find('a', title="Genere musicale")
    if a_tag:
    #Access the parent <th> tag
        th_tag = a_tag.parent if a_tag.parent.name == 'th' else None
        if not th_tag:
           
     
            print("The <a> tag is not contained within a <th> tag.")
            first_genre= 'Genre not found'
            genres.append(first_genre)
            f+=1 
            continue
            
    else:
        print("No <a> tag with title 'Genere musicale' was found.")
        first_genre= 'Genre not found'
        genres.append(first_genre)
        f+=1 
        continue
        
    
    td_tag= th_tag.find_next_sibling('td')
    first_genre = td_tag.find('a').text if td_tag else 'Genre not found'
    print(first_genre)
    genres.append(first_genre)
    count+=1
    print('Scraped:', count)
    print('Failed:', f)

The_Safaris_&_The_Phantom's_Band
https://it.wikipedia.org/wiki/The_Safaris_&_The_Phantom's_Band
No <a> tag with title 'Genere musicale' was found.
Billy_Joe
https://it.wikipedia.org/wiki/Billy_Joe
No <a> tag with title 'Genere musicale' was found.
Ronnie_&_the_Hi-Lites
https://it.wikipedia.org/wiki/Ronnie_&_the_Hi-Lites
No <a> tag with title 'Genere musicale' was found.
Inez
https://it.wikipedia.org/wiki/Inez
No <a> tag with title 'Genere musicale' was found.
Randy_&_the_Rainbows
https://it.wikipedia.org/wiki/Randy_&_the_Rainbows
Doo-wop
Scraped: 1
Failed: 4
Ruby_&_the_Romantics
https://it.wikipedia.org/wiki/Ruby_&_the_Romantics
No <a> tag with title 'Genere musicale' was found.
Garnet_Mimms_&_the_Enchanters
https://it.wikipedia.org/wiki/Garnet_Mimms_&_the_Enchanters
No <a> tag with title 'Genere musicale' was found.
J._Frank_Wilson
https://it.wikipedia.org/wiki/J._Frank_Wilson
No <a> tag with title 'Genere musicale' was found.
Ronny_&_the_Daytonas
https://it.wikipedia.org/wiki/Ronny_&

In [100]:
#Count the number of 'Genre not found' in the list

new_genres= []
for i in genres:
    if i!='Genre not found':
        new_genres.append(i)

print(new_genres)

['Doo-wop', 'Folk', 'Rock', 'Rock', 'Rock', 'Rock', 'Rock', 'Blues rock', 'Pop', 'Rock', 'Rock', 'Pop', 'Rock', 'Soul', 'Pop rock', 'Pop rock', 'Pop rock', 'Pop rock', 'Pop rock', 'Pop rock', 'Pop', 'Pop rap', 'Pop rap', 'Pop rap', 'Pop rap', 'Hip hop', 'Pop rap', 'Contemporary R&B', 'Country', 'Hip hop', 'Pop', 'Pop', 'Pop', 'Pop', 'Musica elettronica', 'Country', 'Pop', 'Pop', 'Pop', 'Hip hop', 'Pop', 'Pop', 'Musica elettronica', 'Hip hop', 'Electro house', 'Trap', 'Pop', 'Pop', 'Trap', 'Southern hip hop', 'Hip hop', 'Pop', 'Country', 'Pop', 'Country', 'Pop', 'Hip hop', 'Hip hop']


In [101]:
#This process updates the missing 'tag' values in df1 with new genre values from the new_genres list, 
# ensuring that each missing value is replaced sequentially.

df= df1[df1['tag'].isna()]
for index, row in df.iterrows(): 
    print(index)
    df1.at[index, 'tag']= new_genres.pop(0)

61
286
298
318
326
332
342
408
438
449
516
558
560
596
608
660
686
690
695
698
720
824
848
871
894
919
954
965
974
994
999
1093
1142
1167
1171
1192
1225
1264
1284
1290
1351
1361
1456
1481
1545
1574
1597
1600
1621
1642
1773
1800
1828
1879
1968
2024
2038
2076
2129


IndexError: pop from empty list

In [102]:
df1[df1['tag'].isna()] #Genres not found 

Unnamed: 0,Year,Song Title,Artist,tag
2129,1981,"""Guilty""",Barbra Streisand & Barry Gibb,
2147,1981,"""The Breakup Song (They Don't Write 'Em)""",The Greg Kihn Band,
2185,1981,"""What Kind of Fool""",Barbra Streisand & Barry Gibb,
2242,1982,"""Pac-Man Fever""",Buckner & Garcia,
2321,1983,"""Jeopardy""",The Greg Kihn Band,
...,...,...,...,...
6273,2022,"""Better Days""","Neiked, Mae Muller and Polo G",
6274,2022,"""Meet Me at Our Spot""",The Anxiety: Willow and Tyler Cole,
6292,2022,"""Never Say Never""",Cole Swindell and Lainey Wilson,
6296,2022,"""Flower Shops""",Ernest featuring Morgan Wallen,


In [123]:
top_100= pd.read_csv('Billboard_top.csv')
top_100= top_100[top_100['Year']!=2023].reset_index(drop=True)


In [124]:
#Add the genre column to the top 100 dataset
genres= df1['tag'].to_list()
top_100['Genre']= genres
top_100

Unnamed: 0.1,Unnamed: 0,Year,Song Title,Artist,Label,Lyrics,URL,Genre
0,0,1960,Theme from A Summer Place,Percy Faith,1,There's a summer placeWhere it may rain or sto...,https://genius.com/percy-faith-theme-from-a-su...,Easy listening
1,1,1960,He'll Have to Go,Jim Reeves,1,Put your sweet lips a little closer to the pho...,https://genius.com/jim-reeves-hell-have-to-go-...,Country
2,2,1960,Cathy's Clown,The Everly Brothers,1,Don't want your love anymoreDon't want your ki...,https://genius.com/the-everly-brothers-cathys-...,Rock and roll
3,3,1960,Running Bear,Johnny Preston,1,*vocalizations*On the bank of the riverStood R...,https://genius.com/johnny-preston-running-bear...,Pop
4,4,1960,Teen Angel,Mark Dinning,1,Teen AngelTeen AngelTeen AngelThat fateful nig...,https://genius.com/mark-dinning-teen-angel-lyrics,Pop
...,...,...,...,...,...,...,...,...
6296,6296,2022,Flower Shops,Ernest featuring Morgan Wallen,1,"It's a beautiful day, she's been cryin' all ni...",https://genius.com/ernest-flower-shops-lyrics,
6297,6297,2022,To the Moon,Jnr Choi and Sam Tompkins,1,"Sit by myselfTalking to the moonTeh, ha, pull ...",https://genius.com/jnr-choi-to-the-moon-lyrics,
6298,6298,2022,Unholy,Sam Smith and Kim Petras,1,Mummy don't know Daddy's getting hot At the Bo...,,Pop soul
6299,6299,2022,One Mississippi,Kane Brown,1,You and IHad this off and on so longYou've bee...,https://genius.com/kane-brown-one-mississippi-...,Country


In [125]:
#Remove songs with no lyrics from the final dataset

top_100= top_100[top_100['Lyrics']!="Lyrics not found."]
top_100.drop('Unnamed: 0', axis=1, inplace=True)
top_100= top_100[top_100['Lyrics'].notna()].reset_index(drop=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  top_100.drop('Unnamed: 0', axis=1, inplace=True)


## Final dataset with genres 

In [131]:
top_100.to_csv('Billboard_top_genres.csv', index=False)