# Data downloading from Spotify

We would like to get the musical genre associated to an artist from Spotify.

## Spotify Authentication

In order to use the Spotify API, we need to register on their websites to obtain our client_id and secret. Then we can ask for a token allowing us to get the wanted information on the artists.

In [12]:
import spotipy
import spotipy.oauth2 as oauth2
import spotipy.util as util

import pandas as pd
import re

In [64]:
client_id = "c829ff34022d4c3a9e65c30c10623765"
client_secret = "8e998357ebf04fe092edab4915946e00"

credentials = oauth2.SpotifyClientCredentials(client_id, client_secret)

token = credentials.get_access_token()
sp = spotipy.Spotify(auth=token)

### Mapping Artist to the Genres

Import the artists and already pre-defined genres in a pandas dataframe

In [37]:
filepath = "kaggleDataset\music.csv" #lyrics
lyrics_df = pd.read_csv(filepath)
useful_df = lyrics_df[['artist.name', 'terms']]
useful_df.head()

Unnamed: 0,artist.name,terms
0,Casual,hip hop
1,The Box Tops,blue-eyed soul
2,Sonora Santanera,salsa
3,Adam Ant,pop rock
4,Gob,pop punk


Fetch all the genres of the artists on Spotify

In [65]:
def get_artist_genres(artist_name):
    """ Fetch on Spotify the genres of the artist"""
    query = sp.search(q='artist:' + artist_name, type='artist')
    if len(query['artists']['items']) > 0:
        genres = query['artists']['items'][0]['genres'] # list: ["pop rock", "hard rock", "alternative rock"...]
        return genres
    else:
        #print("{} not on Spotify".format(artist_name))
        return []

    
artist_genre = []
artists = list(set(useful_df['artist.name']))

print("Initial number of artists: {}".format(len(artists)))
for i, artist in enumerate(artists):
    if i % 500 == 0:
        print(i)
    genres = get_artist_genres(artist)
    if len(genres) > 0:
        artist_genre.append([artist, genres])
print("Number of artists with genres on Spotify: {}".format(len(artist_genre)))

Initial number of artists: 2159
0
500
1000
1500
2000
Number of artists with genres on Spotify: 474


In [57]:
artist_genres_df = pd.DataFrame(artist_genre, columns = {'artist', 'genre'})
artist_genres_df.head()

Unnamed: 0,artist,genre
0,Black Eyed Peas,"[dance pop, pop, pop rap]"
1,Jimmy Hughes,[southern soul]
2,Joy Division,"[alternative rock, art rock, dance rock, garag..."
3,Mint Condition,"[dance pop, funk, hip pop, neo soul, new jack ..."
4,Plump DJs,"[big beat, breakbeat, electronic, nu skool bre..."
5,Jope Ruonansuu,"[classic finnish pop, classic iskelma, finnish..."
6,Rick Astley,"[dance rock, europop, new romantic, new wave, ..."
7,Mike Jones,"[dirty south rap, gangster rap, hip hop, pop r..."
8,George Jones,"[country, country gospel, country rock, cowboy..."
9,Kanda Bongo Man,"[afropop, highlife, makossa, mande pop, mbalax..."


Careful with the save as it overrides the actual file, thus I commented the line

In [71]:
# Save the genres:
# artist_genres_fin.to_csv("kaggleDataset/artist_genre.csv", sep=';')

In [7]:
useful_df = lyrics_df[['artist.name', 'terms']]
artist_genre_dict = {}
print(len(useful_df))

# Create d ictionary from the Dataframe
for i, row in useful_df.iterrows():
    artist = row['artist.name']
    genre = row['terms']
    if artist not in artist_genre_dict.keys():
        artist_genre_dict[artist] = genre
    else:
        found_genre = artist_genre_dict[artist]
        if found_genre != genre:
            print("Artist {}, found genre: {}, actual genre: {}".format(artist, found_genre, genre))
            # What to do ?

10000
Artist Bill & Gloria Gaither, found genre: ccm, actual genre: country gospel
Artist Bill & Gloria Gaither, found genre: ccm, actual genre: country gospel
Artist Bert Kaempfert And His Orchestra, found genre: orchestra, actual genre: easy listening
Artist Bill & Gloria Gaither, found genre: ccm, actual genre: country gospel
Artist Bill & Gloria Gaither, found genre: ccm, actual genre: southern gospel
Artist Margaret Becker, found genre: ccm, actual genre: reggae
Artist Bill & Gloria Gaither, found genre: ccm, actual genre: country gospel
Artist Twinkle Twinkle Little Rock Star, found genre: nan, actual genre: nan
Artist John Hammond, found genre: blues, actual genre: blues-rock
Artist fIREHOSE, found genre: alternative rock, actual genre: glam metal
Artist Bill & Gloria Gaither, found genre: ccm, actual genre: country gospel
Artist Bill & Gloria Gaither, found genre: ccm, actual genre: country gospel
Artist The Plasmatics, found genre: trip hop, actual genre: riot grrrl
Artist Coc

In [30]:
MAIN_GENRES = {'rap', 'metal', 'rock', 'pop', 'blues'}

def get_main_genre(genres, main_genres):
    """ Count occurences of main genres terms in the genres list"""
    all_genres = ' '.join(genres) # string: "pop rock hard rock alternative rock ..."
    main_genres_occ = [(g, len(re.findall(g, all_genres))) for g in main_genres if len(re.findall(g, all_genres)) > 0] # list: [(rock, 3), (pop, 1), (hard, 1)...]
    if len(main_genres_occ) > 0:
        max_genre = sorted(main_genres_occ, key = lambda x: x[1], reverse = True)[0][0]
        return max_genre
    else:
        return "No main genre"
    
def get_artist_genre(artist_name, main_genres = MAIN_GENRES):
    """ Fetch on Spotify the main genre of the artist, depending on the main_genres required"""
    query = sp.search(q='artist:' + artist_name, type='artist')
    if len(query['artists']['items']) > 0:
        genres = query['artists']['items'][0]['genres'] # list: ["pop rock", "hard rock", "alternative rock"...]
        
        main_genre = get_main_genre(genres, main_genres)
        return main_genre
    else:
        print("{} not on Spotify".format(artist_name))
        return "No artist"

We test 2 queries to know if they work correctly:

In [31]:
test1 = get_artist_genre('Northlane')
print(test1)
test2 = get_artist_genre('SomethingRandom')
print(test2)

metal
SomethingRandom not on Spotify
No artist


### Getting genres of a given list of artist

In [9]:
t = ['hardcore des familles', 'rap pop rock']
full_genres = ' '.join(t)
split_genres = full_genres.split(' ')
print(split_genres)

['hardcore', 'des', 'familles', 'rap', 'pop', 'rock']


In [36]:
artist_sp_genres_dict = {}

artist_on_sp = 0
for artist, genre in artist_genre_dict.items():
    sp_genres = get_artist_genre(artist) #
    if sp_genres != "No genre":
        # Artist found on Spotify
        artist_on_sp += 1
        genre = get_artist_genre(artist)
        print(artist, genre)
        artist_sp_genres_dict[artist] = genre
    else:
        print("Not on spotify: {}".format(artist))
        
print("Ratio of artists found: {}".format(float(artist_on_sp)/len(artist_genre_dict.keys())))
            

Casual No main genre
The Box Tops pop
Sonora Santanera rock
Adam Ant rock
Gob pop
Jeff And Sheri Easter No main genre
Rated R No main genre
Tweeterfriendly Music No main genre
Planet P Project No main genre
Clp No main genre
JennyAnyKind No main genre
Wayne Watson rock
Andy Andy No main genre
Bob Azzam No main genre
Lionel Richie rock
Blue Rodeo rock
Richard Souther No main genre
Faiz Ali Faiz No main genre
Tesla rock
lextrical No main genre
Jimmy Wakely No main genre
Alice Stuart No main genre
Elena pop
The Dillinger Escape Plan metal
SUE THOMPSON rock
Five Bolt Main metal
Tim Wilson No main genre
Willie Bobo No main genre
Faye Adams blues
Terry Callier No main genre
John Wesley pop
The Shangri-Las pop
Billie Jo Spears No main genre
Mike Jones (Featuring CJ_ Mello & Lil' Bran) not on Spotify
Mike Jones (Featuring CJ_ Mello & Lil' Bran) not on Spotify
Mike Jones (Featuring CJ_ Mello & Lil' Bran) No artist
Sierra Maestra No main genre
Butthole Surfers rock
Despina Vandi No main genre
Ja

KeyboardInterrupt: 