**This notebook focuses on getting Spotify artist data based on Kaggle dataset 2 (the larger one) as well as using spotipy to pull data on any unmatched artists. larger_spot_artist_id_class_count is the pickled final file from this notebook. 

**I kept track of my "notes to self" in Next Steps where I was documenting possible next moves with dataset creation. The bolded bullets are the paths I took. I have kept other bullets for future iterations.**

**Move on to notebook 5_.. for similar work on songs from Peloton dataset.**

### Next Steps
- **Continue down artist path - join with kaggle dataset**
    - https://www.kaggle.com/yamaerenay/spotify-dataset-19212020-160k-tracks?select=data_by_artist_o.csv
    - data_by_artist_o.csv
    - but this is by artist name... 
- **Continue down artist path - use Spotify API to gather this same info, but using API and IDs collected**
    - would make the most sense
- Go down genre path
    - https://www.kaggle.com/yamaerenay/spotify-dataset-19212020-160k-tracks?select=data_by_artist_o.csv
    - data_by_genres_o.csv
- Try same exercise with artists but for songs - can't zip both, but can search songs and take top title
    - need to look more into what info to use
    - tracks.csv by track id and track name, artist id and artist name - has popularity/dancibility/energy
    - data_o.csv by track id, has artist name, has more audio information

In [2]:
# Basics 
import pandas as pd
import numpy as np

# Text Cleaning
import re

#Spotify
# Source: https://betterprogramming.pub/how-to-extract-any-artists-data-using-spotify-s-api-python-and-spotipy-4c079401bc37
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
import time 

# Importing dfs with heavy processing
import pickle

In [3]:
# import peloton data
# Thank you to okaykristinakay on Reddit for the data!
# Source: https://www.reddit.com/r/pelotoncycle/comments/m18xnr/peloton_class_list_march_update/
df = pd.read_excel('../../../data/original_datasets/AGF_Peloton Classes March.xlsx')
df.head()

Unnamed: 0,classId,className,classDescription,classDifficulty,classDuration,classType,classLength,classLocation,classOriginalAirdate,classRating,classRatingCount,instructorName,instructorBio,classEquipment,classSongs,classArtists,classUrl
0,7f66378211c9476b9b5619bf989f91d0,20 min Peace Meditation,A guided meditation that focuses on cultivatin...,4.3333,20,Meditation,23,psny-studio-2,2021-09-03 13:25:00,0.9847,131,Aditi Shah,"To Aditi, yoga goes beyond movement and can br...",Yoga Block,Meditation 22,RIOPY,https://members.onepeloton.com/classes/bootcam...
1,54ac61803b364b2fa8378acd9f593cdb,15 min Bodyweight Strength,"No equipment, no problem. Join us for a high-e...",5.7755,15,Strength,19,psny-studio-3,2021-09-03 13:19:00,0.9933,297,Olivia Amato,"Born and raised in New York, Olivia grew up pl...",Workout Mat,"California Gurls,Let's Get Loud,Let It Rock (f...","Katy Perry,Snoop Dogg,Jennifer Lopez,Kevin Rud...",https://members.onepeloton.com/classes/bootcam...
2,c75fd4831573483c9d45739aae11d083,20 min Focus Flow: Lower Body,This yoga flow class focuses on poses that eng...,4.3664,20,Yoga,23,psny-studio-2,2021-09-03 12:51:00,1.0,159,Aditi Shah,"To Aditi, yoga goes beyond movement and can br...","Yoga Blanket,Yoga Block,Yoga Mat","Interlude No 1,Oceansize,She Just Likes To Fig...","James Vincent McMorrow,Oh Wonder,Four Tet,Grim...",https://members.onepeloton.com/classes/bootcam...
3,470086936f7a4723ab5a53cb80b571ff,45 min Pop Bootcamp,Split your workout 50/50 between cardio on the...,7.8312,45,Tread Bootcamp,50,psny-studio-4,2021-09-03 11:56:00,0.9737,152,Olivia Amato,"Born and raised in New York, Olivia grew up pl...","Workout Mat,Medium Weights",34+35 (Remix) (feat. Doja Cat & Megan Thee Sta...,"Ariana Grande,Doja Cat,Megan Thee Stallion,Jus...",https://members.onepeloton.com/classes/bootcam...
4,9680a817bf2149d2b91990c87166a400,20 min Pop Ride,We dare you not to dance as you ride to all th...,7.4,20,Cycling,24,uk,2021-09-03 07:52:00,1.0,82,Sam Yo,Sam is a pro at many things but shines when it...,,"Señorita,Marry You,Irreplaceable,What Do You M...","Justin Timberlake,Bruno Mars,Beyoncé,Justin Bi...",https://members.onepeloton.com/classes/bootcam...


In [8]:
# Opening pickle files
# Use classId where you can as key for merging
base_clean_df = pd.read_pickle("../../../data/pickled_dfs/base_clean_df.pkl")
artist_df = pd.read_pickle("../../../data/pickled_dfs/artist_df.pkl")
artist_class_count_df = pd.read_pickle("../../../data/pickled_dfs/artist_class_count_df.pkl")
songs_df = pd.read_pickle("../../../data/pickled_dfs/songs_df.pkl")
song_class_count_df = pd.read_pickle("../../../data/pickled_dfs/song_class_count_df.pkl")
class_type_reorg_df = pd.read_pickle("../../../data/pickled_dfs/class_type_reorg_df.pkl")
new_class_types_with_artists = pd.read_pickle("../../../data/pickled_dfs/new_class_types_with_artists.pkl")

In [4]:
base_clean_df
# Difference from df:
# Dropped null 'instructorBio' rows - only 9
# Dropped rows with null classSongs/classArtists
# Dropped row value with classSongs as float (1 row)
# Populate rows with classType Cycling and classEquipment None to classEquipment = Bike

Unnamed: 0,classId,className,classDescription,classDifficulty,classDuration,classType,classLength,classLocation,classOriginalAirdate,classRating,classRatingCount,instructorName,instructorBio,classEquipment,classSongs,classArtists,classUrl
0,7f66378211c9476b9b5619bf989f91d0,20 min Peace Meditation,A guided meditation that focuses on cultivatin...,4.3333,20,Meditation,23,psny-studio-2,2021-09-03 13:25:00,0.9847,131,Aditi Shah,"To Aditi, yoga goes beyond movement and can br...",Yoga Block,Meditation 22,RIOPY,https://members.onepeloton.com/classes/bootcam...
1,54ac61803b364b2fa8378acd9f593cdb,15 min Bodyweight Strength,"No equipment, no problem. Join us for a high-e...",5.7755,15,Strength,19,psny-studio-3,2021-09-03 13:19:00,0.9933,297,Olivia Amato,"Born and raised in New York, Olivia grew up pl...",Workout Mat,"California Gurls,Let's Get Loud,Let It Rock (f...","Katy Perry,Snoop Dogg,Jennifer Lopez,Kevin Rud...",https://members.onepeloton.com/classes/bootcam...
2,c75fd4831573483c9d45739aae11d083,20 min Focus Flow: Lower Body,This yoga flow class focuses on poses that eng...,4.3664,20,Yoga,23,psny-studio-2,2021-09-03 12:51:00,1.0000,159,Aditi Shah,"To Aditi, yoga goes beyond movement and can br...","Yoga Blanket,Yoga Block,Yoga Mat","Interlude No 1,Oceansize,She Just Likes To Fig...","James Vincent McMorrow,Oh Wonder,Four Tet,Grim...",https://members.onepeloton.com/classes/bootcam...
3,470086936f7a4723ab5a53cb80b571ff,45 min Pop Bootcamp,Split your workout 50/50 between cardio on the...,7.8312,45,Tread Bootcamp,50,psny-studio-4,2021-09-03 11:56:00,0.9737,152,Olivia Amato,"Born and raised in New York, Olivia grew up pl...","Workout Mat,Medium Weights",34+35 (Remix) (feat. Doja Cat & Megan Thee Sta...,"Ariana Grande,Doja Cat,Megan Thee Stallion,Jus...",https://members.onepeloton.com/classes/bootcam...
4,9680a817bf2149d2b91990c87166a400,20 min Pop Ride,We dare you not to dance as you ride to all th...,7.4000,20,Cycling,24,uk,2021-09-03 07:52:00,1.0000,82,Sam Yo,Sam is a pro at many things but shines when it...,Bike,"Señorita,Marry You,Irreplaceable,What Do You M...","Justin Timberlake,Bruno Mars,Beyoncé,Justin Bi...",https://members.onepeloton.com/classes/bootcam...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16743,a144fcf1595740caa610f6484deb3cdb,20 min Funk Fun Run,This 20 minute Fun Run is a musically driven c...,6.4567,20,Running,23,nyc-gw-tread,14/09/2018 10:22,0.9933,150,Becs Gentry,Becs joins Peloton from London as an accomplis...,,"Man Funk (feat. Leron Thomas),If It Ain't Funk...","Guts,Leron Thomas,Soopasoul,FDEL,Basement Frea...",https://members.onepeloton.com/classes/bootcam...
16744,bd89a56c884b48e59ce9c43ab4f4d86f,45 min 90s Pop Ride,The 90s were a booming and inventive time in m...,8.1481,45,Cycling,49,nyc,13/09/2018 23:19,0.9925,1877,Denis Morton,"Raised in Florida, schooled in Tennessee, stee...",Light Weights,"Too Close - Radio Edit,Motownphilly - Original...","Next,Boyz II Men,Blues Traveler,Eagle-Eye Cher...",https://members.onepeloton.com/classes/bootcam...
16745,e5a99c83d7fc4f0686c68b02e5932204,10 min Arms Toning,Join Jess as she takes you through this 10-min...,7.4028,10,Strength,11,nyc,13/09/2018 21:02,0.9978,4481,Jess King,Jess is a charismatic instructor with a boundl...,Light Weights,"Hustle,Day and Night,Just Got Paid,Waiting for...","Daphne Willis,Gizzle,Lo Air,Sigala,Ella Eyre,M...",https://members.onepeloton.com/classes/bootcam...
16746,c57e123004d04df1bfae94c5e4399bb3,45 min 90s Ride,The 90s were a booming and inventive time in m...,8.3982,45,Cycling,49,nyc,13/09/2018 11:53,0.9922,2307,Emma Lovewell,Emma Lovewell is a Martha’s Vineyard native an...,Light Weights,"Show Me Love - Radio Version,Don't Speak,Gonna...","Robyn,No Doubt,C & C Music Factory,Technotroni...",https://members.onepeloton.com/classes/bootcam...


Note:
classId = 8840f82cdc624aa28a5d1babc51a3916 is the one you need to watch out for (has float classSong)

In [5]:
# Check lengths of dfs
print(f'artist_df: {len(artist_df)}')
print(f'artist_class_count_df: {len(artist_class_count_df)}')
print(f'songs_df: {len(songs_df)}')
print(f'song_class_count_df: {len(song_class_count_df)}')
print(f'class_type_reorg_df: {len(class_type_reorg_df)}')
print(f'new_class_types_with_artists: {len(new_class_types_with_artists)}')
# No inconsistencies where there shouldn't be

artist_df: 16748
artist_class_count_df: 10941
songs_df: 16748
song_class_count_df: 29904
class_type_reorg_df: 16748
new_class_types_with_artists: 16748


#### Before cleaning Artist List try to match to Kaggle Spotify df for artist IDs

In [6]:
artist_class_count_df = artist_class_count_df.sort_values(by='Artist').reset_index(drop=True)

In [7]:
artist_class_count_df.rename(columns={'Artist':'name'}, inplace=True)

In [8]:
artist_class_count_df

Unnamed: 0,name,Class Count
0,ACE HOOD,1
1,AKON,1
2,Cambridge,1
3,Cr3on,6
4,DJ Deeon,13
...,...,...
10936,yutaka hirasaka,1
10937,Álvaro Faria,4
10938,Ásgeir,12
10939,Ådå,1


In [9]:
# Import LARGER Kaggle Spotify artist IDs for matching
# Source: https://www.kaggle.com/yamaerenay/spotify-dataset-19212020-160k-tracks?select=data_by_artist_o.csv
kag2_spot_artist_df = pd.read_csv('kag2_spot_artists.zip', compression='zip')
kag2_spot_artist_df.head()

Unnamed: 0,id,followers,genres,name,popularity
0,0DheY5irMjBUeLybbCUEZ2,0.0,[],Armid & Amir Zare Pashai feat. Sara Rouzbehani,0
1,0DlhY15l3wsrnlfGio2bjU,5.0,[],ปูนา ภาวิณี,0
2,0DmRESX2JknGPQyO15yxg7,0.0,[],Sadaa,0
3,0DmhnbHjm1qw6NCYPeZNgJ,0.0,[],Tra'gruda,0
4,0Dn11fWM7vHQ3rinvWEl4E,2.0,[],Ioannis Panoutsopoulos,0


In [10]:
artist_class_count_spotid = artist_class_count_df.merge(kag2_spot_artist_df, on='name', how='left', indicator=True)
# matched slightly more than first Kaggle dataset in noteboob 3_...v1

In [11]:
len(artist_class_count_spotid)

12266

In [12]:
artist_class_count_spotid['_merge'].value_counts()
# Total is up to 12266; duplicate rows added
# More than one Spotify id for an artist

both          10342
left_only      1924
right_only        0
Name: _merge, dtype: int64

In [13]:
artist_class_count_spotid[artist_class_count_spotid['name'].duplicated(keep='first')].sort_values(by='name')['name']

56                  10
63                1991
98                 4AM
137      A R I Z O N A
156                ABC
             ...      
12155             Zion
12157             Zion
12162             Zion
12187           anders
12263           Ásgeir
Name: name, Length: 1325, dtype: object

In [14]:
# Drop duplicate rows, take first spotify id matched with for 341 rows
artist_class_count_spotid.drop_duplicates(subset='name', keep='first', inplace=True)

In [15]:
# Reset index post drops
artist_class_count_spotid = artist_class_count_spotid.reset_index(drop=True)

In [16]:
artist_class_count_spotid

Unnamed: 0,name,Class Count,id,followers,genres,popularity,_merge
0,ACE HOOD,1,,,,,left_only
1,AKON,1,,,,,left_only
2,Cambridge,1,,,,,left_only
3,Cr3on,6,,,,,left_only
4,DJ Deeon,13,,,,,left_only
...,...,...,...,...,...,...,...
10936,yutaka hirasaka,1,0stmdx2IonUUUIlWQ9bLYZ,12363.0,"['j-ambient', 'japanese guitar', 'japanese jaz...",52.0,both
10937,Álvaro Faria,4,,,,,left_only
10938,Ásgeir,12,7xUZ4069zcyBM4Bn10NQ1c,252028.0,"['icelandic folk', 'icelandic indie', 'iceland...",60.0,both
10939,Ådå,1,0Ll3Su2buPphrFk5OdPEvp,1176.0,[],32.0,both


In [17]:
artist_class_count_spotid['id'].isnull().sum()
# ~1924 artists left to match

1924

In [18]:
len(artist_class_count_spotid)

10941

In [19]:
leftover_artists = list(artist_class_count_spotid[artist_class_count_spotid['id'].isnull()]['name'])

Clean leftover artists for Spotify API work

In [20]:
# Remove leading and trailing spaces from values
leftover_artists = [x.strip() for x in leftover_artists]

In [21]:
# Remove punctuation
leftover_artists = [re.sub(r'[^\w\s]', '', (x)) for x in leftover_artists]

In [22]:
leftover_artists = sorted(leftover_artists)

In [23]:
# Spaces after punctuation removed, re-strip
leftover_artists = [x.strip() for x in leftover_artists]

In [24]:
# First value is a blank space
leftover_artists = leftover_artists[1:]

In [25]:
len(leftover_artists)

1923

### Spotify API Work

In [26]:
#Spotify API set up
# Source: https://betterprogramming.pub/how-to-extract-any-artists-data-using-spotify-s-api-python-and-spotipy-4c079401bc37
client_id = '4aac5b215c8a4fc591b506b71af7ebf5'
client_secret = '31d999b609424f569382b047ba828b00'

client_credentials_manager = SpotifyClientCredentials(client_id, client_secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

In [27]:
# leftover_artists

['Ingrosso',
 'Kyle',
 'L',
 '000 Maniacs',
 '2014 Remastered Version',
 '2016 Remaster',
 '24 Karatz',
 '24KGoldn',
 '30 Seconds To Mars',
 '40 Thevz',
 '433',
 '4FRNT',
 '7 Version',
 '70s Greatest Hits',
 '70s Music',
 '7th Heaven',
 '80s Hits',
 '80s Pop',
 '90s Maniacs',
 '90s Pop Band',
 '9Tendo',
 'A Manetta',
 'A Rocket To the Moon',
 'ACE HOOD',
 'AKON',
 'APM Music',
 'AUDÉ',
 'AWE',
 'AWE Remix',
 'AY AY',
 'AZTX',
 'Aanysa',
 'Academy Of Choir Art Of Russia',
 'Acoustic Chill',
 'Acoustic Covers',
 'Adam  the Ants',
 'Adam Christoper',
 'Adam Clough',
 'Adam F Mix',
 'Adam Port feat Gigi',
 'Addy van der Zwan',
 'Afra Hines',
 'Afterhrs',
 'Age of Bass',
 'Agent Orange DJ',
 'Agge',
 'Ahmad Lewis',
 'Ahmad Simmons',
 'Aire Atlantica',
 'Alan Wiggins',
 'Alana D',
 'Album version',
 'Ale Benassi',
 'Alejandro Fernandez',
 'Alemany',
 'Alessandro Benassi',
 'Alesso Remix',
 'Alex Natale DJ',
 'Alex Van Halen',
 'Alexander Lewis Remix',
 'Alexander Technique',
 'Alexio La Best

In [28]:
# # Set up search term 
# searchtest = leftover_artists[0]
# searchtest

'Ingrosso'

In [29]:
# # Pull out artistid
# result = sp.search(searchtest)
# artistidtest = result['tracks']['items'][0]['album']['artists'][0]['uri']
# artistidtest[15:]

'2XnBwblw31dfGnspMIwgWz'

In [30]:
# leftover_artists_df = pd.DataFrame(leftover_artists, columns=['artist'])

In [31]:
# # Create function to get artist id
# def getartistId(artist):
#     searchterm = artist
#     try:
#         result = sp.search(searchterm)
#         artistid = result['tracks']['items'][0]['album']['artists'][0]['uri']
#     except: # if do not find match, error
#         return "error"
#         pass
#     else: # if do find match, return id
#         return artistid[15:]

In [32]:
# leftover_artists_df['spotifyId_man'] = leftover_artists_df['artist'].map(lambda a: getartistId(a))

In [33]:
# leftover_artists_df

Unnamed: 0,artist,spotifyId_man
0,Ingrosso,2XnBwblw31dfGnspMIwgWz
1,Kyle,4qBgvVog0wzW75IQ48mU7v
2,L,0du5cEVh5yTK9QJze8zA0C
3,000 Maniacs,0MBIKH9DjtBkv8O3nS6szj
4,2014 Remastered Version,1SQRv42e4PjEYfPhS0Tk9E
...,...,...
1918,r3x,0vRvGUQVUjytro0xpb26bs
1919,suns,74oJ4qxwOZvX6oSsu1DGnw
1920,surf mesa,1lmU3giNF3CSbkVSQmLpHQ
1921,the London Cast of Chicago,7DbmkhVKGV7UiwrbegHL9t


In [34]:
# leftover_artists_df['spotifyId_man'].value_counts()
# # 97.9% of artists assigned spotifyId

0LyfQWJT6nXafLPZqxe9Of    186
error                      40
6MDME20pz9RveH9rEXvrOM      9
6OTr0YwLwGdv7mlmX27hRX      6
0FtMEXCUB6K8FLTMazOZeQ      6
                         ... 
2gBjLmx6zQnFGQJCAQpRgw      1
46SHBwWsqBkxI7EeeBEQG7      1
5QM6WPXiNFRTcOetPgMFtC      1
10Khz9BDdDT2mzm3330Cvu      1
5T4UKHhr4HGIC0VzdZQtAE      1
Name: spotifyId_man, Length: 1396, dtype: int64

In [35]:
# leftover_artists_df

Unnamed: 0,artist,spotifyId_man
0,Ingrosso,2XnBwblw31dfGnspMIwgWz
1,Kyle,4qBgvVog0wzW75IQ48mU7v
2,L,0du5cEVh5yTK9QJze8zA0C
3,000 Maniacs,0MBIKH9DjtBkv8O3nS6szj
4,2014 Remastered Version,1SQRv42e4PjEYfPhS0Tk9E
...,...,...
1918,r3x,0vRvGUQVUjytro0xpb26bs
1919,suns,74oJ4qxwOZvX6oSsu1DGnw
1920,surf mesa,1lmU3giNF3CSbkVSQmLpHQ
1921,the London Cast of Chicago,7DbmkhVKGV7UiwrbegHL9t


In [38]:
# # pickle df - done
# leftover_artists_df.to_pickle("./larger_spot_leftover_artists_df.pkl")

In [None]:
# Bring in larger_spot leftover artists df that used a bigger spotify artists data set for matching; thus smaller table
larger_spot_leftover_artists_df = pd.read_pickle("./larger_spot_leftover_artists_df.pkl")

In [39]:
artist_class_count_spotid

Unnamed: 0,name,Class Count,id,followers,genres,popularity,_merge
0,ACE HOOD,1,,,,,left_only
1,AKON,1,,,,,left_only
2,Cambridge,1,,,,,left_only
3,Cr3on,6,,,,,left_only
4,DJ Deeon,13,,,,,left_only
...,...,...,...,...,...,...,...
10936,yutaka hirasaka,1,0stmdx2IonUUUIlWQ9bLYZ,12363.0,"['j-ambient', 'japanese guitar', 'japanese jaz...",52.0,both
10937,Álvaro Faria,4,,,,,left_only
10938,Ásgeir,12,7xUZ4069zcyBM4Bn10NQ1c,252028.0,"['icelandic folk', 'icelandic indie', 'iceland...",60.0,both
10939,Ådå,1,0Ll3Su2buPphrFk5OdPEvp,1176.0,[],32.0,both


In [40]:
# Drop _merge column from original df, prepare for next merge with spotifyId_man column addition
artist_class_count_spotid = artist_class_count_spotid.drop(columns="_merge")

In [41]:
leftover_artists_df

Unnamed: 0,artist,spotifyId_man
0,Ingrosso,2XnBwblw31dfGnspMIwgWz
1,Kyle,4qBgvVog0wzW75IQ48mU7v
2,L,0du5cEVh5yTK9QJze8zA0C
3,000 Maniacs,0MBIKH9DjtBkv8O3nS6szj
4,2014 Remastered Version,1SQRv42e4PjEYfPhS0Tk9E
...,...,...
1918,r3x,0vRvGUQVUjytro0xpb26bs
1919,suns,74oJ4qxwOZvX6oSsu1DGnw
1920,surf mesa,1lmU3giNF3CSbkVSQmLpHQ
1921,the London Cast of Chicago,7DbmkhVKGV7UiwrbegHL9t


In [42]:
leftover_artists_df.rename(columns={'artist':'name'}, inplace=True)

In [43]:
artist_class_count_spotid

Unnamed: 0,name,Class Count,id,followers,genres,popularity
0,ACE HOOD,1,,,,
1,AKON,1,,,,
2,Cambridge,1,,,,
3,Cr3on,6,,,,
4,DJ Deeon,13,,,,
...,...,...,...,...,...,...
10936,yutaka hirasaka,1,0stmdx2IonUUUIlWQ9bLYZ,12363.0,"['j-ambient', 'japanese guitar', 'japanese jaz...",52.0
10937,Álvaro Faria,4,,,,
10938,Ásgeir,12,7xUZ4069zcyBM4Bn10NQ1c,252028.0,"['icelandic folk', 'icelandic indie', 'iceland...",60.0
10939,Ådå,1,0Ll3Su2buPphrFk5OdPEvp,1176.0,[],32.0


In [44]:
# repeat left join between original artist_class_count_spotid and new leftover_artists_df based on artist
artist_class_count_spotid_2 = artist_class_count_spotid.merge(leftover_artists_df, on='name', how='left', indicator=True)

In [45]:
artist_class_count_spotid_2

Unnamed: 0,name,Class Count,id,followers,genres,popularity,spotifyId_man,_merge
0,ACE HOOD,1,,,,,,left_only
1,AKON,1,,,,,,left_only
2,Cambridge,1,,,,,,left_only
3,Cr3on,6,,,,,,left_only
4,DJ Deeon,13,,,,,,left_only
...,...,...,...,...,...,...,...,...
10941,yutaka hirasaka,1,0stmdx2IonUUUIlWQ9bLYZ,12363.0,"['j-ambient', 'japanese guitar', 'japanese jaz...",52.0,,left_only
10942,Álvaro Faria,4,,,,,3Vop0LjMsp1743VPBhYJXl,both
10943,Ásgeir,12,7xUZ4069zcyBM4Bn10NQ1c,252028.0,"['icelandic folk', 'icelandic indie', 'iceland...",60.0,,left_only
10944,Ådå,1,0Ll3Su2buPphrFk5OdPEvp,1176.0,[],32.0,,left_only


In [46]:
# Duplicates added
artist_class_count_spotid_2['_merge'].value_counts()

left_only     9339
both          1607
right_only       0
Name: _merge, dtype: int64

In [47]:
# View duplicate artists
artist_class_count_spotid_2[artist_class_count_spotid_2['name'].duplicated(keep='first')].sort_values(by='name')['name']

4263             Ingrosso
5693           Lil Boosie
6847             N Douwma
7319              P Diddy
8642    Sherry St Germain
Name: name, dtype: object

In [48]:
# Drop duplicate rows, take first spotify id matched with for 7 records
artist_class_count_spotid_2.drop_duplicates(subset='name', keep='first', inplace=True)

In [49]:
# Reset index post drops
artist_class_count_spotid_2 = artist_class_count_spotid_2.reset_index(drop=True)

In [50]:
artist_class_count_spotid_2

Unnamed: 0,name,Class Count,id,followers,genres,popularity,spotifyId_man,_merge
0,ACE HOOD,1,,,,,,left_only
1,AKON,1,,,,,,left_only
2,Cambridge,1,,,,,,left_only
3,Cr3on,6,,,,,,left_only
4,DJ Deeon,13,,,,,,left_only
...,...,...,...,...,...,...,...,...
10936,yutaka hirasaka,1,0stmdx2IonUUUIlWQ9bLYZ,12363.0,"['j-ambient', 'japanese guitar', 'japanese jaz...",52.0,,left_only
10937,Álvaro Faria,4,,,,,3Vop0LjMsp1743VPBhYJXl,both
10938,Ásgeir,12,7xUZ4069zcyBM4Bn10NQ1c,252028.0,"['icelandic folk', 'icelandic indie', 'iceland...",60.0,,left_only
10939,Ådå,1,0Ll3Su2buPphrFk5OdPEvp,1176.0,[],32.0,,left_only


In [51]:
artist_class_count_spotid_2['id'].isnull().sum()

1924

In [53]:
# Replace nans in ['id'] with values found in manual Id pull (['spotifyId_man'])
artist_class_count_spotid_2['id'] = np.where(~artist_class_count_spotid_2['id'].isnull(),
                                                    artist_class_count_spotid_2['id'],artist_class_count_spotid_2['spotifyId_man'])

In [54]:
# Clean df by dropping 'spotifyId_man', '_merge' helper columns
artist_class_count_spotid_2.drop(columns=['spotifyId_man', '_merge'], inplace=True)

In [55]:
artist_class_count_spotid_2['id'].isnull().sum()

342

In [97]:
1-((artist_class_count_spotid_2['id'].isnull().sum())/len(artist_class_count_spotid_2))
# 96.8% of artists have spotify Id

0.9687414313134083

In [58]:
len(artist_class_count_spotid_2)

10941

In [60]:
# Describe with suppressed scientific notation
# Source: https://stackoverflow.com/questions/40347689/dataframe-describe-suppress-scientific-notation
artist_class_count_spotid_2.describe().apply(lambda s: s.apply('{0:.5f}'.format))

Unnamed: 0,Class Count,followers,popularity
count,10941.0,9017.0,9017.0
mean,16.00585,620274.34446,49.93889
std,56.5979,2562293.91061,18.38083
min,1.0,0.0,0.0
25%,1.0,4944.0,40.0
50%,2.0,48580.0,52.0
75%,8.0,315454.0,62.0
max,1436.0,78900234.0,100.0


In [63]:
artist_class_count_spotid_2

Unnamed: 0,name,Class Count,id,followers,genres,popularity
0,ACE HOOD,1,,,,
1,AKON,1,,,,
2,Cambridge,1,,,,
3,Cr3on,6,,,,
4,DJ Deeon,13,,,,
...,...,...,...,...,...,...
10936,yutaka hirasaka,1,0stmdx2IonUUUIlWQ9bLYZ,12363.0,"['j-ambient', 'japanese guitar', 'japanese jaz...",52.0
10937,Álvaro Faria,4,3Vop0LjMsp1743VPBhYJXl,,,
10938,Ásgeir,12,7xUZ4069zcyBM4Bn10NQ1c,252028.0,"['icelandic folk', 'icelandic indie', 'iceland...",60.0
10939,Ådå,1,0Ll3Su2buPphrFk5OdPEvp,1176.0,[],32.0


In [65]:
# pickle df - done
artist_class_count_spotid_2.to_pickle("./larger_spot_artist_id_class_count.pkl")

In [61]:
base_clean_df

Unnamed: 0,classId,className,classDescription,classDifficulty,classDuration,classType,classLength,classLocation,classOriginalAirdate,classRating,classRatingCount,instructorName,instructorBio,classEquipment,classSongs,classArtists,classUrl
0,7f66378211c9476b9b5619bf989f91d0,20 min Peace Meditation,A guided meditation that focuses on cultivatin...,4.3333,20,Meditation,23,psny-studio-2,2021-09-03 13:25:00,0.9847,131,Aditi Shah,"To Aditi, yoga goes beyond movement and can br...",Yoga Block,Meditation 22,RIOPY,https://members.onepeloton.com/classes/bootcam...
1,54ac61803b364b2fa8378acd9f593cdb,15 min Bodyweight Strength,"No equipment, no problem. Join us for a high-e...",5.7755,15,Strength,19,psny-studio-3,2021-09-03 13:19:00,0.9933,297,Olivia Amato,"Born and raised in New York, Olivia grew up pl...",Workout Mat,"California Gurls,Let's Get Loud,Let It Rock (f...","Katy Perry,Snoop Dogg,Jennifer Lopez,Kevin Rud...",https://members.onepeloton.com/classes/bootcam...
2,c75fd4831573483c9d45739aae11d083,20 min Focus Flow: Lower Body,This yoga flow class focuses on poses that eng...,4.3664,20,Yoga,23,psny-studio-2,2021-09-03 12:51:00,1.0000,159,Aditi Shah,"To Aditi, yoga goes beyond movement and can br...","Yoga Blanket,Yoga Block,Yoga Mat","Interlude No 1,Oceansize,She Just Likes To Fig...","James Vincent McMorrow,Oh Wonder,Four Tet,Grim...",https://members.onepeloton.com/classes/bootcam...
3,470086936f7a4723ab5a53cb80b571ff,45 min Pop Bootcamp,Split your workout 50/50 between cardio on the...,7.8312,45,Tread Bootcamp,50,psny-studio-4,2021-09-03 11:56:00,0.9737,152,Olivia Amato,"Born and raised in New York, Olivia grew up pl...","Workout Mat,Medium Weights",34+35 (Remix) (feat. Doja Cat & Megan Thee Sta...,"Ariana Grande,Doja Cat,Megan Thee Stallion,Jus...",https://members.onepeloton.com/classes/bootcam...
4,9680a817bf2149d2b91990c87166a400,20 min Pop Ride,We dare you not to dance as you ride to all th...,7.4000,20,Cycling,24,uk,2021-09-03 07:52:00,1.0000,82,Sam Yo,Sam is a pro at many things but shines when it...,Bike,"Señorita,Marry You,Irreplaceable,What Do You M...","Justin Timberlake,Bruno Mars,Beyoncé,Justin Bi...",https://members.onepeloton.com/classes/bootcam...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16743,a144fcf1595740caa610f6484deb3cdb,20 min Funk Fun Run,This 20 minute Fun Run is a musically driven c...,6.4567,20,Running,23,nyc-gw-tread,14/09/2018 10:22,0.9933,150,Becs Gentry,Becs joins Peloton from London as an accomplis...,,"Man Funk (feat. Leron Thomas),If It Ain't Funk...","Guts,Leron Thomas,Soopasoul,FDEL,Basement Frea...",https://members.onepeloton.com/classes/bootcam...
16744,bd89a56c884b48e59ce9c43ab4f4d86f,45 min 90s Pop Ride,The 90s were a booming and inventive time in m...,8.1481,45,Cycling,49,nyc,13/09/2018 23:19,0.9925,1877,Denis Morton,"Raised in Florida, schooled in Tennessee, stee...",Light Weights,"Too Close - Radio Edit,Motownphilly - Original...","Next,Boyz II Men,Blues Traveler,Eagle-Eye Cher...",https://members.onepeloton.com/classes/bootcam...
16745,e5a99c83d7fc4f0686c68b02e5932204,10 min Arms Toning,Join Jess as she takes you through this 10-min...,7.4028,10,Strength,11,nyc,13/09/2018 21:02,0.9978,4481,Jess King,Jess is a charismatic instructor with a boundl...,Light Weights,"Hustle,Day and Night,Just Got Paid,Waiting for...","Daphne Willis,Gizzle,Lo Air,Sigala,Ella Eyre,M...",https://members.onepeloton.com/classes/bootcam...
16746,c57e123004d04df1bfae94c5e4399bb3,45 min 90s Ride,The 90s were a booming and inventive time in m...,8.3982,45,Cycling,49,nyc,13/09/2018 11:53,0.9922,2307,Emma Lovewell,Emma Lovewell is a Martha’s Vineyard native an...,Light Weights,"Show Me Love - Radio Version,Don't Speak,Gonna...","Robyn,No Doubt,C & C Music Factory,Technotroni...",https://members.onepeloton.com/classes/bootcam...


In [62]:
artist_df

Unnamed: 0,classId,classArtists,Charles Lloyd,Marshall Jefferson,Electric Six,Bjonr,Gerald Alston,Blue Cheer,Madden,Jessy,...,Maxence Cyrin,B.B. Jay,Yizzy,Bellman,Lauren Pritchard,The Japanese Popstars,James Jacob,Radio Remix,Mayer Hawthorne,John Ryan
0,7f66378211c9476b9b5619bf989f91d0,[RIOPY],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,54ac61803b364b2fa8378acd9f593cdb,"[Katy Perry, Snoop Dogg, Jennifer Lopez, Kevin...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,c75fd4831573483c9d45739aae11d083,"[James Vincent McMorrow, Oh Wonder, Four Tet, ...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,470086936f7a4723ab5a53cb80b571ff,"[Ariana Grande, Doja Cat, Megan Thee Stallion,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,9680a817bf2149d2b91990c87166a400,"[Justin Timberlake, Bruno Mars, Beyoncé, Justi...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16743,a144fcf1595740caa610f6484deb3cdb,"[Guts, Leron Thomas, Soopasoul, FDEL, Basement...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
16744,bd89a56c884b48e59ce9c43ab4f4d86f,"[Next, Boyz II Men, Blues Traveler, Eagle-Eye ...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
16745,e5a99c83d7fc4f0686c68b02e5932204,"[Daphne Willis, Gizzle, Lo Air, Sigala, Ella E...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
16746,c57e123004d04df1bfae94c5e4399bb3,"[Robyn, No Doubt, C & C Music Factory, Technot...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
