## 🔮💿 PRÉDIRE LES CERTIFICATIONS DES ALBUMS ET SINGLES

Dans ce projet, je souhaite comprendre **la naissance d'un hit en France** grâce à l'exploitation de plusieurs variables. Cela me permettra par la suite de **réaliser un modèle permettant de prédire la certification d'un single** (or, platine ou diamant).  
Ce notebook contient le _data-preprocessing_ : nous nous appuierons sur les données Spotify. En effet, la plateforme suédoise de streaming est celle qui permet de récolter un grand nombre de données. Cette collecte est facilitée par la librairie **spotipy**. De plus, d'autres données seront collectées grâce au scraping.

## Importation des modules

In [1]:
import pandas as pd
import numpy as np

import spotipy
from spotipy.oauth2 import SpotifyClientCredentials

import collections

import requests
from bs4 import BeautifulSoup

_____________________________________________________

## Data extraction

In [2]:
# Connexion à l'API Spotify
client_id = "3692c6af54244957b5114da1ae8667d2"
client_secret = "1c9aa410d1e94b0ea53356f1cab56d54"
client_credentials_manager = SpotifyClientCredentials(client_id=client_id, client_secret=client_secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager) 

_________________________________

## Création des dataframes

Pour réaliser notre modèle de prédiction, nous collectons différentes données :  
* Singles sans certification : on crée un dataframe contenant les singles non certifiés de plusieurs playlists Spotify. Nous faisons en sorte qu'il y ait une grande diversité de genres musicaux. On récupère ensuite les données que fournit Spotify.
* Singles certifiés : on récupère le fichier disponible au format csv sur le site du SNEP. On collectera ensuite les données que fournit Spotify sur toutes les chansons certifiées.
* Albums certifiés : on récupère le fichier disponible au format csv sur le site du SNEP.
* TOP 200 Spotify : on récupère le classement actuel des chansons les plus streamés (format csv), puis on collectera les données fournies par Spotify.
* TOP 200 SNEP : on récupère les informations en scrapant le site du SNEP. Les datas spotify seront récoltées également.

### Singles sans certification

In [3]:
# On récupère les chansons non certifiées et leurs features dans les playlists Spotify
draft = pd.read_csv('/Users/anissa/P8/la_draft.csv')
nouvelle_vague = pd.read_csv('/Users/anissa/P8/nouvelle_vague.csv')
metal_francais = pd.read_csv('/Users/anissa/P8/metal_francais.csv')
rock_fr = pd.read_csv('/Users/anissa/P8/rock_fr.csv')
pop_fr = pd.read_csv('/Users/anissa/P8/pop_fr.csv')
reggae_fr = pd.read_csv('/Users/anissa/P8/reggae_fr.csv')
indie_fr = pd.read_csv('/Users/anissa/P8/indie_fr.csv')
a_surveiller = pd.read_csv('/Users/anissa/P8/a_surveiller.csv')
nouvelle_scene = pd.read_csv('/Users/anissa/P8/nouvelle_scene.csv')
alternatif = pd.read_csv('/Users/anissa/P8/alternatif.csv')
bachata = pd.read_csv('/Users/anissa/P8/bachata.csv')
bossa_nova = pd.read_csv('/Users/anissa/P8/bossa_nova.csv')
coupe_decale = pd.read_csv('/Users/anissa/P8/coupe_decale.csv')
dancehall_official = pd.read_csv('/Users/anissa/P8/dancehall_official.csv')
dancehall_queen = pd.read_csv('/Users/anissa/P8/dancehall_queen.csv')
dancehall_station = pd.read_csv('/Users/anissa/P8/dancehall_station.csv')
equal_france = pd.read_csv('/Users/anissa/P8/equal_france.csv')
equal_us = pd.read_csv('/Users/anissa/P8/equal_us.csv')
exception_fr = pd.read_csv('/Users/anissa/P8/exception_fre.csv')
flashback = pd.read_csv('/Users/anissa/P8/flashback.csv')
folk_arc = pd.read_csv('/Users/anissa/P8/folk_arc.csv')
folk_pop = pd.read_csv('/Users/anissa/P8/folk_pop.csv')
funky_jams = pd.read_csv('/Users/anissa/P8/funky_jams.csv')
jaleo = pd.read_csv('/Users/anissa/P8/jaleo.csv')
la_vie_est_belle = pd.read_csv('/Users/anissa/P8/la_vie_est_belle.csv')
made_in_france = pd.read_csv('/Users/anissa/P8/made_in_france.csv')
mellow_bars = pd.read_csv('/Users/anissa/P8/mellow_bars.csv')
pop_fem = pd.read_csv('/Users/anissa/P8/pop_fem.csv')
rap_qub = pd.read_csv('/Users/anissa/P8/rap_qub.csv')
rap_uk = pd.read_csv('/Users/anissa/P8/rap_uk.csv')
reggae_fr = pd.read_csv('/Users/anissa/P8/reggae_franais.csv')
stars_en_herbe = pd.read_csv('/Users/anissa/P8/stars_en_herbe.csv')
street_soul = pd.read_csv('/Users/anissa/P8/street_soul.csv')
variet = pd.read_csv('/Users/anissa/P8/varit_daujourdhui.csv')
hiphop = pd.read_csv('/Users/anissa/P8/women_of_hip-hop.csv')

# Concaténation
sans_certif = pd.concat([draft,nouvelle_vague, metal_francais, 
                         rock_fr, pop_fr, reggae_fr, indie_fr, a_surveiller, nouvelle_scene, alternatif, 
                         bachata, bossa_nova, coupe_decale, dancehall_official, dancehall_queen, dancehall_station, equal_france, equal_us, exception_fr,
                        flashback, folk_arc, folk_pop, funky_jams, jaleo, la_vie_est_belle, made_in_france, mellow_bars, pop_fem, rap_qub,
                        rap_uk, reggae_fr, stars_en_herbe, street_soul, variet, hiphop])

# On supprime les colonnes non utiles
sans_certif = sans_certif.drop(columns=['Spotify ID', 'Artist IDs', 'Added By', 'Added At'])

sans_certif.head()

Unnamed: 0,Track Name,Album Name,Artist Name(s),Release Date,Duration (ms),Popularity,Genres,Danceability,Energy,Key,Loudness,Mode,Speechiness,Acousticness,Instrumentalness,Liveness,Valence,Tempo,Time Signature
0,Étincelle - Maradona,Étincelle (Maradona),Tiakola,2021-09-09,212946,68,"pop urbaine,rap francais",0.682,0.521,8.0,-8.058,1.0,0.0759,0.435,6.3e-05,0.23,0.683,140.093,4.0
1,Outils (feat. DA Uzi),Outils (feat. DA Uzi),"Lamatrix,DA Uzi",2021-09-30,210493,55,"french hip hop,pop urbaine,rap francais",0.663,0.839,2.0,-4.785,0.0,0.45,0.542,0.0,0.35,0.513,179.933,3.0
2,Aggravage 2,Prend Tous Les Risques,LE RISQUE,2021-09-22,233684,58,rap francais,0.539,0.593,6.0,-9.534,1.0,0.29,0.39,0.0,0.127,0.472,138.601,4.0
3,RS DRILL #1,RS DRILL #1,"Pirate,SRK",2021-08-16,163098,51,rap francais,0.828,0.797,0.0,-5.549,1.0,0.321,0.584,0.0,0.0742,0.742,142.0,4.0
4,Emotion,Emotion,Fresh laDouille,2021-07-02,206665,59,"french hip hop,pop urbaine,rap francais",0.762,0.628,5.0,-6.278,0.0,0.0943,0.268,0.0,0.133,0.82,141.931,4.0


In [5]:
# Suppression des doublons
sans_certif = sans_certif.drop_duplicates(subset=['Track Name','Artist Name(s)'], keep=False)

In [7]:
# On récupère uniquement le premier nom des featurings
sans_certif['Artist Name(s)'] = [x.split(',')[0] for x in sans_certif['Artist Name(s)']]

In [8]:
# On crée des listes pour les genres musicaux
sans_certif['Genres'] = sans_certif['Genres'].str.split(',')
sans_certif.head()

Unnamed: 0,Track Name,Album Name,Artist Name(s),Release Date,Duration (ms),Popularity,Genres,Danceability,Energy,Key,Loudness,Mode,Speechiness,Acousticness,Instrumentalness,Liveness,Valence,Tempo,Time Signature
0,Étincelle - Maradona,Étincelle (Maradona),Tiakola,2021-09-09,212946,68,"[pop urbaine, rap francais]",0.682,0.521,8.0,-8.058,1.0,0.0759,0.435,6.3e-05,0.23,0.683,140.093,4.0
1,Outils (feat. DA Uzi),Outils (feat. DA Uzi),Lamatrix,2021-09-30,210493,55,"[french hip hop, pop urbaine, rap francais]",0.663,0.839,2.0,-4.785,0.0,0.45,0.542,0.0,0.35,0.513,179.933,3.0
2,Aggravage 2,Prend Tous Les Risques,LE RISQUE,2021-09-22,233684,58,[rap francais],0.539,0.593,6.0,-9.534,1.0,0.29,0.39,0.0,0.127,0.472,138.601,4.0
3,RS DRILL #1,RS DRILL #1,Pirate,2021-08-16,163098,51,[rap francais],0.828,0.797,0.0,-5.549,1.0,0.321,0.584,0.0,0.0742,0.742,142.0,4.0
4,Emotion,Emotion,Fresh laDouille,2021-07-02,206665,59,"[french hip hop, pop urbaine, rap francais]",0.762,0.628,5.0,-6.278,0.0,0.0943,0.268,0.0,0.133,0.82,141.931,4.0


In [27]:
# Exportation au format csv
sans_certif.to_csv("/Users/anissa/P8/csv_features/sans_certif.csv", index = False)

In [9]:
# On récupère les features manquantes

In [10]:
temp = 0
dict_albums = collections.defaultdict(dict)
track_ids_list = list(sans_certif['Track Name'])
id_bug = []
for i in range(0,1680):
    #len(track_ids_list)
 try : 
    data3 = track_ids_list[i] 
    
    tracks_df = sp.search(data3)
    #tracks_features = sp.audio_features(tracks_df['tracks']['items'][0]['id'])
    audio_analysis = sp.audio_analysis(tracks_df['tracks']['items'][0]['id'])
    artist_genre = sp.artist(tracks_df['tracks']['items'][0]['artists'][0]['uri'])
    albums_data = sp.album(tracks_df['tracks']['items'][0]['album']['uri'])
    
    dict_albums[tracks_df['tracks']['items'][0]['id']]["artist_name"] = tracks_df['tracks']["items"][0]["artists"][0]["name"]
    
    dict_albums[tracks_df['tracks']['items'][0]['id']]["date_release"] = tracks_df['tracks']['items'][0]['album']["release_date"]
      
    dict_albums[tracks_df['tracks']['items'][0]['id']]["track_popularity"] = tracks_df['tracks']['items'][0]['popularity']
        
    dict_albums[tracks_df['tracks']['items'][0]['id']]["artist_popularity"] = artist_genre['popularity']
    dict_albums[tracks_df['tracks']['items'][0]['id']]["followers"] = artist_genre['followers']['total']
    
    dict_albums[tracks_df['tracks']['items'][0]['id']]["explicit_content"] = albums_data['tracks']['items'][0]['explicit']
    dict_albums[tracks_df['tracks']['items'][0]['id']]["labels"] = albums_data['copyrights'][0]['text']
    
    dict_albums[tracks_df['tracks']['items'][0]['id']]["end_of_fade_in"] = audio_analysis['track']['end_of_fade_in']
    dict_albums[tracks_df['tracks']['items'][0]['id']]["start_of_fade_out"] = audio_analysis['track']['start_of_fade_out']
    
    temp+=1
    
 except :
    print(temp) # gestion des erreurs
    id_bug.append(temp)
    pass

209
238
260
291
294
424
454
457
688
797
999
1014
1021
1360
1535
1664


In [11]:
# Création du df
sans_certif_df = pd.DataFrame.from_dict(dict_albums, orient='index').reset_index(drop=True)

# Exportation au format csv
sans_certif_df.to_csv("/Users/anissa/P8/csv_features/sans_certif_df.csv", index = False)

In [28]:
sans_certif_df.shape

(1618, 9)

### Top singles SNEP

#### Scraping

In [13]:
url = 'https://snepmusique.com/les-tops/le-top-de-la-semaine/top-albums/?categorie=Top%20Singles&semaine=40'

page = requests.get(url)

soup = BeautifulSoup(page.text, "html.parser")
titre = soup.findAll('div',{'class':'titre'})

data = []
for link in soup.findAll('div',{'class':'titre'}):
    data.append(link.get_text())
data = data[:200]

data = []
for link in soup.findAll('div',{'class':'titre'}):
    data.append(link.get_text())
data = data[:200]

data2 = []
for link in soup.findAll('div',{'class':'artiste'}):
    data2.append(link.get_text())
    
data3 = []
for link in soup.findAll('div',{'class':'editeur'}):
    data3.append(link.get_text())
    
data4 = []
for link in soup.findAll('div',{'class':'positions'}):
    data4.append(link.get_text())
    
position = soup.findAll('div',{'class':'positions'})
for mytag in position:
    mytag.get_text()
    
rang = soup.findAll('div',{'class':'rang'})
for mytag in rang:
    mytag.get_text()
    
data5 = []
for link in soup.findAll('div',{'class':'rang'}):
    data5.append(link.get_text())
    
df = pd.DataFrame()
df['titre'] = data
df['artiste'] = data2
df['editeur'] = data3
df['positions'] = data4
df['rang'] = data5

df.head()

Unnamed: 0,titre,artiste,editeur,positions,rang
0,LOVE NWANTITI,CKAY,WARNER / CHOCOLATE CITY MUSIC,\n1erLa Semaine Dernière\n,1
1,TDB,OBOY,BELIEVE / SYNDICATE RECORDS,\n2eLa Semaine Dernière\n,2
2,PEPAS,FARRUKO,SONY MUSIC ENTERTAINMENT / RCA,\n3eLa Semaine Dernière\n,3
3,LA KIFFANCE,NAPS,BELIEVE / OK MANY INDUSTRIE ET 13ÈME ART MUSIC,\n5eLa Semaine Dernière\n,4
4,STAY,THE KID LAROI,SONY MUSIC ENTERTAINMENT / COLUMBIA,\n6eLa Semaine Dernière\n,5


In [14]:
df['titre'] = df['titre'].str.lower()

df = df.apply(lambda x: x.str.strip() if isinstance(x, object) else x) # suppression '/'
df.head()

Unnamed: 0,titre,artiste,editeur,positions,rang
0,love nwantiti,CKAY,WARNER / CHOCOLATE CITY MUSIC,1erLa Semaine Dernière,1
1,tdb,OBOY,BELIEVE / SYNDICATE RECORDS,2eLa Semaine Dernière,2
2,pepas,FARRUKO,SONY MUSIC ENTERTAINMENT / RCA,3eLa Semaine Dernière,3
3,la kiffance,NAPS,BELIEVE / OK MANY INDUSTRIE ET 13ÈME ART MUSIC,5eLa Semaine Dernière,4
4,stay,THE KID LAROI,SONY MUSIC ENTERTAINMENT / COLUMBIA,6eLa Semaine Dernière,5


In [15]:
# On récupère les data disponibles
temp = 0
dict_albums = collections.defaultdict(dict)
track_ids_list = list(df.titre)
id_bug = []
for i in range(0,201):
    #len(track_ids_list)
 try : 
    data2 = track_ids_list[i] 
    
    tracks_df = sp.search(data2)
    tracks_features = sp.audio_features(tracks_df['tracks']['items'][0]['id'])
    audio_analysis = sp.audio_analysis(tracks_df['tracks']['items'][0]['id'])
    artist_genre = sp.artist(tracks_df['tracks']['items'][0]['artists'][0]['uri'])
    albums_data = sp.album(tracks_df['tracks']['items'][0]['album']['uri'])
    
    dict_albums[tracks_df['tracks']['items'][0]['id']]["track_id"] = tracks_df['tracks']['items'][0]['id']
    dict_albums[tracks_df['tracks']['items'][0]['id']]["artist_name"] = tracks_df['tracks']["items"][0]["artists"][0]["name"]
    dict_albums[tracks_df['tracks']['items'][0]['id']]["album_name"] = tracks_df['tracks']['items'][0]['album']['name']
    dict_albums[tracks_df['tracks']['items'][0]['id']]["track_name"] = tracks_df['tracks']['items'][0]['name']
    dict_albums[tracks_df['tracks']['items'][0]['id']]["track_popularity"] = tracks_df['tracks']['items'][0]['popularity']
    dict_albums[tracks_df['tracks']['items'][0]['id']]["date_release"] = tracks_df['tracks']['items'][0]['album']["release_date"]
    
    dict_albums[tracks_df['tracks']['items'][0]['id']]["acousticness"] = tracks_features[0]["acousticness"]
    dict_albums[tracks_df['tracks']['items'][0]['id']]["danceability"] = tracks_features[0]["danceability"]
    dict_albums[tracks_df['tracks']['items'][0]['id']]["duration_ms"] = tracks_features[0]["duration_ms"]
    dict_albums[tracks_df['tracks']['items'][0]['id']]["energy"] = tracks_features[0]["energy"]
    dict_albums[tracks_df['tracks']['items'][0]['id']]["instrumentalness"] = tracks_features[0]["instrumentalness"]
    dict_albums[tracks_df['tracks']['items'][0]['id']]["speechiness"] = tracks_features[0]["speechiness"]
    dict_albums[tracks_df['tracks']['items'][0]['id']]["key"] = tracks_features[0]["key"]
    dict_albums[tracks_df['tracks']['items'][0]['id']]["liveness"] = tracks_features[0]["liveness"]
    dict_albums[tracks_df['tracks']['items'][0]['id']]["loudness"] = tracks_features[0]["loudness"]
    dict_albums[tracks_df['tracks']['items'][0]['id']]["mode"] = tracks_features[0]["mode"]
    dict_albums[tracks_df['tracks']['items'][0]['id']]["tempo"] = tracks_features[0]["tempo"]
    dict_albums[tracks_df['tracks']['items'][0]['id']]["time_signature"] = tracks_features[0]["time_signature"]
    dict_albums[tracks_df['tracks']['items'][0]['id']]["valence"] = tracks_features[0]["valence"]
    
    dict_albums[tracks_df['tracks']['items'][0]['id']]["artist_genre"] = artist_genre['genres']
    dict_albums[tracks_df['tracks']['items'][0]['id']]["artist_popularity"] = artist_genre['popularity']
    dict_albums[tracks_df['tracks']['items'][0]['id']]["followers"] = artist_genre['followers']['total']
    
    dict_albums[tracks_df['tracks']['items'][0]['id']]["explicit_content"] = albums_data['tracks']['items'][0]['explicit']
    dict_albums[tracks_df['tracks']['items'][0]['id']]["labels"] = albums_data['copyrights'][0]['text']
    
    dict_albums[tracks_df['tracks']['items'][0]['id']]["end_of_fade_in"] = audio_analysis['track']['end_of_fade_in']
    dict_albums[tracks_df['tracks']['items'][0]['id']]["start_of_fade_out"] = audio_analysis['track']['start_of_fade_out']   
    
    temp+=1
    
 except :
    print(temp)
    id_bug.append(temp)
    pass
 #artist_nametracks_df.append(['tracks']["items"][0]["artists"][0]["name"])

23
195
198


In [16]:
# Création du df
df2 = pd.DataFrame.from_dict(dict_albums, orient='index').reset_index(drop=True)
df2.head()

# Exportation du df
df2.to_csv("/Users/anissa/P8/csv_features/df2.csv", index = False)

### Top 200 Spotify

In [17]:
top200 = pd.read_csv('/Users/anissa/top200.csv', header=1)
top200.head()

Unnamed: 0,Position,Track Name,Artist,Streams,URL
0,1,TDB,OBOY,2477643,https://open.spotify.com/track/5YoDQgB4j7oBHWX...
1,2,INDUSTRY BABY (feat. Jack Harlow),Lil Nas X,2171563,https://open.spotify.com/track/5Z9KJZvQzH6PFmb...
2,3,Pepas,Farruko,1839896,https://open.spotify.com/track/5fwSHlTEWpluwOM...
3,4,STAY (with Justin Bieber),The Kid LAROI,1823895,https://open.spotify.com/track/5PjdY0CKGZdEuoN...
4,5,La kiffance,Naps,1778740,https://open.spotify.com/track/1bwhDrXcAtqePp8...


In [18]:
# On récupère les data disponibles
artist_name = []
artist_uri = []
album_id = []
album_name = []
album_uri = []
track_name = []
popularity = []
track_id = []
track_uri = []
date_release = []
track_ids_list = list(top200['URL'])
for i in range(0,len(track_ids_list),50):
 data = track_ids_list[i:i + 50]
 tracks_df1 = sp.tracks(data)
 
 for idx, track in enumerate(tracks_df1['tracks']):
  artist_name.append(track['artists'][0]['name'])
  artist_uri.append(track['artists'][0]['uri'])
  album_id.append(track['album']['id'])
  album_name.append(track['album']['name'])
  album_uri.append(track['album']['uri'])
  track_name.append(track['name'])
  popularity.append(track['popularity'])
  track_id.append(track['id'])
  track_uri.append(track['uri'])
  date_release.append(track['album']['release_date'])
    
tracks_dataframe = pd.DataFrame({'track_id' : track_id, 'track_uri' : track_uri, 'artist_name' : artist_name, 'artist_uri' : artist_uri, 
                                 'album_id': album_id, 'album_name': album_name, 'album_uri' : album_uri, 'track_name' : track_name, 
                                 'track_id' : track_id, 'popularity' : popularity, 'date_release':date_release})

top200 = top200.rename(columns={'Track Name':'track_name'})

data_df = tracks_dataframe.merge(top200, how='inner', on='track_name')
data_df = data_df[['track_id', 'track_uri', 'artist_name', 'album_id', 'album_name', 'track_name', 'popularity', 'date_release', 'Position', 'Streams', 'URL']]
data_df = data_df.drop_duplicates()
data_df = data_df.reset_index(drop=True)

track_id = []
acousticness = []
analysis_url = []
danceability = []
duration_ms = []
energy = []
instrumentalness = []
key = []
liveness = []
loudness = []
mode = []
speechiness = []
tempo = []
time_signature = []
track_href = []
valence = []
feature_type = []
track_ids_list = list(data_df['track_id'])
for i in range(0,len(track_ids_list),100):
 data = track_ids_list[i:i + 100]
 tracks_features = sp.audio_features(data)
 for idx, track in enumerate(tracks_features):
  track_id.append(track['id'])
  acousticness.append(track['acousticness'])
  analysis_url.append(track['analysis_url'])
  danceability.append(track['danceability'])
  duration_ms.append(track['duration_ms'])
  energy.append(track['energy'])
  instrumentalness.append(track['instrumentalness'])
  key.append(track['key'])
  liveness.append(track['liveness'])
  loudness.append(track['loudness'])
  mode.append(track['mode'])
  speechiness.append(track['speechiness'])
  tempo.append(track['tempo'])
  time_signature.append(track['time_signature'])
  track_href.append(track['track_href'])
  valence.append(track['valence'])
    
tracks_features_df = pd.DataFrame({'track_id' : track_id, 'track_href':track_href, 'analysis_url' : analysis_url,  'acousticness' : acousticness, 'danceability' : danceability, 
                                   'duration_ms' : duration_ms, 'energy':energy, 'instrumentalness':instrumentalness, 'key':key, 'liveness':liveness, 'loudness':loudness, 'mode':mode, 
                                   'speechiness':speechiness, 'tempo':tempo, 'time_signature':time_signature, 'valence':valence})
data_featured = data_df.merge(tracks_features_df, how='inner', on='track_id')
data_featured = data_featured.drop_duplicates()
data_featured = data_featured.reset_index(drop=True)

data_featured.head()

Unnamed: 0,track_id,track_uri,artist_name,album_id,album_name,track_name,popularity,date_release,Position,Streams,...,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence
0,5YoDQgB4j7oBHWXH3DcwFD,spotify:track:5YoDQgB4j7oBHWXH3DcwFD,OBOY,4kC4sAGhPKek3ZKxrc1TW5,No crari,TDB,70,2021-09-10,1,2477643,...,0.675,0.000598,3,0.124,-6.004,1,0.141,140.092,4,0.758
1,5Z9KJZvQzH6PFmb8SNkxuk,spotify:track:5Z9KJZvQzH6PFmb8SNkxuk,Lil Nas X,6pOiDiuDQqrmo5DbG0ZubR,MONTERO,INDUSTRY BABY (feat. Jack Harlow),91,2021-09-17,2,2171563,...,0.691,0.0,10,0.0476,-7.395,0,0.0672,150.087,4,0.892
2,5fwSHlTEWpluwOM0Sxnh5k,spotify:track:5fwSHlTEWpluwOM0Sxnh5k,Farruko,2A5ksnhz7YWbRLvFMjX1j1,Pepas,Pepas,96,2021-06-24,3,1839896,...,0.766,7e-05,7,0.128,-3.955,1,0.0343,130.001,4,0.442
3,5PjdY0CKGZdEuoNab3yDmX,spotify:track:5PjdY0CKGZdEuoNab3yDmX,The Kid LAROI,4bZJWQhHKJckFLJuYdvyX2,F*CK LOVE 3: OVER YOU,STAY (with Justin Bieber),93,2021-07-23,4,1823895,...,0.764,0.0,1,0.103,-5.484,1,0.0483,169.928,4,0.478
4,1bwhDrXcAtqePp8mNJ0pTe,spotify:track:1bwhDrXcAtqePp8mNJ0pTe,Naps,3mLAiqAlaOd5KaVOLJLtYM,Les mains faites pour l'or,La kiffance,76,2021-04-02,5,1778740,...,0.637,0.0,6,0.071,-6.515,1,0.0869,140.045,4,0.678


In [19]:
# Exportation au format csv
data_featured.to_csv("/Users/anissa/P8/csv_features/data_featured.csv", index = False)

### Certifications Singles

In [20]:
certif_singles = pd.read_csv('/Users/anissa/certif-singles.csv', sep=';')
certif_singles.head()

Unnamed: 0,Interprete,Titre,Label,Catégorie,Certification,Date de sortie,Date de constat
0,HATIK FEAT. JOK'AIR\t,LA MEILLEURE,Low Wood / UNIVERSAL MUSIC FRANCE,Singles,Diamant,06/03/2020,25/12/2020
1,HARRY STYLES,WATERMELON SUGAR,COLUMBIA / SONY MUSIC ENTERTAINMENT,Singles,Platine,17/11/2019,25/12/2020
2,NAZA FEAT. NISKA,JOLI BÉBÉ,Capirolesque / BELIEVE,Singles,Diamant,11/09/2020,25/12/2020
3,24KGOLDN FEAT. IANN DIOR\t,MOOD,Records / Columbia / SONY MUSIC ENTERTAINMENT,Singles,Platine,24/07/2020,25/12/2020
4,DAMSO,LIFE LIFE,TRENTE-QUATRE CENTIMES / UNIVERSAL MUSIC FRANCE,Singles,Or,17/09/2020,25/12/2020


In [21]:
certif_singles = certif_singles.apply(lambda x: x.str.strip() if isinstance(x, object) else x) # suppression '/'
certif_singles['Titre'] = certif_singles['Titre'].str.lower() 

In [22]:
certif_singles.shape

(2434, 7)

In [23]:
# On récupère les data disponibles
temp = 0
dict_albums = collections.defaultdict(dict)
track_ids_list = list(certif_singles.Titre)
id_bug = []
for i in range(0,2435):
    #len(track_ids_list)
 try : 
    data2 = track_ids_list[i] 
    
    tracks_df = sp.search(data2)
    tracks_features = sp.audio_features(tracks_df['tracks']['items'][0]['id'])
    audio_analysis = sp.audio_analysis(tracks_df['tracks']['items'][0]['id'])
    artist_genre = sp.artist(tracks_df['tracks']['items'][0]['artists'][0]['uri'])
    albums_data = sp.album(tracks_df['tracks']['items'][0]['album']['uri'])
    
    dict_albums[tracks_df['tracks']['items'][0]['id']]["artist_name"] = tracks_df['tracks']["items"][0]["artists"][0]["name"]
    dict_albums[tracks_df['tracks']['items'][0]['id']]["album_name"] = tracks_df['tracks']['items'][0]['album']['name']
    dict_albums[tracks_df['tracks']['items'][0]['id']]["track_name"] = tracks_df['tracks']['items'][0]['name']
    dict_albums[tracks_df['tracks']['items'][0]['id']]["track_popularity"] = tracks_df['tracks']['items'][0]['popularity']
    dict_albums[tracks_df['tracks']['items'][0]['id']]["date_release"] = tracks_df['tracks']['items'][0]['album']["release_date"]
    
    dict_albums[tracks_df['tracks']['items'][0]['id']]["acousticness"] = tracks_features[0]["acousticness"]
    dict_albums[tracks_df['tracks']['items'][0]['id']]["danceability"] = tracks_features[0]["danceability"]
    dict_albums[tracks_df['tracks']['items'][0]['id']]["duration_ms"] = tracks_features[0]["duration_ms"]
    dict_albums[tracks_df['tracks']['items'][0]['id']]["energy"] = tracks_features[0]["energy"]
    dict_albums[tracks_df['tracks']['items'][0]['id']]["instrumentalness"] = tracks_features[0]["instrumentalness"]
    dict_albums[tracks_df['tracks']['items'][0]['id']]["speechiness"] = tracks_features[0]["speechiness"]
    dict_albums[tracks_df['tracks']['items'][0]['id']]["key"] = tracks_features[0]["key"]
    dict_albums[tracks_df['tracks']['items'][0]['id']]["liveness"] = tracks_features[0]["liveness"]
    dict_albums[tracks_df['tracks']['items'][0]['id']]["loudness"] = tracks_features[0]["loudness"]
    dict_albums[tracks_df['tracks']['items'][0]['id']]["mode"] = tracks_features[0]["mode"]
    dict_albums[tracks_df['tracks']['items'][0]['id']]["tempo"] = tracks_features[0]["tempo"]
    dict_albums[tracks_df['tracks']['items'][0]['id']]["time_signature"] = tracks_features[0]["time_signature"]
    dict_albums[tracks_df['tracks']['items'][0]['id']]["valence"] = tracks_features[0]["valence"]
    
    dict_albums[tracks_df['tracks']['items'][0]['id']]["artist_genre"] = artist_genre['genres']
    dict_albums[tracks_df['tracks']['items'][0]['id']]["artist_popularity"] = artist_genre['popularity']
    dict_albums[tracks_df['tracks']['items'][0]['id']]["followers"] = artist_genre['followers']['total']
    
    dict_albums[tracks_df['tracks']['items'][0]['id']]["explicit_content"] = albums_data['tracks']['items'][0]['explicit']
    dict_albums[tracks_df['tracks']['items'][0]['id']]["labels"] = albums_data['copyrights'][0]['text']
    
    dict_albums[tracks_df['tracks']['items'][0]['id']]["end_of_fade_in"] = audio_analysis['track']['end_of_fade_in']
    dict_albums[tracks_df['tracks']['items'][0]['id']]["start_of_fade_out"] = audio_analysis['track']['start_of_fade_out']   
    
    temp+=1
    
 except :
    print(temp)
    id_bug.append(temp)
    pass

719
924
956
998
1065
1313
1411
1436
1465
1493
1541
1565
1647
1841
1841
1842
1895
1963
1971
2026
2153
2382
2386
2386
2410


In [24]:
# Création du df
df1 = pd.DataFrame.from_dict(dict_albums, orient='index').reset_index(drop=True)

# Exportation du df
df1.to_csv("/Users/anissa/P8/csv_features/df1.csv", index = False)

In [25]:
df1.shape

(2188, 25)

In [26]:
df1.head()

Unnamed: 0,artist_name,album_name,track_name,track_popularity,date_release,acousticness,danceability,duration_ms,energy,instrumentalness,...,tempo,time_signature,valence,artist_genre,artist_popularity,followers,explicit_content,labels,end_of_fade_in,start_of_fade_out
0,Hatik,Chaise pliante II,La meilleure,63,2020-03-06,0.652,0.757,185120,0.517,0.00527,...,102.008,4,0.218,[francoton],68,602812,True,© 2020 Low Wood,2.08399,178.31764
1,Harry Styles,Fine Line,Watermelon Sugar,89,2019-12-13,0.122,0.548,174000,0.816,0.0,...,95.39,4,0.557,[pop],87,17191998,False,"(P) 2019 Erskine Records Limited, under exclus...",0.0,169.66821
2,Naza,Gros bébé,Joli bébé,66,2020-11-13,0.586,0.689,178333,0.623,0.0,...,154.95,4,0.26,"[francoton, french hip hop, pop urbaine]",67,996442,True,2020 Capricarolesque,0.44113,178.33333
3,24kGoldn,El Dorado,Mood (feat. iann dior),89,2021-03-26,0.174,0.701,140533,0.716,0.0,...,91.007,4,0.732,"[cali rap, pop rap]",82,1409137,True,"(P) 2021 Records Label, LLC / Columbia",0.22045,136.34177
4,Damso,QALF infinity,LIFE LIFE,61,2021-04-28,0.195,0.851,160667,0.62,1.2e-05,...,122.003,4,0.276,"[belgian hip hop, french hip hop, pop urbaine,...",77,3206285,True,"© 2021 Trente-quatre Centimes, exclusively dis...",0.06907,153.62321
