![Ironhack logo](https://i.imgur.com/1QgrNNw.png)

<body>
    <p style="font-size:28px;text-align:center"><b>Project 03 - Part 01 | Web Scrapping & API</b></p>
</body>

# Introduction

The objective of the first part of this project was collect data to answer a problem statement, practicing web scrapping and using API.

---

<body>
    <p style="font-size:20px"><b>Problem Statement</b></p>
</body>

_Do some TikTok viral songs have common characteristics?_

---

To answer this problem, 66 songs, which were obtained from the **PopSugar** website, were analyzed. The post that contained this list of songs was made in March 27th, 2020 by Hedy Phillips around the same time people started to quarantine, because of the COVIVD-19 pandemic, and they began to use the app more often to spend their time at home.

The sources of information used to gather data were **Spotify** and **Chartmetric**.

---

Sources:
- Website:
  - PopSugar: https://www.popsugar.com/entertainment/popular-tiktok-songs-47289804?stream_view=1#photo-47289832
 
- APIs
  - Spotify API: https://developer.spotify.com/
  - Spotipy (Spotify API wrapper for Python): https://spotipy.readthedocs.io/en/2.15.0/
  - Chartmetric API: https://api.chartmetric.com/apidoc/

# Setup

## Import

In [1]:
import os
import re
import requests

from ast import literal_eval
from time import sleep

import numpy as np
import pandas as pd
import spotipy

from bs4 import BeautifulSoup
from dotenv import load_dotenv, find_dotenv
from spotipy.oauth2 import SpotifyClientCredentials, SpotifyOAuth
from tqdm.auto import tqdm

# Web Scrapping

The web scrapping was necessary to collect the following data:

<table>
  <thead>
    <tr>
      <th>INFORMATION</th>
      <th>SOURCE</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>69 TikTok viral songs</td>
      <td>PopSugar</td>
    </tr>
    <tr>
      <td>Music genres</td>
      <td>MusicBrainz</td>
    </tr>
  </tbody>
</table>



## List of viral songs on TikTok

### Get response

In [2]:
# Get response from the url and check it
url = 'https://www.popsugar.com/entertainment/popular-tiktok-songs-47289804?stream_view=1#photo-47289832'
response = requests.get(url)
response

<Response [200]>

### Data Collection

In [3]:
# Get the content in the url
content_popsugar = BeautifulSoup(response.text)

# Get the date the post was made
popsugar_date = content_popsugar.find('time').text.replace('\n', '').strip()

# Get only the songs and artists
popsugar_html = content_popsugar.find_all('span', attrs={'class': 'count-copy'})

In [4]:
# Conver the list 'html_popsugar' to a Pandas DataFrame
df_base = pd.DataFrame([re.split(' by ', song.text.replace('"', '').strip()) for song in popsugar_html], 
                       columns=['song', 'artists'])

# Check the result
df_base

Unnamed: 0,song,artists
0,Roxanne,Arizona Zervas
1,Say So,Doja Cat
2,My Oh My,Camila Cabello feat. DaBaby
3,Moon,Kid Francescoli
4,Vibe,Cookiee Kawaii
...,...,...
64,What the Hell,Avril Lavigne
65,Towards the Sun,Rihanna
66,I Think I'm OKAY,"Machine Gun Kelly, YUNGBLUD, and Travis Barker"
67,Myself,Bazzi


### Data Cleaning

In [5]:
# Create a column with a list of artists for each song
df_base['artists_list'] = [re.split(',* and |, * | [Ff]eat. ', artists.strip()) for artists in df_base.artists]

# Create a column with the number of artists for each song
df_base['number_artists'] = df_base.artists_list.apply(len)

In [6]:
# Check possible number of artists for one song
df_base.number_artists.value_counts()

1    48
2    18
3     3
Name: number_artists, dtype: int64

Seeing the result above, the maximum number of artists for a song is 3.

In [7]:
# Check the dataframe
df_base.head()

Unnamed: 0,song,artists,artists_list,number_artists
0,Roxanne,Arizona Zervas,[Arizona Zervas],1
1,Say So,Doja Cat,[Doja Cat],1
2,My Oh My,Camila Cabello feat. DaBaby,"[Camila Cabello, DaBaby]",2
3,Moon,Kid Francescoli,[Kid Francescoli],1
4,Vibe,Cookiee Kawaii,[Cookiee Kawaii],1


### Make backup dataframe

In [8]:
df_base_raw_bck = df_base.copy()

# Spotify

From the Spotify and with the Spotipy's help, some data about each song will be gathered. It is relevant to point that there is a possibility that some songs may not be in the Spotify's library.

In [9]:
# Create a copy of the dataframe
df_sp = df_base.copy()

# Check the result
df_sp.head()

Unnamed: 0,song,artists,artists_list,number_artists
0,Roxanne,Arizona Zervas,[Arizona Zervas],1
1,Say So,Doja Cat,[Doja Cat],1
2,My Oh My,Camila Cabello feat. DaBaby,"[Camila Cabello, DaBaby]",2
3,Moon,Kid Francescoli,[Kid Francescoli],1
4,Vibe,Cookiee Kawaii,[Cookiee Kawaii],1


## Connecting to the API

In [10]:
load_dotenv(find_dotenv())

True

In [11]:
cid = os.getenv('spotify_p03_key')
csecret = os.getenv('spotify_p03_secret')
cc_manager = SpotifyClientCredentials(client_id=cid, client_secret=csecret)
sp = spotipy.Spotify(client_credentials_manager=cc_manager)

## Songs

### Search information about each song

In [12]:
# Search information about each song, using the Spotipy
spotify_songs = [sp.search(q=df_base.iloc[index, 0], type='track', limit=50) for index in tqdm(df_base.index)]

HBox(children=(FloatProgress(value=0.0, max=69.0), HTML(value='')))




In [13]:
# Check if there are 69 items in this list
len(spotify_songs)

69

In [14]:
# Add a column in the dataframe with the data that were just collected
df_sp['spotify_search'] = spotify_songs

In [15]:
# Check the result
df_sp.head()

Unnamed: 0,song,artists,artists_list,number_artists,spotify_search
0,Roxanne,Arizona Zervas,[Arizona Zervas],1,{'tracks': {'href': 'https://api.spotify.com/v...
1,Say So,Doja Cat,[Doja Cat],1,{'tracks': {'href': 'https://api.spotify.com/v...
2,My Oh My,Camila Cabello feat. DaBaby,"[Camila Cabello, DaBaby]",2,{'tracks': {'href': 'https://api.spotify.com/v...
3,Moon,Kid Francescoli,[Kid Francescoli],1,{'tracks': {'href': 'https://api.spotify.com/v...
4,Vibe,Cookiee Kawaii,[Cookiee Kawaii],1,{'tracks': {'href': 'https://api.spotify.com/v...


In [16]:
# Function to add new information in a copy of the dataframe
def get_spotify_track_info(df):
    
    '''
    Filters some data of the songs and adds them to a copy of the dataframe
    
    Args:
    -----
        df (Pandas DataFrame): a dataframe containing the songs and their artists
    
    Returns:
    --------
        df_copy (Pandas DataFrame): a copy of the dataframe with some new information appended
    '''
    
    # Create auxiliary empty lists (final lists)
    list_spotify_track_name = []
    list_spotify_track_id = []
    list_spotify_track_duration = []
    list_spotify_track_popularity = []
    list_spotify_album_release_date = []
    list_spotify_track_explicit = []
    list_spotify_track_artists = []
    lists_spotify = [list_spotify_track_name, list_spotify_track_id, list_spotify_track_duration, 
                     list_spotify_track_popularity, list_spotify_album_release_date, list_spotify_track_explicit,
                     list_spotify_track_artists]
    
    
    # Check for each row of the dataframe
    for index in df.index:
        
        # Information necessary from the dataframe to use during the process
        song_name = df.iloc[index, 0][:3].lower()
        artists_list = [artist.lower() for artist in df.iloc[index, 2]]
        total_artists = df.iloc[index, 3]
        mask = df.iloc[index, 4]['tracks']['items']
        
        # If the track was not found in the Spotify library, a 'not-found' string is added to the final lists
        if len(mask) == 0:
            for lst in lists_spotify:
                lst.append('not-found')
            #print(f'{index} - {song_name} - NOT FOUND')
        
        # If the track was found in the Spotify
        else:
            
            # Variable necessary to check if the information about a song has been added to the final lists
            added = 0
            
            # For each track it was listed 50 tracks related to the query 
            for idx, each_found in enumerate(mask):
                
                if mask[idx] == None:
                    pass
                
                else:
                
                    # Information necessary from the Spotify API to use during the process
                    track_name = mask[idx]['name'].lower()
                    track_name_normal = mask[idx]['name']
                    track_id = mask[idx]['id']
                    track_duration = mask[idx]['duration_ms']
                    track_popularity = mask[idx]['popularity']
                    album_release_date = mask[idx]['album']['release_date']
                    track_explicit = mask[idx]['explicit']
                    n_artists = len(mask[idx]['artists'])
                    first_artist_name = mask[idx]['artists'][0]['name'].lower()
                    artists_names = [each_artist['name'] for each_artist in mask[idx]['artists']]

                    # Check if the name of the song, the artists from both sources match and if an information about the
                    # song has been added to the final lists 
                    if ((song_name in track_name) & (total_artists == n_artists) & 
                        ((first_artist_name in artists_list) | (artists_list[0][:5] in first_artist_name)) & (added == 0)):
                        list_spotify_track_name.append(track_name_normal)
                        list_spotify_track_id.append(track_id)
                        list_spotify_track_duration.append(track_duration)
                        list_spotify_track_popularity.append(track_popularity)
                        list_spotify_album_release_date.append(album_release_date)
                        list_spotify_track_explicit.append(track_explicit)
                        list_spotify_track_artists.append(artists_names)
                        added += 1
                        #print(f'{index} - {track_name} - {track_id}')


                    # If the track found in the search is not a math, itis the last one and information about the track 
                    # has not been added to the final list, then add a 'not-found' string to the final lists
                    elif (idx == len(mask) - 1) & (added == 0):
                        for lst in lists_spotify:
                            lst.append('not-found')
                        #print(f'{index} - {song_name} - NOT FOUND')
    
    # Make a copy of the dataframe
    df_copy = df.copy()
    
    # Add columns with the desired information
    # Not an inplace process
    df_copy['sp_name'] = list_spotify_track_name
    df_copy['sp_id'] = list_spotify_track_id
    df_copy['sp_duration_ms'] = list_spotify_track_duration
    df_copy['sp_popularity'] = list_spotify_track_popularity
    df_copy['sp_release_date'] = list_spotify_album_release_date
    df_copy['sp_explicit'] = list_spotify_track_explicit
    df_copy['sp_artists_name'] = list_spotify_track_artists
                    
    return df_copy

In [17]:
# Add desired information to the dataframe
df_sp = get_spotify_track_info(df_sp)

# Check the result
df_sp

Unnamed: 0,song,artists,artists_list,number_artists,spotify_search,sp_name,sp_id,sp_duration_ms,sp_popularity,sp_release_date,sp_explicit,sp_artists_name
0,Roxanne,Arizona Zervas,[Arizona Zervas],1,{'tracks': {'href': 'https://api.spotify.com/v...,ROXANNE,696DnlkuDOXcMAnKlTgXXK,163636,88,2019-10-10,True,[Arizona Zervas]
1,Say So,Doja Cat,[Doja Cat],1,{'tracks': {'href': 'https://api.spotify.com/v...,Say So,3Dv1eDb0MEgF93GpLXlucZ,237893,88,2019-11-07,True,[Doja Cat]
2,My Oh My,Camila Cabello feat. DaBaby,"[Camila Cabello, DaBaby]",2,{'tracks': {'href': 'https://api.spotify.com/v...,My Oh My (feat. DaBaby),3yOlyBJuViE2YSGn3nVE1K,170746,82,2019-12-06,False,"[Camila Cabello, DaBaby]"
3,Moon,Kid Francescoli,[Kid Francescoli],1,{'tracks': {'href': 'https://api.spotify.com/v...,Moon (And It Went Like),24upABZ8A0sAepfu91sEYr,390638,70,2017-03-03,False,[Kid Francescoli]
4,Vibe,Cookiee Kawaii,[Cookiee Kawaii],1,{'tracks': {'href': 'https://api.spotify.com/v...,Vibe (If I Back It Up),4gOgQTv9RYYFZ1uQNnlk3q,83940,72,2019-03-29,True,[Cookiee Kawaii]
...,...,...,...,...,...,...,...,...,...,...,...,...
64,What the Hell,Avril Lavigne,[Avril Lavigne],1,{'tracks': {'href': 'https://api.spotify.com/v...,What the Hell,2z4U9d5OAA4YLNXoCgioxo,220706,74,2011-03-08,False,[Avril Lavigne]
65,Towards the Sun,Rihanna,[Rihanna],1,{'tracks': {'href': 'https://api.spotify.com/v...,"Towards The Sun - From The ""Home"" Soundtrack",1UuZhGTon3gzXQAJzNa2A4,273293,55,2015-03-23,False,[Rihanna]
66,I Think I'm OKAY,"Machine Gun Kelly, YUNGBLUD, and Travis Barker","[Machine Gun Kelly, YUNGBLUD, Travis Barker]",3,{'tracks': {'href': 'https://api.spotify.com/v...,I Think I'm OKAY (with YUNGBLUD & Travis Barker),2gTdDMpNxIRFSiu7HutMCg,169397,81,2019-07-05,True,"[Machine Gun Kelly, YUNGBLUD, Travis Barker]"
67,Myself,Bazzi,[Bazzi],1,{'tracks': {'href': 'https://api.spotify.com/v...,Myself,5YLHLxoZsodDWjqSgjhBf3,167552,76,2018-04-12,False,[Bazzi]


In [18]:
# Check if songs were not found in Spotify
df_sp[df_sp.sp_id == 'not-found']

Unnamed: 0,song,artists,artists_list,number_artists,spotify_search,sp_name,sp_id,sp_duration_ms,sp_popularity,sp_release_date,sp_explicit,sp_artists_name
11,How About Now,Bryson Tiller,[Bryson Tiller],1,{'tracks': {'href': 'https://api.spotify.com/v...,not-found,not-found,not-found,not-found,not-found,not-found,not-found
36,Shibuya — Chanel Funk Remix,Frank Ocean and L.Dre,"[Frank Ocean, L.Dre]",2,{'tracks': {'href': 'https://api.spotify.com/v...,not-found,not-found,not-found,not-found,not-found,not-found,not-found
39,WOP,J. Dash feat. Flo Rida,"[J. Dash, Flo Rida]",2,{'tracks': {'href': 'https://api.spotify.com/v...,not-found,not-found,not-found,not-found,not-found,not-found,not-found


3 songs were not found in Spotify. There were checked manually using the Spotify Desktop. The first two do not exist there. Finally, the third song exists, but the version with the featured artist in the dataframe does not exist there.

### Find the audio features for each song

In [19]:
# Search in the API wrapper
spotify_audio_features = [sp.audio_features(track_id)  if track_id != 'not-found'else 'not-found' 
                          for track_id in tqdm(df_sp.sp_id)]

HBox(children=(FloatProgress(value=0.0, max=69.0), HTML(value='')))




In [20]:
# Check if there are 69 items in this list
len(spotify_audio_features)

69

In [21]:
# Function to add new information in a copy of the dataframe
def get_spotify_audio_features(df, audio_features: list):
    
    '''
    Adds new information from a list to a copy of the dataframe
    
    Args:
    -----
        df (Pandas DataFrame): a dataframe containing the songs and their artists
    
    Returns:
    --------
        df_copy (Pandas DataFrame): a copy of the dataframe with some new information appended
    '''
    
    # Create auxiliary empty lists (final lists)
    list_danceability = []
    list_energy = []
    list_key = []
    list_loudness = []
    list_mode = []
    list_speechiness = []
    list_acousticness = []
    list_instrumentalness = []
    list_liveness = []
    list_valence = []
    list_tempo = []
    list_time_signature = []
    lists_features = [list_danceability, list_energy, list_key, list_loudness, list_mode, list_speechiness, 
                      list_acousticness, list_instrumentalness, list_liveness, list_valence, list_tempo,
                      list_time_signature]
    
    # Check for each row of the dataframe
    for index in df.index:
        
        # Get the track's Spotify id
        track_id = df.iloc[index, 5]
        
        # If the track was not found in the Spotify library, a 'not-found' string is added to the final lists
        if track_id == 'not-found':
            
            for lst in lists_features:
                lst.append('not-found')
    
        # If the track was found in the Spotify library
        else:
            
            # Add the information to the final lists
            list_danceability.append(audio_features[index][0]['danceability'])
            list_energy.append(audio_features[index][0]['energy'])
            list_key.append(audio_features[index][0]['key'])
            list_loudness.append(audio_features[index][0]['loudness'])
            list_mode.append(audio_features[index][0]['mode'])
            list_speechiness.append(audio_features[index][0]['speechiness'])
            list_acousticness.append(audio_features[index][0]['acousticness'])
            list_instrumentalness.append(audio_features[index][0]['instrumentalness'])
            list_liveness.append(audio_features[index][0]['liveness'])
            list_valence.append(audio_features[index][0]['valence'])
            list_tempo.append(audio_features[index][0]['tempo'])
            list_time_signature.append(audio_features[index][0]['time_signature'])
     
     # Make a copy of the dataframe
    df_copy = df.copy()
    
    # Add columns with the desired information
    # Not an inplace process
    df_copy['sp_danceability'] = list_danceability
    df_copy['sp_energy'] = list_energy
    df_copy['sp_key'] = list_key
    df_copy['sp_loudness'] = list_loudness
    df_copy['sp_mode'] = list_mode
    df_copy['sp_speechiness'] = list_speechiness
    df_copy['sp_acousticness'] = list_acousticness
    df_copy['sp_instrumentalness'] = list_instrumentalness
    df_copy['sp_liveness'] = list_liveness
    df_copy['sp_valence'] = list_valence
    df_copy['sp_tempo'] = list_tempo
    df_copy['sp_time_signature'] = list_time_signature
    
    return df_copy

In [22]:
# Add new information to the dataframe
df_sp = get_spotify_audio_features(df_sp, spotify_audio_features)

# Check result
df_sp.head()

Unnamed: 0,song,artists,artists_list,number_artists,spotify_search,sp_name,sp_id,sp_duration_ms,sp_popularity,sp_release_date,...,sp_key,sp_loudness,sp_mode,sp_speechiness,sp_acousticness,sp_instrumentalness,sp_liveness,sp_valence,sp_tempo,sp_time_signature
0,Roxanne,Arizona Zervas,[Arizona Zervas],1,{'tracks': {'href': 'https://api.spotify.com/v...,ROXANNE,696DnlkuDOXcMAnKlTgXXK,163636,88,2019-10-10,...,6,-5.616,0,0.148,0.0522,0.0,0.46,0.457,116.735,5
1,Say So,Doja Cat,[Doja Cat],1,{'tracks': {'href': 'https://api.spotify.com/v...,Say So,3Dv1eDb0MEgF93GpLXlucZ,237893,88,2019-11-07,...,11,-4.577,0,0.158,0.256,3.57e-06,0.0904,0.786,110.962,4
2,My Oh My,Camila Cabello feat. DaBaby,"[Camila Cabello, DaBaby]",2,{'tracks': {'href': 'https://api.spotify.com/v...,My Oh My (feat. DaBaby),3yOlyBJuViE2YSGn3nVE1K,170746,82,2019-12-06,...,8,-6.024,1,0.0296,0.018,1.29e-05,0.0887,0.383,105.046,4
3,Moon,Kid Francescoli,[Kid Francescoli],1,{'tracks': {'href': 'https://api.spotify.com/v...,Moon (And It Went Like),24upABZ8A0sAepfu91sEYr,390638,70,2017-03-03,...,7,-10.002,1,0.0345,0.288,0.856,0.102,0.0584,117.986,4
4,Vibe,Cookiee Kawaii,[Cookiee Kawaii],1,{'tracks': {'href': 'https://api.spotify.com/v...,Vibe (If I Back It Up),4gOgQTv9RYYFZ1uQNnlk3q,83940,72,2019-03-29,...,10,-8.719,1,0.344,0.0635,0.00932,0.118,0.175,159.947,4


### Final Dataframe

In [23]:
df_tracks_sp = df_sp[~(df_sp.sp_id == 'not-found')]

# Check the result
df_tracks_sp

Unnamed: 0,song,artists,artists_list,number_artists,spotify_search,sp_name,sp_id,sp_duration_ms,sp_popularity,sp_release_date,...,sp_key,sp_loudness,sp_mode,sp_speechiness,sp_acousticness,sp_instrumentalness,sp_liveness,sp_valence,sp_tempo,sp_time_signature
0,Roxanne,Arizona Zervas,[Arizona Zervas],1,{'tracks': {'href': 'https://api.spotify.com/v...,ROXANNE,696DnlkuDOXcMAnKlTgXXK,163636,88,2019-10-10,...,6,-5.616,0,0.148,0.0522,0,0.46,0.457,116.735,5
1,Say So,Doja Cat,[Doja Cat],1,{'tracks': {'href': 'https://api.spotify.com/v...,Say So,3Dv1eDb0MEgF93GpLXlucZ,237893,88,2019-11-07,...,11,-4.577,0,0.158,0.256,3.57e-06,0.0904,0.786,110.962,4
2,My Oh My,Camila Cabello feat. DaBaby,"[Camila Cabello, DaBaby]",2,{'tracks': {'href': 'https://api.spotify.com/v...,My Oh My (feat. DaBaby),3yOlyBJuViE2YSGn3nVE1K,170746,82,2019-12-06,...,8,-6.024,1,0.0296,0.018,1.29e-05,0.0887,0.383,105.046,4
3,Moon,Kid Francescoli,[Kid Francescoli],1,{'tracks': {'href': 'https://api.spotify.com/v...,Moon (And It Went Like),24upABZ8A0sAepfu91sEYr,390638,70,2017-03-03,...,7,-10.002,1,0.0345,0.288,0.856,0.102,0.0584,117.986,4
4,Vibe,Cookiee Kawaii,[Cookiee Kawaii],1,{'tracks': {'href': 'https://api.spotify.com/v...,Vibe (If I Back It Up),4gOgQTv9RYYFZ1uQNnlk3q,83940,72,2019-03-29,...,10,-8.719,1,0.344,0.0635,0.00932,0.118,0.175,159.947,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
64,What the Hell,Avril Lavigne,[Avril Lavigne],1,{'tracks': {'href': 'https://api.spotify.com/v...,What the Hell,2z4U9d5OAA4YLNXoCgioxo,220706,74,2011-03-08,...,6,-3.689,0,0.0548,0.00472,0.0127,0.14,0.877,149.976,4
65,Towards the Sun,Rihanna,[Rihanna],1,{'tracks': {'href': 'https://api.spotify.com/v...,"Towards The Sun - From The ""Home"" Soundtrack",1UuZhGTon3gzXQAJzNa2A4,273293,55,2015-03-23,...,4,-6.207,0,0.0392,0.0531,0,0.152,0.263,170.18,4
66,I Think I'm OKAY,"Machine Gun Kelly, YUNGBLUD, and Travis Barker","[Machine Gun Kelly, YUNGBLUD, Travis Barker]",3,{'tracks': {'href': 'https://api.spotify.com/v...,I Think I'm OKAY (with YUNGBLUD & Travis Barker),2gTdDMpNxIRFSiu7HutMCg,169397,81,2019-07-05,...,7,-4.718,1,0.0379,0.0257,0,0.313,0.277,119.921,4
67,Myself,Bazzi,[Bazzi],1,{'tracks': {'href': 'https://api.spotify.com/v...,Myself,5YLHLxoZsodDWjqSgjhBf3,167552,76,2018-04-12,...,9,-5.513,0,0.072,0.465,1.12e-06,0.0338,0.902,195.918,4


In [24]:
# Dataframe metadata
df_tracks_sp.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 66 entries, 0 to 68
Data columns (total 24 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   song                 66 non-null     object
 1   artists              66 non-null     object
 2   artists_list         66 non-null     object
 3   number_artists       66 non-null     int64 
 4   spotify_search       66 non-null     object
 5   sp_name              66 non-null     object
 6   sp_id                66 non-null     object
 7   sp_duration_ms       66 non-null     object
 8   sp_popularity        66 non-null     object
 9   sp_release_date      66 non-null     object
 10  sp_explicit          66 non-null     object
 11  sp_artists_name      66 non-null     object
 12  sp_danceability      66 non-null     object
 13  sp_energy            66 non-null     object
 14  sp_key               66 non-null     object
 15  sp_loudness          66 non-null     object
 16  sp_mode   

### Export dataset

In [25]:
# Export
df_tracks_sp.to_csv('exported_df/sp_tracks_info.csv', index=False)

## Artists

### Get artists ID

In [26]:
# Function to add new information in a copy of the dataframe
def get_spotify_artists_id(df):
    
    '''
    Creates a new dataframe about the artists.
    
    Args:
    -----
        df (Pandas DataFrame): a dataframe containing the songs and their artists
    
    Returns:
    --------
        df_artists_id (Pandas DataFrame): a new dataframe with the name and Spotify ID of the artists
    '''
    
    # Create auxiliary empty dictionary
    dict_artists_ids = {}
      
    # Check for each row of the dataframe
    for index in df.index:
                      
        # Information necessary from the dataframe to use during the process
        song_name = df.iloc[index, 0][:3].lower()
        artists_list = [artist.lower() for artist in df.iloc[index, 2]]
        total_artists = df.iloc[index, 3]
        mask = df.iloc[index, 4]['tracks']['items']
        added = 0
        
        # For each track it was listed 50 tracks related to the query 
        for idx, each_found in enumerate(mask):
                
            if mask[idx] == None:
                    pass
                
            else:
            
                # Information necessary from the Spotify API to use during the process
                track_name = mask[idx]['name'].lower()
                n_artists = len(mask[idx]['artists'])
                first_artist_name = mask[idx]['artists'][0]['name'].lower()

                # Check if the name of the song, the artists from both sources match and if an information about the
                # song has been added to the final lists 
                if ((song_name in track_name) & (total_artists == n_artists) & 
                    ((first_artist_name in artists_list) | (artists_list[0][:5] in first_artist_name)) & (added == 0)
                    & (df.iloc[index, 5] != 'not-found')):

                    for artist in mask[idx]['artists']:
                        artist_name = artist['name']
                        artist_id = artist['id']

                        # Add to dict
                        dict_artists_ids[artist_name] = artist_id

    # Create a Pandas DataFrame
    
    df_artists_id = pd.DataFrame(dict_artists_ids.items(), columns=['artist', 'sp_id'])
                    
    return df_artists_id

In [27]:
# Create a Pandas DataFrame with artists
df_artists_sp = get_spotify_artists_id(df_sp)

# Check result
df_artists_sp

Unnamed: 0,artist,sp_id
0,Arizona Zervas,0vRvGUQVUjytro0xpb26bs
1,Doja Cat,5cj0lLjcoR7YOSnhnX0Po5
2,Camila Cabello,4nDoRrQiYLoBzwC5BhVJzF
3,DaBaby,4r63FhuTkUYltbVAg5TQnk
4,Kid Francescoli,2G7QgTep5IsJHGHm1hXygD
...,...,...
83,Machine Gun Kelly,6TIYQ3jFPwQSRmorSezPxX
84,YUNGBLUD,6Ad91Jof8Niiw0lGLLi3NW
85,Travis Barker,4exLIFE8sISLr28sqG1qNX
86,Bazzi,4GvEc3ANtPPjt1ZJllr5Zl


In [28]:
# Search in the API wrapper
spotify_artists_info = [sp.artist(artist) for artist in  tqdm(df_artists_sp.sp_id)]

HBox(children=(FloatProgress(value=0.0, max=88.0), HTML(value='')))




In [29]:
# Add a column in the dataframe with the data that were just collected
df_artists_sp['spotify_artist'] = spotify_artists_info

# Check the result
df_artists_sp.head()

Unnamed: 0,artist,sp_id,spotify_artist
0,Arizona Zervas,0vRvGUQVUjytro0xpb26bs,{'external_urls': {'spotify': 'https://open.sp...
1,Doja Cat,5cj0lLjcoR7YOSnhnX0Po5,{'external_urls': {'spotify': 'https://open.sp...
2,Camila Cabello,4nDoRrQiYLoBzwC5BhVJzF,{'external_urls': {'spotify': 'https://open.sp...
3,DaBaby,4r63FhuTkUYltbVAg5TQnk,{'external_urls': {'spotify': 'https://open.sp...
4,Kid Francescoli,2G7QgTep5IsJHGHm1hXygD,{'external_urls': {'spotify': 'https://open.sp...


### Add information about the artists to the dataframe

In [30]:
# Function to add new information in a copy of the dataframe
def get_spotify_artist_info(df):
    
    '''
    Adds information about the artists to a copy of the dataframe.
    
    Args:
    -----
        df (Pandas DataFrame): a dataframe containing the artists and their Spotify ID
    
    Returns:
    --------
        df_copy (Pandas DataFrame): a copy of the dataframe with some new information appended
    '''
    
    # Create auxiliary empty lists (final lists)
    list_spotify_artist_genres = []
    list_spotify_artist_popularity = []
    list_spotify_artist_followers = []
    
    # Check for each row of the dataframe
    for index in df.index:
        
        sp_id = df.iloc[index, 1]
        mask = df.iloc[index, 2]
        search_sp_id = mask['id']
        #artist_name = mask['name']
        
        # Check if 'id's match
        if sp_id == search_sp_id:
            #print(f'{index} - {artist_name}: OK - {n_artist}')
            
            search_sp_genres = mask['genres']
            list_spotify_artist_genres.append(search_sp_genres)
            
            search_sp_popularity = mask['popularity']
            list_spotify_artist_popularity.append(search_sp_popularity)
            
            search_sp_followers = mask['followers']['total']
            list_spotify_artist_followers.append(search_sp_followers)
    
    # Make a copy of the dataframe
    df_copy = df.copy()
    
    # Add columns with the desired information
    # Not an inplace process
    df_copy['sp_genres'] = list_spotify_artist_genres
    df_copy['sp_popularity'] = list_spotify_artist_popularity
    df_copy['sp_followers'] = list_spotify_artist_followers
                    
    return df_copy

In [31]:
# Add desired information to the dataframe
df_artists_sp = get_spotify_artist_info(df_artists_sp)

# Check the result
df_artists_sp.head()

Unnamed: 0,artist,sp_id,spotify_artist,sp_genres,sp_popularity,sp_followers
0,Arizona Zervas,0vRvGUQVUjytro0xpb26bs,{'external_urls': {'spotify': 'https://open.sp...,"[pop rap, rap, rhode island rap]",80,487372
1,Doja Cat,5cj0lLjcoR7YOSnhnX0Po5,{'external_urls': {'spotify': 'https://open.sp...,"[la indie, pop]",88,3456204
2,Camila Cabello,4nDoRrQiYLoBzwC5BhVJzF,{'external_urls': {'spotify': 'https://open.sp...,"[dance pop, pop, post-teen pop]",87,18000350
3,DaBaby,4r63FhuTkUYltbVAg5TQnk,{'external_urls': {'spotify': 'https://open.sp...,"[north carolina hip hop, rap]",95,4334477
4,Kid Francescoli,2G7QgTep5IsJHGHm1hXygD,{'external_urls': {'spotify': 'https://open.sp...,"[electronica, french indie pop, french indietr...",61,82376


### Final Dataframe

In [32]:
df_artists_sp

Unnamed: 0,artist,sp_id,spotify_artist,sp_genres,sp_popularity,sp_followers
0,Arizona Zervas,0vRvGUQVUjytro0xpb26bs,{'external_urls': {'spotify': 'https://open.sp...,"[pop rap, rap, rhode island rap]",80,487372
1,Doja Cat,5cj0lLjcoR7YOSnhnX0Po5,{'external_urls': {'spotify': 'https://open.sp...,"[la indie, pop]",88,3456204
2,Camila Cabello,4nDoRrQiYLoBzwC5BhVJzF,{'external_urls': {'spotify': 'https://open.sp...,"[dance pop, pop, post-teen pop]",87,18000350
3,DaBaby,4r63FhuTkUYltbVAg5TQnk,{'external_urls': {'spotify': 'https://open.sp...,"[north carolina hip hop, rap]",95,4334477
4,Kid Francescoli,2G7QgTep5IsJHGHm1hXygD,{'external_urls': {'spotify': 'https://open.sp...,"[electronica, french indie pop, french indietr...",61,82376
...,...,...,...,...,...,...
83,Machine Gun Kelly,6TIYQ3jFPwQSRmorSezPxX,{'external_urls': {'spotify': 'https://open.sp...,"[ohio hip hop, pop rap, rap]",86,2457132
84,YUNGBLUD,6Ad91Jof8Niiw0lGLLi3NW,{'external_urls': {'spotify': 'https://open.sp...,"[british indie rock, modern alternative rock, ...",78,1087080
85,Travis Barker,4exLIFE8sISLr28sqG1qNX,{'external_urls': {'spotify': 'https://open.sp...,[rap rock],77,252472
86,Bazzi,4GvEc3ANtPPjt1ZJllr5Zl,{'external_urls': {'spotify': 'https://open.sp...,"[pop, post-teen pop]",83,3674143


### Export the dataset

In [33]:
# Export
df_artists_sp.to_csv('exported_df/sp_artists_info.csv', index=False)

## Playlists

In [34]:
# Search in the API wrapper
spotify_tiktok = sp.search(q='tiktok', type='playlist', limit=50)

In [35]:
# Check the number of playlists
len(spotify_tiktok['playlists']['items'])

50

## Get playlist ID

In [36]:
# Get the playlists IDs and convert to a dataframe
sp_playlists = pd.DataFrame([playlist['id'] for playlist in spotify_tiktok['playlists']['items']], 
                            columns=['sp_playlist_id'])
sp_playlists.head()

Unnamed: 0,sp_playlist_id
0,37i9dQZF1DX2L0iB23Enbq
1,65LdqYCLcsV0lJoxpeQ6fW
2,0JFatPoPq82gNcPa4esOzj
3,2NNzPH70CakBbbU8JHrZRG
4,4FLeoROn5GT7n2tZq5XB4V


## Get information about the playlists

In [37]:
# Get additional information about the playlists
sp_playlists['sp_playlist_info'] = [sp.playlist(id) for id in tqdm(sp_playlists.sp_playlist_id)]

# Check the result
sp_playlists.head()

HBox(children=(FloatProgress(value=0.0, max=50.0), HTML(value='')))




Unnamed: 0,sp_playlist_id,sp_playlist_info
0,37i9dQZF1DX2L0iB23Enbq,"{'collaborative': False, 'description': 'Viral..."
1,65LdqYCLcsV0lJoxpeQ6fW,"{'collaborative': False, 'description': 'Best ..."
2,0JFatPoPq82gNcPa4esOzj,"{'collaborative': False, 'description': 'bigge..."
3,2NNzPH70CakBbbU8JHrZRG,"{'collaborative': False, 'description': 'TIK T..."
4,4FLeoROn5GT7n2tZq5XB4V,"{'collaborative': False, 'description': 'The m..."


### Export dataset

In [38]:
sp_playlists.to_csv('exported_df/sp_playlists_info.csv', index=False)

# Chartmetric

## Connecting to the API

In [None]:
load_dotenv(find_dotenv())

In [None]:
def get_token_cmc():
    url = "https://api.chartmetric.com/api/token"
    payload = r'{"refreshtoken":"%s"}' % os.getenv('chartmetric_rftoken')

    headers = {
      'Content-Type': 'application/json',
      'Cookie': 'connect.sid=s%3A96446210-f75b-11ea-bb65-c97242076514.wtHGb%2BZnACRtIERUXXeoBVsOmhNDPPnBzgPW9UL%2Bhpc'
    }

    response = requests.request("POST", url, headers=headers, data=payload)
    print(response)
    
    token = re.findall("[A-Za-z0-9._-]+", response.text)[1]
    
    return token

In [None]:
token = get_token_cmc()

## Songs

In [None]:
# Create a dataframe with songs for the Chartmetric dataset
df_cmc_tracks = df_sp.iloc[:, 0:6].drop(columns='spotify_search')
df_cmc_tracks

### Get Chartmetric track ID

In [None]:
def get_cmc_id_spotify(query: str, limit=10, offset=0):
    
    '''
    Searches for Chartmetric track ID using Spotify ID
    
    Args: spotify ID
    ------
    
    Returns: json
    '''
    
    search = f'https://open.spotify.com/track/{query}'
    url = f'https://api.chartmetric.com/api/search?q={search}&limit={limit}&offset={offset}&type=tracks'
    
    headers = {
    'Authorization': 'Bearer ' + token
    }
    
    response = requests.get(url, headers=headers)
    
    return response.json()

In [None]:
cmc_tracks_id = [get_cmc_id_spotify(id) if id != 'not-found' else 'not-found' for id in tqdm(df_cmc_tracks.sp_id)]

In [None]:
len(cmc_tracks_id)

In [None]:
df_cmc_tracks['cmc_song_seach'] = cmc_tracks_id
df_cmc_tracks

In [None]:
df_cmc_tracks['cmc_id'] = [cmc_search['obj']['tracks'][0]['id'] if df_cmc_tracks.iloc[index, 4] != 'not-found' 
                           else 'not-found' for index, cmc_search in enumerate(df_cmc_tracks.cmc_song_seach)]

df_cmc_tracks.head()

### Get track metadata

In [None]:
def get_cmc_metadata(cmc_id: str):
    
    
    
    url = f'https://api.chartmetric.com/api/track/{cmc_id}'
    
    headers = {
    'Authorization': 'Bearer ' + token
    }
    
    response = requests.get(url, headers=headers)
    
    return response.json()

In [None]:
cmc_track_metadata = [get_cmc_metadata(id) if id != 'not-found' else 'not-found' for id in tqdm(df_cmc_tracks.cmc_id)]

In [None]:
df_cmc_tracks['cmc_track_metadata'] = cmc_track_metadata

In [None]:
df_cmc_tracks.head()

### Get track stats - TikTok

In [None]:
def get_cmc_track_stats_tiktok(cmc_id, since="2020-01-01"):
    
    
    
    url = f'https://api.chartmetric.com/api/track/{cmc_id}/tiktok/stats?since={since}'
    
    headers = {
    'Authorization': 'Bearer ' + token
    }
    
    response = requests.get(url, headers=headers)
    
    return response.json()

In [None]:
cmc_track_stats_tiktok = [get_cmc_track_stats_tiktok(id) if id != 'not-found' else 'not-found' 
                          for id in tqdm(df_cmc_tracks.cmc_id)]

In [None]:
df_cmc_tracks['cmc_track_stats_tiktok'] = cmc_track_stats_tiktok
df_cmc_tracks.head()

### Export dataset

In [None]:
df_cmc_tracks.to_csv('cmc_tracks_info.csv', index=False)

## Artists

In [None]:
df_cmc_artists = df_artists.drop(columns=['spotify_artist'])
df_cmc_artists

### Get Artists IDs

In [None]:
def cmc_search_artist_spotify(query: str, limit=10, offset=0):
    
    '''
    Searches for Chartmetric artist ID using Spotify ID
    
    Args: spotify ID
    ------
    
    Returns: json
    '''
    
    search = f'https://open.spotify.com/artist/{query}'
    url = f'https://api.chartmetric.com/api/search?q={search}&limit={limit}&offset={offset}&type=artists'
    
    headers = {
    'Authorization': 'Bearer ' + token
    }
    
    response = requests.get(url, headers=headers)
    
    return response.json()

In [None]:
cmc_artist_id = [cmc_search_artist_spotify(id) for id in tqdm(df_cmc_artists['sp_id'])]

In [None]:
df_cmc_artists['cmc_search'] = cmc_artist_id

In [None]:
df_cmc_artists.head()

In [None]:
df_cmc_artists.cmc_search[0]['obj']['artists'][0]

In [None]:
df_cmc_artists['cmc_id'] = [search['obj']['artists'][0]['id'] for search in df_cmc_artists.cmc_search]
df_cmc_artists

### Get artist metadata

In [None]:
def get_cmc_artist_metadata(cmc_id: str):
    
    
    
    url = f'https://api.chartmetric.com/api/artist/{cmc_id}'
    
    headers = {
    'Authorization': 'Bearer ' + token
    }
    
    response = requests.get(url, headers=headers)
    
    return response.json()

In [None]:
cmc_art_metadata = [get_cmc_artist_metadata(id) for id in tqdm(df_cmc_artists.cmc_id)]

In [None]:
df_cmc_artists['cmc_art_metadata'] = cmc_art_metadata
df_cmc_artists

### Get spotify Monthly Listeners by City

In [None]:
def get_cmc_artist_sp_city(cmc_id: str):
    
    
    
    url = f'https://api.chartmetric.com/api/artist/{cmc_id}/where-people-listen'
    
    headers = {
    'Authorization': 'Bearer ' + token
    }
    
    response = requests.get(url, headers=headers)
    
    return response.json()

In [None]:
cmc_art_sp_city = [get_cmc_artist_sp_city(id) for id in tqdm(df_cmc_artists.cmc_id)]

In [None]:
df_cmc_artists['cmc_art_sp_city'] = cmc_art_sp_city
df_cmc_artists.head()

## Tiktok Audience Data


In [None]:
def get_cmc_artist_tiktok_audience_data(cmc_id: str):
    
    
    
    url = f'https://api.chartmetric.com/api/artist/{cmc_id}/tiktok-audience-stats'
    
    headers = {
    'Authorization': 'Bearer ' + token
    }
    
    response = requests.get(url, headers=headers)
    
    return response.json()

In [None]:
cmc_art_tiktok_audience = [get_cmc_artist_tiktok_audience_data(id) for id in tqdm(df_cmc_artists.cmc_id)]

In [None]:
df_cmc_artists['cmc_art_tiktok_audience'] = cmc_art_tiktok_audience
df_cmc_artists.head()

### Export dataset

In [None]:
df_cmc_artists.to_csv('cmc_artists.csv', index=False)

In [None]:
df_teste2 = pd.read_csv('cmc_artists.csv')
df_teste2

# Charts

In [None]:
def get_cmc_tiktok_charts(date):
    
    
    
    url = f'https://api.chartmetric.com/api/charts/tiktok/tracks?date={date}&interval=weekly'
    
    headers = {
    'Authorization': 'Bearer ' + token
    }
    
    response = requests.get(url, headers=headers)
    
    return response.json()

In [None]:
cmc_tiktok_chart = get_cmc_tiktok_charts('2020-09-08')

In [None]:
df_cmc_chart = pd.DataFrame(cmc_tiktok_chart).reset_index()
df_cmc_chart.to_csv('cmc_chart.csv', index=False)