![Ironhack logo](https://i.imgur.com/1QgrNNw.png)

<body>
    <p style="font-size:28px;text-align:center"><b>Project 03 | Web Scrapping & API</b></p>
</body>

# Introduction

The objective of the first part of this project was collect data to answer a problem statement, practicing web scrapping and using API.

---

<body>
    <p style="font-size:20px"><b>Problem Statement</b></p>
</body>

_What some the TikTok viral songs have in common?_

---

To answer this problem, 69 songs, which were obtained from the **PopSugar** website, were analyzed. The post that contained this list of songs was made in March 27th, 2020 by Hedy Phillips around the same time people started to quarantine, because of the COVIVD-19 pandemic, and people began to use it more to spend their time at home.

The sources of information used to gather data were **Spotify**,  **Last.fm** and **MusicBrainz**.

---

Sources:
- Websites:
  - PopSugar: https://www.popsugar.com/entertainment/popular-tiktok-songs-47289804?stream_view=1#photo-47289832
  - MusicBrainz:https://musicbrainz.org/genres
 
- APIs
  - Spotify API: https://developer.spotify.com/
  - Spotipy (Spotify API wrapper for Python): https://spotipy.readthedocs.io/en/2.15.0/
  - Last.fm API: https://www.last.fm/api


# Setup

## Import

In [1]:
import os
import re
import requests
from time import sleep

import numpy as np
import pandas as pd
import spotipy

from bs4 import BeautifulSoup
from dotenv import load_dotenv, find_dotenv
from spotipy.oauth2 import SpotifyClientCredentials, SpotifyOAuth
from tqdm.auto import tqdm

# Web Scrapping

The web scrapping was necessary to collect the following data:

<table>
  <thead>
    <tr>
      <th>INFORMATION</th>
      <th>SOURCE</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>69 TikTok viral songs</td>
      <td>PopSugar</td>
    </tr>
    <tr>
      <td>Music genres</td>
      <td>MusicBrainz</td>
    </tr>
  </tbody>
</table>



## List of viral songs on TikTok

### Get response

In [2]:
# Get response from the url and check it
url = 'https://www.popsugar.com/entertainment/popular-tiktok-songs-47289804?stream_view=1#photo-47289832'
response = requests.get(url)
response

<Response [200]>

### Data Collection

In [3]:
# Get the content in the url
content_popsugar = BeautifulSoup(response.text)

# Get the date the post was made
popsugar_date = content_popsugar.find('time').text.replace('\n', '').strip()

# Get only the songs and artists
popsugar_html = content_popsugar.find_all('span', attrs={'class': 'count-copy'})

In [4]:
# Conver the list 'html_popsugar' to a Pandas DataFrame
df_base = pd.DataFrame([re.split(' by ', song.text.replace('"', '').strip()) for song in popsugar_html], 
                       columns=['song', 'artists'])

# Check the result
df_base

Unnamed: 0,song,artists
0,Roxanne,Arizona Zervas
1,Say So,Doja Cat
2,My Oh My,Camila Cabello feat. DaBaby
3,Moon,Kid Francescoli
4,Vibe,Cookiee Kawaii
...,...,...
64,What the Hell,Avril Lavigne
65,Towards the Sun,Rihanna
66,I Think I'm OKAY,"Machine Gun Kelly, YUNGBLUD, and Travis Barker"
67,Myself,Bazzi


### Data Cleaning

In [5]:
# Create a column with a list of artists for each song
df_base['artists_list'] = [re.split(',* and |, * | [Ff]eat. ', artists.strip()) for artists in df_base.artists]

# Create a column with the number of artists for each song
df_base['number_artists'] = df_base.artists_list.apply(len)

In [6]:
# Check possible number of artists for one song
df_base.number_artists.value_counts()

1    48
2    18
3     3
Name: number_artists, dtype: int64

Seeing the result above, the maximum number of artists for a song is 3.

In [7]:
# Create columns to stores one artist each
df_base['artist_1'] = df_base.artists_list.apply(lambda x : x[0] if len(x) >= 1 else 'no-artist')
df_base['artist_2'] = df_base.artists_list.apply(lambda x : x[1] if len(x) >= 2 else 'no-artist')
df_base['artist_3'] = df_base.artists_list.apply(lambda x : x[2] if len(x) == 3 else 'no-artist')

In [8]:
df_base.head()

Unnamed: 0,song,artists,artists_list,number_artists,artist_1,artist_2,artist_3
0,Roxanne,Arizona Zervas,[Arizona Zervas],1,Arizona Zervas,no-artist,no-artist
1,Say So,Doja Cat,[Doja Cat],1,Doja Cat,no-artist,no-artist
2,My Oh My,Camila Cabello feat. DaBaby,"[Camila Cabello, DaBaby]",2,Camila Cabello,DaBaby,no-artist
3,Moon,Kid Francescoli,[Kid Francescoli],1,Kid Francescoli,no-artist,no-artist
4,Vibe,Cookiee Kawaii,[Cookiee Kawaii],1,Cookiee Kawaii,no-artist,no-artist


### Make backup dataframe

In [9]:
df_base_raw_bck = df_base.copy()

## List of music genre

### Get the response

In [10]:
# Get response from the url and check it
url = 'https://musicbrainz.org/genres'
response = requests.get(url)
response

<Response [200]>

### Data collection

In [11]:
# Get the content in the url
musicbrainz_content = BeautifulSoup(response.content)

# Create a list with the music genres listed in the url
musicbrainz_genre = [genre.text for genre in musicbrainz_content.find_all('bdi')]

The data cleaning for this list will be made later.

# Spotify

From the Spotify and with the Spotipy's help, some data about each song will be gathered. It is relevant to point that there is a possibility that some songs may not be in the Spotify's library.

## Connecting to the API

In [12]:
load_dotenv(find_dotenv())

True

In [13]:
cid = os.getenv('spotify_p03_key')
csecret = os.getenv('spotify_p03_secret')
cc_manager = SpotifyClientCredentials(client_id=cid, client_secret=csecret)
sp = spotipy.Spotify(client_credentials_manager=cc_manager)

## Songs

### Search information about each song

In [14]:
# Search information about each song, using the Spotipy
spotify_songs = [sp.search(q=df_base.iloc[index, 0], type='track', limit=50) for index in tqdm(df_base.index)]

HBox(children=(FloatProgress(value=0.0, max=69.0), HTML(value='')))




In [15]:
# Check if there are 69 items in this list
len(spotify_songs)

69

In [16]:
# Add a column in the dataframe with the data that were just collected
df_base['spotify_search'] = spotify_songs

In [17]:
# Check the result
df_base.head()

Unnamed: 0,song,artists,artists_list,number_artists,artist_1,artist_2,artist_3,spotify_search
0,Roxanne,Arizona Zervas,[Arizona Zervas],1,Arizona Zervas,no-artist,no-artist,{'tracks': {'href': 'https://api.spotify.com/v...
1,Say So,Doja Cat,[Doja Cat],1,Doja Cat,no-artist,no-artist,{'tracks': {'href': 'https://api.spotify.com/v...
2,My Oh My,Camila Cabello feat. DaBaby,"[Camila Cabello, DaBaby]",2,Camila Cabello,DaBaby,no-artist,{'tracks': {'href': 'https://api.spotify.com/v...
3,Moon,Kid Francescoli,[Kid Francescoli],1,Kid Francescoli,no-artist,no-artist,{'tracks': {'href': 'https://api.spotify.com/v...
4,Vibe,Cookiee Kawaii,[Cookiee Kawaii],1,Cookiee Kawaii,no-artist,no-artist,{'tracks': {'href': 'https://api.spotify.com/v...


In [18]:
# Function to add new information in a copy of the dataframe
def get_spotify_track_info(df):
    
    '''
    Filters some data of the songs and adds them to a copy of the dataframe
    
    Args:
    -----
        df (Pandas DataFrame): a dataframe containing the songs and their artists
    
    Returns:
    --------
        df_copy (Pandas DataFrame): a copy of the dataframe with some new information appended
    '''
    
    # Create auxiliary empty lists (final lists)
    list_spotify_track_id = []
    list_spotify_track_duration = []
    list_spotify_track_popularity = []
    list_spotify_album_release_date = []
    list_spotify_track_explicit = []
    
    # Check for each row of the dataframe
    for index in df.index:
        
        # Information necessary from the dataframe to use during the process
        song_name = df.iloc[index, 0].lower()
        artists_list = [artist.lower() for artist in df.iloc[index, 2]]
        total_artists = df.iloc[index, 3]
        mask = df.iloc[index, 7]['tracks']['items']
        
        # If the track was not found in the Spotify library, a 'not-found' string is added to the final lists
        if len(mask) == 0:
            list_spotify_track_id.append('not-found')
            list_spotify_track_duration.append('not-found')
            list_spotify_track_popularity.append('not-found')
            list_spotify_album_release_date.append('not-found')
            list_spotify_track_explicit.append('not-found')
            #print(f'{index} - {song_name} - NOT FOUND')
        
        # If the track was found in the Spotify
        else:
            
            # Variable necessary to check if the information about a song has been added to the final lists
            added = 0
            
            # For each track it was listed 50 tracks related to the query 
            for idx, each_found in enumerate(mask):
                
                # Information necessary from the Spotify API to use during the process
                track_name = mask[idx]['name'].lower()
                track_id = mask[idx]['id']
                track_duration = mask[idx]['duration_ms']
                track_popularity = mask[idx]['popularity']
                album_release_date = mask[idx]['album']['release_date']
                track_explicit = mask[idx]['explicit']
                n_artists = len(mask[idx]['artists'])
                first_artist_name = mask[idx]['artists'][0]['name'].lower()
            
                # Check if the name of the song, the artists from both sources match and if an information about the
                # song has been added to the final lists 
                if ((song_name in track_name) & (total_artists == n_artists) & (first_artist_name in artists_list)
                    & (added == 0)):
                    list_spotify_track_id.append(track_id)
                    list_spotify_track_duration.append(track_duration)
                    list_spotify_track_popularity.append(track_popularity)
                    list_spotify_album_release_date.append(album_release_date)
                    list_spotify_track_explicit.append(track_explicit)
                    added += 1
                    #print(f'{index} - {track_name} - {track_id}')
                
                # If the track found in the search is not a math, itis the last one and information about the track 
                # has not been added to the final list, then add a 'not-found' string to the final lists
                elif (idx == len(mask) - 1) & (added == 0):
                    list_spotify_track_id.append('not-found')
                    list_spotify_track_duration.append('not-found')
                    list_spotify_track_popularity.append('not-found')
                    list_spotify_album_release_date.append('not-found')
                    list_spotify_track_explicit.append('not-found')
                    #print(f'{index} - {song_name} - NOT FOUND')
    
    # Make a copy of the dataframe
    df_copy = df.copy()
    
    # Add columns with the desired information
    # Not an inplace process
    df_copy['sp_id'] = list_spotify_track_id
    df_copy['sp_duration_ms'] = list_spotify_track_duration
    df_copy['sp_popularity'] = list_spotify_track_popularity
    df_copy['sp_release_date'] = list_spotify_album_release_date
    df_copy['sp_explicit'] = list_spotify_track_explicit
                    
    return df_copy

In [21]:
# Add desired information to the dataframe
df_base = get_spotify_track_info(df_base)

# Check the result
df_base.head()

Unnamed: 0,song,artists,artists_list,number_artists,artist_1,artist_2,artist_3,spotify_search,sp_id,sp_duration_ms,sp_popularity,sp_release_date,sp_explicit
0,Roxanne,Arizona Zervas,[Arizona Zervas],1,Arizona Zervas,no-artist,no-artist,{'tracks': {'href': 'https://api.spotify.com/v...,696DnlkuDOXcMAnKlTgXXK,163636,89,2019-10-10,True
1,Say So,Doja Cat,[Doja Cat],1,Doja Cat,no-artist,no-artist,{'tracks': {'href': 'https://api.spotify.com/v...,3Dv1eDb0MEgF93GpLXlucZ,237893,89,2019-11-07,True
2,My Oh My,Camila Cabello feat. DaBaby,"[Camila Cabello, DaBaby]",2,Camila Cabello,DaBaby,no-artist,{'tracks': {'href': 'https://api.spotify.com/v...,3yOlyBJuViE2YSGn3nVE1K,170746,83,2019-12-06,False
3,Moon,Kid Francescoli,[Kid Francescoli],1,Kid Francescoli,no-artist,no-artist,{'tracks': {'href': 'https://api.spotify.com/v...,24upABZ8A0sAepfu91sEYr,390638,70,2017-03-03,False
4,Vibe,Cookiee Kawaii,[Cookiee Kawaii],1,Cookiee Kawaii,no-artist,no-artist,{'tracks': {'href': 'https://api.spotify.com/v...,4gOgQTv9RYYFZ1uQNnlk3q,83940,73,2019-03-29,True


### Find the audio features for each song

In [22]:
# Search in the API wrapper
spotify_audio_features = [sp.audio_features(track_id)  if track_id != 'not-found'else 'not-found' 
                          for track_id in tqdm(df_base.sp_id)]

HBox(children=(FloatProgress(value=0.0, max=69.0), HTML(value='')))




In [23]:
# Check if there are 69 items in this list
len(spotify_audio_features)

69

In [24]:
# Function to add new information in a copy of the dataframe
def get_spotify_audio_features(df, audio_features: list):
    
    '''
    Adds new information from a list to a copy of the dataframe
    
    Args:
    -----
        df (Pandas DataFrame): a dataframe containing the songs and their artists
    
    Returns:
    --------
        df_copy (Pandas DataFrame): a copy of the dataframe with some new information appended
    '''
    
    # Create auxiliary empty lists (final lists)
    list_danceability = []
    list_energy = []
    list_key = []
    list_loudness = []
    list_mode = []
    list_speechiness = []
    list_acousticness = []
    list_instrumentalness = []
    list_liveness = []
    list_valence = []
    list_tempo = []
    list_time_signature = []
    
    # Check for each row of the dataframe
    for index in df.index:
        
        # Get the track's Spotify id
        track_id = df.iloc[index, 8]
        
        # If the track was not found in the Spotify library, a 'not-found' string is added to the final lists
        if track_id == 'not-found':
            
            list_danceability.append('not-found')
            list_energy.append('not-found')
            list_key.append('not-found')
            list_loudness.append('not-found')
            list_mode.append('not-found')
            list_speechiness.append('not-found')
            list_acousticness.append('not-found')
            list_instrumentalness.append('not-found')
            list_liveness.append('not-found')
            list_valence.append('not-found')
            list_tempo.append('not-found')
            list_time_signature.append('not-found')
    
        # If the track was found in the Spotify library
        else:
            
            # Add the information to the final lists
            list_danceability.append(audio_features[index][0]['danceability'])
            list_energy.append(audio_features[index][0]['energy'])
            list_key.append(audio_features[index][0]['key'])
            list_loudness.append(audio_features[index][0]['loudness'])
            list_mode.append(audio_features[index][0]['mode'])
            list_speechiness.append(audio_features[index][0]['speechiness'])
            list_acousticness.append(audio_features[index][0]['acousticness'])
            list_instrumentalness.append(audio_features[index][0]['instrumentalness'])
            list_liveness.append(audio_features[index][0]['liveness'])
            list_valence.append(audio_features[index][0]['valence'])
            list_tempo.append(audio_features[index][0]['tempo'])
            list_time_signature.append(audio_features[index][0]['time_signature'])
     
     # Make a copy of the dataframe
    df_copy = df.copy()
    
    # Add columns with the desired information
    # Not an inplace process
    df_copy['sp_danceability'] = list_danceability
    df_copy['sp_energy'] = list_energy
    df_copy['sp_key'] = list_key
    df_copy['sp_loudness'] = list_loudness
    df_copy['sp_mode'] = list_mode
    df_copy['sp_speechiness'] = list_speechiness
    df_copy['sp_acousticness'] = list_acousticness
    df_copy['sp_instrumentalness'] = list_instrumentalness
    df_copy['sp_liveness'] = list_liveness
    df_copy['sp_valence'] = list_valence
    df_copy['sp_tempo'] = list_tempo
    df_copy['sp_time_signature'] = list_time_signature
    
    return df_copy

In [26]:
df_base = get_spotify_audio_features(df_base, spotify_audio_features)
df_base.head()

Unnamed: 0,song,artists,artists_list,number_artists,artist_1,artist_2,artist_3,spotify_search,sp_id,sp_duration_ms,...,sp_key,sp_loudness,sp_mode,sp_speechiness,sp_acousticness,sp_instrumentalness,sp_liveness,sp_valence,sp_tempo,sp_time_signature
0,Roxanne,Arizona Zervas,[Arizona Zervas],1,Arizona Zervas,no-artist,no-artist,{'tracks': {'href': 'https://api.spotify.com/v...,696DnlkuDOXcMAnKlTgXXK,163636,...,6,-5.616,0,0.148,0.0522,0.0,0.46,0.457,116.735,5
1,Say So,Doja Cat,[Doja Cat],1,Doja Cat,no-artist,no-artist,{'tracks': {'href': 'https://api.spotify.com/v...,3Dv1eDb0MEgF93GpLXlucZ,237893,...,11,-4.577,0,0.158,0.256,3.57e-06,0.0904,0.786,110.962,4
2,My Oh My,Camila Cabello feat. DaBaby,"[Camila Cabello, DaBaby]",2,Camila Cabello,DaBaby,no-artist,{'tracks': {'href': 'https://api.spotify.com/v...,3yOlyBJuViE2YSGn3nVE1K,170746,...,8,-6.024,1,0.0296,0.018,1.29e-05,0.0887,0.383,105.046,4
3,Moon,Kid Francescoli,[Kid Francescoli],1,Kid Francescoli,no-artist,no-artist,{'tracks': {'href': 'https://api.spotify.com/v...,24upABZ8A0sAepfu91sEYr,390638,...,7,-10.002,1,0.0345,0.288,0.856,0.102,0.0584,117.986,4
4,Vibe,Cookiee Kawaii,[Cookiee Kawaii],1,Cookiee Kawaii,no-artist,no-artist,{'tracks': {'href': 'https://api.spotify.com/v...,4gOgQTv9RYYFZ1uQNnlk3q,83940,...,10,-8.719,1,0.344,0.0635,0.00932,0.118,0.175,159.947,4


## Artists

In [51]:
df_artist = pd.DataFrame([each_artist for each_list in df_base.artists_list 
                          for each_artist in each_list], columns=['artist']).drop_duplicates()
df_artist.head()

Unnamed: 0,artist
0,Arizona Zervas
1,Doja Cat
2,Camila Cabello
3,DaBaby
4,Kid Francescoli


In [55]:
# Search in the API wrapper
spotify_artists = [sp.search(q=artist, type='artist') for artist in tqdm(df_artist.artist)]

HBox(children=(FloatProgress(value=0.0, max=91.0), HTML(value='')))




In [81]:
spotify_artists[3]['artists']['items'][:2]

[{'external_urls': {'spotify': 'https://open.spotify.com/artist/4r63FhuTkUYltbVAg5TQnk'},
  'followers': {'href': None, 'total': 4291213},
  'genres': ['north carolina hip hop', 'rap'],
  'href': 'https://api.spotify.com/v1/artists/4r63FhuTkUYltbVAg5TQnk',
  'id': '4r63FhuTkUYltbVAg5TQnk',
  'images': [{'height': 640,
    'url': 'https://i.scdn.co/image/f68192e6516d89a77a2b16904725a77b75b42056',
    'width': 640},
   {'height': 320,
    'url': 'https://i.scdn.co/image/f88e08cf9132c7facc2ee9fbdd1be3924b5c5a74',
    'width': 320},
   {'height': 160,
    'url': 'https://i.scdn.co/image/1b6dd2116962f6d9741d0181708e31006b5048a7',
    'width': 160}],
  'name': 'DaBaby',
  'popularity': 95,
  'type': 'artist',
  'uri': 'spotify:artist:4r63FhuTkUYltbVAg5TQnk'},
 {'external_urls': {'spotify': 'https://open.spotify.com/artist/7MCrEuHBgUcjP8eMxM2IFC'},
  'followers': {'href': None, 'total': 144},
  'genres': [],
  'href': 'https://api.spotify.com/v1/artists/7MCrEuHBgUcjP8eMxM2IFC',
  'id': '7MCrE

## Playlists

In [86]:
# Search in the API wrapper
spotify_tiktok = sp.search(q='tiktok', type='playlist', limit=50)

In [87]:
len(spotify_tiktok['playlists']['items'])

50

# Last.fm

## Connecting to the API and collecting data

Example of how to connect to the API and collect information about a song

```python
response = lastfm_get({
    'method': 'track.getInfo',
    'track': 'My Oh My',
    'artist': 'Camila Cabello'
})```

In [27]:
lastfm_key = os.getenv('lastfm_p03_key')
lastfm_user = 'gnakasato'

In [28]:
# Function to connect to the API and get data
def lastfm_get(payload):
    
    '''
    Connects to the Last.fm API
    
    Args:
    -----
        payload (dict): a dictionary containing additional information to connect to the Last.fm API
    
    Returns:
    --------
        response: the response for the API connection request
    '''
    
    # Define the headers and the url
    headers = {'user-agent': lastfm_user}
    url = 'http://ws.audioscrobbler.com/2.0/'
    
    # Add API key and format to the payload
    payload['api_key'] = lastfm_key
    payload['format=json'] = lastfm_key
    
    response = requests.get(url, headers=headers, params=payload)
    
    return response

In [29]:
# Create an auxiliary list
lastfm_track_info = []

# Connect to the API and collect data
for index in tqdm(range(df_base.shape[0])):
    
    # Connect to the API
    response = lastfm_get({'method': 'track.getInfo',
                           'track': df_base.iloc[index, 0],
                           'artist': df_base.iloc[index, 4],
                           'autocorrect[0|1]': '1'})
    
    # Convert the data into a more amicable format
    track_info = BeautifulSoup(response.text)
    
    # Add the data to the auxiliary list
    lastfm_track_info.append(track_info)
    
    # Wait 1 second to check the next track
    sleep(1)

HBox(children=(FloatProgress(value=0.0, max=69.0), HTML(value='')))




## Data Cleaning

In [30]:
# Create a list with the tags of each track
# If the track was not found in the API, the value to the corresponding track is 'not-found'
lastfm_tags = [track.find_all('tag') if track.text.replace('\n\n', '') != 'Track not found' else 'not-found' for track 
                in lastfm_track_info]

# Check the result for the first two tracks
lastfm_tags[:2]

[[<tag><name>2019</name>
  <url>https://www.last.fm/tag/2019</url>
  </tag>,
  <tag><name>2010s</name>
  <url>https://www.last.fm/tag/2010s</url>
  </tag>,
  <tag><name>arizona zervas</name>
  <url>https://www.last.fm/tag/arizona+zervas</url>
  </tag>,
  <tag><name>Hip-Hop</name>
  <url>https://www.last.fm/tag/Hip-Hop</url>
  </tag>,
  <tag><name>rap</name>
  <url>https://www.last.fm/tag/rap</url>
  </tag>],
 [<tag><name>pop</name>
  <url>https://www.last.fm/tag/pop</url>
  </tag>,
  <tag><name>Disco</name>
  <url>https://www.last.fm/tag/Disco</url>
  </tag>,
  <tag><name>rap</name>
  <url>https://www.last.fm/tag/rap</url>
  </tag>,
  <tag><name>Hip-Hop</name>
  <url>https://www.last.fm/tag/Hip-Hop</url>
  </tag>,
  <tag><name>female vocalists</name>
  <url>https://www.last.fm/tag/female+vocalists</url>
  </tag>]]

The idea is to create a nested list with lists of tags of each track

In [31]:
# Create auxiliary empty lists 
tracks_tags = []  # Final list with the tags
tags_exist = []  # List with all tags

# Get the tag names for each track from the raw (messy) data
for track_tag in lastfm_tags:
    
    # If the track was not found in the Last.fm API
    if track_tag == 'not-found':
        tracks_tags.append(['not-found'])
    
    # If the track was found, but there is no tag related to the track
    elif len(track_tag) == 0:
        tracks_tags.append(['no-tag'])
    
    # If the track was found and there are tags related to the track
    else:
        
        # Create an auxiliary list 
        # Before checking each track, this list needs to be cleared, so it stores a list of messy tags for each track
        each_tag_lists = []
        
        # Each track has a list with messy tags, so it is necessary to clean this data, checking each tag for each track
        for each_tag in track_tag:
            
            # Get only the tag name, but this process creates a list with the tag name
            tag_list = each_tag.find('name')
            
            # Add each tag (messy data) in an auxiliary list
            each_tag_lists.append(tag_list)
            
            # Create an auxiliary list
            # Before checking each tag of a track, this list needs to be cleared, so it stores a list of clean data of
            # tags for each track
            each_track_tags = []
            
            # Each tag name of a track is inside a list
            for each_one in each_tag_lists:
                
                # Clean the data for each tag name
                tag = each_one.text.lower().replace('-', ' ')
                
                # Add the tag (clean data) with all lower case letters in an auxiliary list if the tag is one of the
                # genres listed in the 'musicbrainz_genre'
                if tag in musicbrainz_genre:
                    each_track_tags.append(tag)
                    tags_exist.append(tag)
                
        # Add each list of tags (clean) for one track in a final list
        tracks_tags.append(each_track_tags)
        #print(f'\n{tracks_tags}\n')
        
# Add a column in the 'popsugar_df' with the tags found in Last.fm API
df_base['lastfm_tags'] = tracks_tags

In [32]:
# Check the result
df_base.head()

Unnamed: 0,song,artists,artists_list,number_artists,artist_1,artist_2,artist_3,spotify_search,sp_id,sp_duration_ms,...,sp_loudness,sp_mode,sp_speechiness,sp_acousticness,sp_instrumentalness,sp_liveness,sp_valence,sp_tempo,sp_time_signature,lastfm_tags
0,Roxanne,Arizona Zervas,[Arizona Zervas],1,Arizona Zervas,no-artist,no-artist,{'tracks': {'href': 'https://api.spotify.com/v...,696DnlkuDOXcMAnKlTgXXK,163636,...,-5.616,0,0.148,0.0522,0.0,0.46,0.457,116.735,5,[hip hop]
1,Say So,Doja Cat,[Doja Cat],1,Doja Cat,no-artist,no-artist,{'tracks': {'href': 'https://api.spotify.com/v...,3Dv1eDb0MEgF93GpLXlucZ,237893,...,-4.577,0,0.158,0.256,3.57e-06,0.0904,0.786,110.962,4,"[pop, disco, hip hop]"
2,My Oh My,Camila Cabello feat. DaBaby,"[Camila Cabello, DaBaby]",2,Camila Cabello,DaBaby,no-artist,{'tracks': {'href': 'https://api.spotify.com/v...,3yOlyBJuViE2YSGn3nVE1K,170746,...,-6.024,1,0.0296,0.018,1.29e-05,0.0887,0.383,105.046,4,[pop]
3,Moon,Kid Francescoli,[Kid Francescoli],1,Kid Francescoli,no-artist,no-artist,{'tracks': {'href': 'https://api.spotify.com/v...,24upABZ8A0sAepfu91sEYr,390638,...,-10.002,1,0.0345,0.288,0.856,0.102,0.0584,117.986,4,"[chillout, indie pop]"
4,Vibe,Cookiee Kawaii,[Cookiee Kawaii],1,Cookiee Kawaii,no-artist,no-artist,{'tracks': {'href': 'https://api.spotify.com/v...,4gOgQTv9RYYFZ1uQNnlk3q,83940,...,-8.719,1,0.344,0.0635,0.00932,0.118,0.175,159.947,4,[no-tag]


# Final Dataframe

In [103]:
df = df_base.drop(['spotify_search'], axis=1)
df.head()

Unnamed: 0,song,artists,artists_list,number_artists,artist_1,artist_2,artist_3,sp_id,sp_duration_ms,sp_popularity,...,sp_loudness,sp_mode,sp_speechiness,sp_acousticness,sp_instrumentalness,sp_liveness,sp_valence,sp_tempo,sp_time_signature,lastfm_tags
0,Roxanne,Arizona Zervas,[Arizona Zervas],1,Arizona Zervas,no-artist,no-artist,696DnlkuDOXcMAnKlTgXXK,163636,89,...,-5.616,0,0.148,0.0522,0.0,0.46,0.457,116.735,5,[hip hop]
1,Say So,Doja Cat,[Doja Cat],1,Doja Cat,no-artist,no-artist,3Dv1eDb0MEgF93GpLXlucZ,237893,89,...,-4.577,0,0.158,0.256,3.57e-06,0.0904,0.786,110.962,4,"[pop, disco, hip hop]"
2,My Oh My,Camila Cabello feat. DaBaby,"[Camila Cabello, DaBaby]",2,Camila Cabello,DaBaby,no-artist,3yOlyBJuViE2YSGn3nVE1K,170746,83,...,-6.024,1,0.0296,0.018,1.29e-05,0.0887,0.383,105.046,4,[pop]
3,Moon,Kid Francescoli,[Kid Francescoli],1,Kid Francescoli,no-artist,no-artist,24upABZ8A0sAepfu91sEYr,390638,70,...,-10.002,1,0.0345,0.288,0.856,0.102,0.0584,117.986,4,"[chillout, indie pop]"
4,Vibe,Cookiee Kawaii,[Cookiee Kawaii],1,Cookiee Kawaii,no-artist,no-artist,4gOgQTv9RYYFZ1uQNnlk3q,83940,73,...,-8.719,1,0.344,0.0635,0.00932,0.118,0.175,159.947,4,[no-tag]
