## GNOD 1


#### Business goal:
Check the case_study_gnod.md file.

Make sure you've understood the big picture of your project:

- the goal of the company (Gnod),
- their current product (Gnoosic),
- their strategy, and
- how your project fits into this context.

Re-read the business case and the e-mail from the CTO, take a look at the flowchart and create an initial Trello board with the tasks you think you'll have to accomplish.





#### Instructions - Scraping popular songs
Your product will take a song as an input from the user and will output another song (the recommendation). In most cases, the recommended song will have to be similar to the inputted song, but the CTO thinks that if the song is on the top charts at the moment, the user will enjoy more a recommendation of a song that's also popular at the moment.

You have find data on the internet about currently popular songs. Billboard maintains a weekly Top 100 of "hot" songs here: https://www.billboard.com/charts/hot-100.

It's a good place to start! Scrape the current top 100 songs and their respective artists, and put the information into a pandas dataframe.




--> popvortex.com/music/charts/top-100-songs.php

In [1]:
# Libraries
import pandas as pd
import requests
from bs4 import BeautifulSoup
import time
import random


In [2]:
url = 'https://www.popvortex.com/music/charts/top-100-songs.php'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

In [3]:
display(response.status_code)
#soup

200

In [None]:
# Artist Name // #chart-position-1 > div.chart-content.col-xs-12.col-sm-8 > p > em

# Song title // #chart-position-1 > div.chart-content.col-xs-12.col-sm-8 > p > cite

# genre // #chart-position-1 > div.chart-content.col-xs-12.col-sm-8 > ul > li:nth-child(2) > a

# release date // #chart-position-1 > div.chart-content.col-xs-12.col-sm-8 > ul > li:nth-child(3)

In [None]:
#soup.select("p")

In [4]:
soup.select("p cite")[0:10]

[<cite class="title">Last Night</cite>,
 <cite class="title">Fast Car</cite>,
 <cite class="title">Flowers</cite>,
 <cite class="title">Calm Down</cite>,
 <cite class="title">Like Crazy (Deep House Remix)</cite>,
 <cite class="title">Anti-Hero</cite>,
 <cite class="title">Like Crazy (UK Garage Remix)</cite>,
 <cite class="title">Eyes Closed</cite>,
 <cite class="title">Heart Like A Truck</cite>,
 <cite class="title">Like Crazy (Instrumental)</cite>]

In [5]:
soup.select("p em")[0:10]

[<em class="artist">Morgan Wallen</em>,
 <em class="artist">Luke Combs</em>,
 <em class="artist">Miley Cyrus</em>,
 <em class="artist">Rema &amp; Selena Gomez</em>,
 <em class="artist">Jimin</em>,
 <em class="artist">Taylor Swift</em>,
 <em class="artist">Jimin</em>,
 <em class="artist">Ed Sheeran</em>,
 <em class="artist">Lainey Wilson</em>,
 <em class="artist">Jimin</em>]

In [6]:
soup.select("div.chart-content.col-xs-12.col-sm-8 > ul > li:nth-child(2) > a")[0:10]

[<a href="/music/charts/top-country-songs.php">Country</a>,
 <a href="/music/charts/top-kpop-songs.php">K-Pop</a>,
 <a href="/music/charts/top-kpop-songs.php">K-Pop</a>,
 <a href="/music/charts/top-pop-songs.php">Pop</a>,
 <a href="/music/charts/top-kpop-songs.php">K-Pop</a>,
 <a href="/music/charts/top-alternative-songs.php">Alternative</a>,
 <a href="/music/charts/top-kpop-songs.php">K-Pop</a>,
 <a href="/music/charts/top-kpop-songs.php">K-Pop</a>,
 <a href="/music/charts/top-christian-gospel-songs.php">Christian &amp; Gospel</a>,
 <a href="/music/charts/top-pop-songs.php">Pop</a>]

In [7]:
soup.select ("div.chart-content.col-xs-12.col-sm-8 > ul > li:nth-child(3)")[0:10]

[<li><strong>Release Date</strong>: March 24, 2023</li>,
 <li class="billboard-chart billboard-number-one">The current #1 hit song  in the U.S. on the <cite><a href="http://www.billboard.com/charts/hot-100" rel="noopener" target="_blank">Billboard Hot 100</a></cite> chart.<br/>2 weeks at #1.</li>,
 <li><strong>Release Date</strong>: March 26, 2023</li>,
 <li class="billboard-chart">Former #1 song on the <cite>Billboard Hot 100</cite> chart.</li>,
 <li><strong>Release Date</strong>: March 26, 2023</li>,
 <li><strong>Release Date</strong>: March 24, 2023</li>,
 <li><strong>Release Date</strong>: March 26, 2023</li>,
 <li><strong>Release Date</strong>: </li>,
 <li><strong>Release Date</strong>: March 23, 2023</li>,
 <li><strong>Release Date</strong>: March 24, 2023</li>]

In [8]:
#Empty Lists
title = []
artist = []

# Iterations
num_iter = len(soup.select("p cite"))

title_song = soup.select("p cite")
artist_song = soup.select("p em")

for i in range(num_iter):
    title.append(title_song[i].get_text())
    artist.append(artist_song[i].get_text())
    
print(title[0:10])
print(artist[0:10])


#

print(len(title))
print(len(artist))
#print(len(genre))
#print(len(release_date))

['Last Night', 'Fast Car', 'Flowers', 'Calm Down', 'Like Crazy (Deep House Remix)', 'Anti-Hero', 'Like Crazy (UK Garage Remix)', 'Eyes Closed', 'Heart Like A Truck', 'Like Crazy (Instrumental)']
['Morgan Wallen', 'Luke Combs', 'Miley Cyrus', 'Rema & Selena Gomez', 'Jimin', 'Taylor Swift', 'Jimin', 'Ed Sheeran', 'Lainey Wilson', 'Jimin']
100
100


In [9]:
# Columns of a dataframe
topsongs = pd.DataFrame({"title":title,
                       "artist":artist})

In [10]:
display(topsongs.head(20), len(topsongs))

Unnamed: 0,title,artist
0,Last Night,Morgan Wallen
1,Fast Car,Luke Combs
2,Flowers,Miley Cyrus
3,Calm Down,Rema & Selena Gomez
4,Like Crazy (Deep House Remix),Jimin
5,Anti-Hero,Taylor Swift
6,Like Crazy (UK Garage Remix),Jimin
7,Eyes Closed,Ed Sheeran
8,Heart Like A Truck,Lainey Wilson
9,Like Crazy (Instrumental),Jimin


100

## GNOD 2

The first steps you took yesterday, were to create a list of Top Songs and Artists from scraping web sites.


You should have ended with your lists in a data frame containing at least Song Title and Artist.


Today you are creating a recommender where the user inputs a song title and check if that song is in the list you created.   If it is,  give a different random song and artist from the list.  If it is not on the list, let the user know that you have no recommendation at this time.


In [11]:
# Previous dataframe
topsongs.head(5)

Unnamed: 0,title,artist
0,Last Night,Morgan Wallen
1,Fast Car,Luke Combs
2,Flowers,Miley Cyrus
3,Calm Down,Rema & Selena Gomez
4,Like Crazy (Deep House Remix),Jimin


In [12]:
import random
from IPython.display import Markdown, display  # library to display in bolt letters, etc

def similar_top100(topsongs, song_searched):
    
    # Converting both input value and 'title' in lower case
    song_searched = song_searched.lower() 
    topsongs['title'] = topsongs['title'].str.lower()
    
    # IF song in the input IS IN the 'title column'
    if song_searched in topsongs['title'].values:
        # While song in input IS different from the recommended (output)
        while True:
            # Generating a random number to select another song from the list
            random_num = random.randint(0, len(topsongs)-1)
            # Picking the song 'title' with random_num index
            song_recommended_title = topsongs.iloc[random_num]['title']
            # Getting also the artist name 
            song_recommended_artist = topsongs.iloc[random_num]['artist']
            # If recommended song IS NOT the same as the searched one, stop the while
            if song_recommended_title != song_searched:
                break
        # Capitalizing first letters of 'title' (.title())
        song_recommended_title = song_recommended_title.title()
        # Joinning song and artist to display in the output
        song_recommended = f"{song_recommended_title} by {song_recommended_artist}"
        # Displaying output with both 'title' and 'artist' bigger and bolt style.
        display(Markdown(f"You should listen to: **{song_recommended}**!"))
    
    # If song in the input IS NOT in 'title column'
    else:
        print("Sorry, you have very bad musical taste. Try another one...")

In [13]:
# Input
song_searched = input("Introduce the name of a song: ")
# Applying function
similar_top100(topsongs,song_searched)


Introduce the name of a song: Flowers


You should listen to: **Thank God I Do by Lauren Daigle**!

## GNOD 3

To move forward with the project, you need to create a collection of songs with their audio features - as large as possible!



These are the songs that we will cluster. And, later, when the user inputs a song, we will find the cluster to which the song belongs and recommend a song from the same cluster. The more songs you have, the more accurate and diverse recommendations you'll be able to give. Although... you might want to make sure the collected songs are "curated" in a certain way. Try to find playlists of songs that are diverse, but also that meet certain standards.



The process of sending hundreds or thousands of requests can take some time - it's normal if you have to wait a few minutes (or, if you're ambitious, even hours) to get all the data you need.

An idea for collecting as many songs as possible is to start with all the songs of a big, diverse playlist and then go to every artist present in the playlist and grab every song of every album of that artist. The amount of songs you'll be collecting per playlist will grow exponentially!


In [16]:
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials

#### Authentification

secrets_file = open("secrets.txt","r")
string = secrets_file.read()
#string.split('\n')

# Dictionary
secrets_dict={}
for line in string.split('\n'):
    if len(line) > 0: # excluding empty lines
        #           [first element:key]  [ second element:value  ]
        secrets_dict[line.split(':')[0]]=line.split(':')[1].strip()


#### Authentication with secrets text file

In [17]:
#Initialize SpotiPy with user credentials
sp = spotipy.Spotify(auth_manager=SpotifyClientCredentials(client_id=secrets_dict['cid'],
                                                           client_secret=secrets_dict['csecret']))


#### Playlists selected from Spotify

In [None]:
# Top 2000 (1999): 37i9dQZF1DWTmvXBN4DgpA
# Top 5000 (2705) - private: 5DPT4gtwr5AeFlf3YVvdmK
# Top 5000 (3318) - private: 4RVf1hHtwvMEED3yuCNi8q
# Best Music of all times (1723): 1BHsIy6qBuhJVzys6nr2uo

#### Audio features

In [18]:
# List of playlists id
playlist_id = ['37i9dQZF1DWTmvXBN4DgpA',
               '5DPT4gtwr5AeFlf3YVvdmK',
               '4RVf1hHtwvMEED3yuCNi8q'],
              # '1BHsIy6qBuhJVzys6nr2uo']


In [19]:
playlist1 = sp.user_playlist_tracks("spotify", "37i9dQZF1DWTmvXBN4DgpA")
playlist2 = sp.user_playlist_tracks("spotify", "5DPT4gtwr5AeFlf3YVvdmK")
playlist3 = sp.user_playlist_tracks("spotify", "4RVf1hHtwvMEED3yuCNi8q")
#playlist4 = sp.user_playlist_tracks("spotify", "1BHsIy6qBuhJVzys6nr2uo")


display(type(playlist1), playlist1['total'])
display(type(playlist2), playlist2['total'])
display(type(playlist3), playlist3['total'])
#display(type(playlist4), playlist4['total'])

dict

1999

dict

2705

dict

3318

In [20]:
#playlist1

In [21]:
# Checking the path to get 'title', 'artist' and 'uri'
display(playlist1["items"][0]["track"]["name"]) # title
display(playlist1["items"][0]["track"]["artists"][0]["name"]) # artist
display(playlist1["items"][0]["track"]["uri"]) # uri

'Bohemian Rhapsody - Remastered 2011'

'Queen'

'spotify:track:7tFiyTwD0nx5a1eklYtX2J'

#### Retrieving all data from playlists and creating a new dataframe


In [22]:
# List of playlists
playlists_list = ["playlist1","playlist2","playlist3"] #,"playlist4"]


In [23]:
# Function to retrieve ALL data of "title", "artist", and "uri" 's id and metrics
# from the playlist dict and convert all them into columns of a new dataframe dataframe

def create_playlist_dataframe(playlist, sp):
    # List of items in track
    tracks = []
    
    # While there is a next page of results in the playlist
    while playlist:
        # For each item in playlist:
        for item in playlist['items']:
            # Look for uri and save id and get values in features
            song_uri = item['track']['uri']
            features = sp.audio_features(song_uri)[0]
            
            # New dict retrieving values of 'title', 'artist' and 'uri' (id)
            track = {
                'title': item['track']['name'],
                'artist': item['track']['artists'][0]['name'],
                'uri': song_uri,
            }
            
            # for each feature add in track dict the key (column name) and the value
            for feature_key, feature_value in features.items():
                track[feature_key] = feature_value
                
            # Append each item in track
            tracks.append(track)
        
        # Get the next page of results
        playlist = sp.next(playlist)
        
        # Sleep for a random time between 1 and 3 seconds to avoid triggering rate limits
        sleep_time = random.randint(1, 3)
        time.sleep(sleep_time)
    
    # Return the results as a DataFrame
    return pd.DataFrame(tracks)


In [24]:
# Applying the function
playlist1_df = create_playlist_dataframe(playlist1, sp)

In [25]:
display(playlist1_df.head(5), playlist1_df.shape)

Unnamed: 0,title,artist,uri,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,type,id,track_href,analysis_url,duration_ms,time_signature
0,Bohemian Rhapsody - Remastered 2011,Queen,spotify:track:7tFiyTwD0nx5a1eklYtX2J,0.392,0.402,0,-9.961,0,0.0536,0.288,0.0,0.243,0.228,143.883,audio_features,7tFiyTwD0nx5a1eklYtX2J,https://api.spotify.com/v1/tracks/7tFiyTwD0nx5...,https://api.spotify.com/v1/audio-analysis/7tFi...,354320,4
1,Roller Coaster,Danny Vera,spotify:track:5B5YKjgne3TZzNpMsN9aj1,0.401,0.383,9,-10.048,1,0.0279,0.51,0.0078,0.121,0.285,96.957,audio_features,5B5YKjgne3TZzNpMsN9aj1,https://api.spotify.com/v1/tracks/5B5YKjgne3TZ...,https://api.spotify.com/v1/audio-analysis/5B5Y...,269986,4
2,Hotel California - 2013 Remaster,Eagles,spotify:track:40riOy7x9W7GXjyGp4pjAv,0.579,0.508,2,-9.484,1,0.027,0.00574,0.000494,0.0575,0.609,147.125,audio_features,40riOy7x9W7GXjyGp4pjAv,https://api.spotify.com/v1/tracks/40riOy7x9W7G...,https://api.spotify.com/v1/audio-analysis/40ri...,391376,4
3,Piano Man,Billy Joel,spotify:track:3FCto7hnn1shUyZL42YgfO,0.334,0.472,0,-8.791,1,0.0277,0.6,4e-06,0.317,0.431,179.173,audio_features,3FCto7hnn1shUyZL42YgfO,https://api.spotify.com/v1/tracks/3FCto7hnn1sh...,https://api.spotify.com/v1/audio-analysis/3FCt...,336093,3
4,Fix You,Coldplay,spotify:track:7LVHVU3tWfcxj5aiPFEW4Q,0.209,0.417,3,-8.74,1,0.0338,0.164,0.00196,0.113,0.124,138.178,audio_features,7LVHVU3tWfcxj5aiPFEW4Q,https://api.spotify.com/v1/tracks/7LVHVU3tWfcx...,https://api.spotify.com/v1/audio-analysis/7LVH...,295533,4


(1999, 20)

In [None]:
# Applying the function
playlist2_df = create_playlist_dataframe(playlist2, sp)

In [None]:
display(playlist2_df.head(5), playlist2_df.shape)

In [None]:
# Applying the function
playlist3_df = create_playlist_dataframe(playlist3, sp)

In [None]:
display(playlist3_df.head(5), playlist3_df.shape)

In [None]:
# Applying the function
#playlist4_df = create_playlist_dataframe(playlist4, sp)

In [None]:
#display(playlist4_df.head(5), playlist4_df.shape)

#### Checking if df are ready for concatenation

In [None]:
df_list = ["playlist1_df","playlist2_df","playlist3_df"]

In [None]:
# Checking same names and order for each dataframe's columns
display(playlist1_df.columns == playlist2_df.columns,
        playlist1_df.columns == playlist3_df.columns)

In [None]:
# Information about their shape
display(playlist1_df.shape, playlist3_df.shape, playlist3_df.shape)

#### Concatenating dataframes

In [None]:
# Concatenating in axis=0 (rows)
spotify_data = pd.concat([playlist1_df, playlist2_df, playlist3_df], axis = 0)

In [None]:
display(spotify_data.head(5), spotify_data.tail(5), spotify_data.shape)