<div style='text-align: right;'>
  bojan.gjokjevski
</div>

## Lab | Web Scraping Single Page (GNOD part 1)

---

Business goal:

* Check the [case_study_gnod.md](https://github.com/ta-data-remote/lab-web-scraping-single-page/blob/master/case-study-gnod.md) file.
* Make sure you've understood the big picture of your project:
    * the goal of the company (Gnod),
    * their current product (Gnoosic),
    * their strategy, and
    * how your project fits into this context

* Re-read the business case and the e-mail from the CTO.

---

**Instructions - Scraping popular songs**

Your product will take a song as an input from the user and will output another song (the recommendation). In most cases, the recommended song will have to be similar to the inputted song, but the CTO thinks that if the song is on the top charts at the moment, the user will also enjoy a recommendation of another song that is popular at the moment.

You have to find data on the internet about currently popular songs. Popvortex maintains a weekly Top 100 of "hot" songs here:<br> 
• http://www.popvortex.com/music/charts/top-100-songs.php

It's a good place to start! Scrape the current top 100 songs and their respective artists, and put the information into a pandas dataframe.

---

In [None]:
# Scrape the current top 100 songs and their respective artists, and put the information into a pandas dataframe.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import requests
import random
import pickle
import spotipy

In [3]:
from bs4 import BeautifulSoup
from pandas import json_normalize
from random import randint
from spotipy.oauth2 import SpotifyClientCredentials
from sklearn import cluster, datasets
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
from matplotlib.lines import Line2D

In [11]:
# Connecting with Spotify API

secrets_file = open("secrets.txt","r")
string = secrets_file.read()

secrets_dict = {}

for line in string.split('\n'):
    
    if len(line) > 0:
        
        #print(line.split(':'))
        secrets_dict[line.split(':')[0]] = line.split(':')[1].strip()
        

In [12]:
# Initialize SpotiPy with user credentials

sp = spotipy.Spotify(
    
    auth_manager = SpotifyClientCredentials(
        
        client_id = secrets_dict['clientid'],
        client_secret = secrets_dict['clientsecret']
    )
)


In [6]:
pd.set_option('display.max_columns', None)

In [8]:
# Importing CSV files

hot_songs_df = pd.read_csv('hot_songs.csv')
display(hot_songs_df.head())
display(hot_songs_df.shape)

sp_songs_df = pd.read_csv('sp_songs_df.csv')
display(sp_songs_df.head())
display(sp_songs_df.shape)

Unnamed: 0,song,artist
0,Beautiful Things,Benson Boone
1,Enough (Miami),Cardi B
2,Fri(End)S,V
3,Lose Control,Teddy Swims
4,Texas Hold 'Em,Beyoncé


(200, 2)

Unnamed: 0,track.name,name,song_id,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,type,id,uri,track_href,analysis_url,duration_ms,time_signature,cluster
0,2K,Nosaj Thing,33xMbeHzmWd6Od0BmLZEUs,0.31,0.445,7,-13.355,0,0.0863,0.094,0.0678,0.113,0.122,95.36,audio_features,33xMbeHzmWd6Od0BmLZEUs,spotify:track:33xMbeHzmWd6Od0BmLZEUs,https://api.spotify.com/v1/tracks/33xMbeHzmWd6...,https://api.spotify.com/v1/audio-analysis/33xM...,152560,3,5
1,4 Billion Souls,The Doors,3UnyplmZaq547hwsfOR5yy,0.419,0.565,5,-11.565,1,0.0347,0.137,0.337,0.128,0.648,151.277,audio_features,3UnyplmZaq547hwsfOR5yy,spotify:track:3UnyplmZaq547hwsfOR5yy,https://api.spotify.com/v1/tracks/3UnyplmZaq54...,https://api.spotify.com/v1/audio-analysis/3Uny...,197707,4,4
2,4 Minute Warning,Radiohead,1w8QCSDH4QobcQeT4uMKLm,0.354,0.302,9,-13.078,1,0.0326,0.59,0.0709,0.111,0.223,123.753,audio_features,1w8QCSDH4QobcQeT4uMKLm,spotify:track:1w8QCSDH4QobcQeT4uMKLm,https://api.spotify.com/v1/tracks/1w8QCSDH4Qob...,https://api.spotify.com/v1/audio-analysis/1w8Q...,244285,4,2
3,7 Element,Vitas,7J9mBHG4J2eIfDAv5BehKA,0.727,0.785,5,-6.707,0,0.0603,0.325,0.126,0.31,0.96,129.649,audio_features,7J9mBHG4J2eIfDAv5BehKA,spotify:track:7J9mBHG4J2eIfDAv5BehKA,https://api.spotify.com/v1/tracks/7J9mBHG4J2eI...,https://api.spotify.com/v1/audio-analysis/7J9m...,249940,4,3
4,#9 Dream,R.E.M.,1VZedwJj1gyi88WFRhfThb,0.571,0.724,0,-5.967,1,0.026,0.0231,0.00311,0.0919,0.385,116.755,audio_features,1VZedwJj1gyi88WFRhfThb,spotify:track:1VZedwJj1gyi88WFRhfThb,https://api.spotify.com/v1/tracks/1VZedwJj1gyi...,https://api.spotify.com/v1/audio-analysis/1VZe...,278320,4,0


(17172, 22)

In [15]:
pd.set_option('display.max_colwidth', None)

display(sp_songs_df['track_href'].head())

pd.reset_option('display.max_colwidth')

0    https://api.spotify.com/v1/tracks/33xMbeHzmWd6Od0BmLZEUs
1    https://api.spotify.com/v1/tracks/3UnyplmZaq547hwsfOR5yy
2    https://api.spotify.com/v1/tracks/1w8QCSDH4QobcQeT4uMKLm
3    https://api.spotify.com/v1/tracks/7J9mBHG4J2eIfDAv5BehKA
4    https://api.spotify.com/v1/tracks/1VZedwJj1gyi88WFRhfThb
Name: track_href, dtype: object

In [18]:
# Scaler and fitted K-Means model

#scaler
Standardtransformer = pickle.load(open('Standardtransformer.pkl','rb'))

#K-means model
kmeans = pickle.load(open('kmean.pkl', 'rb'))

In [28]:
# Building algorithm: song_finder

def song_finder(song_title):
    
    title_lower_case = song_title.lower()
    
    if title_lower_case in hot_songs['song'].str.lower().tolist():
        
        recommended_song = random.choice(hot_songs['song'])
        return f"We recommend you to listen '{recommended_song}' as well, one of top hot hits right now!"
        
    else:
        
        results = sp.search(
            q = song_title.lower(), limit = 10
        )
        
        tracks = json_normalize(
            results["tracks"]["items"]
        )
        
        def expand_list_dict(row):
            
            df = json_normalize(row['artists'])
            df['song_id'] = row['id']
            
            return df
            
        tracks['artists_dfs'] = tracks.apply(expand_list_dict, axis=1)
        artist_df = pd.DataFrame(columns=['external_urls.spotify', 'href', 'id', 'name', 'type', 'uri', 'song_id'])
        
        for mini_df in tracks['artists_dfs']:
            
            artist_df = pd.concat([artist_df, mini_df], axis=0)
            
            df_merged = pd.merge(
                
                left=tracks,
                right=artist_df,
                how='inner',
                left_on='id',
                right_on='song_id'
            )
        
        # saving into a df_final the name of the song, artist and song_id associated with input song
        df_final = df_merged[['name_x', 'name_y', 'song_id']]
        
        # now need to confirm with user the song and artist from list.
        # if yes, then do another request to Spotify to get song data; if not, ask the user again until possible.
        
        row_index = 0 #so then it starts selecting the first row
        
        while row_index < len(df_final):
            
            x = input('Did you mean '+df_final['name_x'].iloc[row_index]+' by '+df_final['name_y'].iloc[row_index]+'?').lower()
            
            if x in ['yes', 'y', 'ys', 'es', 'si', 'oui']:
                
                song_info = json_normalize(sp.audio_features(df_final['song_id'].iloc[row_index]))
                
                # to break the while loop
                break 
            
            else:
                
                print('ok, let me try again')
                row_index += 1 # to repeat the process but taking the next song in the df_final
                
        song_input_df = song_info[
        [
            'danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness', 
            'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_ms', 'time_signature'
        ]]
        
        # scaling
        song_input_df_scaled = Standardtransformer.transform(song_input_df)

        # creating df
        song_input_df_scaled = pd.DataFrame(song_input_df_scaled, columns = song_input_df.columns)
        
        cluster_song = kmeans.predict(song_input_df_scaled)
        
        array = sp_songs_df.name[sp_songs_df.cluster == cluster_song[0]].reset_index(drop=True)
        
        recommended_song = random.choice(array)
        
        return f"We recommend you to listen '{recommended_song}' as well:)"
        

In [29]:
song_finder('poker face')

Did you mean Poker Face by Lady Gaga? n


ok, let me try again


Did you mean Poker Face by Bread Beatz? y


"We recommend you to listen 'Brian Epstein' as well:)"

In [30]:
song_finder('walk alone')

Did you mean Boulevard of Broken Dreams by Green Day? y


"We recommend you to listen 'Mac DeMarco' as well:)"

In [31]:
song_finder('dear god 2.0')

Did you mean Dear God 2.0 by The Roots? y


"We recommend you to listen 'The 5th Dimension' as well:)"

In [34]:
song_finder('kind of blue')

Did you mean So What (feat. John Coltrane, Cannonball Adderley & Bill Evans) by Miles Davis? y


"We recommend you to listen 'Fitz and The Tantrums' as well:)"

In [36]:
song_finder('enough')

Did you mean Enough (Miami) by Cardi B? y


"We recommend you to listen 'Philip Glass' as well:)"

---
<div style='text-align: center;'>
    • the end •
    </div>