![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

# Music

<table><tr>
<td style="font-size:8px;"> <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/5/57/Treble_a.svg/1920px-Treble_a.svg.png" alt="Musical Staff" style="width: 350px;"/><br><a href="https://en.wikipedia.org/wiki/Musical_note#/media/File:Treble_a.svg">By Dbolton - Own work, CC0, https://commons.wikimedia.org/w/index.php?curid=17813642</a></td>
<td> <img src="https://storage.googleapis.com/pr-newsroom-wp/1/2018/11/Spotify_Logo_CMYK_Green.png" alt="Spotify Logo" style="width: 650px;"/> </td>
</tr></table>

Music is an art loved by many people around the world, and it has been an important part of people's life.

On a regular day, you might be listening to your artist or trying to play your favourite songs. In this hackathon notebook let's try to find out more about the most popular songs and what they have in common. Hopefully you will find some interesting insights that might be difficult to determine otherwise, while learning some new coding skills.

## Getting ready

This section sets up many things behind the scenes which are required for the rest of this notebook. Most of the code blocks in this section are ready-to-run so you won't have to do any modifications. You don't need to know everything about various tasks being accomplished by the code cell in this section to complete the challenges. However feel free to ask mentors about anything that makes you curious.

### 1. Install/Import libraries

Run the cell below to download and install required Python libraries. It may take few minutes to complete the execution of the cell.

In [None]:
import pandas as pd
import plotly.express as px
print('Setup Complete')

Run the next few cells to load libaries and pre-defined functions which will help us later to complete various challenges.

In [None]:
#!wget https://raw.githubusercontent.com/callysto/hackathon/master/Group4_Music/helper_code/music.py -P helper_code -nc
# load helper code
#from helper_code.music import *

### 2. Import data and create a dataframe

[Spotify](https://en.wikipedia.org/wiki/Spotify), an audio streaming platform, compiles and publishes the list of [most streamed songs](https://spotifycharts.com/regional) at the end of every year. The dataset used in this notebook is a combination of two Spotify datasets available on Kaggle:
 - [Top 100 Spotify tracks 2017](https://www.kaggle.com/nadintamer/top-tracks-of-2017)
 - [Top 100 Spotify tracks 2018](https://www.kaggle.com/nadintamer/top-spotify-tracks-of-2018)

Each of the above mentioned datasets contain various features of 100 most popular songs of 2017 and 2018 respetively. For this hackathon, the dataset is stored in cloud storage so we can import it into this notebook. Executing cells below will also create a dataframe and make you aware of some interesting facts about the dataset.

In [None]:
#music = pd.read_csv('https://raw.githubusercontent.com/callysto/data-files/main/hackathon/spotify-top-100-from-2017-2018.csv')
music = pd.read_csv('https://raw.githubusercontent.com/callysto/data-files/main/hackathon/spotify.csv')
music

In [None]:
music['track_id'] = music['uri'].str.split(':', expand=True)[2]
music

In [None]:
'''
import requests
#import json
CLIENT_ID = '234d5d311e4746c4929bfcd225de5958'
CLIENT_SECRET = '88ccf19c24cf4e6a8d8d84f25f90e546'

AUTH_URL = 'https://accounts.spotify.com/api/token'

auth_response = requests.post(AUTH_URL, {'grant_type': 'client_credentials', 'client_id': CLIENT_ID, 'client_secret': CLIENT_SECRET})
auth_response_data = auth_response.json() # convert the response to JSON
access_token = auth_response_data['access_token']  # save the access token
#access_token = 'BQCUB3HmxIX4_SYaPQpOyjUxr4p2LqUgJR7-1uoDF6REk1_0hRCcPs6NCL16Iiw4MaJj7XZpkdJwDXP3tU6i_uzglJawGqNdaeLdMMxPOE3TkcPSJ6BR0kHVjMPZ8x3L3Nyy7EOWd-TwsoXayE6KWKS71YAKGCSH5F9DKTRMLNvV_hjJ_tB81-Y7dq9H8rHuC_4'

headers = {'Authorization': 'Bearer {token}'.format(token=access_token)}

BASE_URL = 'https://api.spotify.com/v1/'

uri = 'spotify:track:7aNjMJ05FvUXACPWZ7yJmv'

track_id = uri.split(':')[2]

#r = requests.get(BASE_URL + 'audio-features/' + track_id, headers=headers)
r = requests.get(BASE_URL + 'tracks/' + track_id, headers=headers)
r.json()
'''

In [None]:
# to get release dates
import requests
import time

CLIENT_ID = ''
CLIENT_SECRET = ''

AUTH_URL = 'https://accounts.spotify.com/api/token'
auth_response = requests.post(AUTH_URL, {'grant_type':'client_credentials', 'client_id':CLIENT_ID, 'client_secret':CLIENT_SECRET})
auth_response_data = auth_response.json() # convert the response to JSON
access_token = auth_response_data['access_token']  # save the access token
headers = {'Authorization': 'Bearer {token}'.format(token=access_token)}


# get release dates function

def get_release_date(track_id):
    r = requests.get('https://api.spotify.com/v1/tracks/' + track_id, headers=headers)
    if r.status_code == 429:
        retry_after = int(r.headers['Retry-After'])
        if retry_after > 300:
            print('error 429, try again in', retry_after, 'seconds')
            return None
        else:
            print('error 429, retrying in', retry_after ,'seconds')
            time.sleep(retry_after)
            r = requests.get('https://api.spotify.com/v1/tracks/' + track_id, headers=headers)
    try:
        rd = r.json()['album']['release_date']
    except:
        print('error', r.status_code, 'for track_id', track_id)
        return None
    return rd

In [None]:
# make a dictionary of track_id:release_date

release_dates = {}
for row in music.itertuples():
    track_id = row.track_id
    print(row.Index, row.artist, row.track, track_id)
    rd = get_release_date(track_id)
    if rd is not None:
        release_dates[track_id] = rd
    else:
        print('We will try later')
        break

# see how many tracks we have release dates for
len(release_dates)

In [None]:
# continue where we left off
import numpy as np

music2 = pd.read_csv('spotify_with_release_dates.csv', low_memory=False)

more_release_dates = {}
for row in music2.itertuples():
    track_id = row.track_id
    release_date = row.release_date
    if not isinstance(release_date, str):
        print(row.Index, row.artist, row.track, track_id)
        rd = get_release_date(track_id)
        if rd is not None:
            more_release_dates[track_id] = rd # add to dictionary
            #music2.loc[row.Index, 'release_date'] = rd # add to dataframe
        else:
            print('We will try later')
            break

In [None]:
# make a dataframe of release dates
#yearsdf = pd.DataFrame.from_dict(release_dates, orient='index', columns=['release_date'])
yearsdf = pd.DataFrame.from_dict(more_release_dates, orient='index', columns=['release_date'])
yearsdf = yearsdf.reset_index().rename(columns={'index': 'track_id'})

# merge the two dataframes
music3 = music2.merge(yearsdf, how='left', on='track_id')
music3.to_csv('spotify_with_release_dates2.csv', index=False)

In [None]:
# what are the column names?
for c in music.columns:
    print(c)

Now you know which columns are there in the dataset, but what do those columns refer to?

**Danceability**: How suitable a track is for dancing. A value of 0.0 is least danceable and 1.0 is most danceable.

**Energy**: A perceptual measure of intensity and activity that ranges between 0 to 1. Typically, energetic tracks feel fast, loud, and noisy.

**Key**: The key the track is in. Integers map to pitches using standard Pitch Class notation. E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on.

**Loudness**: The average loudness of a track in decibels (dB). Values typically ranges between -60 and 0 db.

**Mode**: The modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.

**Speechiness**: Indicates the presence of spoken words in a track. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech while below 0.33 most likely represent music and other non-speech-like tracks.

**Acousticness**: A confidence measure indicating whether the track is acoustic. Value of 1 represents highest confidence.

**Instrumentalness**: Predicts whether a track contains no vocals. The closer the value is to 1.0, the greater likelihood the track contains no vocal content.

**Liveness**: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live.

**Valence**: A measure to describe the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

**Tempo**: The overall estimated tempo (speed or pace) of a track in beats per minute (BPM).

**duration_ms**: The duration of the track in milliseconds.

**time_signature**: An estimated overall time signature of a track. The time signature is a notational convention to specify how many beats are in each bar (or measure).

## Add a new column

We can add a new column to show the duration of the track in seconds instead of milliseconds.

In [None]:
music["duration_s"] = music["duration_ms"]/1000

# display the top 10 rows
music.head(10)

## Part B: Longest popular song of 2018

Let us find out the duration of the longest song of 2018 from the dataset. We will use the **duration_s** column (duration in seconds) for this purpose.
  

In [None]:
# use pre-defined function get_data_by_year() and supply the column name we are interested in i.e. "duration_s"
duration_by_year = get_data_by_year(music,"duration_s")

# display first 5 rows
duration_by_year.head()

We have obtained the duration of popular songs of 2017 and 2018. Next, we need to sort them using **sort_values()** function. Note that the sorting is done based on the year 2018.

In [None]:
# sort by "2018" column in descending order
duration_by_year.sort_values(2018, ascending = False).head()

The duration of the longest popular song of 2018 is almost 7 minutes!

Let us plot the distribution of duration of songs for both years using a [histogram](https://www.mathsisfun.com/data/histograms.html).

In [None]:
duration_by_year.iplot(kind = "histogram", subplots=True)

### Challenges:

- Find the duration of the shortest popular song of 2017.
- Create a [box plot](https://www.mathsisfun.com/definitions/box-and-whisker-plot.html) for the duration of the songs by changing `kind = "histogram"` to `kind = "box"`. Also, remove `subplots=True` to keep both box plots in a single figure. Which plot helps you better visualize the distribution?

## Part C: Find the song with highest energy

Let us find the song with highest energy. First, let's calculate some additional statistical parameters (minimum, maximum, and mean values) for the energy of the songs for both the years using the **agg()** function.

In [None]:
# use pre-defined function get_data_by_year() and supply the column name we are interested in i.e. "energy"
energy_by_year = get_data_by_year(music,"energy")

# calculate the additional statistics
energy_stats = energy_by_year.agg(['min', 'max', 'mean'])

# show the dataframe
energy_stats

In [None]:
# plot the statistics using bar chart
energy_stats.iplot(kind = "bar", yTitle='Energy')

We can see from the plot that both years have similar statistics for the energy of the songs. Does it indicate that overall energy distribution of popular songs did not change significantly?

You might have noticed that the song with maximum energy is from 2017. Let's find out more information about that song.

In [None]:
# Find the maximum energy value from year 2017
max_energy_song = energy_stats.loc["max",2017]

# Find the other details for the song with maximum energy
music[music["energy"] == max_energy_song]

We can even find this song on YouTube and play it in the notebook.

In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo('RJOqJ-RitOg')

### Challenges:

- Calculate statistics for **valence** (positiveness) and compare it for both years.
     - Which year had higher average valence?
     - Find the name of the song with the highest valence. 
     - Search on YouTube for that song with highest valence and include it in the notebook. Do you agree that this song is very positive?


## Part D: Artists with most number of popular songs

Let's start by finding the artists who contributed the highest number of popular songs in 2017.

In [None]:
# from the original dataframe we take just the year 2017
music_2017 = music[music["year"]==2017]

# calculate the number of rows for every artist and save it as new column - "Count"
song_number_2017 = music_2017.groupby("artists").size().reset_index(name="Count")

# sort by Count, to display the artist with the largest number of songs at the top
song_number_2017 = song_number_2017.sort_values("Count", ascending = False)

# display the dataframe
song_number_2017.head()

The top two popular artists in 2017 were "Ed Sheeran" and "The Chainsmokers". 

Let us compare the averaged characteristics of the songs for these artists to the yearly average. In this code cell we have compared  "danceability", "energy", "speechiness", "mode", "acousticness", "liveness", and "valence" which are on the same scale (between 0 and 1). However feel free to edit that list of columns.

In [None]:
# call pre-defined function get_average_by_artist() and supply the year and list of artists
avg_by_artist_2017 = get_average_by_artist(music,2017,["Ed Sheeran","The Chainsmokers"])

# feel free to select different columns
columns = ["danceability","energy","speechiness","mode","acousticness","liveness","valence"]

# select these columns only and transpose(flip the data) in order to better visualize it
stats_by_artist_2017 = avg_by_artist_2017[columns].T

# display the dataframe
stats_by_artist_2017

Let's visualize this dataframe using a bar chart.

In [None]:
# create a bar chart
stats_by_artist_2017.iplot(kind = "bar")

From the bar chart, we can observe that Ed Sheeran's songs have more valence and more energy then the yearly average. And The Chainsmokers' songs are a little bit more danceable than Ed Sheeran's, but they both are below yearly average for danceability.

### Challenges:

- Can you perform the similar analysis for the most popular artists in 2018 and answer the following?
    
   - Find the top three artists with most songs.
   - Compare the averaged characteristics (of your choice) of their songs to the yearly average.
   - Share your thoughts on how these averaged characteristics of popular artists can be useful for a new artist.

## Part E: Danceability and Energy in 2018

Let us find out more about danceability and energy of the popular songs in 2018 using a [scatter plot](https://en.wikipedia.org/wiki/Scatter_plot).

In [None]:
# create a scatter plot
music[music["year"]==2018].iplot(kind="scatter", # type of plot
                                 mode='markers', # show only markers(dots), not lines
                                 x="danceability", # which columns will be the used for x-values
                                 y="energy", # which columns will be the used for y-values
                                 text="name", # name of the song will be displayed when you hoover your mouse over a marker
                                 xTitle= "Danceability", # x-axis title
                                 yTitle="Energy") # y-axis title

### Challenges:

- Explore the plot and answer the following questions:
    - Which songs had high danceability as well as energy? 
    - Which songs had high danceability but low energy?
    - Do you see any significant [positive correlation](https://examples.yourdictionary.com/positive-correlation-examples.html) between *Energy* and *Danceability*?


- Create similar plot for year 2017 with *valence* on the x-axis and *energy* on the y-axis. 
    - Find out the songs with highest valence as well as energy.
    - Find out the songs with lowest valence as well as energy.
    - Is the correlation between *Valence* and *Energy* higher than that of *Energy* and *Danceability*?

## Summary

This workbook analyzes the **Music** dataset by Spotify from Kaggle with the help of Python code blocks. Duration of the most popular songs are analyzed to find the longest and the shortest of them. Also, the artists with the highest number of popular songs are identified and characteristics of the songs are visualized and compared.

By taking part in this hackathon and completing these challenges, you learned how to analyze big dataset which is impractical to do manually, created visualizations and most importantly, developed [*computational thinking*](https://en.wikipedia.org/wiki/Computational_thinking) abilities which can be used to solve various problems.

## Hackathon Reflections
Write about some or all of the following questions, either individually in separate markdown cells or as a group.
- What is something you learned through this process?
- How well did your group work together? Why do you think that is?
- What were some of the hardest parts?
- What are you proud of? What would you like to show others?
- Are you curious about anything else related to this? Did anything surprise you?
- How can you apply your learning to future activities?

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)