![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

In [None]:
#from IPython.display import HTML, display
#display(HTML("<table><tr><td><img src='data/spotify.png' width='650'></td><td><img src='data/instruments.jpeg' width='350'></td></tr></table>"))

### Prep work

In [None]:
#library should be installed already
#!pip install cufflinks ipywidgets

Run the next cells to load libaries and pre-defined functions:

In [None]:
!wget https://raw.githubusercontent.com/callysto/hackathon/master/Group4_Music/helper_code/music.py -P helper_code -nc

In [None]:
# load libraries and helper code
import pandas as pd

import cufflinks as cf
cf.go_offline()

from IPython.display import YouTubeVideo


#to enable plotting in colab
def enable_plotly_in_cell():
    import IPython
    from plotly.offline import init_notebook_mode
    display(IPython.core.display.HTML('''
        <script src="/static/components/requirejs/require.js"></script>
  '''))
    init_notebook_mode(connected=False)

get_ipython().events.register('pre_run_cell', enable_plotly_in_cell)

#helper code
from helper_code.music import *

# Group goal

 
Go through the  analysis below, work on challenges.


**Extra challenge**:

Is there anything else interesting you can find and visualize for this data? 

### Getting data
This  dataset is a combination of two datasets:
 - [Top 100 Spotify tracks 2017](https://www.kaggle.com/nadintamer/top-tracks-of-2017)
 - [Top 100 Spotify tracks 2018](https://www.kaggle.com/nadintamer/top-spotify-tracks-of-2018)

**Spotify**  is a Swedish audio streaming platform that provides DRM-protected music and podcasts from record labels and media companies.At the end of each year, Spotify compiles a playlist of the songs streamed most often over the course of that year.

This dataset has 200 songs - combination of 100 most popular songs in 2017 and 100 most popula songs in 2018.

In [None]:
#reading from cloud object storage
target_url="https://swift-yeg.cloud.cybera.ca:8080/v1/AUTH_d22d1e3f28be45209ba8f660295c84cf/hackaton/top_tracks.csv"

In [None]:
#reading the input file and creating dataframe
music = pd.read_csv(target_url) 

In [None]:
#how many rows and colums does the dataframe have?
music.shape

In [None]:
#what are the column names?
music.columns

**The description of the columns**:

**Danceability**: describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.

**Energy**: a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.

**Key**: The key the track is in. Integers map to pitches using standard Pitch Class notation. E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on.

**Loudness**: The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.

**Mode**: indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.

**Speechiness**: detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.

**Acousticness**: A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.

**Instrumentalness**: Predicts whether a track contains no vocals. "Ooh" and "aah" sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly "vocal". The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.

**Liveness**: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.

**Valence**: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

**Tempo**: The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.

**duration_ms**: The duration of the track in milliseconds.

**time_signature**: An estimated overall time signature of a track. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure).

In [None]:
#display first 5 rows to explore how the data looks like
music.head()

In [None]:
#creating a new column - duration in seconds
music["duration_s"] = music["duration_ms"]/1000

### Challenge
 - Create a new column **duration_m** - duration in minutes

## Comparing year 2017 to year 2018

Let's compare data from year 2017 to year 2018.   
We start with the **duration_s** column - song duration in seconds.
  

### Calculating the min and max values

In [None]:
# get  data by year by calling function get_data_by_year()

# specify the column name we are interested in  - "duration_s"
duration_by_year = get_data_by_year(music,"duration_s")

#displaying first 5 rows
duration_by_year.head()

What is the longest song for 2017?    
Let's use function **sort_values()**  here and order duration by year 2017.   

In [None]:
#there is an `ascending = False` parameter in sort_values, what if we set it to true?
duration_by_year.sort_values(2017, ascending = False).head()

The longest song for 2018 was 343 seconds  - more then 5 minutes! 

### Challenge
- Using the example above - create new cell(s) and try to find the shortest song for year 2017

### Plotting
Lets  plot the data we have created.   

In [None]:
duration_by_year.iplot(kind = "histogram",subplots=True)

### Challenge
We are using **histogram** type of plot here 
 - Try deleting `,subplots=True`  to display both years together, 
 - Try changing `kind = "histogram"` to `kind = "box"` to get boxplot instead
    
What kind of plot helps you to better understand the data?  
    
More information about [histograms](https://www.mathsisfun.com/data/histograms.html) and [boxplots](https://www.mathsisfun.com/definitions/box-and-whisker-plot.html) 

### Calculating additional statistics and comparing the years:
There is another way of calculating min and max values, using **agg()** function.  
In addition we are going to calculate average:

In [None]:
duration_stats = duration_by_year.agg(['min', 'max', 'mean'])

duration_stats

In [None]:
#plotting the statistics to compare years
duration_stats.iplot(kind = "bar")

We can see from the plot that year 2018 definetely has more variety - the shortest song is shorter and the longest song is longer than in 2017!   

Let's find out what was the name of the shortest song in 2018:

In [None]:
min_duration_2018 = duration_stats.loc["min",2018]

music[music["duration_s"]== min_duration_2018]

We can even find this song on YouTube and include into the notebook:

In [None]:
YouTubeVideo('7JGDWKJfgxQ')

### Challenge
Using the example above - create new cell(s) and  compare **valence** across two years
 - Which year had higher average valence?
 - Find the name of the song with the highest valence in 2018 
 - Try searching for it on Youtube, do you agree that this son is very positive?


## Comparing artists with  the most number of songs

Let's start by finding the most popular artrists in 2017:

In [None]:
# from the original dataframe we take just the year 2017
music_2017 = music[music["year"]==2017]

# calculate the row number for every artist and save it as new column - "Count"
song_number_2017 = music_2017.groupby("artists").size().reset_index(name="Count")

# sort by Count, to display the artist with the largest number of songs at the top
song_number_2017 = song_number_2017.sort_values("Count", ascending = False)

song_number_2017.head()

Looks like the most popular artists in 2017 were "Ed Sheeran" and "The Chainsmokers". 
Lets compare the average data for these artists to the yearly average:

In [None]:
#call pre-defined function get_average_by_artist() giving it a year, and list of artists
avg_by_artist_2017 = get_average_by_artist(music,2017,["Ed Sheeran","The Chainsmokers"])

avg_by_artist_2017

Columns "danceability","energy","speechiness","mode","acousticness","liveness","valence"  are on the same scale - between 0 and 1, lets try to display and compare them:

In [None]:
##feel free to select different columns
columns = ["danceability","energy","speechiness","mode","acousticness","liveness","valence"]

#select these columns only and transpose(flip the data) in order to better visualize it
stats_by_artist_2017 = avg_by_artist_2017[columns].T

stats_by_artist_2017

In [None]:
stats_by_artist_2017.iplot(kind = "bar")

Loooking at  the plot - we can see that Ed Sheeran's songs have  more valence and more energy then the yearly average. And The Chainsmokers's songs are a little bit more danceable than Ed Sheeran's, however they both are below yearly average for danceability.

### Challenge


Using the example above - create new cell(s) and  do the same analysis for the most popular artists in 2018.
 - Do you notice anything interesting?   
Try combining `avg_by_artist_2017` and `avg_by_artist_2018`  and compare most popular artist across both years

`avg_by_artist = pd.concat(avg_by_artist_2018, avg_by_artist_2017)`


### Valence and Energy in 2018

In [None]:
music[music["year"]==2018].iplot(kind="scatter",mode='markers',y="energy",x="danceability",text="name",
                                xTitle= "Danceability",yTitle="Energy")

### Challenge
- Explore the plot:
    - which songs had both: high danceability and energy? 
    - which songs had high danceability but low energy?
- Create similar plot for year 2017 comparing valence and energy and analyze it

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)