# Analysis Top 100 Spotify Tracks of 2018

At the end of each year, Spotify compiles a playlist of the songs streamed most often over the course of that year. This year's playlist (Top Tracks of 2018) includes 100 songs. The question is: **`What do these top songs have in common? Why do people like them?`**

**`Original Data Source:`** The audio features for each song were extracted using the Spotify Web API and the spotipy Python library. Credit goes to Spotify for calculating the audio feature values.

**`Data Description:`** There is one .csv file in the dataset. (top2018.csv) This file 

**`includes:`**

* Spotify URI for the song
* Name of the song
* Artist(s) of the song
* Audio features for the song (such as danceability, tempo, key etc.)
* A more detailed explanation of the audio features can be found in the Metadata tab.

**`Exploring the Data:`** Some suggestions for what to do with the data:

1. Look for patterns in the audio features of the songs. Why do people stream these songs the most?
2. Try to predict one audio feature based on the others
3. See which features correlate the most

**`NOTE:`** *At the end of this notebook you will be providing the conclusions of the study*

# import libraries 

In [None]:
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns 
from scipy.stats import pearsonr
%matplotlib inline 

In [None]:
df=pd.read_csv('../input/top2018.csv')

In [None]:
df.info()

**We make the conversion of the column duration  to place it in standard format (minutes:seg )**

In [None]:
df['Duration_min']=df['duration_ms']/60000

In [None]:
df.drop(columns='duration_ms',inplace=True)

# Identification of correlations between columns.

In this step we will help you with the `corr ()` function of pandas and then we will make a heat map that will clearly show the correlations between certain columns, although the ideal thing before this step is to have an idea of ​​the columns that will have a possible correlation.

In [None]:
sns.heatmap(df.corr(),cmap="YlOrRd")

**ANALYSIS:** We can observe in the strongest tones the existing correlations between the different columns, at first glance the `"loudness and energy"` columns call our attention but our duty is to explore the behavior of each one of the variables.

# Top 10 artists with the largest presence in the Top 100

In [None]:
df['artists'].value_counts().head(10)

# Danceablity column analysis

In [None]:
sns.set_style(style='darkgrid')
sns.distplot(df['danceability'],hist=True,kde=True)

**ANALYSIS:** In this graph we can see that most of the tracks are considered danceable from values ​​higher than 0.5, but to make a better analysis we are going to divide them into 3 groups:

* Greater than 75% --- Very danceable
* between 50% and 74% - Regularly danceable
* Minor 50% - non-danceable or instrumental music

In [None]:
# Set conditions
Vd=df['danceability']>=0.75
Ld=(df['danceability']>=0.5) & (df['danceability']<0.75)
Nd=df['danceability']<0.5

In [None]:
# Create DataFrame 

In [None]:
data=[Vd.sum(),Ld.sum(),Nd.sum()]
Dance=pd.DataFrame(data,columns=['percent'],
                   index=['Very','Regular','Instrumental'])

In [None]:
Dance

# Energy tracks

In [None]:
sns.distplot(df['energy'])

In [None]:
# Set conditions
Ve=df['energy']>=0.75
Re=(df['energy']>=0.5) & (df['energy']<0.75)
Le=df['energy']<0.5

In [None]:
#Create DataFrame
data=[Ve.sum(),Re.sum(),Le.sum()]
Energy=pd.DataFrame(data,columns=['percent'],
                   index=['Very Energy','Regular Energy','Low Energy'])

In [None]:
Energy

# Correlation  Zone 

In this area we will take into account the most important variables according to the preliminary analysis of the heat maps previously seen


In [None]:
Correlation=df[['danceability','energy','valence','loudness','tempo']]

In [None]:
sns.heatmap(Correlation.corr(),annot=True,cmap="YlOrRd")

In [None]:
sns.jointplot(data=Correlation,y='energy',x='loudness',kind='reg',stat_func=pearsonr)

**ANALYSIS:** We can clearly observe that as the values ​​of loudness are closer to zero the probability that our song has a high content of rhythms is higher, in general these values ​​should be kept below zero but not very far away

# What is the musical tempo?
At first the musical scores did not give us indications about the tempo or they were very scarce, so each interpreter performed it at ease, but the idea of ​​the tempo begins to change from the eighteenth and nineteenth century, possibly as a consequence of the fact that composers will be tired of listening to interpretations of their works with completely arbitrary times. Then came a notation that expressed the "speed" or form that the works should be executed.

The 5 most usual ones were:

* Length: very slow (20 bpm)
* Adagio: slow and majestic (66 to 76 bpm)
* Andante: at the pace, quiet, a little vivacious (76 to 108 bpm)
* Allegro: animated and fast. (110 to 168 bpm).
* Presto: very fast (168 to 200 bpm).

In [None]:
df['Rhythm']=df['tempo']

In [None]:
df.loc[df['tempo']>168,'Rhythm']='Presto'
df.loc[(df['tempo']>=110) & (df['tempo']<=168),'Rhythm']='Allegro'
df.loc[(df['tempo']>=76) & (df['tempo']<=108),'Rhythm']='Andante'
df.loc[(df['tempo']>=66) & (df['tempo']<=76),'Rhythm']='Adagio'
df.loc[df['tempo']<65,'Rhythm']='Length'


# Classification according to the tempo of the track

In [None]:
df['Rhythm'].value_counts()

In [None]:
sns.set_style(style='darkgrid')
Rhy=df['Rhythm'].value_counts()
Rhy_DF=pd.DataFrame(Rhy)
sns.barplot(x=Rhy_DF.Rhythm, y=Rhy_DF.index, palette="viridis")
plt.title('Popular keys')

**NOTE:** As we can see these songs are influenced by the intermediate rhythms offered by the following genres:
 * ***Hip hop***
 * ***reggaeton***
 * ***Pop***
 * ***Rap***
        

# Top 10 of the most danceable songs

In [None]:
df[['name','artists','danceability','valence','tempo','Rhythm']].sort_values(by='danceability',ascending=False).head(10)

# Top 10 songs with the most energy

Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.

In [None]:
df[['name','artists','energy','valence','tempo','Rhythm']].sort_values(by='energy',ascending=False).head(10)

# Top 10 songs more likely to create positive feelings

A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).


In [None]:
df[['name','artists','energy','valence','tempo','Rhythm']].sort_values(by='valence',ascending=False).head(10)

# Analysis of artists with greater presence in the 100 most played songs of 2018

We are going to analyze the 4 artists with more songs within the Top 100 to see taste patterns among the clients

In [None]:
df['artists'].value_counts().head(4)

# Artista XXXTENTATION

In [None]:
XXXTENT=df[df['artists']=='XXXTENTACION']

In [None]:
XXXTENT[['name','danceability','energy','loudness','valence','tempo','Rhythm']]

**NOTE:** Although according to the data presented the artist is characterized by having more danceable tracks one could say that his influence in this ranking is due to the Hip Hop genre due to the speed belongs to the rhythm "ALLEGRO".

# Artista Post Malone

In [None]:
PMalone=df[df['artists']=='Post Malone']

In [None]:
PMalone[['name','danceability','energy','loudness','valence','tempo','Rhythm']]

**NOTE:** In the case of this artist we can observe the same tendency with respect to the rhythm "ALLEGRO".

# Artista Drake

In [None]:
Drake=df[df['artists']=='Drake']

In [None]:
Drake[['name','danceability','energy','loudness','valence','tempo','Rhythm']]

# Artista Ed Sheran 

In [None]:
Edshe=df[df['artists']=='Ed Sheeran']
Edshe[['name','danceability','energy','loudness','valence','acousticness','tempo','Rhythm']]


# Data treatment  regarding the 'MODE' column

We will analyze the data in such a way that we consider values ​​higher than 0.5 in the danceability and energy columns because most of the data are concentrated in these ranges and we do not want values ​​lower than 0.5, affecting a possible correlation between columns.

**Mode Column**

Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.

In [None]:
Mayores=df[df['mode']==1]
Menores=df[df['mode']==0]

In [None]:
# Variables separation according to the scale to which Major or minor belongs

In [None]:
MayoresD=Mayores[Mayores['danceability']>=0.5]
MenoresD=Menores[Menores['danceability']>=0.5]

In [None]:
# We eliminate the columns that say nothing in the study

In [None]:
MayoresD=Mayores.drop(columns=['mode','time_signature'])
MenoresD=Menores.drop(columns=['mode','time_signature'])

In [None]:
# Heat map for Major scales

In [None]:
sns.heatmap(MayoresD.corr(),cmap="YlOrRd")

In [None]:
# Heat map for Less scales

In [None]:
sns.heatmap(MenoresD.corr(),cmap="YlOrRd")

In [None]:
# We create the variables and assign the columns that we want to correlate

In [None]:
MaycorD=MayoresD[['danceability','energy','valence','loudness','tempo']]
MencorD=MenoresD[['danceability','energy','valence','loudness','tempo']]

In [None]:
# Major scale correlation

In [None]:
sns.heatmap(MaycorD.corr(),annot=True,cmap="YlOrRd")

In [None]:
sns.heatmap(MencorD.corr(),annot=True,cmap="YlOrRd")


**ANALYSIS:** As we can observe the tracks with minor scales show us correlation better between the most important variables


# Keys
keyThe key the track is in. Integers map to pitches using standard Pitch Class notation. E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on.

In [None]:
df.loc[ df['key']==0 ,'key']='C'    
df.loc[ df['key']==1 ,'key']='C#'    
df.loc[ df['key']==2 ,'key']='D'    
df.loc[ df['key']==3 ,'key']='D#'    
df.loc[ df['key']==4 ,'key']='E'    
df.loc[ df['key']==5 ,'key']='F'    
df.loc[ df['key']==6 ,'key']='F#'    
df.loc[ df['key']==7 ,'key']='G'    
df.loc[ df['key']==8 ,'key']='G#'    
df.loc[ df['key']==9 ,'key']='A'    
df.loc[ df['key']==10 ,'key']='A#' 
df.loc[ df['key']==11 ,'key']='B' 

In [None]:
sns.set_style(style='darkgrid')
keys=df['key'].value_counts()
key_DF=pd.DataFrame(keys)
sns.barplot(x=key_DF.key, y=key_DF.index, palette="viridis")
plt.title('Popular keys')

# Parameter relationship by key

As we can see the most danceable tracks go by the key "G and C #" and those that have more energy have "F #, C # and D" 

In [None]:
df[['danceability','energy','valence','key']].groupby(by='key').mean().sort_values(by='danceability',ascending=False)

# Conclusions:

To answer the 2 initial questions posed by this data set, we can say that there was only a correlation between energy and noise, on the other hand, you can see that the main reason for people to like these songs is In relation to the time of the track, let's say that most of the tracks were within the range ***"ALLEGRO"*** and ***"ANDANTE"*** that are characteristic of the genres:

* ***Hip hop***
* ***reggaeton***
* ***Pop***
* ***Rap***



**FINAL!!!!**