## Analysing Spotify 2023 Data

Step 1: Start by importing the required modules

In [1]:
import pandas as pd
import re

Step 2: Read the provided csv and load it into DataFrame in a variable called spotify

In [3]:
spotify = pd.read_csv('spotify-2023-dirt.csv')
spotify.head()

Unnamed: 0,track_name,artist(s)_name,artist_count,released_year,released_month,released_day,in_spotify_playlists,in_spotify_charts,streams,in_apple_playlists,...,bpm,key,mode,danceability_%,valence_%,energy_%,acousticness_%,instrumentalness_%,liveness_%,speechiness_%
0,Seven (feat. Latto) (Explicit Ver.),"Latto, Jung Kook",2,2023,7,14,553,147,141381703,43,...,125.0,B,Major,80,89,83,31,0,8,4
1,LALA,Myke Towers,1,2023,3,23,1474,48,133716286,48,...,92.0,C#,Major,71,61,74,7,0,10,4
2,vampire,OLIVIA RODRIGO,1,2023,6,30,1397,113,140003974,94,...,138.0,F,Major,51,32,53,17,0,31,6
3,Cruel Summer,TAYLOR SWIFT,1,2019,8,23,7858,100,800840817,116,...,170.0,A,Major,55,58,72,11,0,11,15
4,WHERE SHE GOES,Bad Bunny,1,2023,5,18,3133,50,303236322,84,...,144.0,A,Minor,65,23,80,14,63,11,6


### Objectives

The objective of this practical excercise is to produce a reliable dataset for a posterior analysis. We have received data from Spotify. This data includes information about th most listened tracks in 2023. We want to explore the data and visualize the possible relationships between artists, the style of music, and the amount of streams that they have had.

However, in order to achieve this and ensure that our conclusions are meaningful, we must ensure that the data analysed is clean and without errors, at least, to the maximumn extent of our ability.

## Part 1: Artist Names

We want to obtain the list of all the artists that are included in the top 2023 songs list to be able to do some analytics on those later on.

Step 3: Output the different artists names

In [4]:
spotify['artist(s)_name'].unique()

array(['Latto, Jung Kook', 'Myke Towers', 'OLIVIA RODRIGO',
       'TAYLOR SWIFT', 'Bad Bunny', 'Dave, Central Cee',
       'ESLABON ARMADO, PESO PLUMA', 'Quevedo', 'Gunna',
       'Peso Pluma, Yng Lvcas', 'Bad Bunny, Grupo Frontera', 'NewJeans',
       'Miley Cyrus', 'David Kushner', 'Harry Styles', 'SZA',
       'Fifty Fifty', 'BILLIE EILISH', 'Feid, Young Miko', 'Jimin',
       'Gabito Ballesteros, Junior H, Peso Pluma', 'Taylor Swift',
       'Arctic Monkeys', 'Bizarrap, Peso Pluma',
       'The Weeknd, Madonna, Playboi Carti', 'Fuerza Regida',
       'Rma, Selena G', 'Tainy, Bad Bunny', 'Morgan Wallen', 'Dua Lipa',
       'Troye Sivan', 'Peso Pluma, Grupo Frontera',
       'The Weeknd, 21 Savage, Metro Boomin', 'KAROL G, SHAKIRA',
       'BIG ONE, DUKI, LIT KILLAH, MARIA BECERRA, FMK, RUSHERKING, EMILIA, TIAGO PZK',
       'Yahritza Y Su Esencia, Grupo Frontera', 'Junior H, Peso Pluma',
       'POST MALONE, SWAE LEE', 'Bebe Rexha, David Guetta',
       'Tyler, The Creator, Kali Uc

Step 4: Split songs that have multiple artists into many different rows (as much as the number of artists)

Step 5: Ensure that artists names are unique and without errors (For example, there are entries of Taylor Swift both in Caps and not caps). Ignore codification errors for the time being (accents)

We can already see that there are some artists that are capitalized and others that are not. Let's convert everything to lower case and check if we remove duplicates

We have removed 65 duplicates by transforming everything to lower case, but there are still things that look odd like `bad bunny` and `bad bunyn`. Let's analyze this further by looking at the distance between strings.

We can see some heat points outside of the diagonal, which means there are strings that are quite close to each other, let's analyze which ones they are.

Let's fix those.

Step 6: Calculate the average streams per artist

Step 7: Calculate the top 10 artists with more streams (and give also the number of streams)

## Part 2: Analysing distributions of data and identifying outliers

We are now going to work with the columns of 

- bpm: Beats per minute, a measure of song tempo
- danceability_%: Percentage indicating how suitable the song is for dancing
- valence_%: Positivity of the song's musical content
- energy_%: Perceived energy level of the song
- acousticness_%: Amount of acoustic sound in the song
- instrumentalness_%: Amount of instrumental content in the song
- speechiness_%: Amount of spoken words in the song

This columns should be:

- Numerical
- Represent a percentage (except bpm)

Step 7: Study the columns in percentage format and remove data outside the desired range

We know percentages must be in the interval of 0-100. A negative percentage does not have a meaning here.

Hence, we will set those values to NaN. The reason for doing this, is that it will allow us to keep columns that are actually usefull when other columns aren ot. I.E, bpm might be invalid but valence_% might be not.

Step 8: Study the normality of the variable bpm

For a significance level of 1, the critical value of the Anderson test is 1.089. We got a value much greter than that, so we can discard our null hypothesis that the data follows a normal distribution.

Step 9: Make a boxplot of the bpm variable and use the IQR to remove remaining outliers