# Spotify Top 50 (2020) Analysis

## Loading Dataset

In [5]:
import pandas as pd

df = pd.read_csv("spotifytoptracks.csv")

print(df)

    Unnamed: 0           artist  \
0            0       The Weeknd   
1            1      Tones And I   
2            2      Roddy Ricch   
3            3        SAINt JHN   
4            4         Dua Lipa   
5            5           DaBaby   
6            6     Harry Styles   
7            7            Powfu   
8            8    Trevor Daniel   
9            9    Lewis Capaldi   
10          10          KAROL G   
11          11   Arizona Zervas   
12          12      Post Malone   
13          13        Lil Mosey   
14          14    Justin Bieber   
15          15            Drake   
16          16    Lewis Capaldi   
17          17         Doja Cat   
18          18         Maroon 5   
19          19           Future   
20          20        Jawsh 685   
21          21     Harry Styles   
22          22            Topic   
23          23         24kGoldn   
24          24    Billie Eilish   
25          25     Shawn Mendes   
26          26    Billie Eilish   
27          27      

## Data Cleaning

In [6]:
""" Check for missing Values """

df.isnull().sum()

Unnamed: 0          0
artist              0
album               0
track_name          0
track_id            0
energy              0
danceability        0
key                 0
loudness            0
acousticness        0
speechiness         0
instrumentalness    0
liveness            0
valence             0
tempo               0
duration_ms         0
genre               0
dtype: int64

In [7]:
""" Remove duplicate samples """

df.drop_duplicates(inplace=True)

In [8]:
""" Handling Outliers """

# Selct only numeric columns
numeric_cols = df.select_dtypes(include=["float64", "int64"])

Q1 = numeric_cols.quantile(0.25)
Q3 = numeric_cols.quantile(0.75)
IQR = Q3 - Q1

# Define the lower and upper bounds for outliers
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Identify outliers based on the bounds
outliers = (numeric_cols < lower_bound) | (numeric_cols > upper_bound)
print(outliers)

    Unnamed: 0  energy  danceability    key  loudness  acousticness  \
0        False   False         False  False     False         False   
1        False   False         False  False     False          True   
2        False   False         False  False     False         False   
3        False   False         False  False     False         False   
4        False   False         False  False     False         False   
5        False   False         False  False     False         False   
6        False   False         False  False     False         False   
7        False   False         False  False     False          True   
8        False   False         False  False     False         False   
9        False   False         False  False     False          True   
10       False   False         False  False     False         False   
11       False   False         False  False     False         False   
12       False   False         False  False     False         False   
13    

## Exploratory Data Analysis (EDA)

### General Information:

In [28]:
""" Number of observations """

df.shape[0]  # gives number of rows
print(f"The dataset has {df.shape[0]} rows")

The dataset has 50 rows


In [29]:
""" Number of features """

df.shape[1]  # gives the number of columns
print(f"The dataset has {df.shape[1]} columns")

The dataset has 17 columns


### Feature Classification:

In [31]:
# Categorical features
categorical_features = df.select_dtypes(include=["object"]).columns
print(f"Categorical features are: {list(categorical_features)}")

Categorical features are: ['artist', 'album', 'track_name', 'track_id', 'genre']


In [32]:
# Numeric features
numeric_features = df.select_dtypes(include=["float64", "int64"]).columns
print(f"Numeric features are: {list(numeric_features)}")

Numeric features are: ['Unnamed: 0', 'energy', 'danceability', 'key', 'loudness', 'acousticness', 'speechiness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_ms']


### Questions:

#### 1. Artists with more than one popular track:

In [37]:
# Getting artists with more than one popular track
multiple_tracks = df["artist"].value_counts()[df["artist"].value_counts() > 1]

# Print clean output
print("Artists with more than one popular track:\n")
for artist, count in multiple_tracks.items():
    print(f"{artist}: {count} songs")

Artists with more than one popular track:

Billie Eilish: 3 songs
Dua Lipa: 3 songs
Travis Scott: 3 songs
Justin Bieber: 2 songs
Harry Styles: 2 songs
Lewis Capaldi: 2 songs
Post Malone: 2 songs


#### 2. Most Popular artist:

In [38]:
most_popular = df["artist"].value_counts().idxmax()
print(f"The most popular artist in the Spotify Top 50 (2020) is {most_popular}")

The most popular artist in the Spotify Top 50 (2020) is Billie Eilish


#### 3. Total number of artists:

In [35]:
unique_artist = df["artist"].nunique()
print(f"There are {unique_artist} unique artists in the dataset.")

There are 40 unique artists in the dataset.


#### 4. Albums with more than 1 popular track:

In [39]:
# Getting all Albums with multiple songs featured in top 50
multiple_track_album = df["album"].value_counts()[df["album"].value_counts() > 1]

# Printing clean output
print("Albums with multiple songs featured in top 50:\n")
for album, count in multiple_track_album.items():
    print(f"{album}: {count} songs")

Albums with multiple songs featured in top 50:

Future Nostalgia: 3 songs
Hollywood's Bleeding: 2 songs
Fine Line: 2 songs
Changes: 2 songs


#### 5. Total Number of Albums:

In [40]:
total_albums = df["album"].nunique()
print(f"There are {total_albums} unique albums in the Spotify Top 50 (2020).")

There are 45 unique albums in the Spotify Top 50 (2020).


#### 6. Tracks with a danceability score above 0.7:

In [41]:
# Filter tracks with danceability score above 0.7
high_danceabilty = df[df["danceability"] > 0.7]["track_name"]

# Printing results with clean format
print("Tracks with danceability above 0.7:\n")
for track in high_danceabilty:
    print(track)

Tracks with danceability above 0.7:

Dance Monkey
The Box
Roses - Imanbek Remix
Don't Start Now
ROCKSTAR (feat. Roddy Ricch)
death bed (coffee for your head)
Falling
Tusa
Blueberry Faygo
Intentions (feat. Quavo)
Toosie Slide
Say So
Memories
Life Is Good (feat. Drake)
Savage Love (Laxed - Siren Beat)
Breaking Me
everything i wanted
Señorita
bad guy
WAP (feat. Megan Thee Stallion)
Sunday Best
Godzilla (feat. Juice WRLD)
Break My Heart
Dynamite
Supalonely (feat. Gus Dapperton)
Sunflower - Spider-Man: Into the Spider-Verse
Hawái
Ride It
goosebumps
RITMO (Bad Boys For Life)
THE SCOTTS
SICKO MODE


#### 7. Tracks with a danceability score below 0.4:

In [42]:
# Filtering tracks with danceability score below 0.4
low_danceability = df[df["danceability"] < 0.4]["track_name"]

# Print clean format
print("Tracks with danceability below 0.4:\n")
for track in low_danceability:
    print(track)

Tracks with danceability below 0.4:

lovely (with Khalid)


#### 8. Tracks with loudness above -5:

In [43]:
# Filtering tracks with loudness above -5
loud_tracks = df[df["loudness"] > -5]["track_name"]

# Printing the names of tracks with loudness above -5
print("The loudest tracks are:\n")
for track in loud_tracks:
    print(track)

The loudest tracks are:

Don't Start Now
Watermelon Sugar
Tusa
Circles
Before You Go
Say So
Adore You
Mood (feat. iann dior)
Break My Heart
Dynamite
Supalonely (feat. Gus Dapperton)
Rain On Me (with Ariana Grande)
Sunflower - Spider-Man: Into the Spider-Verse
Hawái
Ride It
goosebumps
Safaera
Physical
SICKO MODE


#### 9. Tracks with Loudness below -8:

In [44]:
# Filtering tracks with loudness below -8
quiet_tracks = df[df["loudness"] < -8]["track_name"]

# Printing clean fornmat
print("The Quietest tracks are :\n")
for track in quiet_tracks:
    print(track)

The Quietest tracks are :

death bed (coffee for your head)
Falling
Toosie Slide
Savage Love (Laxed - Siren Beat)
everything i wanted
bad guy
HIGHEST IN THE ROOM
lovely (with Khalid)
If the World Was Ending - feat. Julia Michaels


#### 10. Longest and Shortest track:

In [22]:
# Longest Track
Longest_track = df.loc[df["duration_ms"].idxmax(), "track_name"]
Longest_track_artist = df.loc[df["duration_ms"].idxmax(), "artist"]
print(f"The longest track is '{Longest_track}' by {Longest_track_artist}")

# Shortest Track
Shortest_track = df.loc[df["duration_ms"].idxmin(), "track_name"]
Shortest_track_artist = df.loc[df["duration_ms"].idxmin(), "artist"]
print(f"The shortest track is '{Shortest_track}' by {Shortest_track_artist}")

The longest track is 'SICKO MODE' by Travis Scott
The shortest track is 'Mood (feat. iann dior)' by 24kGoldn


#### Most Popular Genres and genres with only one song:

In [23]:
# Most Popular Genres:

# Getting value counts for the genre
genre_counts = df["genre"].value_counts()

# Resetting the index so "genre" doesn't appear at the top of output
genre_counts = genre_counts.reset_index()

# Rename columns
genre_counts.columns = ["Genre", "Count"]

# Display the result
print("The Most Popular Genres are:\n")
print(genre_counts)


# Genres with only one song:

# Getting value counts for the genre
single_track_genres = df["genre"].value_counts()[df["genre"].value_counts() == 1]

# Resetting the index so "genre" doesn't appear at the top of output
single_track_genres = single_track_genres.reset_index()

# Rename columns
single_track_genres.columns = ["Genre", "Count"]

# Display the result
print("\nThe Genres with only one song are:\n")
print(single_track_genres)

The Most Popular Genres are:

                                 Genre  Count
0                                  Pop     14
1                          Hip-Hop/Rap     13
2                     Dance/Electronic      5
3                    Alternative/Indie      4
4                             R&B/Soul      2
5                          Electro-pop      2
6                             Nu-disco      1
7              R&B/Hip-Hop alternative      1
8                        Pop/Soft Rock      1
9                              Pop rap      1
10                        Hip-Hop/Trap      1
11                     Dance-pop/Disco      1
12                           Disco-pop      1
13                Dreampop/Hip-Hop/R&B      1
14  Alternative/reggaeton/experimental      1
15                         Chamber pop      1

The Genres with only one song are:

                                Genre  Count
0                            Nu-disco      1
1             R&B/Hip-Hop alternative      1
2               

### Correlations:

#### 1. Strong positive/negative correlations:

In [24]:
# Selecting only numeric columns
numeric_cols = df.select_dtypes(include=["int", "float"]).columns

# Calculate the correlation matrix
corr_matrix = df[numeric_cols].corr()

# Display the correlation matrix
print(corr_matrix)

                  Unnamed: 0    energy  danceability       key  loudness  \
Unnamed: 0          1.000000  0.030381     -0.176321 -0.052844  0.034935   
energy              0.030381  1.000000      0.152552  0.062428  0.791640   
danceability       -0.176321  0.152552      1.000000  0.285036  0.167147   
key                -0.052844  0.062428      0.285036  1.000000 -0.009178   
loudness            0.034935  0.791640      0.167147 -0.009178  1.000000   
acousticness       -0.036557 -0.682479     -0.359135 -0.113394 -0.498695   
speechiness         0.095790  0.074267      0.226148 -0.094965 -0.021693   
instrumentalness   -0.003126 -0.385515     -0.017706  0.020802 -0.553735   
liveness           -0.063216  0.069487     -0.006648  0.278672 -0.069939   
valence            -0.034159  0.393453      0.479953  0.120007  0.406772   
tempo               0.081289  0.075191      0.168956  0.080475  0.102097   
duration_ms         0.309563  0.081971     -0.033763 -0.003345  0.064130   

           

##### Key Insights:

##### Energy & Loudness:
    - High positive correlation (0.79), meaning energetic songs tend to be louder.

##### Energy & Acousticness:
    - Strong negative correlation (-0.68), meaning energetic songs are usually not acoustic.

##### Danceability & Valence:
    - Positive correlation (0.48), meaning more danceable songs are usually happier.


#### 2. Correlation between specific features:

In [25]:
# Filtering specific genres
selected_genres = df[
    df["genre"].isin(["Pop", "Hip-Hop/Rap", "Dance/Electronic", "Alternative/Indie"])
]

# Group by genre and calculate the mean of selected features
genre_comp = selected_genres.groupby("genre")[
    ["danceability", "loudness", "acousticness"]
].mean()

# Print the result
print(genre_comp)

                   danceability  loudness  acousticness
genre                                                  
Alternative/Indie      0.661750 -5.421000      0.583500
Dance/Electronic       0.755000 -5.338000      0.099440
Hip-Hop/Rap            0.765538 -6.917846      0.188741
Pop                    0.677571 -6.460357      0.323843


##### Comparative Analysis of Genres Based on Danceability, Loudness, and Acousticness:

##### Danceability: 
    - Hip-Hop/Rap (0.766) and Dance/Electronic (0.755) are the most danceable genres, making them ideal for tracks that get people moving. 
    -Pop (0.678) and Alternative/Indie (0.662) are moderately danceable, indicating a more balanced feel.

##### Loudness:
    - Dance/Electronic is the loudest (-5.34 dB), followed closely by Alternative/Indie (-5.42 dB). 
    - Hip-Hop/Rap (-6.92 dB) and Pop (-6.46 dB) are quieter, suggesting a softer production style compared to the more aggressive sounds of Dance/Electronic.

##### Acousticness:
    - Alternative/Indie (0.583) has the most acoustic elements, reflecting its organic sound, while Dance/Electronic (0.099) has the least, focusing on electronic production. Pop (0.324) and Hip-Hop/Rap (0.189) fall in the middle, blending acoustic and electronic elements.

##### In summary, Dance/Electronic and Hip-Hop/Rap excel in danceability, with Dance/Electronic being louder but less acoustic. Alternative/Indie is more acoustic, offering a more organic sound, while Pop sits between these genres in both attributes.