# Spotify-2023 dataset cleaning

In [1]:
# Import libraries
import numpy as np
import pandas as pd

# Load data
spotify_df = pd.read_csv("spotify-2023.csv", encoding_errors='replace', header=0)

In [2]:
spotify_df.columns

Index(['track_name', 'artist(s)_name', 'artist_count', 'released_year',
       'released_month', 'released_day', 'in_spotify_playlists',
       'in_spotify_charts', 'streams', 'in_apple_playlists', 'in_apple_charts',
       'in_deezer_playlists', 'in_deezer_charts', 'in_shazam_charts', 'bpm',
       'key', 'mode', 'danceability_%', 'valence_%', 'energy_%',
       'acousticness_%', 'instrumentalness_%', 'liveness_%', 'speechiness_%'],
      dtype='object')

In [3]:
spotify_df.shape

(953, 24)

### Fixing errors with the track and artist names

Count 'track_name' entries with errors

In [4]:
spotify_df['track_name'].str.contains('�|ý').sum()

70

In [5]:
spotify_df['track_name'].str.contains('[^a-zA-Z0-9\s():"\'\.,&!?#|$+\[\]/_-]', regex=True).sum()

70

Count 'artist(s)_name' entries with errors

In [6]:
spotify_df['artist(s)_name'].str.contains('�').sum()

48

In [7]:
spotify_df['artist(s)_name'].str.contains('[^a-zA-Z0-9\s():"\'\.,&!?#|$+\[\]/_-]', regex=True).sum()

48

Manually cleaned the track and artist(s) names with unknown characters.<br>
Diacritics (or accent marks) such as é or ñ will be removed since they seem to be the cause of many of the errors.<br>
Saved as "spotify_2023_cleaned.csv"

In [8]:
# Load cleaned data
spotify_clean_df = pd.read_csv("spotify_2023_cleaned.csv", encoding_errors='replace', header=0)

In [9]:
spotify_clean_df['track_name'].str.contains('[^a-zA-Z0-9\s():"\'\.,&!?#|$+\[\]/_*-]', regex=True).sum()

0

In [10]:
spotify_clean_df['track_name'].str.contains('�|ý').sum()

0

In [11]:
spotify_clean_df['artist(s)_name'].str.contains('[^a-zA-Z0-9\s():"\'\.,&!?#|$+\[\]/_*-]', regex=True).sum()

0

In [12]:
spotify_clean_df['artist(s)_name'].str.contains('�').sum()

0

### Fixing the 'key' column

Some of the tracks have a blank key.

In [13]:
spotify_df['key'].value_counts()

key
C#    120
G      96
G#     91
F      89
B      81
D      81
A      75
F#     73
E      62
A#     57
D#     33
Name: count, dtype: int64

There are no tracks with the key of C, which is highly unlikely.

In [14]:
spotify_df['key'].isna().sum()

95

After searching the tracks with empty keys on tunebat, was able to conclude that these songs were all in the key of C.

In [15]:
spotify_clean_df['key'] = spotify_clean_df['key'].fillna('C')

In [16]:
spotify_clean_df['key'].isna().sum()

0

In [17]:
spotify_clean_df['key'].value_counts()

key
C#    120
G      96
C      95
G#     91
F      89
B      81
D      81
A      75
F#     73
E      62
A#     57
D#     33
Name: count, dtype: int64

In [19]:
spotify_clean_df.isna().sum()

track_name               0
artist(s)_name           0
artist_count             0
released_year            0
released_month           0
released_day             0
in_spotify_playlists     0
in_spotify_charts        0
streams                  0
in_apple_playlists       0
in_apple_charts          0
in_deezer_playlists      0
in_deezer_charts         0
in_shazam_charts        50
bpm                      0
key                      0
mode                     0
danceability_%           0
valence_%                0
energy_%                 0
acousticness_%           0
instrumentalness_%       0
liveness_%               0
speechiness_%            0
dtype: int64

The dataset is ready to use.

In [18]:
spotify_clean_df.to_csv('spotify_2023_cleaned.csv', index=False)