# 1. Selection

In [1]:
import pandas as pd
import os
from utils_io import load_step, save_step

# Load the dataset
print(os.getcwd())
df = pd.read_csv('data/Spotify_Songs.csv')

c:\Users\komit\Documents\Education\University\University of Mannheim\1st Sem\Data Mining (IE500)\3. Group Project\datamining_group12


## General Info

In [2]:
# General Info
print('Dataset Info:')
df.info()

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 114000 entries, 0 to 113999
Data columns (total 21 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   index             114000 non-null  int64  
 1   track_id          114000 non-null  object 
 2   artists           113999 non-null  object 
 3   album_name        113999 non-null  object 
 4   track_name        113999 non-null  object 
 5   popularity        114000 non-null  int64  
 6   duration_ms       114000 non-null  int64  
 7   explicit          114000 non-null  bool   
 8   danceability      114000 non-null  float64
 9   energy            114000 non-null  float64
 10  key               114000 non-null  int64  
 11  loudness          114000 non-null  float64
 12  mode              114000 non-null  int64  
 13  speechiness       114000 non-null  float64
 14  acousticness      114000 non-null  float64
 15  instrumentalness  114000 non-null  float64
 16  livene

## Outliers
We identified and handled outliers by "filtering" through the Duration, Tempo and Time Signature Attributes

In [None]:
# Choose rows with a duration of <1min & >10min, a time signature and tempo of 0
drop_clause = (df['duration_ms'] < 60000) | (df['duration_ms'] > 600000) | (df['time_signature'] == 0) | (df['tempo'] == 0)

# Find the index of that condition in our dataset
drop_index = df[drop_clause].index

# Drop those rows
df = df.drop(drop_index)

# Verify that outliers are removed
print('Dataset without Outliers:')
df.info()

Dataset without Outliers:
<class 'pandas.core.frame.DataFrame'>
Index: 112392 entries, 0 to 113999
Data columns (total 21 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   index             112392 non-null  int64  
 1   track_id          112392 non-null  object 
 2   artists           112392 non-null  object 
 3   album_name        112392 non-null  object 
 4   track_name        112392 non-null  object 
 5   popularity        112392 non-null  int64  
 6   duration_ms       112392 non-null  int64  
 7   explicit          112392 non-null  bool   
 8   danceability      112392 non-null  float64
 9   energy            112392 non-null  float64
 10  key               112392 non-null  int64  
 11  loudness          112392 non-null  float64
 12  mode              112392 non-null  int64  
 13  speechiness       112392 non-null  float64
 14  acousticness      112392 non-null  float64
 15  instrumentalness  112392 non-null  float64
 16 

## Feature Subset Selection

Normally, Feature Subset Selection plays a role in high dimensional spaces. For now we will select features with which we can compute the model. Later, we will have to do feature subset selection again to choose amongst the most valuable predictors. FORWARD SELECTION and BACKWARD ELIMINATION

Only use valuable features, hence, used  basic heuristic: nominal attributes removed with have more than p% identical values and which have more than p% different values

-> Hence, index and track_name are removed.

Remove False predictors: Songs popularity might be influenced by an artist name or album name.

-> Hence, artists and album_name are removed.

We keep track_id during all preprocessing steps for clarity. We will remove it before the DataMining step.

### Irrelevant Features: False Predictors & Heuristics

In [None]:
# How do the textual attributes look like?
df.describe(include='object')

# Remove "artists", "album_name", "track_name" and "index" features (KEEP track_id for clarity)
df = df.drop(columns=['artists', 'album_name', 'track_name', 'index'])

Unnamed: 0,track_id,track_genre
count,112392,112392
unique,88266,114
top,6S3JlDAGk3uu3NtZbPnuhS,alternative
freq,9,1000


In [5]:
# Check uniqueness of "time_signature"
# Hypothesis: most of the songs have time_signature 4

s = df["time_signature"]

time_sig_table = (
    s.value_counts(dropna=False)                    # count each value (including NaN)
     .sort_index()                                  # sort by the meter value (3,4,5,...)
     .rename_axis("time_signature")
     .reset_index(name="count")
)

time_sig_table["share"] = (time_sig_table["count"] / time_sig_table["count"].sum()).round(4)
time_sig_table

Unnamed: 0,time_signature,count,share
0,1,900,0.008
1,3,8938,0.0795
2,4,100802,0.8969
3,5,1752,0.0156


We want to further explore what characteristics songs of time_signature == 1 have and whether they can be considered genuine songs.
What should we do with entries that have time_signature == 1? Time_signature == 1 does not have a musical interpretation. Instead it is likely to be a missing value or time_signature could not be parsed. Nevertheless, other musical characteristcs stay inconspicuous.

In [6]:
# Group and compute mean
duration_by_ts = (
    df.groupby("time_signature", dropna=False)[["duration_ms", "tempo", "energy", "danceability", "loudness"]]
      .mean()
      .sort_values("time_signature")
)

print(duration_by_ts)

                  duration_ms       tempo    energy  danceability   loudness
time_signature                                                              
1               208880.207778  107.951697  0.458268      0.425865 -12.804916
3               218694.093533  122.705017  0.451654      0.439766 -11.651288
4               226844.265907  122.647836  0.664817      0.583546  -7.776472
5               203230.401826  113.174825  0.488683      0.464670 -11.765884


In [7]:
# Verify that irrelevant features are removed
print('Dataset Info without Irrelevant Features:')
print(df.info())

Dataset Info without Irrelevant Features:
<class 'pandas.core.frame.DataFrame'>
Index: 112392 entries, 0 to 113999
Data columns (total 17 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   track_id          112392 non-null  object 
 1   popularity        112392 non-null  int64  
 2   duration_ms       112392 non-null  int64  
 3   explicit          112392 non-null  bool   
 4   danceability      112392 non-null  float64
 5   energy            112392 non-null  float64
 6   key               112392 non-null  int64  
 7   loudness          112392 non-null  float64
 8   mode              112392 non-null  int64  
 9   speechiness       112392 non-null  float64
 10  acousticness      112392 non-null  float64
 11  instrumentalness  112392 non-null  float64
 12  liveness          112392 non-null  float64
 13  valence           112392 non-null  float64
 14  tempo             112392 non-null  float64
 15  time_signature    112392 non-nu

In the following several columns/features have to be looked at close since their meaning is not obvious at first sight:

- explicit = indicates whether the track contains explicit content (we might transform these into 0s and 1s)
- key = refers to the musical key (0 = C, ..., 11 = B) (this might be actually very interesting and since it is ordinal? otherwise nominal we should do one-hot-encoding)
- mode = modality (major = 1, minor = 0) (can be left as it is)
- tempo = estimated tempo in beats per minute (this might also be a relevant feature)
- time_signature = time signature of the track (e.g. 4 = 4/4) (also for that we should probably do one-hot encoding)


In [8]:
# Explore features that are not continuous and of type int64 or boolean: explicit, key, mode, time_signature 
# (track_id, popularity, duration_ms, track_genre are excluded since they will be explored separately)

print(df[["explicit", "key", "mode", "tempo", "time_signature"]].head(5))
print(df[["explicit", "key", "mode", "tempo", "time_signature"]].describe())

   explicit  key  mode    tempo  time_signature
0     False    1     0   87.917               4
1     False    1     1   77.489               4
2     False    0     1   76.332               4
3     False    0     1  181.740               3
4     False    2     1  119.949               4
                 key           mode          tempo  time_signature
count  112392.000000  112392.000000  112392.000000   112392.000000
mean        5.316215       0.636789     122.387033        3.912040
std         3.559547       0.480927      29.636057        0.399308
min         0.000000       0.000000      30.200000        1.000000
25%         2.000000       0.000000      99.600500        4.000000
50%         5.000000       1.000000     122.035500        4.000000
75%         8.000000       1.000000     140.099000        4.000000
max        11.000000       1.000000     243.372000        5.000000


### Redundant Features: Correlation

Will be looked at in preprocessing PCA or feature subset selection

## Save Selection

In [9]:
save_step(df, "step1_selection")

Saved step1_selection.csv
