# Feature Engineering Validation Notebook


## Notebook Purpose
This notebook is intended to test and validate the functionality of the `features.py` module during development.
It will be used to process the cleaned dataset which involves the following steps:
- Converting columns with Yes/No values to binary 1s and 0s.
- Converting ordinal features from string to numeric representations.
- One-hot encoding categorical columns.

## Objectives  
- ✅ Load and review the dataset after feature engineering.
- ✅ Identify categorical features requiring encoding
- ✅ Apply and compare encoding methods such as One-Hot Encoding, Label Encoding, and Ordinal Encoding as appropriate  
- ✅ Ensure the encoded data preserves the integrity of the original variables 

## Dataset Source  
- [Music & Mental Health Dataset on Kaggle](https://www.kaggle.com/datasets/catherinerasgaitis/mxmh-survey-results)

## Notes  
- This notebook is part of a larger data science pipeline aimed at exploring the relationship between music preferences and mental health indicators  
- All preprocessing choices will be documented and justified in context


In [1]:
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)
import numpy as np

# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Oversampling
from imblearn.over_sampling import SMOTENC
from imblearn.under_sampling import TomekLinks

# Statistics
from scipy import stats

from music_and_mental_health_survey_analysis.config import (
    PROCESSED_DATA_DIR, SAMPLING_EVALUATIONS_DIR, RAW_DATA_DIR
)
from music_and_mental_health_survey_analysis.config import (
    load_config_file
)

[32m2025-06-10 09:28:44.437[0m | [1mINFO    [0m | [36mmusic_and_mental_health_survey_analysis.config[0m:[36m<module>[0m:[36m15[0m - [1mPROJ_ROOT path is: /home/arsen/Documents/dsc_projects/music_health_survey/music_mental_health_analysis[0m


In [2]:
%load_ext autoreload
%autoreload 2

# ✅ Load and Review Dataset

In [3]:
df = pd.read_csv(PROCESSED_DATA_DIR / 'features.csv')
df.head()

Unnamed: 0,Age,Hours per day,BPM,While working,Instrumentalist,Composer,Exploratory,Foreign languages,Frequency [Classical],Frequency [Country],Frequency [EDM],Frequency [Folk],Frequency [Gospel],Frequency [Hip hop],Frequency [Jazz],Frequency [K pop],Frequency [Latin],Frequency [Lofi],Frequency [Metal],Frequency [Pop],Frequency [R&B],Frequency [Rap],Frequency [Rock],Frequency [Video game music],Anxiety,Depression,Insomnia,OCD,improved,Primary streaming service_Apple Music,Primary streaming service_I do not use a streaming service.,Primary streaming service_Other streaming service,Primary streaming service_Pandora,Primary streaming service_Spotify,Primary streaming service_YouTube Music,Fav genre_Classical,Fav genre_Country,Fav genre_EDM,Fav genre_Folk,Fav genre_Gospel,Fav genre_Hip hop,Fav genre_Jazz,Fav genre_K pop,Fav genre_Latin,Fav genre_Lofi,Fav genre_Metal,Fav genre_Pop,Fav genre_R&B,Fav genre_Rap,Fav genre_Rock,Fav genre_Video game music
0,18.0,3.0,156.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,2.0,0.0,3.0,3.0,1.0,0.0,3.0,2.0,3.0,0.0,2.0,3,0,1,0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,63.0,1.5,119.0,1.0,0.0,0.0,1.0,0.0,2.0,0.0,0.0,1.0,2.0,1.0,3.0,1.0,2.0,1.0,0.0,2.0,2.0,1.0,3.0,1.0,7,2,2,1,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,18.0,4.0,132.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,3.0,0.0,0.0,1.0,1.0,3.0,0.0,2.0,2.0,1.0,0.0,1.0,1.0,3.0,7,7,10,2,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3,61.0,2.5,84.0,1.0,0.0,1.0,1.0,1.0,2.0,0.0,0.0,1.0,2.0,0.0,3.0,2.0,3.0,2.0,0.0,2.0,2.0,0.0,0.0,0.0,9,7,3,3,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,18.0,4.0,107.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,3.0,0.0,3.0,2.0,2.0,0.0,2.0,3.0,3.0,0.0,1.0,7,2,5,9,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


In [4]:
df.shape

(715, 51)

In [5]:
df.describe()

Unnamed: 0,Age,Hours per day,BPM,While working,Instrumentalist,Composer,Exploratory,Foreign languages,Frequency [Classical],Frequency [Country],Frequency [EDM],Frequency [Folk],Frequency [Gospel],Frequency [Hip hop],Frequency [Jazz],Frequency [K pop],Frequency [Latin],Frequency [Lofi],Frequency [Metal],Frequency [Pop],Frequency [R&B],Frequency [Rap],Frequency [Rock],Frequency [Video game music],Anxiety,Depression,Insomnia,OCD,improved,Primary streaming service_Apple Music,Primary streaming service_I do not use a streaming service.,Primary streaming service_Other streaming service,Primary streaming service_Pandora,Primary streaming service_Spotify,Primary streaming service_YouTube Music,Fav genre_Classical,Fav genre_Country,Fav genre_EDM,Fav genre_Folk,Fav genre_Gospel,Fav genre_Hip hop,Fav genre_Jazz,Fav genre_K pop,Fav genre_Latin,Fav genre_Lofi,Fav genre_Metal,Fav genre_Pop,Fav genre_R&B,Fav genre_Rap,Fav genre_Rock,Fav genre_Video game music
count,715.0,715.0,715.0,715.0,715.0,715.0,715.0,715.0,715.0,715.0,715.0,715.0,715.0,715.0,715.0,715.0,715.0,715.0,715.0,715.0,715.0,715.0,715.0,715.0,715.0,715.0,715.0,715.0,715.0,715.0,715.0,715.0,715.0,715.0,715.0,715.0,715.0,715.0,715.0,715.0,715.0,715.0,715.0,715.0,715.0,715.0,715.0,715.0,715.0,715.0,715.0
mean,24.93958,3.484685,123.226294,0.795804,0.321678,0.172028,0.717483,0.552448,1.334266,0.823776,1.025175,1.01958,0.379021,1.398601,1.033566,0.735664,0.605594,1.07972,1.213986,2.065734,1.26993,1.33986,2.081119,1.254545,5.840559,4.806993,3.717483,2.613986,0.753846,0.06993,0.092308,0.067133,0.012587,0.626573,0.131469,0.072727,0.034965,0.048951,0.040559,0.006993,0.047552,0.027972,0.036364,0.004196,0.013986,0.117483,0.159441,0.047552,0.027972,0.254545,0.058741
std,11.602233,2.618192,29.571452,0.403395,0.467447,0.377669,0.450539,0.49759,0.989518,0.924151,1.050241,1.012337,0.694401,1.028582,0.938736,1.003524,0.863821,1.026577,1.131581,0.918157,1.057787,1.052151,1.031232,1.072754,2.769842,3.008709,3.059387,2.813896,0.431071,0.255208,0.289662,0.250427,0.111563,0.484053,0.338149,0.25987,0.18382,0.215917,0.197405,0.08339,0.212966,0.165008,0.187324,0.064684,0.117515,0.32222,0.366343,0.212966,0.165008,0.43591,0.235304
min,10.0,0.1,40.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,18.0,2.0,104.7,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.5,0.0,4.0,2.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,21.0,3.0,120.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,2.0,1.0,1.0,2.0,1.0,6.0,5.0,3.0,2.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,27.0,5.0,140.0,1.0,1.0,0.0,1.0,1.0,2.0,1.0,2.0,2.0,1.0,2.0,2.0,1.0,1.0,2.0,2.0,3.0,2.0,2.0,3.0,2.0,8.0,7.0,6.0,4.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
max,80.0,16.0,210.0,1.0,1.0,1.0,1.0,1.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,10.0,10.0,10.0,10.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [6]:
# Check for duplicate columns

print(f"Duplicated columns: {', '.join(df.columns[df.columns.duplicated()])}")

Duplicated columns: 


In [7]:
# Check if any columns non-numeric

df.select_dtypes(exclude='number').columns

Index([], dtype='object')

In [8]:
# Check for missing values

print(f"Columns with missing values: {df.columns[df.isna().sum() > 0]}")

Columns with missing values: Index([], dtype='object')


# Identify Encoded Features

- All categorical features were encoded.
  - ✅ Binary features
  - ✅ Ordinal features
  - ✅ Dummy features
- Target feature was encoded.

## ✅ Binary columns

In [9]:
# Check binary columns

binary_cols = ['While working', 'Instrumentalist', 'Composer', 'Exploratory', 'Foreign languages']
df[binary_cols].nunique()

While working        2
Instrumentalist      2
Composer             2
Exploratory          2
Foreign languages    2
dtype: int64

In [10]:
# Confirm all values are 0.0s or 1.0s

df[binary_cols].isin([0.0, 1.0]).all()

While working        True
Instrumentalist      True
Composer             True
Exploratory          True
Foreign languages    True
dtype: bool

In [11]:
# Load cleaned data prior to feature engineering

cleaned_df = pd.read_csv(PROCESSED_DATA_DIR / 'cleaned.csv')

In [12]:
# Check improved

df['improved'].unique()

array([1., 0.])

In [13]:
improve_index = cleaned_df[cleaned_df['Music effects'] == 'Improve'].index
print(f"improved correctly generated from Music effects: {all(df.iloc[improve_index]['improved'] == 1)}")

improved correctly generated from Music effects: True


## ✅ Ordinal columns

In [14]:
# Check ordinal values

freq_cols = [col for col in df.columns if col.startswith('Frequency')]
df[freq_cols].isin([0.0, 1.0, 2.0, 3.0]).all()

Frequency [Classical]           True
Frequency [Country]             True
Frequency [EDM]                 True
Frequency [Folk]                True
Frequency [Gospel]              True
Frequency [Hip hop]             True
Frequency [Jazz]                True
Frequency [K pop]               True
Frequency [Latin]               True
Frequency [Lofi]                True
Frequency [Metal]               True
Frequency [Pop]                 True
Frequency [R&B]                 True
Frequency [Rap]                 True
Frequency [Rock]                True
Frequency [Video game music]    True
dtype: bool

In [15]:
# User reverse mapping to verify encoding

mapping = {0: 'Never', 1: 'Rarely', 2: 'Sometimes', 3: 'Very frequently'}
random_sample = df.sample(n=5)
equality = []

for index, row in random_sample.iterrows():
    for col in freq_cols:
        equality.append(mapping[row[col]] == cleaned_df.iloc[index][col])

print(f"Mapping of frequency columns valid: {all(equality)}")

Mapping of frequency columns valid: True


## ✅ Dummy columns

In [16]:
# Check dummy cols

dummy_cols = [col for col in df.columns if '_' in col]
dummy_cols

['Primary streaming service_Apple Music',
 'Primary streaming service_I do not use a streaming service.',
 'Primary streaming service_Other streaming service',
 'Primary streaming service_Pandora',
 'Primary streaming service_Spotify',
 'Primary streaming service_YouTube Music',
 'Fav genre_Classical',
 'Fav genre_Country',
 'Fav genre_EDM',
 'Fav genre_Folk',
 'Fav genre_Gospel',
 'Fav genre_Hip hop',
 'Fav genre_Jazz',
 'Fav genre_K pop',
 'Fav genre_Latin',
 'Fav genre_Lofi',
 'Fav genre_Metal',
 'Fav genre_Pop',
 'Fav genre_R&B',
 'Fav genre_Rap',
 'Fav genre_Rock',
 'Fav genre_Video game music']

In [17]:
# Verify values in dummy cols

print(f"All values 0 or 1: {all([df[col].all() in [0,1] for col in dummy_cols])}")

All values 0 or 1: True


In [18]:
# Verify all primary streaming service values included in dummy cols

service_vals = cleaned_df['Primary streaming service'].unique()
dummy_service_cols = [col for col in df.columns if col.startswith('Primary streaming service')]
print(f"All primary streaming service values converted to dummy cols: {len(service_vals) == len(dummy_service_cols)}")


All primary streaming service values converted to dummy cols: True


In [19]:
# Verify all fav genre values included in dummy cols

service_vals = cleaned_df['Fav genre'].unique()
dummy_service_cols = [col for col in df.columns if col.startswith('Fav genre')]
print(f"All Fav genre values converted to dummy cols: {len(service_vals) == len(dummy_service_cols)}")


All Fav genre values converted to dummy cols: True


## ✅ Target feature: `improved`

In [21]:
df['improved'].unique()

array([1., 0.])

In [29]:
positive_index = df[df['improved'] == 1].index
cleaned_effects = cleaned_df.iloc[positive_index]['Music effects']
print(f"Music effects correctly encoded as improved: {all(cleaned_effects == 'Improve')}")

Music effects correctly encoded as improved: True
