# Introducing maskers/sound source as input feature

### First, transform masker column to numbers --> one-hot encoding


One-hot encoding is a technique used to convert categorical variables into a numerical format that can be used for machine learning algorithms. It is particularly useful when dealing with categorical data that has no inherent order or hierarchy among its categories.

Here's how one-hot encoding works:

1) Identify Unique Categories:
First, you identify all the unique categories present in the categorical variable.

2) Create Binary Columns:
For each unique category, you create a new binary column. Each binary column corresponds to one unique category.

3) Assign Values:
In each binary column, you assign a value of 1 if the observation belongs to the category represented by that column, and 0 otherwise.



In [1]:
import sklearn.linear_model
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os

In [16]:
responses = pd.read_csv(os.path.join('..','data','responses.csv'), dtype = {'participant':str})
print(responses.columns.shape, responses.shape, responses.columns)

(160,) (27255, 160) Index(['participant', 'fold_r', 'soundscape', 'masker', 'smr',
       'stimulus_index', 'time_taken', 'is_attention', 'pleasant', 'eventful',
       ...
       'M04000_0_r', 'M05000_0_r', 'M06300_0_r', 'M08000_0_r', 'M10000_0_r',
       'M12500_0_r', 'M16000_0_r', 'M20000_0_r', 'Leq_L_r', 'Leq_R_r'],
      dtype='object', length=160)


Extract only the maskers column to generate the one-hot encoding

In [6]:
maskers=responses["masker"]

Now from the maskers, extract the type of masker from name (type_number.wav) and then calculate the number of different maskers there is, and assign an order

In [8]:
# Generate maskers column with just masker type
maskers_type=maskers.str.split("_").str[0]
print(maskers_type)

0             silence
1             silence
2               water
3             traffic
4             traffic
             ...     
27250         traffic
27251         silence
27252    construction
27253         silence
27254         silence
Name: masker, Length: 27255, dtype: object


In [13]:
# Now count different maskers
maskers_variety=maskers_type.unique().tolist()
print(maskers_variety)

['silence', 'water', 'traffic', 'construction', 'wind', 'bird']


Now, generate the one-hot encoded dataframe

In [14]:
one_hot_encoded=pd.get_dummies(maskers_type, columns=maskers_variety, prefix="masker")
print(one_hot_encoded)

       masker_bird  masker_construction  masker_silence  masker_traffic  \
0                0                    0               1               0   
1                0                    0               1               0   
2                0                    0               0               0   
3                0                    0               0               1   
4                0                    0               0               1   
...            ...                  ...             ...             ...   
27250            0                    0               0               1   
27251            0                    0               1               0   
27252            0                    1               0               0   
27253            0                    0               1               0   
27254            0                    0               1               0   

       masker_water  masker_wind  
0                 0            0  
1                 0          

Finally, concatenate the one-hot-encoded dataframe with the original, and store it as a new csv

In [15]:
# Concatenate
complete_responses=pd.concat([responses, one_hot_encoded], axis=1)
print(complete_responses.shape, complete_responses)

(27255, 166)        participant  fold_r                          soundscape  \
0      ARAUS_00001      -1  R0091_segment_binaural_44100_1.wav   
1      ARAUS_00001       1  R0079_segment_binaural_44100_1.wav   
2      ARAUS_00001       1  R0056_segment_binaural_44100_2.wav   
3      ARAUS_00001       1  R0046_segment_binaural_44100_2.wav   
4      ARAUS_00001       1  R0092_segment_binaural_44100_1.wav   
...            ...     ...                                 ...   
27250  ARAUS_10005       0    R1007_segment_binaural_44100.wav   
27251  ARAUS_10005       0    R1006_segment_binaural_44100.wav   
27252  ARAUS_10005       0    R1008_segment_binaural_44100.wav   
27253  ARAUS_10005       0    R1007_segment_binaural_44100.wav   
27254  ARAUS_10005      -1  R0091_segment_binaural_44100_1.wav   

                       masker  smr  stimulus_index  time_taken  is_attention  \
0           silence_00001.wav    0               1      98.328             0   
1           silence_00001.wav    6

In [18]:
# Store locally
complete_responses.to_csv("../data/responses_maskers_onehot.csv", index=False)