# Demo notebook for analyzing audio data

## Introduction

Audio data - as recorded by smartphones or other portable devices - can carry important information about individuals' environments. This may give insights about the activity, sleep, and social interaction. However, audio data use can be difficult due to privacy concerns. A possible solution is to compute more general characteristics (e.g. frequency) and use those instead. Still, some aggregation and preprocessing needs to be done. To address this last part, `niimpy` includes the function `extract_features_audio` to clean, downsample, and extract features from audio data. This function employs other functions to extract the following features:

- `audio_count_silent`: number of times when there has been some sound in the environment
- `audio_count_speech`: number of times when there has been some sound in the environment that matches the range of human speech frequency (65 - 255Hz)
- `audio_count_loud`: number of times when there has been some sound in the environment above 70dB
- `audio_min_freq`: minimum frequency of the recorded audio snippets
- `audio_max_freq`: maximum frequency of the recorded audio snippets
- `audio_mean_freq`: mean frequency of the recorded audio snippets
- `audio_median_freq`: median frequency of the recorded audio snippets
- `audio_std_freq`: standard deviation of the frequency of the recorded audio snippets
- `audio_min_db`: minimum decibels of the recorded audio snippets
- `audio_max_db`: maximum decibels of the recorded audio snippets
- `audio_mean_db`: mean decibels of the recorded audio snippets
- `audio_median_db`: median decibels of the recorded audio snippets
- `audio_std_db`: standard deviations of the recorded audio snippets decibels

In the following, we will analyze audio sample data provided by `niimpy`

## Read data

In [14]:
import niimpy
import niimpy.preprocessing.audio as au

In [15]:
data = niimpy.read_csv(niimpy.sampledata.AUDIO_FILE, tz='Europe/Helsinki')
data.shape

(33, 7)

There are 33 datapoints with 7 columns in the dataset. Let us have a quick look at the data:

In [18]:
data.head()

Unnamed: 0,user,device,time,is_silent,double_decibels,double_frequency,datetime
2020-01-09 02:08:03.896000+02:00,jd9INuQ5BBlW,3p83yASkOb_B,1578528000.0,0,84,4935,2020-01-09 02:08:03.896000+02:00
2020-01-09 02:38:03.896000+02:00,jd9INuQ5BBlW,3p83yASkOb_B,1578530000.0,0,89,8734,2020-01-09 02:38:03.896000+02:00
2020-01-09 03:08:03.896000+02:00,jd9INuQ5BBlW,3p83yASkOb_B,1578532000.0,0,99,1710,2020-01-09 03:08:03.896000+02:00
2020-01-09 03:38:03.896000+02:00,jd9INuQ5BBlW,3p83yASkOb_B,1578534000.0,0,77,9054,2020-01-09 03:38:03.896000+02:00
2020-01-09 04:08:03.896000+02:00,jd9INuQ5BBlW,3p83yASkOb_B,1578536000.0,0,80,12265,2020-01-09 04:08:03.896000+02:00


In [19]:
data.tail()

Unnamed: 0,user,device,time,is_silent,double_decibels,double_frequency,datetime
2019-08-13 15:02:17.657999872+03:00,iGyXetHE3S8u,Cq9vueHh3zVs,1565698000.0,1,44,2914,2019-08-13 15:02:17.657999872+03:00
2019-08-13 15:28:59.657999872+03:00,iGyXetHE3S8u,Cq9vueHh3zVs,1565699000.0,1,49,7195,2019-08-13 15:28:59.657999872+03:00
2019-08-13 15:59:01.657999872+03:00,iGyXetHE3S8u,Cq9vueHh3zVs,1565701000.0,0,55,91,2019-08-13 15:59:01.657999872+03:00
2019-08-13 16:29:03.657999872+03:00,iGyXetHE3S8u,Cq9vueHh3zVs,1565703000.0,0,76,3853,2019-08-13 16:29:03.657999872+03:00
2019-08-13 16:59:05.657999872+03:00,iGyXetHE3S8u,Cq9vueHh3zVs,1565705000.0,0,84,7419,2019-08-13 16:59:05.657999872+03:00


The dataframe seems to be complete. Its index is timestamps, and it has three main columns: is_silent, double_decibels, and double_frequency. In addition, the dataframe contains information from multiple users. 

## Extracting features

To extract audio features, we need to employ the function `extract_features_audio`. This function needs two inputs, a dataframe with the data and a dictionary. The dataframe should contain the audio observations, and the dictionary is used to input customizable arguments to the function. The function has some parameters by default. Let's have a look at those first. 

### Default option

The default option will compute all features in 30-minute aggregation windows. To use the `extract_features_audio` function with its default options, simply call the function. 

In [21]:
default = au.extract_features_audio(data, features=None)

computing audio_count_silent...
computing audio_count_speech...
computing audio_count_loud...
computing audio_min_freq...
computing audio_max_freq...
computing audio_mean_freq...
computing audio_median_freq...
computing audio_std_freq...
computing audio_min_db...
computing audio_max_db...
computing audio_mean_db...
computing audio_median_db...
computing audio_std_db...


The function prints the computed features so you can track its process. Now let's have a look at the outputs

In [22]:
default.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,audio_count_silent,audio_count_speech,audio_count_loud,audio_min_freq,audio_max_freq,audio_mean_freq,audio_median_freq,audio_std_freq,audio_min_db,audio_max_db,audio_mean_db,audio_median_db,audio_std_db
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
iGyXetHE3S8u,2019-08-13 07:00:00+03:00,0,,,7735.0,7735.0,7735.0,7735.0,,51.0,51.0,51.0,51.0,
iGyXetHE3S8u,2019-08-13 07:30:00+03:00,0,,1.0,13609.0,13609.0,13609.0,13609.0,,90.0,90.0,90.0,90.0,
iGyXetHE3S8u,2019-08-13 08:00:00+03:00,0,,1.0,7690.0,7690.0,7690.0,7690.0,,81.0,81.0,81.0,81.0,
iGyXetHE3S8u,2019-08-13 08:30:00+03:00,0,,0.0,8347.0,8347.0,8347.0,8347.0,,58.0,58.0,58.0,58.0,
iGyXetHE3S8u,2019-08-13 09:00:00+03:00,1,,0.0,13592.0,13592.0,13592.0,13592.0,,36.0,36.0,36.0,36.0,


In [23]:
default.tail()

Unnamed: 0_level_0,Unnamed: 1_level_0,audio_count_silent,audio_count_speech,audio_count_loud,audio_min_freq,audio_max_freq,audio_mean_freq,audio_median_freq,audio_std_freq,audio_min_db,audio_max_db,audio_mean_db,audio_median_db,audio_std_db
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
jd9INuQ5BBlW,2020-01-09 08:30:00+02:00,0,,0.0,,,,,,,,,,
jd9INuQ5BBlW,2020-01-09 09:00:00+02:00,0,,1.0,4569.0,4569.0,4569.0,4569.0,,93.0,93.0,93.0,93.0,
jd9INuQ5BBlW,2020-01-09 09:30:00+02:00,0,,1.0,2590.0,2590.0,2590.0,2590.0,,78.0,78.0,78.0,78.0,
jd9INuQ5BBlW,2020-01-09 10:00:00+02:00,0,,1.0,13981.0,13981.0,13981.0,13981.0,,98.0,98.0,98.0,98.0,
jd9INuQ5BBlW,2020-01-09 10:30:00+02:00,0,,1.0,9601.0,9601.0,9601.0,9601.0,,97.0,97.0,97.0,97.0,


The function output is also a dataframe where each column stands for a feature. The indexes are subjects and timestamps. 

### Customized features

The `extract_features_audio` function can also be customized. We can:
- extract some of the features
- modify the aggregation periods

All of these modifications need to be inside the dictionary input. 

Let's see how to use this to only call some functions. To do so, we need to create a dictionary where the keys are the name of the features we want to compute, and the values are empty dictionaries.

In [24]:
custom = {}
custom['audio_max_freq'] = {}
custom['audio_max_db'] = {}

In [26]:
custom_output = au.extract_features_audio(data, features=custom)
custom_output.head()

computing audio_max_freq...
computing audio_max_db...


Unnamed: 0_level_0,Unnamed: 1_level_0,audio_max_freq,audio_max_db
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
iGyXetHE3S8u,2019-08-13 07:00:00+03:00,7735.0,51.0
iGyXetHE3S8u,2019-08-13 07:30:00+03:00,13609.0,90.0
iGyXetHE3S8u,2019-08-13 08:00:00+03:00,7690.0,81.0
iGyXetHE3S8u,2019-08-13 08:30:00+03:00,8347.0,58.0
iGyXetHE3S8u,2019-08-13 09:00:00+03:00,13592.0,36.0


As we see, this time only two features were computed in a 30-min aggregated period. Now, let's compute another set of features with different aggregation windows. For that, we rely on the arguments from the `pandas.DataFrame.resample` function. 

For this example, we will aggregate the features `audio_mean_freq` and `audio_median_db`. The mean frequency will be computed in a daily basis and the mean decibels will be computed in 5-hour periods with a 5-min offset.

In [48]:
features = {"audio_mean_freq":{"audio_column_name":"double_frequency","resample_args":{"rule":"1D"}},
               "audio_median_db":{"audio_column_name":"double_decibels","resample_args":{"rule":"5H","offset":"5min"}}}

As we see, we have an input dictionary in which the main keys are the names of the features to compute. For each feature, we also have a dictionary. This new dictionary has some other arguments, mainly the name of the column that we would like to use for the computation and another dictionary named `resample_args`. The name of the column helps us in case our dataframe has some other naming conventions. The `resample_args` dictionary contains the arguments to pass for the resampling (see `pandas.DataFrame.resample`).

In [49]:
custom_output = au.extract_features_audio(data, features=features)
custom_output.head(15)

computing audio_mean_freq...
computing audio_median_db...


Unnamed: 0_level_0,Unnamed: 1_level_0,audio_mean_freq,audio_median_db
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
iGyXetHE3S8u,2019-08-13 00:00:00+03:00,7822.058824,
jd9INuQ5BBlW,2020-01-09 00:00:00+02:00,8157.8125,
iGyXetHE3S8u,2019-08-13 05:05:00+03:00,,69.5
iGyXetHE3S8u,2019-08-13 10:05:00+03:00,,88.0
iGyXetHE3S8u,2019-08-13 15:05:00+03:00,,65.5
jd9INuQ5BBlW,2020-01-09 00:05:00+02:00,,82.0
jd9INuQ5BBlW,2020-01-09 05:05:00+02:00,,76.5
jd9INuQ5BBlW,2020-01-09 10:05:00+02:00,,97.5


The output is once again a dataframe. In this case, two aggregations are shown. The first one is the daily aggregation computed for the `audio_mean_freq` feature. The second one is the 5-hour aggregation period with 5-min offset for the median decibels. Therefore, the repeated user IDs. We must note that because the `audio_median_db`feature is not required to be aggregated daily, the daily aggregation timestamps have a NaN value. Similarly, because the `audio_mean_freq`is not required to be aggregated in 5-hour windows, its values are NaN for all subjects. 

Finally, just for the sake of demonstration, we will compute the same feature with different column names. Note that one of these computations will be incorrect because we will be passing wrong values from the start. 

In [53]:
features = {"audio_mean_freq":{"audio_column_name":"double_frequency","resample_args":{"rule":"12H"}},
               "audio_max_db":{"audio_column_name":"double_frequency","resample_args":{"rule":"12H"}}}

custom_output = au.extract_features_audio(data, features=features)
custom_output.head()

computing audio_mean_freq...
computing audio_max_db...


Unnamed: 0_level_0,Unnamed: 1_level_0,audio_mean_freq,audio_max_db
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
iGyXetHE3S8u,2019-08-13 00:00:00+03:00,8982.8,13609
iGyXetHE3S8u,2019-08-13 12:00:00+03:00,6163.857143,14529
jd9INuQ5BBlW,2020-01-09 00:00:00+02:00,8157.8125,14408


Since we are passing frequency observations as decibels, the `audio_max_db` feature computation is wrong. Neverthless, it serves us to demonstrate how easily we can pass the column names in case the dataframe has no standard naming conventions.

## Implementing own features

We can implement our own customized features easily. To do so, we need to define a function that accepts a dataframe and returns a dataframe. The returned object should be indexed by user and timestamps. 
To make the feature readily available in the default options, we need add the *audio* prefix to the new function (e.g. `audio_my-new-feature`). 

In [59]:
def audio_sum_freq(df,feature_functions=None):
    if not "audio_column_name" in feature_functions:
        col_name = "double_frequency"
    else:
        col_name = feature_functions["audio_column_name"]
    if not "resample_args" in feature_functions.keys():
        feature_functions["resample_args"] = {"rule":"30T"}
    
    if len(df)>0:
        result = df.groupby('user')[col_name].resample(**feature_functions["resample_args"]).sum()
        result = result.to_frame(name='audio_min_freq')
    return result

Then, we can call our new function using the `extract_features_audio` function.

In [63]:
customized_features = au.extract_features_audio(data, features={"audio_sum_freq": {}})

computing audio_sum_freq...


NameError: name 'audio_sum_freq' is not defined