# Training and test sets creation
The first step of the project consist of training and test sets creation. 

## Feature extraction
The dataset is created by using the Features class.
Each audio file il loaded in memory and the following features are extracted: 
- mfcc 
- chroma
- rms

Every feature array is then reduced with the following functions: 
- min
- max
- median
- mean

Results are concatenated and a total of 132 features are extracted from each audio.

## Structure
The dataset is organized in this structure
$$\mathit{class}, \; \mathit{feature}_1, \; \dots, \; \mathit{feature}_n$$

## Scaling
A standard scaler is applied to the training set and it's saved to disk.
When processing the test folds, the same scaler is applied to the data.

## Dask speed up
To speed up the computation Dask is used. 
A total of 4 workers works in parallel to extract features more efficiently, reducing the time on a single fold from about 70 seconds to just under 30.

## Training dataset
The first step is to get the training dataset, the considered folds are the first four and the sixth. The total number of samples in the obtained dataset is 4499.


In [None]:
!pip install wanglaoshi

In [None]:
from wanglaoshi import JupyterEnv as JE
JE.jupyter_kernel_list()

In [None]:
JE.install_kernel()

刷新环境

In [None]:
!pip install -r /mnt/workspace/urban-sound-classification/src/requirements.txt

In [None]:
!pip install dask

In [None]:
!pip install "dask[distributed]" --upgrade

## 拉取数据

在命令行工具中依次执行


![0oKZjO](https://upiclw.oss-cn-beijing.aliyuncs.com/uPic/0oKZjO.png)

```shell
mkdir /mnt/workspace/urban-sound-classification/data/raw/
```

```shell
git clone https://github.com/WangLaoShi/UrbanSound8K.git /mnt/workspace/urban-sound-classification/data/raw/
```

```shell
ls /mnt/workspace/urban-sound-classification/data/raw
```

```shell
mv /mnt/workspace/urban-sound-classification/data/raw/UrbanSound8K/metadata/ /mnt/workspace/urban-sound-classification/data/raw/
```

```shell
mv /mnt/workspace/urban-sound-classification/data/raw/UrbanSound8K/audio/ /mnt/workspace/urban-sound-classification/data/raw/
```



In [30]:
import librosa
print(librosa.__version__)

0.10.1


In [33]:
import sys
sys.path.append("..")
from src.data import Features
import pandas as pd
import numpy as np

## Unscaled training set 
The following cell extracts the unscaled training set.

In [34]:
f = Features(save_path="../data/processed/initial",
             save_name="train_unscaled",
             folds=[1,2,3,4,6])

training_dataframe = f.get_dataframe()
f.save_dataframe(training_dataframe)

OSError: [Errno 116] Stale file handle: '../data/raw/metadata/UrbanSound8K.csv'

In [None]:
training_dataframe

## Scaling the dataset
A standard scaler is applied to the dataset and saved for later scaling on the test sets.

In [None]:
scaled_df = f.scale_dataframe(training_dataframe, 
                              save_scaler=True)
scaled_df

In [None]:

f.save_dataframe(scaled_df, save_name="train_scaled")

## Test datasets

After getting the training set, multiple test sets are obtained from the other folds.
Each one of them is saved in scaled and unscaled form, to test scaling improvement.

In [None]:
for fold in [5, 7, 8, 9, 10]:
    print(f"Processing fold {fold}")
    
    f = Features(save_path="../data/processed/initial",
                 save_name=f"test_{fold}_unscaled",
                 folds=[fold])
    
    df = f.get_dataframe()
    f.save_dataframe(df)
    
    scaled = f.apply_scaling(df, "../models/scalers/scaler_training.pkl")
    f.save_dataframe(scaled, save_name=f"test_{fold}_scaled")