# 1. TorchAudio를 이용한 음성파일 처리

## Python Audio Manipulation Packages
### Torchaudio
<img src="https://github.com/pytorch/audio/raw/main/docs/source/_static/img/logo.png" height=120>

The aim of torchaudio is to apply PyTorch to the audio domain. 


# [Library]
- `torch` : Deep learning 라이브러리, 간편하게 모델을 설계하고 학습 가능(PyTorch)
- `torchaudio` : torch tensor형식으로 오디오를 다룰 수 있는 라이브러리
- `pandas` : dataframe, csv, excel - table 데이터를 다루는 라이브러리
- `matplotlib` : 시각화용 라이브러리 
- `IPython.display` : IPython 위젯을 사용할 수 있는 라이브러리
- `pathlib` : 경로 관련 라이브러리, 파일의 경로를 쉽게 사용가능

In [None]:
import torch
import torchaudio     
import torchaudio.transforms as T
import torch.nn.functional as F

import pandas as pd
import matplotlib.pyplot as plt
import IPython.display as ipd
from pathlib import Path

In [None]:
%matplotlib inline

# [Load Audio File]
## Data : free-spoken-digit-dataset

음성 버전의 MNIST dataset

https://github.com/Jakobovski/free-spoken-digit-dataset

<img src="https://drive.google.com/uc?id=1yEjXMS5-KTrYriyPhrSaJqeneBTStao_">

### Current status
- 6 speakers
- 3,000 recordings (50 of each digit per speaker)
- English pronunciations
### Organization
Files are named in the following format: `{digitLabel}_{speakerName}_{index}.wav` Example: 7_jackson_32.wav

### Usage
The test set officially consists of the first 10% of the recordings. Recordings numbered 0-4 (inclusive) are in the test and 5-49 are in the training set.

In [None]:
!git clone https://github.com/Jakobovski/free-spoken-digit-dataset

In [None]:
# check dir

In [None]:
# get list of audio file path



In [None]:
# show audios



In [None]:
# show audio sample name



# [Check Audio File]

## `ipd.Audio` 로 wav 파일 확인

In [None]:
??ipd.Audio

In [None]:
# check audio sample with ipd.Audio



## `torchaudio` 로 wav 파일 확인

### Audio Meta data

- `sample_rate` is the sampling rate of the audio
- `num_channels` is the number of channels
- `num_frames` is the number of frames per channel
- `bits_per_sample` is bit depth
- `encoding` is the sample coding format

In [None]:
??torchaudio.info

In [None]:
# check audio sample with torchaudio



# [Load audio file]

## torchaudio를 이용하여 음악파일 불러오기
### Loading audio data
To load audio data, you can use `torchaudio.load()`.

This function accepts a path-like object or file-like object as input.

The returned value is a tuple of waveform (`Tensor`) and sample rate (`int`).

By default, the resulting tensor object has `dtype=torch.float32` and its value range is` [-1.0, 1.0]`.

For the list of supported format, please refer to the torchaudio documentation.
```
waveform, sample_rate = torchaudio.load(SAMPLE_WAV)
```

In [None]:
??torchaudio.load

In [None]:
# torchaudio.load



### Sampling Rate

In [None]:
# sampling rate


### Audio Tensor

In [None]:
# audio tensor


In [None]:
# audio tensor type


### Duration

In [None]:
# audio shape, sampling rate


In [None]:
# duration


`ipd.Audio`를 이용해서 `torch.Tensor` 타입의 변수를 읽고 들을 수 있다.

In [None]:
# display audio sample


## `torch.Tensor`타입의 waveform의 시각화
`matplotlib.pyplot` 을 이용하여 audio sample을 시각화 가능

python의 `Slicing`을 통해 특정구간을 확대하여 확인 가능

### sine wave

In [None]:
# sin wave


In [None]:
# plot


### audio sample 

In [None]:
# audio tensor


In [None]:
# plot


### slicing audio sample

In [None]:
# slicing audio sample


### audio sample 다양한 방식으로 확인

In [None]:
start,dur = 1000,150

plt.figure(figsize=(10,2),dpi=100)
plt.plot(y[0][start:start+dur])
plt.show()

In [None]:
plt.figure(figsize=(10,2),dpi=100)
plt.stem(y[0][start:start+dur], use_line_collection=True)

In [None]:
plt.figure(figsize=(10,2),dpi=100)
plt.plot(y[0][start:start+dur])
plt.show()

plt.figure(figsize=(10,2),dpi=100)
plt.plot(y[0][start:start+dur])
plt.stem(y[0][start:start+dur], use_line_collection=True)
plt.show()

# [Audio feature extraction] 
## Overview of audio features

<img src="https://download.pytorch.org/torchaudio/tutorial-assets/torchaudio_feature_extractions.png" width=600>

STFT (DFT)

<img src="https://upload.wikimedia.org/wikipedia/commons/6/61/FFT-Time-Frequency-View.png?20171130134719" width=600>

<img src="https://kr.mathworks.com/help/dsp/ref/stft_output.png" width=600>

### Raw Audio sample

In [None]:
# load audio sample



### Framing
- n_fft is size of FFT, creates ``n_fft // 2 + 1`` bins
- win_length is window size (e.g. 25ms = 200 (in 8kHz))
- hop_length is Length of hop between STFT windows.

In [None]:
??T.Spectrogram

In [None]:
# n_fft, win_length, hop_length



In [None]:
# framing



### Windowing
- 각각의 프레임에 특정 함수를 적용해 경계를 스무딩하는 기법
- 단순하게 자른 프레임의 양끝에서는 신호가 존재하다가 없어지는 상황 발생. -> 해당 프레임에 푸리에 변환을 실시하게 되면 불필요한 고주파 성분이 남게되는 문제 발생

In [None]:
# hann_window



In [None]:
# show hann_window 


In [None]:
# sliced sample


In [None]:
# apply hann_window


### Spectrogram
- x축은 시간(Time), y축은 주파수(Frequency), z축은 진폭(Amplitude)

In [None]:
# spectrogram converter


In [None]:
# spectrogram shape


`torch.Size([1, 129, 20])`

1 : batch size or channel

129 : n_fft // 2 +1 (n_fft = 256)

23 : ceil(len(y) /hop_length)

In [None]:
# spectrogram shape check


In [None]:
# show spectorgram



In [None]:
# Spec by time frame


### AmplitudeToDB

In [None]:
# db converter


In [None]:
# show db_spec


In [None]:
# Amplitude, DB relation



### Mel-Spectrogram

In [None]:
??T.MelScale

In [None]:
# define n_mels, sample_rate, f_min, f_max, n_stft


In [None]:
# define mel Scale


In [None]:
# show mel Scale filterbank


In [None]:
# define torchaudio.functional.melscale_fbanks


In [None]:
# filterbank low, high frequency


In [None]:
# define mel_converter


In [None]:
# mel_spec


In [None]:
# mel_spec -> db_converter


### MFCC
- mel-spectrogram에 log및 IDCT(Inverse Discrete Cosine Transform)을 수행해 변수 간 상관관계를 해결
- MFCCs는 구축 과정에서 버리는 정보가 많아 최근 딥러닝 기반 모델에서는 멜 스펙트럼 혹은 로그 멜 스펙트럼을 많이 사용

In [None]:
??T.MFCC

In [None]:
# define melkwargs, mfcc_converter


In [None]:
# mfcc converter


In [None]:
# show mfcc 
