<a href="https://colab.research.google.com/github/frankx1/deepspeech_project/blob/main/DeepSpeech_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Using pre-trained model**
- In the following section, we use the pre-trained model from DeepSpeech website
- We test the model on two separate groups of audio files
- The first group contains audio files without noise
- The second group contains audio with a specific type of background noise, which is processed with pyaudio

In [39]:
!pip install deepspeech
!pip install wget
!pip install gitpython
from deepspeech import Model
import numpy as np
import wget,wave,csv 
import os
import git



In [40]:
wget.download('https://github.com/mozilla/DeepSpeech/releases/download/v0.9.2/deepspeech-0.9.2-models.pbmm')
wget.download('https://github.com/mozilla/DeepSpeech/releases/download/v0.9.2/deepspeech-0.9.2-models.scorer')

'deepspeech-0.9.2-models.scorer'

In [41]:
model_file_path = 'deepspeech-0.9.2-models.pbmm'
dsModel = Model(model_file_path)

In [42]:
git.Repo.clone_from('https://github.com/frankx1/deepspeech_project','audio_no_noise', branch = 'audiononoise')

<git.repo.base.Repo '/content/audio_no_noise/.git'>

In [43]:
audio = os.listdir('audio_no_noise')
audio = ['audio_no_noise/' + x for x in audio]
deepspeech_transcript = []
audio.sort()
for filename in audio:
    if 'wav' not in filename:
      continue
    w = wave.open(filename,'r')
    rate = w.getframerate()
    frames = w.getnframes()
    buffer = w.readframes(frames)
    data = np.frombuffer(buffer,dtype = np.int16)
    text = dsModel.stt(data)
    deepspeech_transcript.append(text)

- With the printing results, we can see that the pre-trained model has really high accuracy on audio without noise. The only flaw is that it cannot recognize single words. 

In [44]:
with open('audio_no_noise/audio_no_noise.csv', encoding = 'utf-8') as file:
    reader = csv.DictReader(file)
    i = 0
    for row in reader:
        print(row['FILENAME'],row['TRANSCRIPT'])
        print('     ', deepspeech_transcript[i])
        i = i + 1

0.wav it just could be a good night
      it just could be a good night
1.wav author of the danger trial
      author of the danger trail
2.wav six
      al
3.wav firefox
      mor
4.wav it depends on the decisions of the member states
      it depends on the decisions of the member states
5.wav on the night it was not enough
      on the night it was not enough
6.wav this is very good news
      this is very good news
7.wav i was awful
      i was awful
8.wav but we finished the show
      but we finished the show
9.wav she has always been very friendly
      she has always been very friendly


- In the following cell, we use the pre-trained model to test audio files with noise. As the result shows, the model predicted text only matches a few words as the transcript displays. This indicates that the model is not well-adapted with background noise.

In [45]:
git.Repo.clone_from('https://github.com/frankx1/deepspeech_project','audio_with_noise', branch = 'audiowithnoise')
audiowithnoise = os.listdir('audio_with_noise')
audiowithnoise = ['audio_with_noise/' + x for x in audiowithnoise]
deepspeech_transcript = []
audiowithnoise.sort()
for filename in audiowithnoise:
    if 'wav' not in filename:
      continue
    w = wave.open(filename,'r')
    rate = w.getframerate()
    frames = w.getnframes()
    buffer = w.readframes(frames)
    data = np.frombuffer(buffer,dtype = np.int16)
    text = dsModel.stt(data)
    deepspeech_transcript.append(text)

with open('audio_with_noise/audio_with_noise.csv', encoding = 'utf-8') as file:
    reader = csv.DictReader(file)
    i = 0
    for row in reader:
        print(row['FILENAME'],row['TRANSCRIPT'])
        print('     ', deepspeech_transcript[i])
        i = i + 1

0.wav hello today i have a phili cheese steak as breakfast
      helo t ey i am have a facis take as referenc
1.wav i always prefer making everything easier
      i always prefer ma in everything easier
2.wav i always give up the hardest question in the exam
      always give up the heartes the pition e is that exact
3.wav the first time when i came to the us i preferred having macdonald
      the first iv we li cim thogh the yu s as the presheveyn mygona
4.wav today is a good day and i have some fries for dinner
      e is a go o a an as has some pret for tener
5.wav hello this is a good night and we watched some movies
      o thisiciia sac an a we was some morning
6.wav hello we went to the macy to see parade at time square
      el who wen o the ra i an sed of paris a pan toer
7.wav we went to the north pole and we didn't see any penguin
      we ran tha nort tough and evre dedency any ting wat
8.wav everytime when i watch nba the network is not stable
      o retid when i what hav

# **Training our own model**
- setting up environment
- install modules and create directories
- the checkpoint files have to be uploaded manually since they are too big to be stored remotely


In [46]:
!pip install virtualenv

git.Repo.clone_from('https://github.com/mozilla/DeepSpeech','DeepSpeech')

Collecting virtualenv
[?25l  Downloading https://files.pythonhosted.org/packages/1a/c6/bb564f5eec616d241e85d741f00a07f5f50ea12989022ad49bc66876993c/virtualenv-20.2.2-py2.py3-none-any.whl (5.7MB)
[K     |████████████████████████████████| 5.7MB 23.1MB/s 
Collecting distlib<1,>=0.3.1
[?25l  Downloading https://files.pythonhosted.org/packages/f5/0a/490fa011d699bb5a5f3a0cf57de82237f52a6db9d40f33c53b2736c9a1f9/distlib-0.3.1-py2.py3-none-any.whl (335kB)
[K     |████████████████████████████████| 337kB 48.7MB/s 
Collecting appdirs<2,>=1.4.3
  Downloading https://files.pythonhosted.org/packages/3b/00/2344469e2084fb287c2e0b57b72910309874c3245463acd6cf5e3db69324/appdirs-1.4.4-py2.py3-none-any.whl
Installing collected packages: distlib, appdirs, virtualenv
Successfully installed appdirs-1.4.4 distlib-0.3.1 virtualenv-20.2.2


<git.repo.base.Repo '/content/DeepSpeech/.git'>

In [47]:
!apt-get install python3-venv
!python3 -m venv /content/deepspeech-train-venv/
!source /content/deepspeech-train-venv/bin/activate

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following additional packages will be installed:
  python-pip-whl python3.6-venv
The following NEW packages will be installed:
  python-pip-whl python3-venv python3.6-venv
0 upgraded, 3 newly installed, 0 to remove and 14 not upgraded.
Need to get 1,660 kB of archives.
After this operation, 1,902 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic-updates/universe amd64 python-pip-whl all 9.0.1-2.3~ubuntu1.18.04.4 [1,653 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic-updates/universe amd64 python3.6-venv amd64 3.6.9-1~18.04ubuntu1.3 [6,180 B]
Get:3 http://archive.ubuntu.com/ubuntu bionic-updates/universe amd64 python3-venv amd64 3.6.7-1~18.04 [1,208 B]
Fetched 1,660 kB in 1s (2,084 kB/s)
Selecting previously unselected package python-pip-whl.
(Reading database ... 144865 files and directories currently installed.)
Preparing to unpack .../python-pip-

In [48]:
os.chdir('DeepSpeech')
!pip3 install --upgrade pip==20.2.2 wheel==0.34.2 setuptools==49.6.0
!pip3 install --upgrade -e .

Collecting pip==20.2.2
[?25l  Downloading https://files.pythonhosted.org/packages/5a/4a/39400ff9b36e719bdf8f31c99fe1fa7842a42fa77432e584f707a5080063/pip-20.2.2-py2.py3-none-any.whl (1.5MB)
[K     |▏                               | 10kB 23.6MB/s eta 0:00:01[K     |▍                               | 20kB 30.0MB/s eta 0:00:01[K     |▋                               | 30kB 22.9MB/s eta 0:00:01[K     |▉                               | 40kB 20.5MB/s eta 0:00:01[K     |█                               | 51kB 20.7MB/s eta 0:00:01[K     |█▎                              | 61kB 15.7MB/s eta 0:00:01[K     |█▌                              | 71kB 16.2MB/s eta 0:00:01[K     |█▊                              | 81kB 16.9MB/s eta 0:00:01[K     |██                              | 92kB 15.0MB/s eta 0:00:01[K     |██▏                             | 102kB 16.1MB/s eta 0:00:01[K     |██▍                             | 112kB 16.1MB/s eta 0:00:01[K     |██▋                             | 122kB

Obtaining file:///content/DeepSpeech
Collecting pyxdg
  Downloading pyxdg-0.27-py2.py3-none-any.whl (49 kB)
[K     |████████████████████████████████| 49 kB 6.7 MB/s 
[?25hCollecting attrdict
  Downloading attrdict-2.0.1-py2.py3-none-any.whl (9.9 kB)
Collecting semver
  Downloading semver-2.13.0-py2.py3-none-any.whl (12 kB)
Collecting opuslib==2.0.0
  Downloading opuslib-2.0.0.tar.gz (7.3 kB)
Collecting optuna
  Downloading optuna-2.3.0.tar.gz (258 kB)
[K     |████████████████████████████████| 258 kB 31.8 MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting sox
  Downloading sox-1.4.1-py2.py3-none-any.whl (39 kB)
Collecting numba==0.47.0
  Downloading numba-0.47.0-cp36-cp36m-manylinux1_x86_64.whl (3.7 MB)
[K     |████████████████████████████████| 3.7 MB 55.5 MB/s 
Collecting soundfile
  Downloading SoundFile-0.10.3.post1-py2.py3-none-any.whl (21 kB)
Col

In [49]:
!sudo apt-get install python3-dev

Reading package lists... Done
Building dependency tree       
Reading state information... Done
python3-dev is already the newest version (3.6.7-1~18.04).
0 upgraded, 0 newly installed, 0 to remove and 14 not upgraded.


In [50]:
!pip3 uninstall tensorflow
!pip3 install 'tensorflow-gpu==1.15.4'

Found existing installation: tensorflow 1.15.4
Uninstalling tensorflow-1.15.4:
  Would remove:
    /usr/local/bin/estimator_ckpt_converter
    /usr/local/bin/freeze_graph
    /usr/local/bin/saved_model_cli
    /usr/local/bin/tensorboard
    /usr/local/bin/tf_upgrade_v2
    /usr/local/bin/tflite_convert
    /usr/local/bin/toco
    /usr/local/bin/toco_from_protos
    /usr/local/lib/python3.6/dist-packages/tensorflow-1.15.4.dist-info/*
    /usr/local/lib/python3.6/dist-packages/tensorflow/*
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/*
Proceed (y/n)? y
  Successfully uninstalled tensorflow-1.15.4
Collecting tensorflow-gpu==1.15.4
  Downloading tensorflow_gpu-1.15.4-cp36-cp36m-manylinux2010_x86_64.whl (411.0 MB)
[K     |████████████████████████████████| 411.0 MB 15 kB/s 
Installing collected packages: tensorflow-gpu
Successfully installed tensorflow-gpu-1.15.4


In [51]:
!make Dockerfile.train

sed \
	-e "s|#DEEPSPEECH_REPO#|https://github.com/mozilla/DeepSpeech.git|g" \
	-e "s|#DEEPSPEECH_SHA#|origin/master|g" \
	< Dockerfile.train.tmpl > Dockerfile.train


In [52]:
git.Repo.clone_from('https://github.com/frankx1/deepspeech_project','audioset', branch = 'audioset')

<git.repo.base.Repo '/content/DeepSpeech/audioset/.git'>

In [53]:
os.mkdir('fine_tuning_checkpoints')

# **Fine-tuning the model**
- Since the orginally provided model in DeepSpeech website cannot recognize audio with noises, our target is to make the model work with noise. In order to achieve our goal, we use the"fine-tuning" method listed in the DeepSpeech documentation. The idea is to start from a pretrained model and train it using our audio with noise.
- The model is from the DeepSpeech website(deepspeech-0.9.2) 
- Since our training data is fairly small, we also use them as testing data to validate if our model has been trained accurately.
- The training epoch has been set to 200, with 0.0001 learning rate
- After 200 epochs, the loss is pretty low and it seems like the matching rate is high. However, this only applies to the audios that we've already trained. The model is not generalized with the specific type of noise. We believe that is because of our sample size.

In [54]:
!python3 DeepSpeech.py --n_hidden 2048 --checkpoint_dir fine_tuning_checkpoints --epochs 200 --train_files audioset/my-train.csv --dev_files audioset/my-dev.csv --test_files audioset/my-test.csv --learning_rate 0.0001

I1210 22:40:21.161862 140525104273280 utils.py:141] NumExpr defaulting to 2 threads.
I Could not find best validating checkpoint.
I Could not find most recent checkpoint.
I Initializing all variables.
I STARTING Optimization
Epoch 0 |   Training | Elapsed Time: 0:00:16 | Steps: 11 | Loss: 235.484944     
Epoch 0 | Validation | Elapsed Time: 0:00:02 | Steps: 11 | Loss: 159.860506 | Dataset: audioset/my-dev.csv
I Saved new best validating model with loss 159.860506 to: fine_tuning_checkpoints/best_dev-11
--------------------------------------------------------------------------------
Epoch 1 |   Training | Elapsed Time: 0:00:09 | Steps: 11 | Loss: 158.447895     
Epoch 1 | Validation | Elapsed Time: 0:00:02 | Steps: 11 | Loss: 155.490394 | Dataset: audioset/my-dev.csv
I Saved new best validating model with loss 155.490394 to: fine_tuning_checkpoints/best_dev-22
--------------------------------------------------------------------------------
Epoch 2 |   Training | Elapsed Time: 0:00:09 | 