<a href="https://colab.research.google.com/github/alicater/Random_Prompts/blob/main/Speech_To_Text.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Speech to Text Solution
Original prompt found on upwork, person wanted a script that could preprocess audio files in order to improve the recognition accuracy and then implement a Speech-to-Text solution using a pre-trained model. Here is the original prompt:

    Preprocess audio files (e.g., denoise, normalize) to improve recognition accuracy.
    Implement a Speech-to-Text solution using pre-trained models (e.g., Wav2Vec2, DeepSpeech, or similar).
    Test and evaluate the model’s performance with provided audio samples.
    Deliver a simple script or API that accepts audio files and returns text output.
    Provide basic documentation for usage and future scalability.

I found this prompt interesting so here's my go at it.

In [None]:
# Install necessary libraries, colab should have most of these already but this is to make sure
!pip install torch transformers librosa soundfile noisereduce

# get files set up
from google.colab import files



In [None]:
# import all the libraries
import librosa
import soundfile as sf
import noisereduce as nr
import torch
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import numpy as np
from IPython.display import display, Audio
import os

## Documentation For Each Library
**Librosa:** https://librosa.org/doc/latest/index.html

**Soundfile:** https://python-soundfile.readthedocs.io/en/0.11.0/

**Noisereduce:** https://pypi.org/project/noisereduce/

**Torch:** https://pytorch.org/docs/stable/index.html

**Transformers:** https://pypi.org/project/transformers/

**Numpy:** https://numpy.org/doc/stable/user/index.html#user

**IPython:** https://ipython.readthedocs.io/en/stable/api/generated/IPython.display.html

**OS:** https://docs.python.org/3/library/os.html

In [None]:
print("----- Loading Wav2Vec2 Model -----")
MODEL_NAME = "facebook/wav2vec2-large-960h"
processor = Wav2Vec2Processor.from_pretrained(MODEL_NAME)
model = Wav2Vec2ForCTC.from_pretrained(MODEL_NAME)
print("----- Model Loaded -----")

----- Loading Wav2Vec2 Model -----


Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-large-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


----- Model Loaded -----


In [None]:
# Create Functions

# function to preprocess audio file
def preprocess_audio(input_file):
  print("----- Preprocessing Audio -----")
  y, sr = librosa.load(input_file, sr=16000) # resample to 16kHz for the model

  # denoise:
  denoise = nr.reduce_noise(y=y, sr=sr)

  # normalize:
  normalize = librosa.util.normalize(denoise)

  # save as new file:
  temp_wav = "temp.wav"
  sf.write(temp_wav, normalize, sr)
  print("----- Preprocessing Complete -----")

  return temp_wav

# function to get speech to text
def speech_to_text(input_file):
  clean_audio = preprocess_audio(input_file)

  print("----- Transcribing Audio -----")

  speech, sr = librosa.load(clean_audio, sr=16000)
  input_values = processor(speech, return_tensors="pt", sampling_rate=16000).input_values

  # inference
  with torch.no_grad():
    logits = model(input_values).logits

  # decode logits to text
  predicted_ids = torch.argmax(logits, dim=-1)
  transcription = processor.decode(predicted_ids[0])
  print("----- Transcription Complete -----")

  return clean_audio, transcription

In [None]:
print("----- Upload Audio File -----")
uploaded = files.upload()

----- Upload Audio File -----


Saving test_recording.m4a to test_recording (1).m4a


In [None]:
audio_file = list(uploaded.keys())[0]
transcription = speech_to_text(audio_file)

----- Preprocessing Audio -----


  y, sr = librosa.load(input_file, sr=16000) # resample to 16kHz for the model
	Deprecated as of librosa version 0.10.0.
	It will be removed in librosa version 1.0.
  y, sr_native = __audioread_load(path, offset, duration, dtype)


----- Preprocessing Complete -----
----- Transcribing Audio -----
----- Transcription Complete -----


In [None]:
print("----- Transcription -----")
print(transcription)

----- Transcription -----
('temp.wav', "THIS IS A TEST WITH SOME BACKGROUNDS THERE'S GOING TO BE MORE BACKGROUND NOISE AND IT'S GETTING LOUDER LET'S SEE HOW THIS MODEL DOES")


In [None]:
clean_audio = preprocess_audio(audio_file)
print("----- Playing Audio -----")
display(Audio(clean_audio))

print("----- Orginal Audio -----")
display(Audio(audio_file))

----- Preprocessing Audio -----


  y, sr = librosa.load(input_file, sr=16000) # resample to 16kHz for the model
	Deprecated as of librosa version 0.10.0.
	It will be removed in librosa version 1.0.
  y, sr_native = __audioread_load(path, offset, duration, dtype)


----- Preprocessing Complete -----
----- Playing Audio -----


----- Orginal Audio -----


# Thoughts
Completing this little project only took a little over an hour. I had to look into different libraries in python that would actually be able to preprocess audio. I honestly don't know if using all three audio related libraries is necessary but hey, it works. I already have a little experience with soundfile and librosa and figured these two libraries would work well for the prompt. I had to do a little research (basic google search hahah) to find noisereduce. I think all three paired well, again, I'm sure there's a more streamlined way to handle this but this is combining what I already knew and something I was unfamiliar with.

Not sure if this script goes along with what the original poster on Upwork wanted but it accomplishes the basic overview of the task.

I used quite a few libraries, most of which I was already familiar with. This helped to streamline my work process and also made it easier for me to actually start and test code out. Going back into it all these may be overkill, however I think the libraries also allow for scalability and manipulation of the script.

# Scalability
This was one of the  things that the original posted mentioned quite a bit (more was mentioned beyond the prompt I copy pasted here). I find this quite interesting in this use case because I genuinely think you can do so much with this simple idea. Two ideas come to mind immediately:

* Bulk uploads
* Finetuning for specific audio

A user could easily add another code snippet or adjust the existing code to handle batch file uploads in multiple ways. You could do it through the existing multiple file upload widget, connect to your existing google drive and link to a specific folder, or even connect to other cloud storage services like AWS S3 etc, using their APIs.

You could also finetune the model (if you want to use this for any serious work honestly I'd suggest it) getting even better results. In the examples I've tested with the model transcribes it pretty well but sometimes misses a word or too, especially if the audio dips to a lower register and/or volume. You could use the ```torchaudio``` library to help accomplish this. Reasoning for doing this? You'll get better results. Realistically this poster probably has a specific usecase for this script in mind and most likely the audioclips that are put into the script will be similar-ish to one another. The poster mentioned test the script on provided audio examples which I obviously don't have access to, however those clips may be a solid starting place to help train.

You could do other things with this as a groundwork like adding export options, adding a gui, turning it into a flask app, etc.

# Documentation
To use the script go to the "Runtime" button at the very top and select "Run All". Scroll a bit to the "Upload Audio" code cell, then once cell is running select from your computer what file to upload. Format must be an audio format (doesn't need to be .wav format as of now). The script will continue to run once you've selected your audio file and will spit out the transcript as well as the cleaned audio and original audio. If you want to listen to either audio segment, just press the play button.

# Using Colab
I specifically chose to use Colab to build this in case I wanted to come back and finetune the model. That's pretty much the only reason, that and it's easy to add my thoughts/conclusions and share the script. I think using Colab also allows for some easier file management and debugging capabilities.

# Summary
In general this was a fun little project to do while on Winter Break from University, I'll probably look at upwork a bit more to see if there are other interesting prompts to complete (I didn't apply for the job cause apparently you need "connection" credits to apply??). So here's a basic script that does exactly what the person wanted, I may or may not come back to clean the script up a bit and add to it.