Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatic Training - Google Collab Errors #70

Closed
Ricetato opened this issue Oct 22, 2023 · 18 comments
Closed

Automatic Training - Google Collab Errors #70

Ricetato opened this issue Oct 22, 2023 · 18 comments

Comments

@Ricetato
Copy link

Ricetato commented Oct 22, 2023

Hi,

I just tried giving the Google Collab Workbook a few goes with variations on the target word "oi mate" (tried "oy mayte" and others as well) and kept getting the same errors. I've popped my output below, I'm hoooping this isn't user error.. It looks like something isn't going right in the "3. Train the Model" script though:

At the very top:
/usr/local/lib/python3.10/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: '/usr/local/lib/python3.10/dist-packages/torchvision/image.so: undefined symbol: _ZN3c104cuda20CUDACachingAllocator9allocatorE'If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpegorlibpnginstalled before buildingtorchvision from source? warn( torchvision is not available - cannot save figures

Near the bottom there are some errors after "Generating negative clips for training" before it fails to find "my_custom_model/oi_mate.onnx"

Cheers

Outputs:
1. Test Example Training Clip Generation.txt
2. Download Data.txt
3. Train the Model.txt

@dscripka
Copy link
Owner

This seems to be a similar error to #65, but I'm having trouble reproducing it my end (even when using the "oi mate" target word).

Can you download a copy of the colab notebook where you are getting this error and attach it to this issue?

@Ricetato
Copy link
Author

automatic_model_training_simple.ipynb.zip

I'm hoping this is what you're after, I have 0 experience with this google collab sorry!

@ThreepE0
Copy link

having the same issue. I'm banging my head against the wall trying to get training to work with a GPU. Tried every different combo of python, torch, torchvision, piper_phonemize, and cuda that I can think of. On cuda 11.7, I'm getting FFT errors when trying to train the model. CPU seems to work fine but would take ages

@AnkushMalaker
Copy link

I'm not sure if this warrants another issue, I ran into
NotImplementedError: A UTF-8 locale is required. Got ANSI_X3.4-1968 for the command !pip install datasets.
Adding a codeblock at the start with the following was enough to fix it. Leaving it here for anyone who runs into the same.

import locale
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"
locale.getpreferredencoding = getpreferredencoding

@synesthesiam
Copy link

Possibly related: home-assistant/home-assistant.io#29464

@ThreepE0
Copy link

ThreepE0 commented Oct 24, 2023

copied the collab to local machine and got it working (finally!) with:
conda create -n "openwakeword_train" python pytorch==1.13.1 torchvision torchaudio pytorch-cuda=11.7 tensorflow jupyter pybind11 -c pytorch -c nvidia

activate the new conda env
install rhasspy/espeak-ng from github
git cloned piper-phonemize then cd into it and run python setup.py install

jupyter lab automatic_model_training_simple.ipynb --ip=0.0.0.0

Then I was able to get the lab working and through all the steps on my local machine. I'm not sure what all the different versions are in the google colab, but I think there's a mismatch somewhere.

Based on one of the comments in the notebook: Do I understand correctly in thinking that the conversion to .tflite is expected to fail if run in jupyter, and that the conversion should be done on the command line instead? I just ran all three train commands from the command line once I had an edited yaml file, and that worked for me.

I'll try to get this working in the google colab later on if I get some time, but I need a break from mucking around with this at the moment haha

All that being said, I wonder if there's a way to package this up with pyinstaller to make things easier for people. It'd be handy to have a sort of black box exe that you know isn't going to just stop working with updates etc. Lastly, I wonder if we can share trained models centrally somewhere; I'm sure a lot of people are using similar wake-words, and it'd be cool to just be able to download a community-provided model if someone doesn't want to go through the trouble of training.

EDIT: I spoke too soon, conversion to tflite failed. I've updated my comment to go back to torch 1.13.1 and set up the environment with matching versions.

This is from the google colab linked from the home assistant site:
cuda version: 11.8
torch version: 2.1.0+cu118
torchaudio version: 2.1.0+cu118
tensorflow version: 2.14.0

@ThreepE0
Copy link

Sorry to spam here, but there does still seem to be a couple issues, one of which I think you might be able to fix by adjusting the setup.py. I think I have this nailed down finally after probably too many hours, but I'm going to be following up with a fresh build once again to be sure and will report back.

  • The first issue is that setup.py requires 'tensorflow==2.8.1', however pytorch requires 1.13.1, which then needs cuda 11.7, which tensorflow 2.8.1 is not compatible with. I've had to comment the 2.8.1 requirement out.

  • The second issue was a bit harder to track down:

    • libcufft==10.9.0.58 needs to be installed. There is a known issue with the cuda 11.7 pytorch cfft, so the cfft for cuda 11.8 needs to be installed. Pytorch 1.13.1 didn't include it to fix the cfft error because of the binary being twice as large as the previous one.

reference:
pytorch/pytorch#88038

@dscripka
Copy link
Owner

@Ricetato, @ThreepE0 thank you providing such detailed information! I've fixed a few bugs in the training code and made adjustments to the dependency installation sections of both Google Colab notebooks for model training. In particular, I've adjusted dependencies to only require tensorflow-cpu as this is only needed for tflite model conversion, and allowed for a wider range of pytorch versions.

From my tests both training notebooks are working correctly now, but please let me know if you are still experiencing issues.

As for doing this in a local environment, this is certainly still possible, but to @ThreepE0's point dependencies for a GPU-enabled environment are very complex and system dependent which makes standardized testing difficult. In the future I may try to release a Docker image that is preconfigured for training to help reduce these types of issues.

@Ricetato
Copy link
Author

Thanks @dscripka, I've managed to run your collab workbook from a couple of different machines and seems to be working without issues for me personally.

@ThreepE0
Copy link

ThreepE0 commented Oct 27, 2023

I haven't had a chance to try the hosted colab again yet, but I wanted to get my thoughts out here for a sec if it's ok:

I'm really happy that the colab is working, and that's really the most important part I think: People can train their own models and use Openwakeword.

The things that are kinda nagging at me:

  • With cuda, training goes SO much quicker, and more efficiently.
  • I think it should be mentioned and probably highlighted for people that the colab is intented for CPU use; I don't know about others, but I saw the CPU message and the speed of training, and saw those things as errors/problems to be fixed.
  • The only real thing turning the dependencies into a definitely-impossible jenga puzzle is onnx-tf, which is no longer being maintained. I've found a few configurations that leave me with a valid onnx file, but an error while creating the tflite.
  • I'm not sure why Home Assistant only supports tflite and not onnx for openwakeword. There's probably a reason, but I'd like to understand. Without knowing why, I'm left thinking it'd be great if I could just use the onnx file that is easy to generate quickly.
  • Would it be possible to train directly to a tensorflow or tflite model?
  • I'd like to provide a packaged exe for people to train with here on github, maybe a couple (one for cpu, one for cudaxx, antoher for cuda x + 1, etc..) Do you think that's a reasonable thing to do, and a decent idea? I just know in my professional career anyways, handing folks and exe that "just works" in most cases makes people that don't want to tinker (I'll never understand them lol) smile.

Now I'll be the first to admit that I'm just plain not smart enough to get an alternative to onnx-tf in place and working. I'm kinda ok at troubleshooting, mostly because I cause so much trouble for myself in the technology domain haha. I know there's a new repo that popped up "onnx2tf," but I don't know how to describe the onnx model for it to do the transformation. I just want to throw out there that I think this is the primary issue that's preventing a colab being created for people to train in easily using GPU resources.

All that being said, I wonder if it makes sense to separate the training, and the conversion to onnx. The dependency on a deprecated onnx-tf seems precarious at best and, in my humble opinion, highlights that it might be best to separate out this function.

Sorry for the wall of text. Please let me know what you think.

@nj-banks
Copy link

nj-banks commented Oct 27, 2023

I found a new way of converting onnx to tflite that also works with newer versions of the onnx format:

python3 -m venv .venv
source .venv/bin/activate
pip3 install --upgrade pip
pip3 install onnx==1.14.1 onnxruntime==1.16.0 onnxsim==0.4.33 simple_onnx_processing_tools psutil==5.9.5 ml_dtypes==0.2.0 tensorflow==2.14.0 onnx2tf
pip3 install nvidia-pyindex
pip3 install h5py==3.7.0
pip3 install onnx-graphsurgeon
for file in *.onnx; do onnx2tf -i ${file} -kat onnx____Flatten_0; done
rm saved_model/*_float16.tflite

The string behind -kat must match the input_op_name of the onnx model exactly. It can be found in the output of onnx2tf. There is no warning if it doesn't match but the tflite file will fail to load in openWakeWord.

@ThreepE0
Copy link

@nj-banks mvp right here thank you so much! Going to give this a try today.

Could you describe what you mean by “found in the output of onnx2tf?” I have a folder with onnx files and I’n not sure if I recall what their input_op_name would be for example.

@nj-banks
Copy link

@ThreepE0 If you run onnx2tf -i file.onnx you should see this: INFO: input_op_name: onnx____Flatten_0

screenshot

@ThreepE0
Copy link

ThreepE0 commented Oct 27, 2023

@ThreepE0 If you run onnx2tf -i file.onnx you should see this: INFO: input_op_name: onnx____Flatten_0

screenshot

I owe you several coffees. Thank you very much. Seriously though if you have a link for donation send it over

@dscripka
Copy link
Owner

I haven't had a chance to try the hosted colab again yet, but I wanted to get my thoughts out here for a sec if it's ok:

I'm really happy that the colab is working, and that's really the most important part I think: People can train their own models and use Openwakeword.

The things that are kinda nagging at me:

* With cuda, training goes SO much quicker, and more efficiently.

* I think it should be mentioned and probably highlighted for people that the colab is intented for CPU use;  I don't know about others, but I saw the CPU message and the speed of training, and saw those things as errors/problems to be fixed.

* The only real thing turning the dependencies into a definitely-impossible jenga puzzle is onnx-tf, which is no longer being maintained. I've found a few configurations that leave me with a valid onnx file, but an error while creating the tflite.

* I'm not sure why Home Assistant only supports tflite and not onnx for openwakeword.  There's probably a reason, but I'd like to understand.  Without knowing why, I'm left thinking it'd be great if I could just use the onnx file that is easy to generate quickly.

* Would it be possible to train directly to a tensorflow or tflite model?

* I'd like to provide a packaged exe for people to train with here on github, maybe a couple (one for cpu, one for cudaxx, antoher for cuda x + 1, etc..)  Do you think that's a reasonable thing to do, and a decent idea?  I just know in my professional career anyways, handing folks and exe that "just works" in most cases makes people that don't want to tinker (I'll never understand them lol) smile.

Now I'll be the first to admit that I'm just plain not smart enough to get an alternative to onnx-tf in place and working. I'm kinda ok at troubleshooting, mostly because I cause so much trouble for myself in the technology domain haha. I know there's a new repo that popped up "onnx2tf," but I don't know how to describe the onnx model for it to do the transformation. I just want to throw out there that I think this is the primary issue that's preventing a colab being created for people to train in easily using GPU resources.

All that being said, I wonder if it makes sense to separate the training, and the conversion to onnx. The dependency on a deprecated onnx-tf seems precarious at best and, in my humble opinion, highlights that it might be best to separate out this function.

Sorry for the wall of text. Please let me know what you think.

The Colab notebook should actually work fine with a GPU as well, as all of the libraries should be compatible with a GPU-enabled Colab notebook. The same should work for a local deployment as well, assuming that the appropriate CUDA libraries are installed (even using an Nvidia GPU).

With the excellent suggestion of @nj-banks, its also possible to move away from the no longer maintained onnx-tf library which enables more current Tensorflow versions as well.

Training a model directly in Tensorflow is possible, but it would require converting the training code as it is currently written to use Pytorch. However, with easier conversion of ONNX files this seems like a lower priority.

Something like a packaged .exe would be the simplest way to enable anyone to train models easily, I agree. The challenge here (in my experience) is designing and maintaining an application like this requires substantial (and continuous effort), and it isn't something I can focus on currently. An intermediate solution could be Docker images, for those users who are familiar with Docker.

@dscripka
Copy link
Owner

I found a new way of converting onnx to tflite that also works with newer versions of the onnx format:

python3 -m venv .venv
source .venv/bin/activate
pip3 install --upgrade pip
pip3 install onnx==1.14.1 onnxruntime==1.16.0 onnxsim==0.4.33 simple_onnx_processing_tools psutil==5.9.5 ml_dtypes==0.2.0 tensorflow==2.14.0 onnx2tf
pip3 install nvidia-pyindex
pip3 install h5py==3.7.0
pip3 install onnx-graphsurgeon
for file in *.onnx; do onnx2tf -i ${file} -kat onnx____Flatten_0; done
rm saved_model/*_float16.tflite

The string behind -kat must match the input_op_name of the onnx model exactly. It can be found in the output of onnx2tf. There is no warning if it doesn't match but the tflite file will fail to load in openWakeWord.

This is quite useful @nj-banks, thank you very much for sharing!

@ThreepE0
Copy link

@dscripka ok, I must have made an incorrect assumption based on your fix: “I've adjusted dependencies to only require tensorflow-cpu”

is it the case that because tensorflow isn’t being used to train, cuda can still be used?

@dscripka
Copy link
Owner

Ah, I see how that is confusing. And you are correct, since Tensorflow is only needed for the tflite conversion it does not need to use the GPU and the dependencies can be simplified by installing the tensorflow-cpu package.

As long as the environment is setup for PyTorch to use the GPU properly, then training and example generation can happen on the GPU.

@dscripka dscripka closed this as completed Nov 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants