-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automatic Training - Google Collab Errors #70
Comments
This seems to be a similar error to #65, but I'm having trouble reproducing it my end (even when using the "oi mate" target word). Can you download a copy of the colab notebook where you are getting this error and attach it to this issue? |
automatic_model_training_simple.ipynb.zip I'm hoping this is what you're after, I have 0 experience with this google collab sorry! |
having the same issue. I'm banging my head against the wall trying to get training to work with a GPU. Tried every different combo of python, torch, torchvision, piper_phonemize, and cuda that I can think of. On cuda 11.7, I'm getting FFT errors when trying to train the model. CPU seems to work fine but would take ages |
I'm not sure if this warrants another issue, I ran into
|
Possibly related: home-assistant/home-assistant.io#29464 |
copied the collab to local machine and got it working (finally!) with: activate the new conda env jupyter lab automatic_model_training_simple.ipynb --ip=0.0.0.0 Then I was able to get the lab working and through all the steps on my local machine. I'm not sure what all the different versions are in the google colab, but I think there's a mismatch somewhere. Based on one of the comments in the notebook: Do I understand correctly in thinking that the conversion to .tflite is expected to fail if run in jupyter, and that the conversion should be done on the command line instead? I just ran all three train commands from the command line once I had an edited yaml file, and that worked for me. I'll try to get this working in the google colab later on if I get some time, but I need a break from mucking around with this at the moment haha All that being said, I wonder if there's a way to package this up with pyinstaller to make things easier for people. It'd be handy to have a sort of black box exe that you know isn't going to just stop working with updates etc. Lastly, I wonder if we can share trained models centrally somewhere; I'm sure a lot of people are using similar wake-words, and it'd be cool to just be able to download a community-provided model if someone doesn't want to go through the trouble of training. EDIT: I spoke too soon, conversion to tflite failed. I've updated my comment to go back to torch 1.13.1 and set up the environment with matching versions. This is from the google colab linked from the home assistant site: |
Sorry to spam here, but there does still seem to be a couple issues, one of which I think you might be able to fix by adjusting the setup.py. I think I have this nailed down finally after probably too many hours, but I'm going to be following up with a fresh build once again to be sure and will report back.
reference: |
@Ricetato, @ThreepE0 thank you providing such detailed information! I've fixed a few bugs in the training code and made adjustments to the dependency installation sections of both Google Colab notebooks for model training. In particular, I've adjusted dependencies to only require From my tests both training notebooks are working correctly now, but please let me know if you are still experiencing issues. As for doing this in a local environment, this is certainly still possible, but to @ThreepE0's point dependencies for a GPU-enabled environment are very complex and system dependent which makes standardized testing difficult. In the future I may try to release a Docker image that is preconfigured for training to help reduce these types of issues. |
Thanks @dscripka, I've managed to run your collab workbook from a couple of different machines and seems to be working without issues for me personally. |
I haven't had a chance to try the hosted colab again yet, but I wanted to get my thoughts out here for a sec if it's ok: I'm really happy that the colab is working, and that's really the most important part I think: People can train their own models and use Openwakeword. The things that are kinda nagging at me:
Now I'll be the first to admit that I'm just plain not smart enough to get an alternative to onnx-tf in place and working. I'm kinda ok at troubleshooting, mostly because I cause so much trouble for myself in the technology domain haha. I know there's a new repo that popped up "onnx2tf," but I don't know how to describe the onnx model for it to do the transformation. I just want to throw out there that I think this is the primary issue that's preventing a colab being created for people to train in easily using GPU resources. All that being said, I wonder if it makes sense to separate the training, and the conversion to onnx. The dependency on a deprecated onnx-tf seems precarious at best and, in my humble opinion, highlights that it might be best to separate out this function. Sorry for the wall of text. Please let me know what you think. |
I found a new way of converting onnx to tflite that also works with newer versions of the onnx format:
The string behind |
@nj-banks mvp right here thank you so much! Going to give this a try today. Could you describe what you mean by “found in the output of onnx2tf?” I have a folder with onnx files and I’n not sure if I recall what their input_op_name would be for example. |
@ThreepE0 If you run |
I owe you several coffees. Thank you very much. Seriously though if you have a link for donation send it over |
The Colab notebook should actually work fine with a GPU as well, as all of the libraries should be compatible with a GPU-enabled Colab notebook. The same should work for a local deployment as well, assuming that the appropriate CUDA libraries are installed (even using an Nvidia GPU). With the excellent suggestion of @nj-banks, its also possible to move away from the no longer maintained Training a model directly in Tensorflow is possible, but it would require converting the training code as it is currently written to use Pytorch. However, with easier conversion of ONNX files this seems like a lower priority. Something like a packaged .exe would be the simplest way to enable anyone to train models easily, I agree. The challenge here (in my experience) is designing and maintaining an application like this requires substantial (and continuous effort), and it isn't something I can focus on currently. An intermediate solution could be Docker images, for those users who are familiar with Docker. |
This is quite useful @nj-banks, thank you very much for sharing! |
@dscripka ok, I must have made an incorrect assumption based on your fix: “I've adjusted dependencies to only require tensorflow-cpu” is it the case that because tensorflow isn’t being used to train, cuda can still be used? |
Ah, I see how that is confusing. And you are correct, since Tensorflow is only needed for the tflite conversion it does not need to use the GPU and the dependencies can be simplified by installing the As long as the environment is setup for PyTorch to use the GPU properly, then training and example generation can happen on the GPU. |
Hi,
I just tried giving the Google Collab Workbook a few goes with variations on the target word "oi mate" (tried "oy mayte" and others as well) and kept getting the same errors. I've popped my output below, I'm hoooping this isn't user error.. It looks like something isn't going right in the "3. Train the Model" script though:
At the very top:
/usr/local/lib/python3.10/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: '/usr/local/lib/python3.10/dist-packages/torchvision/image.so: undefined symbol: _ZN3c104cuda20CUDACachingAllocator9allocatorE'If you don't plan on using image functionality from
torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have
libjpegor
libpnginstalled before building
torchvisionfrom source? warn( torchvision is not available - cannot save figures
Near the bottom there are some errors after "Generating negative clips for training" before it fails to find "my_custom_model/oi_mate.onnx"
Cheers
Outputs:
1. Test Example Training Clip Generation.txt
2. Download Data.txt
3. Train the Model.txt
The text was updated successfully, but these errors were encountered: