New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Help Re-writing 04_fasttext_train_vectors.py for Windows 10 Compatibility #105
Comments
I was able to solve the problem by inferring the format of both vectors.txt and vocab.txt from the 05_export.py script _get_shape and read_vocab functions. Also, in the 05_export.py script I had to change line 15 to Here is a complete Windows 10 compatible 04_fasttext_train_vectors.py script:
|
@dshefman1 : thanks for this! Do you think it's possible to have one version of the script that works across all platforms? It would be great to incorporate your changes into the source here, so others don't run into the same problems. Would you feel like contributing a PR? |
@dshefman1 : could you please elaborate on the in_dir, when referring to the dir where the .s2v file is saved, I get a "SystemExit: 1" and "✘ Not a valid input directory". Using Jupyter. |
I do think it is possible.
I'd be happy to contribute a PR. |
@Z-e-e As per the docstring in the original script, it expects a "directory of preprocessed .s2v input files, will concatenate them (using a temporary file on disk) and will use fastText to train a word2vec model." You may be making the same mistake I made the first time I ran the script, which is to provide a string reference to the filepath of the .s2v file. Instead, you are expected to provide a string reference to the directory, which in my case is "./corpus_parsed3", but it's probably named something else on your machine. |
Awesome. Reopening this to track the progress. |
@svlandeg I'm struggling to finalize my code to resolve Issue 105. The problem I am having is that I still don't quite understand the outputs from the CLI commands in 04_fasttext_train_vectors.py. For example, line 63 or Line 76 appears to create a .bin file of the FastText model. However, the 05_export.py script does not take this file as an input. So, I'm not sure if the purpose of creating the model.bin file is just for the purpose of creating the "vocab.txt" file on line 76. If so, then I plan to make saving the model to disk as an option, but not necessary. However, if the model.bin file is for another purpose related to the 05_export.py script please let me know. Also, 05_export.py expects a "vectors.txt" file as an input. However, the 04_fasttext_train_vectors.py script does not explicitly create a "vectors.txt" file in the same way it explicitly creates a vocab.txt on lines 75-77, I'm assuming that the "vectors.txt" file is created on line 61-64. Am I correct? If not, could you help me understand at what point the "vectors.txt" file is created? This would help me ensure that I don't cause a new problem while fixing the Windows compatibility problem. |
I'll have a detailed look tomorrow ! |
[EDIT: this was a reply to a question asking which changes exactly were made by @dshefman1. Though that question has now been deleted, it's still good to link the relevant PR to this Issue :-)] |
Hi @dshefman1, I finally had some time to look into this in more detail. The main reason why the scripts are "incompatible" with Windows, is because you should be able to build the binary file from the However, the other option is also to just download the binary files from the unofficial release for Windows: https://github.com/xiamx/fastText/releases. That works for me on Windows just fine. I also didn't run into any trouble with the BOM etc. I think this may be the best option in the end, as the changes you started making to the script were quite big, and we need to make sure that it keeps working also on other platforms. What do you think ? |
Hi @svlandeg It sounds like you are saying that the current version of 04_fasttext_train_vectors.py already meets your criteria "to have one version of the script that works across all platforms." If so, then it sounds like a fine solution to me. |
I mean, ideally, we wouldn't need to depend on a platform-dependent binary file. But it looks like working around it gets quite involved, and I'm also not sure what all the different intermediate files are for. So I'm just wondering whether it's worth putting more time into this if we can use that unofficial Windows release instead? |
@svlandeg I was able to answer my own intermediate-files questions from before. So, the intermediate files are not an issue for the code I wrote. The code I wrote provides all of the appropriate inputs for the 05_export.py, but it does need to be tested for non-Windows users. However, if you think it is not worth putting more time into this then I can get on board with that. I have plenty of high priority work to keep me busy these days. |
Oh, OK, I thought you were still having open issues! So basically the PR works for you as-is on Windows? |
@svlandeg Yes, it does. Sorry. That is my mistake for not reporting back that the PR works as-is on Windows. Also, since the PR uses the pip installed fastText library, instead of a binary build of fastText, I would think that it would work with almost any operating system, but I don't have a non-Windows machine to test it on. |
The two os.system(cmd) portions of the 04_fasttext_train_vectors.py script on lines 61-67 and lines 75-81 do not work for Windows users. So, I re-wrote lines 61-67 using the FastText Word representations documentation. However, lines 75-81 are proving more difficult to rewrite because I don't know the structure of the vocab_file output file created with lines 76-78, included below. The 3rd-to-last line of my code below uses the save_model function to save the model to a binary file for later loading as shown here. However, this is not the input file format expected in 05_export.py. Could you please provide a sample of what the vocab_file output file looks like? Or, better yet, do you have any suggestions for how to replace lines 75-78 that doesn't involve using os.system, CLI code, or the fasttext_bin.
Lines 76-78 of 04_fasttext_train_vectors.py:
Here is the code I used in place of 04_fasttext_train_vectors.py to make it Windows compatible:
The text was updated successfully, but these errors were encountered: