Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Help Re-writing 04_fasttext_train_vectors.py for Windows 10 Compatibility #105

Closed
dshefman1 opened this issue Apr 15, 2020 · 15 comments · Fixed by #106
Closed

Help Re-writing 04_fasttext_train_vectors.py for Windows 10 Compatibility #105

dshefman1 opened this issue Apr 15, 2020 · 15 comments · Fixed by #106
Labels
scripts Code examples in /scripts

Comments

@dshefman1
Copy link

dshefman1 commented Apr 15, 2020

The two os.system(cmd) portions of the 04_fasttext_train_vectors.py script on lines 61-67 and lines 75-81 do not work for Windows users. So, I re-wrote lines 61-67 using the FastText Word representations documentation. However, lines 75-81 are proving more difficult to rewrite because I don't know the structure of the vocab_file output file created with lines 76-78, included below. The 3rd-to-last line of my code below uses the save_model function to save the model to a binary file for later loading as shown here. However, this is not the input file format expected in 05_export.py. Could you please provide a sample of what the vocab_file output file looks like? Or, better yet, do you have any suggestions for how to replace lines 75-78 that doesn't involve using os.system, CLI code, or the fasttext_bin.

Lines 76-78 of 04_fasttext_train_vectors.py:

vocab_file = output_path / "vocab.txt"
cmd = f"{fasttext_bin} dump {output_file.with_suffix('.bin')} dict > {vocab_file}"
print(cmd)
vocab_cmd = os.system(cmd)

Here is the code I used in place of 04_fasttext_train_vectors.py to make it Windows compatible:

from pathlib import Path
from wasabi import msg
import fasttext

in_dir = "./corpus_parsed3"
out_dir = "./fasttext_model3"
n_threads = 26
min_count = 50
vector_size = 300
verbose = 2

input_path = Path(in_dir)
output_path = Path(out_dir)
if not input_path.exists() or not input_path.is_dir():
    msg.fail("Not a valid input directory", in_dir, exits=1)
if not output_path.exists():
    output_path.mkdir(parents=True)
    msg.good(f"Created output directory {out_dir}")
output_file = output_path / f"vectors_w2v_{vector_size}dim.bin"

# fastText expects only one input file and only reads from disk and not
# stdin, so we need to create a temporary file that concatenates the inputs
tmp_path = input_path / "s2v_input.tmp"
input_files = [p for p in input_path.iterdir() if p.suffix == ".s2v"]
if not input_files:
    msg.fail("Input directory contains no .s2v files", in_dir, exits=1)
with tmp_path.open("a", encoding="utf8") as tmp_file:
    for input_file in input_files:
        with input_file.open("r", encoding="utf-8") as f:
            tmp_file.write(f.read())
msg.info("Created temporary merged input file", tmp_path)

sense2vec_model = fasttext.train_unsupervised(in_dir+"/s2v_input.tmp", thread=n_threads, epoch=5, dim=vector_size, minn=0, maxn=0, minCount=min_count, verbose=verbose)
sense2vec_model.save_model(out_dir+f"/vectors_w2v_{vector_size}dim.bin")

tmp_path.unlink()
msg.good("Deleted temporary input file", tmp_path)
@svlandeg svlandeg added the demo Online demo label Apr 15, 2020
@dshefman1
Copy link
Author

I was able to solve the problem by inferring the format of both vectors.txt and vocab.txt from the 05_export.py script _get_shape and read_vocab functions. Also, in the 05_export.py script I had to change line 15 to first_line = next(file_).replace('\ufeff','').split() because of Windows' UTF-8 BOM signature included at the beginning of UTF-8 text docs.

Here is a complete Windows 10 compatible 04_fasttext_train_vectors.py script:

from pathlib import Path
from wasabi import msg
import fasttext

in_dir = "./corpus_parsed3"
out_dir = "./fasttext_model3"
n_threads = 26
min_count = 50
vector_size = 300
verbose = 2

input_path = Path(in_dir)
output_path = Path(out_dir)
if not input_path.exists() or not input_path.is_dir():
    msg.fail("Not a valid input directory", in_dir, exits=1)
if not output_path.exists():
    output_path.mkdir(parents=True)
    msg.good(f"Created output directory {out_dir}")
output_file = output_path / f"vectors_w2v_{vector_size}dim.bin"

# fastText expects only one input file and only reads from disk and not
# stdin, so we need to create a temporary file that concatenates the inputs
tmp_path = input_path / "s2v_input.tmp"
input_files = [p for p in input_path.iterdir() if p.suffix == ".s2v"]
if not input_files:
    msg.fail("Input directory contains no .s2v files", in_dir, exits=1)
with tmp_path.open("a", encoding="utf8") as tmp_file:
    for input_file in input_files:
        with input_file.open("r", encoding="utf-8") as f:
            tmp_file.write(f.read())
msg.info("Created temporary merged input file", tmp_path)

sense2vec_model = fasttext.train_unsupervised(in_dir+"/s2v_input.tmp", thread=n_threads, epoch=5, dim=vector_size, minn=0, maxn=0, minCount=min_count, verbose=verbose)
# sense2vec_model.save_model(out_dir+f"/vectors_w2v_{vector_size}dim.bin")

tmp_path.unlink()
msg.good("Deleted temporary input file", tmp_path)

words, freqs = sense2vec_model.get_words(include_freq=True)

with open("./fasttext_model3/vocab.txt", 'w', encoding='utf-8') as f:
    for i in range(len(words)):
        f.write(words[i] + " " + str(freqs[i]) + " word\n")

# https://stackoverflow.com/questions/58337469/how-to-save-fasttext-model-in-vec-format
# get all words from model
words = sense2vec_model.get_words()
# print(str(len(words)) + " " + str(sense2vec_model.get_dimension()))
# line by line, you append vectors to VEC file
with open("./fasttext_model3/vectors.txt", 'w', encoding='utf-8') as file_out:
    file_out.write(str(len(words)) + " " + str(sense2vec_model.get_dimension())+'\n')
    for w in words:
        v = sense2vec_model.get_word_vector(w)
        vstr = ""
        for vi in v:
            vstr += " " + str(vi)
        try:
            file_out.write(w + vstr+'\n')
        except:
            pass

@svlandeg
Copy link
Member

@dshefman1 : thanks for this! Do you think it's possible to have one version of the script that works across all platforms? It would be great to incorporate your changes into the source here, so others don't run into the same problems. Would you feel like contributing a PR?

@Z-e-e
Copy link

Z-e-e commented Apr 16, 2020

@dshefman1 : could you please elaborate on the in_dir, when referring to the dir where the .s2v file is saved, I get a "SystemExit: 1" and "✘ Not a valid input directory".

Using Jupyter.

@dshefman1
Copy link
Author

dshefman1 commented Apr 20, 2020

@svlandeg

thanks for this! Do you think it's possible to have one version of the script that works across all platforms?

I do think it is possible.

It would be great to incorporate your changes into the source here, so others don't run into the same problems. Would you feel like contributing a PR?

I'd be happy to contribute a PR.

@dshefman1
Copy link
Author

dshefman1 commented Apr 20, 2020

could you please elaborate on the in_dir, when referring to the dir where the .s2v file is saved, I get a "SystemExit: 1" and "✘ Not a valid input directory".

@Z-e-e As per the docstring in the original script, it expects a "directory of preprocessed .s2v input files, will concatenate them (using a temporary file on disk) and will use fastText to train a word2vec model."

You may be making the same mistake I made the first time I ran the script, which is to provide a string reference to the filepath of the .s2v file. Instead, you are expected to provide a string reference to the directory, which in my case is "./corpus_parsed3", but it's probably named something else on your machine.

@svlandeg
Copy link
Member

I'd be happy to contribute a PR.

Awesome. Reopening this to track the progress.

@svlandeg svlandeg reopened this Apr 20, 2020
@dshefman1
Copy link
Author

dshefman1 commented Apr 20, 2020

@svlandeg I'm struggling to finalize my code to resolve Issue 105. The problem I am having is that I still don't quite understand the outputs from the CLI commands in 04_fasttext_train_vectors.py. For example, line 63 or Line 76 appears to create a .bin file of the FastText model. However, the 05_export.py script does not take this file as an input. So, I'm not sure if the purpose of creating the model.bin file is just for the purpose of creating the "vocab.txt" file on line 76. If so, then I plan to make saving the model to disk as an option, but not necessary. However, if the model.bin file is for another purpose related to the 05_export.py script please let me know.

Also, 05_export.py expects a "vectors.txt" file as an input. However, the 04_fasttext_train_vectors.py script does not explicitly create a "vectors.txt" file in the same way it explicitly creates a vocab.txt on lines 75-77, I'm assuming that the "vectors.txt" file is created on line 61-64. Am I correct? If not, could you help me understand at what point the "vectors.txt" file is created? This would help me ensure that I don't cause a new problem while fixing the Windows compatibility problem.

@svlandeg
Copy link
Member

svlandeg commented Apr 20, 2020

I'll have a detailed look tomorrow !
[EDIT: update, sorry, something more urgent came up, but will definitely have a look in the coming week ;-)]

@svlandeg svlandeg added scripts Code examples in /scripts and removed demo Online demo labels Apr 21, 2020
@svlandeg
Copy link
Member

svlandeg commented Apr 23, 2020

@Z-e-e : cf. PR #106

[EDIT: this was a reply to a question asking which changes exactly were made by @dshefman1. Though that question has now been deleted, it's still good to link the relevant PR to this Issue :-)]

@svlandeg
Copy link
Member

Hi @dshefman1, I finally had some time to look into this in more detail.

The main reason why the scripts are "incompatible" with Windows, is because you should be able to build the binary file from the fasttext github repo. This should be doable with the instructions given for cmake.

However, the other option is also to just download the binary files from the unofficial release for Windows: https://github.com/xiamx/fastText/releases. That works for me on Windows just fine. I also didn't run into any trouble with the BOM etc.

I think this may be the best option in the end, as the changes you started making to the script were quite big, and we need to make sure that it keeps working also on other platforms. What do you think ?

@dshefman1
Copy link
Author

Hi @svlandeg It sounds like you are saying that the current version of 04_fasttext_train_vectors.py already meets your criteria "to have one version of the script that works across all platforms." If so, then it sounds like a fine solution to me.

@svlandeg
Copy link
Member

I mean, ideally, we wouldn't need to depend on a platform-dependent binary file. But it looks like working around it gets quite involved, and I'm also not sure what all the different intermediate files are for. So I'm just wondering whether it's worth putting more time into this if we can use that unofficial Windows release instead?

@dshefman1
Copy link
Author

@svlandeg I was able to answer my own intermediate-files questions from before. So, the intermediate files are not an issue for the code I wrote. The code I wrote provides all of the appropriate inputs for the 05_export.py, but it does need to be tested for non-Windows users. However, if you think it is not worth putting more time into this then I can get on board with that. I have plenty of high priority work to keep me busy these days.

@svlandeg
Copy link
Member

@svlandeg I was able to answer my own intermediate-files questions from before. So, the intermediate files are not an issue for the code I wrote.

Oh, OK, I thought you were still having open issues! So basically the PR works for you as-is on Windows?

@dshefman1
Copy link
Author

@svlandeg Yes, it does. Sorry. That is my mistake for not reporting back that the PR works as-is on Windows. Also, since the PR uses the pip installed fastText library, instead of a binary build of fastText, I would think that it would work with almost any operating system, but I don't have a non-Windows machine to test it on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
scripts Code examples in /scripts
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants