-
Notifications
You must be signed in to change notification settings - Fork 323
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add Multi-instance translate chapter in README and its script example #361
Conversation
I am not sure how generic and re-usable this script is for users (for example by hardcoding newstest2016). |
OK. We will change script more easy to use. First divided file by instance number, and each instance will handle 1 part. Then put each translated file together to get one file after each instance translate done. @huangzhiyuan |
we have add option for whether running as benchmark. |
Can you help to check this PR @fhieber ? Thanks very much. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @rongzha1,
I had another look and I have to say that I don't think this should be part of core Sockeye in the current form.
- Splitting input files and translating them in parallel with multiple Sockeye translate processes is a nice idea, but Sockeye is a Python package and if this is supposed to be a top-level feature it would be more useful to have a Python CLI for this that is properly tested.
- The speed numbers that are added to README.md will be outdated quite quickly. I would remove this.
- The description added to README.md is not clear to me. What do you mean by "translate 24 instances in C4.8xlarge"?
- What are the assumptions for the environment so that this script works for arbitrary users? Does this only run on C4/5 instances? What about GPU instances?
How about making this a separate tutorial and place it under tutorial/cpu_benchmarking
for example?
mlt-trans.sh
Outdated
|
||
output=$2".en" | ||
find . -name "*.result.en" | sort | xargs cat | head -n $line > $output | ||
rm mlt-gnmt.log* -f |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is gnmt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually , this log is not used. We will rm it in next version.
mlt-trans.sh
Outdated
do | ||
if [ $i == $[$1-1] ] | ||
then | ||
taskset -c $i-$i python3 -m sockeye.translate -m $2 -i $3 -o mlt-en.result.en --batch-size $4 --output-type benchmark --use-cpu > /dev/null 2>&1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why hardcode the output files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK will add output file name as parameters
mlt-trans.sh
Outdated
mv temp.log $file | ||
} | ||
|
||
# benchmark: inference use dummy data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what kind of dummy data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we use input file as dummy data but not divided. That is to say each instance translate the whole input file. This is used as benchmark test.
hi @fhieber We will add some brief descriptions on multi-instances translation in README.md and move script to tutorial/cpu_benchmarking |
|
||
|
||
# benchmark: multi-instances translate using same input file | ||
def benchmark(cores, args): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need separate benchmark and merge_translate functions? They are doing basically the same work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for two different use. One is for benchmark, each core do the whole file inference. The other is that each core do only a part (file_line // core number lines) inference.
If each core do the whole file inference, performance will be better for larger sentence.
If each core do part time file inference, will use less time .
|
||
# split file to small files | ||
def split_file(cores, fileInput): | ||
lines = len(open(fileInput).readlines()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will load the whole into memory. For bigger files it could be more useful to iterate over the file and increase a counter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. will change later
rema = lines % cores | ||
if rema != 0: | ||
quot += 1 | ||
os.system("cp %s mlt-nmt-temp.log && head -n%s %s >> mlt-nmt-temp.log " % (fileInput, str(quot * cores - lines), fileInput)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
This is basically replicating lines so that the size is a multiple of quot*cores, right? Why do you need this? If you will spawn multiple jobs they can have different translation sizes.
-
Please don't hardcode filenames. Use the tempfile library for generating temporary files.
-
Please use %d instead of the combination %s and str().
quot += 1 | ||
os.system("cp %s mlt-nmt-temp.log && head -n%s %s >> mlt-nmt-temp.log " % (fileInput, str(quot * cores - lines), fileInput)) | ||
|
||
os.system("split -l %s mlt-nmt-temp.log -d -a 2 mlt-nmt.log." % str(quot)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here with %d.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. will change later
os.system("split -l %s mlt-nmt-temp.log -d -a 2 mlt-nmt.log." % str(quot)) | ||
|
||
# merge_translate: multi-instances translation, each instance trans whole_lines/instance_num lines of file, and merge into one complete output file | ||
def merge_translate(cores, args): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually there is no merging taking place in this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. will change this function later
os.system(ompStr) | ||
|
||
# the total lines of input file will be translated | ||
lines = len(open(args.input_file).readlines()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the number of lines is computed here, you can pass it down so that it doesn't need to be recomputed in split_files().
[And same comment above about memory applies here as well]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. will change later
#force stop 100s after the fisrt instance complete to jump out of the loop | ||
while (process_cnt > 0 and stop_cnt < 1000): | ||
process_cnt = int(os.popen("ps -ef | grep 'sockeye.translate' | grep -v grep | wc -l").read().split('\n')[0]) | ||
time.sleep(0.1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0.1 is probably too short time, waking the main process 10 times per second for a process that will take at least minutes to complete. A better solution would probably use os.wait() or similar funtions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I'll change to os.wait
while (process_cnt > 0 and stop_cnt < 1000): | ||
process_cnt = int(os.popen("ps -ef | grep 'sockeye.translate' | grep -v grep | wc -l").read().split('\n')[0]) | ||
time.sleep(0.1) | ||
stop_cnt += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't the stop_cnt trigger for really large files? Also there is no code to differentiate the two exit conditions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. will use os.wait instead
#merge each part into one complete output file | ||
os.system("find . -name '*.result.en' | sort | xargs cat | head -n %s > %s" % (str(lines), fileOutput)) | ||
#rm temp file | ||
os.system("rm mlt-nmt* -rf") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This wouldn't be needed with the use of tempfile, as pointed out above.
stop_cnt += 1 | ||
end = time.time() | ||
#merge each part into one complete output file | ||
os.system("find . -name '*.result.en' | sort | xargs cat | head -n %s > %s" % (str(lines), fileOutput)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You only want the files in the current directory, right? A simple 'ls *.result.en' could replace both 'find [...]' and 'sort' (ls sorts its output). Actually 'cat *.results.en' would also work for a number of size below several thousands, which you are actually assuming given the naming schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK.will change later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for updating the PR!
README.md
Outdated
@@ -161,6 +161,15 @@ You can translate as follows: | |||
This will take the best set of parameters found during training and then translate strings from STDIN and | |||
write translations to STDOUT. | |||
|
|||
#### Multi-instance Translate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a list entry in tutorials/README.md
and move this section to tutorials/$your_folder_name/README.md
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
@@ -0,0 +1,129 @@ | |||
# Describtion: This script is used for CPU Multi-instance Translate, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know @fhieber asked for moving this to cpu_benchmarking
, but I find this name confusing as this is not really about benchmarking. What do people think about process_per_core_translation
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
README.md
Outdated
@@ -161,6 +161,15 @@ You can translate as follows: | |||
This will take the best set of parameters found during training and then translate strings from STDIN and | |||
write translations to STDOUT. | |||
|
|||
#### Multi-instance Translate | |||
Multi-instance can be used to greatly speedup translation in one multi-core processor computer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add a few more words about what multi-instance means, namely that you can one process per core setting the CPU affinity, as I don't this this will be clear to users otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
thanks for iterating. You will need to rebase on the current master before merging. |
@@ -0,0 +1,13 @@ | |||
# CPU process per core translation | |||
On multi-core processor computer, translation per core separately can speedup translation performance, due to some operation can't be handled parallel in one process. | |||
Using this methord, translation on each core can be parallel. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
On multi-core processor computer, translation per core separately can speedup translation performance, due to some operation can't be handled parallel in one process. | ||
Using this methord, translation on each core can be parallel. | ||
|
||
One python script example is givne and you can run it as follows: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: given
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed. Thanks
@@ -0,0 +1,129 @@ | |||
# Describtion: This script is used for CPU Multi-instance Translate, which process per core translation. | |||
# It can greatly speedup translate perefomance. | |||
# FileName: cpu_process_per_core_translation.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
filename and version seem unnecessary. The usage would be best as part of the argparse description argparse.ArgumentParser(description=...
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Thanks
def task(args): | ||
os.system(args) | ||
|
||
# benchmark: multi-instances translating, each instance trans the same input file separately |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you make this a proper doc-string for the method?
def benchmark(...):
"""
doc goes here
"""
Also applies for the other functions below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good style. will follow.
…ns to run multi-instance
…nd C5 in README.md
…ns to run multi-instance
…nd C5 in README.md
thanks for iterating! |
Can you add a header stating the license (Apache) and authors? |
(description of the change)
Add Multi-instance translate in README.md and its script example.
Using Multi-instance translate will have 10+ times speed up.
Here is some data:
AWS EC2 C5.18xlarge:
Pull Request Checklist
until you can check this box.
done
pytest
)didn't change any code.
no tests modified
pytest test/system
)didn't change any code.
./style-check.sh
)done, no error in README.md and mlt-trans.sh
no, didn't change any code.
sockeye/__init__.py
. Major version bump if this is a backwards incompatible change.not a incompatible change
seems no need to.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
@pengzhao-intel @huangzhiyuan