add Multi-instance translate chapter in README and its script example #361

rongzha1 · 2018-04-19T08:09:30Z

(description of the change)
Add Multi-instance translate in README.md and its script example.
Using Multi-instance translate will have 10+ times speed up.
Here is some data:
AWS EC2 C5.18xlarge:

Batch	1 Instance (sent/sec)	36 Instance (sent/sec)
1	7.50	77.172
2	11.30	78.46
8	16.30	183.93
16	19.70	194.88
32	21.40	200.30
64	22.50	199.563

Pull Request Checklist

[] Changes are complete (if posting work-in-progress code, prefix your pull request title with '[WIP]'
until you can check this box.
done
[] Unit tests pass (pytest)
didn't change any code.
Were system tests modified? If so did you run these at least 5 times to account for the variation across runs?
no tests modified
System tests pass (pytest test/system)
didn't change any code.
Passed code style checking (./style-check.sh)
done, no error in README.md and mlt-trans.sh
You have considered writing a test
no, didn't change any code.
Updated major/minor version in sockeye/__init__.py. Major version bump if this is a backwards incompatible change.
not a incompatible change
Updated CHANGELOG.md
seems no need to.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@pengzhao-intel @huangzhiyuan

fhieber · 2018-04-19T15:36:10Z

I am not sure how generic and re-usable this script is for users (for example by hardcoding newstest2016).

rongzha1 · 2018-04-20T01:54:40Z

OK. We will change script more easy to use. First divided file by instance number, and each instance will handle 1 part. Then put each translated file together to get one file after each instance translate done. @huangzhiyuan

rongzha1 · 2018-04-27T08:56:41Z

we have add option for whether running as benchmark.
If run as benchmark, each instance will copy input file as file to be translate.
If not, the input file will be divided by instance number, and each instance will handle one part.
Our script is an example for how to do mlt-instance translate and show its best performance.
@huangzhiyuan @fhieber

rongzha1 · 2018-05-04T06:53:50Z

Can you help to check this PR @fhieber ? Thanks very much.

fhieber

Hi @rongzha1,
I had another look and I have to say that I don't think this should be part of core Sockeye in the current form.

Splitting input files and translating them in parallel with multiple Sockeye translate processes is a nice idea, but Sockeye is a Python package and if this is supposed to be a top-level feature it would be more useful to have a Python CLI for this that is properly tested.
The speed numbers that are added to README.md will be outdated quite quickly. I would remove this.
The description added to README.md is not clear to me. What do you mean by "translate 24 instances in C4.8xlarge"?
What are the assumptions for the environment so that this script works for arbitrary users? Does this only run on C4/5 instances? What about GPU instances?

How about making this a separate tutorial and place it under tutorial/cpu_benchmarking for example?

fhieber · 2018-05-04T07:16:22Z

mlt-trans.sh

+
+output=$2".en"
+find . -name "*.result.en" | sort | xargs cat | head -n $line > $output
+rm mlt-gnmt.log* -f


What is gnmt?

Actually , this log is not used. We will rm it in next version.

fhieber · 2018-05-04T07:16:35Z

mlt-trans.sh

+	do
+		if [ $i == $[$1-1] ]
+		then
+ 			taskset -c $i-$i python3 -m sockeye.translate -m $2 -i $3 -o mlt-en.result.en --batch-size $4 --output-type benchmark --use-cpu > /dev/null 2>&1 


why hardcode the output files?

OK will add output file name as parameters

fhieber · 2018-05-04T07:17:21Z

mlt-trans.sh

+	mv temp.log $file
+}
+
+# benchmark: inference use dummy data


what kind of dummy data?

we use input file as dummy data but not divided. That is to say each instance translate the whole input file. This is used as benchmark test.

rongzha1 · 2018-05-04T08:35:15Z

hi @fhieber
Thanks for your quick response!
We will change sh script to python script.
Agree to remove speed numbers.
Will make README.md more acurrate and clearly. "translate 24 instances in C4.8xlarge" , means use 24 instances to translate on C4.8xlarge platform. (Actually 24 is typo, we use 18 instances in C4.8xlarge for C4.8xlarge has 18 psychical cores each socket)
Assume script can be run on multi-core cpu environment. We test it on broadwell and skylake servers . It works well. I am not sure whether GPU can benefit from multi-instance.

We will add some brief descriptions on multi-instances translation in README.md and move script to tutorial/cpu_benchmarking

davvil · 2018-05-25T08:54:46Z

tutorials/cpu_benchmarking/mlt_cpu_trans_benchmark.py

+
+
+# benchmark: multi-instances translate using same input file
+def benchmark(cores, args):


Why do we need separate benchmark and merge_translate functions? They are doing basically the same work.

This is for two different use. One is for benchmark, each core do the whole file inference. The other is that each core do only a part (file_line // core number lines) inference.
If each core do the whole file inference, performance will be better for larger sentence.
If each core do part time file inference, will use less time .

davvil · 2018-05-25T08:58:11Z

tutorials/cpu_benchmarking/mlt_cpu_trans_benchmark.py

+
+# split file to small files
+def split_file(cores, fileInput):
+    lines = len(open(fileInput).readlines())


This will load the whole into memory. For bigger files it could be more useful to iterate over the file and increase a counter.

Agree. will change later

davvil · 2018-05-25T09:49:04Z

tutorials/cpu_benchmarking/mlt_cpu_trans_benchmark.py

+    rema = lines % cores
+    if rema != 0:
+        quot += 1
+        os.system("cp %s mlt-nmt-temp.log && head -n%s %s >> mlt-nmt-temp.log " % (fileInput, str(quot * cores - lines), fileInput))


This is basically replicating lines so that the size is a multiple of quot*cores, right? Why do you need this? If you will spawn multiple jobs they can have different translation sizes.

Please don't hardcode filenames. Use the tempfile library for generating temporary files.

Please use %d instead of the combination %s and str().

davvil · 2018-05-25T09:49:16Z

tutorials/cpu_benchmarking/mlt_cpu_trans_benchmark.py

+        quot += 1
+        os.system("cp %s mlt-nmt-temp.log && head -n%s %s >> mlt-nmt-temp.log " % (fileInput, str(quot * cores - lines), fileInput))
+
+    os.system("split -l %s mlt-nmt-temp.log -d -a 2 mlt-nmt.log." % str(quot))


Same here with %d.

Agree. will change later

davvil · 2018-05-25T09:50:00Z

tutorials/cpu_benchmarking/mlt_cpu_trans_benchmark.py

+    os.system("split -l %s mlt-nmt-temp.log -d -a 2 mlt-nmt.log." % str(quot))
+
+# merge_translate: multi-instances translation, each instance trans  whole_lines/instance_num lines of file, and merge into one complete output file
+def merge_translate(cores, args):


Actually there is no merging taking place in this function.

Agree. will change this function later

davvil · 2018-05-25T09:55:53Z

tutorials/cpu_benchmarking/mlt_cpu_trans_benchmark.py

+    os.system(ompStr)
+
+    # the total lines of input file will be translated
+    lines = len(open(args.input_file).readlines())


If the number of lines is computed here, you can pass it down so that it doesn't need to be recomputed in split_files().

[And same comment above about memory applies here as well]

Yes. will change later

davvil · 2018-05-25T10:02:12Z

tutorials/cpu_benchmarking/mlt_cpu_trans_benchmark.py

+        #force stop 100s after the fisrt instance complete to jump out of the loop
+        while (process_cnt > 0 and stop_cnt < 1000):
+            process_cnt = int(os.popen("ps -ef | grep 'sockeye.translate' | grep -v grep | wc -l").read().split('\n')[0])
+            time.sleep(0.1)


0.1 is probably too short time, waking the main process 10 times per second for a process that will take at least minutes to complete. A better solution would probably use os.wait() or similar funtions.

OK, I'll change to os.wait

davvil · 2018-05-25T10:03:30Z

tutorials/cpu_benchmarking/mlt_cpu_trans_benchmark.py

+        while (process_cnt > 0 and stop_cnt < 1000):
+            process_cnt = int(os.popen("ps -ef | grep 'sockeye.translate' | grep -v grep | wc -l").read().split('\n')[0])
+            time.sleep(0.1)
+            stop_cnt += 1


Wouldn't the stop_cnt trigger for really large files? Also there is no code to differentiate the two exit conditions.

OK. will use os.wait instead

davvil · 2018-05-25T10:04:46Z

tutorials/cpu_benchmarking/mlt_cpu_trans_benchmark.py

+        #merge each part into one complete output file
+        os.system("find . -name '*.result.en' | sort | xargs cat | head -n %s > %s" % (str(lines), fileOutput))
+        #rm temp file
+        os.system("rm mlt-nmt* -rf")


This wouldn't be needed with the use of tempfile, as pointed out above.

davvil · 2018-05-25T10:07:49Z

tutorials/cpu_benchmarking/mlt_cpu_trans_benchmark.py

+            stop_cnt += 1
+        end = time.time()
+        #merge each part into one complete output file
+        os.system("find . -name '*.result.en' | sort | xargs cat | head -n %s > %s" % (str(lines), fileOutput))


You only want the files in the current directory, right? A simple 'ls *.result.en' could replace both 'find [...]' and 'sort' (ls sorts its output). Actually 'cat *.results.en' would also work for a number of size below several thousands, which you are actually assuming given the naming schema.

OK.will change later

tdomhan

thanks for updating the PR!

tdomhan · 2018-06-21T07:14:18Z

README.md

@@ -161,6 +161,15 @@ You can translate as follows:
 This will take the best set of parameters found during training and then translate strings from STDIN and
 write translations to STDOUT.

+#### Multi-instance Translate


Can you add a list entry in tutorials/README.md and move this section to tutorials/$your_folder_name/README.md?

tdomhan · 2018-06-21T07:19:11Z

tutorials/cpu_benchmarking/mlt_cpu_trans_benchmark.py

@@ -0,0 +1,129 @@
+# Describtion: This script is used for CPU Multi-instance Translate,


I know @fhieber asked for moving this to cpu_benchmarking, but I find this name confusing as this is not really about benchmarking. What do people think about process_per_core_translation?

tdomhan · 2018-06-21T07:21:26Z

README.md

@@ -161,6 +161,15 @@ You can translate as follows:
 This will take the best set of parameters found during training and then translate strings from STDIN and
 write translations to STDOUT.

+#### Multi-instance Translate
+Multi-instance can be used to greatly speedup translation in one multi-core processor computer.


maybe add a few more words about what multi-instance means, namely that you can one process per core setting the CPU affinity, as I don't this this will be clear to users otherwise.

tdomhan · 2018-06-25T15:58:01Z

thanks for iterating. You will need to rebase on the current master before merging.

tdomhan · 2018-06-25T15:54:18Z

tutorials/process_per_core_translation/README.md

@@ -0,0 +1,13 @@
+# CPU process per core translation
+On multi-core processor computer, translation per core separately can speedup translation performance, due to some operation can't be handled parallel in one process.
+Using this methord, translation on each core can be parallel.


typo: method

tdomhan · 2018-06-25T15:54:23Z

tutorials/process_per_core_translation/README.md

+On multi-core processor computer, translation per core separately can speedup translation performance, due to some operation can't be handled parallel in one process.
+Using this methord, translation on each core can be parallel.
+
+One python script example is givne and you can run it as follows:


typo: given

changed. Thanks

tdomhan · 2018-06-25T15:55:53Z

tutorials/process_per_core_translation/cpu_process_per_core_translation.py

@@ -0,0 +1,129 @@
+# Describtion: This script is used for CPU Multi-instance Translate, which process per core translation.
+# It can greatly speedup translate perefomance.
+# FileName: cpu_process_per_core_translation.py


filename and version seem unnecessary. The usage would be best as part of the argparse description argparse.ArgumentParser(description=....

Done. Thanks

tdomhan · 2018-06-25T15:56:39Z

tutorials/process_per_core_translation/cpu_process_per_core_translation.py

+def task(args):
+    os.system(args)
+
+# benchmark: multi-instances translating, each instance trans the same input file separately


could you make this a proper doc-string for the method?

def benchmark(...): """ doc goes here """

Also applies for the other functions below.

Good style. will follow.

…ns to run multi-instance

…nd C5 in README.md

…ore_translation

…ns to run multi-instance

…nd C5 in README.md

…ore_translation

tdomhan · 2018-06-26T09:59:41Z

thanks for iterating!

davvil · 2018-06-27T10:12:28Z

Can you add a header stating the license (Apache) and authors?

rongzha1 requested review from davvil, fhieber, mjdenkowski and tdomhan as code owners April 19, 2018 08:09

fhieber reviewed May 4, 2018

View reviewed changes

fhieber added the WIP work in progress label May 23, 2018

davvil suggested changes May 25, 2018

View reviewed changes

tdomhan requested changes Jun 21, 2018

View reviewed changes

tdomhan reviewed Jun 25, 2018

View reviewed changes

rongzha1 requested a review from mjpost as a code owner June 26, 2018 08:13

rongzha1 added 14 commits June 26, 2018 16:28

add chapter multi-instance translate for README.md and script mlt-tra…

5e6a92f

…ns to run multi-instance

add chapter multi-instance translate for README.md

2aa4a97

change method to get physical cores number and add hyperlink for C4 a…

bebf3f2

…nd C5 in README.md

add option for running as benchmark or not

849684b

rewrite mlt-trans script using python and move it to tutorials

4a06324

change code according to the commnets

2ba9140

move mlt_cpu_trans_benchmark to tutorials and rename it process_per_c…

b77546a

…ore_translation

fix typo

9e868a5

add chapter multi-instance translate for README.md and script mlt-tra…

96757a7

…ns to run multi-instance

add chapter multi-instance translate for README.md

8abb672

change method to get physical cores number and add hyperlink for C4 a…

e2e32eb

…nd C5 in README.md

add option for running as benchmark or not

709a7e3

rewrite mlt-trans script using python and move it to tutorials

c4ab968

change code according to the commnets

98fe512

rongzha1 added 2 commits June 26, 2018 16:30

move mlt_cpu_trans_benchmark to tutorials and rename it process_per_c…

2ff05c2

…ore_translation

add python script in tutortial

6c6eb0c

tdomhan approved these changes Jun 26, 2018

View reviewed changes

add license (Apache) and authors

9231d94

davvil approved these changes Jun 28, 2018

View reviewed changes

tdomhan merged commit 951e71e into awslabs:master Jun 28, 2018

pengzhao-intel mentioned this pull request Nov 8, 2018

Performence with multi thead inference is slow apache/mxnet#13075

Open



		# benchmark: multi-instances translate using same input file
		def benchmark(cores, args):

		@@ -0,0 +1,129 @@
		# Describtion: This script is used for CPU Multi-instance Translate,

add Multi-instance translate chapter in README and its script example #361

add Multi-instance translate chapter in README and its script example #361

Conversation

rongzha1 commented Apr 19, 2018 • edited Loading

Pull Request Checklist

fhieber commented Apr 19, 2018

rongzha1 commented Apr 20, 2018

rongzha1 commented Apr 27, 2018 • edited Loading

rongzha1 commented May 4, 2018

fhieber left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rongzha1 commented May 4, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tdomhan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tdomhan commented Jun 25, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tdomhan commented Jun 26, 2018

davvil commented Jun 27, 2018

rongzha1 commented Apr 19, 2018 •

edited

Loading

rongzha1 commented Apr 27, 2018 •

edited

Loading