Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add Multi-instance translate chapter in README and its script example #361

Merged
merged 17 commits into from
Jun 28, 2018
Merged

add Multi-instance translate chapter in README and its script example #361

merged 17 commits into from
Jun 28, 2018

Conversation

rongzha1
Copy link
Contributor

@rongzha1 rongzha1 commented Apr 19, 2018

(description of the change)
Add Multi-instance translate in README.md and its script example.
Using Multi-instance translate will have 10+ times speed up.
Here is some data:
AWS EC2 C5.18xlarge:

Batch 1 Instance (sent/sec) 36 Instance (sent/sec)
1 7.50 77.172
2 11.30 78.46
8 16.30 183.93
16 19.70 194.88
32 21.40 200.30
64 22.50 199.563

Pull Request Checklist

  • [] Changes are complete (if posting work-in-progress code, prefix your pull request title with '[WIP]'
    until you can check this box.
    done
  • [] Unit tests pass (pytest)
    didn't change any code.
  • Were system tests modified? If so did you run these at least 5 times to account for the variation across runs?
    no tests modified
  • System tests pass (pytest test/system)
    didn't change any code.
  • Passed code style checking (./style-check.sh)
    done, no error in README.md and mlt-trans.sh
  • You have considered writing a test
    no, didn't change any code.
  • Updated major/minor version in sockeye/__init__.py. Major version bump if this is a backwards incompatible change.
    not a incompatible change
  • Updated CHANGELOG.md
    seems no need to.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@pengzhao-intel @huangzhiyuan

@fhieber
Copy link
Contributor

fhieber commented Apr 19, 2018

I am not sure how generic and re-usable this script is for users (for example by hardcoding newstest2016).

@rongzha1
Copy link
Contributor Author

OK. We will change script more easy to use. First divided file by instance number, and each instance will handle 1 part. Then put each translated file together to get one file after each instance translate done. @huangzhiyuan

@rongzha1
Copy link
Contributor Author

rongzha1 commented Apr 27, 2018

we have add option for whether running as benchmark.
If run as benchmark, each instance will copy input file as file to be translate.
If not, the input file will be divided by instance number, and each instance will handle one part.
Our script is an example for how to do mlt-instance translate and show its best performance.
@huangzhiyuan @fhieber

@rongzha1
Copy link
Contributor Author

rongzha1 commented May 4, 2018

Can you help to check this PR @fhieber ? Thanks very much.

Copy link
Contributor

@fhieber fhieber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @rongzha1,
I had another look and I have to say that I don't think this should be part of core Sockeye in the current form.

  • Splitting input files and translating them in parallel with multiple Sockeye translate processes is a nice idea, but Sockeye is a Python package and if this is supposed to be a top-level feature it would be more useful to have a Python CLI for this that is properly tested.
  • The speed numbers that are added to README.md will be outdated quite quickly. I would remove this.
  • The description added to README.md is not clear to me. What do you mean by "translate 24 instances in C4.8xlarge"?
  • What are the assumptions for the environment so that this script works for arbitrary users? Does this only run on C4/5 instances? What about GPU instances?

How about making this a separate tutorial and place it under tutorial/cpu_benchmarking for example?

mlt-trans.sh Outdated

output=$2".en"
find . -name "*.result.en" | sort | xargs cat | head -n $line > $output
rm mlt-gnmt.log* -f
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is gnmt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually , this log is not used. We will rm it in next version.

mlt-trans.sh Outdated
do
if [ $i == $[$1-1] ]
then
taskset -c $i-$i python3 -m sockeye.translate -m $2 -i $3 -o mlt-en.result.en --batch-size $4 --output-type benchmark --use-cpu > /dev/null 2>&1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why hardcode the output files?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK will add output file name as parameters

mlt-trans.sh Outdated
mv temp.log $file
}

# benchmark: inference use dummy data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what kind of dummy data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we use input file as dummy data but not divided. That is to say each instance translate the whole input file. This is used as benchmark test.

@rongzha1
Copy link
Contributor Author

rongzha1 commented May 4, 2018

hi @fhieber
Thanks for your quick response!
We will change sh script to python script.
Agree to remove speed numbers.
Will make README.md more acurrate and clearly. "translate 24 instances in C4.8xlarge" , means use 24 instances to translate on C4.8xlarge platform. (Actually 24 is typo, we use 18 instances in C4.8xlarge for C4.8xlarge has 18 psychical cores each socket)
Assume script can be run on multi-core cpu environment. We test it on broadwell and skylake servers . It works well. I am not sure whether GPU can benefit from multi-instance.

We will add some brief descriptions on multi-instances translation in README.md and move script to tutorial/cpu_benchmarking

@fhieber fhieber added the WIP work in progress label May 23, 2018


# benchmark: multi-instances translate using same input file
def benchmark(cores, args):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need separate benchmark and merge_translate functions? They are doing basically the same work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for two different use. One is for benchmark, each core do the whole file inference. The other is that each core do only a part (file_line // core number lines) inference.
If each core do the whole file inference, performance will be better for larger sentence.
If each core do part time file inference, will use less time .


# split file to small files
def split_file(cores, fileInput):
lines = len(open(fileInput).readlines())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will load the whole into memory. For bigger files it could be more useful to iterate over the file and increase a counter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. will change later

rema = lines % cores
if rema != 0:
quot += 1
os.system("cp %s mlt-nmt-temp.log && head -n%s %s >> mlt-nmt-temp.log " % (fileInput, str(quot * cores - lines), fileInput))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. This is basically replicating lines so that the size is a multiple of quot*cores, right? Why do you need this? If you will spawn multiple jobs they can have different translation sizes.

  2. Please don't hardcode filenames. Use the tempfile library for generating temporary files.

  3. Please use %d instead of the combination %s and str().

quot += 1
os.system("cp %s mlt-nmt-temp.log && head -n%s %s >> mlt-nmt-temp.log " % (fileInput, str(quot * cores - lines), fileInput))

os.system("split -l %s mlt-nmt-temp.log -d -a 2 mlt-nmt.log." % str(quot))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here with %d.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. will change later

os.system("split -l %s mlt-nmt-temp.log -d -a 2 mlt-nmt.log." % str(quot))

# merge_translate: multi-instances translation, each instance trans whole_lines/instance_num lines of file, and merge into one complete output file
def merge_translate(cores, args):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually there is no merging taking place in this function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. will change this function later

os.system(ompStr)

# the total lines of input file will be translated
lines = len(open(args.input_file).readlines())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the number of lines is computed here, you can pass it down so that it doesn't need to be recomputed in split_files().

[And same comment above about memory applies here as well]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. will change later

#force stop 100s after the fisrt instance complete to jump out of the loop
while (process_cnt > 0 and stop_cnt < 1000):
process_cnt = int(os.popen("ps -ef | grep 'sockeye.translate' | grep -v grep | wc -l").read().split('\n')[0])
time.sleep(0.1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0.1 is probably too short time, waking the main process 10 times per second for a process that will take at least minutes to complete. A better solution would probably use os.wait() or similar funtions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'll change to os.wait

while (process_cnt > 0 and stop_cnt < 1000):
process_cnt = int(os.popen("ps -ef | grep 'sockeye.translate' | grep -v grep | wc -l").read().split('\n')[0])
time.sleep(0.1)
stop_cnt += 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't the stop_cnt trigger for really large files? Also there is no code to differentiate the two exit conditions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. will use os.wait instead

#merge each part into one complete output file
os.system("find . -name '*.result.en' | sort | xargs cat | head -n %s > %s" % (str(lines), fileOutput))
#rm temp file
os.system("rm mlt-nmt* -rf")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wouldn't be needed with the use of tempfile, as pointed out above.

stop_cnt += 1
end = time.time()
#merge each part into one complete output file
os.system("find . -name '*.result.en' | sort | xargs cat | head -n %s > %s" % (str(lines), fileOutput))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You only want the files in the current directory, right? A simple 'ls *.result.en' could replace both 'find [...]' and 'sort' (ls sorts its output). Actually 'cat *.results.en' would also work for a number of size below several thousands, which you are actually assuming given the naming schema.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK.will change later

Copy link
Contributor

@tdomhan tdomhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for updating the PR!

README.md Outdated
@@ -161,6 +161,15 @@ You can translate as follows:
This will take the best set of parameters found during training and then translate strings from STDIN and
write translations to STDOUT.

#### Multi-instance Translate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a list entry in tutorials/README.md and move this section to tutorials/$your_folder_name/README.md?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@@ -0,0 +1,129 @@
# Describtion: This script is used for CPU Multi-instance Translate,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know @fhieber asked for moving this to cpu_benchmarking, but I find this name confusing as this is not really about benchmarking. What do people think about process_per_core_translation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

README.md Outdated
@@ -161,6 +161,15 @@ You can translate as follows:
This will take the best set of parameters found during training and then translate strings from STDIN and
write translations to STDOUT.

#### Multi-instance Translate
Multi-instance can be used to greatly speedup translation in one multi-core processor computer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a few more words about what multi-instance means, namely that you can one process per core setting the CPU affinity, as I don't this this will be clear to users otherwise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@tdomhan
Copy link
Contributor

tdomhan commented Jun 25, 2018

thanks for iterating. You will need to rebase on the current master before merging.

@@ -0,0 +1,13 @@
# CPU process per core translation
On multi-core processor computer, translation per core separately can speedup translation performance, due to some operation can't be handled parallel in one process.
Using this methord, translation on each core can be parallel.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: method

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

On multi-core processor computer, translation per core separately can speedup translation performance, due to some operation can't be handled parallel in one process.
Using this methord, translation on each core can be parallel.

One python script example is givne and you can run it as follows:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: given

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed. Thanks

@@ -0,0 +1,129 @@
# Describtion: This script is used for CPU Multi-instance Translate, which process per core translation.
# It can greatly speedup translate perefomance.
# FileName: cpu_process_per_core_translation.py
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

filename and version seem unnecessary. The usage would be best as part of the argparse description argparse.ArgumentParser(description=....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thanks

def task(args):
os.system(args)

# benchmark: multi-instances translating, each instance trans the same input file separately
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you make this a proper doc-string for the method?

def benchmark(...):
   """
    doc goes here
   """

Also applies for the other functions below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good style. will follow.

@rongzha1 rongzha1 requested a review from mjpost as a code owner June 26, 2018 08:13
@tdomhan
Copy link
Contributor

tdomhan commented Jun 26, 2018

thanks for iterating!

@davvil
Copy link
Contributor

davvil commented Jun 27, 2018

Can you add a header stating the license (Apache) and authors?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
WIP work in progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants