为什么有时候开启多进程模式速度反而慢? #63

piaolingxue · 2013-06-21T06:32:28Z

下面是我的测试代码

#!/usr/bin/python
# -*- coding: utf-8 -*-
import jieba
import time

def test_segment(parallel=False):
    """
    """
    if parallel:
        jieba.enable_parallel(2)
    start = time.time()
    with file('pins.json') as f:
        for line in f:
            words = jieba.cut(line)
            " ".join(words)
    f.closed
    print 'parallel:%s, time elapsed:%f second' % (parallel, time.time() - start)


def main():
    """
    """
    jieba.set_dictionary('data/dict.txt.big')
    jieba.initialize()
    test_segment()
    test_segment(True)

if __name__ == '__main__':
    main()

结果如下:

Building Trie..., from /home/matrix/workspace/huaban_segmentation_service/data/dict.txt.big
loading model from cache /tmp/jieba.user.5209089772060894903.cache
loading model cost  1.39316701889 seconds.
Trie has been built succesfully.
parallel:False, time elapsed:1.211206 second
parallel:True, time elapsed:2.321984 second

The text was updated successfully, but these errors were encountered:

fxsjy · 2013-06-21T06:42:49Z

@piaolingxue ，你好。看了一下你的代码，你是自行把文本按行分割后，再调用的jieba.cut方法，这样多进程加速的优势发挥不出来了。

结巴分词多进程模式的原理是这样的：对于一个大文本（比如10万行)，假设我们启用进程数是2，那么一个进程负责分割前5万行，另一个进程负责后5万行，两个进程同时运行。最后再把结果merge后返回。

由于你调用jieba.cut的时候，实际上只用于分割一行文本，因此只能由一个进程返回作用。

piaolingxue · 2013-06-21T06:47:55Z

嗯，明白了，也就是调用的时候多线程来调，我打算用tornado把jieba封装成web服务，客户端通过restful接口来调用，不知道效率怎么样

fxsjy · 2013-06-21T06:48:30Z

#!/usr/bin/python
# -*- coding: utf-8 -*-
import jieba
import time

def test_segment(parallel=False):
    """
    """
    if parallel:
        jieba.enable_parallel(2)
    start = time.time()
    with file('pins.json') as f:
            words = jieba.cut(f.read())
            " ".join(words)
    f.closed
    print 'parallel:%s, time elapsed:%f second' % (parallel, time.time() - start)


def main():
    """
    """
    jieba.set_dictionary('data/dict.txt.big')
    jieba.initialize()
    test_segment()
    test_segment(True)

if __name__ == '__main__':
    main()

fxsjy · 2013-06-21T06:51:06Z

@piaolingxue ，是多进程，不是多线程。我开始也想用多线程，但是后来发现Python对于CPU密集型的任务，在多线程下也不能利用多核CPU。因此，我才改为多进程实现的。

至于tornado，我没有试过集成会不会出问题，我那个demo网站是基于bottle的。

piaolingxue · 2013-06-21T07:19:25Z

我说的是调用的时候采用多线程，比如我有两个cpu，调用的时候采用两个线程调用

#!/usr/bin/python
# -*- coding: utf-8 -*-
import jieba
import time
import threading

class SegmentThread(threading.Thread):

    def __init__(self, lines, jieba):
        """
        """
        threading.Thread.__init__(self)
        self.lines = lines
        self.jieba = jieba

    def run(self):
        """
        """
        for line in self.lines:
            results = jieba.cut(line)
            "".join(results)



def test_segment(parallel=False):
    """
    """
    if parallel:
        jieba.enable_parallel(2)
    threads = []
    start = time.time()
    with file('pins.json') as f:
        lines = f.readlines()
        if parallel:
            t1 = SegmentThread(lines=lines[0:len(lines)/2],jieba=jieba)
            t2 = SegmentThread(lines=lines[len(lines)/2:],jieba=jieba)
            threads.append(t1)
            threads.append(t2)
            t1.start()
            t2.start()
        else:
            t1 = SegmentThread(lines=lines, jieba=jieba)
            t1.start()
            threads.append(t1)
    f.closed
    for t in threads:
        t.join()
    print 'parallel:%s, time elapsed:%f second' % (parallel, time.time() - start)


def main():
    """
    """
    jieba.set_dictionary('data/dict.txt.big')
    jieba.initialize()
    test_segment()
    test_segment(True)

if __name__ == '__main__':
    main()

这样看来，确实并行的速度快

Building Trie..., from /home/matrix/workspace/huaban_segmentation_service/data/dict.txt.big
loading model from cache /tmp/jieba.user.5209089772060894903.cache
loading model cost  1.37951397896 seconds.
Trie has been built succesfully.
parallel:False, time elapsed:1.249306 second
parallel:True, time elapsed:1.093940 second

callzhang · 2016-08-23T04:08:34Z

问题是我需要按行输出的情况怎么办呢？
with file('pins.json') as f: words = jieba.cut(f.read()) " ".join(words)
这样只能输出一整段

kn45 · 2016-12-09T09:50:35Z

回车也被作为一个字符. 这样行首和行末都有一个空格. 不过也不影响训练

shm007g · 2018-08-06T08:06:24Z

如下csv，label会被分割成好几段'|', '__', 'label', '__', 'edu'，必须使用数字替换才行。然而有的工具约定了label格式，将会特别难办，如fastText的__label__

content	label
abcdefg	__label__edu
hijklmnopq	__label__finance

MrKZZ · 2022-07-06T08:47:44Z

如何做到对分词文本可以看到进度呢？这样虽然是多进程执行了，但是无法使用tqdm看到分词的进度了

fxsjy closed this as completed Jun 21, 2013

fxsjy reopened this Jun 21, 2013

piaolingxue closed this as completed Jul 24, 2013

StevenLOL mentioned this issue Dec 29, 2015

多进程反而更慢 #331

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

为什么有时候开启多进程模式速度反而慢? #63

为什么有时候开启多进程模式速度反而慢? #63

piaolingxue commented Jun 21, 2013

fxsjy commented Jun 21, 2013

piaolingxue commented Jun 21, 2013

fxsjy commented Jun 21, 2013

fxsjy commented Jun 21, 2013

piaolingxue commented Jun 21, 2013

callzhang commented Aug 23, 2016

kn45 commented Dec 9, 2016

shm007g commented Aug 6, 2018

MrKZZ commented Jul 6, 2022

为什么有时候开启多进程模式 速度反而慢? #63

为什么有时候开启多进程模式 速度反而慢? #63

Comments

piaolingxue commented Jun 21, 2013

fxsjy commented Jun 21, 2013

piaolingxue commented Jun 21, 2013

fxsjy commented Jun 21, 2013

fxsjy commented Jun 21, 2013

piaolingxue commented Jun 21, 2013

callzhang commented Aug 23, 2016

kn45 commented Dec 9, 2016

shm007g commented Aug 6, 2018

MrKZZ commented Jul 6, 2022

为什么有时候开启多进程模式速度反而慢? #63

为什么有时候开启多进程模式速度反而慢? #63