Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

为什么有时候开启多进程模式 速度反而慢? #63

Closed
piaolingxue opened this issue Jun 21, 2013 · 9 comments
Closed

为什么有时候开启多进程模式 速度反而慢? #63

piaolingxue opened this issue Jun 21, 2013 · 9 comments

Comments

@piaolingxue
Copy link
Contributor

下面是我的测试代码

#!/usr/bin/python
# -*- coding: utf-8 -*-
import jieba
import time

def test_segment(parallel=False):
    """
    """
    if parallel:
        jieba.enable_parallel(2)
    start = time.time()
    with file('pins.json') as f:
        for line in f:
            words = jieba.cut(line)
            " ".join(words)
    f.closed
    print 'parallel:%s, time elapsed:%f second' % (parallel, time.time() - start)


def main():
    """
    """
    jieba.set_dictionary('data/dict.txt.big')
    jieba.initialize()
    test_segment()
    test_segment(True)

if __name__ == '__main__':
    main()

结果如下:

Building Trie..., from /home/matrix/workspace/huaban_segmentation_service/data/dict.txt.big
loading model from cache /tmp/jieba.user.5209089772060894903.cache
loading model cost  1.39316701889 seconds.
Trie has been built succesfully.
parallel:False, time elapsed:1.211206 second
parallel:True, time elapsed:2.321984 second
@fxsjy
Copy link
Owner

fxsjy commented Jun 21, 2013

@piaolingxue , 你好。看了一下你的代码,你是自行把文本按行分割后,再调用的jieba.cut方法,这样多进程加速的优势发挥不出来了。

结巴分词多进程模式的原理是这样的: 对于一个大文本(比如10万行),假设我们启用进程数是2,那么一个进程负责分割前5万行,另一个进程负责后5万行,两个进程同时运行。最后再把结果merge后返回。

由于你调用jieba.cut的时候,实际上只用于分割一行文本,因此只能由一个进程返回作用。

@piaolingxue
Copy link
Contributor Author

嗯,明白了,也就是调用的时候多线程来调,我打算用tornado把jieba封装成web服务,客户端通过restful接口来调用,不知道效率怎么样

@fxsjy
Copy link
Owner

fxsjy commented Jun 21, 2013

#!/usr/bin/python
# -*- coding: utf-8 -*-
import jieba
import time

def test_segment(parallel=False):
    """
    """
    if parallel:
        jieba.enable_parallel(2)
    start = time.time()
    with file('pins.json') as f:
            words = jieba.cut(f.read())
            " ".join(words)
    f.closed
    print 'parallel:%s, time elapsed:%f second' % (parallel, time.time() - start)


def main():
    """
    """
    jieba.set_dictionary('data/dict.txt.big')
    jieba.initialize()
    test_segment()
    test_segment(True)

if __name__ == '__main__':
    main()

@fxsjy fxsjy closed this as completed Jun 21, 2013
@fxsjy fxsjy reopened this Jun 21, 2013
@fxsjy
Copy link
Owner

fxsjy commented Jun 21, 2013

@piaolingxue ,是多进程,不是多线程。我开始也想用多线程,但是后来发现Python对于CPU密集型的任务,在多线程下也不能利用多核CPU。因此,我才改为多进程实现的。

至于tornado,我没有试过集成会不会出问题,我那个demo网站是基于bottle的。

@piaolingxue
Copy link
Contributor Author

我说的是 调用 的时候采用多线程,比如我有两个cpu,调用的时候采用两个线程调用

#!/usr/bin/python
# -*- coding: utf-8 -*-
import jieba
import time
import threading

class SegmentThread(threading.Thread):

    def __init__(self, lines, jieba):
        """
        """
        threading.Thread.__init__(self)
        self.lines = lines
        self.jieba = jieba

    def run(self):
        """
        """
        for line in self.lines:
            results = jieba.cut(line)
            "".join(results)



def test_segment(parallel=False):
    """
    """
    if parallel:
        jieba.enable_parallel(2)
    threads = []
    start = time.time()
    with file('pins.json') as f:
        lines = f.readlines()
        if parallel:
            t1 = SegmentThread(lines=lines[0:len(lines)/2],jieba=jieba)
            t2 = SegmentThread(lines=lines[len(lines)/2:],jieba=jieba)
            threads.append(t1)
            threads.append(t2)
            t1.start()
            t2.start()
        else:
            t1 = SegmentThread(lines=lines, jieba=jieba)
            t1.start()
            threads.append(t1)
    f.closed
    for t in threads:
        t.join()
    print 'parallel:%s, time elapsed:%f second' % (parallel, time.time() - start)


def main():
    """
    """
    jieba.set_dictionary('data/dict.txt.big')
    jieba.initialize()
    test_segment()
    test_segment(True)

if __name__ == '__main__':
    main()

这样看来,确实并行的速度快

Building Trie..., from /home/matrix/workspace/huaban_segmentation_service/data/dict.txt.big
loading model from cache /tmp/jieba.user.5209089772060894903.cache
loading model cost  1.37951397896 seconds.
Trie has been built succesfully.
parallel:False, time elapsed:1.249306 second
parallel:True, time elapsed:1.093940 second

@callzhang
Copy link

问题是我需要按行输出的情况怎么办呢?
with file('pins.json') as f: words = jieba.cut(f.read()) " ".join(words)
这样只能输出一整段

@kn45
Copy link

kn45 commented Dec 9, 2016

回车也被作为一个字符. 这样行首和行末都有一个空格. 不过也不影响训练

@shm007g
Copy link

shm007g commented Aug 6, 2018

如下csv,label会被分割成好几段'|', '__', 'label', '__', 'edu',必须使用数字替换才行。然而有的工具约定了label格式,将会特别难办,如fastText的__label__

content label
abcdefg __label__edu
hijklmnopq __label__finance

@MrKZZ
Copy link

MrKZZ commented Jul 6, 2022

如何做到对分词文本可以看到进度呢?这样虽然是多进程执行了,但是无法使用tqdm看到分词的进度了

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants