New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
为什么有时候开启多进程模式 速度反而慢? #63
Comments
@piaolingxue , 你好。看了一下你的代码,你是自行把文本按行分割后,再调用的jieba.cut方法,这样多进程加速的优势发挥不出来了。 结巴分词多进程模式的原理是这样的: 对于一个大文本(比如10万行),假设我们启用进程数是2,那么一个进程负责分割前5万行,另一个进程负责后5万行,两个进程同时运行。最后再把结果merge后返回。 由于你调用jieba.cut的时候,实际上只用于分割一行文本,因此只能由一个进程返回作用。 |
嗯,明白了,也就是调用的时候多线程来调,我打算用tornado把jieba封装成web服务,客户端通过restful接口来调用,不知道效率怎么样 |
#!/usr/bin/python
# -*- coding: utf-8 -*-
import jieba
import time
def test_segment(parallel=False):
"""
"""
if parallel:
jieba.enable_parallel(2)
start = time.time()
with file('pins.json') as f:
words = jieba.cut(f.read())
" ".join(words)
f.closed
print 'parallel:%s, time elapsed:%f second' % (parallel, time.time() - start)
def main():
"""
"""
jieba.set_dictionary('data/dict.txt.big')
jieba.initialize()
test_segment()
test_segment(True)
if __name__ == '__main__':
main() |
@piaolingxue ,是多进程,不是多线程。我开始也想用多线程,但是后来发现Python对于CPU密集型的任务,在多线程下也不能利用多核CPU。因此,我才改为多进程实现的。 至于tornado,我没有试过集成会不会出问题,我那个demo网站是基于bottle的。 |
我说的是 调用 的时候采用多线程,比如我有两个cpu,调用的时候采用两个线程调用 #!/usr/bin/python
# -*- coding: utf-8 -*-
import jieba
import time
import threading
class SegmentThread(threading.Thread):
def __init__(self, lines, jieba):
"""
"""
threading.Thread.__init__(self)
self.lines = lines
self.jieba = jieba
def run(self):
"""
"""
for line in self.lines:
results = jieba.cut(line)
"".join(results)
def test_segment(parallel=False):
"""
"""
if parallel:
jieba.enable_parallel(2)
threads = []
start = time.time()
with file('pins.json') as f:
lines = f.readlines()
if parallel:
t1 = SegmentThread(lines=lines[0:len(lines)/2],jieba=jieba)
t2 = SegmentThread(lines=lines[len(lines)/2:],jieba=jieba)
threads.append(t1)
threads.append(t2)
t1.start()
t2.start()
else:
t1 = SegmentThread(lines=lines, jieba=jieba)
t1.start()
threads.append(t1)
f.closed
for t in threads:
t.join()
print 'parallel:%s, time elapsed:%f second' % (parallel, time.time() - start)
def main():
"""
"""
jieba.set_dictionary('data/dict.txt.big')
jieba.initialize()
test_segment()
test_segment(True)
if __name__ == '__main__':
main() 这样看来,确实并行的速度快
|
问题是我需要按行输出的情况怎么办呢? |
回车也被作为一个字符. 这样行首和行末都有一个空格. 不过也不影响训练 |
如下csv,label会被分割成好几段
|
如何做到对分词文本可以看到进度呢?这样虽然是多进程执行了,但是无法使用tqdm看到分词的进度了 |
下面是我的测试代码
结果如下:
The text was updated successfully, but these errors were encountered: