Skip to content

allenwind/count-in-parallel

Repository files navigation

count-in-parallel

并行的词频统计和分词。

并行词频统计

在数据集不大的情况下,通常我们使用类似如下代码来统计字词频率表,

import itertools
import collections
X = [text1, text2, ..., textn]
words = collections.Counter(itertools.chain(*X))
print(words.most_common(20))

当数据集非常大时,以上代码显得非常无力。这里提供在大数集中并行统计字词频率表的方法。

import jieba

path = "THUCNews/**/*.txt"
tokens = count_in_parrallel(
    tokenize=jieba.lcut,
    batch_generator=load_batch_texts(path, limit=10000),
    processes=6,
    maxsize=300,
    preprocess=lambda x: x.lower()
)
print(len(tokens))
print(tokens.most_common(200))
print(tokens.most_common(-200))

并行分词

并行分词例子,分词结果按原句子顺序返回,这样有利于Tokenizer处理和模型训练等相关操作,

import itertools
import jieba
from tokenize_in_parallel import *

file = "THUCNews-title-label.txt"

def gen(file):
    with open(file, encoding="utf-8") as fp:
        text = fp.read()
    lines = text.split("\n")[:-1]
    for line in lines:
        yield line

# 分词结果按原顺序返回
tokens = tokenize_in_parallel(
    tokenize=jieba.lcut,
    generator=gen(file),
    processes=7,
    maxsize=300,
    preprocess=lambda x: x.lower()
)

print(len(tokens))
for i in range(10):
    print(tokens[i])

以THUCNews文件集作为测试,jieba作tokenize下测试:

CPUs 文件数 时间
6CPUS 10000 8s
4CPUS 10000 11s
2CPUs 10000 16s
1CPUs 10000 32s
1CPUs 800000 近40分钟
6CPUS 800000 350s

可以看到,并行效果还是不错的。

About

parallel counter

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages