——JioNLP：A Python Lib for Chinese NLP Preprocessing & Parsing

——installation method：`pip install jionlp`

——JioNLP online is provided for a quick trial of some functions

——中文版 README.md

Doing NLP tasks, need to clean and filter the corpus? Use JioNLP
Doing NLP tasks, need to extract key info? Use JioNLP
Doing NLP tasks, need to do text augmentation? Use JioNLP
Doing NLP tasks, need to get radical, pinyin, traditional info of Chinese character? Use JioNLP

In short, JioNLP offers a bundle of NLP task preprocessing and parsing tools, which is accurate, efficient, easy to use.

Main functions include: clean text, delete HTML tags, exceptional chars, redundent chars, convert full-angle chars to half-angle, extract email, qq, phone-num, parenthesis info, id cards, ip, url, money and case, nums, parse time text, extract keyphrase, load Chinese dictionaries, do Chinese text augmentation

Updata 2022-05-26

Update Keyphrase extraction

jio.keyphrase.extract_keyphrase: extract keyphrases from a Chinese text

>>> import jionlp as jio
>>> text = '浑水创始人：七月开始调查贝壳，因为“好得难以置信” 2021年12月16日，做空机构浑水在社交媒体上公开表示，正在做空美股上市公司贝壳...'

>>> keyphrases = jio.keyphrase.extract_keyphrase(text)
>>> print(keyphrases)
>>> print(jio.keyphrase.extract_keyphrase.__doc__)

# ['浑水创始人', '开始调查贝壳', '做空机构浑水', '美股上市公司贝壳', '美国证监会']

Update 2021-10-25

Update money text parser

jio.parse_money: parse a given money text to get a number, money case and definition of the money

import jionlp as jio
text_list = ['约4.287亿美元', '两个亿卢布', '六十四万零一百四十三元一角七分', '3000多欧元', '三五佰块钱', '七百到九百亿泰铢'] 
moneys = [jio.parse_money(text) for text in text_list]

# 约4.287亿美元: {'num': '428700000.00', 'case': '美元', 'definition': 'blur'}
# 两个亿卢布: {'num': '200000000.00', 'case': '卢布', 'definition': 'accurate'}
# 六十四万零一百四十三元一角七分: {'num': '640143.17', 'case': '元', 'definition': 'accurate'}
# 3000多欧元: {'num': ['3000.00', '4000.00'], 'case': '欧元', 'definition': 'blur'}
# 三五百块钱: {'num': ['300.00', '500.00'], 'case': '元', 'definition': 'blur'}
# 七百到九百亿泰铢: {'num': ['70000000000.00', '90000000000.00'], 'case': '泰铢', 'definition': 'blur'}

Update 2022-03-07

Update Time sementic parser

jio.parse_time: parse a given time string

import time
import jionlp as jio
res = jio.parse_time('今年9月', time_base={'year': 2021})
res = jio.parse_time('零三年元宵节晚上8点半', time_base=time.time())
res = jio.parse_time('一万个小时')
res = jio.parse_time('100天之后', time.time())
res = jio.parse_time('四月十三', lunar_date=False)
res = jio.parse_time('每周五下午4点', time.time(), period_results_num=2)
print(res)

# {'type': 'time_span', 'definition': 'accurate', 'time': ['2021-09-01 00:00:00', '2021-09-30 23:59:59']}
# {'type': 'time_point', 'definition': 'accurate', 'time': ['2003-02-15 20:30:00', '2003-02-15 20:30:59']}
# {'type': 'time_delta', 'definition': 'accurate', 'time': {'hour': 10000.0}}
# {'type': 'time_span', 'definition': 'blur', 'time': ['2021-10-22 00:00:00', 'inf']}
# {'type': 'time_period', 'definition': 'accurate', 'time': {'delta': {'day': 7}, 
# {'type': 'time_point', 'definition': 'accurate', 'time': ['2022-04-13 00:00:00', '2022-04-13 23:59:59']}
#  'point': {'time': [['2021-07-16 16:00:00', '2021-07-16 16:59:59'],
#                     ['2021-07-23 16:00:00', '2021-07-23 16:59:59']], 'string': '周五下午4点'}}}

Abouttime sementic parser
All test cases

Installation

python>=3.6 and github

$ git clone https://github.com/dongrixinyu/JioNLP
$ cd ./JioNLP
$ pip install .

pip

$ pip install jionlp

Features

import jionlp and check the main funcs and annotatiosn

>>> import jionlp as jio
>>> jio.help()  # input the keywords, such as “回译”, which means back translation
>>> dir(jio)
>>> print(jio.extract_parentheses.__doc__)

If in Linux, the following command is a replacement of jio.help().

$ jio_help

Star⭐ represents excellent features

1.Gadgets

Features	Function name	Description	Star
help search tool	help	if you have no idea of JioNLP features, this tool can help you to scan with keywords
time sementic parser	parse_time	get the timestamp and span of a given time text	⭐
keyphrase extraction	extract_keyphrase	extract the keyphrases of a given text	⭐
extractive summary	extract_summary	extract the summary of a given text
stopwords filter	remove_stopwords	delete the stopwords of a given words list generated from a text	⭐
sentence spliter	split_sentence	split a text to sentences	⭐
location parser	parse_location	get the province, city, county, town and countryside name of a location text	⭐
telephone number parser	phone_location cell_phone_location landline_phone_location	get the province, city, communication operators of a telephone number
news location recognizer	recognize_location	get the country, province, city, county name of a news text	⭐
solar lunardate conversion	lunar2solar solar2lunar	translate a lunar (solar) date to the solar (lunar) date
ID cards parser	parse_id_card	get the province, city, conty, birthday, gender, checking code of a given Chinese ID card number	⭐
idiom solitaire	idiom_solitaire	a word game that a list of Chinese idioms which the first char of the latter idiom has the same pronunciation with the last char of the former idiom
tranditional chars to simplified chars	tra2sim	translate traditional characters to simplified version
simplified chars to traditional chars	sim2tra	translate simplified characters to traditional version
characters to pinyin	pinyin	get the pinyin of chinese chars to add pronunciation info to the NLP model input	⭐
characters to radical	char_radical	get the radical info of Chinese chars to add to the NLP model input	⭐
money numbers to chars	money_num2char	get the character of a given money number

2.Text Augmentation

Description of all text augmentation methods

Features	Function name	Description	Star
back translation	BackTranslation	get augmented text via back translation	⭐
swap char position	swap_char_position	get augmented text via swapping the position of adjacent chars
homophone substitution	homophone_substitution	replace chars with the same pronunciation to get augmented text	⭐
randomly add & delete chars	random_add_delete	add and delete chars randomly in the text to get augmented text
NER entity replacement	replace_entity	replace the entity of the text via dictionary to get augmented text	⭐

3.Key info extraction and parsing with regular expression

Features	Function name	Description	Star
clean text	clean_text	delete exceptional, redundent chars, HTML tags, parenthesis, url, email, phone nums	⭐
extract E-mail	extract_email	extract email info from text
parse money text	extract_money	parse money text	⭐
extract phone number	extract_phone_number	extract landline and telephone number
extract Chinese ID card	extract_id_card	extract Chinese ID card info and parse it with jio.parse_id_card
extract QQ	extract_qq	extract tencent QQ number
extract URL	extract_url	extract URL info
extract IP	extract_ip_address	extract IPv4 address
extract parenthesis info	extract_parentheses	extract parenthesis info wrapped by {}「」[]【】()（）<>《》	⭐
delete E-mail	remove_email	delete E-mail info from the given text
delete URL	remove_url	delete URL info
delete phone num	remove_phone_number	delete telephone numbers
delete IP	remove_ip_address	delete IP address
delete Chinese ID card	remove_id_card	delete Chinese ID card info
delete QQ	remove_qq	delete qq numbers
delete HTML tags	remove_html_tag	delete HTML tags
delete parenthesis info	remove_parentheses	delete parenthesis info wrapped by {}「」[]【】()（）<>《》
delete exceptional chars	remove_exception_char	delete exceptional chars

4.file reader and writer

Features	Function name	Description	Star
read file by iteration	read_file_by_iter	read file by iteration to get a json list
read file by line	read_file_by_line	read file to get a json list	⭐
write file by line	write_file_by_line	write a list of text to the file	⭐
get the time consumption	TimeIt	get the seconds of a given programming consuming
jionlp logger	set_logger	the logger used by jionlp

5.dictionaries

Features	Function name	Description	Star
Chinese idiom dict	chinese_idiom_loader	load Chinese idiom dictionary	⭐
xiehouyu dict	xiehouyu_loader	load xiehouyu dictionary	⭐
Chinese location dict	china_location_loader	load Chinese location dictionary including province, city, county	⭐
Chinese location replacement dict	china_location_change_loader	load replacement info of Chinese location dictionary from 2018	⭐
world wide location dict	world_location_loader	load world wide location
Chinese character dict	chinese_char_dictionary_loader	load Chinese character dictionary
Chinese word dict	chinese_word_dictionary_loader	load Chinese word dictionary

6.Named Entity Recognition(NER) auxiliary tools

NER dateset format description

Features	Function name	Description	Star
extract money entity	extract_money	extract money entity text from the given text	⭐
extract time entity	extract_time	extract time entity text from the given text	⭐
Lexicon NER	LexiconNER	get entities from the text via dictionary	⭐
entity to tag	entity2tag	convert the entities info to tags for sequence labeling
tag to entity	tag2entity	convert the tags of sequence labeling to entities
char token to word token	char2word	convert char token data to word token data
word token to char token	word2char	convert word token data to char token data
entity compare	entity_compare	compare the predicted entities with the golden entities	⭐
NER acceleration of prediction	TokenSplitSentence TokenBreakLongSentence TokenBatchBucket	acceleration of NER prediction	⭐
split dataset	analyse_dataset	split dataset info training, valid, test part and analyse the KL divergence info	⭐
entity collector	collect_dataset_entities	collect all entities from labeled dataset to get a dictionary

7.Text Classification

Features	Function name	Description	Star
Naive bayes words analysis	analyse_freq_words	analyse the words frequency of different classes by naive bayes	⭐
split dataset	analyse_dataset	split dataset info training, valid, test part and analyse the KL divergence info	⭐

8.Sentiment Analysis

Features	Function name	Description	Star
sentiment analysis based on dictionary	LexiconSentiment	compute the sentiment value(0~1) of a given text

9.Chinese Word Segmentation(CWS)

Features	Function name	Description
word to tag	cws.word2tag	convert the words list to a list of tags for CWS
tag to word	cws.tag2word	convert the list of tags to a words list for CWS
compute F1	cws.f1	compute F1 value of the CWS models
CWS dataset corrector	cws.CWSDCWithStandardWords	correct the CWS datasets with dictionaries

My Initial Intention

NLP preprocessing and parsing is significant and time-consuming, especially for Chinese. This library offers a bundle of features to tackle these nasty jobs and you can focus more on training models.
If having any suggestions or problems with bugs, you can raise an issue via github.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly