Skip to content

Latest commit

 

History

History
263 lines (210 loc) · 19.2 KB

README_en.md

File metadata and controls

263 lines (210 loc) · 19.2 KB

   ——JioNLP:A Python Lib for Chinese NLP Preprocessing & Parsing

   ——installation method:pip install jionlp

   ——JioNLP online is provided for a quick trial of some functions

   ——中文版 README.md

  • Doing NLP tasks, need to clean and filter the corpus? Use JioNLP
  • Doing NLP tasks, need to extract key info? Use JioNLP
  • Doing NLP tasks, need to do text augmentation? Use JioNLP
  • Doing NLP tasks, need to get radical, pinyin, traditional info of Chinese character? Use JioNLP

In short, JioNLP offers a bundle of NLP task preprocessing and parsing tools, which is accurate, efficient, easy to use.

Main functions include: clean text, delete HTML tags, exceptional chars, redundent chars, convert full-angle chars to half-angle, extract email, qq, phone-num, parenthesis info, id cards, ip, url, money and case, nums, parse time text, extract keyphrase, load Chinese dictionaries, do Chinese text augmentation

Updata 2022-05-26

jio.keyphrase.extract_keyphrase: extract keyphrases from a Chinese text

>>> import jionlp as jio
>>> text = '浑水创始人:七月开始调查贝壳,因为“好得难以置信” 2021年12月16日,做空机构浑水在社交媒体上公开表示,正在做空美股上市公司贝壳...'

>>> keyphrases = jio.keyphrase.extract_keyphrase(text)
>>> print(keyphrases)
>>> print(jio.keyphrase.extract_keyphrase.__doc__)

# ['浑水创始人', '开始调查贝壳', '做空机构浑水', '美股上市公司贝壳', '美国证监会']

Update 2021-10-25

jio.parse_money: parse a given money text to get a number, money case and definition of the money

import jionlp as jio
text_list = ['约4.287亿美元', '两个亿卢布', '六十四万零一百四十三元一角七分', '3000多欧元', '三五佰块钱', '七百到九百亿泰铢'] 
moneys = [jio.parse_money(text) for text in text_list]

# 约4.287亿美元: {'num': '428700000.00', 'case': '美元', 'definition': 'blur'}
# 两个亿卢布: {'num': '200000000.00', 'case': '卢布', 'definition': 'accurate'}
# 六十四万零一百四十三元一角七分: {'num': '640143.17', 'case': '元', 'definition': 'accurate'}
# 3000多欧元: {'num': ['3000.00', '4000.00'], 'case': '欧元', 'definition': 'blur'}
# 三五百块钱: {'num': ['300.00', '500.00'], 'case': '元', 'definition': 'blur'}
# 七百到九百亿泰铢: {'num': ['70000000000.00', '90000000000.00'], 'case': '泰铢', 'definition': 'blur'}

Update 2022-03-07

jio.parse_time: parse a given time string

import time
import jionlp as jio
res = jio.parse_time('今年9月', time_base={'year': 2021})
res = jio.parse_time('零三年元宵节晚上8点半', time_base=time.time())
res = jio.parse_time('一万个小时')
res = jio.parse_time('100天之后', time.time())
res = jio.parse_time('四月十三', lunar_date=False)
res = jio.parse_time('每周五下午4点', time.time(), period_results_num=2)
print(res)

# {'type': 'time_span', 'definition': 'accurate', 'time': ['2021-09-01 00:00:00', '2021-09-30 23:59:59']}
# {'type': 'time_point', 'definition': 'accurate', 'time': ['2003-02-15 20:30:00', '2003-02-15 20:30:59']}
# {'type': 'time_delta', 'definition': 'accurate', 'time': {'hour': 10000.0}}
# {'type': 'time_span', 'definition': 'blur', 'time': ['2021-10-22 00:00:00', 'inf']}
# {'type': 'time_period', 'definition': 'accurate', 'time': {'delta': {'day': 7}, 
# {'type': 'time_point', 'definition': 'accurate', 'time': ['2022-04-13 00:00:00', '2022-04-13 23:59:59']}
#  'point': {'time': [['2021-07-16 16:00:00', '2021-07-16 16:59:59'],
#                     ['2021-07-23 16:00:00', '2021-07-23 16:59:59']], 'string': '周五下午4点'}}}

Installation

  • python>=3.6 and github
$ git clone https://github.com/dongrixinyu/JioNLP
$ cd ./JioNLP
$ pip install .
  • pip
$ pip install jionlp

Features

  • import jionlp and check the main funcs and annotatiosn
>>> import jionlp as jio
>>> jio.help()  # input the keywords, such as “回译”, which means back translation
>>> dir(jio)
>>> print(jio.extract_parentheses.__doc__)
  • If in Linux, the following command is a replacement of jio.help().
$ jio_help
  • Star⭐ represents excellent features

1.Gadgets

Features Function name Description Star
help search tool help if you have no idea of JioNLP features, this tool can help you to scan with keywords
time sementic parser parse_time get the timestamp and span of a given time text
keyphrase extraction extract_keyphrase extract the keyphrases of a given text
extractive summary extract_summary extract the summary of a given text
stopwords filter remove_stopwords delete the stopwords of a given words list generated from a text
sentence spliter split_sentence split a text to sentences
location parser parse_location get the province, city, county, town and countryside name of a location text
telephone number parser phone_location
cell_phone_location
landline_phone_location
get the province, city, communication operators of a telephone number
news location recognizer recognize_location get the country, province, city, county name of a news text
solar lunardate conversion lunar2solar
solar2lunar
translate a lunar (solar) date to the solar (lunar) date
ID cards parser parse_id_card get the province, city, conty, birthday, gender, checking code of a given Chinese ID card number
idiom solitaire idiom_solitaire a word game that a list of Chinese idioms which the first char of the latter idiom has the same pronunciation with the last char of the former idiom
tranditional chars to simplified chars tra2sim translate traditional characters to simplified version
simplified chars to traditional chars sim2tra translate simplified characters to traditional version
characters to pinyin pinyin get the pinyin of chinese chars to add pronunciation info to the NLP model input
characters to radical char_radical get the radical info of Chinese chars to add to the NLP model input
money numbers to chars money_num2char get the character of a given money number

2.Text Augmentation

Features Function name Description Star
back translation BackTranslation get augmented text via back translation
swap char position swap_char_position get augmented text via swapping the position of adjacent chars
homophone substitution homophone_substitution replace chars with the same pronunciation to get augmented text
randomly add & delete chars random_add_delete add and delete chars randomly in the text to get augmented text
NER entity replacement replace_entity replace the entity of the text via dictionary to get augmented text

3.Key info extraction and parsing with regular expression

Features Function name Description Star
clean text clean_text delete exceptional, redundent chars, HTML tags, parenthesis, url, email, phone nums
extract E-mail extract_email extract email info from text
parse money text extract_money parse money text
extract phone number extract_phone_number extract landline and telephone number
extract Chinese ID card extract_id_card extract Chinese ID card info and parse it with jio.parse_id_card
extract QQ extract_qq extract tencent QQ number
extract URL extract_url extract URL info
extract IP extract_ip_address extract IPv4 address
extract parenthesis info extract_parentheses extract parenthesis info wrapped by {}「」[]【】()()<>《》
delete E-mail remove_email delete E-mail info from the given text
delete URL remove_url delete URL info
delete phone num remove_phone_number delete telephone numbers
delete IP remove_ip_address delete IP address
delete Chinese ID card remove_id_card delete Chinese ID card info
delete QQ remove_qq delete qq numbers
delete HTML tags remove_html_tag delete HTML tags
delete parenthesis info remove_parentheses delete parenthesis info wrapped by {}「」[]【】()()<>《》
delete exceptional chars remove_exception_char delete exceptional chars

4.file reader and writer

Features Function name Description Star
read file by iteration read_file_by_iter read file by iteration to get a json list
read file by line read_file_by_line read file to get a json list
write file by line write_file_by_line write a list of text to the file
get the time consumption TimeIt get the seconds of a given programming consuming
jionlp logger set_logger the logger used by jionlp

5.dictionaries

Features Function name Description Star
Chinese idiom dict chinese_idiom_loader load Chinese idiom dictionary
xiehouyu dict xiehouyu_loader load xiehouyu dictionary
Chinese location dict china_location_loader load Chinese location dictionary including province, city, county
Chinese location replacement dict china_location_change_loader load replacement info of Chinese location dictionary from 2018
world wide location dict world_location_loader load world wide location
Chinese character dict chinese_char_dictionary_loader load Chinese character dictionary
Chinese word dict chinese_word_dictionary_loader load Chinese word dictionary

6.Named Entity Recognition(NER) auxiliary tools

Features Function name Description Star
extract money entity extract_money extract money entity text from the given text
extract time entity extract_time extract time entity text from the given text
Lexicon NER LexiconNER get entities from the text via dictionary
entity to tag entity2tag convert the entities info to tags for sequence labeling
tag to entity tag2entity convert the tags of sequence labeling to entities
char token to word token char2word convert char token data to word token data
word token to char token word2char convert word token data to char token data
entity compare entity_compare compare the predicted entities with the golden entities
NER acceleration of prediction TokenSplitSentence
TokenBreakLongSentence
TokenBatchBucket
acceleration of NER prediction
split dataset analyse_dataset split dataset info training, valid, test part and analyse the KL divergence info
entity collector collect_dataset_entities collect all entities from labeled dataset to get a dictionary

7.Text Classification

Features Function name Description Star
Naive bayes words analysis analyse_freq_words analyse the words frequency of different classes by naive bayes
split dataset analyse_dataset split dataset info training, valid, test part and analyse the KL divergence info

8.Sentiment Analysis

Features Function name Description Star
sentiment analysis based on dictionary LexiconSentiment compute the sentiment value(0~1) of a given text

9.Chinese Word Segmentation(CWS)

Features Function name Description Star
word to tag cws.word2tag convert the words list to a list of tags for CWS
tag to word cws.tag2word convert the list of tags to a words list for CWS
compute F1 cws.f1 compute F1 value of the CWS models
CWS dataset corrector cws.CWSDCWithStandardWords correct the CWS datasets with dictionaries

My Initial Intention

  • NLP preprocessing and parsing is significant and time-consuming, especially for Chinese. This library offers a bundle of features to tackle these nasty jobs and you can focus more on training models.
  • If having any suggestions or problems with bugs, you can raise an issue via github.

Welcome to join the wechat group of NLP technics

Please scan the qr code below and send 【进群】

image

If this tool is useful to your development, please click the github star ⭐

Or scan the Paypal or Wechat QR code to donate money (●'◡'●) Thanks ~~

image \