nGram-Python

nGram: New Word Recognition (Chinese news texts).

This repo contains dataCollector (which crawls data from Sina news website), dataCleaner (which tidies the crawled data and aggregates them), nGram (which counts n-gram words and produce nGram data), and wordRecognition (which uses the nGram data to recognize chinese words).

Code Constitution Specifics

dataCollector

Spiders to crawl Sina news texts (with titles), using Python's Requests + BeautifulSoup.

Data source: From daily news list "http://news.sina.com.cn/old1000/news1000_YYYYMMDD.shtml", where YYYY, MM, DD are year, month, and day respectively.

dataCleaner

Python script to aggregate the crawled texts in a year to a single file, e.g. 2006_all.

nGram

Scan texts to generate statistical nGram files.

wordRecognition

Use nGram data to recognize words.

Data Specifics

The raw data is acquired by dataCollector and classified by date. The data contain Sina news texts with labels like <p></p>, and the Cinese-text-only result is acquired by dataCleaner as data.txt (using texts of 2006 January - April, totally 1.9GB approximately, 1,990,660,672 bytes precisely).

`nGram.py` utility

nGram.py is a command line utility. Usage:

$ src/nGram/nGram.py
nGram.py: count nGram words in Chinese Texts
Usage:
         ./nGram.py inputFileName nMin nMax [outputFileName]
Explanation:
         inputFileName -- Name of the input data text
         nMin, nMax -- the range of N for n-gram
         outputFileName -- (optional) Name of the output nGram text
Example:
         ./nGram.py data.txt 2 5
$

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src

src

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

Repository files navigation

nGram-Python

nGram: New Word Recognition (Chinese news texts).

Code Constitution Specifics

dataCollector

dataCleaner

nGram

wordRecognition

Data Specifics

`nGram.py` utility

About

Releases

Packages

Languages

License

bsnsk/nGram-Python

Folders and files

Latest commit

History

Repository files navigation

nGram-Python

nGram: New Word Recognition (Chinese news texts).

Code Constitution Specifics

dataCollector

dataCleaner

nGram

wordRecognition

Data Specifics

nGram.py utility

About

Resources

License

Stars

Watchers

Forks

Languages

`nGram.py` utility