Skip to content

Commit

Permalink
v1.0.0
Browse files Browse the repository at this point in the history
  • Loading branch information
george-pogosyan committed Dec 7, 2020
2 parents 924a273 + ba35667 commit 12694d9
Show file tree
Hide file tree
Showing 17 changed files with 503 additions and 42 deletions.
25 changes: 25 additions & 0 deletions .github/workflows/create-release.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
on:
push:
# Sequence of patterns matched against refs/tags
tags:
- 'v*' # Push events to matching v*, i.e. v1.0, v20.15.10

name: Create Release

jobs:
build:
name: Create Release
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v2
- name: Create Release
id: create_release
uses: actions/create-release@v1
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
with:
tag_name: ${{ github.ref }}
release_name: Release ${{ github.ref }}
draft: false
prerelease: false
95 changes: 94 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1 +1,94 @@
![Tests](https://github.com/asar-studio/natural-abh/workflows/Tests/badge.svg?branch=develop) ![Release Package to npm](https://github.com/asar-studio/natural-abh/workflows/Release%20Package%20to%20npm/badge.svg)
# natural-abh

=======

![Tests](https://github.com/asar-studio/natural-abh/workflows/Tests/badge.svg?branch=develop)
![Release Package to npm](https://github.com/asar-studio/natural-abh/workflows/Release%20Package%20to%20npm/badge.svg)
[![NPM version](https://img.shields.io/npm/v/natural-abh.svg)](https://www.npmjs.com/package/natural-abh)

"natural-abh" is a general natural language facility for nodejs. В настоящее время поддерживается: Tokenizing, normalizing and N-grams are currently supported.

It's still in the early stages, so we're very interested in bug reports, contributions and the like.

### TABLE OF CONTENTS

- [Installation](#installation)
- [Tokenizers](#tokenizers)
- [Normalizer](#normalizer)
- [N-Grams](#n-grams)

## Installation

You can install natural-abh via NPM like so:

npm install natural-abh

or using yarn:

yarn add natural-abh

If you're interested in contributing to natural, or just hacking on it, then by all means fork away!

## Tokenizers

Word anf RegExp are provided for breaking text up into arrays of tokens:

```javascript
const nabh = require('natural-abh');
const tokenizer = new nabh.WordTokenizer();
console.log(nabh.tokenize('Аԥсны Аҳәынҭқарра Ашьаустә закәанеидкыла'));
// [ 'Аԥсны', 'Аҳәынҭқарра', 'Ашьаустә', 'закәанеидкыла' ]
```

The other tokenizers follow a similar pattern:

```javascript
tokenizer = new nabh.AggressiveTokenizer();
console.log(tokenizer.tokenize('Аԥсны Аҳәынҭқарра Ашьаустә закәанеидкыла'));
// [ 'Аԥсны', 'Аҳәынҭқарра', 'Ашьаустә', 'закәанеидкыла' ]

tokenizer = new nabh.RegexpTokenizer({ pattern: /\-/ });
console.log(tokenizer.tokenize('Аԥсны Аҳәынҭқарра Ашьаустә закәанеидкыла'));
// [ 'Аԥсны', 'Аҳәынҭқарра', 'Ашьаустә', 'закәанеидкыла' ]

tokenizer = new nabh.WordPunctTokenizer();
console.log(tokenizer.tokenize('Аԥсны Аҳәынҭқарра Ашьаустә закәанеидкыла'));
// [ 'Аԥсны', 'Аҳәынҭқарра', 'Ашьаустә', 'закәанеидкыла' ]
```

## Normalizer

Replaces obsolete characters in a string with modern counterparts:

```javascript
const { normalize } = require('natural-abh');
console.log(normalize('Аҧсны Аҳәынҭқарра Ашьаустә закәанеидкыла'));
// "Аԥсны Аҳәынҭқарра Ашьаустә закәанеидкыла"
```

## N-Grams

n-grams can be obtained for strings (which will be tokenized for you):

```javascript
const { bigrams, trigrams, ngrams } = nabh;
```

### bigrams

```javascript
console.log(bigrams('Аҧсны Аҳәынҭқарра Ашьаустә закәанеидкыла'));
console.log(ngrams('Аҧсны Аҳәынҭқарра Ашьаустә закәанеидкыла', 2));
// [ [ 'Аҧсны', 'Аҳәынҭқарра' ], [ 'Аҳәынҭқарра', 'Ашьаустә' ], [ 'Ашьаустә', 'закәанеидкыла' ] ]
```

### trigrams

```javascript
console.log(trigrams('Аҧсны Аҳәынҭқарра Ашьаустә закәанеидкыла'));
console.log(ngrams('Аҧсны Аҳәынҭқарра Ашьаустә закәанеидкыла', 3));
// [ [ 'Аҧсны', 'Аҳәынҭқарра', 'Ашьаустә' ], [ 'Аҳәынҭқарра', 'Ашьаустә', 'закәанеидкыла' ] ]
```

More use cases u can find reading tests

94 changes: 94 additions & 0 deletions README.ru.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
# natural-abh

=======

[![NPM version](https://img.shields.io/npm/v/natural-abh.svg)](https://www.npmjs.com/package/natural-abh)
![Tests](https://github.com/asar-studio/natural-abh/workflows/Tests/badge.svg?branch=develop)
![Release Package to npm](https://github.com/asar-studio/natural-abh/workflows/Release%20Package%20to%20npm/badge.svg)

"Natural" is a general natural language facility for nodejs. В настоящее время поддерживается: токенизация, нормализация, подсчёт N-грамм(биграммы, триграммы и мультиграммы).

Библиотека все еще на начальной стадии, поэтому мы очень заинтересованы в сообщениях об ошибках, помощь в реализации функционала и тд.

### Содержание

- [Установка](#установка)
- [Токенизатор](#токенизатор)
- [Нормалайзер](#нормалайзер)
- [N-граммы](#n-граммы)

## Установка

Вы можете установить natural-abh через NPM следующим образом:

npm install natural-abh

Либо используя yarn:

yarn add natural-abh

Если вы заинтересованы в том, чтобы внести свой вклад в natural-abh, создайте fork репозитория, добавьте свой функционал и создайте pull request для обсуждения!

## Токенизатор

Word и RegExp токенизаторы предназначены для разбиения текста на массивы токенов:

```javascript
const nabh = require('natural-abh');
const tokenizer = new nabh.WordTokenizer();
console.log(nabh.tokenize('Аԥсны Аҳәынҭқарра Ашьаустә закәанеидкыла'));
// [ 'Аԥсны', 'Аҳәынҭқарра', 'Ашьаустә', 'закәанеидкыла' ]
```

Остальные токенизаторы следуют аналогичной схеме:

```javascript
tokenizer = new nabh.AggressiveTokenizer();
console.log(tokenizer.tokenize('Аԥсны Аҳәынҭқарра Ашьаустә закәанеидкыла'));
// [ 'Аԥсны', 'Аҳәынҭқарра', 'Ашьаустә', 'закәанеидкыла' ]

tokenizer = new nabh.RegexpTokenizer({ pattern: /\-/ });
console.log(tokenizer.tokenize('Аԥсны Аҳәынҭқарра Ашьаустә закәанеидкыла'));
// [ 'Аԥсны', 'Аҳәынҭқарра', 'Ашьаустә', 'закәанеидкыла' ]

tokenizer = new nabh.WordPunctTokenizer();
console.log(tokenizer.tokenize('Аԥсны Аҳәынҭқарра Ашьаустә закәанеидкыла'));
// [ 'Аԥсны', 'Аҳәынҭқарра', 'Ашьаустә', 'закәанеидкыла' ]
```

## Нормалайзер

Заменяет устаревшие символы в строке на современные аналоги:

```javascript
const { normalize } = require('natural-abh');
console.log(normalize('Аҧсны Аҳәынҭқарра Ашьаустә закәанеидкыла'));
// "Аԥсны Аҳәынҭқарра Ашьаустә закәанеидкыла"
```

## N-граммы

быдут получены для строк (которые будут токенизированы для вас):

```javascript
const { bigrams, trigrams, ngrams } = nabh;
```

### bigrams

```javascript
console.log(bigrams('Аҧсны Аҳәынҭқарра Ашьаустә закәанеидкыла'));
console.log(ngrams('Аҧсны Аҳәынҭқарра Ашьаустә закәанеидкыла', 2));
// [ [ 'Аҧсны', 'Аҳәынҭқарра' ], [ 'Аҳәынҭқарра', 'Ашьаустә' ], [ 'Ашьаустә', 'закәанеидкыла' ] ]
```

### trigrams

```javascript
console.log(trigrams('Аҧсны Аҳәынҭқарра Ашьаустә закәанеидкыла'));
console.log(ngrams('Аҧсны Аҳәынҭқарра Ашьаустә закәанеидкыла', 3));
// [ [ 'Аҧсны', 'Аҳәынҭқарра', 'Ашьаустә' ], [ 'Аҳәынҭқарра', 'Ашьаустә', 'закәанеидкыла' ] ]
```


Более детально ознакомиться с использованием библиотеки вы можете ознакомиться посмотрев тесты
10 changes: 5 additions & 5 deletions package.json
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
{
"name": "natural-abh",
"version": "0.1.1",
"version": "1.0.0",
"main": "dist/index.js",
"scripts": {
"build": "tsc",
"lint": "eslint --ext .ts, --ignore-path .eslintignore .",
"lintfix": "eslint --fix --ext .ts, --ignore-path .eslintignore .",
"test": "jest",
"test": "jest --collectCoverage",
"test:watch": "jest --watch --detectOpenHandles"
},
"husky": {
Expand All @@ -21,8 +21,6 @@
]
},
"dependencies": {
"eslint-config-airbnb-typescript": "^12.0.0",
"typescript": "^4.1.0",
"underscore": "^1.12.0"
},
"devDependencies": {
Expand All @@ -31,6 +29,7 @@
"@types/underscore": "^1.10.24",
"@typescript-eslint/eslint-plugin": "^4.8.2",
"eslint": "^7.13.0",
"eslint-config-airbnb-typescript": "^12.0.0",
"eslint-config-airbnb-base": "^14.2.1",
"eslint-config-prettier": "^6.15.0",
"eslint-plugin-import": "^2.22.1",
Expand All @@ -41,7 +40,8 @@
"nodemon": "^2.0.6",
"prettier": "^2.1.2",
"ts-jest": "^26.4.4",
"ts-node": "^9.0.0"
"ts-node": "^9.0.0",
"typescript": "^4.1.0"
},
"jest": {
"moduleFileExtensions": [
Expand Down
6 changes: 4 additions & 2 deletions src/index.ts
Original file line number Diff line number Diff line change
@@ -1,15 +1,17 @@
import { AggressiveTokenizer } from './tokenizers/aggressive_tokenizer';
import { Matchers } from './tokenizers/orthography_matchers';
import { RegexpTokenizer } from './tokenizers/regexp_tokenizer';
import { WordTokenizer } from './tokenizers/word_tokenizer';
import { WordPunctTokenizer } from './tokenizers/word_punct_tokenizer';
import { PunctTokenizer } from './tokenizers/punct_tokenizer';
import { normalize } from './normalizers/normalizer';
import { ngrams, bigrams, trigrams, multrigrams } from './ngrams/ngrams';

export {
AggressiveTokenizer,
Matchers,
RegexpTokenizer,
WordPunctTokenizer,
PunctTokenizer,
WordTokenizer,
normalize,
ngrams,
bigrams,
Expand Down
2 changes: 1 addition & 1 deletion tests/ngrams/ngrams.spec.ts → src/ngrams/ngrams.spec.ts
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ import {
bigrams,
trigrams,
multrigrams
} from '../../src/ngrams/ngrams';
} from './ngrams';

const text = readFileSync(process.cwd() + '/tests/data/text.txt', 'utf8');
const unogramsJSON = JSON.parse(
Expand Down
7 changes: 1 addition & 6 deletions src/ngrams/ngrams.ts
Original file line number Diff line number Diff line change
@@ -1,12 +1,7 @@
import * as _ from 'underscore';
import { AggressiveTokenizer as Tokenizer } from '../tokenizers/aggressive_tokenizer';

let tokenizer = new Tokenizer();

export const setTokenizer = (t: Tokenizer) => {
if (!_.isFunction(t.tokenize)) throw new Error('Expected a valid Tokenizer');
tokenizer = t;
};
const tokenizer = new Tokenizer();

export const ngrams = (
_sequence: string,
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
import { readFileSync } from 'fs';
import { normalize } from '../../src/normalizers/normalizer';
import { normalize } from './normalizer';

const wrongString = readFileSync(process.cwd() + '/tests/data/text.txt', 'utf8');
const rightString = readFileSync(process.cwd() + '/tests/data/normalized.txt', 'utf8');
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
import { AggressiveTokenizer } from '../../src/tokenizers/aggressive_tokenizer';
import { AggressiveTokenizer } from './aggressive_tokenizer';

const aggressiveTokenizer = new AggressiveTokenizer();
const tokenizer = new AggressiveTokenizer();

const string = 'Аҧсны Аҳәынҭқарра Ашьаустә закәанеидкыла» 10.01.2007 шықәсазтәи N 1555-с-XIV (иаднакылт Аҧсны Жәлар Реизара – Апарламент 2006 ш, ԥхынҷкәынмза 28 рзы) (аредакциа 29.06.2016)';

Expand Down Expand Up @@ -31,9 +31,9 @@ const tokens = [
'2016'
];
test('должен быть инстансом класса AggressiveTokenizer', () => {
expect(aggressiveTokenizer).toBeInstanceOf(AggressiveTokenizer);
expect(tokenizer).toBeInstanceOf(AggressiveTokenizer);
});

test('должен правильно токенизировать строку', () => {
expect(aggressiveTokenizer.tokenize(string)).toStrictEqual(tokens);
expect(tokenizer.tokenize(string)).toStrictEqual(tokens);
});
8 changes: 0 additions & 8 deletions src/tokenizers/orthography_matchers.ts

This file was deleted.

Loading

0 comments on commit 12694d9

Please sign in to comment.