Ruby Natural Language Processing Resources

A collection of Natural Language Processing (NLP) Ruby libraries, tools and software. Suggestions and contributions are welcome.

APIs

3rd party NLP services

Client libraries to various 3rd party NLP API services.

alchemy_api - provides a client API library for AlchemyAPI's NLP services
aylien_textapi_ruby - AYLIEN's officially supported Ruby client library for accessing Text API
biffbot - Ruby gem for Diffbot's APIs that extract Articles, Products, Images, Videos, and Discussions from any web page
gengo-ruby - a Ruby library to interface with the Gengo API for translation
monkeylearn-ruby - build and consume machine learning models for language processing from your Ruby apps
poliqarpr - Ruby client for Poliqarp text corpus server
wlapi - Ruby based API for the project Wortschatz Leipzig

Instant Messaging Bots

Client/server libraries to various 3rd party instant messengers chat bots APIs.

Facebook Messenger

botstack - rapid FB Chatbot development with ruby on rails
facebook-messenger - Definitely the best Ruby client for Bots on Messenger
messenger-ruby - A simple library for supporting implementation of Facebook Messenger Bot in Ruby on Rails

Kik

kik - Build www.Kik.com bots in Ruby

Microsoft Bot Framework (Skype)

BotBuilder - REST APIs (for Skype and others instant messaging apps)
botframework-ruby - Microsoft Bot Framework ruby client

Slack

slack-bot-server - A Grape API serving a Slack bot to multiple teams
slack-ruby-bot - The easiest way to write a Slack bot in Ruby
slack-ruby-client - A Ruby and command-line client for the Slack Web and Real Time Messaging APIs
slack-ruby-gem - A Ruby wrapper for the Slack API

Telegram Messenger

BOTServer - Telegram Bot API Webhooks Framework, for Rubyists
TelegramBot - a charismatic Ruby client for Telegram's Bot API
TelegramBotRuby - yet another client for Telegram's Bot API
telegram-bot-ruby - Ruby wrapper for Telegram's Bot API

Wechat

wechat API, command and message handling for WeChat in Rails
wechat-api - 用于微信 api 调用（非服务端推送信息）的处理。

Natural Language Understanding Tools

dialogflow-ruby-client - A Ruby SDK to the https://dialogflow.com natural language processing service
expando - A translation language for defining user utterance examples in conversational interfaces (for dialogflow and similars).
wit-ruby - Easy interface for wit.ai natural language parsing

Voice-based devices bots

Client/server libraries to various 3rd party voice-based devices APIs.

Amazon Echo Alexa skills

alexa-home - Using Amazon Echo to control the home!
Alexa-Hue - Control Hue Lights with Alexa
alexa-ruby - Ruby toolkit for Amazon Alexa service
alexa-rubykit - Amazon Echo Alexa's App Kit Ruby Implementation
alexa-skill - A Ruby based DSL to create new Alexa Skills
alexa_skills_ruby - Simple library to interface with the Alexa Skills Kit

Books

Mastering Regular Expressions - by Jeffrey E. F. Friedl
Regular Expressions Cookbook - by Jan Goyvaerts, Steven Levithan
Regular Expression Pocket Reference - by Tony Stubblebine
Text Processing with Ruby by Rob Miller
Thoughtful Machine Learning: A Test-Driven Approach - by Matthew Kirk
Understanding Computation - by Tom Stuart

Bitext Alignment

Bitext alignment is the process of aligning two parallel documents on a segment by segment basis. In other words, if you have one document in English and its translation in Spanish, bitext alignment is the process of matching each segment from document A with its corresponding translation in document B.

alignment - alignment functions for corpus linguistics (Gale-Church implementation)

Case

active_support - the rails active_support gem has various string extensions that can handle case (e.g. .mb_chars.upcase.to_s or #transliterate)
string_pl - additional support for Polish encodings in Ruby 1.9
twitter-cldr-rb - casefolding
u - U extends Ruby’s Unicode support
unicode - Unicode normalization library
unicode_utils - Unicode algorithms for Ruby 1.9

Chatbot

chatterbot - A straightforward ruby-based Twitter Bot Framework, using OAuth to authenticate
JeffBot - (Yet another) comical and extensible chat bot
Lita - Lita is a chat bot written in Ruby with persistent storage provided by Redis
MegaHAL - MegaHAL is a learning chatterbot
Markov-chain-bot-module - A chat bot utilizing Markov chains. It speaks Russian and English
stealth - An open source Ruby framework for conversational voice and text chatbots

Classification

Classification aims to assign a document or piece of text to one or more classes or categories making it easier to manage or sort.

Classifier - a general module to allow Bayesian and other types of classifications
classifier-reborn - (a fork of cardmagic/classifier) a general classifier module to allow Bayesian and other types of classifications
fastText Ruby - efficient text classification and representation learning - for Ruby
Latent Dirichlet Allocation - used to automatically cluster documents into topics
liblinear-ruby-swig - Ruby interface to LIBLINEAR (much more efficient than LIBSVM for text classification and other large linear classifications)
linnaeus - a redis-backed Bayesian classifier
maxent_string_classifier - a JRuby maximum entropy classifier for string data, based on the OpenNLP Maxent framework
Naive-Bayes - simple Naive Bayes classifier
nbayes - a full-featured, Ruby implementation of Naive Bayes
omnicat - a generalized rack framework for text classifications
omnicat-bayes - Naive Bayes text classification implementation as an OmniCat classifier strategy
stuff-classifier - a library for classifying text into multiple categories

Date and Time

Chronic - a pure Ruby natural language date parser
Chronic Between - a simple Ruby natural language parser for date and time ranges
Chronic Duration - a simple Ruby natural language parser for elapsed time
dotiw - Better distance of time in words for Rails http://ryanbigg.com
Kronic - a dirt simple library for parsing and formatting human readable dates
Nickel - extracts date, time, and message information from naturally worded text
Tickle - a natural language parser for recurring events
time_ago_in_words - Humanize elapsed time from some Time instance to Time.now
time-lord - adds extra functionality to the time class.

Emoji

active_emoji - A collection of emoji aliases for core Ruby methods
emoji - A gem. For Emoji. For everyone. ❤
gemoji - Emoji images and names
gemoji-parser - The missing helper methods for GitHub's gemoji gem
rumoji - Encode and decode emoji unicode characters into emoji-cheat-sheet form. article

Error Correction

Chat Correct - shows the errors and error types when a correct English sentence is diffed with an incorrect English sentence
gingerice - Ruby wrapper for correcting spelling and grammar mistakes based on the context of complete sentences

Full-Text Search

ferret - an information retrieval library in the same vein as Apache Lucene
ranguba - a project to provide a full-text search system built on Groonga
Thinking Sphinx - Sphinx plugin for ActiveRecord/Rails

Keyword Ranking

graph-rank - Ruby implementation of the PageRank and TextRank algorithms
highscore - find and rank keywords in text

Language Detection

Compact Language Detection - blazing-fast language detection for Ruby provided by Google Chrome's Compact Language Detector
Detect Language API Client - detects language of given text and returns detected language codes and scores
whatlanguage - a language detection library for Ruby that uses bloom filters for speed

Language Localization

fast_gettext - Ruby GetText, but 3.5x faster + 560x less memory + simple + clean namespace + threadsave + extendable + multiple backends + Rails3 ready
ruby-gettext - pure Ruby Localization(L10n) library and tool which is modeled after the GNU gettext package

Lexical Databases and Ontologies

Lexical databases, knowledge-base common sense, multilingual lexicalized semantic networks and ontologies

BabelNet

BabelNet API client - API (with Ruby examples) for BabelNet, multilingual lexicalized semantic network and ontology

ConceptNet

ConceptNet API - REST API for ConceptNet

Mediawiki, Wikipedia

mediawiki-ruby-api - Github mirror of "mediawiki/ruby/api" - our actual code is hosted with Gerrit (please see https://www.mediawiki.org/wiki/Developer_access for contributing
wikipedia-client - Ruby client for the Wikipedia API http://github.com/kenpratt/wikipedia-client

Wordnet

ruby-wordnet - A Ruby interface to the WordNet® Lexical Database. http://deveiate.org/projects/Ruby-WordNet
rwordnet - a pure Ruby interface to the WordNet lexical/semantic database

Machine Learning

Decision Tree - a ruby library which implements ID3 (information gain) algorithm for decision tree learning
rb-libsvm - implementation of SVM, a machine learning and classification algorithm
RubyFann - a ruby gem that binds to FANN (Fast Artificial Neural Network) from within a ruby/rails environment
tensorflow.rb - tensorflow for ruby
tensor_stream - A ground-up and standalone reimplementation of TensorFlow for ruby.

Machine Translation

Google API Client - Google API Ruby Client
microsoft_translator - Ruby client for the microsoft translator API
termit - Google Translate with speech synthesis in your terminal as ruby gem

Miscellaneous

Abbrev - Calculates the set of unique abbreviations for a given set of strings
calyx - A Ruby library for generating text with declarative recursive grammars
dialable - A Ruby gem that provides parsing and output of North American Numbering Plan (NANP) phone numbers, and includes location & time zones
gibber - Gibber replaces text with nonsensical latin with a maximum size difference of +/- 30%
hiatus - a localization QA tool
language_filter - a Ruby gem to detect and optionally filter multiple categories of language
Naturally - Natural (version number) sorting with support for legal document numbering, college course codes, and Unicode
RLTK - The Ruby Language Toolkit http://chriswailes.github.io/RLTK/
ruby-spacy - A wrapper module for using spaCy natural language processing library from the Ruby programming language via PyCall
Shellwords - Manipulates strings like the UNIX Bourne shell
sort_alphabetical - sort UTF8 Strings alphabetical via Enumerable extension
spintax_parser - A mixin to parse "spintax", a text format used for automated article generation. Can handle nested spintax.
stringex - some [hopefully] useful extensions to Ruby’s String class
twitter-text - gem that provides text processing routines for Twitter Tweets
nameable - A Ruby gem that provides parsing and output of person names, as well as Gender & Ethnicity matching

Multipurpose Tools

The following are libraries that integrate multiple NLP tools or functionality.

nlp - NLP tools for the Polish language
NlpToolz - Basic NLP tools, mostly based on OpenNLP, at this time sentence finder, tokenizer and POS tagger implemented, plus Berkeley Parser
Open NLP (Ruby bindings)
Stanford Core NLP (Ruby bindings)
Treat - natural language processing framework for Ruby
twitter-cldr-rb - TwitterCldr uses Unicode's Common Locale Data Repository (CLDR) to format certain types of text into their localized equivalents
ve - a linguistic framework that's easy to use
zipf - a collection of various NLP tools and libraries

Named Entity Recognition

Confidential Info Redactor - a Ruby gem to semi-automatically redact confidential information from a text
ruby-ner - named entity recognition with Stanford NER and Ruby
ruby-nlp - Ruby Binding for Stanford Pos-Tagger and Name Entity Recognizer

Ngrams

N-Gram - N-Gram generator in Ruby
ngram - break words and phrases into ngrams
raingrams - a flexible and general-purpose ngrams library written in Ruby

Numbers

humanize - Takes your numbers and makes them fancy
numbers_and_words - convert numbers to words using I18N
numbers_in_words - to convert numbers into English words and vice versa

Parsers

A natural language parser is a program that works out the grammatical structure of sentences, for instance, which groups of words go together (as "phrases") and which words are the subject or object of a verb.

linkparser - a Ruby binding for the Abiword version of CMU's Link Grammar, a syntactic parser of English
Parslet - A small PEG based parser library
rley - Ruby gem implementing a general context-free grammar parser based on Earley's algorithm
Treetop - a Ruby-based parsing DSL based on parsing expression grammars

Part-of-Speech Taggers

engtagger - English Part-of-Speech Tagger Library; a Ruby port of Lingua::EN::Tagger
rbtagger - a simple ruby rule-based part of speech tagger
TreeTagger for Ruby - Ruby based wrapper for the TreeTagger by Helmut Schmid
treetagger-ruby - The Ruby based wrapper for the TreeTagger by Helmut Schmid

Readability

lingua - Lingua::EN::Readability is a Ruby module which calculates statistics on English text

Regular Expressions

CommonRegexRuby - find a lot of kinds of common information in a string
regexp-examples - generate strings that match a given regular expression
verbal_expressions - make difficult regular expressions easy

Online resources

Explain Regular Expression - breakdown and explanation of each part of your regular expression
Rubular - a Ruby regular expression editor

Ruby NLP Presentations

Quickly Create a Telegram Bot in Ruby [tutorial] - Ardian Haxha (2016)
N-gram Analysis for Fun and Profit [tutorial] - Jesus Castello (2015)
Machine Learning made simple with Ruby [tutorial] - Lorenzo Masini (2015)
Using Ruby Machine Learning to Find Paris Hilton Quotes [tutorial] - Rick Carlino (2015)
Exploring Natural Language Processing in Ruby [slides] - Kevin Dias (2015)
Natural Language Parsing with Ruby [tutorial] - Glauco Custódio (2014)
Demystifying Data Science (Analyzing Conference Talks with Rails and Ngrams) [video RailsConf 2014 | Repo from the Video] - Todd Schneider (2014)
Natural Language Processing with Ruby [video ArrrrCamp 2014 | video Ruby Conf India] - Konstantin Tennhard (2014)
How to parse 'go' - Natural Language Processing in Ruby [slides] - Tom Cartwright (2013)
Natural Language Processing in Ruby [slides | video] - Brandon Black (2013)
Natural Language Processing with Ruby: n-grams [tutorial] - Nathan Kleyn (2013)
A Tour Through Random Ruby [tutorial] - Robert Qualls (2013)

Sentence Generation

gabbler - Gab-bler (noun) - rapid, unintelligible talk
faker - A library for generating fake data such as names, addresses, and phone numbers
kusari - Japanese random sentence generator based on Markov chain
literate_randomizer - Using Markov chains, this generates near-english prose.
markov-sentence-generator - Generates a random, locally-correct sentence using textual input and a Markov model
marky_markov - Markov Chain Generator
poem-generator - A generator for gothic poems
poetry - poetry generator
pwqgen.rb - Ruby implementation of passwdqc's pwqgen, a random pronouncable password generator
ramble - library for generating sentences from a yacc grammar
token_phrase - A token phrase generator

Sentence Segmentation

Sentence segmentation (aka sentence boundary disambiguation, sentence boundary detection) is the problem in natural language processing of deciding where sentences begin and end. Sentence segmentation is the foundation of many common NLP tasks (machine translation, bitext alignment, summarization, etc.).

Speech-to-Text

att_speech - A Ruby library for consuming the AT&T Speech API for speech to text
pocketsphinx-ruby - Ruby speech recognition with Pocketsphinx
Speech2Text - using Google Speech to Text API Provide a Simple Interface to Convert Audio Files

Stemmers

Stemming is the term used in linguistic morphology and information retrieval to describe the process for reducing inflected (or sometimes derived) words to their word stem, base or root form.

Greek stemmer - a Greek stemmer
Ruby-Stemmer - Ruby-Stemmer exposes the SnowBall API to Ruby
Sastrawi - Ruby bindings for Sastrawi, a library which allows you to stem words in Bahasa Indonesia
Turkish stemmer - a Turkish stemmer
uea-stemmer - a conservative stemmer for search and indexing

Stop Words

clarifier
stopwords - really just a list of stopwords with some helpers
Stopwords Filter - a very simple and naive implementation of a stopwords filter that remove a list of banned words (stopwords) from a sentence

Summarization

Automatic summarization is the process of reducing a text document with a computer program in order to create a summary that retains the most important points of the original document.

Epitome - A small gem to make your text shorter; an implementation of the Lexrank algorithm
ots - Ruby bindings to open text summarizer
summarize - Ruby C wrapper for Open Text Summarizer

Text Extraction

docsplit - Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts
rtesseract - Ruby library for working with the Tesseract OCR
Ruby Readability - a tool for extracting the primary readable content of a webpage
ruby-tesseract - This wrapper binds the TessBaseAPI object through ffi-inline (which means it will work on JRuby too) and then proceeds to wrap said API in a more ruby-esque Engine class
Yomu - a library for extracting text and metadata from files and documents using the Apache Tika content analysis toolkit

Text Similarity

amatch - collection of five type of distances between strings (including Levenshtein, Sellers, Jaro-Winkler, 'pair distance'. Last one seems to work well to find similarity in long phrases)
damerau-levenshtein - calculates edit distance using the Damerau-Levenshtein algorithm
FuzzyMatch - find a needle in a haystack based on string similarity and regular expression rules
fuzzy-string-match - fuzzy string matching library for ruby
FuzzyTools - In-memory TF-IDF fuzzy document finding with a fancy default tokenizer tuned on diverse record linkage datasets for easy out-of-the-box use
Going the Distance - contains scripts that do various distance calculations
hotwater - Fast Ruby FFI string edit distance algorithms
levenshtein-ffi - fast string edit distance computation, using the Damerau-Levenshtein algorithm
soundex - A soundex function coded in Ruby
text - Collection of text algorithms
TF-IDF - Term Frequency - Inverse Document Frequency in Ruby
tf-idf-similarity - calculate the similarity between texts using tf*idf

Text-to-Speech

espeak-ruby - small Ruby API for utilizing 'espeak' and 'lame' to create text-to-speech mp3 files
Isabella - a voice-computing assistant built in Ruby
tts - a ruby gem for converting text-to-speech using the Google translate service

Tokenizers

Jieba - Chinese tokenizer and segmenter (jRuby)
MeCab - Japanese morphological analyzer [MeCab Heroku buildpack]
NLP Pure - natural language processing algorithms implemented in pure Ruby with minimal dependencies
Pragmatic Tokenizer - a multilingual tokenizer to split a string into tokens
rseg - a Chinese Word Segmentation (中文分词) routine in pure Ruby
Textoken - Simple and customizable text tokenization gem
thailang4r - Thai tokenizer
tiny_segmenter - Ruby port of TinySegmenter.js for tokenizing Japanese text
tokenizer - a simple multilingual tokenizer

Word Count

wc - a rubygem to count word occurrences in a given text
word_count - a word counter for String and Hash in Ruby
Word Count Analyzer - analyzes a string for potential areas of the text that might cause word count discrepancies depending on the tool used
WordsCounted - a highly customisable Ruby text analyser

Name		Name	Last commit message	Last commit date
Latest commit History 114 Commits
README.md		README.md

diasks2/ruby-nlp

Folders and files

Latest commit

History

Repository files navigation

Ruby Natural Language Processing Resources

Categories

APIs

3rd party NLP services

Instant Messaging Bots

Facebook Messenger

Kik

Microsoft Bot Framework (Skype)

Slack

Telegram Messenger

Wechat

Natural Language Understanding Tools

Voice-based devices bots

Amazon Echo Alexa skills

Books

Bitext Alignment

Case

Chatbot

Classification

Date and Time

Emoji

Error Correction

Full-Text Search

Keyword Ranking

Language Detection

Language Localization

Lexical Databases and Ontologies

BabelNet

ConceptNet

Mediawiki, Wikipedia

Wordnet

Machine Learning

Machine Translation

Miscellaneous

Multipurpose Tools

Named Entity Recognition

Ngrams

Numbers

Parsers

Part-of-Speech Taggers

Readability

Regular Expressions

Online resources

Ruby NLP Presentations

Sentence Generation

Sentence Segmentation

Speech-to-Text

Stemmers

Stop Words

Summarization

Text Extraction

Text Similarity

Text-to-Speech

Tokenizers

Word Count

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 20

Uh oh!

Packages