Latest commit ad8b8a1 Feb 24, 2017 @arbox Reworked check spelling.

Awesome NLP with Ruby Awesome Awesome RubyNLP

Useful resources for text processing in Ruby

This curated list comprises awesome resources, libraries, information sources about computational processing of texts in human languages with the Ruby programming language. That field is often referred to as NLP, Computational Linguistics, HLT (Human Language Technology) and can be brought in conjunction with Artificial Intelligence, Machine Learning, Information Retrieval, Text Mining, Knowledge Extraction and other related disciplines.

This list comes from our day to day work on Language Models and NLP Tools. Read why this list is awesome. Our FAQ describes the important decisions and useful answers you may be interested in.

Our main goal is to promote Ruby as a tool for NLP related tasks. Your help, suggestions and contributions are welcome! We kindly ask you to study the Contribution section. Follow us on Twitter and please spread the word using the #RubyNLP hash tag!


NLP Pipeline Subtasks

Pipeline Generation

Multipurpose Engines

On-line APIs

Language Identification

Language Identification is one of the first crucial steps in every NLP Pipeline.

  • scylla - Language Categorization and Identification.


Tools for Tokenization, Word and Sentence Boundary Detection and Disambiguation.

  • tokenizer - Simple multilingual tokenizer. [tutorial]
  • pragmatic_tokenizer - Multilingual tokenizer to split a string into tokens.
  • nlp-pure - Natural language processing algorithms implemented in pure Ruby with minimal dependencies.
  • textoken - Simple and customizable text tokenization library.
  • pragmatic_segmenter - Word Boundary Disambiguation with many cookies.
  • punkt-segmenter - Pure Ruby implementation of the Punkt Segmenter.
  • Tactful_Tokenizer - RegExp based tokenizer for different languages.
  • scapel - Sentence Boundary Disambiguation tool.

Lexical Processing


Stemming is the term used in information retrieval to describe the process for reducing wordforms to some base representation. Stemming should be distinguished from Lemmatization since stems are not necessarily have linguistic motivation.

  • ruby-stemmer - Ruby-Stemmer exposes the SnowBall API to Ruby.
  • uea-stemmer - Conservative stemmer for search and indexing.


Lemmatization is considered a process of finding a base form of a word. Lemmas are often collected in dictionaries.

  • lemmatizer - WordNet based Lemmatizer for English texts.

Counting Types and Tokens

  • wc - Facilities to count word occurrences in a text.
  • word_count - Word counter for String and Hash objects.

Phrasal Level Processing

  • N-Gram - N-Gram generator.
  • ruby-ngram - Break words and phrases into ngrams.
  • raingrams - Flexible and general-purpose ngrams library written in pure Ruby.

Syntactic Processing

Constituency Parsing

Semantic Analysis

  • amatch - Set of five distance types between strings (including Levenshtein, Sellers, Jaro-Winkler, 'pair distance').
  • damerau-levenshtein - Calculates edit distance using the Damerau-Levenshtein algorithm.
  • FuzzyTools - In-memory TF/IDF fuzzy document finding with a fancy default tokenizer.
  • Going the Distance - Contains scripts that do various distance calculations.
  • hotwater - Fast Ruby FFI string edit distance algorithms.
  • levenshtein-ffi - Fast string edit distance computation, using the Damerau-Levenshtein algorithm.
  • TF-IDF - Term Frequency / Inverse Document Frequency in pure Ruby.
  • tf-idf-similarity - Calculate the similarity between texts using TF/IDF.

Pragmatical Analysis

High Level Tasks

Spelling and Error Correction

Text Alignment

  • alignment - Alignment routines for bilingual texts (Gale-Church implementation).

Machine Translation

Dialog Systems

  • chatterbot - Straightforward ruby-based Twitter Bot Framework, using OAuth to authenticate.
  • Lita - Lita is a chat bot written in Ruby with persistent storage provided by Redis.

Sentiment Analysis

Date and Time Parsing

  • Chronic - Pure Ruby natural language date parser.
  • Chronic Between - Simple Ruby natural language parser for date and time ranges.
  • chronic_duration - Pure Ruby parser for elapsed time.
  • Kronic - Methods for parsing and formatting human readable dates.
  • Nickel - Extracts date, time, and message information from naturally worded text.
  • Tickle - Parser for recurring and repeating events.

Named Entity Recognition

  • ruby-ner - Named Entity Recognition with Stanford NER and Ruby.
  • ruby-nlp - Ruby Binding for Stanford Pos-Tagger and Name Entity Recognizer.


  • espeak-ruby - Small Ruby API for utilizing 'espeak' and 'lame' to create text-to-speech mp3 files.
  • Isabella - Voice-computing assistant built in Ruby.
  • tts - Text-to-Speech conversion using the Google translate service.
  • att_speech - Ruby wrapper over the AT&T Speech API for speech to text.
  • pocketsphinx-ruby - Pocketsphinx bindings.

Linguistic Resources

Machine Learning Libraries

Machine Learning Algorithms in pure Ruby or written in other programming languages with appropriate bindings for Ruby.

  • rb-libsvm - Support Vector Machines with Ruby.
  • weka-jruby - JRuby bindings for Weka, different ML algorithms implemented through Weka.
  • decisiontree - Decision Tree ID3 Algorithm in pure Ruby.
  • rtimbl - Memory based learners from the Timbl framework.
  • classifier-reborn - General classifier module to allow Bayesian and other types of classifications.
  • lda-ruby - Ruby implementation of the LDA (Latent Dirichlet Allocation) for automatic Topic Modelling and Document Clustering.
  • liblinear-ruby-swig - Ruby interface to LIBLINEAR (much more efficient than LIBSVM for text classification).
  • linnaeus - Redis-backed Bayesian classifier.
  • maxent_string_classifier - JRuby maximum entropy classifier for string data, based on the OpenNLP Maxent framework.
  • Naive-Bayes - Simple Naive Bayes classifier.
  • nbayes - Full-featured, Ruby implementation of Naive Bayes.
  • omnicat - Generalized rack framework for text classifications.
  • omnicat-bayes - Naive Bayes text classification implementation as an OmniCat classifier strategy.

Full Text Search, Information Retrieval, Indexing

Language Aware String Manipulation

Libraries for language aware string manipulation, i.e. search, pattern matching, case conversion, transcoding, regular expressions which need information about the underlying language.

  • FuzzyMatch - find a needle in a haystack based on string similarity and regular expression rules.
  • fuzzy-string-match - Fuzzy string matching library for Ruby.
  • active_support - RoR ActiveSupport gem has various string extensions that can handle case.
  • u - U extends Ruby’s Unicode support.
  • unicode - Unicode normalization library.
  • CommonRegexRuby - Find a lot of kinds of common information in a string.
  • regexp-examples - Generate strings that match a given regular expression.
  • verbal_expressions - Make difficult regular expressions easy.

Articles, Posts, Talks, and Presentations


  • Miller, Rob. Text Processing with Ruby: Extract Value from the Data That Surrounds You. Pragmatic Programmers, 2015. [link]
  • Watson, Mark. Scripting Intelligence: Web 3.0 Information Gathering and Processing. APRESS, 2010. [link]
  • Watson, Mark. Practical Semantic Web and Linked Data Applications. Lulu: 2010. [link]


Related Resources


We are very glad to see you in this section and highly appreciate any help!

But we also take care about the quality of this list. If you want to contribute please:

  • agree that your work will be published under the terms of the CC0 license;
  • carefully read the Contribution Guidelines.

Some of the open tasks for contributors are listed in the todo file. You may want to start there.


Creative Commons Zero 1.0 Awesome NLP with Ruby by Andrei Beliankou and Contributors.

To the extent possible under law, the person who associated CC0 with Awesome NLP with Ruby has waived all copyright and related or neighboring rights to Awesome NLP with Ruby.

You should have received a copy of the CC0 legalcode along with this work. If not, see