Gibran is an Elixir natural language processor, and a port of WordsCounted.
Latest commit 872fa9b Feb 14, 2017 @abitdodgy committed on GitHub Merge pull request #10 from GeoffreyPS/soundex


Yesterday is but today's memory, and tomorrow is today's dream.


Gibran is an Elixir port of WordsCounted, a Ruby natural language processor. I have lofty goals for Gibran, such as:

  • Metaphone phonetic coding system
  • Soundex algorithm
  • Porter Stemming algorithm
  • String similarity as described by Simon White

But for now, you'll have to be content with a powerful tokeniser and a utility counter.

  • Token count, unique token count, and character count.
  • Average characters per token.
  • HashDicts of tokens and their frequencies, lengths, and densities.
  • The longest token(s) and its length.
  • The most frequent token(s) and its frequency.
  • Unique tokens.
  • Levenshtein distance algorithm


Let's start with something simple.

alias Gibran.Tokeniser
alias Gibran.Counter

str = "Yesterday is but today's memory, and tomorrow is today's dream."
# => ["yesterday", "is", "but", "today's", "memory", "and", "tomorrow", "is", "today's", "dream"]

Tokeniser.tokenise(str) |> Counter.uniq_token_count
# => 8

By default Gibran uses the following regular expression to tokenise strings: ~r/[^\p{L}'-]/u. However, you can provide your own regular expression through the pattern option. You can also combine pattern with exclude to create sophisticated tokenisation strategies.

Tokeniser.tokenise(string, exclude: &String.length(&1) < 4) |> Counter.token_count
# => 6

The exclude option accepts a string, a function, a regular expression, or a list combining any one or more of those types.

# Using `exclude` with a function.
Tokeniser.tokenise("Kingdom of the Imagination", exclude: &(String.length(&1) < 10))

# Using `exclude` with a regular expression.
Tokeniser.tokenise("Sand and Foam", exclude: ~r/and/)

# Using `exclude` with a string.
Tokeniser.tokenise("Eye of The Prophet", exclude: "eye of")
["the", "prophet"]

# Using `exclude` with a list of a combination of types.
Tokeniser.tokenise("Eye of The Prophet", exclude: ["eye", &(String.ends_with?(&1, "he")), ~r/of/])

Gibran has a shortcut to work with strings directly instead of running them through the tokeniser first.

Gibran.from_string(str, :token_count, opts: [exclude: &String.length(&1) < 4])
# => 6

Gibran normalises input before applying transformations to avoid inconsistencies that can arise from character-casing.

Levenshtein distance

Ordinary use:

iex(1)> Gibran.Levenshtein.distance("kitten", "sitting")

The Levenshtein distance for the same string is 0.

iex(2)> Gibran.Levenshtein.distance("snail", "snail")

The Levenshtein distance is case-sensitive.

iex(3)> Gibran.Levenshtein.distance("HOUSEBOAT", "houseboat")

The function can accept charlists as well as strings.

 iex(4)> Gibran.Levenshtein.distance('jogging', 'logger')

The doctests contain extensive usage examples. Please take a look there for more details.