Gibran is an Elixir natural language processor, and a port of WordsCounted.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
config initial commit Sep 29, 2015
lib Add documentation and type specification Aug 3, 2016
test Include test for handling charlists Aug 3, 2016
.gitignore adds mix docs Oct 12, 2015
.travis.yml adds travis version Nov 13, 2015
README.md Updated README.md Apr 23, 2017
mix.exs Bump minor version Jun 30, 2016
mix.lock adds mix docs Oct 12, 2015

README.md

Gibran

Yesterday is but today's memory, and tomorrow is today's dream.

Gibran

Gibran is an Elixir natural language processor. Lofty goals for Gibran include:

  • Metaphone phonetic coding system
  • Soundex algorithm
  • Porter Stemming algorithm
  • String similarity as described by Simon White

Currently, Gibran ships with the following features:

  • Token count, unique token count, and character count
  • Average characters per token
  • HashDicts of tokens and their frequencies, lengths, and densities
  • The longest token(s) and its length
  • The most frequent token(s) and its frequency
  • Unique tokens
  • Levenshtein distance algorithm

Usage

Let's start with something simple.

alias Gibran.Tokeniser
alias Gibran.Counter

str = "Yesterday is but today's memory, and tomorrow is today's dream."
Tokeniser.tokenise(str)
# => ["yesterday", "is", "but", "today's", "memory", "and", "tomorrow", "is", "today's", "dream"]

Tokeniser.tokenise(str) |> Counter.uniq_token_count
# => 8

By default Gibran uses the following regular expression to tokenise strings: ~r/[^\p{L}'-]/u. You can provide your own regular expression through the pattern option. You can combine pattern with exclude to create sophisticated tokenisation strategies.

Tokeniser.tokenise(string, exclude: &String.length(&1) < 4) |> Counter.token_count
# => 6

The exclude option accepts a string, a function, a regular expression, or a list combining any one or more of those types.

# Using `exclude` with a function.
Tokeniser.tokenise("Kingdom of the Imagination", exclude: &(String.length(&1) < 10))
["imagination"]

# Using `exclude` with a regular expression.
Tokeniser.tokenise("Sand and Foam", exclude: ~r/and/)
["foam"]

# Using `exclude` with a string.
Tokeniser.tokenise("Eye of The Prophet", exclude: "eye of")
["the", "prophet"]

# Using `exclude` with a list of a combination of types.
Tokeniser.tokenise("Eye of The Prophet", exclude: ["eye", &(String.ends_with?(&1, "he")), ~r/of/])
["prophet"]

Gibran provides a shortcut for working with strings directly (instead of running them through the tokeniser first).

Gibran.from_string(str, :token_count, opts: [exclude: &String.length(&1) < 4])
# => 6

To avoid inconsistencies that arise from character-casing, Gibran normalises input before applying transformations.

Levenshtein distance

Ordinary use:

iex(1)> Gibran.Levenshtein.distance("kitten", "sitting")
3

The Levenshtein distance for the same string is 0.

iex(2)> Gibran.Levenshtein.distance("snail", "snail")
0

The Levenshtein distance is case-sensitive.

iex(3)> Gibran.Levenshtein.distance("HOUSEBOAT", "houseboat")
9

The function can accept charlists as well as strings.

 iex(4)> Gibran.Levenshtein.distance('jogging', 'logger')
 4

The doctests contain extensive usage examples. Please take a look there for more details.