/
README.Rmd
154 lines (110 loc) · 4.46 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
library(simpletrie)
```
# simpletrie
<!-- badges: start -->
![](https://img.shields.io/badge/cool-useless-green.svg)
![](https://img.shields.io/badge/dependencies-zero-blue.svg)
<!-- badges: end -->
`simpletrie` is a simple [trie](https://en.wikipedia.org/wiki/Trie) implementation
in plain R code.
A trie is a structure for representing certain data as a tree of nodes. This
data representation allows for fast queries of certain types.
`simpletrie` can build a trie from 175k words in about 3s.
## What's in the box
* `trie_create()` - create a trie, and optionally initialise with a dictionary of words
* `trie_insert()` - insert words into a trie
* `trie_find_subwords()` - find all sub-words in the trie using the characters in the given word
## Installation
You can install from [GitHub](https://github.com/coolbutuseless/simpletrie) with:
``` r
# install.package('remotes')
remotes::install_github('coolbutuseless/simpletrie)
```
## Example: Finding sub-words from a given word
My main use-case for `simpletrie` is for looking up scrabble words i.e. to find all possible sub-words of a given word with:
* the characters in any order
* being able to specify a *wildcard* to indicate that any letter can be used
i.e. the `.` character
```{r example}
library(simpletrie)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Read the ENABLE word list
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
words <- tolower(readLines("working/enable1.txt"))
length(words)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Create trie strcture
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
system.time({
trie <- trie_create(words)
})
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Find all words which can be made with 'hi.rstats'
# '.' is a wildcard, and can match any letter. (Like a blank tile in scrabble)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
system.time({
subwords <- trie_find_subwords(trie, 'hirstats.')
})
length(subwords)
head(subwords, 100)
```
## What's a trie?
A *trie* is an ordered data structure represented as a tree of nodes.
The image below shows a very small trie which represents the words: 'cat', 'cog', 'dog' and 'dote.'
This representation allows for very fast queries of certain types.
```{dot echo=FALSE, out.width = "50%"}
digraph D {
root [label="root node"]
c
a
t
o1 [label="o"]
g1 [label="g"]
d1 [label="d"]
o2 [label="o"]
g2 [label="g"]
t3 [label="t"]
e4 [label="e"]
root -> c;
root -> d1;
c -> a -> t;
c -> o1 -> g1;
d1 -> o2 -> g2;
o2 -> t3 -> e4;
}
```
## Technical bits
The trie is implemented here as nested environments. Environments are
created with `new.env(parent = emptyenv(), hash = FALSE)`.
Using the `emptyenv()` as parent saves some memory, as the parent
environment is never needed in this process.
`hash` is set to `FALSE` as using a hash table expands the memory
usage by quite a bit - even if `size` is set to limit the initial hash
table size.
Using a hash table on the environment doesn't
appreciably change the speed of trie creation or preforming look-ups.
Environments are used instead of lists as I make use of the reference semantics
of environment objects when building the trie.
## Related Software + Resources
* [Trie on Wikipedia](https://en.wikipedia.org/wiki/Trie)
* [iptrie](https://github.com/hrbrmstr/iptrie) - Efficiently Store and Query ‘IPv4’ Internet Addresses with Associated Data
* [triebeard](https://cran.r-project.org/package=triebeard)
* [rtrie](https://cran.r-project.org/web/packages/rtrie/index.html) - was previously on cran but has now been archived
* [AhoCorasickTrie](https://cran.r-project.org/web/packages/AhoCorasickTrie/index.html)
* Some raw wordlists e.g. ENABLE1 and SOWPODS can be found [here](http://norvig.com/ngrams/)
* [Bare bones trie creation in 10 lines of R](http://stackoverflow.com/questions/27060453/how-to-build-an-alphabetical-tree-from-a-list-of-words-in-r)
## Acknowledgements
* R Core for developing and maintaining such a wonderful language.
* CRAN maintainers, for patiently shepherding packages onto CRAN and maintaining
the repository