Skip to content

Commit

Permalink
Added the word regexp to README
Browse files Browse the repository at this point in the history
  • Loading branch information
ruddzw committed Mar 13, 2009
1 parent cefed77 commit 9262fff
Showing 1 changed file with 4 additions and 0 deletions.
4 changes: 4 additions & 0 deletions README
Original file line number Diff line number Diff line change
Expand Up @@ -11,3 +11,7 @@
P(word1 word2) = count[word1][word2] / count[word1]
Note that unigram counts needed for bigram Probabilities.

Regular Expression for matching "words" including punctuation and stuff like 's:
('?\w+|[\`\~\!\@\#\$\%\^\&\*\(\)\-\_\=\+\[\{\]\}\\\|\;\:\'\"\,\<\.\>\/\?])
Java-ized for use in a String literal:
('?\\w+|[\\`\\~\\!\\@\\#\\$\\%\\^\\&\\*\\(\\)\\-\\_\\=\\+\\[\\{\\]\\}\\\\\\|\\;\\:\\'\\\"\\,\\<\\.\\>\\/\\?])

This comment has been minimized.

Copy link
@esbie

esbie Mar 13, 2009

Owner

This line is hard to read :)

Could you use a Java POSIX character class instead?
\p{Punct} i.e. One of !"#$%&’()*+,-./:;<=>?@[]^_`{|}~

1 comment on commit 9262fff

@ruddzw
Copy link
Contributor Author

@ruddzw ruddzw commented on 9262fff Mar 13, 2009

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never heard of \p{Punct}. Will change to:
’?\w+|\p{Punct}
and java-ized:
’?\w+|\p{Punct}

Please sign in to comment.