Skip to content

Commit

Permalink
add cltk_normalize docs
Browse files Browse the repository at this point in the history
  • Loading branch information
kylepjohnson committed Jul 11, 2016
1 parent 0c90ce2 commit 04c2967
Show file tree
Hide file tree
Showing 2 changed files with 60 additions and 0 deletions.
27 changes: 27 additions & 0 deletions docs/greek.rst
Expand Up @@ -208,6 +208,33 @@ If for any reason you want to go from oxia to tonos, just add the ``reverse=True
Out[8]: True
Another approach to normalization is to use the Python language's builtin ``normalize()``. The CLTK provides a wrapper \
for this, as a convenience. Here's an example its use in "compatibility" mode (``NFKC``):

.. code-block:: python
In [1]: from cltk.corpus.utils.formatter import cltk_normalize
In [2]: tonos = "ά"
In [3]: oxia = "ά"
In [4]: tonos == oxia
Out[4]: False
In [5]: tonos == cltk_normalize(oxia)
Out[5]: True
One can turn off compatability with:

.. code-block:: python
In [6]: tonos == cltk_normalize(oxia, compatibility=False)
Out[6]: True
For more on ``normalize()`` see the `Python Unicode docs <https://docs.python.org/3.5/library/unicodedata.html#unicodedata.normalize>`_.


POS tagging
===========
Expand Down
33 changes: 33 additions & 0 deletions docs/multilingual.rst
Expand Up @@ -227,6 +227,39 @@ N–grams
…]
Normalization
=============

If you are working from texts from different resources, it is likely a good idea to normalize them before
further processing (such as sting comparison). The CLTK provides a wrapper to the Python language's builtin \
``normalize()``. Here's an example its use in "compatibility" mode (``NFKC``):

.. code-block:: python
In [1]: from cltk.corpus.utils.formatter import cltk_normalize
In [2]: tonos = "ά"
In [3]: oxia = "ά"
In [4]: tonos == oxia
Out[4]: False
In [5]: tonos == cltk_normalize(oxia)
Out[5]: True
One can turn off compatibility with:

.. code-block:: python
In [6]: tonos == cltk_normalize(oxia, compatibility=False)
Out[6]: True
For more on ``normalize()`` see the `Python Unicode docs <https://docs.python.org/3.5/library/unicodedata.html#unicodedata.normalize>`_.



Skipgrams
=========
The NLTK has a handy `skipgram <https://en.wikipedia.org/wiki/N-gram#Skip-gram>`_ function. Use it like this:
Expand Down

0 comments on commit 04c2967

Please sign in to comment.