add cltk_normalize docs

cltk · Jul 11, 2016 · 04c2967 · 04c2967
1 parent 0c90ce2
commit 04c2967
Show file tree

Hide file tree

Showing 2 changed files with 60 additions and 0 deletions.
diff --git a/docs/greek.rst b/docs/greek.rst
@@ -208,6 +208,33 @@ If for any reason you want to go from oxia to tonos, just add the ``reverse=True
    Out[8]: True
 
 
+Another approach to normalization is to use the Python language's builtin ``normalize()``. The CLTK provides a wrapper \
+for this, as a convenience. Here's an example its use in "compatibility" mode (``NFKC``):
+
+.. code-block:: python
+
+   In [1]: from cltk.corpus.utils.formatter import cltk_normalize
+
+   In [2]: tonos = "ά"
+
+   In [3]: oxia = "ά"
+
+   In [4]: tonos == oxia
+   Out[4]: False
+
+   In [5]: tonos == cltk_normalize(oxia)
+   Out[5]: True
+
+
+One can turn off compatability with:
+
+.. code-block:: python
+
+   In [6]: tonos == cltk_normalize(oxia, compatibility=False)
+   Out[6]: True
+
+For more on ``normalize()`` see the `Python Unicode docs <https://docs.python.org/3.5/library/unicodedata.html#unicodedata.normalize>`_.
+
 
 POS tagging
 ===========

diff --git a/docs/multilingual.rst b/docs/multilingual.rst
@@ -227,6 +227,39 @@ N–grams
     …]
 
 
+Normalization
+=============
+
+If you are working from texts from different resources, it is likely a good idea to normalize them before
+further processing (such as sting comparison). The CLTK provides a wrapper to the Python language's builtin \
+``normalize()``. Here's an example its use in "compatibility" mode (``NFKC``):
+
+.. code-block:: python
+
+   In [1]: from cltk.corpus.utils.formatter import cltk_normalize
+
+   In [2]: tonos = "ά"
+
+   In [3]: oxia = "ά"
+
+   In [4]: tonos == oxia
+   Out[4]: False
+
+   In [5]: tonos == cltk_normalize(oxia)
+   Out[5]: True
+
+
+One can turn off compatibility with:
+
+.. code-block:: python
+
+   In [6]: tonos == cltk_normalize(oxia, compatibility=False)
+   Out[6]: True
+
+For more on ``normalize()`` see the `Python Unicode docs <https://docs.python.org/3.5/library/unicodedata.html#unicodedata.normalize>`_.
+
+
+
 Skipgrams
 =========
 The NLTK has a handy `skipgram <https://en.wikipedia.org/wiki/N-gram#Skip-gram>`_ function. Use it like this: