Merge branch 'develop'

amaiya · Jun 25, 2020 · cbd42f9 · cbd42f9
2 parents e9aafad + 970e15b
commit cbd42f9
Show file tree

Hide file tree

Showing 4 changed files with 54 additions and 8 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -7,6 +7,17 @@ Most recent releases are shown at the top. Each release shows:
 - **Fixed**: Bug fixes that don't change documented behaviour
 
 
+## 0.17.2 (2020-06-25)
+
+### New:
+- Added support for Russian in `text.EnglishTranslator`
+
+### Changed
+- N/A
+
+### Fixed:
+- N/A
+
 ## 0.17.1 (2020-06-24)
 
 ### New:

diff --git a/examples/text/language_translation_example.ipynb b/examples/text/language_translation_example.ipynb
@@ -27,6 +27,7 @@
     "\n",
     "- `zh` : Chinese (both Simplified and Traditional)\n",
     "- `ar` : Arabic\n",
+    "- `ru` : Russian\n",
     "- `de` : German\n",
     "- `ar` : Afrikaans\n",
     "- `fr` : French\n",
@@ -72,7 +73,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 4,
+   "execution_count": 3,
    "metadata": {},
    "outputs": [
     {
@@ -95,7 +96,9 @@
    "metadata": {},
    "source": [
     "#### Some comments about traslations:\n",
-    "Notice in the example above that we supplied a document of **two** sentences as input.  The `translate` method can accept single sentences, paragraphs, or entire documents.  However, if the document is large (e.g., a book), we recommend that you break it up into smaller chunks (e.g., pages or paragraphs).  This is because *ktrain* tokenizes your document into individual sentences, which re joined together and fed to model as single batch when making a prediction. If the batch is too large for memory, the prediction will fail.\n"
+    "Notice in the example above that we supplied a document of **two** sentences as input.  The `translate` method can accept single sentences, paragraphs, or entire documents.  However, if the document is large (e.g., a book), we recommend that you break it up into smaller chunks (e.g., pages or paragraphs).  This is because *ktrain* tokenizes your document into individual sentences, which re joined together and fed to model as single batch when making a prediction. If the batch is too large for memory, the prediction will fail.\n",
+    "\n",
+    "When instantiating the `EnglishTranslator`, pretrained models are automatically loaded, which may take a few seconds. Once instantiated, the `translate` method can be repeatedly invoked on different documents or sentences. Next, let us reinstantiate an `EnglishTranslator` object to translate Arabic.\n"
    ]
   },
   {
@@ -107,7 +110,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 5,
+   "execution_count": 4,
    "metadata": {},
    "outputs": [
     {
@@ -127,6 +130,35 @@
     "print(translator.translate(src_text))"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Russian to English"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "The pandemic has damaged the world economy.\n",
+      "However, as of June 2020, the US stock market continues to grow.\n"
+     ]
+    }
+   ],
+   "source": [
+    "translator = text.EnglishTranslator(src_lang='ru')\n",
+    "src_text = '''Пандемия нанесла ущерб мировой экономике.\n",
+    "Однако по состоянию на июнь 2020 года фондовый рынок США продолжает расти.\n",
+    "'''\n",
+    "print(translator.translate(src_text))"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -219,7 +251,7 @@
    "source": [
     "## The `Translator` Class for Translating to and from Many Languages\n",
     "\n",
-    "For translations **from** and **to** other languages, `text.Translator`instances can be used. `Translator` instances accept as input a pretrained model from [Helsinki-NLP](https://huggingface.co/Helsinki-NLP).  For instance, to translate Chinese to German, one can use the [Helsinki-NLP/opus-mt-ZH-de ](https://huggingface.co/Helsinki-NLP/opus-mt-ZH-de) model:"
+    "For translations **from** and **to** other languages, `text.Translator`instances can be used. `Translator` instances accept as input a pretrained model from [Helsinki-NLP](https://huggingface.co/models?search=Helsinki-NLP%2Fopus-mt).  For instance, to translate Chinese to German, one can use the [Helsinki-NLP/opus-mt-ZH-de ](https://huggingface.co/Helsinki-NLP/opus-mt-ZH-de) model:"
    ]
   },
   {

diff --git a/ktrain/text/translation/core.py b/ktrain/text/translation/core.py
@@ -2,7 +2,7 @@
 from ... import utils as U
 from .. import textutils as TU
 
-SUPPORTED_SRC_LANGS = ['zh', 'ar', 'de', 'af', 'es', 'fr', 'it', 'pt']
+SUPPORTED_SRC_LANGS = ['zh', 'ar', 'ru', 'de', 'af', 'es', 'fr', 'it', 'pt']
 
 class Translator():
     """
@@ -18,11 +18,11 @@ def __init__(self, model_name=None, device=None):
           device(str): device to use (e.g., 'cuda', 'cpu')
         """
         if 'Helsinki-NLP' not in model_name:
-            raise ValueError('BasicTranslator requires a Helsinki-NLP model: https://huggingface.co/Helsinki-NLP')
+            raise ValueError('Translator requires a Helsinki-NLP model: https://huggingface.co/Helsinki-NLP')
         try:
             import torch
         except ImportError:
-            raise Exception('BasicTranslator requires PyTorch to be installed.')
+            raise Exception('Translator requires PyTorch to be installed.')
         self.torch_device = device
         if self.torch_device is None: self.torch_device = 'cuda' if torch.cuda.is_available() else 'cpu'
         from transformers import MarianMTModel, MarianTokenizer
@@ -67,6 +67,7 @@ def __init__(self, src_lang=None, device=None):
                          Must be one of SUPPORTED_SRC_LANGS:
                            'zh': Chinese (either tradtional or simplified)
                            'ar': Arabic
+                           'ru' : Russian
                            'de': German
                            'af': Afrikaans
                            'es': Spanish
@@ -82,6 +83,8 @@ def __init__(self, src_lang=None, device=None):
         self.translators = []
         if src_lang == 'ar':
             self.translators.append(Translator(model_name='Helsinki-NLP/opus-mt-ar-en', device=device))
+        elif src_lang == 'ru':
+            self.translators.append(Translator(model_name='Helsinki-NLP/opus-mt-ru-en', device=device))
         elif src_lang == 'de':
             self.translators.append(Translator(model_name='Helsinki-NLP/opus-mt-de-en', device=device))
         elif src_lang == 'af':

diff --git a/ktrain/version.py b/ktrain/version.py
@@ -1,2 +1,2 @@
 __all__ = ['__version__']
-__version__ = '0.17.1'
+__version__ = '0.17.2'