Merge branch 'develop'

amaiya · Feb 1, 2020 · 995fcdd · 995fcdd
2 parents 760ce6e + beb8363
commit 995fcdd
Show file tree

Hide file tree

Showing 6 changed files with 23 additions and 6 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -7,6 +7,19 @@ Most recent releases are shown at the top. Each release shows:
 - **Fixed**: Bug fixes that don't change documented behaviour
 
 
+## 0.9.1 (2020-02-01)
+
+### New:
+- N/A
+
+### Changed:
+- `text.TextPreprocessor` prints sequence length statistics
+
+### Fixed:
+- fixed `utils.nclasses_from_data` for `ktrain.Dataset` instances
+- prevent `detect_lang` failing when Pandas Series is supplied
+
+
 ## 0.9.0 (2020-01-31)
 
 ### New:

diff --git a/examples/text/text_regression_example.ipynb b/examples/text/text_regression_example.ipynb
@@ -330,7 +330,7 @@
      "output_type": "stream",
      "text": [
       "fasttext: a fastText-like model [http://arxiv.org/pdf/1607.01759.pdf]\n",
-      "linreg: linear text ression using a trainable Embedding layer\n",
+      "linreg: linear text regression using a trainable Embedding layer\n",
       "bigru: Bidirectional GRU with pretrained word vectors [https://arxiv.org/abs/1712.09405]\n",
       "standard_gru: simple 2-layer GRU with randomly initialized embeddings\n",
       "bert: Bidirectional Encoder Representations from Transformers (BERT) [https://arxiv.org/abs/1810.04805]\n",

diff --git a/ktrain/text/preprocessor.py b/ktrain/text/preprocessor.py
@@ -326,7 +326,11 @@ def detect_lang(texts, sample_size=32):
     """
     detect language
     """
-    if not isinstance(texts, (list, np.ndarray)): texts = [texts]
+    if isinstance(texts, (pd.Series, pd.DataFrame)):
+        texts = texts.values
+    if isinstance(texts, str): texts = [texts]
+    if not isinstance(texts, (list, np.ndarray)):
+        raise ValueError('texts must be a list or NumPy array of strings')
     lst = []
     for doc in texts[:sample_size]:
         try:

diff --git a/ktrain/utils.py b/ktrain/utils.py
@@ -217,7 +217,7 @@ def nsamples_from_data(data):
 
 def nclasses_from_data(data):
     if is_iter(data):
-        if isinstance(data, Dataset): return data.nsamples()
+        if isinstance(data, Dataset): return data.nclasses()
         elif is_ner(data=data): return len(data.p._label_vocab._id2token)    # NERSequence
         elif is_huggingface(data=data):         # Hugging Face Transformer
             return data.y.shape[1]

diff --git a/ktrain/version.py b/ktrain/version.py
@@ -1,2 +1,2 @@
 __all__ = ['__version__']
-__version__ = '0.9.0'
+__version__ = '0.9.1'
diff --git a/tutorials/tutorial-A4-customdata-text_regression_with_extra_regressors.ipynb b/tutorials/tutorial-A4-customdata-text_regression_with_extra_regressors.ipynb
@@ -430,7 +430,7 @@
     "## Wrapping our Data in an Instance of `ktrain.Dataset`\n",
     "To use this custom data format of two inputs in *ktrain*, we will wrap it in a `ktrain.Dataset` instance, which is simply a `tf.keras` Sequence wrapper.  We must be sure to override and implment the required methods (e.g., `def nsamples` and `def get_y`).  The `ktrain.Dataset` class is simply a subclass of `tf.keras.utils.Sequence`.  See the TensorFlow documentation on the [Sequence class](https://www.tensorflow.org/api_docs/python/tf/keras/utils/Sequence) for more information on how Sequence wrappers work.\n",
     "\n",
-    "Note that, in the implementation below, we have made `MyCustomDataset` more general such that it can wrap lists of "
+    "Note that, in the implementation below, we have made `MyCustomDataset` more general such that it can wrap lists containing an arbitrary number of inputs instead of just the two needed in our example. "
    ]
   },
   {
@@ -864,7 +864,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Let's look at our most expensive prediction.  Our most expensive prediction (`$404`) is associated with an expensive wine priced at `$800`, which is good. However, we are `~$400` off.  Again, our model has trouble with expensive wines.  This is somewhat understandable since our model only looks at short textual descriptions and the winer - neither of which contain clear indicators of their exorbitant prices."
+    "Let's look at our most expensive prediction.  Our most expensive prediction (`$404`) is associated with an expensive wine priced at `$800`, which is good. However, we are `~$400` off.  Again, our model has trouble with expensive wines.  This is somewhat understandable since our model only looks at short textual descriptions and the winery - neither of which contain clear indicators of their exorbitant prices."
    ]
   },
   {