Merge branch 'develop'

amaiya · Mar 4, 2020 · 0038b38 · 0038b38
2 parents 33f9378 + f32c0cb
commit 0038b38
Show file tree

Hide file tree

Showing 5 changed files with 102 additions and 14 deletions.
diff --git a/examples/text/shallownlp-examples.ipynb b/examples/text/shallownlp-examples.ipynb
@@ -303,7 +303,7 @@
     "Discovered entities with English translations:\n",
     "- Катерина Тихонова = Katerina Tikhonova (PER)\n",
     "- России = Russia (LOC)\n",
-    "- Vladimir Putin = Vladimir Putin (PER)\n",
+    "- Владимира Путина = Vladimir Putin (PER)\n",
     "- МГУ = Moscow State University (ORG)"
    ]
   },

diff --git a/ktrain/text/preprocessor.py b/ktrain/text/preprocessor.py
@@ -432,6 +432,7 @@ def _transform_y(self, y_data):
             if self.label_encoder is None:
                 self.label_encoder = LabelEncoder()
                 self.label_encoder.fit(y_data)
+                #if self.get_classes(): warnings.warn('class_names argument is being overridden by string labels from data')
                 self.set_classes(self.label_encoder.classes_)
             y_data = self.label_encoder.transform(y_data)
 

diff --git a/tutorials/tutorial-03-image-classification.ipynb b/tutorials/tutorial-03-image-classification.ipynb
@@ -25,6 +25,8 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "\n",
+    "\n",
     "We will begin our image classification example by importing some required modules."
    ]
   },
@@ -50,7 +52,16 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Next, we use the ```images_from_folder``` function to load the data as a generator (i.e., DirectoryIterator object).  This function assumes the following directory structure:\n",
+    "Next, we will load and preprocess the image data for training and validation.  *ktrain* can load images and associated labels from a variety of source:\n",
+    "\n",
+    "\n",
+    "- `images_from_folder`:  labels are represented as subfolders containing images [ [example notebook] ](https://github.com/amaiya/ktrain/blob/master/examples/vision/dogs_vs_cats-ResNet50.ipynb)\n",
+    "- `images_from_csv`: labels are mapped to images in a CSV file [ [example notebook](https://github.com/amaiya/ktrain/blob/master/examples/vision/planet-ResNet50.ipynb) ]\n",
+    "- `images_from_fname`: labels are included as part of the filename and must be extracted using a regular expression [ [example notebook](https://github.com/amaiya/ktrain/blob/master/examples/vision/pets-ResNet50.ipynb) ]\n",
+    "- `images_from_array`: images and labels are stored in array [ [example notebook](https://github.com/amaiya/ktrain/blob/master/examples/vision/cifar10-WRN22.ipynb) ]\n",
+    "\n",
+    "\n",
+    "Here, we use the ```images_from_folder``` function to load the data as a generator (i.e., DirectoryIterator object).  This function assumes the following directory structure:\n",
     "```\n",
     "  ├── datadir\n",
     "    │   ├── train\n",
@@ -64,6 +75,8 @@
     "    │       ├── class2       # folder containing documents of class 2\n",
     "    │       └── classN       # folder containing documents of class N\n",
     "```\n",
+    "\n",
+    "\n",
     "The *train_test_names* argument can be used, if the train and test subfolders are named differently (e.g., *test* folder is called *valid*).  Here, we load a dataset of cat and dog images, which can be obtained from [here](https://www.kaggle.com/c/dogs-vs-cats/data).  The DATADIR variale should be set to the path to the extracted folder.  The **data_aug** parameter can be used to employ [data augmentation](https://arxiv.org/abs/1712.04621). We set this parameter using the ```get_data_aug``` function, which returns a default data augmentation with ```horizontal_flip=True``` as the only change to the defaults.  See [Keras documentation](https://keras.io/preprocessing/image/#imagedatagenerator-class) for a full set of agumentation parameters.  Finally, we pass the requested target size (224,224) and color mode (rgb, which is a 3-channel image). The image will be resized or converted appropriately based on the values supplied.  A target size of 224 by 224 is  typically used when using a network pretrained on ImageNet, which we do next.  The ```images_from_folder``` function returns generators for both the training and validation data in addition an instance of ```ktrain.vision.ImagePreprocessor```, which can be used to preprocess raw data when making predictions for new examples.  This will be demonstrated later."
    ]
   },

diff --git a/tutorials/tutorial-04-text-classification.ipynb b/tutorials/tutorial-04-text-classification.ipynb
@@ -45,8 +45,45 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "Next, we will load and preprocess the text data for training and validation.  *ktrain* can load texts and associated labels from a variety of source:\n",
     "\n",
-    "Next, we use the ```texts_from_folder``` function to load documents as fixed-length sequences of word IDs from a folder of raw documents.  This function assumes a directory structure like the following:\n",
+    "- `texts_from_folder`:  labels are represented as subfolders containing text files [ [example notebook] ](https://github.com/amaiya/ktrain/blob/master/examples/text/IMDb-BERT.ipynb)\n",
+    "- `texts_from_csv`: texts and associated labels are stored in columns in a CSV file [ [example notebook](https://github.com/amaiya/ktrain/blob/master/examples/text/toxic_comments-fasttext.ipynb) ]\n",
+    "- `texts_from_df`: texts and associated labels are stored in columns in a *pandas* DataFrame [ [example notebook](https://github.com/amaiya/ktrain/blob/master/examples/text/ArabicHotelReviews-nbsvm.ipynb) ]\n",
+    "- `texts_from_array`: texts and labels are loaded and preprocessed from an array [ [example notebook](https://github.com/amaiya/ktrain/blob/master/examples/text/20newsgroup-distilbert.ipynb) ]\n",
+    "\n",
+    "For `texts_from_csv` and `texts_from_df`, labels can either be multi or one-hot-encoded with one column per class or can be a single column storing integers or strings like this:\n",
+    "```python\n",
+    "# my_training_data.csv\n",
+    "TEXT,LABEL\n",
+    "I like this movie,positive\n",
+    "I hate this movie,negative\n",
+    "```\n",
+    "\n",
+    "For `texts_from_array`, the labels are arrays in one of the following forms:\n",
+    "```python\n",
+    "# string labels\n",
+    "y_train = ['negative', 'positive']\n",
+    "# integer labels\n",
+    "y_train = [0, 1]\n",
+    "# multi or one-hot encoded labels (used for multi-label problems)\n",
+    "y_train = [[1,0], [0,1]]\n",
+    "```\n",
+    "\n",
+    "In the latter two cases, you must supply a `class_names` argument to the `texts_from_array`, which tells *ktrain* how indices map to class names.  In this case, `class_names=['negative', 'positive']` because 0=negative and 1=positive.\n",
+    "\n",
+    "Sample arrays for `texts_from_array` might look like this:\n",
+    "```python\n",
+    "x_train = ['I hate this movie.', 'I like this movie.']\n",
+    "y_train = ['negative', 'positive']\n",
+    "x_test = ['I despise this movie.', 'I love this movie.']\n",
+    "y_test = ['negative', 'positive']\n",
+    "```\n",
+    "\n",
+    "All of the above methods transform the texts into a sequence of word IDs in one way or another, as expected by neural network models.\n",
+    "\n",
+    "\n",
+    "In this first example problem, we use the ```texts_from_folder``` function to load documents as fixed-length sequences of word IDs from a folder of raw documents.  This function assumes a directory structure like the following:\n",
     "\n",
     "```\n",
     "    ├── datadir\n",
@@ -98,11 +135,11 @@
    "source": [
     "# load training and validation data from a folder\n",
     "DATADIR = 'data/aclImdb'\n",
-    "(x_train, y_train), (x_test, y_test), preproc = text.texts_from_folder(DATADIR, \n",
-    "                                                                         max_features=80000, maxlen=2000, \n",
-    "                                                                         ngram_range=3, \n",
-    "                                                                         preprocess_mode='standard',\n",
-    "                                                                         classes=['pos', 'neg'])"
+    "trn, val, preproc = text.texts_from_folder(DATADIR, \n",
+    "                                           max_features=80000, maxlen=2000, \n",
+    "                                           ngram_range=3, \n",
+    "                                           preprocess_mode='standard',\n",
+    "                                           classes=['pos', 'neg'])"
    ]
   },
   {
@@ -155,7 +192,7 @@
    ],
    "source": [
     "# load an NBSVM model\n",
-    "model = text.text_classifier('nbsvm', (x_train, y_train), preproc=preproc)"
+    "model = text.text_classifier('nbsvm', trn, preproc=preproc)"
    ]
   },
   {
@@ -171,7 +208,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "learner = ktrain.get_learner(model, train_data=(x_train, y_train), val_data=(x_test, y_test))"
+    "learner = ktrain.get_learner(model, train_data=trn, val_data=val)"
    ]
   },
   {
@@ -378,6 +415,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "For text classifiers, there is also `predictor.predict_proba`, which is simply calls `predict` with `return_proba=True`.\n",
     "\n",
     "Our movie review sentiment predictor can be saved to disk and reloaded/re-used later as part of an application.  This is illustrated below:"
    ]
@@ -475,8 +513,8 @@
     "DATA_PATH = 'data/toxic-comments/train.csv'\n",
     "NUM_WORDS = 50000\n",
     "MAXLEN = 150\n",
-    "(x_train, y_train), (x_test, y_test), preproc = text.texts_from_csv(DATA_PATH,\n",
-    "                      'comment_text',\n",
+    "trn, val, preproc = text.texts_from_csv(DATA_PATH,\n",
+    "                     'comment_text',\n",
     "                      label_columns = [\"toxic\", \"severe_toxic\", \"obscene\", \"threat\", \"insult\", \"identity_hate\"],\n",
     "                      val_filepath=None, # if None, 10% of data will be used for validation\n",
     "                      max_features=NUM_WORDS, maxlen=MAXLEN,\n",
@@ -525,8 +563,8 @@
     }
    ],
    "source": [
-    "model = text.text_classifier('fasttext', (x_train, y_train), preproc=preproc)\n",
-    "learner = ktrain.get_learner(model, train_data=(x_train, y_train), val_data=(x_test, y_test))"
+    "model = text.text_classifier('fasttext', trn, preproc=preproc)\n",
+    "learner = ktrain.get_learner(model, train_data=trn, val_data=val)"
    ]
   },
   {
@@ -808,6 +846,22 @@
     "learner.fit_onecycle(3e-5, 1)\n",
     "```\n",
     "\n",
+    "Note that `x_train` and `x_test` are the raw texts here:\n",
+    "```python\n",
+    "x_train = ['I hate this movie.', 'I like this movie.']\n",
+    "```\n",
+    "Similar to `texts_from_array`, the labels are arrays in one of the following forms:\n",
+    "```python\n",
+    "# string labels\n",
+    "y_train = ['negative', 'positive']\n",
+    "# integer labels\n",
+    "y_train = [0, 1]\n",
+    "# multi or one-hot encoded labels\n",
+    "y_train = [[1,0], [0,1]]\n",
+    "```\n",
+    "In the latter two cases, you must supply a `class_names` argument to the `Transformer` constructor, which tells *ktrain* how indices map to class names.  In this case, `class_names=['negative', 'positive']` because 0=negative and 1=positive.\n",
+    "\n",
+    "\n",
     "For more information, see our tutorial on [text classification with Hugging Face Transformers](https://github.com/amaiya/ktrain/blob/master/tutorials/tutorial-A3-hugging_face_transformers.ipynb).\n",
     "\n",
     "You may be also interested in some of our blog posts on text classification:\n",

diff --git a/tutorials/tutorial-A3-hugging_face_transformers.ipynb b/tutorials/tutorial-A3-hugging_face_transformers.ipynb
@@ -127,6 +127,26 @@
     "learner = ktrain.get_learner(model, train_data=trn, val_data=val, batch_size=6)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Note that `x_train` and `x_test` are the raw texts that look like this:\n",
+    "```python\n",
+    "x_train = ['I hate this movie.', 'I like this movie.']\n",
+    "```\n",
+    "The labels are arrays in one of the following forms:\n",
+    "```python\n",
+    "# string labels\n",
+    "y_train = ['negative', 'positive']\n",
+    "# integer labels\n",
+    "y_train = [0, 1]\n",
+    "# multi or one-hot encoded labels\n",
+    "y_train = [[1,0], [0,1]]\n",
+    "```\n",
+    "In the latter two cases, you must supply a `class_names` argument to the `Transformer` constructor, which tells *ktrain* how indices map to class names.  In this case, `class_names=['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']` because 0=alt.atheism, 1=comp.graphics, etc."
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},