Skip to content

Commit

Permalink
Merge branch 'develop'
Browse files Browse the repository at this point in the history
  • Loading branch information
amaiya committed Mar 4, 2020
2 parents 33f9378 + f32c0cb commit 0038b38
Show file tree
Hide file tree
Showing 5 changed files with 102 additions and 14 deletions.
2 changes: 1 addition & 1 deletion examples/text/shallownlp-examples.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -303,7 +303,7 @@
"Discovered entities with English translations:\n",
"- Катерина Тихонова = Katerina Tikhonova (PER)\n",
"- России = Russia (LOC)\n",
"- Vladimir Putin = Vladimir Putin (PER)\n",
"- Владимира Путина = Vladimir Putin (PER)\n",
"- МГУ = Moscow State University (ORG)"
]
},
Expand Down
1 change: 1 addition & 0 deletions ktrain/text/preprocessor.py
Original file line number Diff line number Diff line change
Expand Up @@ -432,6 +432,7 @@ def _transform_y(self, y_data):
if self.label_encoder is None:
self.label_encoder = LabelEncoder()
self.label_encoder.fit(y_data)
#if self.get_classes(): warnings.warn('class_names argument is being overridden by string labels from data')
self.set_classes(self.label_encoder.classes_)
y_data = self.label_encoder.transform(y_data)

Expand Down
15 changes: 14 additions & 1 deletion tutorials/tutorial-03-image-classification.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,8 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"\n",
"We will begin our image classification example by importing some required modules."
]
},
Expand All @@ -50,7 +52,16 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we use the ```images_from_folder``` function to load the data as a generator (i.e., DirectoryIterator object). This function assumes the following directory structure:\n",
"Next, we will load and preprocess the image data for training and validation. *ktrain* can load images and associated labels from a variety of source:\n",
"\n",
"\n",
"- `images_from_folder`: labels are represented as subfolders containing images [ [example notebook] ](https://github.com/amaiya/ktrain/blob/master/examples/vision/dogs_vs_cats-ResNet50.ipynb)\n",
"- `images_from_csv`: labels are mapped to images in a CSV file [ [example notebook](https://github.com/amaiya/ktrain/blob/master/examples/vision/planet-ResNet50.ipynb) ]\n",
"- `images_from_fname`: labels are included as part of the filename and must be extracted using a regular expression [ [example notebook](https://github.com/amaiya/ktrain/blob/master/examples/vision/pets-ResNet50.ipynb) ]\n",
"- `images_from_array`: images and labels are stored in array [ [example notebook](https://github.com/amaiya/ktrain/blob/master/examples/vision/cifar10-WRN22.ipynb) ]\n",
"\n",
"\n",
"Here, we use the ```images_from_folder``` function to load the data as a generator (i.e., DirectoryIterator object). This function assumes the following directory structure:\n",
"```\n",
" ├── datadir\n",
" │ ├── train\n",
Expand All @@ -64,6 +75,8 @@
" │ ├── class2 # folder containing documents of class 2\n",
" │ └── classN # folder containing documents of class N\n",
"```\n",
"\n",
"\n",
"The *train_test_names* argument can be used, if the train and test subfolders are named differently (e.g., *test* folder is called *valid*). Here, we load a dataset of cat and dog images, which can be obtained from [here](https://www.kaggle.com/c/dogs-vs-cats/data). The DATADIR variale should be set to the path to the extracted folder. The **data_aug** parameter can be used to employ [data augmentation](https://arxiv.org/abs/1712.04621). We set this parameter using the ```get_data_aug``` function, which returns a default data augmentation with ```horizontal_flip=True``` as the only change to the defaults. See [Keras documentation](https://keras.io/preprocessing/image/#imagedatagenerator-class) for a full set of agumentation parameters. Finally, we pass the requested target size (224,224) and color mode (rgb, which is a 3-channel image). The image will be resized or converted appropriately based on the values supplied. A target size of 224 by 224 is typically used when using a network pretrained on ImageNet, which we do next. The ```images_from_folder``` function returns generators for both the training and validation data in addition an instance of ```ktrain.vision.ImagePreprocessor```, which can be used to preprocess raw data when making predictions for new examples. This will be demonstrated later."
]
},
Expand Down
78 changes: 66 additions & 12 deletions tutorials/tutorial-04-text-classification.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -45,8 +45,45 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we will load and preprocess the text data for training and validation. *ktrain* can load texts and associated labels from a variety of source:\n",
"\n",
"Next, we use the ```texts_from_folder``` function to load documents as fixed-length sequences of word IDs from a folder of raw documents. This function assumes a directory structure like the following:\n",
"- `texts_from_folder`: labels are represented as subfolders containing text files [ [example notebook] ](https://github.com/amaiya/ktrain/blob/master/examples/text/IMDb-BERT.ipynb)\n",
"- `texts_from_csv`: texts and associated labels are stored in columns in a CSV file [ [example notebook](https://github.com/amaiya/ktrain/blob/master/examples/text/toxic_comments-fasttext.ipynb) ]\n",
"- `texts_from_df`: texts and associated labels are stored in columns in a *pandas* DataFrame [ [example notebook](https://github.com/amaiya/ktrain/blob/master/examples/text/ArabicHotelReviews-nbsvm.ipynb) ]\n",
"- `texts_from_array`: texts and labels are loaded and preprocessed from an array [ [example notebook](https://github.com/amaiya/ktrain/blob/master/examples/text/20newsgroup-distilbert.ipynb) ]\n",
"\n",
"For `texts_from_csv` and `texts_from_df`, labels can either be multi or one-hot-encoded with one column per class or can be a single column storing integers or strings like this:\n",
"```python\n",
"# my_training_data.csv\n",
"TEXT,LABEL\n",
"I like this movie,positive\n",
"I hate this movie,negative\n",
"```\n",
"\n",
"For `texts_from_array`, the labels are arrays in one of the following forms:\n",
"```python\n",
"# string labels\n",
"y_train = ['negative', 'positive']\n",
"# integer labels\n",
"y_train = [0, 1]\n",
"# multi or one-hot encoded labels (used for multi-label problems)\n",
"y_train = [[1,0], [0,1]]\n",
"```\n",
"\n",
"In the latter two cases, you must supply a `class_names` argument to the `texts_from_array`, which tells *ktrain* how indices map to class names. In this case, `class_names=['negative', 'positive']` because 0=negative and 1=positive.\n",
"\n",
"Sample arrays for `texts_from_array` might look like this:\n",
"```python\n",
"x_train = ['I hate this movie.', 'I like this movie.']\n",
"y_train = ['negative', 'positive']\n",
"x_test = ['I despise this movie.', 'I love this movie.']\n",
"y_test = ['negative', 'positive']\n",
"```\n",
"\n",
"All of the above methods transform the texts into a sequence of word IDs in one way or another, as expected by neural network models.\n",
"\n",
"\n",
"In this first example problem, we use the ```texts_from_folder``` function to load documents as fixed-length sequences of word IDs from a folder of raw documents. This function assumes a directory structure like the following:\n",
"\n",
"```\n",
" ├── datadir\n",
Expand Down Expand Up @@ -98,11 +135,11 @@
"source": [
"# load training and validation data from a folder\n",
"DATADIR = 'data/aclImdb'\n",
"(x_train, y_train), (x_test, y_test), preproc = text.texts_from_folder(DATADIR, \n",
" max_features=80000, maxlen=2000, \n",
" ngram_range=3, \n",
" preprocess_mode='standard',\n",
" classes=['pos', 'neg'])"
"trn, val, preproc = text.texts_from_folder(DATADIR, \n",
" max_features=80000, maxlen=2000, \n",
" ngram_range=3, \n",
" preprocess_mode='standard',\n",
" classes=['pos', 'neg'])"
]
},
{
Expand Down Expand Up @@ -155,7 +192,7 @@
],
"source": [
"# load an NBSVM model\n",
"model = text.text_classifier('nbsvm', (x_train, y_train), preproc=preproc)"
"model = text.text_classifier('nbsvm', trn, preproc=preproc)"
]
},
{
Expand All @@ -171,7 +208,7 @@
"metadata": {},
"outputs": [],
"source": [
"learner = ktrain.get_learner(model, train_data=(x_train, y_train), val_data=(x_test, y_test))"
"learner = ktrain.get_learner(model, train_data=trn, val_data=val)"
]
},
{
Expand Down Expand Up @@ -378,6 +415,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"For text classifiers, there is also `predictor.predict_proba`, which is simply calls `predict` with `return_proba=True`.\n",
"\n",
"Our movie review sentiment predictor can be saved to disk and reloaded/re-used later as part of an application. This is illustrated below:"
]
Expand Down Expand Up @@ -475,8 +513,8 @@
"DATA_PATH = 'data/toxic-comments/train.csv'\n",
"NUM_WORDS = 50000\n",
"MAXLEN = 150\n",
"(x_train, y_train), (x_test, y_test), preproc = text.texts_from_csv(DATA_PATH,\n",
" 'comment_text',\n",
"trn, val, preproc = text.texts_from_csv(DATA_PATH,\n",
" 'comment_text',\n",
" label_columns = [\"toxic\", \"severe_toxic\", \"obscene\", \"threat\", \"insult\", \"identity_hate\"],\n",
" val_filepath=None, # if None, 10% of data will be used for validation\n",
" max_features=NUM_WORDS, maxlen=MAXLEN,\n",
Expand Down Expand Up @@ -525,8 +563,8 @@
}
],
"source": [
"model = text.text_classifier('fasttext', (x_train, y_train), preproc=preproc)\n",
"learner = ktrain.get_learner(model, train_data=(x_train, y_train), val_data=(x_test, y_test))"
"model = text.text_classifier('fasttext', trn, preproc=preproc)\n",
"learner = ktrain.get_learner(model, train_data=trn, val_data=val)"
]
},
{
Expand Down Expand Up @@ -808,6 +846,22 @@
"learner.fit_onecycle(3e-5, 1)\n",
"```\n",
"\n",
"Note that `x_train` and `x_test` are the raw texts here:\n",
"```python\n",
"x_train = ['I hate this movie.', 'I like this movie.']\n",
"```\n",
"Similar to `texts_from_array`, the labels are arrays in one of the following forms:\n",
"```python\n",
"# string labels\n",
"y_train = ['negative', 'positive']\n",
"# integer labels\n",
"y_train = [0, 1]\n",
"# multi or one-hot encoded labels\n",
"y_train = [[1,0], [0,1]]\n",
"```\n",
"In the latter two cases, you must supply a `class_names` argument to the `Transformer` constructor, which tells *ktrain* how indices map to class names. In this case, `class_names=['negative', 'positive']` because 0=negative and 1=positive.\n",
"\n",
"\n",
"For more information, see our tutorial on [text classification with Hugging Face Transformers](https://github.com/amaiya/ktrain/blob/master/tutorials/tutorial-A3-hugging_face_transformers.ipynb).\n",
"\n",
"You may be also interested in some of our blog posts on text classification:\n",
Expand Down
20 changes: 20 additions & 0 deletions tutorials/tutorial-A3-hugging_face_transformers.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,26 @@
"learner = ktrain.get_learner(model, train_data=trn, val_data=val, batch_size=6)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that `x_train` and `x_test` are the raw texts that look like this:\n",
"```python\n",
"x_train = ['I hate this movie.', 'I like this movie.']\n",
"```\n",
"The labels are arrays in one of the following forms:\n",
"```python\n",
"# string labels\n",
"y_train = ['negative', 'positive']\n",
"# integer labels\n",
"y_train = [0, 1]\n",
"# multi or one-hot encoded labels\n",
"y_train = [[1,0], [0,1]]\n",
"```\n",
"In the latter two cases, you must supply a `class_names` argument to the `Transformer` constructor, which tells *ktrain* how indices map to class names. In this case, `class_names=['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']` because 0=alt.atheism, 1=comp.graphics, etc."
]
},
{
"cell_type": "markdown",
"metadata": {},
Expand Down

0 comments on commit 0038b38

Please sign in to comment.