Most recent releases are shown at the top. Each release shows:
- New: New classes, methods, functions, etc
- Changed: Additional parameters, changes to inputs or outputs, etc
- Fixed: Bug fixes that don't change documented behaviour
- The
text.ner.models.sequence_tagger
now supports word embeddings from non-BERT transformer models (e.g.,roberta-base
,openai-gpt
). Thank to @Niekvdplas. - Custom tokenization can now be used in sequence-tagging even when using transformer word embeddings. See
custom_tokenizer
argument toNERPredictor.predict
.
- [breaking change] In the
text.ner.models.sequence_tagger
function, thebilstm-bert
model is now calledbilstm-transformer
and thebert_model
parameter has been renamed totransformer_model
. - [breaking change] The
syntok
package is now used as the default tokenizer forNERPredictor
(sequence-tagging prediction). To use the tokenization scheme from older versions of ktrain, you can import there
andstring
packages and supply this function to thecustom_tokenizer
argument:lambda s: re.compile(f"([{string.punctuation}“”¨«»®´·º½¾¿¡§£₤‘’])").sub(r" \1 ", s).split()
. - Code base was reformatted using black
- ktrain now supports TIKA for text extraction in the
text.textractor.TextExtractor
package with theuse_tika=True
argument as default. To use the old-style text extraction based on thetextract
package, you can supplyuse_tika=False
toTextExtractor
. - removed warning about sentence pair classification to avoid confusion
- N/A
- ktrain now supports simple, fast, and robust keyphrase extraction with the
ktran.text.kw.KeywordExtractor
module - ktrain now only issues a warning if TensorFlow is not installed, insteading of halting and preventing further use. This means that
pre-trained PyTorch models (e.g.,
text.zsl.ZeroShotClassifier
) and sklearn models (e.g.,text.eda.TopicModel
) in ktrain can now be used without having TensorFlow installed. text.qa.SimpleQA
andtext.qa.AnswerExtractor
now both support PyTorch with optional quantization (useframework='pt'
for PyTorch version)text.qa.SimpleQA
andtext.qa.AnswerExtractor
now both support aquantize
argument that can speed uptext.zsl.ZeroShotClassifier
,text.translation.Translator
, andtext.translation.EnglishTranslator
all support aquantize
argument.- pretrained image-captioning and object-detection via
transformers
is now supported
- reorganized imports
- localized seqeval
- The
half
parameter totext.translation.Translator
, andtext.translation.EnglishTranslator
was changed toquantize
and now supports both CPU and GPU.
- N/A
NERPredictor.predict
now includes areturn_offsets
parameter. If True, the results will include character offsets of predicted entities.
- In
eda.TopicModel
, changedlda_max_iter
tomax_iter
andnmf_alpha
toalpha
- Added
show_counts
parameter toTopicModel.get_topics
method - Changed
qa.core._process_question
toqa.core.process_question
- In
qa.core
, addedremove_english_stopwords
andand_np
parameters toprocess_question
- The
valley
learning rate suggestion is now returned inlearner.lr_estimate
andlearner.lr_plot
(whensuggest=True
supplied tolearner.lr_plot
)
- save
TransformerEmbedding
model, tokenizer, and configuration when savingNERPredictor
and resette_model
to facilitate loading NERPredictors with BERT embeddings offline (#423) - switched from
keras2onnx
totf2onnx
, which supports newer versions of TensorFlow
- N/A
- N/A
- added
get_tokenizer
call toTransformersPreprocessor._load_pretrained
to address issue #416
- N/A
- pin to
sklearn==0.24.2
due to breaking changes.eli5
fork for tf.keras updated for 0.24.2. To usescikit-learn==0.24.2
, users must uninstall and re-install theeli5
fork with:pip install https://github.com/amaiya/eli5/archive/refs/heads/tfkeras_0_10_1.zip
- N/A
- New vision models: added MobileNetV3-Small and EfficientNet. Thanks to @ilos-vigil.
core.Learner.plot
now supports plotting of any value that exists in the trainingHistory
object (e.g.,mae
if previously specified as metric). Thanks to @ilos-vigil.- added
raw_confidence
parameter toQA.ask
method to return raw confidence scores. Thanks to @ilos-vigil.
- pin to
transformers==4.10.3
due to Issue #398 - pin to
syntok==1.3.3
due to bug withsyntok==1.4.1
causing paragraph tokenization inqa
module to break - properly suppress TF/CUDA warnings by default
- ensure document fed to
keras_bert
tokenizer to avoid this issue
- speech transcription support
- N/A
- N/A
- N/A
- minor fix to installation due to pypi
- N/A
- N/A
- added
extra_requirements
tosetup.py
- changed imports for summarization, translation, qa, and zsl in notebooks and tests
- N/A
text.AnswerExtractor
is a universal information extractor powered by a Question-Answering module and capable of extracting user-specfied information from texts.text.TextExtractor
is a text extraction pipeline (e.g., convert PDFs to plain text)
- changed transformers pin to
transformers>=4.0.0,<=4.10.3
- N/A
- N/A
-N/A
SimpleQA
now can load PyTorch question-answering checkpoints- change API call to support newest
causalnlp
- N/A
- N/A
- check for
logits
attribute when predicting usingtransformers
- change raised Exception to warning for longer sequence lengths for
transformers
- N/A
- Added
method
parameter totabular.causal_inference_model
.
- N/A
- Added
tabular.causal_inference_model
function for causal inference support.
- N/A
- N/A
- N/A
- added
query
parameter toSimpleQA.ask
so that an alternative query can be used to retrieve contexts from corpus - added
chardet
as dependency forstellargraph
- fixed issue with
TopicModel.build
whenthreshold=None
- API documenation index
- Added warning when a TensorFlow version of selected
transformers
model is not available and the PyTorch version is being downloaded and converted instead usingfrom_pt=True
.
- Fixed
utils.metrics_from_model
to support alternative metrics - Check for AUC
ktrain.utils
"inspect" function
- N/A
shallownlp.ner.NER.predict
processes lists of sentences in batches resulting in faster predictionsbatch_size
argument added toshallownlp.ner.NER.predict
- added
verbose
parameter toktrain.text.textutils.extract_copy
to optionally see why each skipped document was skipped
- Changed
TextPredictor.save
to save Hugging Face tokenizer files locally to ensure they can be easily reloaded whentext.Transformer
is supplied with local path. - For
transformers
models, thepredictor.preproc.model_name
variable is automatically updated to be newPredictor
folder to avoid having users manually updatemodel_name
. Applies when a local path is supplied totext.Transformer
and resultantPredictor
is moved to new machine.
- N/A
NERPredictor.predict
now optionally accepts lists of sentences to make sequence-labeling predictions in batches (as all otherPredictor
instances already do).
- N/A
- N/A
- expose errors from
transformers
in_load_pretrained
- Changed
TextPreprocessor.check_trained
to be a warning instead of Exception
- N/A
- Support for transformers 4.0 and above.
- added
set_tokenizer to
TransformerPreprocessor` - show error message when original weights cannot be saved (for
reset_weights
method)
- cast filename to string before concatenating with suffix in
images_from_csv
andimages_from_df
(addresses issue #330) - resolved import error for
sklearn>=0.24.0
, buteli5
still requiressklearn<0.24.0
.
- N/A
- N/A
- fixed problem with
LabelEncoder
not properly being stored whentexts_from_df
is invoked - refrain from invoking
max
on empty sequence (#307) - corrected issue with
return_proba=True
in NER predictions (#316)
- N/A
- A
steps_per_epoch
argument has been added to all*fit*
methods that operate on generators - Added
get_tokenizer
methods to all instances ofTextPreprocessor
- propogate custom metrics to model when
distilbert
is chosen intext_classifier
andtext_regression_model
functions - pin
scikit-learn
to 0.24.0 sue to breaking change
- N/A
- N/A
- Added
custom_objects
argument toload_predictor
to load models with custom loss functions, etc. - Fixed bug #286 related to length computation when
use_dynamic_shape=True
- N/A
- Added
use_dynamic_shape
parameter totext.preprocessor.hf_convert_examples
which is set toTrue
when running predictions. This reduces the input length when making predictions, if possible.. - Added warnings to some imports in
imports.py
to allow for slightly lighter-weight deployments - Temporarily pinning to
transformers>=3.1,<4.0
due to breaking changes in v4.0.
- Suppress progress bar in
predictor.predict
forkeras_bert
models - Fixed typo causing problems when loading predictor for Inception models
- Fixes to address documented/undocumented breaking changes in
transformers>=4.0
. But, temporarily pinning totransformers>=3.1,<4.0
for backwards compatibility.
- The
SimpleQA.index_from_folder
method now supports text extraction from many file types including PDFs, MS Word documents, and MS PowerPoint files.
- The default in
SimpleQA.index_from_list
andSimpleQA.index_from_folder
has been changed tobreakup_docs=True
.
- N/A
- N/A
ktrain.text.textutils.extract_copy
now usestextract
to extract text from many file types (e.g., PDF, DOC, PPT) instead of just PDFs,
- N/A
- N/A
- N/A
- Change exception in model ID check in
Translator
to warning to better allow offline language translations
Predictor
instances now provide built-in support for exporting to TensorFlow Lite and ONNX.
- N/A
- N/A
- N/A
- Use fast tokenizers for the following Hugging Face transformers models: BERT, DistilBERT, and RoBERTa models. This change affects models created with either
text.Transformer(...
ortext.text_clasifier('distilbert',..')
. BERT models created withtext_classifier('bert',..
, which useskeras_bert
instead oftransformers
, are not affected by this change.
- N/A
- N/A
- N/A
- Resolved issue in
qa.ask
method occuring with embedding computations when full answer sentences exceed 512 tokens.
- Support for upcoming release of TensorFlow 2.4 such as removal of references to obsolete
multi_gpu_model
- [breaking change]
TopicModel.get_docs
now returns a list of dicts instead of a list of tuples. Each dict has keys:text
,doc_id
,topic_proba
,topic_id
. - added
TopicModel.get_document_topic_distribution
- added
TopicModel.get_sorted_docs
method to return all documents sorted by relevance to a giventopic_id
- Changed version check warning in
lr_find
to a raised Exception to avoid confusion when warnings from ktrain are suppressed - Pass
verbose
parameter tohf_convert_examples
- N/A
- changed
qa.core.display_answers
to make URLs open in new tab
- pin to
seqeval==0.0.19
due tonumpy
version incompatibility with latest TensorFlow and to suppress errors during installation
- N/A
- N/A
- fixed issue with missing noun phrase at end of sentence in
extract_noun_phrases
- fixed TensorFlow versioning issues with
utils.metrics_from_model
- added
extract_noun_phrases
totextutils
SimpleQA.ask
now includes aninclude_np
parameter. When True, noun phrases will be used to retrieve documents containing candidate answers.
- N/A
- N/A
- added optional
references
argument toSimpleQA.index_from_list
- added
min_words
argument toSimpleQA.index_from_list
andSimpleQA.index_from_folder
to prune small documents or paragraphs that are unlikely to include good answers qa.display_answers
now supports hyperlinks for document references
- N/A
- added
breakup_docs
argument toindex_from_list
andindex_from_folder
that potentially speeds upask
method substantially - added
batch_size
argument toask
and set default at 8 for faster answer-retrieval
- refactored
QA
andSimpleQA
for better extensibility
- Ensure
save_path
is correctyl processed inLearner.evaluate
- N/A
- Changed installation instructions in
README.md
to reflect that using ktrain with TensorFlow 2.1 will require downgradingtransformers
to 3.1.0. - updated requirements with
keras_bert>=0.86.0
due to TensorFlow 2.3 error with older versions ofkeras_bert
- In
lr_find
andlr_plot
, check for TF 2.2 or 2.3 and make necessary adjustments due to TF bug 41174.
- fixed typos in
__all__
intext
and graph` modules (PR #250) - fixed Chinese language translation based on on name-changes of models with
zh
as source language
- N/A
- added
TopicModel.get_word_weights
method to retrieve the word weights for a given topic - added
return_fig
option toLearner.lr_plot
andLearner.plot
, which allows the matplotlibFigure
to be returned to user
- N/A
- N/A
SUPPRESS_KTRAIN_WARNINGS
environment variable changed toSUPPRESS_DEP_WARNINGS
- N/A
- N/A
- added
num_beams
andearly_stopping
arguments totranslate
methods intranslation
module that can be set to improve translation speed - added
half
parameter toTranslator
construcor
- N/A
- Added
translate_sentences
method toTranslator
class that translates list of sentences, where list is fed to model as single batch
- Removed TensorFlow dependency from
setup.py
to allow users to use ktrain with any version of TensorFlow 2 they choose. - Added
truncation=True
to tokenization insummarization
module - Require
transformers>=3.1.0
due to breaking changes SUPPRESS_TF_WARNINGS
environment variable changed toSUPPRESS_KTRAIN_WARNINGS
- Use
prepare_seq2seq_batch
insteadd ofprepare_translation_batch
intranslation
module due to breaking change intransformers==3.1.0
- N/A
- N/A
- Always use
*Auto*
classes to loadtransformers
models to prevent loading errors
- N/A
- N/A
- Added missing
torch.no_grad()
scope intext.translation
andtext.summarization
modules
- added
nli_template
parameter toZeroShotClassifier.predict
to allow versatility in the kinds of labels that can be predicted - efficiency improvements to
ZeroShotClassifier.predict
that allow faster predictions on large sequences of documents and a large numer of labels to predict - added 'multilabel
parameter to
ZeroShotClassifier.predict` - added
labels
parameter toZeroShotClassifer.predict
, an alias totopic_strings
parameter
- N/A
- Allow variations on
accuracy
metric such asbinary_accuracy
when inpecting model inis_classifier
- N/A
- N/A
- In
texts_from_array
, checkclass_names
only after preprocessing before printing classification vs. regression status.
- N/A
- N/A
- In
TextPreprocessor
instances, correctly resetclass_names
when targets are in string format.
- N/A
- added
class_weight
parameter tolr_find
for imbalanced datasets - removed pins for
cchardet
andscikitlearn
fromsetup.py
- added version check for
eli5
fork - removed
scipy
pin fromsetup.py
- Allow TensorFlow 2.3 for Python 3.8
- Request manual installation of
shap
inTabularPredictor.explain
instead of inclusion insetup.py
- N/A
- N/A
-N/A
- include metrics check in
is_classifier
function to support with non-standard loss functions
- N/A
-N/A
- Ensure transition to
YTransform
is backwards compatibility forStandardTextPreprocessor
andBertPreprocessor
- N/A
TextPreprocessor
instances now useYTransform
class to transform targetstexts_from_df
,texts_from_csv
, andtexts_from_array
employ the use of eitherYTransformDataFrame
orYTransform
images_from_df
,images_from_fname
,images_from_csv
, andimagas_from_array
useYTransformDataFrame
orYTransform
- Extra imports removed from PyTorch-based
zsl.core.ZeroShotClassifier
andsummarization.core.TransformerSummarizer
. If necessary, both can now be used without having TensorFlow installed by installing ktrain using--no-deps
and importing these modules using a method like this.
- N/A
- N/A/
NERPredictor.predict
was changed to accept an optionalcustom_tokenizer
argument
- N/A
- N/A
- N/A
- added missing
num_classes
argument toto_categorical
- N/A
- Adjusted
no_grad
scope inZeroShotClassifier.predict
- N/A
- support for
tabular
data including explainable AI for tabular predictions learner.validate
andlearner.evaluate
now support regression models- added
restore_weights_only
flag tolr_find
. When True, only the model weights will be restored after simulating training, not the optimizer weights. In at least a few observed cases, this "warm up" seems to improve performance when actual training begins. Further investigation is needed, so it is False by default.
- N/A
- added
save_path
argument toLearner.validate
andLearner.evaluate
. Ifprint_report=False
, classification report will be saved as CSV tosave_path
. - Use
torch.no_grad
withZeroShotClassifier.predict
to prevent OOM - Added
max_length
parameter toZeroShotClassifier.predict
to prevent errors on long documnets - Added type check to
TransformersPreprocessor.preprocess_train
- N/A
- N/A
- Changed
qa
module to use use 'Auto' when loadingQuestionAnswering
models and tokenizer - try
from_pt=True
forqa
module if initial model-loading fails - use
get_hf_model_name
inqa
module
- N/A
- N/A
- return gracefully if no documents match question in
qa
module - tokenize question in
qa
module to ensure all candidate documents are returned - Added error in
text.preprocessor
when training set has incomplete integer labels
- added
batch_size
argument toZeroShotClassifier.predict
that can be increased to speed up predictions. This is especially useful iflen(topic_strings)
is large.
- N/A
- fixed typo in
load_predictor
error message
- N/A
- updated doc comments in core module
- removed unused
nosave
parameter fromreset_weights
- added warning about obsolete
show_wd
parameter inprint_layers
method - pin to
scipy==1.4.1
due to TensorFlow requirement
- N/A
- N/A
- Use
tensorflow==2.1.0
if Python 3.6/3.7 and usetensorflow==2.2.0
only if on Python 3.8 due to TensorFlow v2.2.0 issues
- N/A
- N/A
- Fixes to address changes or issues in TensorFlow 2.2.0:
- created
metrics_from_model
function due to changes in the way metrics are extracted from compiled model - use
loss_fn_from_model
function due to changes in they way loss functions are extracted from compiled model - addd
**kwargs
to `AdamWeightDecay based on this issue - changed
TransformerTextClassLearner.predict
andTextPredictor.predict
to deal with tuples being returned bypredict
in TensorFlow 2.2.0 - changed multilabel test to use loss insead of accuracy due to TF 2.2.0 issue
- changed
Learner.lr_find
to usesave_model
andload_model
to restore weights due to this TF issue and addedTransformersPreprocessor.load_model_and_configure_from_data
to support this
- created
- N/A
- N/A
- N/A
- Explicitly supply
'truncate='longest_first'
to prevent sentence pair classification from breaking intransformers==3.0.0
- Fixed typo in
encode_plus
invocation
- N/A
- N/A
- Explicitly supply
'truncate='longest_first'
to prevent sentence pair classification from breaking intransformers==3.0.0
- N/A
- N/A
- Changed
setup.py
to open README file usingencoding="utf-8"
to prevent installation problems on Windows machines withcp1252
encoding
- Added support for Russian in
text.EnglishTranslator
- N/A
- N/A
- N/A
- N/A
- Properly set device in
text.Translator
and use cuda when available
- support for language translation using pretraiend
MarianMT
models - added
core.evaluate
as alias tocore.validate
Learner.estimate_lr
method will return numerical estimates of learning rate using two different methods. Should only be called after runningLearner.lr_find
.
text.zsl.ZeroShotClassifier
changed to useAutoModel*
andAutoTokenizer
in order to load anymlni
model- remove external modules from
ktrain.__init__.py
so that they do not appear when pressing TAB in notebook - added
Transformer.save_tokenizer
andTransformer.get_tokenizer
methods to facilitate training on machines with no internet
- explicitly call
plt.show()
inLRFinder.plot_loss
to resolved issues with plot not displaying in certain cases (PR #170) - suppress warning about text regression when making text regression predictions
- allow
xnli
models forzsl
module
- added
metrics
parameter totext.text_classifier
andtext.text_regression_model
functions - added
metrics
parameter toTransformer.get_classifier
andTransformer.get_regrssion_model
methods
metric
parameter invision.image_classifier
andvision.image_regression_model
functions changed tometrics
- N/A
- N/A
- default model for summarization changed to
facebook/bart-large-cnn
due to breaking change in v2.11 - added
device
argument toTransformerSummarizer
constructor to control PyTorch device
- require
transformers>=2.11.0
due to breaking changes in v2.11 related toBART
models
- N/A
- N/A/
- prevent
transformer
tokenizers from being pickled duringpredictor.save
, as it causes problems for some community-uploaded models likebert-base-japanese-whole-word-masking
.
- support for Zero-Shot Topic Classification via the
text.ZeroShotClassifier
.
- N/A/
- N/A
- N/A
- N/A/
- Added the
procs
,limitmb
, andmultisegment
argumetns toindex_from_list
andindex_from_folder
method intext.SimpleQA
to speedup indexing when necessary. Supplyingmultisegment=True
speeds things up significantly, for example. Defaults, however, are the same as before. Users must explicitly change values if desiring a speedup. - Load
xlm-roberta*
asjplu/tf-xlm-roberta*
to bypass error fromtransformers
- N/A
- [breaking change] The
multilabel
argument intext.Transformer
constructor was moved toTransformer.get_classifier
and now correctly allows users to forcibly configure model for multilabel task regardless as to what data suggests. However, it is recommended to leave this value asNone
. - The methods
predictor.save
,ktrain.load_predictor
,learner.save_model
,learner.load_model
all now accept a path to folder where all files (e.g., model file,.preproc
file) will be saved. If path does not exist, it will be created. This should not be a breaking change as theload*
methods will still look for files in the old location if model or predictor was saved using an older version of ktrain.
- N/A
- N/A
- Added
n_samples
argument toTextPredictor.explain
to address slowness ofexplain
on Google Colab - Lock to version 0.21.3 of
scikit-learn
to ensure old-style explanations are generated fromTextPredictor.explain
- added missing
import pickle
to ensure saved topic models can be loaded
- N/A
- Changed
Transformer.preprocess*
methods to accept sentence pairs for sentence pair classification
- N/A
- Out-of-the-box support for image regression
vision.images_from_df
function to load image data from pandas DataFrames
- references to
fit_generator
andpredict_generator
converted tofit
andpredict
- Resolved issue with multilabel detection returning
False
for valid multilabel problems when data is in form of generator
- Added
TFDataset
class for use as wrapper around arbitrarytf.data.Dataset
objects for use in ktrain
- Added
NERPreprocessor.preprocess_train_from_conll2003
- Removed extraneous imports from
text.__init__.py
andvision.__init__.py
classes
argument inimages_from_array
changed toclass_names
- ensure NER data is properly prepared
text.ner.learner.validate
- fixed typo with
df
reference inimages_from_fname
- If no validation data is supplied to
images_from_array
, training data is split to generate validation data
- issue warning if Learner cannot save original weights
images_from_array
accepts labels in the form of integer class IDs
- fix pandas
SettingwithCopyWarning
fromimages_from_csv
- fixed issue with
return_proba=True
including class labels for multilabel image classification - resolved issue with class labels not being set correctly in
images_from_array
- lock to
cchardet==2.1.5
due to this issue - fixed
y_from_data
from NumpyArrayIterators in image classification
- N/A
- N/A
- fixed issue with MobileNet model due to typo and added MobileNet example notebook
- N/A
- added
merge_tokens
andreturn_proba
options toNERPredictor.predict
- N/A
- N/A
- added
textutils
totext
namespace and added note aboutsent_tokenize
to sequence-tagging tutorial
- cast dependent variable to
tf.float32
instead oftf.int64
for text regression problems usingtransformers
library
- N/A
- added
suggest
option tocore.Learner.lr_plot
- set interactive mode for matplotlib so plots show automatically from Python console and PyCharm
- run prepare for NER sequence predictor to avoid matrix mismatch
- N/A
- N/A
- ensure
text.eda.TopicModel.visualize_documents
works withbokeh
v2.0.x
- support for building Question-Answering systems
textutils
now containsparagraph_tokenize
function
- N/A
- resolved import issue with `textutils.sent_tokenize'
- N/A
TransformerSummarizer
accepts BARTmodel_name
as parameter
- N/A
- support for link prediction with graph neural networks
- text summarization with pretrained BART (included in 0.13.1 but not in 0.13.0)
bigru
method now selects pretrained word vectors based on detected language
- instead of throwing error, default to English if
detect_lang
could not detect language from batch of texts layers
argument moved toTransformerEmbedding
constructor- enforce specific version of TensorFlow due to undocumented breaking changes in newer TF versions
AdamWeightDecay
optimizer is now used to support global weight decay. Used when user excplictly sets a weight decay
- force re-instantiation of
TransformerEmbedding
object withsequence_tagger
function is re-invoked
- Added
max_momentum
andmin_momentum
parameters toautofit
andfit_onecycle
to control cyclical momentum
- Prevent loading errors of previously saved NERPreprocessor objects
- N/A
- N/A
- Require at least TensorFlow 2.1.0 is installed in
setup.py
due to TF 2.0.0 bug withlr_find
- Added lower bounds to scikit-learn and networkx versions
- N/A
- N/A
- N/A
- check and ensure AllenNLP is installed when Elmo embeddings are selected for NER
- BERT and Elmo embeddings for NER and other downstream tasks
wv_path_or_url
parameter moved fromentities_from*
tosequence_taggers
- Added
use_char
parameter and ensure it is not used unlessDISABLE_V2_BEHAVIOR
is enabled: batch_size
argument added toget_predictor
andload_predictor
eval_batch_size
argument added toget_learner
- added
val_pct
argument toentities_from_array
- properly set threshold in
text.eda
(PR #99) - fixed error when no validation data is supplied to
entities_from_array
- N/A
- N/A
- prevent errors with reading word vector files on Windows by specifying
encoding='utf-8'
- N/A
- N/A
ktrain.text.eda.visualize_documents
now properly processes filepath argument
entities_from_txt
,entities_from_gmb
, andentities_from_conll2003
functions now discover the encoding of the file automatically whenencoding=None
(which is the default now)
- N/A
- N/A
- sequence-taging (e.g., NER) now supports ELMo embeddings with
use_elmo=True
argument to data-loading functions likeentities_from_array
andentities_from_txt
A - pretrained word embeddings (i.e., fasttext word2vec embeddings) can be specified by providing the URL to
a
.vec.gz
file from here. The URL (or path) is supplied aswv_path_or_url
argument to data-loading functions likeentities_from_array
andentities_from_txt
show_random_images
: show random images from folder in Jupyter notebookNERPreprocessor
now includes apreprocess_test
method for easier evaluation of test sets in datasets that contain a training, validation, and test set
- ensure
DISABLE_V2_BEHAVIOR=True
whenImagePredictor.explain
is invoked - added
SUPPRESS_TF_WARNINGS
environment variable. Default is '1'. If set to '0', TF warnings will be displayed. merge_entities
method ofktrain.text.shallownlp.ner.NER
changed tomerge_tokens
- moved
load_predictor
to constructor inkrain.text.shallownlp.ner.NER
ktrain.text.shallownlp.ner.NER
now supportspredictor_path
argument
- convert
class_names
to strings incore.validate
to prevent error from scikit-learn - fixed error arising when no data augmentation scheme is provided to the
images_from*
functions - fixed bug in
images_from_fname
to ensure suppliedpattern
is used - added
val_folder
argument toimages_from_fname
- raise Exception when
preproc
is not found inload_predictor
- check for existence of
preproc
intext_classifier
andtext_regression_model
- fixed
text.eda
so thatdetect_lang
is called correctly after being moved totextutils
- N/A
shallownlp.Classifier.texts_from_folder
changed toshallownlp.Classifier.load_texts_from_folder
shallownlp.Classifier.texts_from_csv
changed toshallownlp.Classifier.load_texts_from_csv
- In
text.preprocessor
, added warning thatclass_names
is being ignored whenclass_names
were supplied andy_train
andy_test
contain string labels
- N/A
Transformer
API in ktrain now supports using community-uploaded transformer models- added
shallownlp
module with out-of-the-box NER for English, Russian, and Chinese text.eda
module now supports NMF in addition to LDA
texts_from_csv
andtexts_from_df
now accept a single column of labels in string format and will 1-hot-encode labels automatically for classification or multi-class classification problems.- reorganized language-handling to
text.textutils
- more suppression of warnings due to spurious warnings from TF2 causing confusion in output
classes
argument toTransformer
constructor has been changed toclass_names
for consistency withtexts_from_array
- N/A
- N/A
- changed pandas dependency to
>=1.0.1
due to bug in pandas 1.0
- N/A
- N/A
- Transformed data containers for transformers, NER, and graph -node classification to be
instances of
ktrain.data.Dataset
.
- fixed
images_from_array
so that y labels are correctly 1-hot-encoded when necessary - correct tokenization for
bert-base-japanese
Transformer models from PR 57
- N/A
- Removed Exception when
distilbert
is selected intext_classifier
for non-English language after Hugging Face fixed the reported bug.
- XLNet models like
xlnet-base-cased
now works after casting input arrays toint32
- modified
TextPredictor.explain
to propogate correct error message fromeli5
for multilabel text classification.
- N/A
- N/A
- fixed
utils.nclasses_from_data
forktrain.Dataset
instances - prevent
detect_lang
failing when Pandas Series is supplied
- support for out-of-the-box text regression in both the
Transformer
API and conventional API (i.e.,text.text_regression_model
).
text.TextPreprocessor
prints sequence length statistics
- auto-detect language when using
Transformer
class to prevent printingen
as default
- N/A
MultiArrayDataset
accepts list of Numpy arrays
- fixed incorrect activation in
TextPredictor
for multi-label Transformer models - fixed
top_losses
for regression tasks
- initial base
ktrain.Dataset
class for use as a Sequence wrapper to better support custom datasets/models
- N/A
- N/A
- N/A
- N/A
- fix to support multilabel text classification in
Transformers
_prepare_dataset
no longer breaks when validation dataset has not been supplied
- availability of a new, simplied interface to Hugging Face transformer models
- added 'distilbert' as an available model in
text.text_classifier
function
preproc
argument is required fortext.text_classifier
core._load_model
calls_make_predict_function
before returning model- added warning when non-adam optimizer is used with
cycle_momentum=True
- N/A
- N/A
- Fixed error when using ktrain with v0.2.x of
fastprogress
. ktrain can now be used with both v0.1.x and v0.2.x offastprogress
- All data-loading functions (e.g.,
texts_from_csv
) accept arandom_state
argument that will enable consistent reproduction of the train-test split.
- perform local checks for
stellargraph
where needed. - removed
stellargraph
as dependency due to issues with it overwritingtensorflow-gpu
- change
setup.py
to skip navigation links for pypi page
- N/A
- All data-loading functions (e.g.,
texts_from_csv
) accept arandom_state
argument that will enable consistent reproduction of the train-test split.
- perform local checks for
stellargraph
where needed. - removed
stellargraph
as dependency due to issues with it overwritingtensorflow-gpu
- N/A
- ktrain now uses tf.keras (
tensorflow>=1.14,<=2.0
) instead of stand-alone Keras.
- N/A
- N/A
- N/A
- added encoding argument when reading in word vectors to bypass error on Windows systems (PR #31)
- Change preprocessing defaults and apply special preprocessing in
text.eda.get_topic_model
when non-English is detected.
- N/A
- N/A
TextPredictor.explain
now correcty supports non-English languages.- Parameter
activation
is no longer ignored in_build_bert
function
- support for learning from unlabeled or partially-labeled text data
- unsupervised topic modeling with LDA
- one-class text classification to score documents based on similarity to a set of positive examples
- document recommendation engine
- N/A
- Removed dangling reference to external 'stellargraph' dependency from
_load_model
, so that we rely solely on local version of stellargraph
- N/A
- N/A
- Removed dangling reference to external 'stellargraph' dependency so that we rely solely on local version of stellargraph
- N/A
- N/A
- store a local version of
stellargraph
to prevent it from installingtensorflow-cpu
and overriding existingtensorflow-gpu
installation
- Support for node classification in graphs with
ktrain.graph
module
- N/A
- N/A
- N/A
- N/A
- Call
reset
beforepredict_generator
for consistent ordering ofview_top_losses
results - Fixed incorrect reference to
train_df
instead ofval_df
intexts_from_df
- All
fit
methods in ktrain now acceptclass_weight
parameter to handle imbalanced datasets.
- N/A
- Resolved problem with
text_classifier
incorrectly usinguncased_L-12_H-768_A-12
to build BERT model instead ofmulti_cased_L-12_H-768_A-12
when non-English language was detected. - Fixed error messages releated to preproc requirement in
text_classifier
- Fixed test script for multingual text classification
- Fixed rendering of Chinese in
view_top_losses
- N/A
- N/A
- Fix problem with
text_classifier
incorrectly usinguncased_L-12_H-768_A-12
to build BERT model instead ofmulti_cased_L-12_H-768_A-12
when non-English language was detected.
- Added multilingual support for text classification.
- Added experimental support for tf.keras. By default, ktrain will use standalone Keras.
If
os.environ['TF_KERAS']
is set, ktrian will attempt to use tf.keras. Some capabilities (e.g.,predictor.explain
for images) are not yet supported for tf.keras
- When BERT is selected, check to make sure dataset is correctly preprocessed for BERT
- Fixed
utils.bert_data_type
and ensures it does more checks to validate BERT-style data
- N/A
- globally import tensorflow
- suppress tensorflow deprecation warnings from TF 1.14.0
- Resolved issue with
text_classifier
failing when BERT is selected and Preprocessor is supplied.
- Support for sequence tagging with Bidirectional LSTM-CRF. Word embeddings can currently be either random or word2vec(cbow). If latter chosen, word vectors will be downloaded automaticlaly from Facebook fasttext site.
- Added
ktra.text.texts_from_df
function
- Added FutureWarning in
text.text_classifier
, thatpreproc
will be required argument in future. - In
text.text_classifier
, whenpreproc=None
, use the maximum feature ID to populate max_features.
- Fixed construction of custom_objects dictionary for BERT to ensure load_model works for custom BERT models
- Resolved issue with pretrained bigru models failing when max_features >= than total word count.
explain
methods have been added toTextPredictor
andImagePredictor
objects.TextPredictor.predict_proba
andImagePredictor.predict_proba_*
convenience methods have been added.- Added
utils.is_classifier
utility function
TextPredictor.predict
method can now accept a single document as input instead of always requiring a list.- Output of
core.view_top_losses
now includes the ground truth label of examples
- Fixed test of data loading
- added additional tests of ktrain
- Added
classes
argument tovision.images_from_folder
. Only classes/subfolders matching a name in theclasses
list will be considered.
- Resolved issue with using
learner.view_top_losses
with BERT models.
- N/A
- Added
classes
argument tovision.images_from_folder
. Only classes/subfolders matching a name in theclasses
list will be considered.
- Fixed issue with
learner.validate
andlearner.predict
failing when validation data is in the form of an Iterator (e.g., DirectoryIterator).
- N/A
- Added check in
ktrain.lroptimize.lrfinder
to stop training if learning rate exceeds a fixed maximum, which may happen when bad/dysfunctional model is supplied to learning rate finder.
- In
ktrain.text.data.texts_from_folder
function, only subfolders specified in classes argument are read in as training and validation data.
- N/A
- N/A
- Fixed error related to validation_steps=None in call to fit_generator in
ktrain.core
on Google Colab.
- Support for pretrained BERT Text Classification
- For
Learner.lr_find
, added optionalmax_epochs
argument. - Changed
Learner.confusion_matrix
toLearner.validate
and added optionalval_data
argument. Theuse_valid
argument has been removed. - Removed
pretrained_fpath
argument totext.text_classifier
. Pretrained word vectors are now downloaded automatically when 'bigru' is selected as model.
- Further cleanup of
utils.is_iter
function to use type check.
- N/A
- For
Learner.lr_find
, removed epochs and max_lr arguments and added lr_mult argument Default lr_mult is 1.01, but can be changed to control size of sample being used to estimate learning rate. - Changed structure of examples folder
- Resolved issue with
utils.y_from_data
not working correctly with DataFrameIterator objects.
- N/A
- Use class check in utils.is_iter as temporary fix
- revert to epochs=5 for
Learner.lr_find
- N/A
- N/A
- N/A
Learner.set_weight_decay
now works correctly
- BIGRU text classifier: Bidirectional GRU using pretrained word embeddings
- Epochs are calculated automatically in
LRFinder
- Number of epochs that
Learner.lr_find
runs can be explicitly set again
- relocated calls to tensorflow
- installation instructions and reformatted examples
- cycle_momentum argument for both
autofit
andfit_onecycle
method that will cycle momentum between 0.95 and 0.85 as described in this paper Learner.plot
method that will plot training-validation loss, LR schedule, or momentum schedule- added
set_weight_decay
andget_weight_decay
methods to get/set "global" weight decay in Keras
vision.data.preview_data_aug
now displays images in rows by default- added multigpu flag to
core.get_learner
with comment that it is only supported byvision.model.image_classifier
- added
he_normal
initialization to FastText model
- Bug in
vision.data.images_from_fname
that prevented relative paths for directory argument - Bug in
utils.y_from_data
that returned incorrect information for array-based training/validation data - Bug in
core.autofit
with callback failure when validation data is not set - Bug in
core.autofit
andcore.fit_onecycle
with learning rate setting at end of cycle
- Last release without CHANGELOG updates