This directory contains various example notebooks using ktrain. The directory currently has four folders:
text
:- text classification: examples using various text classification models and datasets
- text regression: example for predicting continuous value purely from text
- text sequence labeling: sequence tagging models
- sentence pair classification: sentence pair classification for tasks such as paraphrase or sarcasm detection
- topic modeling: unsupervised learning from unlabeled text data
- document similarity with one-class learning: given a sample of interesting documents, find and score new documents that are semantically similar to it using One-Class text classification
- document recommender system: given text from a sample document, recommend documents that are semantically similar to it from a larger corpus
- Shallow NLP: a small collection of miscellaneous text utilities amenable to being used on machines with only a CPU available (no GPU required)
- Text Summarization: an example of text summarization using a pretrained BART model
- Open-Domain Question-Answering: ask questions to a large text corpus and receive exact candidate answers
- Zero-Shot Learning: classify documents by user-supplied topics without any training examples
- Language Translation: an example of language translation using pretrained MarianMT models
- Text Extraction: extract text from PDFs, Word documents, etc.
- Speech Transcription: extract text audio file
- Universal Information Extraction: an example of using a Question-Answering model for information extraction
- Keyphrase Extraction: an example of keyphrase extraction in ktrain
- Indonesian Text Examples: examples such as zero-Shot text classification and question-answering on Indonesian text by Sandy Khosasi
vision
:- image classification: models for image datasetsimage classification examples using various models and datasets
- image regression: example of predicting numerical values purely from images/photos
- image captioning: example of captioning images with a pretrained model
- object detection: example of object detection in images with a pretrained model
graphs
:- node classification: node classification in graphs or networks
- link prediction: link prediction in graphs or networks
tabular
:- classification: classification for tabular data
- regression: regression for tabular data
- causal inference: causal inference using meta-learners
IMDb: Binary Classification
IMDb is a dataset containing 50K movie reviews labeled as positive or negative. The corpus is split evenly between training and validation. The dataset is in the form of folders of images.
- IMDb-fasttext.ipynb: A simple and fast "custom" fasttext model.
- IMDb-BERT.ipynb: BERT text classification to predict sentiment of movie reviews.
Chinese Sentiment Analysis: Binary Classification
This dataset consists of roughly 6000 hotel reviews in Chinese. The objective is to predict the positive or negative sentiment of each review. This notebook shows an example of using ktrain with non-English text.
-
ChineseHotelReviews-nbsvm.ipynb: Training a simple and fast NBSVM model on this dataset with bigram/trigram features can achieve a validation accuracy of 92% with only 7 seconds of training.
-
ChineseHotelReviews-fasttext.ipynb: Using a fast and simple fasttext-like model to predict sentiment of Chinese-language hotel reviews.
-
ChineseHotelReviews-BERT.ipynb: BERT text classification to predict sentiment of Chinese-language hotel reviews.
Arabic Sentiment Analysis: Binary Classification
This dataset consists contains hotel reviews in Arabic. The objective is to predict the positive or negative sentiment of each review. This notebook shows an example of using ktrain with non-English text.
-
ArabicHotelReviews-nbsvm.ipynb: Training a simple and fast NBSVM model on this dataset with bigram/trigram features can achieve a validation accuracy of 94% with only seconds of training.
-
ArabicHotelReviews-BERT.ipynb: BERT text classification to predict sentiment of Arabic-language hotel reviews.
20 News Groups: Multiclass Classification
This is a small sample of the 20newsgroups dataset based on considering 4 newsgroups similar to what was done in the Working with Text Data scikit-learn tutorial. Data are in the form of arrays fetched via scikit-learn library. These examples show the results of training on a relatively small training set.
- 20newsgroups-NBVSM.ipynb: NBSVM model using unigram features only.
- 20newsgroups-BERT.ipynb: BERT text classification in a multiclass setting.
- 20newsgroups-distilbert.ipynb: a faster, smaller version of BERT called DistilBert for multiclass text classification
- ktrain-ONNX-TFLite-examples.ipynb: text classification using ONNX and TensorFlow Lite
Toxic Comments: Multi-Label Text Classification
In multi-label classification, a single document can belong to multiple classes. The objective here is to categorize each text comment into one or more categories of toxic online behavior. Dataset is in the form of a CSV file.
- toxic_comments-fasttext.ipynb: A fasttext-like model applied in a multi-label setting.
- toxic_comments-bigru.ipynb: A bidirectional GRU using pretrained Glove vectors. This example shows how to use pretreained word vectors using ktrain.
Wine Prices: Text Regression
This dataset consists of prices of various wines along with textual descriptions of the wine. We will attempt to predict the price of a wine based purely on its text description.
- text_regression_example.ipynb: Using an Embedding-based implementation of linear regression to predict wine prices from text column.
CoNLL2003 NER Task: Named Entity Recognition
The objective of the CoNLL2003 task is to classify sequences of words as belonging to one of several categories of concepts such as Persons or Locations. See the original paper for more information on the format.
- CoNLL2003-BiLSTM_CRF.ipynb: A simple and fast Bidirectional LSTM-CRF model with randomly initialized word embeddings.
- CoNLL2003-BiLSTM.ipynb: A Bidirectional LSTM model with pretrained BERT embeddings
CoNLL2002 NER Task (Dutch): Named Entity Recognition for Dutch
- CoNLL2002_Dutch-BiLSTM.ipynb: A Bidirectional LSTM model that uses pretrained BERT embeddings along with pretrained fasttext word embeddings - both for Dutch.
Microsoft Research Paraphrase Corpus (MRPC): Paraphrase Detection
- MRPC-BERT.ipynb: Using BERT for sentence pair classification on MRPC dataset
20 News Groups: unsupervised learning on 20newsgroups corpus
- 20newsgroups-topic_modeling.ipynb: Discover latent topics and themes in the 20 newsgroups corpus
20 News Groups: select a set of positive examples from 20newsgroups dataset
- 20newsgroups-document_similarity_scorer.ipynb: Given a selected seed set of documents from 20newsgroup corpus, find and score new documents that are semantically similar to it.
20 News Groups: recommend posts from 20newsgroups
- 20newsgroups-recommendation_engine.ipynb: given text from a sample document, recommend documents that are semantically similar to it from a larger corpus
- Shallow NLP is a submodule in ktrain that provides a small collection of miscellaneous text utiilities to analyze text on machines with no GPU and minimal computational resources. Includes:
- text classification: a non-neural version of NBSVM that can be trained in a single line of code
- named-entity-recognition (NER): out-of-the box, ready-to-use NER for English, Russian, and Chinese that just works with no training
- multilingual text searches: searching text in multiple languages
20 News Groups: recommend posts from 20newsgroups
- 20newsgroups-recommendation_engine.ipynb: given text from a sample document, recommend documents that are semantically similar to it from a larger corpus
Text Summarization with pretrained BART: text_summarization_with_bart.ipynb
Open-Domain Question-Answering: question_answering_with_bert.ipynb
Zero-Shot Learning: zero_shot_learning_with_nli.ipynb
Language Translation: language_translation_example.ipynb
Text Extraction: text_extraction_example.ipynb
Speech Transcription: speech_transcription_example.ipynb
Universal Information Extraction: qa_information_extraction.ipynb
Keyphrase Extraction: keyword_extraction_example.ipynb
Indonesian NLP examples by Sandy Khosasi including Indonesian question-answering, emotion recognition, and document similarity
Dogs vs. Cats: Binary Classification
- dogs_vs_cats-ResNet50.ipynb: ResNet50 pretrained on ImageNet.
Dogs vs. Cats: Binary Classification
- dogs_vs_cats-MobileNet.ipynb: MobileNet pretrained on ImageNet on filtered version of Dogs vs. Cats dataset
MNIST: Multiclass Classification
- mnist-WRN22.ipynb: A randomly-initialized Wide Residual Network applied to MNIST
MNIST: Multiclass Classification
- mnist-image_from_array_example.ipynb: Build an MNIST model using
images_from_array
MNIST: Multiclass Classification
- mnist-tf_workflow.ipynb: Illustrates how ktrain can be used in minimally-invasive way with normal TF workflow
CIFAR10: Multiclass Classification
- cifar10-WRN22.ipynb: A randomly-initialized Wide Residual Network applied to CIFAR10
Pets: Multiclass Classification
- pets-ResNet50.ipynb: Categorizing dogs and cats by breed using a pretrained ResNet50. Uses the
images_from_fname
function, as class labels are embedded in the file names of images.
Planet: Multilabel Classification
The Kaggle Planet dataset consists of satellite images - each of which are categorized into multiple categories. Image labels are in the form of a CSV containing paths to images.
- planet-ResNet50.ipynb: Using a pretrained ResNet50 model for multi-label classification.
Age Prediction: Image Regression
- utk_faces_age_prediction-resnet50.ipynb: ResNet50 pretrained on ImageNet for age prediction using UTK Face dataset
- image_captioning_example.ipynb is a notebook illustrating pretrained image-captioning in ktrain
- object_detection_example.ipynb is a notebook illustrating pretrained object-detection in ktrain
PubMed-Diabetes: Node Classification
In the PubMed graph, each node represents a paper pertaining to one of three topics: Diabetes Mellitus - Experimental, Diabetes Mellitus - Type 1, and Diabetes Mellitus - Type 2. Links represent citations between papers. The attributes or features assigned to each node are in the form of a vector of words in each paper and their corresponding TF-IDF scores.
- pubmed_node_classification-GraphSAGE.ipynb: GraphSAGE model for transductive and inductive inference.
Cora Citation Graph: Node Classification
In the Cora citation graph, each node represents a paper pertaining to one of several topic categories. Links represent citations between papers. The attributes or features assigned to each node is in the form of a multi-hot-encoded vector of words in each paper.
- cora_node_classification-GraphSAGE.ipynb: GraphSAGE model for transductive inference on validation and test set of nodes in graph.
Hateful Twitter Users: Node Classification
Dataset of Twitter users and their attributes. A small portion of the user accounts are annotated as hateful
or normal
. The goal is to predict hateful accounts based on user features and graph structure.
- hateful_twitter_users-GraphSAGE.ipynb: GraphSAGE model to predict hateful Twitter users using transductive inference.
Cora Citation Graph: Node Classification
In the Cora citation graph, each node represents a paper. Links represent citations between papers. The attributes or features assigned to each node is in the form of a multi-hot-encoded vector of words in each paper.
- cora_link_prediction-GraphSAGE.ipynb: GraphSAGE model to predict missing links in the citation network.
Titanic Survival Prediction: Tabular Classification
This is the well-studied Titanic dataset from Kaggle. The goal is to predict which passengers survived the Titanic disaster based on their attributes.
- tabular_classification_and_regression_example.ipynb: MLP for tabular classification
Income Prediction from Census Data: Tabular Classification
This is the same dataset used in the AutoGluon classification example. The goal is to predict which individuals make over $50K per year.
- IncomePrediction-MLP.ipynb: MLP for tabular classification
Adults Census Dataset: Tabular Regression
The original goal of this dataset is to predict the individuals that make over $50K in this Census dataset. We change the task to a regression problem and predict the Age attribute for each individual. This is the same dataset used in the AutoGluon regression example.
- tabular_classification_and_regression_example.ipynb: MLP for tabular regression
House Price Prediction: Tabular Regression
- HousePricePrediction-MLP.ipynb: MLP for tabular regression
Adults Census Dataset: Tabular Causal Inference
The original goal of this dataset is to predict the individuals that make over $50K in this Census dataset. Here, we will use causal inference to estimate the causal impact of having a PhD on earning over $50K.
- causal_inference_example.ipynb: use meta-learners for causal inference and uplift modeling