Example Notebooks

This directory contains various example notebooks using ktrain. The directory currently has four folders:

text:
- text classification: examples using various text classification models and datasets
- text regression: example for predicting continuous value purely from text
- text sequence labeling: sequence tagging models
- sentence pair classification: sentence pair classification for tasks such as paraphrase or sarcasm detection
- topic modeling: unsupervised learning from unlabeled text data
- document similarity with one-class learning: given a sample of interesting documents, find and score new documents that are semantically similar to it using One-Class text classification
- document recommender system: given text from a sample document, recommend documents that are semantically similar to it from a larger corpus
- Shallow NLP: a small collection of miscellaneous text utilities amenable to being used on machines with only a CPU available (no GPU required)
- Text Summarization: an example of text summarization using a pretrained BART model
- Open-Domain Question-Answering: ask questions to a large text corpus and receive exact candidate answers
- Zero-Shot Learning: classify documents by user-supplied topics without any training examples
- Language Translation: an example of language translation using pretrained MarianMT models
- Text Extraction: extract text from PDFs, Word documents, etc.
- Speech Transcription: extract text audio file
- Universal Information Extraction: an example of using a Question-Answering model for information extraction
- Keyphrase Extraction: an example of keyphrase extraction in ktrain
- Indonesian Text Examples: examples such as zero-Shot text classification and question-answering on Indonesian text by Sandy Khosasi
vision:
- image classification: models for image datasetsimage classification examples using various models and datasets
- image regression: example of predicting numerical values purely from images/photos
- image captioning: example of captioning images with a pretrained model
- object detection: example of object detection in images with a pretrained model
graphs:
- node classification: node classification in graphs or networks
- link prediction: link prediction in graphs or networks
tabular:
- classification: classification for tabular data
- regression: regression for tabular data
- causal inference: causal inference using meta-learners

Text Data

Text Classification

IMDb: Binary Classification

IMDb is a dataset containing 50K movie reviews labeled as positive or negative. The corpus is split evenly between training and validation. The dataset is in the form of folders of images.

IMDb-fasttext.ipynb: A simple and fast "custom" fasttext model.
IMDb-BERT.ipynb: BERT text classification to predict sentiment of movie reviews.

Chinese Sentiment Analysis: Binary Classification

This dataset consists of roughly 6000 hotel reviews in Chinese. The objective is to predict the positive or negative sentiment of each review. This notebook shows an example of using ktrain with non-English text.

ChineseHotelReviews-nbsvm.ipynb: Training a simple and fast NBSVM model on this dataset with bigram/trigram features can achieve a validation accuracy of 92% with only 7 seconds of training.
ChineseHotelReviews-fasttext.ipynb: Using a fast and simple fasttext-like model to predict sentiment of Chinese-language hotel reviews.
ChineseHotelReviews-BERT.ipynb: BERT text classification to predict sentiment of Chinese-language hotel reviews.

Arabic Sentiment Analysis: Binary Classification

This dataset consists contains hotel reviews in Arabic. The objective is to predict the positive or negative sentiment of each review. This notebook shows an example of using ktrain with non-English text.

ArabicHotelReviews-nbsvm.ipynb: Training a simple and fast NBSVM model on this dataset with bigram/trigram features can achieve a validation accuracy of 94% with only seconds of training.
ArabicHotelReviews-BERT.ipynb: BERT text classification to predict sentiment of Arabic-language hotel reviews.

20 News Groups: Multiclass Classification

This is a small sample of the 20newsgroups dataset based on considering 4 newsgroups similar to what was done in the Working with Text Data scikit-learn tutorial. Data are in the form of arrays fetched via scikit-learn library. These examples show the results of training on a relatively small training set.

20newsgroups-NBVSM.ipynb: NBSVM model using unigram features only.
20newsgroups-BERT.ipynb: BERT text classification in a multiclass setting.
20newsgroups-distilbert.ipynb: a faster, smaller version of BERT called DistilBert for multiclass text classification
ktrain-ONNX-TFLite-examples.ipynb: text classification using ONNX and TensorFlow Lite

Toxic Comments: Multi-Label Text Classification

In multi-label classification, a single document can belong to multiple classes. The objective here is to categorize each text comment into one or more categories of toxic online behavior. Dataset is in the form of a CSV file.

toxic_comments-fasttext.ipynb: A fasttext-like model applied in a multi-label setting.
toxic_comments-bigru.ipynb: A bidirectional GRU using pretrained Glove vectors. This example shows how to use pretreained word vectors using ktrain.

Text Regression

Wine Prices: Text Regression

This dataset consists of prices of various wines along with textual descriptions of the wine. We will attempt to predict the price of a wine based purely on its text description.

text_regression_example.ipynb: Using an Embedding-based implementation of linear regression to predict wine prices from text column.

Sequence Labeling

CoNLL2003 NER Task: Named Entity Recognition

The objective of the CoNLL2003 task is to classify sequences of words as belonging to one of several categories of concepts such as Persons or Locations. See the original paper for more information on the format.

CoNLL2003-BiLSTM_CRF.ipynb: A simple and fast Bidirectional LSTM-CRF model with randomly initialized word embeddings.
CoNLL2003-BiLSTM.ipynb: A Bidirectional LSTM model with pretrained BERT embeddings

CoNLL2002 NER Task (Dutch): Named Entity Recognition for Dutch

CoNLL2002_Dutch-BiLSTM.ipynb: A Bidirectional LSTM model that uses pretrained BERT embeddings along with pretrained fasttext word embeddings - both for Dutch.

Sentence Pair Classification

Microsoft Research Paraphrase Corpus (MRPC): Paraphrase Detection

MRPC-BERT.ipynb: Using BERT for sentence pair classification on MRPC dataset

Topic Modeling

20 News Groups: unsupervised learning on 20newsgroups corpus

20newsgroups-topic_modeling.ipynb: Discover latent topics and themes in the 20 newsgroups corpus

One-Class Text Classification

20 News Groups: select a set of positive examples from 20newsgroups dataset

20newsgroups-document_similarity_scorer.ipynb: Given a selected seed set of documents from 20newsgroup corpus, find and score new documents that are semantically similar to it.

Text Recommender System

20 News Groups: recommend posts from 20newsgroups

20newsgroups-recommendation_engine.ipynb: given text from a sample document, recommend documents that are semantically similar to it from a larger corpus

Shallow NLP

shallownlp-examples.ipynb

Shallow NLP is a submodule in ktrain that provides a small collection of miscellaneous text utiilities to analyze text on machines with no GPU and minimal computational resources. Includes:
- text classification: a non-neural version of NBSVM that can be trained in a single line of code
- named-entity-recognition (NER): out-of-the box, ready-to-use NER for English, Russian, and Chinese that just works with no training
- multilingual text searches: searching text in multiple languages

20 News Groups: recommend posts from 20newsgroups

20newsgroups-recommendation_engine.ipynb: given text from a sample document, recommend documents that are semantically similar to it from a larger corpus

Text Summarization with pretrained BART: text_summarization_with_bart.ipynb

Open-Domain Question-Answering: question_answering_with_bert.ipynb

Zero-Shot Learning: zero_shot_learning_with_nli.ipynb

Language Translation: language_translation_example.ipynb

Text Extraction: text_extraction_example.ipynb

Speech Transcription: speech_transcription_example.ipynb

Universal Information Extraction: qa_information_extraction.ipynb

Keyphrase Extraction: keyword_extraction_example.ipynb

Indonesian NLP examples by Sandy Khosasi including Indonesian question-answering, emotion recognition, and document similarity

Vision Data

Image Classification

Dogs vs. Cats: Binary Classification

dogs_vs_cats-ResNet50.ipynb: ResNet50 pretrained on ImageNet.

Dogs vs. Cats: Binary Classification

dogs_vs_cats-MobileNet.ipynb: MobileNet pretrained on ImageNet on filtered version of Dogs vs. Cats dataset

MNIST: Multiclass Classification

mnist-WRN22.ipynb: A randomly-initialized Wide Residual Network applied to MNIST

MNIST: Multiclass Classification

mnist-image_from_array_example.ipynb: Build an MNIST model using images_from_array

MNIST: Multiclass Classification

mnist-tf_workflow.ipynb: Illustrates how ktrain can be used in minimally-invasive way with normal TF workflow

CIFAR10: Multiclass Classification

cifar10-WRN22.ipynb: A randomly-initialized Wide Residual Network applied to CIFAR10

Pets: Multiclass Classification

pets-ResNet50.ipynb: Categorizing dogs and cats by breed using a pretrained ResNet50. Uses the images_from_fname function, as class labels are embedded in the file names of images.

Planet: Multilabel Classification

The Kaggle Planet dataset consists of satellite images - each of which are categorized into multiple categories. Image labels are in the form of a CSV containing paths to images.

planet-ResNet50.ipynb: Using a pretrained ResNet50 model for multi-label classification.

Image Regression

Age Prediction: Image Regression

utk_faces_age_prediction-resnet50.ipynb: ResNet50 pretrained on ImageNet for age prediction using UTK Face dataset

Image Captioning

image_captioning_example.ipynb is a notebook illustrating pretrained image-captioning in ktrain

Object Detection

object_detection_example.ipynb is a notebook illustrating pretrained object-detection in ktrain

Graph Data

Graph Node Classification Datasets

PubMed-Diabetes: Node Classification

In the PubMed graph, each node represents a paper pertaining to one of three topics: Diabetes Mellitus - Experimental, Diabetes Mellitus - Type 1, and Diabetes Mellitus - Type 2. Links represent citations between papers. The attributes or features assigned to each node are in the form of a vector of words in each paper and their corresponding TF-IDF scores.

pubmed_node_classification-GraphSAGE.ipynb: GraphSAGE model for transductive and inductive inference.

Cora Citation Graph: Node Classification

In the Cora citation graph, each node represents a paper pertaining to one of several topic categories. Links represent citations between papers. The attributes or features assigned to each node is in the form of a multi-hot-encoded vector of words in each paper.

cora_node_classification-GraphSAGE.ipynb: GraphSAGE model for transductive inference on validation and test set of nodes in graph.

Hateful Twitter Users: Node Classification

Dataset of Twitter users and their attributes. A small portion of the user accounts are annotated as hateful or normal. The goal is to predict hateful accounts based on user features and graph structure.

hateful_twitter_users-GraphSAGE.ipynb: GraphSAGE model to predict hateful Twitter users using transductive inference.

Graph Link Prediction Datasets

Cora Citation Graph: Node Classification

In the Cora citation graph, each node represents a paper. Links represent citations between papers. The attributes or features assigned to each node is in the form of a multi-hot-encoded vector of words in each paper.

cora_link_prediction-GraphSAGE.ipynb: GraphSAGE model to predict missing links in the citation network.

Tabular Data

Tabular Classification Datasets

Titanic Survival Prediction: Tabular Classification

This is the well-studied Titanic dataset from Kaggle. The goal is to predict which passengers survived the Titanic disaster based on their attributes.

tabular_classification_and_regression_example.ipynb: MLP for tabular classification

Income Prediction from Census Data: Tabular Classification

This is the same dataset used in the AutoGluon classification example. The goal is to predict which individuals make over $50K per year.

IncomePrediction-MLP.ipynb: MLP for tabular classification

Tabular Regression Datasets

Adults Census Dataset: Tabular Regression

The original goal of this dataset is to predict the individuals that make over $50K in this Census dataset. We change the task to a regression problem and predict the Age attribute for each individual. This is the same dataset used in the AutoGluon regression example.

tabular_classification_and_regression_example.ipynb: MLP for tabular regression

House Price Prediction: Tabular Regression

HousePricePrediction-MLP.ipynb: MLP for tabular regression

Tabular Causal Inference

Adults Census Dataset: Tabular Causal Inference

The original goal of this dataset is to predict the individuals that make over $50K in this Census dataset. Here, we will use causal inference to estimate the causal impact of having a PhD on earning over $50K.

causal_inference_example.ipynb: use meta-learners for causal inference and uplift modeling

Files

README.md

Latest commit

History

README.md

File metadata and controls

Example Notebooks

Text Data

Text Classification

IMDb: Binary Classification

Chinese Sentiment Analysis: Binary Classification

Arabic Sentiment Analysis: Binary Classification

20 News Groups: Multiclass Classification

Toxic Comments: Multi-Label Text Classification

Text Regression

Wine Prices: Text Regression

Sequence Labeling

CoNLL2003 NER Task: Named Entity Recognition

CoNLL2002 NER Task (Dutch): Named Entity Recognition for Dutch

Sentence Pair Classification

Microsoft Research Paraphrase Corpus (MRPC): Paraphrase Detection

Topic Modeling

20 News Groups: unsupervised learning on 20newsgroups corpus

One-Class Text Classification

20 News Groups: select a set of positive examples from 20newsgroups dataset

Text Recommender System

20 News Groups: recommend posts from 20newsgroups

Shallow NLP

shallownlp-examples.ipynb

20 News Groups: recommend posts from 20newsgroups

Text Summarization with pretrained BART: text_summarization_with_bart.ipynb

Open-Domain Question-Answering: question_answering_with_bert.ipynb

Zero-Shot Learning: zero_shot_learning_with_nli.ipynb

Language Translation: language_translation_example.ipynb

Text Extraction: text_extraction_example.ipynb

Speech Transcription: speech_transcription_example.ipynb

Universal Information Extraction: qa_information_extraction.ipynb

Keyphrase Extraction: keyword_extraction_example.ipynb

Indonesian NLP examples by Sandy Khosasi including Indonesian question-answering, emotion recognition, and document similarity

Vision Data

Image Classification

Dogs vs. Cats: Binary Classification

Dogs vs. Cats: Binary Classification

MNIST: Multiclass Classification

MNIST: Multiclass Classification

MNIST: Multiclass Classification

CIFAR10: Multiclass Classification

Pets: Multiclass Classification

Planet: Multilabel Classification

Image Regression

Age Prediction: Image Regression

Image Captioning

Object Detection

Graph Data

Graph Node Classification Datasets

PubMed-Diabetes: Node Classification

Cora Citation Graph: Node Classification

Hateful Twitter Users: Node Classification

Graph Link Prediction Datasets

Cora Citation Graph: Node Classification

Tabular Data

Tabular Classification Datasets

Titanic Survival Prediction: Tabular Classification

Income Prediction from Census Data: Tabular Classification

Tabular Regression Datasets

Adults Census Dataset: Tabular Regression

House Price Prediction: Tabular Regression

Tabular Causal Inference

Adults Census Dataset: Tabular Causal Inference