Skip to content

Latest commit

 

History

History
281 lines (167 loc) · 20 KB

README.md

File metadata and controls

281 lines (167 loc) · 20 KB

Example Notebooks

This directory contains various example notebooks using ktrain. The directory currently has four folders:

Text Data

Text Classification

IMDb: Binary Classification

IMDb is a dataset containing 50K movie reviews labeled as positive or negative. The corpus is split evenly between training and validation. The dataset is in the form of folders of images.

Chinese Sentiment Analysis: Binary Classification

This dataset consists of roughly 6000 hotel reviews in Chinese. The objective is to predict the positive or negative sentiment of each review. This notebook shows an example of using ktrain with non-English text.

Arabic Sentiment Analysis: Binary Classification

This dataset consists contains hotel reviews in Arabic. The objective is to predict the positive or negative sentiment of each review. This notebook shows an example of using ktrain with non-English text.

  • ArabicHotelReviews-nbsvm.ipynb: Training a simple and fast NBSVM model on this dataset with bigram/trigram features can achieve a validation accuracy of 94% with only seconds of training.

  • ArabicHotelReviews-BERT.ipynb: BERT text classification to predict sentiment of Arabic-language hotel reviews.

20 News Groups: Multiclass Classification

This is a small sample of the 20newsgroups dataset based on considering 4 newsgroups similar to what was done in the Working with Text Data scikit-learn tutorial. Data are in the form of arrays fetched via scikit-learn library. These examples show the results of training on a relatively small training set.

Toxic Comments: Multi-Label Text Classification

In multi-label classification, a single document can belong to multiple classes. The objective here is to categorize each text comment into one or more categories of toxic online behavior. Dataset is in the form of a CSV file.

Text Regression

Wine Prices: Text Regression

This dataset consists of prices of various wines along with textual descriptions of the wine. We will attempt to predict the price of a wine based purely on its text description.

Sequence Labeling

CoNLL2003 NER Task: Named Entity Recognition

The objective of the CoNLL2003 task is to classify sequences of words as belonging to one of several categories of concepts such as Persons or Locations. See the original paper for more information on the format.

CoNLL2002 NER Task (Dutch): Named Entity Recognition for Dutch

  • CoNLL2002_Dutch-BiLSTM.ipynb: A Bidirectional LSTM model that uses pretrained BERT embeddings along with pretrained fasttext word embeddings - both for Dutch.

Sentence Pair Classification

  • MRPC-BERT.ipynb: Using BERT for sentence pair classification on MRPC dataset

Topic Modeling

20 News Groups: unsupervised learning on 20newsgroups corpus

One-Class Text Classification

20 News Groups: select a set of positive examples from 20newsgroups dataset

Text Recommender System

20 News Groups: recommend posts from 20newsgroups

Shallow NLP

  • Shallow NLP is a submodule in ktrain that provides a small collection of miscellaneous text utiilities to analyze text on machines with no GPU and minimal computational resources. Includes:
    • text classification: a non-neural version of NBSVM that can be trained in a single line of code
    • named-entity-recognition (NER): out-of-the box, ready-to-use NER for English, Russian, and Chinese that just works with no training
    • multilingual text searches: searching text in multiple languages

20 News Groups: recommend posts from 20newsgroups

Text Summarization with pretrained BART: text_summarization_with_bart.ipynb

Open-Domain Question-Answering: question_answering_with_bert.ipynb

Universal Information Extraction: qa_information_extraction.ipynb

Keyphrase Extraction: keyword_extraction_example.ipynb

Indonesian NLP examples by Sandy Khosasi including Indonesian question-answering, emotion recognition, and document similarity

Vision Data

Image Classification

Dogs vs. Cats: Binary Classification

Dogs vs. Cats: Binary Classification

MNIST: Multiclass Classification

MNIST: Multiclass Classification

MNIST: Multiclass Classification

CIFAR10: Multiclass Classification

Pets: Multiclass Classification

  • pets-ResNet50.ipynb: Categorizing dogs and cats by breed using a pretrained ResNet50. Uses the images_from_fname function, as class labels are embedded in the file names of images.

Planet: Multilabel Classification

The Kaggle Planet dataset consists of satellite images - each of which are categorized into multiple categories. Image labels are in the form of a CSV containing paths to images.

Image Regression

Age Prediction: Image Regression

Image Captioning

Object Detection

Graph Data

Graph Node Classification Datasets

PubMed-Diabetes: Node Classification

In the PubMed graph, each node represents a paper pertaining to one of three topics: Diabetes Mellitus - Experimental, Diabetes Mellitus - Type 1, and Diabetes Mellitus - Type 2. Links represent citations between papers. The attributes or features assigned to each node are in the form of a vector of words in each paper and their corresponding TF-IDF scores.

Cora Citation Graph: Node Classification

In the Cora citation graph, each node represents a paper pertaining to one of several topic categories. Links represent citations between papers. The attributes or features assigned to each node is in the form of a multi-hot-encoded vector of words in each paper.

Hateful Twitter Users: Node Classification

Dataset of Twitter users and their attributes. A small portion of the user accounts are annotated as hateful or normal. The goal is to predict hateful accounts based on user features and graph structure.

Graph Link Prediction Datasets

Cora Citation Graph: Node Classification

In the Cora citation graph, each node represents a paper. Links represent citations between papers. The attributes or features assigned to each node is in the form of a multi-hot-encoded vector of words in each paper.

Tabular Data

Tabular Classification Datasets

Titanic Survival Prediction: Tabular Classification

This is the well-studied Titanic dataset from Kaggle. The goal is to predict which passengers survived the Titanic disaster based on their attributes.

Income Prediction from Census Data: Tabular Classification

This is the same dataset used in the AutoGluon classification example. The goal is to predict which individuals make over $50K per year.

Tabular Regression Datasets

Adults Census Dataset: Tabular Regression

The original goal of this dataset is to predict the individuals that make over $50K in this Census dataset. We change the task to a regression problem and predict the Age attribute for each individual. This is the same dataset used in the AutoGluon regression example.

House Price Prediction: Tabular Regression

Tabular Causal Inference

Adults Census Dataset: Tabular Causal Inference

The original goal of this dataset is to predict the individuals that make over $50K in this Census dataset. Here, we will use causal inference to estimate the causal impact of having a PhD on earning over $50K.