Changelog

The changelog will record what content was changed (e.g. changed an existing paragraph to a better-worded version, re-ran the notebook using an updated version of the package, introduced new content to existing notebook), added (e.g. a completely new jupyter notebook).

[2024-03]

Added

Multilingual Sentence Embedding with LLM and PEFT LoRA (PyTorch Lightning) [nbviewer][html]

[2023-11]

Changed

Introduction to CLIP (Contrastive Language-Image Pre-training), LiT, ViT [nbviewer][html]
- Massive overhaul to the content using latest version of PyTorch 2.
- Switched to using huggingface transformer ViT image encoder, instead of timm's ResNet.
- Added quantitative evaluation with retrieval recall@k.
- Added additional introduction to LiT, ViT.

[2023-10]

Added

BERT CTR. [nbviewer][html]

[2023-09]

Added

Deep Learning - Learning to Rank 101 (RankNet, ListNet). [nbviewer][html]

[2023-08]

Changed

Finetuning Pre-trained BERT Model on Text Classification Task And Inferencing with ONNX Runtime. [nbviewer][html]
- Overhaul the content using latest version of PyTorch 2.
- Use huggingface trainer for model training/evaluation instead of custom implementation.
- Benchmarked ONNX versus PyTorch on both GPU and CPU.
Deep Learning for Tabular Data - PyTorch. [nbviewer][html]
- Overhaul the content using latest version of PyTorch 2.
- Use huggingface trainer for model training/evaluation instead of custom implementation.
- Removed outdated content around ONNX which are not relevant for this particular topic.

[2023-07]

Added

Self Supervised (SIMCLR) versus Supervised Contrastive Learning. [nbviewer][html]

[2023-04]

Changed

Training Bi-Encoder Models with Contrastive Learning Notes. [nbviewer][html]
- Update to include a section on data augmentation as well as various clarification on wordings.
Response Knowledge Distillation for Training Student Model. [nbviewer][html]
- Re-ran with huggingface dataset logging disabled, this is to prevent messages from flooding main content.
Sentence Transformer: Training Bi-Encoder via Contrastive Loss. [nbviewer][html]
- Re-ran with huggingface dataset logging disabled, this is to prevent messages from flooding main content.

[2023-03]

Added

Uploading and downloading files from s3. [nbviewer][html]

[2023-02]

Added

Training Bi-Encoder Models with Contrastive Learning Notes. [nbviewer][html]
Introduction to CLIP (Contrastive Language-Image Pre-training) [nbviewer][html]

[2023-01]

Changed

Machine Translation with Huggingface Transformers mT5. [nbviewer][html]
- This is a complete overhaul of the original Machine Translation with Huggingface Transformers article, which was a bit obsolete.
Response Knowledge Distillation for Training Student Model. [nbviewer][html]
- Added a final notes section on distilbert, well read students read well.

[2022-12]

Added

Fine Tuning Pre-trained Encoder on Question Answer Task. [nbviewer][html]

[2022-11]

Added

Sentence Transformer: Training Bi-Encoder via Contrastive Loss. [nbviewer][html]

[2022-10]

Added

Quick Introduction to Graph Neural Network Node Classification Task (DGL, GraphSAGE). [nbviewer][html]

[2022-09]

Added

Response Knowledge Distillation for Training Student Model. [nbviewer][html]

[2022-07]

Added

HyperParameter Tuning with Ray Tune and Hyperband. [nbviewer][html]

[2022-06]

Added

Quick introduction to difference in difference. [nbviewer][html]

Changed

Quick Intro to Gradient Boosted Tree Inferencing. [nbviewer][html]
- Added content around ONNX.

[2022-04]

Added

Quick introduction to generalized second price auction. [nbviewer][html]

[2021-10]

Added

Operation Research Quick Intro Via Ortools. [nbviewer][html]

[2021-09]

Added

Probability Calibration for deep learning classification models with Temperature Scaling. [nbviewer][html]

[2021-06]

Added

Finetuning Pre-trained BERT Model on Text Classification Task And Inferencing with ONNX Runtime. [nbviewer][html]

[2021-05]

Added

Machine Translation with Huggingface Transformers. [nbviewer][html]

[2021-02]

Added

Quick Intro to Gradient Boosted Tree Inferencing. [nbviewer][html]

[2021-01]

Added

Transformer, Attention is All you Need - PyTorch, Huggingface Datasets. [nbviewer][html]

[2020-11]

Added

Inverse Propensity Weighting. [nbviewer][html]

[2020-10]

Added

Deep Learning for Tabular Data - PyTorch, PyTorch Lightning, ONNX Runtime. [nbviewer][html]

Changed

Removed mlutils: Machine learning utility function package. A lot of its contents are not well-maintained, as a result, are already out-dated.
LightGBM API walkthrough and a discussion about categorical features in tree-based models. [nbviewer][html]
- Upgrade LightGBM to 3.0.0, and deprecate out-dated content.
Xgboost API walkthrough (includes hyperparameter tuning via scikit-learn like API). [nbviewer][html]
- Upgrade XGBoost to 1.2.1, and deprecate out-dated content.

[2020-09]

Changed

Probability Calibration for classification models. [nbviewer][html]
- Massive overhaul to the content. e.g. introducing two additional calibration methods, histogram binning and Plat Scaling Binning. Bundling all helper utility function in a package structure for ease of re-use.
Multi-Label Text Classification with Fasttext and Huggingface Tokenizers. [nbviewer][html]
- Update Huggingface Tokenizers to 0.8.1 API.

[2020-06]

Added

Approximate Nearest Neighborhood Search with Navigable Small World. [nbviewer][html]

[2020-05]

Added

Product Quantization for Model Compression. [nbviewer][html]
Maximum Inner Product for Speeding Up Generating Recommendations. [nbviewer][html]

[2020-04]

Added

Extremely Quick Guide to Unicode. [markdown]
MultiLabel Text Classification with Fasttext and Huggingface Tokenizers. [nbviewer][html]

Changed

FastAPI & Azure Kubernetes Cluster. End to end example of training a model and hosting it as a service. [folder]
- Added application load testing with Apache Jmeter.

[2020-03]

Changed

FastAPI & Azure Kubernetes Cluster. End to end example of training a model and hosting it as a service. [folder]
- Added more best practices when specifying a deployment.

[2020-02]

Added

FastAPI & Azure Kubernetes Cluster. End to end example of training a model and hosting it as a service. [folder]

Changed

Parallel programming with Python (threading, multiprocessing, concurrent.futures, joblib). [nbviewer][html]
- Added a short section to asynchronous programming.
Monotonic Constraint with Boosted Tree. [nbviewer][html]
- The original notebook uses xgboost to demonstrate the feature. Added lightgbm example.
Logging module. [nbviewer][html]
- Added a section that emphasizes the importance of logging the full stack trace of an exception.

[2020-01]

Added

Kaggle: Quora Insincere Questions Classification Predicting insincere questions. [folder]

Changed

Seq2Seq for German to English Machine Translation - PyTorch. Includes quick intro to torchtext [nbviewer][html]
- Added more introduction to torchtext.

[2019-12]

Added

Byte Pair Encoding (BPE) from scratch and quick walkthrough of sentencepiece. [nbviewer][html]
Sentencepiece Subword tokenization for Text Classification. [nbviewer][html]

Changed

Gaussian Mixture Model from scratch; AIC and BIC for choosing the number of Gaussians. [nbviewer][html]
- Fix erroneous log likelihood calculation.
- Update deprecated function for plotting contour plots.

[2019-11]

Added

Leveraging Pre-trained Word Embedding for Text Classification. [nbviewer][html]
Monotonic Constraint with Boosted Tree. [nbviewer][html]
Probability Calibration for classification models. [nbviewer][html]

[2019-10]

Added

Seq2Seq with Attention for German to English Machine Translation - PyTorch. [nbviewer][html]

[2019-09]

Added

Seq2Seq with PyTorch for German to English Machine Translation. [nbviewer][html]

[2019-08]

Added

Kaggle: Rossman Store Sales Predicting daily store sales. Also introduces deep learning for tabular data. [folder]

Changed

Optimizing Pandas (e.g. reduce memory usage using category type). [nbviewer][html]
- Added helper function to automatically determine optimal data type.
Framing time series problem as supervised-learning. [nbviewer][html]
- Added window-based features.

[2019-06]

Added

Word2vec for Text Classification. [nbviewer][html]

Changed

Word2vec (skipgram + negative sampling) using Gensim. [nbviewer][html]
- Update to the more efficient file-based training.

[2019-04]

Propensity Score Matching. [nbviewer][html]

[2019-03]

Added

Short Walkthrough of PageRank. [nbviewer][html]

[2019-02]

Added

Quick Example of Factory Design Pattern. [nbviewer][html]
Introduction to Multi-armed Bandits. [nbviewer][html]

[2019-01]

Added

Quantile Regression and its application in A/B testing.
- Quick Introduction to Quantile Regression. [nbviewer][html]
- Quantile Regression's application in A/B testing. [nbviewer][html]

[2018-12]

Added

First Foray Into Discrete/Fast Fourier Transformation. [nbviewer][html]

[2018-11]

Added

Introduction to BM25 (Best Match). [nbviewer][html]

[2018-10]

Added

Kullback-Leibler (KL) Divergence. [nbviewer][html]
Calibrated Recommendation for reducing bias/increasing diversity in recommendation. [nbviewer][html]
Influence Maximization from scratch. Includes discussion on Independent Cascade (IC), Submodular Optimization algorithms including Greedy and Lazy Greedy, a.k.a Cost Efficient Lazy Forward (CELF) [nbviewer][html]

[2018-09]

Added

Introduction to Residual Networks (ResNets) and Class Activation Maps (CAM). [nbviewer][html]

Changed

Hosted html-version of all jupyter notebook on github pages.

[2018-08]

Added

(Text) Content-Based Recommenders. Introducing Approximate Nearest Neighborhood (ANN) - Locality Sensitive Hashing (LSH) for cosine distance from scratch. [nbviewer]
Benchmarking ANN implementations (nmslib). [nbviewer]

[2018-07]

Added

Getting started with time series analysis with Exponential Smoothing (Holt-Winters). [nbviewer]
Framing time series problem as supervised-learning. [nbviewer]
Tuning Spark Partitions. [nbviewer]

[2018-06]

Added

Evaluation metrics for imbalanced dataset. [nbviewer]

Changed

H2O API walkthrough (using GBM as an example). [nbviewer]
- Moved H2O notebook to its own sub-folder.
- Added model interpretation using partial dependence plot.

[2018-05]

Added

RNN, LSTM - PyTorch hello world. [nbviewer]
Recurrent Neural Network (RNN) - language modeling basics. [nbviewer]

[2018-04]

Added

Long Short Term Memory (LSTM) - Tensorflow. [nbviewer]
Vanilla RNN - Tensorflow. [nbviewer]
WARP (Weighted Approximate-Rank Pairwise) Loss using lightfm. [nbviewer]

[2018-03]

Added

Local Hadoop cluster installation on Mac. [markdown]
Spark MLlib Binary Classification (using GBM as an example). [raw zeppelin notebook][Zepl]

[2018-02]

Added

H2O API walkthrough (using GBM as an example). [nbviewer]
Factorization Machine from scratch. [nbviewer]

Changed

The spark folder has been renamed to big_data to incorporate other big data tools.

[2018-01]

Added

Partial Dependence Plot (PDP), model-agnostic approach for directional feature influence. [nbviewer]
Parallel programming with Python (threading, multiprocessing, concurrent.futures, joblib). [nbviewer]

[2017-12]

Added

LightGBM API walkthrough and a discussion about categorical features in tree-based models. [nbviewer]
Curated tips and tricks for technical and soft skills. [nbviewer]
Detecting collinearity amongst features (Variance Inflation Factor for numeric features and Cramer's V statistics for categorical features), also introduces Linear Regression from a Maximum Likelihood perspective and the R-squared evaluation metric. [nbviewer]

Changed

Random Forest from scratch and Extra Trees. [nbviewer]
- Refactored code for visualizating tree's feature importance.
Building intuition on Ridge and Lasso regularization using scikit-learn. [nbviewer]
- Include section when there are collinear features in the dataset.
mlutils: Machine learning utility function package [folder]
- Refer to its changelog for details.
data_science_is_software. [nbviewer]
- Mention notebook extension, a project that contains various functionalities that makes jupyter notebook even more pleasant to work with.

[2017-11]

Added

Introduction to Singular Value Decomposition (SVD), also known as Latent Semantic Analysis/Indexing (LSA/LSI). [nbviewer]

[2017-10]

Added

mlutils: Machine learning utility function package [folder]

Changed

Bernoulli and Multinomial Naive Bayes from scratch. [nbviewer]
- Fixed various typos and added a more efficient implementation of Multinomial Naive Bayes.
TF-IDF (text frequency - inverse document frequency) from scratch. [nbviewer]
- Moved to its own tfidf folder.
- Included the full tfidf implementation from scratch.

Changed

Using built-in data structure and algorithm. [nbviewer]
- Merged the content from the two notebooks namedtuple and defaultdict and sorting with itemgetter and attrgetter into this one and improved the section on priority queue.

[2017-08]

Added

Understanding iterables, iterator and generators. [nbviewer]
Word2vec (skipgram + negative sampling) using Gensim (includes text preprocessing with spaCy). [nbviewer]
Frequentist A/B testing (includes a quick review of concepts such as p-value, confidence interval). [nbviewer]
AUC (Area under the ROC, precision/recall curve) from scratch (includes building a custom scikit-learn transformer). [nbviewer]

Changed

Optimizing Pandas (e.g. reduce memory usage using category type). [nbviewer]
- This is a revamp of the old content Pandas's category type.

[2017-07]

Added

cohort : Cohort analysis. Visualize user retention by cohort with seaborn's heatmap and illustrating pandas's unstack. [nbviewer]

Changed

Bayesian Personalized Ranking (BPR) from scratch & AUC evaluation. [nbviewer]
- A more efficient matrix operation using Hadamard product.
Cython and Numba quickstart for high performance python. [nbviewer]
- Added Numba parallel prange.
ALS-WR for implicit feedback data from scratch & mean average precision at k (mapk) and normalized cumulative discounted gain (ndcg) evaluation. [nbviewer]
- Included normalized cumulative discounted gain (ndcg) evaluation.
Gradient Boosting Machine (GBM) from scratch. [nbviewer]
- Added a made up number example on how GBM works.
data_science_is_software. [nbviewer]
- Mention nbdime, a tool that makes checking changes in jupyter notebook on github a lot easier.
- Mention semantic versioning (what each number in the package version usually represents).
- Mention configparser, a handy library for storing and loading configuration files.
K-fold cross validation, grid/random search from scratch. [nbviewer]
- Minor change in Kfolds educational implementation (original was passing redundant arguments to a method).
- Minor change in random search educational implementation (did not realize scipy's .rvs method for generating random numbers returns a single element array instead of a number when you pass in size = 1).

[2017-06]

This is the first time that the changelog file is added, thus every existing notebook will fall under the added category. Will try to group the log by month (one or two) in the future. Note that this repo will be geared towards Python3. Hence, even though the repo contains some R-related contents, they are not that well maintained and will most likely be translated to Python3. As always, any feedbacks are welcomed.

Added

Others (Genetic Algorithm)
Regression (Linear, Ridge/Lasso)
Market Basket Analysis (Apriori)
Clustering (K-means++, Gaussian Mixture Model)
Deep Learning (Feedforward, Convolutional Neural Nets)
Model Selection (Cross Validation, Grid/Random Search)
Dimensionality Reduction (Principal Component Analysis)
Classification (Logistic, Bernoulli and Multinomial Naive Bayes)
Text Analysis (TF-IDF, Chi-square feature selection, Latent Dirichlet Allocation)
Tree Models (Decision Tree, Random Forest, Extra Trees, Gradient Boosting Machine)
Recommendation System (Alternating Least Squares with Weighted Regularization, Bayesian Personalized Ranking)
Python Programming (e.g. logging, unittest, decorators, pandas category type)

Files

changelog.md

Latest commit

History

changelog.md

File metadata and controls

Changelog

[2024-03]

Added

[2023-11]

Changed

[2023-10]

Added

[2023-09]

Added

[2023-08]

Changed

[2023-07]

Added

[2023-04]

Changed

[2023-03]

Added

[2023-02]

Added

[2023-01]

Changed

[2022-12]

Added

[2022-11]

Added

[2022-10]

Added

[2022-09]

Added

[2022-07]

Added

[2022-06]

Added

Changed

[2022-04]

Added

[2021-10]

Added

[2021-09]

Added

[2021-06]

Added

[2021-05]

Added

[2021-02]

Added

[2021-01]

Added

[2020-11]

Added

[2020-10]

Added

Changed

[2020-09]

Changed

[2020-06]

Added

[2020-05]

Added

[2020-04]

Added

Changed

[2020-03]

Changed

[2020-02]

Added

Changed

[2020-01]

Added

Changed

[2019-12]

Added

Changed

[2019-11]