Introduction

The application of Machine Learning (ML) to tabular data is one of the most popular tasks in the ML community because many areas (e.g., finance, medical) commonly use tabular data to store a set of information. The process of building an ML model for tabular data is often messy and less organized. This may intimidate newcomers in the ML community as the entire workflow seems to be vague and less-intuitive. Nevertheless, there are simply just 2 things that must be understood in order to build a good ML model for tabular data.:

How to perform a specific step?
Which step should be performed first?

The first one focuses on various techniques to achieve a specific step (for example, how to handle skewed features with log transformation), while the second refers to the entire workflow of building an ML model for a specific dataset. The second point is crucial as performing the wrong workflow is prone to serious issues (e.g., information leakage or degraded performance). However, the existing learning resources out there often show different workflows when applied to different tasks or a specific dataset, while mainly just repeating a similar pattern. Therefore, in this project, I attempt to establish the same workflow in a jupyter notebook that can be used to solve various predictive tasks, including classification and regression. I also demonstrate that the notebook I created is capable of achieving decent metric scores across different datasets.

Knowledge Requirement

This template assumes you have knowledge on:

a basic python
a basic statistic

Tasks and Dataset

Binary Classification
- Insomnia
  - dataset
  - colab
- Titanic
  - dataset
  - colab
Multi-class Classification
- Body Performance
  - dataset
  - colab
- Ghouls, Goblins, and Ghosts... Boo!
  - dataset
  - colab
Single-output Regression
- Concrete Compressive Strength
  - dataset
  - colab
- Car Price
  - dataset
  - colab

Machine Learning Workflow

This is the established workflow that remains the same for various problems and datasets :

step 1 : install all dependencies
step 2 : import all libraries
step 3 : store any utility function here
step 4 : load the dataset
step 5 : take a peek on the subset of data
step 6 : drop useless features
step 7 : show data description
step 8 : categorical features encoding
step 9 : split train data
step 10 : features visualization
step 11 : data modelling
step 12 : visualize features importance
step 13 : prepare for submission

The boring questions

List of questions that are often asked and debated in ML forums or community group :

Why do we perform categorical feature encoding before data splitting?

Why do we perform feature transformation after data splitting?

Why do we deal with outliers first before imputing missing values?

Why do we perform oversampling and undersampling after feature transformation and feature encoding?

Authors Info

----------------------------------------
Author  : Alvin Setiadi
Email   : alvinsetiadi22@gmail.com
Website : alvinwatner.github.io/about
License : MIT
----------------------------------------

License

This project is licensed under the MIT License - see the LICENSE.md file for details

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
tasks		tasks
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Knowledge Requirement

Tasks and Dataset

Machine Learning Workflow

The boring questions

Authors Info

License

About

Releases

Packages

Languages

License

alvinwatner/ml-template

Folders and files

Latest commit

History

Repository files navigation

Introduction

Knowledge Requirement

Tasks and Dataset

Machine Learning Workflow

The boring questions

Authors Info

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages