# **Machine Learning**
---

“Learning” in Machine Learning = the process of a machine finding mathematical patterns from data, by reducing the error between predictions and reality.

Type of Machine Learning
---

**Supervised Learning**

Models that learn from data that already has labels (correct answer)

Example:

| Fitur (Input)     | Label (Output)  |
| ----------------- | --------------- |
| Luas rumah: 100m² | Harga: 800 juta |
| Luas rumah: 150m² | Harga: 1 M      |
| Luas rumah: 200m² | Harga: 1.3 M    |

model will learn the relation between fitur and label, and they can predict a new label.


**Type of Supervised Learning**
* Regression

The output is usually a continuous value or a number. Examples include house price predictions and blood sugar level predictions.

Some commonly used algorithms are linear regression, decision trees, random forests, and SVR.
* Classification

The output is in the form of categories. For example, whether an email is spam or not, or whether an image is a dog or a cat.

Algorithms commonly used are logistic regression, KKN, and naive Bayes.

**Unsupervised Learning**

Data is unlabeled, generally, the model must determine its own patterns.

**Types of unsupervised learning**

* Clustering

Grouping similar data, such as segmentation, grouping music types based on sound patterns.

Popular algorithms commonly used are K-means, hierarchical clustering, and dbscan.

* Dimension reduction

Reducing the number of features while retaining important information.

Some algorithms used include PCA, t-sne, and autocoder.

**Reincofce Learning**

The model will learn through actions in the environment, receiving rewards or punishments.

Main components

* Agent: the one who learns or makes decisions
* Environment: the world where the agent acts
* Actions (A): what the agent does
* State (S): the current environmental condition
* Reward (R): the value obtained after the action

Algorithms commonly used are q-learning, deep q-network, etc.

Bias Variance Trade-off
---

| Risiko       | Means                                                             | Tendency       |
| ------------ | ---------------------------------------------------------------- | ------------ |
| **Bias**     | The model is too *simple* to capture the data pattern.              | Underfitting |
| **Variance** | The model is too *complex* and too sensitive to the training data. | Overfitting  |

The trade-off serves to balance the two so that the model is complex enough to learn important patterns but not too noisy.

*`Total Error = Bias^2 + Variance + Irreducible Error`*

* Bias^2 = how far the model's prediction is from the actual value.
* Variance = how much the model's results change if the training data is slightly changed.
* Irreducible Error = inherent noise in the data.

Example:
| Tree Depth | Bias     | Variance | Condition  |
| ---------- | -------- | -------- | -------- |
| 2          | High   | Low   | Underfit |
| 30         | Low   | High   | Overfit  |
| 8–12       | = | = | Optimal  |

Control Bias dan Variance

| Approach                                           | Reduce Bias | Reduce Variance       |
| ---------------------------------------------------- | --------------- | ------------------------- |
| Add features | Yes | No |
| Reduce features (feature selection) | No | Yes |
| Increase model complexity (larger neural network) | Yes | No |
| Regularization (L1/L2, dropout) | No | Yes |
| Add training data | Yes (sometimes) | Yes (stabilize the model) |
| Bagging (Random Forest) | No | Yes |
| Boosting (XGBoost) | Yes | No |
| Cross-validation | No | Yes (measure generalization) |

**Overfitting and Underfitting**

| Condition       | Training Error | Testing Error     |
| ------------- | -------------- | ----------------- |
| Underfitting  | High         | High            |
| Overfitting   | Low         | High            |
| Correct Fit | Low         | Low (equal) |


Causes

| Underfitting | Overfitting |
| -------------------------------------------------------------- | --------------------------------------------------------- |
| Model too simple (Linear Regression for nonlinear data) | Model too complex (too many parameters/layers) |
| Features not informative | Too many features (no regularization) |
| Too little data | Too little data, but large model |
| Training too short | Training too long (memorizing data) |
| Regularization too strong | Regularization too weak |

How to fix underfitting
| Strategy | Explanation |
| ---------------------------- | --------------------------------------------------- |
| Increase model complexity | For example, from linear to random forest/neural network |
| Add relevant features | Improve feature engineering |
| Reduce regularization | For example, lower the λ value in Lasso/Ridge |
| Train longer | Try more epochs (neural network) |
| Try non-linear transformations | Polynomial, kernel, log transform |

How to fix overfitting
| Strategy | Explanation |
| ------------------------------------ | ------------------------------------------------------------------- |
| Add data | More examples → model learns general patterns |
| Regularization | L1 (Lasso), L2 (Ridge), Dropout, Early Stopping |
| Reduce model complexity | Reduce depth in decision trees, layers in NN |
| Feature selection | Remove irrelevant features |
| Cross-validation | Detect overfitting early |
| Data augmentation | Add variety to data (images, text, etc.) |
| Ensemble | Use Random Forest, Bagging, Averaging to reduce variance |

Simple Machine Learning Pipeline
---

1. Problem Definition

* Objective: Define the task (classification/regression/clustering/etc.), success metrics (precision, recall, F1-Score, etc), constraints (latency, privacy, budget), and scope.

* Clear output: for example, “predict house price (regression), MAPE < 10%, inference < 100ms.”

* Stakeholders, label sources, business impact.

2. Data Collection & Ingestion

* Objective: Collect relevant data from all sources.

* Sources: databases, CSV, APIs, sensors, scraping, open datasets.

* Record metadata: schema, timestamp, owner, refresh rate.

* Storage raw data (immutable) + logs. Use a structured format (Parquet/CSV/SQL).

3. Data Understanding / Exploratory Data Analysis (EDA)

* Objective: Understand distribution, missingness, noise, correlation, and quality issues.

* Summary statistics (mean, median, standard deviation), histogram, boxplot.

* Correlation (Pearson/Spearman), heatmap.

* Per-class visualization for classification.

* Find bias, class imbalance, potential drift.

4. Data cleaning & preprocessing

* Goal: Improve quality so the model learns from signal, not noise.

* Handle missing data: drop/impute (mean/median/KNN/iterative).

* Outliers: Detect (IQR, z-score) → decide remove/clip/transform.

* Type consistency, normalize/standardize, categorical encoding (one-hot, ordinal, target encoding).

* Timestamp handling: Extract features (day, hour, lag features).

5. Feature engineering

* Goal: Create data representations that facilitate model learning.

* Domain features: ratio, aggregation, lag features (time series), interaction terms.

* Encoding: embedding for high-cardinality categorical.

* Dimensionality reduction: PCA / UMAP / feature selection.

* Create pipelines so feature transformations are reproducible.

6. Train/validation/test splitting

* Goal: estimation of model generalization.

* Random split (i.i.d. data) — train/val/test (80/10/10 or as appropriate).

* Time series: use time-based split (walk-forward, expanding window).

* Cross-validation: K-fold, stratified K-fold (classification), grouped CV (data leakage prevention).

Important: test set strictly held-out (only final eval).

7. Model selection

* Goal: select candidate models & baselines.

* Baselines: mean predictor, logistic regression, decision tree — always start from the baseline.

* Candidates: linear models, tree ensembles (RandomForest, XGBoost, LightGBM), SVM, neural networks.

* For image/text: CNN/transformer/embeddings.

* Consider complexity vs latency vs interpretability.

8. Training & hyperparameter tuning

* Goal: parameter optimization for best performance.

* Loss function according to task (MSE, cross-entropy, ranking loss).

* Optimizers (SGD/Adam) for NN; early stopping.

* Hyperparameter tuning: grid search, random search, Bayesian (Optuna), Hyperopt.

* Use cross-validation and keep validation sets separate.

9. Evaluation & metrics

* Objective: assess performance and conformity to requirements.

* Regression: MSE, RMSE, MAE, MAPE, R².

* Classification: accuracy, precision, recall, F1, ROC-AUC, PR-AUC (imbalanced).

* Calibration: reliability diagrams, Brier score.

* Confusion matrix, per-class metrics, error analysis (case study).

* Fairness & bias checks (demographic parity, equalized odds) where relevant.

10. Model interpretation & explainability

* Goal: understand what the model learns.

* Feature importance (Tree SHAP, permutation importance).

* SHAP / LIME for local & global explanations.

* Partial dependence plots, ICE plots.

* Document feature drift risk & spurious correlations.