[__<< Feature Engineering__](./04_XXX_feature_engineering.ipynb) | [__Home__](../README.md)



# [PROJECT NAME]
## Modeling

__Dataset:__ [DATASET](URL) \
__Author:__ [AUTHOR NAME](URL) \
__Version:__ 1.N.0\
__Date:__ [YYYY-MM-DD]

### Notebooks <a class="anchor" name='notebooks'></a>

+ [Initial Data Exploration](./01_XXX_data_exploration.ipynb)
+ [Data Cleaning](./02_XXX_data_cleaning.ipynb)
+ [Exploratory Data Analysis](./03_XXX_exploratory_data_analysis.ipynb)
+ [Feature Engineering](./04_XXX_feature_engineering.ipynb)
+ __[Modeling & Validation](./05_XXX_modeling.ipynb)__

### Import Libraries <a name='#import-libraries'></a>

In [1]:
import datetime
import sys
import re
import pickle

import pandas as pd
import numpy as np
import statsmodels as sm
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

sys.path.append('../02_scripts/')

%matplotlib inline

The history saving thread hit an unexpected error (OperationalError('attempt to write a readonly database')).History will not be written to the database.


### Notebook Setup <a name='#notebook-setup'></a>

In [2]:
# Pandas settings
pd.options.display.max_rows = 20
pd.options.display.max_columns = None
pd.options.display.max_colwidth = 60
pd.options.display.float_format = '{:,.4f}'.format

# Visualization settings
from matplotlib import rcParams
plt.style.use('fivethirtyeight')
rcParams['figure.figsize'] = (16, 5)   
rcParams['axes.spines.right'] = False
rcParams['axes.spines.top'] = False
rcParams['font.size'] = 12
rcParams['savefig.dpi'] = 300
plt.rc('xtick', labelsize=11)
plt.rc('ytick', labelsize=11)
%config InlineBackend.figure_format = 'retina'

In [2]:
from IPython.display import Markdown
from IPython.core.magic import register_cell_magic


@register_cell_magic
def markdown(line, cell):
    return Markdown(cell.format(**globals()))

### ToDo's <a name='todos'></a>

In [3]:
# get all tasks from the previous phase

"""
sys.path.append('../02_scripts/')
from todo_list import extract_todo_patterns

print(f'{'-'*5} TASKS FROM PREVIOUS PHASE {'-'*5}')
for todo in extract_todo_patterns('./02_XXX_data_cleaning.ipynb'):
    print(f'TODO: {todo}')
"""

"\nsys.path.append('../02_scripts/')\nfrom todo_list import extract_todo_patterns\n\nprint(f'{'-'*5} TASKS FROM PREVIOUS PHASE {'-'*5}')\nfor todo in extract_todo_patterns('./02_XXX_data_cleaning.ipynb'):\n    print(f'TODO: {todo}')\n"

### Loading Data <a name='#loading-data'></a>

In [None]:
# %store -r myvar # to load dataset in the next phase

In [4]:
# loading data
# filename = '../00_data/02_processed/used_cars_data_processed_final.pkl'

# with open(filename, 'rb') as file:
#     data = pickle.load(file)

In [5]:
# data.info()

### Building Model <a name='#building-model'></a>

1. Supervised Learning
Supervised learning algorithms are trained using labeled data, where the input-output pairs are known.

    - Classification
        - **Logistic Regression**
        - **Support Vector Machines (SVM)**
        - **k-Nearest Neighbors (k-NN)**
        - **Decision Trees**
        - **Random Forest**
        - **Gradient Boosting Machines (e.g., XGBoost, LightGBM)**
        - **Neural Networks (Multi-Layer Perceptron, Convolutional Neural Networks)**
        - **Naive Bayes**

    - Regression
        - **Linear Regression**
        - **Ridge Regression**
        - **Lasso Regression**
        - **Support Vector Regression (SVR)**
        - **Decision Trees for Regression**
        - **Random Forest for Regression**
        - **Gradient Boosting for Regression**
        - **Neural Networks for Regression**

2. Unsupervised Learning
Unsupervised learning algorithms are used on data without labeled responses, finding hidden patterns or intrinsic structures.

    - Clustering
        - **k-Means Clustering**
        - **Hierarchical Clustering**
        - **DBSCAN (Density-Based Spatial Clustering of Applications with Noise)**
        - **Gaussian Mixture Models (GMM)**

    - Dimensionality Reduction
        - **Principal Component Analysis (PCA)**
        - **t-Distributed Stochastic Neighbor Embedding (t-SNE)**
        - **Linear Discriminant Analysis (LDA)**
        - **Autoencoders**

    - Association Rule Learning
        - **Apriori Algorithm**
        - **Eclat Algorithm**

3. Semi-Supervised Learning
Semi-supervised learning algorithms use both labeled and unlabeled data for training, typically a small amount of labeled data and a large amount of unlabeled data.

    - **Self-Training**
    - **Co-Training**
    - **Semi-Supervised Support Vector Machines (S3VM)**
    - **Generative Adversarial Networks (GANs) for Semi-Supervised Learning**

4. Reinforcement Learning
Reinforcement learning algorithms learn by interacting with an environment, receiving rewards or penalties.

    - **Q-Learning**
    - **Deep Q-Networks (DQN)**
    - **Policy Gradient Methods**
    - **Actor-Critic Methods**
    - **Monte Carlo Methods**

5. Ensemble Learning
Ensemble learning methods combine multiple learning algorithms to improve performance.

    - **Bagging (Bootstrap Aggregating)**
    - **Boosting**
    - **Stacking**
    - **Voting**

6. Neural Networks and Deep Learning
A subset of machine learning focusing on neural networks with many layers.

    - **Convolutional Neural Networks (CNNs)**
    - **Recurrent Neural Networks (RNNs)**
    - **Long Short-Term Memory Networks (LSTMs)**
    - **Generative Adversarial Networks (GANs)**
    - **Transformer Networks**

7. Anomaly Detection
Algorithms designed to identify rare items, events, or observations.

    - **Isolation Forest**
    - **One-Class SVM**
    - **Autoencoders for Anomaly Detection**

#### Model Validation

1. Cross-Validation

    - K-Fold Cross-Validation
        - **Definition**: The dataset is divided into K equal-sized folds. The model is trained on K-1 folds and validated on the remaining fold. This process is repeated K times, with each fold serving as the validation set once.
        - **Purpose**: Provides a comprehensive evaluation of the model by using different subsets of the data for training and validation.
        - **Advantages**: Reduces bias associated with random sampling, more stable and reliable estimates of model performance.

    - Stratified K-Fold Cross-Validation
        - **Definition**: A variant of K-fold cross-validation where the folds are created in such a way that each fold has approximately the same proportion of class labels as the original dataset.
        - **Purpose**: Ensures that each fold is representative of the entire dataset, especially useful for imbalanced datasets.

    - Leave-One-Out Cross-Validation (LOOCV)
        - **Definition**: A special case of K-fold cross-validation where K equals the number of observations in the dataset. Each observation serves as a validation set exactly once.
        - **Purpose**: Provides a thorough validation but can be computationally expensive.

    - Leave-P-Out Cross-Validation
        - **Definition**: A more general form of LOOCV where P data points are left out in each iteration. The model is trained on the remaining dataset and validated on the P data points left out.
        - **Purpose**: Used in scenarios where specific portions of data are more critical for validation.

2. Train-Test Split

    - Holdout Method
        - **Definition**: The dataset is randomly split into two parts: a training set and a testing set. The model is trained on the training set and evaluated on the testing set.
        - **Purpose**: Simple and quick way to validate a model’s performance on unseen data.
        - **Considerations**: The choice of the split ratio (e.g., 70/30, 80/20) can affect the evaluation results. The randomness of the split can lead to variability in the results.

3. Nested Cross-Validation

    - Nested Cross-Validation
        - **Definition**: An extension of K-fold cross-validation used for hyperparameter tuning and model selection. It involves two loops: an outer loop for evaluating model performance and an inner loop for hyperparameter tuning.
        - **Purpose**: Provides an unbiased estimate of model performance by accounting for the model selection process.

4. Bootstrapping

    - Bootstrapping
        - **Definition**: Involves repeatedly sampling with replacement from the dataset and evaluating the model on each sample. The performance metrics are averaged over all samples.
        - **Purpose**: Provides an estimate of the variability of the model’s performance and is useful for small datasets.

5. Validation Techniques for Time Series Data

    - Time Series Split
        - **Definition**: The data is split based on time, keeping earlier data points for training and later data points for validation.
        - **Purpose**: Maintains the temporal order of observations, which is crucial in time series forecasting.

    - Rolling Forecast Origin
        - **Definition**: A specific type of time series split where the training set is expanded with each iteration, and the model is validated on a fixed-size validation set.
        - **Purpose**: Simulates a real-world scenario where new data becomes available over time.

6. Performance Metrics for Model Validation

    - Confusion Matrix
        - **Definition**: A table showing the performance of a classification model by displaying the true positives, true negatives, false positives, and false negatives.

    - Accuracy, Precision, Recall, F1 Score
        - **Purpose**: Evaluate different aspects of classification model performance.

    - ROC-AUC, Precision-Recall Curve
        - **Purpose**: Evaluate the model’s performance across different thresholds.

    - Mean Absolute Error (MAE), Mean Squared Error (MSE), R-squared
        - **Purpose**: Evaluate regression model performance.

    - Log-Loss
        - **Purpose**: Measures the accuracy of probabilistic predictions in classification.

These model validation techniques and strategies help ensure that the model's performance is reliable and generalizes well to new data. The choice of validation method depends on the dataset characteristics, the model's intended use, and computational resources.

#### Model Evaluation

1. Classification Metrics

    - Accuracy
        - **Definition**: The ratio of correctly predicted instances to the total instances.
        - **Formula**: `(TP + TN) / (TP + TN + FP + FN)`

    - Precision
        - **Definition**: The ratio of true positive predictions to the total predicted positives.
        - **Formula**: `TP / (TP + FP)`

    - Recall (Sensitivity or True Positive Rate)
        - **Definition**: The ratio of true positive predictions to the total actual positives.
        - **Formula**: `TP / (TP + FN)`

    - F1 Score
        - **Definition**: The harmonic mean of precision and recall.
        - **Formula**: `2 * (Precision * Recall) / (Precision + Recall)`

    - Specificity (True Negative Rate)
        - **Definition**: The ratio of true negative predictions to the total actual negatives.
        - **Formula**: `TN / (TN + FP)`

    - AUC-ROC Curve
        - **Definition**: AUC (Area Under the Curve) measures the ability of the classifier to distinguish between classes. ROC (Receiver Operating Characteristic) curve is a graphical plot of the true positive rate against the false positive rate.
        - **Purpose**: Evaluates the model's performance across all classification thresholds.

    - Confusion Matrix
        - **Definition**: A table used to describe the performance of a classification model. It shows the counts of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN).

2. Regression Metrics

    - Mean Absolute Error (MAE)
        - **Definition**: The average of the absolute differences between predicted and actual values.
        - **Formula**: `(1/n) * Σ|y_i - ŷ_i|`

    - Mean Squared Error (MSE)
        - **Definition**: The average of the squared differences between predicted and actual values.
        - **Formula**: `(1/n) * Σ(y_i - ŷ_i)²`

    - Root Mean Squared Error (RMSE)
        - **Definition**: The square root of the mean of the squared differences between predicted and actual values.
        - **Formula**: `sqrt((1/n) * Σ(y_i - ŷ_i)²)`

    - R-squared (Coefficient of Determination)
        - **Definition**: The proportion of variance in the dependent variable that is predictable from the independent variables.
        - **Formula**: `1 - (Σ(y_i - ŷ_i)² / Σ(y_i - ȳ)²)`

    - Adjusted R-squared
        - **Definition**: A modified version of R-squared that adjusts for the number of predictors in the model.
        - **Purpose**: Provides a more accurate measure of model performance when multiple predictors are used.

3. Clustering Metrics

    - Silhouette Score
        - **Definition**: Measures how similar an object is to its own cluster compared to other clusters.
        - **Range**: -1 to 1, where 1 indicates that objects are well clustered.

    - Davies-Bouldin Index
        - **Definition**: Measures the average similarity ratio of each cluster with the cluster that is most similar to it.
        - **Range**: Lower values indicate better clustering.

    - Dunn Index
        - **Definition**: Ratio of the minimum inter-cluster distance to the maximum intra-cluster distance.
        - **Purpose**: Higher values indicate better clustering.

4. Cross-Validation

    - K-Fold Cross-Validation
        - **Definition**: The dataset is divided into K equal-sized folds. The model is trained on K-1 folds and tested on the remaining fold. This process is repeated K times, and the results are averaged.
        - **Purpose**: Provides a robust measure of model performance by minimizing overfitting.

    - Leave-One-Out Cross-Validation (LOOCV)
        - **Definition**: A special case of K-fold cross-validation where K equals the number of instances in the dataset.
        - **Purpose**: Evaluates the model's performance by using each instance as a test set.

5. Other Evaluation Techniques

    - Holdout Method
        - **Definition**: The dataset is split into training and testing sets. The model is trained on the training set and evaluated on the testing set.
        - **Purpose**: Simple and quick evaluation method.

    - Precision-Recall Curve
        - **Definition**: A plot of the precision against recall for different threshold values.
        - **Purpose**: Useful for evaluating models when the classes are imbalanced.

    - Cost-Sensitive Metrics
        - **Definition**: Metrics that consider the cost of false positives and false negatives, useful in cases where these errors have different consequences.
        - **Example**: Weighted Accuracy, Cost-Adjusted ROC

These evaluation techniques and metrics help in understanding the performance and reliability of different machine learning models. The choice of metric depends on the specific problem, the type of data, and the consequences of different types of errors.


#### Save Final Model <a name='save-model'></a>

In [17]:
# import pickle

# final_model_filename = '../03_models/XXX_model.pkl'
# pickle.dump(final_model, open(final_model_filename, 'wb'))

### Conclusions

- **Data Dictionary**: Document the meaning and type of each column in the dataset.
- **Exploration Report**: Summarize key findings and visualizations in a report.
- **Next Steps**: Outline potential areas for deeper analysis or data collection.


---
\
[__<< Feature Engineering__](./04_XXX_feature_engineering.ipynb) | [__Home__](../README.md)

\
\
[PROJECT_NAME], _[MMMM YYYY]_