# Intro

Step by step how to handle data and create ML Model

**General Workflow Order in Scikit-Learn:**

1.  **Data Preprocessing**

    *   Handling missing values (e.g., `SimpleImputer`)
    *   Encoding categorical variables (e.g., `OneHotEncoder` or `LabelEncoder`)
    *   Scaling numerical features (e.g., `StandardScaler`, `MinMaxScaler`)
    *   Handling outliers, transformations, etc.

2.  **Feature Selection**

    *   Finding correlation coefficients between each predictor pair with pandas's `df.corr()` (this tells us correlation between each predictor pair)
    *   Finding p-values with `scipy.stats.pearsonr()`
    *   Finding coefficients between each predictor/target pair using panadas's `df.corrwith()` (this tells us correlation between the each predictor/target pair)
    *   Removing low-variance features (`VarianceThreshold`), if they do not have strong predictive power found with `df.corrwith()`
    *   Selecting important features (`SelectKBest` filter method, Recursive Feature Elimination `RFCV` wrapper method)
    *   Using model-based feature selection (e.g., `SelectFromModel` with Lasso or RandomForest) (eg, embedded methods)
    
3. **Train-Test Split** (`train_test_split`)

4.  **Model Training & Evaluation**

    *   Choosing a machine learning algorithm
    *   Hyperparameter tuning
    *   Performance evaluation (accuracy, precision, recall, F1-score, etc.)

# Preprocessing

# Feature Selection

## Wrapper Methods vs Embedded Methods

| Feature                     | Wrapper Methods                                         | Embedded Methods                                      |
| :-------------------------- | :------------------------------------------------------ | :---------------------------------------------------- |
| **Accuracy**                | High (can optimize feature set)                         | Moderate (depends on model regularization)             |
| **Speed**                   | Slow (requires multiple model fits)                     | Fast (feature selection during training)                |
| **Overfitting Risk**        | Higher                                                  | Lower (regularization helps)                           |
| **Scalability**             | Poor for large datasets                                  | Works well on large datasets                           |
| **Model Dependence**        | Works with any model                                     | Tied to a specific model                               |

**When to Use What?**

- Use Wrapper Methods when accuracy is critical and computational cost is not a concern (e.g., small datasets).
- Use Embedded Methods when you need a faster, scalable, and regularized approach (e.g., large datasets).

**Hybrid Approach?** You can combine both by using an embedded method (like Lasso) for initial feature filtering and then applying a wrapper method (like RFE) for fine-tuning. 🚀

# Model Selection

## Cross Validation

### Kfold vs Stratified KFold

Typically, scikit learn chooses the right one for you automatically, but here is a table.

| Feature         | KFold                               | StratifiedKFold                        |
| :-------------- | :---------------------------------- | :-------------------------------------- |
| Class Balance   | Not maintained                      | Maintained                              |
| Use Case        | Regression, balanced classification | Imbalanced classification                 |
| Distribution    | Splits randomly                     | Preserves class proportions             |

# Predictors

## Classifiers

**Best Classifiers (for Classification Problems)**

| Algorithm                                      | Pros                                                                                        | Cons                                                                 | Best For                                                    |
| :--------------------------------------------- | :------------------------------------------------------------------------------------------ | :------------------------------------------------------------------- | :---------------------------------------------------------- |
| Random Forest (RF)                             | Robust, handles missing data, avoids overfitting, interpretable (feature importance)           | Slower for large datasets                                             | General-purpose, tabular data                               |
| Gradient Boosting (XGBoost, LightGBM, CatBoost) | High accuracy, handles non-linearity well                                                | Can overfit, requires tuning                                        | Structured/tabular data, small-to-medium datasets           |
| Support Vector Machine (SVM)                   | Works well in high-dimensional space                                                      | Slow on large datasets                                               | Text classification, image recognition                     |
| Logistic Regression                            | Simple, interpretable, works well for linear problems                                       | Poor performance for complex, non-linear data                     | Binary classification, medical studies                      |
| Neural Networks (MLP, CNN, Transformers, etc.) | Powerful for complex data, scalable                                                          | Requires lots of data and tuning                                    | Deep learning tasks (images, NLP, time-series)               |
| k-Nearest Neighbors (k-NN)                     | Simple, no training phase                                                                   | Slow for large datasets                                               | Small datasets, intuitive cases                              |
| Naïve Bayes                                    | Fast, works well with text/NLP                                                               | Assumes feature independence                                         | Spam detection, text classification                          |

**Best Overall?**

*   Random Forest or XGBoost for most structured/tabular data.
*   Neural Networks (e.g., CNNs, Transformers) for deep learning (images, NLP).

## Regressors

**Best Regressors (for Regression Problems)**

| Algorithm                                      | Pros                                                                    | Cons                                                 | Best For                                              |
| :--------------------------------------------- | :----------------------------------------------------------------------- | :--------------------------------------------------- | :---------------------------------------------------- |
| Linear Regression                              | Simple, interpretable                                                     | Poor fit for complex data                            | Basic regression, finance                            |
| Ridge/Lasso Regression                         | Handles multicollinearity, reduces overfitting                           | Assumes linearity                                     | Feature selection, reducing overfitting                |
| Random Forest Regressor                        | Handles non-linearity, robust                                            | Slower on large datasets                                 | General-purpose tabular data                           |
| Gradient Boosting (XGBoost, LightGBM, CatBoost) | Highly accurate, works well for complex data                             | Sensitive to tuning                                  | Most regression tasks                                    |
| Support Vector Regression (SVR)                | Works well in high-dimensional spaces                                    | Computationally expensive                             | High-dimensional data                                  |
| Neural Networks (Deep Learning)                 | Captures complex relationships                                            | Needs lots of data                                   | Time series, large-scale problems, image-based predictions|
| k-Nearest Neighbors (k-NN) Regression          | Simple, non-parametric                                                    | Slow for large datasets                                 | Small datasets                                          |

**Best Overall?**

*   XGBoost or Random Forest for structured data.
*   Neural Networks for deep learning-based regression (e.g., stock prices, image-based predictions).