If we don't fix random state (while splitting the data set in X_train,X_test,y_tain,y_test) each time r-square score will change.

Many experiment and averaging out the metrics increases the confidence in the model.

    **Cross-Validation**:Experimenting with different arrangement of same data to build different models of same algorithm

<img src="cross-validation.png" width="350">

Cross-validation is a fundamental technique in machine learning used to evaluate the performance and generalizability of models. It helps ensure that the model's performance metrics are reliable and not overly optimistic by splitting the data into training and testing sets multiple times in different ways. Here are the key types of cross-validation and their uses:

### 1. **K-Fold Cross-Validation**
In k-fold cross-validation, the dataset is divided into k subsets (or "folds"). The model is trained k times, each time using k-1 folds for training and the remaining fold for testing. The performance metrics are then averaged over the k runs. This method helps ensure that every data point gets to be in the test set exactly once, providing a comprehensive evaluation.

**Steps:**
1. Shuffle the dataset randomly.
2. Split the dataset into k folds.
3. For each fold:
   - Use k-1 folds to train the model.
   - Use the remaining fold to test the model.
4. Compute the performance metric (e.g., accuracy, RMSE) for each iteration.
5. Average the performance metrics to get a final estimate.

**Advantages:**
- Reduces the risk of overfitting since the model is validated on different subsets.
- Provides a more accurate estimate of model performance.

**Disadvantages:**
- Computationally expensive, especially for large datasets and complex models.


<img src="k-fold.jpg" width="550">

### 2. **Stratified K-Fold Cross-Validation**
Stratified k-fold cross-validation is a variation of k-fold cross-validation used for classification problems where the folds are created in such a way that each fold maintains the same proportion of each class as in the original dataset. This ensures that each fold is representative of the entire dataset.

**Steps:**
- Similar to k-fold cross-validation, but with stratified sampling to maintain class distribution in each fold.

**Advantages:**
- More reliable performance metrics for ***imbalanced datasets***.

<img src="stratified-k-fold-cross-validation.avif" width="650">

### 3. **Leave-One-Out Cross-Validation (LOOCV)**
LOOCV is an extreme case of k-fold cross-validation where k is equal to the number of data points. Each iteration, one data point is used as the test set, and the rest as the training set. 

**Advantages:**
- Uses the maximum amount of data for training, which can be beneficial for small datasets.
- Provides an unbiased estimate of model performance.

**Disadvantages:**
- Very computationally intensive, especially for large datasets.
- High variance in performance estimates since each fold contains only one test instance. This can lead to over fitting.

Validation which is part of training dataset will act as test data for validating the model performance

<img src="leave-one-out.png" width="450">


### 4. **Leave-P-Out Cross-Validation**
Leave-p-out cross-validation is a generalization of LOOCV where p data points are left out for testing, and the remaining data points are used for training. This method is rarely used in practice due to its computational intensity.

<img src="leave-p-out.png">

### 5. **Repeated Random Subsampling (Monte Carlo Cross-Validation)**
In this method, the dataset is randomly split into training and test sets multiple times. Each split is independent of the others.

**Advantages:**
- More flexibility in the size of the training and test sets.
- Can provide more varied estimates of model performance.

**Disadvantages:**
- Can lead to overlap in training sets across different iterations.
- Not all data points may be used in testing.

### Practical Considerations
- **Choice of k**: Common choices are k=5 or k=10. Larger k values can provide more accurate performance estimates but at the cost of increased computational time.
- **Computational Resources**: Ensure that cross-validation does not become prohibitively expensive for large datasets or complex models.
- **Model Selection and Hyperparameter Tuning**: Cross-validation can also be used for model selection and hyperparameter tuning by evaluating different models or parameter combinations systematically.

Cross-validation is crucial for building robust machine learning models as it provides a more reliable evaluation than a single train-test split, reducing the risk of overfitting and helping in model selection.