-----------
    Train Test Split
-------------

<img src="https://media.licdn.com/dms/image/D4D12AQGDGMjsrzigDg/article-cover_image-shrink_600_2000/0/1674057624735?e=2147483647&v=beta&t=tVSJ36hIjz7sFWVEDRQuvBZxAtFu5JnlCiDpvIuDGI4" width="650">

In machine learning, the "train-test split" is a common technique used to evaluate the performance of a model. Here's how it works:

1. **Dataset**: Start with a dataset containing your input features (X) and corresponding target labels (y). This dataset should represent the problem you're trying to solve, such as classifying images or predicting house prices.

2. **Splitting**: Split the dataset into two subsets: the training set and the test set. The training set is used to train the model, while the test set is used to evaluate its performance. Typically, the training set contains a larger portion of the data (e.g., 70-80%) while the test set contains the remaining portion.

3. **Training**: Train your machine learning model using the training set. This involves feeding the input features and corresponding labels into the model and adjusting its parameters to minimize the error between the predicted and actual labels.

4. **Testing**: Once the model is trained, use it to make predictions on the test set. Compare these predictions to the actual labels in the test set to evaluate how well the model generalizes to new, unseen data.

5. **Evaluation**: Finally, evaluate the performance of the model using metrics such as accuracy, precision, recall, or mean squared error, depending on the type of problem you're solving. These metrics provide insight into how well the model is performing and can help identify areas for improvement.

It's crucial to perform the train-test split to prevent overfitting, where the model performs well on the training data but fails to generalize to new data. By evaluating the model on a separate test set, you can get a more accurate estimate of its performance in real-world scenarios.

<img src="https://dhavalpatel2101992.wordpress.com/wp-content/uploads/2021/05/image-29.png?w=500">

In addition to the train-test split, the train-test-validation split is another common technique used in machine learning. Here's how it works:

1. **Dataset**: Start with your dataset containing input features (X) and corresponding target labels (y), just like in the train-test split.

2. **Splitting**: Split the dataset into three subsets: the training set, the validation set, and the test set. The training set is used to train the model, the validation set is used to tune hyperparameters and assess model performance during training, and the test set is used to evaluate the final model performance.

   - Training set: This subset typically contains the majority of the data (e.g., 60-80%). It's used to train the model's parameters.
   - Validation set: This subset (usually smaller than the training set) is used to tune hyperparameters and assess model performance during training. It helps in preventing overfitting by providing an unbiased evaluation metric during training.
   - Test set: This subset (also separate from the training and validation sets) is used to evaluate the final model performance after it has been trained and tuned. It provides an estimate of how well the model will perform on unseen data.

3. **Training**: Train your machine learning model using the training set. During training, you'll periodically evaluate the model's performance on the validation set and adjust hyperparameters accordingly to improve performance.

4. **Validation**: Use the validation set to tune hyperparameters, such as learning rate, regularization strength, or the number of hidden units in a neural network. This process helps to optimize the model's performance without overfitting to the training data.

5. **Testing**: Once the model is trained and tuned using the training and validation sets, use the test set to evaluate its final performance. This gives you an unbiased estimate of how well the model generalizes to new, unseen data.

6. **Evaluation**: Evaluate the model's performance on the test set using appropriate metrics, similar to the train-test split.

By using a train-test-validation split, you can better assess your model's performance, tune hyperparameters effectively, and ensure it generalizes well to new data.

----------
    Over fitting and under fitting
----------

Overfitting and underfitting are two common problems encountered when training machine learning models:

1. **Overfitting**:
   - **Definition**: Overfitting occurs when a model learns to fit the training data too closely, capturing noise or random fluctuations in the data rather than the underlying pattern. As a result, the model performs well on the training data but poorly on unseen data.
   - **Signs**:
     - High accuracy on the training data but low accuracy on the test/validation data.
     - The model may perform well on specific examples in the training data but fail on new examples.
   - **Causes**:
     - Using a model that is too complex relative to the amount of training data available.
     - Insufficient regularization (e.g., not using dropout, L1/L2 regularization).
     - Training for too many epochs, allowing the model to memorize the training data.
   - **Mitigation**:
     - Use simpler models or reduce the model's capacity.
     - Increase the amount of training data if possible.
     - Apply regularization techniques such as dropout, L1/L2 regularization.
     - Use early stopping to prevent training for too many epochs.

2. **Underfitting**:
   - **Definition**: Underfitting occurs when a model is too simple to capture the underlying structure of the data. As a result, it performs poorly on both the training and test/validation data.
   - **Signs**:
     - Low accuracy on both the training and test/validation data.
     - The model fails to capture important patterns or trends in the data.
   - **Causes**:
     - Using a model that is too simple or has insufficient capacity to represent the data.
     - Insufficient training, such as not training for enough epochs or with too little data.
   - **Mitigation**:
     - Use more complex models with higher capacity.
     - Increase the amount of training data if possible.
     - Train for more epochs or adjust the learning rate to allow the model to learn better representations.
     - Feature engineering: Adding more relevant features to the dataset.
     
To address both overfitting and underfitting, it's crucial to strike a balance between model complexity and the amount of available data. Regularization techniques, appropriate model selection, and hyperparameter tuning can help mitigate these issues and improve the generalization performance of the model.

-------------
    Bais and Variance in ml model
---------

<img src ="bias and variance.png" width="650">

-------------
    bias variance tradeoff
---------------

In statistics and machine learning, the bias–variance tradeoff describes the relationship between a model's complexity, the accuracy of its predictions, and how well it can make predictions on previously unseen data that were not used to train the model.

<img src="https://scott.fortmann-roe.com/docs/docs/BiasVariance/biasvariance.png">

------------
    bias variance tradeoff for Over fitting and under fitting
-----------------

<img src ="https://miro.medium.com/v2/resize:fit:500/1*WaZtPef1z_wxew9ONx1B2A.png">