## __1. Train-Test split__

Train test split is a model validation procedure that allows you to simulate how a model would perform on new/unseen data by splitting a dataset into a training set and a testing set. 

- The training set is data used to train the model, and the testing set data (which is new to the model) is used to test the model’s performance and accuracy.
- A train test split can also involve splitting data into a validation set, which is data used to fine-tune hyperparameters and optimize the model during the training process.

<img src="https://builtin.com/sites/www.builtin.com/files/styles/ckeditor_optimize/public/inline-images/1_train-test-split_0.jpg" width=700 />

#### __Methods for Splitting Data in a Train Test Split__

Some common methods of splitting data in a train test split:

1. __Random Splitting:__ involves randomly shuffling data and splitting it into training and testing sets based on given percentages (like 75% training and 25% testing).
   
2. __Stratified Splitting:__ divides a dataset in a way that preserves its proportion of classes or categories. This creates training and testing sets with class proportions representative of the original dataset. Using stratified splitting can prevent model bias, and is most effective for imbalanced datasets. Use `stratify` parameter in the `train_test_split()` method.

3. __Time-Based Splitting:__ involves organizing data in a set by points in time, ensuring past data is in the training set and future or later data is in the testing set. Splitting data based on time works to simulate real-world scenarios (for example, predicting future financial or market trends) and allows for time series analysis on time series datasets. However, one drawback to time-based splitting is that it may not fully capture trends for non-stationary data (data that continually changes over time). In scikit-learn, time series data can be split into training and testing sets by using the `TimeSeriesSplit()` method.

## __2. Data Standardization__

Data standardization comes into the picture when features of the input data set have large differences between their ranges, or simply when they are measured in different units. It converts data into a standard, uniform format, making it consistent across different data sets and easier to understand for machine learning or statistical models. 

- Standardizing data can enhance data quality and accuracy, which helps users make reliable data-driven decisions.
- Z-score normalization, or standardization, is one of the most popular methods to standardize data.
- With this method, data is transformed to have a mean of 0 and a standard deviation of 1, giving all data points the same scale.
- We can use the `StandardScaler()` method from scikit-learn.
- `StandardScaler()` provides the 3 methods fit(), transform(), and fit_transform().
- `fit()` method - takes the dataset we aim to standardize as an argument and computes its mean and standard deviation.
- `transform()` method - applies the scaling performed using the `.fit()` method to every feature value.
- `fit_transform()` method - does both `.fit()` and `.transform()`. Has more computational efficiency as it combines two methods into one.

### __Should we perform `.fit_transform()` before or after the split of training and test data?__

Normalization / Standardization should be done after splitting the data into train and test sets. The reason is to avoid any data leakage. This is also applicable to `CountVectorizer()` where it counts the no. of words in a text message.

__Data Leakage:__ Data leakage happens when information from outside the training set is used to create the model. This additional information can allow the model to learn or know something that it otherwise would not know and in turn invalidate the estimated performance of the model being constructed.

The testing data points represent real-world data. Feature normalization (or data standardisation) of the explanatory (or predictor) variables is a technique used to center and normalise the data by subtracting the mean and dividing by the standard deviation. If you take the mean and variance of the whole dataset, you will be introducing future information into the explanatory variables (i.e. the mean and std. deviation). 

- We perform standardisation as `fit_transform()` on the training set and `transform()` on the testing set.
- `fit_transform()` on the train data standardises it and calculates the mean and standard deviation for the train data, `transform()` is used on the test data which means we will apply the metrics calculated from the train set onto the test set. We do so to prevent data leakage, i.e. learning something new from the test data, in order to accurately test the selected model.
- Using the `fit_transform` method on the entire dataset could cause the model to overperform, as it would have prior knowledge of the test data vocabulary, leading to an unrealistic assessment of its performance.

## __3. Missing data imputation__

Missing data imputation is a technique used to fill in missing values within a dataset, preventing potential issues with analysis or model training. It involves replacing missing values with estimated values based on the existing data, ensuring the dataset is complete and usable for further analysis.

- __Univariate Imputation:__ This method focuses on a single variable, using the mean, median, or mode of the non-missing values to fill in the missing values for that specific variable. 
- __Multivariate Imputation:__ This method considers multiple variables to estimate the missing values. It often involves using regression models or other statistical methods to predict the missing values based on the relationships between the variables. 
- __Multiple Imputation:__ This technique creates multiple imputed datasets by generating different estimates for the missing values. This allows for incorporating uncertainty about the true values, as analysis can be performed separately on each imputed dataset and results can be pooled. 

#### __Common Imputation Techniques:__

1. __Mean, Median, and Mode Imputation:__ Replacing missing values with the average, middle value, or most frequent value, respectively. 
2. __Constant Value Imputation:__ Replacing missing values with a predetermined constant, which can be a specific value or a value representing an "unknown" or "missing" category. 
3. __Regression Imputation:__ Using regression models to predict missing values based on other available variables. 
4. __K-Nearest Neighbors Imputation:__ Replacing missing values with the average of the values from the nearest neighbors in the dataset. 

## __4. Cross Validation Techniques__

Cross-validation is a machine learning technique that evaluates model performance on unseen data by dividing the data into multiple folds. In each iteration, one fold is used as a validation set and the remaining as training data. This process is repeated such that each fold serves as the validation set once. The results from all iterations are averaged to provide a robust estimate of model performance. The main purpose of cross validation is to prevent overfitting.

### __Types of Cross-Validation__

Some of the common cross-validation techniques are:

#### __4.1 K-Fold Cross-Validation__

In K-Fold Cross validation, the dataset is divided into `k` equally sized folds. The model is trained on `k-1` folds and tested on the remaining fold. This process is repeated `k` times, with each fold used exactly once as the test set. The results are averaged to produce a single performance estimate.
* Pros: Provides a more accurate estimate of model performance.
* Cons: Computationally intensive for large datasets.

#### __4.2 Stratified K-Fold Cross-Validation__

It is a technique used in machine learning to ensure that each fold of the cross-validation process maintains the same class distribution as the entire dataset. This is particularly important when dealing with imbalanced datasets where certain classes may be under represented. In this method:
- The dataset is divided into k folds while maintaining the proportion of classes in each fold.
- During each iteration, one-fold is used for testing and the remaining folds are used for training.
- The process is repeated k times with each fold serving as the test set exactly once.
  
* Pros: More reliable performance estimates for imbalanced datasets.
* Cons: Still computationally intensive.

#### __4.3 Holdout Method__

In holdout method, the dataset is divided into two sets, a training set and a test set, i.e. we perform training on the 50% of the given dataset and rest 50% is used for the testing purpose. The model is trained on the training set and evaluated on the test set.

The major drawback of this method is that we perform training on the 50% of the dataset, it may possible that the remaining 50% of the data contains some important information which we are leaving while training our model that can lead to higher bias.

#### __4.4 Leave-One-Out Cross-Validation (LOOCV)__

A special case of k-fold cross-validation where k is equal to the number of data points in the dataset ($k = n$). Each observation is used once as a test set, and the model is trained on all remaining data points. So, model is trained on `k-1` data points and tested using 1 data point.

* Pros: Maximizes the amount of training data used.
* Cons: Extremely computationally expensive, especially for large datasets. another major drawback of this method is that it leads to higher variation in the testing model as we are testing against one data point. If the data point is an outlier it can lead to higher variation.