<a href="https://colab.research.google.com/github/danieleduardofajardof/DataSciencePrepMaterial/blob/main/Dataset_prep_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chapter 4. Dataset Preparation and Annotation

#Index

- [1. Dataset annotation](#ds-annot)
- [2. Handling imbalanced data](#handling)
- [3. Dataset splitting tecniques](#ds-split)

Proper dataset preparation and annotation are crucial steps in building high-quality machine learning models. This section covers best practices and techniques related to dataset annotation, handling class imbalance, and dataset splitting strategies.

---



### 1. Dataset Annotation  <a name="ds-annot"></a>

**Dataset annotation** is the process of labeling data so that it can be used for supervised learning tasks. The type of annotation depends on the task:

- **Image Classification**: Assigning a class label to an entire image.
- **Object Detection**: Drawing bounding boxes and assigning class labels to objects within an image.
- **Segmentation**: Annotating each pixel with a class.
- **Text Classification**: Tagging entire documents or sentences with categories.
- **Named Entity Recognition (NER)**: Labeling specific spans of text.

**Annotation Tools**:
- CVAT, Labelbox, Roboflow (images/videos)
- Prodigy, Doccano (NLP)
- VGG Image Annotator (VIA), LabelImg

**Best Practices**:
- Define clear annotation guidelines.
- Use multiple annotators and compute inter-annotator agreement.
- Conduct quality audits to ensure label consistency.

---

### 2. Handling Imbalanced Datasets  <a name="handling"></a>

In many real-world problems, classes are not evenly distributed (e.g., fraud detection, medical diagnosis). Handling class imbalance is crucial for avoiding biased models.

#### 🔻 Under-sampling Techniques

These methods reduce the number of samples in the majority class.

- **Random Under-sampling**: Randomly removes examples from the majority class.
- **Tomek Links**: Removes overlapping examples between classes.
- **Cluster Centroids**: Replaces clusters of majority class samples with centroids.

**Pros**: Fast, reduces training time  
**Cons**: Risk of losing valuable information

#### 🔺 Over-sampling Techniques

These methods increase the number of samples in the minority class.

- **Random Over-sampling**: Duplicates random minority class examples.
- **SMOTE (Synthetic Minority Over-sampling Technique)**: Synthesizes new examples between existing minority samples.
- **ADASYN**: Similar to SMOTE, but focuses more on difficult-to-learn samples.

**Pros**: Preserves minority class information  
**Cons**: May lead to overfitting

---

### 3. Dataset Splitting Techniques  <a name="ds-split"></a>

Splitting your data ensures that your model generalizes to unseen examples.

#### 📂 Train-Validation-Test Split

A typical split:
- **Train Set (60–80%)**: Used to train the model.
- **Validation Set (10–20%)**: Used for hyperparameter tuning.
- **Test Set (10–20%)**: Used for final performance evaluation.

In [None]:

from sklearn.model_selection import train_test_split

# Split into train and temp (val+test)
train, temp = train_test_split(data, test_size=0.3, random_state=42)

# Split temp into validation and test
val, test = train_test_split(temp, test_size=0.5, random_state=42)

#### Cross-Validation
Cross-validation helps evaluate model performance more reliably, especially on small datasets.

#### K-Fold Cross-Validation:

- Splits data into k subsets (folds).

- Trains the model k times, each time using a different fold as the validation set.

- Average the scores from all folds for final evaluation.

####Stratified K-Fold:

- Maintains class distribution in each fold (useful for classification).

- Leave-One-Out Cross-Validation (LOOCV):

- Uses one sample as validation, rest as training — repeated for each sample.

Example (K-Fold):

In [None]:
from sklearn.model_selection import KFold
kf = KFold(n_splits=5)

for train_index, val_index in kf.split(X):
    X_train, X_val = X[train_index], X[val_index]


Advantages:

- More robust estimate of performance.

- Better use of limited data.

Disadvantages:

- Computationally expensive.
