# Emoji.
- ðŸŽ‰
- ðŸ¤—

# Terminology.
- Logit : A raw output from the model, before processed with postprocessings like activation layer.
- Token : A piece of text that is splited from a sentence by tokenizer.
- Causal LM : Predicts next token.
- Masked LM : Predicts a mask in the sentence.
- Encoding : raw txt -> numbers.
- Decoding : numbers -> raw txt.
- Dynamic Padding : After making a batch, find the longest sample and pad last of samples with this length.
- Collate Function : Responsible for putting together samples inside a batch.

## Ignore warnings.

In [None]:
# Ignore all warnings.
import logging
logging.basicConfig(level=logging.ERROR)

# HF warning for symlinks.
import os
os.environ['HF_HUB_DISABLE_SYMLINKS_WARNING'] = '1'

## Measuring an execution time.

In [None]:
%time print('1 line')

In [3]:
%%time 
print('whole cell')

whole cell
CPU times: total: 0 ns
Wall time: 0 ns


In [None]:
%%timeit
# print('iterative')  # DO NOT RUN THIS!

## Memory Usage.

In [2]:
import psutil
print(f'RAN used : {psutil.Process().memory_info().rss / (1024**2):.2f} MB')   # Resident set size.

RAN used : 65.48 MB


### General Guide to Choosing Batch Size for Model Training

1. **Understand the Memory Usage**:
   - Each sample consists of tokenized data, where each token is represented as an integer (typically 4 bytes in PyTorch).
   - Estimate the average number of tokens per sample by analyzing your dataset.

2. **Account for Model and Gradient Memory**:
   - Reserve VRAM for the model parameters, gradients, and other computations.
   - Larger models (e.g., ALBERT) require more memory for gradients (~2 GB for ALBERT `base`).

3. **Calculate Maximum Batch Size**:
   - Divide available memory for data by the estimated memory per sample.
   - Apply a safety margin to account for padding, intermediate computations, and possible memory spikes.

4. **Optimize for Efficiency**:
   - Choose a batch size as a power of 2 (e.g., 128, 256, 512) for better GPU utilization and compatibility with mixed precision training.
   - If not using such optimizations, any batch size close to the calculated limit is fine.

5. **Monitor and Adjust**:
   - Check GPU memory usage during training. If you encounter OOM (Out Of Memory) errors, reduce the batch size.
   - Use tools like NVIDIA `nvidia-smi` or PyTorchâ€™s `torch.cuda.memory_allocated()` to monitor VRAM usage.

### Summary:
- Batch size depends on dataset size, tokenization, model architecture, and available GPU VRAM.
- Use powers of 2 (e.g., 256, 512) for better efficiency.
- Start with a conservative estimate and adjust based on memory usage.

### Example (IMDb Dataset with ALBERT on RTX 3070):
1. **Memory Usage**:
   - Average character length: 1,325.
   - Approximate tokens per sample: \(1,325 \times 0.6 \approx 795\).
   - Memory per sample: \(795 \times 4 \, \text{bytes} \approx 3.2 \, \text{KB}\).

2. **Available Memory**:
   - GPU VRAM: 8 GB.
   - Reserve ~2 GB for model and gradients, leaving ~6 GB for data.

3. **Batch Size Calculation**:
   - Maximum batch size: \(6,000,000 \, \text{KB} / 3.2 \, \text{KB per sample} \approx 1,875\).
   - Safe batch size (with 50% margin): \(1,875 / 2 \approx 900\).

4. **Final Recommendation**:
   - Use a batch size of **512** or **768** for efficient training.
   - Adjust further based on memory usage and performance.


## Random Seed.

In [None]:
from transformers import set_seed
import tensorflow as tf

set_seed(42)              # For HF.
tf.random.set_seed(42)    # For tf, np, and python.

## Stop Words.

In [None]:
# Exclude some frequent words that are rarely meaningful.
import nltk
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

# Login to HF Hub.

In [None]:
# Push to hub.
from huggingface_hub import login
hf_token = "your-huggingface-token"
login(token=hf_token)

evaluate.push_to_hub(
    model_id="your-username/your-repo-name",  # Your repository
    metric_value=0.5,
    metric_type="bleu",
    metric_name="BLEU",
    dataset_type="wikitext",
    dataset_name="WikiText",
    dataset_split="test",
    task_type="text-generation",
    task_name="Text Generation"
)