In [None]:
from typing import Set

# 1. Create Sets
set_A = {1, 2, 3}
set_B = {3, 4, 5}

# 2. Cardinality
cardinality_A = len(set_A)

# 3. Union
union_set = set_A.union(set_B)

# 4. Intersection
intersection_set = set_A.intersection(set_B)

# 5. Complement (for demonstration, considering universal set)
universal_set = {1, 2, 3, 4, 5}
complement_A = universal_set - set_A

# 6. Subset
is_subset = set_A.issubset(set_B)

# 7. Proper Subset
is_proper_subset = set_A < set_B

# 8. Disjoint
is_disjoint = set_A.isdisjoint(set_B)

# 9. Power Set
def power_set(s: Set) -> Set[Set]:
    from itertools import chain, combinations
    return set(chain.from_iterable(combinations(s, r) for r in range(len(s)+1)))

power_set_A = power_set(set_A)

# 10. Cartesian Product
cartesian_product = {(a, b) for a in set_A for b in set_B}

# 11. Symmetric Difference
symmetric_difference = set_A.symmetric_difference(set_B)

# 12. Set Difference
set_difference = set_A - set_B

# 13. Empty Set
empty_set = set()

# 14. Universal Set (defined earlier)

# 15. Commutative Property
commutative_union = set_A.union(set_B) == set_B.union(set_A)
commutative_intersection = set_A.intersection(set_B) == set_B.intersection(set_A)

# 16. Associative Property
associative_union = (set_A.union(set_B)).union(set_C) == set_A.union(set_B.union(set_C))
associative_intersection = (set_A.intersection(set_B)).intersection(set_C) == set_A.intersection(set_B.intersection(set_C))

# 17. Distributive Property
distributive_1 = set_A.intersection(set_B.union(set_C)) == (set_A.intersection(set_B)).union(set_A.intersection(set_C))
distributive_2 = set_A.union(set_B.intersection(set_C)) == (set_A.union(set_B)).intersection(set_A.union(set_C))

# 18. Idempotent Property (for demonstration, considering union)
idempotent_union = set_A.union(set_A) == set_A

# 19. Absorption Property
absorption_1 = set_A.intersection(set_A.union(set_B)) == set_A
absorption_2 = set_A.union(set_A.intersection(set_B)) == set_A

# 20. De Morgan's Laws
demorgan_1 = set_A.intersection(set_B).union(set_C) == (set_A.union(set_C)).intersection(set_B.union(set_C))
demorgan_2 = set_A.union(set_B).intersection(set_C) == (set_A.intersection(set_C)).union(set_B.intersection(set_C))

# 21. Partition
# Define partition function

# 22. Binary Relation
# Define binary relation function

# 23. Equivalence Relation
# Define equivalence relation function

# 24. Equivalence Classes
# Define equivalence classes function

# 25. Cantor's Theorem (for demonstration)
cantor_theorem = len(power_set_A) == 2 ** cardinality_A

# Print results for demonstration
print("Set A:", set_A)
print("Set B:", set_B)
print("Cardinality of Set A:", cardinality_A)
print("Union of Set A and Set B:", union_set)
print("Intersection of Set A and Set B:", intersection_set)
print("Complement of Set A:", complement_A)
print("Is Set A a subset of Set B:", is_subset)
print("Is Set A a proper subset of Set B:", is_proper_subset)
print("Are Set A and Set B disjoint:", is_disjoint)
print("Power set of Set A:", power_set_A)
print("Cartesian product of Set A and Set B:", cartesian_product)
print("Symmetric difference of Set A and Set B:", symmetric_difference)
print("Set difference of Set A and Set B:", set_difference)
print("Empty Set:", empty_set)
print("Is Union commutative:", commutative_union)
print("Is Intersection commutative:", commutative_intersection)
print("Is Union associative:", associative_union)
print("Is Intersection associative:", associative_intersection)
print("Distributive Property 1:", distributive_1)
print("Distributive Property 2:", distributive_2)
print("Is Union idempotent:", idempotent_union)
print("Absorption Property 1:", absorption_1)
print("Absorption Property 2:", absorption_2)
print("De Morgan's Laws 1:", demorgan_1)
print("De Morgan's Laws 2:", demorgan_2)
print("Cantor's Theorem:", cantor_theorem)


Given that you're still encountering out-of-memory errors even after attempting to mitigate them, we need to consider other potential solutions or identify if there are any deeper issues with your setup. Here are some additional steps to troubleshoot the problem:

1. **Reduce Model Size**: If your model is extremely large (which it seems to be, given the `Total no of parameters is : 2506172416`), consider using a smaller model. Large models have high memory requirements, which may not be feasible for your current hardware.

2. **Use a Smaller Tokenizer**: If your tokenizer is producing a very large number of tokens due to a high `chars_per_token` setting, consider reducing this value to decrease the number of tokens and, consequently, the memory footprint.

3. **Disable Features Not Supported by Tokenizer**: Your output indicates that "FIM is not supported by tokenizer, disabling FIM". Make sure that any features or processing that are not compatible with your model or tokenizer are disabled to avoid unnecessary memory usage.

4. **Model Parallelism**: If your model is too large for a single GPU, you may need to use model parallelism, where different parts of the model are placed on different GPUs. This is more complex to set up and may require manual modification of your model.

5. **Use a Machine with More Memory**: At this point, if you've tried everything and are still running out of memory, you may need to use a machine with more GPU memory or distribute the training across multiple GPUs.

6. **Check Model Implementation for Memory Leaks**: Sometimes, custom implementations of models or layers can have memory leaks that cause excessive memory usage. Verify that your model implementation is efficient and does not inadvertently hold onto memory that should be freed.

7. **Check Data Pipeline**: Analyze your data pipeline to ensure that it's not unintentionally loading too much data into memory at once. Consider using lazy loading or on-the-fly data processing to minimize memory footprint.

Since you are using a large model with over 2.5 billion parameters, you're at the edge of what's feasible with a single GPU, even with one that has nearly 40 GB of memory. If none of the above steps help, you might need▌

Given the persistent nature of the CUDA out of memory error despite attempting various strategies to mitigate it, we need to consider some additional, more drastic measures to resolve the issue. It's evident that the model you are trying to train is exceptionally large, with over 2.5 billion trainable parameters. Training such a model requires careful planning around memory management, especially on a single GPU setup.

### Re-evaluate the Training Setup

1. **Model Size**: If possible, consider using a smaller model. Training models with billions of parameters is challenging without access to an infrastructure that supports model parallelism across multiple GPUs or nodes.

2. **Model Parallelism**: For models of this size, data parallelism (splitting data across multiple GPUs) might not be enough. Look into model parallelism libraries like DeepSpeed or FairScale which can split the model itself across multiple GPUs, thereby reducing the memory requirements on any single GPU.

3. **Offloading**: DeepSpeed and similar frameworks offer features like offloading parts of the model and/or the optimizer states to the CPU, which can drastically reduce GPU memory usage. This does come at the cost of increased data transfer between CPU and GPU, potentially slowing down your training, but it can make training feasible when it otherwise wouldn't be.

4. **Activation Checkpointing**: While you've already enabled gradient checkpointing, ensure you're implementing it as efficiently as possible. Libraries like DeepSpeed offer more advanced checkpointing strategies that significantly reduce memory usage at the cost of additional compute.

5. **Use a High-Memory GPU**: If you're limited by the hardware, consider training on a GPU with more memory. For instance, NVIDIA's A100 GPUs come with up to 80 GB of memory, which could accommodate larger models or allow for larger batch sizes.

6. **Distributed Training**: If you have access to multiple GPUs across several nodes, consider setting up distributed training. This can be complex to configure but allows you to leverage collective hardware resources more effectively.

### Technical Considerations

1. **Pre-Training**: For very large models, it might be more feasible to fine-tune from a pre-trained checkpoint rather than training from scratch. If you're not already doing so, consider using a pre-trained version▌

To fix the error in your code, you will need to address the CUDA out of memory issue. Here's what you can do to mitigate it:

1. **Correct the Training Dataset**: Ensure that you are using the correct dataset for training. Replace `eval_datasets` with `train_datasets` in the `Trainer` setup:

```python
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_datasets,  # Use the correct training dataset here
    eval_dataset=eval_datasets,
    data_collator=default_data_collator,
)
```

2. **Reduce Memory Usage**: Since you are already using the smallest batch size, consider other methods to reduce memory usage.

   - **Reduce Sequence Length**: If `args.max_seq_length` is too large, consider reducing it to a smaller number that still captures the necessary context for your task.
   - **Enable Mixed Precision Training**: If your GPU supports it, enable mixed precision training by setting `fp16=True` in your `TrainingArguments`.
   - **Disable Gradient Checkpointing**: Although gradient checkpointing can save memory, it may not be compatible with all models or might not be properly implemented. Try disabling it if you suspect issues there by setting `gradient_checkpointing=False`.

3. **Clear CUDA Cache**: Before starting your training, you can clear the CUDA cache to free up any leftover memory from previous operations:

```python
import torch

torch.cuda.empty_cache()
```

4. **Monitor GPU Usage**: Continuously monitor your GPU memory usage using `nvidia-smi` to ensure that no other processes are using the GPU while you are trying to train your model.

After applying these changes, your modified code should look something like this:

```python
import torch
from transformers import Trainer, TrainingArguments
from datasets import load_dataset

# ... (your previous setup code) ...

# Clear CUDA cache
torch.cuda.empty_cache()

# Update the training arguments to enable mixed precision training
training_args = TrainingArguments(
    # ... (other arguments)
    fp16=True,  # Enable mixed precision training
    gradient_checkpointing=False,  # Disable gradient checkpointing if not compatible
    # ... (other arguments)
)

# Initialize Trainer with the correct datasets
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_datasets,  # Corrected to use the training dataset
    eval_dataset=eval_datasets,
    data_collator=default_data_collator,
)

# Start training
trainer.train()
```

Try running the training process after making these changes. If you still encounter out-of-memory errors, you may need to further reduce the `max_seq_length` or consider using a machine with a GPU that has more memory.

It looks like you are setting up a training and evaluation pipeline using a custom `ConstantLengthDataset` and the `transformers.Trainer` API. However, there's an issue with your `trainer` setup: you are using the same dataset for both training and evaluation (`eval_datasets`). You should use `train_datasets` for training:

```python
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_datasets,  # This should be your training dataset
    eval_dataset=eval_datasets,
    data_collator=default_data_collator,
)
```

Additionally, there are several other potential issues and improvements you can address to prevent the CUDA out of memory error:

1. **Reduce Sequence Length**: If `args.max_seq_length` is too high, consider reducing it. Longer sequences require more memory.

2. **Gradient Checkpointing**: You've enabled gradient checkpointing with `gradient_checkpointing=True`, which should help reduce memory usage by trading off computation. However, keep in mind that gradient checkpointing can be less effective if your model doesn't have many layers or if it's not implemented properly in the model you are using.

3. **Remove Unnecessary Features**: You have `remove_unused_columns=False`. If your dataset contains columns that are not used during training, setting this to `True` might save memory.

4. **Mixed Precision Training**: If your hardware supports it, consider enabling automatic mixed precision (AMP) by setting `fp16=True` instead of using `bf16=False`. AMP can significantly reduce memory usage.

5. **Training Schedule**: You have `max_steps=1000`. If that's too high for your setup, consider reducing it. Also, `num_train_epochs=50` might be more than you actually need. Try reducing the number of epochs to see if the training can complete without running out of memory.

6. **Per Device Batch Size**: You have `per_device_train_batch_size=1`, which is already the smallest batch size. If you're still running out of memory, it's likely not due to the batch size (unless your sequences are extremely long).

7. **Warmup Steps**: The `warmup_steps=100` might not be an issue, but ensure this is an appropriate number for your training regime.

8. **Learning Rate**: The initial `learning_rate=0.001` is a typical default, but make sure it's suitable for your model and training data.

9. **Save Total Limit**: The `save_total_limit=5` should also be fine, but make sure you have enough disk space for the model checkpoints.

10. **Push to Hub**: You have `push_to_hub=True`, which means after training, the model will be pushed to the Hugging Face Model Hub. Ensure this is intended, and you have the necessary authentication set up.

After you've made these adjustments, try running `trainer.train()` again and monitor your GPU memory usage to ensure it doesn't run out of memory. If you still encounter memory issues, you might need to consider more aggressive changes or using a different setup with more GPU memory.