# Training a KGE model with DDP on KINSHIP with torchDDP with 128 CPUs and NVIDIA GeForce RTX 3090
```torchrun --standalone --nproc_per_node=gpu main.py --model 'ComplEx' --embedding_dim 32 --num_epochs 500 --path_dataset_folder 'KGs/KINSHIP' --trainer torchDDP --eval_mode 'test'```

### Here are the last few lines of the log file:
Global1 | Local1 | Epoch:500 | Batch:2 | Loss:0.036598145961761475 |ForwardBackwardUpdate:0.00sec | BatchConst.:0.01sec
Global0 | Local0 | Epoch:500 | Batch:2 | Loss:0.03864430636167526 |ForwardBackwardUpdate:0.00sec | BatchConst.:0.01sec
Done ! It took 12.274 seconds.
Done ! It took 12.276 seconds.

*** Save Trained Model ***
Took 0.0007 seconds | Current Memory Usage  3189.0 in MB
Total computation time: 12.364 seconds
*** Save Trained Model ***
Took 0.0008 seconds | Current Memory Usage  2645.0 in MB
Total computation time: 12.364 seconds
Evaluate ComplEx on Test set: Evaluate ComplEx on Test set: Evaluate ComplEx on Test set
{'H@1': 0.6191806331471136, 'H@3': 0.8514897579143389, 'H@10': 0.9706703910614525, 'MRR': 0.7481822177321609}
Evaluate ComplEx on Test set
{'H@1': 0.6191806331471136, 'H@3': 0.8514897579143389, 'H@10': 0.9706703910614525, 'MRR': 0.7481822177321609}

### We see two eval result as we have to GPUs.

### Training a KGE model with CPU
```torchrun --standalone --nproc_per_node=gpu main.py --model 'ComplEx' --embedding_dim 32 --num_epochs 500 --path_dataset_folder 'KGs/KINSHIP' --trainer torchCPU --eval_mode 'test'```

Epoch:500 | Batch:1 | Loss:0.05302952229976654 |ForwardBackwardUpdate:0.00secs | Mem. Usage  497.07MB
Epoch:500 | Batch:2 | Loss:0.0575931221 |ForwardBackwardUpdate:0.00sec | BatchConst.:0.01sec | Mem. Usage  497.07MB  avail. 1.5 %
Epoch:500 | Batch:3 | Loss:0.0585726425 |ForwardBackwardUpdate:0.00sec | BatchConst.:0.01sec | Mem. Usage  497.07MB  avail. 1.5 %
Epoch:500 | Batch:4 | Loss:0.0551908500 |ForwardBackwardUpdate:0.00sec | BatchConst.:0.00sec | Mem. Usage  497.07MB  avail. 1.5 %
Done ! It took 23.230 seconds.
*** Save Trained Model ***
Took 0.0006 seconds | Current Memory Usage  497.25 in MB
Total computation time: 23.308 seconds
Evaluate ComplEx on Test set: Evaluate ComplEx on Test set
{'H@1': 0.6014897579143389, 'H@3': 0.8212290502793296, 'H@10': 0.9599627560521415, 'MRR': 0.7271467645821516}

# Multi-node GPU training
Execute the following command on the node 1
```torchrun --nnodes 2 --nproc_per_node=gpu  --node_rank 0 --rdzv_id 456 --rdzv_backend c10d --rdzv_endpoint=nebula main.py --model 'ComplEx' --embedding_dim 32 --num_epochs 100 --path_dataset_folder 'KGs/WN18RR' --trainer torchDDP```
Execute the following command on the node 2
```torchrun --nnodes 2 --nproc_per_node=gpu  --node_rank 1 --rdzv_id 456 --rdzv_backend c10d --rdzv_endpoint=nebula main.py --model 'ComplEx' --embedding_dim 32 --num_epochs 100 --path_dataset_folder 'KGs/WN18RR' --trainer torchDDP```

### Node 1
Global:3 | Local:1 | Epoch:100 | Loss:0.00011253 | Runtime:0.042mins
Global:2 | Local:0 | Epoch:100 | Batch:26 | Loss:0.00011469019227661192 |ForwardBackwardUpdate:0.01sec | BatchConst.:0.09sec
Global:2 | Local:0 | Epoch:100 | Loss:0.00011178 | Runtime:0.042mins
Done ! It took 4.440 minutes.
Done ! It took 4.441 minutes.

### Node 2
```
Global:1 | Local:1 | Epoch:100 | Batch:25 | Loss:0.00011904298298759386 |ForwardBackwardUpdate:0.01sec | BatchConst.:0.09sec
Global:0 | Local:0 | Epoch:100 | Batch:25 | Loss:0.00011089341569459066 |ForwardBackwardUpdate:0.01sec | BatchConst.:0.09sec
Global:1 | Local:1 | Epoch:100 | Batch:26 | Loss:0.00011964481382165104 |ForwardBackwardUpdate:0.01sec | BatchConst.:0.08sec
Epoch:100 | Loss:0.00011271 | Runtime:0.042mins
Global:0 | Local:0 | Epoch:100 | Batch:26 | Loss:9.990083344746381e-05 |ForwardBackwardUpdate:0.01sec | BatchConst.:0.05sec
Epoch:100 | Loss:0.00010982 | Runtime:0.042mins
Done ! It took 4.421 minutes.
Done ! It took 4.419 minutes.
```

# TODO:Pytorch-Lightning Trainer

By setting --trainer PL, pytorch-lightning can be used to train a knowledge graph embedding model

### Training with two GPUs
python main.py --model 'DistMult' --trainer PL --num_epochs 10 --gpus 2

### Using two GPUs with DDP
python main.py --model 'DistMult' --trainer PL --num_epochs 10 --accelerator gpu


### Using two GPUs with DDP with Low-Precision
python main.py --model 'DistMult' --trainer PL --num_epochs 50 --accelerator gpu --precision 16

python main.py --model 'DistMult' --trainer PL --num_epochs 50 --accelerator gpu --precision bf16


### Using two GPUs with Model Parallel ([Deep-Speed 3](https://lightning.ai/docs/pytorch/stable/advanced/model_parallel.html#deepspeed-zero-stage-3-offload)) and Low-Precision

python main.py --path_dataset_folder 'KGs/YAGO3-10' --trainer PL --accelerator gpu --strategy deepspeed_stage_3 --precision 16


### Using two GPUs with Mult-node
torchrun --nnodes 2 --nproc_per_node=gpu --node_rank 0 --rdzv_id 456 --rdzv_backend c10d --rdzv_endpoint=nebula main.py --path_dataset_folder 'KGs/YAGO3-10' --trainer torchDDP


