Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chunking over frequency instead of time #39

Merged
merged 8 commits into from
Nov 20, 2023
Merged

Conversation

popcornell
Copy link
Contributor

@popcornell popcornell commented Nov 14, 2023

Most code comes from @boeddeker .
Also he raised the issue here #33

I am re-running the code on CHiME-7 to see if it will match the previous version.

def __call__(self, Obs, acitivity_freq):

initialization = cp.asarray(acitivity_freq, dtype=cp.float64)
def __call__(self, Obs, activity_freq):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ve just corrected spelling here

@desh2608
Copy link
Owner

Awesome! I'll merge once you can verify that the performance remains unchanged (which I believe it should) :)

@popcornell
Copy link
Contributor Author

There seems to be some inconsistency when I change the number of GPUs (I don't think this depends on this PR however).
It seems that the more GPUs the more the memory occupation ?!

With 3 GPUs:

2023-11-15:20:42:15,574 INFO     [enhancer.py:207] Processing batch 1 ('S26', 'P29'): 1 segments = 8.89s (total: 0 segments)
2023-11-15:20:42:20,890 WARNING  [enhancer.py:247] Out of memory error while processing the batch. Trying again with 2 chunks.
2023-11-15:20:42:32,172 INFO     [enhancer.py:207] Processing batch 2 ('S26', 'P30'): 1 segments = 18.55s (total: 1 segments)
2023-11-15:20:42:32,706 WARNING  [enhancer.py:247] Out of memory error while processing the batch. Trying again with 2 chunks.
2023-11-15:20:42:32,863 WARNING  [enhancer.py:247] Out of memory error while processing the batch. Trying again with 3 chunks.
2023-11-15:20:42:46,44 INFO     [enhancer.py:207] Processing batch 3 ('S26', 'P31'): 1 segments = 63.23s (total: 2 segments)
2023-11-15:20:42:46,350 WARNING  [enhancer.py:247] Out of memory error while processing the batch. Trying again with 2 chunks.
2023-11-15:20:42:46,589 WARNING  [enhancer.py:247] Out of memory error while processing the batch. Trying again with 3 chunks.
2023-11-15:20:42:46,810 WARNING  [enhancer.py:247] Out of memory error while processing the batch. Trying again with 4 chunks.
2023-11-15:20:42:47,21 WARNING  [enhancer.py:247] Out of memory error while processing the batch. Trying again with 5 chunks.
2023-11-15:20:43:04,114 INFO     [enhancer.py:207] Processing batch 4 ('S26', 'P32'): 1 segments = 15.86s (total: 3 segments)

With 2 GPU:

2023-11-15:20:46:13,89 INFO     [enhancer.py:207] Processing batch 1 ('S26', 'P29'): 1 segments = 8.89s (total: 0 segments)
2023-11-15:20:46:20,761 INFO     [enhancer.py:207] Processing batch 2 ('S26', 'P30'): 1 segments = 18.55s (total: 1 segments)
2023-11-15:20:46:30,84 INFO     [enhancer.py:207] Processing batch 3 ('S26', 'P31'): 1 segments = 63.23s (total: 2 segments)
2023-11-15:20:46:30,625 WARNING  [enhancer.py:247] Out of memory error while processing the batch. Trying again with 2 chunks.
2023-11-15:20:46:30,859 WARNING  [enhancer.py:247] Out of memory error while processing the batch. Trying again with 3 chunks.
2023-11-15:20:46:31,79 WARNING  [enhancer.py:247] Out of memory error while processing the batch. Trying again with 4 chunks.
2023-11-15:20:46:42,373 INFO     [enhancer.py:207] Processing batch 4 ('S26', 'P32'): 1 segments = 15.86s (total: 3 segments)

With 1 GPU:

2023-11-15:20:44:19,322 INFO     [enhancer.py:207] Processing batch 1 ('S26', 'P29'): 1 segments = 8.89s (total: 0 segments)
2023-11-15:20:44:26,218 INFO     [enhancer.py:207] Processing batch 2 ('S26', 'P30'): 1 segments = 18.55s (total: 1 segments)
2023-11-15:20:44:30,689 INFO     [enhancer.py:207] Processing batch 3 ('S26', 'P31'): 1 segments = 63.23s (total: 2 segments)
2023-11-15:20:44:38,79 INFO     [enhancer.py:207] Processing batch 4 ('S26', 'P32'): 1 segments = 15.86s (total: 3 segments)

@desh2608
Copy link
Owner

There seems to be some inconsistency when I change the number of GPUs (I don't think this depends on this PR however). It seems that the more GPUs the more the memory occupation ?!

With 3 GPUs:

2023-11-15:20:42:15,574 INFO     [enhancer.py:207] Processing batch 1 ('S26', 'P29'): 1 segments = 8.89s (total: 0 segments)
2023-11-15:20:42:20,890 WARNING  [enhancer.py:247] Out of memory error while processing the batch. Trying again with 2 chunks.
2023-11-15:20:42:32,172 INFO     [enhancer.py:207] Processing batch 2 ('S26', 'P30'): 1 segments = 18.55s (total: 1 segments)
2023-11-15:20:42:32,706 WARNING  [enhancer.py:247] Out of memory error while processing the batch. Trying again with 2 chunks.
2023-11-15:20:42:32,863 WARNING  [enhancer.py:247] Out of memory error while processing the batch. Trying again with 3 chunks.
2023-11-15:20:42:46,44 INFO     [enhancer.py:207] Processing batch 3 ('S26', 'P31'): 1 segments = 63.23s (total: 2 segments)
2023-11-15:20:42:46,350 WARNING  [enhancer.py:247] Out of memory error while processing the batch. Trying again with 2 chunks.
2023-11-15:20:42:46,589 WARNING  [enhancer.py:247] Out of memory error while processing the batch. Trying again with 3 chunks.
2023-11-15:20:42:46,810 WARNING  [enhancer.py:247] Out of memory error while processing the batch. Trying again with 4 chunks.
2023-11-15:20:42:47,21 WARNING  [enhancer.py:247] Out of memory error while processing the batch. Trying again with 5 chunks.
2023-11-15:20:43:04,114 INFO     [enhancer.py:207] Processing batch 4 ('S26', 'P32'): 1 segments = 15.86s (total: 3 segments)

With 2 GPU:

2023-11-15:20:46:13,89 INFO     [enhancer.py:207] Processing batch 1 ('S26', 'P29'): 1 segments = 8.89s (total: 0 segments)
2023-11-15:20:46:20,761 INFO     [enhancer.py:207] Processing batch 2 ('S26', 'P30'): 1 segments = 18.55s (total: 1 segments)
2023-11-15:20:46:30,84 INFO     [enhancer.py:207] Processing batch 3 ('S26', 'P31'): 1 segments = 63.23s (total: 2 segments)
2023-11-15:20:46:30,625 WARNING  [enhancer.py:247] Out of memory error while processing the batch. Trying again with 2 chunks.
2023-11-15:20:46:30,859 WARNING  [enhancer.py:247] Out of memory error while processing the batch. Trying again with 3 chunks.
2023-11-15:20:46:31,79 WARNING  [enhancer.py:247] Out of memory error while processing the batch. Trying again with 4 chunks.
2023-11-15:20:46:42,373 INFO     [enhancer.py:207] Processing batch 4 ('S26', 'P32'): 1 segments = 15.86s (total: 3 segments)

With 1 GPU:

2023-11-15:20:44:19,322 INFO     [enhancer.py:207] Processing batch 1 ('S26', 'P29'): 1 segments = 8.89s (total: 0 segments)
2023-11-15:20:44:26,218 INFO     [enhancer.py:207] Processing batch 2 ('S26', 'P30'): 1 segments = 18.55s (total: 1 segments)
2023-11-15:20:44:30,689 INFO     [enhancer.py:207] Processing batch 3 ('S26', 'P31'): 1 segments = 63.23s (total: 2 segments)
2023-11-15:20:44:38,79 INFO     [enhancer.py:207] Processing batch 4 ('S26', 'P32'): 1 segments = 15.86s (total: 3 segments)

That's strange. I have never seen this happen before. Can you check if your GPUs are configured to not share memory?

@popcornell
Copy link
Contributor Author

That's strange. I have never seen this happen before. Can you check if your GPUs are configured to not share memory?

Yep they were in DEFAULT mode. Changed to exclusive mode and it does not happen anymore.
Maybe I should put a check into the code for the GPUs compute mode ?

@boeddeker
Copy link
Contributor

Maybe I should put a check into the code for the GPUs compute mode ?

Something to prevent it would be great. I had this issue in CHiME-7.
It would be great to have the check such, that the user doesn't have to change the mode.

@desh2608
Copy link
Owner

I think it should be sufficient to add this in the README (perhaps as an FAQ), instead of restricting certain modes in the processing. OOM issues can happen for a variety of reasons, such as if the GPU memory is not cleared from a previous running process or misconfigured nodes, and we cannot expect to solve all such problems.

Copy link
Owner

@desh2608 desh2608 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good to me. I'll wait if you want to add something to the README about the GPU configuration. Let me know when it's ready to merge.

initialization = cp.where(initialization == 0, 1e-10, initialization)
initialization = initialization / cp.sum(initialization, keepdims=True, axis=0)
initialization = cp.repeat(initialization[None, ...], 513, axis=0)
initialization = cp.repeat(initialization[None, ...], F, axis=0)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

@popcornell
Copy link
Contributor Author

popcornell commented Nov 16, 2023

I think I can grep the processing mode from nvidia-smi -q, but not sure if this will work on all clusters out there.
I can however put an additional arg to disable this check, with a big warning

@desh2608
Copy link
Owner

I think I can grep the processing mode from nvidia-smi -q, but not sure if this will work on all clusters out there. I can however put an additional arg to disable this check, with a big warning

You could put these instructions in the README, so that users running the code can check. No need to add it in the code, IMO.

@popcornell
Copy link
Contributor Author

Discussing offline with @boeddeker, we added a new utils called gpu_check.
The idea is to use it as this (e.g. CHiME-7 asr1 recipe):

$cmd JOB=1:$nj  ${exp_dir}/${dset_name}/${dset_part}/log/enhance.JOB.log \
    gss utils gpu_check $nj $cmd \& gss enhance cuts \
      ${exp_dir}/${dset_name}/${dset_part}/cuts.jsonl.gz ${exp_dir}/${dset_name}/${dset_part}/split$nj/cuts_per_segment.JOB.jsonl.gz \
       ${exp_dir}/${dset_name}/${dset_part}/enhanced \
      --bss-iterations $gss_iterations \
      --context-duration 15.0 \
      --use-garbage-class \
      --min-segment-length 0.0 \
      --max-segment-length $max_segment_length \
      --max-batch-duration $max_batch_duration \
      --max-batch-cuts 1 \
      --num-buckets 4 \
      --num-workers 4 \
      --force-overwrite \
      --duration-tolerance 3.0 \
       ${affix} || exit 1

However it will not exit when it raises an exception used as this.
My bash is bad, do you know how to make it to exit ?

@desh2608
Copy link
Owner

Discussing offline with @boeddeker, we added a new utils called gpu_check. The idea is to use it as this (e.g. CHiME-7 asr1 recipe):

$cmd JOB=1:$nj  ${exp_dir}/${dset_name}/${dset_part}/log/enhance.JOB.log \
    gss utils gpu_check $nj $cmd \& gss enhance cuts \
      ${exp_dir}/${dset_name}/${dset_part}/cuts.jsonl.gz ${exp_dir}/${dset_name}/${dset_part}/split$nj/cuts_per_segment.JOB.jsonl.gz \
       ${exp_dir}/${dset_name}/${dset_part}/enhanced \
      --bss-iterations $gss_iterations \
      --context-duration 15.0 \
      --use-garbage-class \
      --min-segment-length 0.0 \
      --max-segment-length $max_segment_length \
      --max-batch-duration $max_batch_duration \
      --max-batch-cuts 1 \
      --num-buckets 4 \
      --num-workers 4 \
      --force-overwrite \
      --duration-tolerance 3.0 \
       ${affix} || exit 1

However it will not exit when it raises an exception used as this. My bash is bad, do you know how to make it to exit ?

The exit should work if any of the job fails. But I think this whole GPU check thing is overkill. GPU memory issues can happen in any program, and I don't see why it needs to be included in this repo specifically. I can add it if you are using it in ESPNet, but I personally think this is not the right place to solve this issue.

@popcornell
Copy link
Contributor Author

In the meantime, I confirm i get same results with old version on CHiME-7:

###################################################
### Metrics for all Scenarios ###
###################################################
+----+------------+---------------+---------------+----------------------+----------------------+--------+-----------------+-------------+--------------+----------+
|    | scenario   |   num spk hyp |   num spk ref |   tot utterances hyp |   tot utterances ref |   hits |   substitutions |   deletions |   insertions |      wer |
|----+------------+---------------+---------------+----------------------+----------------------+--------+-----------------+-------------+--------------+----------|
|  0 | chime6     |             8 |             8 |                 6644 |                 6644 |  42884 |           11672 |        4325 |         3107 | 0.324451 |
|  0 | dipco      |            20 |            20 |                 3673 |                 3673 |  22175 |            5817 |        1974 |         2210 | 0.333745 |
|  0 | mixer6     |           118 |           118 |                14804 |                14804 | 126632 |           15991 |        6358 |         7815 | 0.202469 |
+----+------------+---------------+---------------+----------------------+----------------------+--------+-----------------+-------------+--------------+----------+
####################################################################
### Macro-Averaged Metrics across all Scenarios (Ranking Metric) ###
####################################################################
+----+---------------+---------------+---------------+----------------------+----------------------+--------+-----------------+-------------+--------------+----------+
|    | scenario      |   num spk hyp |   num spk ref |   tot utterances hyp |   tot utterances ref |   hits |   substitutions |   deletions |   insertions |      wer |
|----+---------------+---------------+---------------+----------------------+----------------------+--------+-----------------+-------------+--------------+----------|
|  0 | macro-average |       48.6667 |       48.6667 |              8373.67 |              8373.67 |  63897 |           11160 |        4219 |      4377.33 | 0.286888 |
+----+---------------+---------------+---------------+----------------------+----------------------+--------+-----------------+-------------+--------------+----------+

@popcornell
Copy link
Contributor Author

Added some lines in the README.md

Copy link
Owner

@desh2608 desh2608 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution!

@desh2608 desh2608 merged commit e74d9f4 into desh2608:master Nov 20, 2023
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Chunking along time frames to save GPU Ram? Why not along frequency dim?
3 participants