Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of Memory #12

Closed
0three opened this issue Apr 29, 2019 · 8 comments
Closed

Out of Memory #12

0three opened this issue Apr 29, 2019 · 8 comments

Comments

@0three
Copy link

0three commented Apr 29, 2019

I run this command in my lab servers.
th main.lua --develop --name test-run --type float>
And I got error like this.
{
maxPoolStride : 2
noProgress : false
name : "test-run"
learningRate : 0.001
transmissionJPEGU_yc : 5
batchSize : 12
develop : true
optimType : "adam"
adversaryFeatureDepth : 64
messageLength : 30
transmissionCropout : 0.4
transmissionDropout : 0.4
transmissionJPEGQuality : 50
type : "float"
transmissionCropSize : 0.5
decoderConvolutions : 6
loadCheckpoint : ""
fixImage : false
encoderPreMessageConvolution : 3
noSave : false
seed : 1234
maxPoolWindowSize : 4
transmissionGaussianSigma : 2
small : false
encoderFeatureDepth : 64
confusionPer : 20
imageSize : 128
savePer : 20
imagePenaltyCoef : 1
testPer : 1
save : "checkpoints"
transmissionJPEGU_yd : 0
fixMessage : false
epochs : 200
decoderFeatureDepth : 64
transmissionNoiseType : "identity"
thin : false
transmissionJPEGCutoff : 5
transmissionJPEGU_uvd : 0
transmissionJPEGU_uvc : 3
small16 : false
transmissionOutsize : 128
transmissionCombinedRecipe : ""
adversary_gradient_scale : 0.1
adversaryConvolutions : 2
messagePenaltyCoef : 1
grayscale : false
transmissionConcatenatedRecipe : ""
encoderPostMessageConvolution : 1
randomImage : false
}
{
beta1 : 0.9
epsilon : 1e-08
learningRateDecay : 0
learningRate : 0.001
beta2 : 0.999
}
Loading training dataset
Accepting non-grayscale input
test-run: starting to train

epoch: 1

slurmstepd: error: Detected 1 oom-kill event(s) in step 10160.1 cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
srun: error: wmc-slave-g6: task 0: Out Of Memory

It looks like that I have no enough memory to run this. I just git clone these code and run the test command.

Could you plz share some requirements about this ? Thank you !

By the way, hope the pretrained models for research.

Thank you !

@0three
Copy link
Author

0three commented Apr 29, 2019

My memory config.

cat /proc/meminfo
MemTotal: 131663888 kB
MemFree: 51571416 kB
MemAvailable: 95205204 kB
Buffers: 796904 kB
Cached: 44341812 kB
SwapCached: 2428 kB
Active: 62199800 kB
Inactive: 14218780 kB
Active(anon): 30307824 kB
Inactive(anon): 3407284 kB
Active(file): 31891976 kB
Inactive(file): 10811496 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 999420 kB
SwapFree: 949476 kB
Dirty: 148 kB
Writeback: 0 kB
AnonPages: 31277964 kB
Mapped: 2979180 kB
Shmem: 2435292 kB
Slab: 1860776 kB
SReclaimable: 1556396 kB
SUnreclaim: 304380 kB
KernelStack: 14688 kB
PageTables: 94932 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 66831364 kB
Committed_AS: 36091084 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 0 kB
VmallocChunk: 0 kB
HardwareCorrupted: 0 kB
AnonHugePages: 0 kB
CmaTotal: 0 kB
CmaFree: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 34013280 kB
DirectMap2M: 96712704 kB
DirectMap1G: 5242880 kB

@0three
Copy link
Author

0three commented Apr 29, 2019

srun nvidia-smi
Mon Apr 29 21:08:18 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.72 Driver Version: 410.72 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:B2:00.0 Off | N/A |
| 23% 26C P8 15W / 250W | 0MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

@0three
Copy link
Author

0three commented Apr 29, 2019

Thank you !

@ando-khachatryan
Copy link
Owner

Hi,
I'm a bit confused, seems you are running the lua version. This repo is the pytorch implementation. The lua implementation is here: https://github.com/jirenz/HiDDeN

@0three
Copy link
Author

0three commented Apr 29, 2019

I'm terribly sorry that I mistake the tow repos.

But there is also a question about the pytorch implementation.
In noise layer, the pytorch implementation use apply_conv() to simulate the JPEG compression in DCT fileds. However, the source code use a special way, which is by require Image and use Image.JPGCompression. It's strange and I don't find any mask in the source code(https://github.com/jirenz/HiDDeN).

Do you have any comments on this strange point?

@ando-khachatryan
Copy link
Owner

ando-khachatryan commented Apr 29, 2019

OK, I'm not the author of the paper, but I'll try to do my best.
The jpegn.lua file in their repo is non-differentiable (see the very first comment line in the source file). This layer is not used for training, but for verifying that the differentiable approximation of the jpeg is actually a good approximation. See Figure 5 and related explanations in their paper.
The differentiable approximations of the jpeg compression are defined in DCT_layer.lua. There still may be differences in implementation (theirs vs mine), but they seem to be doing the same thing.

@0three
Copy link
Author

0three commented Apr 29, 2019

Thank you very much!

It solved an important issue for me.

Thanks again faithfully!!!

@0three 0three closed this as completed Apr 29, 2019
@ando-khachatryan
Copy link
Owner

You're welcome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants