Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUBLAS_STATUS_NOT_INITIALIZED #3181

Closed
ebremer opened this issue May 13, 2024 · 10 comments
Closed

CUBLAS_STATUS_NOT_INITIALIZED #3181

ebremer opened this issue May 13, 2024 · 10 comments
Labels
bug Something isn't working

Comments

@ebremer
Copy link

ebremer commented May 13, 2024

When running the code from the footwear_classification demo

public final class Training {
    private static final int BATCH_SIZE = 32;
    private static final int EPOCHS = 2;

    public static void main(String[] args) throws IOException, TranslateException {        
        Path modelDir = Paths.get("models");
        ImageFolder dataset = initDataset("ut-zap50k-images-square");
        RandomAccessDataset[] datasets = dataset.randomSplit(8, 2);
        Loss loss = Loss.softmaxCrossEntropyLoss();
        TrainingConfig config = setupTrainingConfig(loss);             
        try (Model model = Models.getModel(); // empty model instance to hold patterns
            Trainer trainer = model.newTrainer(config)) {
            trainer.setMetrics(new Metrics());            
            Shape inputShape = new Shape(1, 3, Models.IMAGE_HEIGHT, Models.IMAGE_HEIGHT);
            trainer.initialize(inputShape);
            EasyTrain.fit(trainer, EPOCHS, datasets[0], datasets[1]);
            TrainingResult result = trainer.getTrainingResult();
            model.setProperty("Epoch", String.valueOf(EPOCHS));
            model.setProperty("Accuracy", String.format("%.5f", result.getValidateEvaluation("Accuracy")));
            model.setProperty("Loss", String.format("%.5f", result.getValidateLoss()));
            model.save(modelDir, Models.MODEL_NAME);
            Models.saveSynset(modelDir, dataset.getSynset());
        }
    }

    private static ImageFolder initDataset(String datasetRoot) throws IOException, TranslateException {
        ImageFolder dataset =
                ImageFolder.builder()
                        .setRepositoryPath(Paths.get(datasetRoot))
                        .optMaxDepth(10)                        
                        .addTransform(new Resize(Models.IMAGE_WIDTH, Models.IMAGE_HEIGHT))
                        .addTransform(new ToTensor())
                        .setSampling(BATCH_SIZE, true)
                        .build();

        dataset.prepare();
        return dataset;
    }

    private static TrainingConfig setupTrainingConfig(Loss loss) {
        return new DefaultTrainingConfig(loss)
                .addEvaluator(new Accuracy())
                .addTrainingListeners(TrainingListener.Defaults.logging());
    }
}

The following error is thrown


--- exec:3.1.0:exec (default-cli) @ DJL ---
[main] INFO ai.djl.util.Platform - Found matching platform from: jar:file:/C:/Users/erich/.m2/repository/ai/djl/pytorch/pytorch-native-cu121/2.1.1/pytorch-native-cu121-2.1.1-win-x86_64.jar!/native/lib/pytorch.properties
[main] INFO ai.djl.pytorch.engine.PtEngine - PyTorch graph executor optimizer is enabled, this may impact your inference latency and throughput. See: https://docs.djl.ai/docs/development/inference_performance_optimization.html#graph-executor-optimization
[main] INFO ai.djl.pytorch.engine.PtEngine - Number of inter-op threads is 16
[main] INFO ai.djl.pytorch.engine.PtEngine - Number of intra-op threads is 16
[main] INFO ai.djl.training.listener.LoggingTrainingListener - Training on: 1 GPUs.
[main] INFO ai.djl.training.listener.LoggingTrainingListener - Load PyTorch Engine Version 2.1.1 in 0.069 ms.
[main] INFO ai.djl.training.listener.LoggingTrainingListener - forward P50: 208.791 ms, P90: 208.791 ms
Exception in thread "main" ai.djl.engine.EngineException: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`
	at ai.djl.pytorch.jni.PyTorchLibrary.torchNNLinear(Native Method)
	at ai.djl.pytorch.jni.JniUtils.linear(JniUtils.java:1376)
	at ai.djl.pytorch.engine.PtNDArrayEx.linear(PtNDArrayEx.java:397)
	at ai.djl.nn.core.Linear.linear(Linear.java:192)
	at ai.djl.nn.core.Linear.forwardInternal(Linear.java:96)
	at ai.djl.nn.AbstractBaseBlock.forwardInternal(AbstractBaseBlock.java:128)
	at ai.djl.nn.AbstractBaseBlock.forward(AbstractBaseBlock.java:93)
	at ai.djl.nn.SequentialBlock.forwardInternal(SequentialBlock.java:211)
	at ai.djl.nn.AbstractBaseBlock.forward(AbstractBaseBlock.java:93)
	at ai.djl.training.Trainer.forward(Trainer.java:188)
	at ai.djl.training.EasyTrain.trainSplit(EasyTrain.java:122)
	at ai.djl.training.EasyTrain.trainBatch(EasyTrain.java:110)
	at ai.djl.training.EasyTrain.fit(EasyTrain.java:58)
	at com.examples.Training.main(Training.java:38)

The same error occurs if I adjust the BATCH_SIZE lower for every value less, except, when you get to BATCH_SIZE = 1, it works and the network trains.

@ebremer ebremer added the bug Something isn't working label May 13, 2024
@frankfliu
Copy link
Contributor

This seems a pytorch bug: pytorch/pytorch#121293

@ebremer
Copy link
Author

ebremer commented May 13, 2024

My POM:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <groupId>com.ebremer</groupId>
    <artifactId>DJL</artifactId>
    <version>0.0.0</version>
    <packaging>jar</packaging>
    <properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <maven.compiler.source>21</maven.compiler.source>
        <maven.compiler.target>21</maven.compiler.target>
        <exec.mainClass>com.ebremer.djl.DJL</exec.mainClass>
    </properties>
    <dependencyManagement>
        <dependencies>
            <dependency>
                <groupId>ai.djl</groupId>
                <artifactId>bom</artifactId>
                <version>0.27.0</version>
                <type>pom</type>
                <scope>import</scope>
            </dependency>
        </dependencies>
    </dependencyManagement>
    <dependencies>
        <dependency>
            <groupId>ai.djl</groupId>
            <artifactId>api</artifactId>
        </dependency>
        <dependency>
            <groupId>ai.djl</groupId>
            <artifactId>model-zoo</artifactId>
        </dependency>        
        <dependency>
            <groupId>ai.djl</groupId>
            <artifactId>basicdataset</artifactId>
            <type>jar</type>
        </dependency>
        <dependency>
            <groupId>ai.djl.pytorch</groupId>
            <artifactId>pytorch-model-zoo</artifactId>
        </dependency>        
        <dependency>
            <groupId>ai.djl.pytorch</groupId>
            <artifactId>pytorch-engine</artifactId>
        </dependency>
        <dependency>
            <groupId>ai.djl.pytorch</groupId>
            <artifactId>pytorch-native-cu121</artifactId>
            <classifier>win-x86_64</classifier>
            <version>2.1.1</version>
            <scope>runtime</scope>
        </dependency>
        <dependency>
            <groupId>ai.djl.pytorch</groupId>
            <artifactId>pytorch-jni</artifactId>
            <version>2.1.1-0.27.0</version>
            <scope>runtime</scope>
        </dependency>
        <dependency>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-simple</artifactId>
            <version>1.7.30</version>
        </dependency>
        <dependency>
            <groupId>commons-cli</groupId>
            <artifactId>commons-cli</artifactId>
            <version>1.6.0</version>
        </dependency>
    </dependencies>
</project>

@frankfliu
Copy link
Contributor

can you try:

<dependency>
            <groupId>ai.djl.pytorch</groupId>
            <artifactId>pytorch-native-cu117</artifactId>
            <classifier>win-x86_64</classifier>
            <version>1.13.1</version>
            <scope>runtime</scope>
        </dependency>
        <dependency>
            <groupId>ai.djl.pytorch</groupId>
            <artifactId>pytorch-jni</artifactId>
            <version>1.13.1-0.27.0</version>
            <scope>runtime</scope>
        </dependency>

@ebremer
Copy link
Author

ebremer commented May 14, 2024

Failed to load PyTorch native library

[main] INFO ai.djl.util.Platform - Found matching platform from: jar:file:/C:/Users/erich/.m2/repository/ai/djl/pytorch/pytorch-native-cu117/1.13.1/pytorch-native-cu117-1.13.1-win-x86_64.jar!/native/lib/pytorch.properties
Exception in thread "main" ai.djl.engine.EngineException: Failed to load PyTorch native library
	at ai.djl.pytorch.engine.PtEngine.newInstance(PtEngine.java:90)
	at ai.djl.pytorch.engine.PtEngineProvider.getEngine(PtEngineProvider.java:41)
	at ai.djl.engine.Engine.getEngine(Engine.java:190)
	at ai.djl.engine.Engine.getInstance(Engine.java:145)
	at ai.djl.Model.newInstance(Model.java:72)
	at ai.djl.Model.newInstance(Model.java:61)
	at com.examples.Models.getModel(Models.java:43)
	at com.examples.Training.main(Training.java:33)
Caused by: java.lang.UnsatisfiedLinkError: C:\Users\erich\.djl.ai\pytorch\1.13.1-20221220-cu117-win-x86_64\torch_cuda_cpp.dll: Can't find dependent libraries
	at java.base/jdk.internal.loader.NativeLibraries.load(Native Method)
	at java.base/jdk.internal.loader.NativeLibraries$NativeLibraryImpl.open(NativeLibraries.java:331)
	at java.base/jdk.internal.loader.NativeLibraries.loadLibrary(NativeLibraries.java:197)
	at java.base/jdk.internal.loader.NativeLibraries.loadLibrary(NativeLibraries.java:139)
	at java.base/java.lang.ClassLoader.loadLibrary(ClassLoader.java:2418)
	at java.base/java.lang.Runtime.load0(Runtime.java:852)
	at java.base/java.lang.System.load(System.java:2025)
	at ai.djl.pytorch.jni.LibUtils.loadNativeLibrary(LibUtils.java:379)
	at ai.djl.pytorch.jni.LibUtils.loadLibTorch(LibUtils.java:195)
	at ai.djl.pytorch.jni.LibUtils.loadLibrary(LibUtils.java:82)
	at ai.djl.pytorch.engine.PtEngine.newInstance(PtEngine.java:53)
	... 7 more

@ebremer
Copy link
Author

ebremer commented May 16, 2024

I saw the new release and tried the below but I would get the same error, except when I run with BATCH_SIZE = 1 and then it will begin to train fine.

        <dependency>
            <groupId>ai.djl.pytorch</groupId>
            <artifactId>pytorch-native-cu121</artifactId>
            <classifier>win-x86_64</classifier>
            <version>2.2.2</version>
            <scope>runtime</scope>
        </dependency>
        <dependency>
            <groupId>ai.djl.pytorch</groupId>
            <artifactId>pytorch-jni</artifactId>
            <version>2.3.0-0.28.0</version>
            <scope>runtime</scope>
        </dependency>

@mdxd44
Copy link

mdxd44 commented May 18, 2024

got the same issue with
ai.djl.pytorch:pytorch-native-cu121:2.3.0, ai.djl.pytorch:pytorch-jni:2.3.0-0.28.0 and cuda 12.4.1
rollback to
ai.djl.pytorch:pytorch-native-cu117:1.13.1, ai.djl.pytorch:pytorch-jni:1.13.1-0.28.0 and cuda 11.7.0
fixed the issue

@ebremer
Copy link
Author

ebremer commented May 20, 2024

@mdxd44 I tried your rollback and I got this

[main] INFO ai.djl.util.Platform - Found matching platform from: jar:file:/C:/Users/erich/.m2/repository/ai/djl/pytorch/pytorch-native-cu117/1.13.1/pytorch-native-cu117-1.13.1-win-x86_64.jar!/native/lib/pytorch.properties
[main] INFO ai.djl.pytorch.jni.LibUtils - Extracting pytorch/cu117/win-x86_64/asmjit.dll to cache ...
[main] INFO ai.djl.pytorch.jni.LibUtils - Extracting pytorch/cu117/win-x86_64/c10.dll to cache ...
[main] INFO ai.djl.pytorch.jni.LibUtils - Extracting pytorch/cu117/win-x86_64/c10_cuda.dll to cache ...
[main] INFO ai.djl.pytorch.jni.LibUtils - Extracting pytorch/cu117/win-x86_64/caffe2_nvrtc.dll to cache ...
[main] INFO ai.djl.pytorch.jni.LibUtils - Extracting pytorch/cu117/win-x86_64/cudnn64_8.dll to cache ...
[main] INFO ai.djl.pytorch.jni.LibUtils - Extracting pytorch/cu117/win-x86_64/cudnn_adv_infer64_8.dll to cache ...
[main] INFO ai.djl.pytorch.jni.LibUtils - Extracting pytorch/cu117/win-x86_64/cudnn_adv_train64_8.dll to cache ...
[main] INFO ai.djl.pytorch.jni.LibUtils - Extracting pytorch/cu117/win-x86_64/cudnn_cnn_infer64_8.dll to cache ...
[main] INFO ai.djl.pytorch.jni.LibUtils - Extracting pytorch/cu117/win-x86_64/cudnn_cnn_train64_8.dll to cache ...
[main] INFO ai.djl.pytorch.jni.LibUtils - Extracting pytorch/cu117/win-x86_64/cudnn_ops_infer64_8.dll to cache ...
[main] INFO ai.djl.pytorch.jni.LibUtils - Extracting pytorch/cu117/win-x86_64/cudnn_ops_train64_8.dll to cache ...
[main] INFO ai.djl.pytorch.jni.LibUtils - Extracting pytorch/cu117/win-x86_64/fbgemm.dll to cache ...
[main] INFO ai.djl.pytorch.jni.LibUtils - Extracting pytorch/cu117/win-x86_64/libiomp5md.dll to cache ...
[main] INFO ai.djl.pytorch.jni.LibUtils - Extracting pytorch/cu117/win-x86_64/nvToolsExt64_1.dll to cache ...
[main] INFO ai.djl.pytorch.jni.LibUtils - Extracting pytorch/cu117/win-x86_64/nvrtc64_112_0.dll to cache ...
[main] INFO ai.djl.pytorch.jni.LibUtils - Extracting pytorch/cu117/win-x86_64/torch.dll to cache ...
[main] INFO ai.djl.pytorch.jni.LibUtils - Extracting pytorch/cu117/win-x86_64/torch_cpu.dll to cache ...
[main] INFO ai.djl.pytorch.jni.LibUtils - Extracting pytorch/cu117/win-x86_64/torch_cuda.dll to cache ...
[main] INFO ai.djl.pytorch.jni.LibUtils - Extracting pytorch/cu117/win-x86_64/torch_cuda_cpp.dll to cache ...
[main] INFO ai.djl.pytorch.jni.LibUtils - Extracting pytorch/cu117/win-x86_64/torch_cuda_cu.dll to cache ...
[main] INFO ai.djl.pytorch.jni.LibUtils - Extracting pytorch/cu117/win-x86_64/uv.dll to cache ...
[main] INFO ai.djl.pytorch.jni.LibUtils - Extracting pytorch/cu117/win-x86_64/zlibwapi.dll to cache ...
Exception in thread "main" ai.djl.engine.EngineException: Failed to load PyTorch native library
	at ai.djl.pytorch.engine.PtEngine.newInstance(PtEngine.java:90)
	at ai.djl.pytorch.engine.PtEngineProvider.getEngine(PtEngineProvider.java:41)
	at ai.djl.engine.Engine.getEngine(Engine.java:190)
	at ai.djl.engine.Engine.getInstance(Engine.java:145)
	at ai.djl.Model.newInstance(Model.java:72)
	at ai.djl.Model.newInstance(Model.java:61)
	at com.examples.Models.getModel(Models.java:43)
	at com.examples.Training.main(Training.java:33)
Caused by: java.lang.UnsatisfiedLinkError: C:\Users\erich\.djl.ai\pytorch\1.13.1-20221220-cu117-win-x86_64\torch_cuda_cpp.dll: Can't find dependent libraries
	at java.base/jdk.internal.loader.NativeLibraries.load(Native Method)
	at java.base/jdk.internal.loader.NativeLibraries$NativeLibraryImpl.open(NativeLibraries.java:331)
	at java.base/jdk.internal.loader.NativeLibraries.loadLibrary(NativeLibraries.java:197)
	at java.base/jdk.internal.loader.NativeLibraries.loadLibrary(NativeLibraries.java:139)
	at java.base/java.lang.ClassLoader.loadLibrary(ClassLoader.java:2418)
	at java.base/java.lang.Runtime.load0(Runtime.java:852)
	at java.base/java.lang.System.load(System.java:2025)
	at ai.djl.pytorch.jni.LibUtils.loadNativeLibrary(LibUtils.java:379)
	at ai.djl.pytorch.jni.LibUtils.loadLibTorch(LibUtils.java:195)
	at ai.djl.pytorch.jni.LibUtils.loadLibrary(LibUtils.java:82)
	at ai.djl.pytorch.engine.PtEngine.newInstance(PtEngine.java:53)
	... 7 more
Command execution failed.
org.apache.commons.exec.ExecuteException: Process exited with an error: 1 (Exit value: 1)
    at org.apache.commons.exec.DefaultExecutor.executeInternal (DefaultExecutor.java:404)
    at org.apache.commons.exec.DefaultExecutor.execute (DefaultExecutor.java:166)
    at org.codehaus.mojo.exec.ExecMojo.executeCommandLine (ExecMojo.java:1000)
    at org.codehaus.mojo.exec.ExecMojo.executeCommandLine (ExecMojo.java:947)
    at org.codehaus.mojo.exec.ExecMojo.execute (ExecMojo.java:471)
    at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo (DefaultBuildPluginManager.java:126)
    at org.apache.maven.lifecycle.internal.MojoExecutor.doExecute2 (MojoExecutor.java:328)
    at org.apache.maven.lifecycle.internal.MojoExecutor.doExecute (MojoExecutor.java:316)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:212)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:174)
    at org.apache.maven.lifecycle.internal.MojoExecutor.access$000 (MojoExecutor.java:75)
    at org.apache.maven.lifecycle.internal.MojoExecutor$1.run (MojoExecutor.java:162)
    at org.apache.maven.plugin.DefaultMojosExecutionStrategy.execute (DefaultMojosExecutionStrategy.java:39)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:159)
    at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:105)
    at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:73)
    at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build (SingleThreadedBuilder.java:53)
    at org.apache.maven.lifecycle.internal.LifecycleStarter.execute (LifecycleStarter.java:118)
    at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:261)
    at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:173)
    at org.apache.maven.DefaultMaven.execute (DefaultMaven.java:101)
    at org.apache.maven.cli.MavenCli.execute (MavenCli.java:906)
    at org.apache.maven.cli.MavenCli.doMain (MavenCli.java:283)
    at org.apache.maven.cli.MavenCli.main (MavenCli.java:206)
    at jdk.internal.reflect.DirectMethodHandleAccessor.invoke (DirectMethodHandleAccessor.java:103)
    at java.lang.reflect.Method.invoke (Method.java:580)
    at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced (Launcher.java:283)
    at org.codehaus.plexus.classworlds.launcher.Launcher.launch (Launcher.java:226)
    at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode (Launcher.java:407)
    at org.codehaus.plexus.classworlds.launcher.Launcher.main (Launcher.java:348)
------------------------------------------------------------------------
BUILD FAILURE
------------------------------------------------------------------------
Total time:  16.182 s
Finished at: 2024-05-20T08:43:46-04:00
------------------------------------------------------------------------
Failed to execute goal org.codehaus.mojo:exec-maven-plugin:3.1.0:exec (default-cli) on project DJL: Command execution failed.: Process exited with an error: 1 (Exit value: 1) -> [Help 1]

To see the full stack trace of the errors, re-run Maven with the -e switch.
Re-run Maven using the -X switch to enable full debug logging.

For more information about the errors and possible solutions, please read the following articles:
[Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException

for pom

        <dependency>
            <groupId>ai.djl.pytorch</groupId>
            <artifactId>pytorch-model-zoo</artifactId>
        </dependency>        
        <dependency>
            <groupId>ai.djl.pytorch</groupId>
            <artifactId>pytorch-engine</artifactId>
        </dependency>
        <dependency>
            <groupId>ai.djl.pytorch</groupId>
            <artifactId>pytorch-native-cu117</artifactId>
            <classifier>win-x86_64</classifier>
            <version>1.13.1</version>
            <scope>runtime</scope>
        </dependency>
        <dependency>
            <groupId>ai.djl.pytorch</groupId>
            <artifactId>pytorch-jni</artifactId>
            <version>1.13.1-0.28.0</version>
            <scope>runtime</scope>
        </dependency>

@mdxd44
Copy link

mdxd44 commented May 24, 2024

@ebremer you can try to use this tool to identify which dependency it can't find

@ebremer
Copy link
Author

ebremer commented May 28, 2024

Some measure of success...
I removed all CUDA libraries I had on my system (there were several) and installed 11.7 only. Everything worked with:

        <dependency>
            <groupId>ai.djl.pytorch</groupId>
            <artifactId>pytorch-model-zoo</artifactId>
        </dependency>        
        <dependency>
            <groupId>ai.djl.pytorch</groupId>
            <artifactId>pytorch-engine</artifactId>
        </dependency>
        <dependency>
            <groupId>ai.djl.pytorch</groupId>
            <artifactId>pytorch-native-cu117</artifactId>
            <classifier>win-x86_64</classifier>
            <version>1.13.1</version>
            <scope>runtime</scope>
        </dependency>
        <dependency>
            <groupId>ai.djl.pytorch</groupId>
            <artifactId>pytorch-jni</artifactId>
            <version>1.13.1-0.28.0</version>
            <scope>runtime</scope>
        </dependency>

I removed 11.7 and installed 12.1 only. I updated the pom to

        <dependency>
            <groupId>ai.djl.pytorch</groupId>
            <artifactId>pytorch-model-zoo</artifactId>
        </dependency>        
        <dependency>
            <groupId>ai.djl.pytorch</groupId>
            <artifactId>pytorch-engine</artifactId>
        </dependency>
        <dependency>
            <groupId>ai.djl.pytorch</groupId>
            <artifactId>pytorch-native-cu121</artifactId>
           <classifier>win-x86_64</classifier>
            <version>2.2.2</version>
            <scope>runtime</scope>
        </dependency>
        <dependency>
            <groupId>ai.djl.pytorch</groupId>
            <artifactId>pytorch-jni</artifactId>
            <version>2.2.2-0.28.0</version>
            <scope>runtime</scope>
        </dependency>

but it could not find the cuda environment...

[main] WARN ai.djl.util.Platform - The bundled library: cu121-win-x86_64:2.2.2-20240505} doesn't match system: cu065-win-x86_64:2.2.2
[main] INFO ai.djl.util.Platform - Ignore mismatching platform from: jar:file:/C:/Users/erich/.m2/repository/ai/djl/pytorch/pytorch-native-cu121/2.2.2/pytorch-native-cu121-2.2.2-win-x86_64.jar!/native/lib/pytorch.properties
[main] WARN ai.djl.pytorch.jni.LibUtils - No matching cuda flavor for win-x86_64 found: cu065.
[main] INFO ai.djl.pytorch.engine.PtEngine - PyTorch graph executor optimizer is enabled, this may impact your inference latency and throughput. See: https://docs.djl.ai/docs/development/inference_performance_optimization.html#graph-executor-optimization
[main] INFO ai.djl.pytorch.engine.PtEngine - Number of inter-op threads is 32
[main] INFO ai.djl.pytorch.engine.PtEngine - Number of intra-op threads is 16
[main] INFO ai.djl.training.listener.LoggingTrainingListener - Training on: cpu().
[main] INFO ai.djl.training.listener.LoggingTrainingListener - Load PyTorch Engine Version 2.2.2 in 0.012 ms.

Training:      0% |=                                       | Accuracy: _, SoftmaxCrossEntropyLoss: _
Training:      0% |=                                       | Accuracy: _, SoftmaxCrossEntropyLoss: _
Training:      0% |=                                       | Accuracy: _, SoftmaxCrossEntropyLoss: _
Training:      0% |=                                       | Accuracy: _, SoftmaxCrossEntropyLoss: _
Training:      0% |=                                       | Accuracy: 0.47, SoftmaxCrossEntropyLoss: 2.46
Training:      0% |=                                       | Accuracy: 0.47, SoftmaxCrossEntropyLoss: 2.46

@ebremer
Copy link
Author

ebremer commented May 28, 2024

Success!
I did some debugging on the DJL code and found out what was happening and not happening.
Tracing through ai.djl.util.cuda.loadLibrary(), the code would find C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll because CUDA_PATH was not defined, it was defined as CUDA_PATH_v12.1 by the installer. Further, C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\bin was not on the PATH. I renamed CUDA_PATH_v12.1 to CUDA_PATH since that is the environmental variable looked for at line 241. At this point, it failed to load because line 253 breaks the path for C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\bin\cudart64_12.dll down to just the filename without the path and so it failed to load at line 255. I added C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\bin to the PATH variable and it was then able to load the CUDA 12.1 enabling GPU training to finally run.

@ebremer ebremer closed this as completed May 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants