Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A script that parses XML takes only 0.00718 seconds to execute through Python, but it takes 17 seconds to execute through Python4J. Why is there such a difference in performance? #10044

Open
aizhimin opened this issue Nov 19, 2023 · 6 comments

Comments

@aizhimin
Copy link

Issue Description

A script that parses XML takes only 0.00718 seconds to execute through Python, but it takes 17 seconds to execute through Python4J. Why is there such a difference in performance?

Version Information

Please indicate relevant versions, including, if relevant:

  • Deeplearning4j version: the latest version
  • Platform information : windows

the java code:

`public static void parserOverlay(){
try(PythonGIL pythonGIL = PythonGIL.lock()) {
try(PythonGC gc = PythonGC.watch()) {
//inputs
byte[] xml_data_bytes = FileUtils.readFileToByteArray(new File("D:\software\BaiduNetdisk\download\231108\231108\MGT\TE01214\KTOVLRAW_TE01214_OL.xml"));
List inputs = new ArrayList<>();
inputs.add(new PythonVariable<>("xml_data_bytes", PythonTypes.BYTES, xml_data_bytes));

            //outputs
            List<PythonVariable> outputs = new ArrayList<>();
            outputs.add(new PythonVariable<>("lotRunJson", PythonTypes.DICT));
            outputs.add(new PythonVariable<>("machineRcpJson", PythonTypes.DICT));
            outputs.add(new PythonVariable<>("imageJson", PythonTypes.DICT));
            outputs.add(new PythonVariable<>("waferRunJson", PythonTypes.DICT));
            outputs.add(new PythonVariable<>("targetJson", PythonTypes.DICT));
            outputs.add(new PythonVariable<>("measurementJson", PythonTypes.DICT));
            outputs.add(new PythonVariable<>("measurementResultJson", PythonTypes.DICT));

            String code = FileUtils.readFileToString(new File("D:\\resource\\LithoTuner-Python\\DataParser_Overlay.py"),
                    StandardCharsets.UTF_8);

            long startTime = System.currentTimeMillis();
            PythonExecutioner.exec(code, inputs, outputs);
            long endTime = System.currentTimeMillis();
            System.out.println("cost time:" + (endTime - startTime)/1000.0);

            for(PythonVariable output : outputs){
                System.out.println(output.getName() + " : " + output.getValue());
            }
        } catch (IOException e) {
            throw new RuntimeException(e);
        }
    }
}`

what's wrong?

@agibsonccc
Copy link
Contributor

@aizhimin thanks for posting! Can you post the script so I can look? Python version would help as well.

@aizhimin
Copy link
Author

aizhimin commented Nov 19, 2023

@agibsonccc Python version is 3.10.2
the script is like this :
`
xml_data_str = xml_data_bytes.decode(encoding='utf-8')

root = etree.fromstring(xml_data_str)
lotRunJson =generate_lotrunjson(root)
machineRcpJson =generate_machinercpjson(root)
imageJson =generate_imagejson(root, lotRunJson)
waferRunJson =generate_waferrunjson(root)
targetJson =generate_targetjson(root, lotRunJson)
measurementJson =generate_measurementjson(root)
measurementResultJson =generate_measurementresultjson(root)
`

this script takes only 0.00718 seconds to execute through Python. But it takes 17 seconds to execute through Python4J.

Is it slower to pass the binary data of a file as input in Java? Do I need to pass the file path in and let Python read the file?

@agibsonccc
Copy link
Contributor

agibsonccc commented Nov 19, 2023

@aizhimin can you give me something I can run standalone? If I'm going to benchmark something you and I need a common baseline to work with.

@aizhimin
Copy link
Author

@agibsonccc Sorry, my file is confidential. My question is that the code for xml parsing executed directly in the python environment runs very fast. However, the xml parsing executed through python4j calls is very slow. Is this due to the need to load the python parser? Or is it because the input parameters cannot pass file data streams?

@agibsonccc
Copy link
Contributor

agibsonccc commented Nov 19, 2023

@aizhimin I don't care about your secrets. A vague description I can't directly run isn't something I'm inclined to spend time on. I believe you but you putting up barriers to me reproducing the issue isn't going to help get this fixed. Meet me half way and setup a trivial example you can show me and I'll be more likely to take a look at this when I get time. The goal is to have a common "language" we can speak here (in this case code) that allows us both to run the same environment and baseline so we can both agree the issue is resolved.

@aizhimin
Copy link
Author

aizhimin commented Nov 19, 2023

@agibsonccc

The xml like this:
1700406062789

The python script like this:

from lxml import etree  
  
def dumps_json(obj):
    if obj:
        return json.dumps(obj)
    return json.dumps("")

xml_data = xml_data_bytes.decode(encoding='utf-8')
root = etree.fromstring(xml_data)  
result = {}
for child in root:  
    result[child.tag] = child.text

resultJson = dumps_json(result)

The java code like this:


try(PythonGIL gil = PythonGIL.lock()){
            try(PythonGC gc = PythonGC.watch()){
                List<PythonVariable> inputs = new ArrayList<>();
                byte[] xml_data_bytes = FileUtils.readFileToByteArray(new File("/Users/aizhimin/Documents/test.xml"));
                inputs.add(new PythonVariable<>("xml_data_bytes", PythonTypes.BYTES, xml_data_bytes));

                List<PythonVariable> outputs = new ArrayList<>();
                outputs.add(new PythonVariable<>("resultJson", PythonTypes.STR));

                String code = FileUtils.readFileToString(new File("/Users/aizhimin/Documents/testxml.py"));
                long startTime = System.currentTimeMillis();
                PythonExecutioner.exec(code, inputs, outputs);
                long endTime = System.currentTimeMillis();
                System.out.println("Cost time:" + (endTime - startTime) / 1000.0);
                for(PythonVariable out : outputs){
                    System.out.println(out.getName()+":"+out.getValue());
                }
            }
        }catch (Throwable e){
            e.printStackTrace();
        }

Java takes 2.193s

If only python ,it takes only 0.0119s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants