Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Obtained different result after calling an IP several times, seems like the data did not work properly inside. #796

Open
KokyoK opened this issue Apr 8, 2023 · 12 comments
Labels
bug Something isn't working

Comments

@KokyoK
Copy link

KokyoK commented Apr 8, 2023

Hello, here we are facing a weird problem.
What we have done:

  • Created an .onnx model file,
  • Created an IP with no warnings.
  • Ran the C simulation and RTL simulation, every test passed.

What we are currently doing:
Trying to use the IP on Pynq-z2 board. But when we call an IP several times, it outputs different results. Like this:
截屏2023-04-08 17 09 17
As you can see, the inputs of this IP are exactly the same. But the outputs are similar but different every time, and the first output is exactly correct, but others are not. Seems like it uses some internal data from the previous input. However, if we reinitialize the Overlay every time we want to use it, it works properly but much slower, as shown in the following pic:
截屏2023-04-08 17 18 40

These are the onnx file and final test notebook we used.
ipynb&onnx.zip

@KokyoK KokyoK added the bug Something isn't working label Apr 8, 2023
@auphelia
Copy link
Collaborator

Hi @KokyoK , could you please provide more information?

  • Which FINN version and which tool versions (Vivado/Vitis) are you using?
  • Is that one of our example networks or one you created/trained yourself?

@KokyoK
Copy link
Author

KokyoK commented Apr 13, 2023

Hello @auphelia ,

  • the versions are:
    Pynq: v2.6.1
    FINN: the newest version (from main branch)
    FINN-examples: v0.0.5
    Vivado: 2022.1

  • it is the network we created and trained ourselves

Thanks

@auphelia
Copy link
Collaborator

Hi @KokyoK,
While we're looking from our side if we can reproduce your issue, could you please update the PYNQ version to use 3.0.1 and see if the error persists?

@KokyoK
Copy link
Author

KokyoK commented Apr 24, 2023

Hi @auphelia ,
I have tried to use PYNQ 3.0.1 and the same issue still occurs.

@fionnodonohoe-xlnx
Copy link
Collaborator

Hi @KokyoK,
I am unable to build the bitstream from the provided ONNX file, could you please provide the original trained model?
Would it be possible to also share the 'input_permute.npy' file that is used by the notebook? Thanks

@KokyoK
Copy link
Author

KokyoK commented May 8, 2023

Hi @fionnodonohoe-xlnx , here are the files:
model.py includes the model structure, which is a normal convolutional model. weight.pt is the trained weights loaded by the model. Also input_permute.npy is provided.

Since we've build a bitstream with the provided ONNX, I guess there might be something wrong when we built bitstream.

model.zip

Thanks for your effort!

@fionnodonohoe-xlnx
Copy link
Collaborator

Hi @KokyoK,

I tried creating the ONNX file from model.py. When adding model.save(is_onnx=1) after model.eval() I get the following error:
RuntimeError: Given groups=1, weight of size [16, 40, 1, 3], expected input[16, 30, 1, 101] to have 40 channels, but got 30 channels instead

I then changed the expected input to have 40 channels instead of 30 - only to get this error:
RuntimeError: input_shape.size() > 0 || reshape.size() > 0INTERNAL ASSERT FAILED at "../torch/csrc/jit/passes/onnx/shape_type_inference.cpp":448, please report a bug to PyTorch. Reshape node should have at least one input size > 0 when constant folding.

Are you also seeing this error? Maybe you could send on your TCResNet8.onnx file created from your script. I can try to put that ONNX file through the bitstream generation stage then.

@KokyoK
Copy link
Author

KokyoK commented May 22, 2023

Hello @fionnodonohoe-xlnx ,

  1. The correct way to save .onnx file is putting the following line in the main function:
    model = QuantizedTCResNet8(1, 40, 10)
    model.load("weight.pt")
    model.eval()

    import brevitas.onnx as bo
    export_onnx_path = "8b_weight_act_bias_net.onnx"
    input_shape = (1, 40, 1, 101)
    bo.export_finn_onnx(model, input_shape, export_onnx_path)

and run the main function. The onnx file should be the same as I uploaded before. Sorry for the previous confusion.

  1. We built the bitstream with this file modified build_dataflow_steps.py basically we add line 327, and commented line 338. The file is attached.

build_dataflow_steps.py.zip

@fionnodonohoe-xlnx
Copy link
Collaborator

Hi @KokyoK,
Thank you for that. I hit another bitstream generation error unfortunately. It turns out that the added TLastMarker in the provided code causes a bitstream generation error for me:
+ model = model.transform(InsertTLastMarker(both=True))
... as the TLastMarker class has no method get_input_datatype(). Do you have edits elsewhere in your local clone that circumnavigate this issue?

Here is what I see on the command line when I use the provided build_dataflow_steps.py:

Running step: step_qonnx_to_finn [1/17]
Running step: step_tidy_up [2/17]
Running step: step_streamline [3/17]
Running step: step_convert_to_hls [4/17]
Running step: step_create_dataflow_partition [5/17]
Running step: step_target_fps_parallelization [6/17]
Running step: step_apply_folding_config [7/17]
Running step: step_generate_estimate_reports [8/17]
Running step: step_hls_codegen [9/17]
Running step: step_hls_ipgen [10/17]
Running step: step_set_fifo_depths [11/17]
Running step: step_create_stitched_ip [12/17]
Running step: step_measure_rtlsim_performance [13/17]
Running step: step_out_of_context_synthesis [14/17]
Running step: step_synthesize_bitfile [15/17]
Traceback (most recent call last):
  File "~/workspace/src/finn/builder/build_dataflow.py", line 168, in build_dataflow_cfg
    model = transform_step(model, cfg)
  File "~/workspace/src/finn/builder/build_dataflow_steps.py", line 772, in step_synthesize_bitfile
    model = model.transform(
  File "~/workspace/deps/qonnx/src/qonnx/core/modelwrapper.py", line 140, in transform
    (transformed_model, model_was_changed) = transformation.apply(transformed_model)
  File "~/workspace/src/finn/transformation/fpgadataflow/make_zynq_proj.py", line 350, in apply
    kernel_model = kernel_model.transform(InsertFIFO())
  File "~/workspace/deps/qonnx/src/qonnx/core/modelwrapper.py", line 140, in transform
    (transformed_model, model_was_changed) = transformation.apply(transformed_model)
  File "~/workspace/src/finn/transformation/fpgadataflow/insert_fifo.py", line 199, in apply
    dtype = n0.get_input_datatype(inp_ind)
  File "~/workspace/src/finn/custom_op/fpgadataflow/hlscustomop.py", line 711, in get_input_datatype
    raise Exception("get_input_datatype not implemented for this op")
Exception: get_input_datatype not implemented for this op
> ~/workspace/src/finn/custom_op/fpgadataflow/hlscustomop.py(711)get_input_datatype()
-> raise Exception("get_input_datatype not implemented for this op")

@KokyoK
Copy link
Author

KokyoK commented May 23, 2023

Hi @fionnodonohoe-xlnx ,
You can simply remove all code related to TLAST in the code. Remove this line:
model = model.transform(InsertTLastMarker(both=True))
As this line is not related to this issue. We've tried to remove it and the issue still occurs.

@fionnodonohoe-xlnx
Copy link
Collaborator

Hi @KokyoK ,
I went ahead and removed the TLastMarker insertion point. The bitstream failed to generate this time due to a lack of resources available for the Pynq FPGA part. I then removed all changes from the modified build_dataflow_steps.py and retried building but to no avail. I have attached the DRC report.
As you were able to generate a bitstream for this model, how did you get around this particular resourcing issue? Thanks.
top_wrapper_drc_opted.txt

@KokyoK
Copy link
Author

KokyoK commented May 25, 2023

Hi @fionnodonohoe-xlnx ,
We tried again and did not meet your problem. I attached the build_customize folder, and we just tried this build with no error.
This is the attachment: https://drive.google.com/file/d/1yfMkpSVOmBp5GrzpWn62daF1Um2d_-RO/view?usp=sharing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants