Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Joining and sorting ranks fails to load serialized double array holding page rank scores #5

Closed
sebastian-nagel opened this issue Sep 5, 2022 · 3 comments
Labels
Milestone

Comments

@sebastian-nagel
Copy link
Contributor

[main] INFO org.commoncrawl.webgraph.JoinSortRanks - Loading page rank values from host/cc-main-2021-22-oct-nov-jan-host-pagerank.ranks
Exception in thread "main" java.lang.IllegalArgumentException: newLimit < 0: (-1216172000 < 0)
        at java.base/java.nio.Buffer.createLimitException(Buffer.java:372)
        at java.base/java.nio.Buffer.limit(Buffer.java:346)
        at java.base/java.nio.ByteBuffer.limit(ByteBuffer.java:1107)
        at java.base/java.nio.MappedByteBuffer.limit(MappedByteBuffer.java:235)
        at java.base/java.nio.MappedByteBuffer.limit(MappedByteBuffer.java:67)
        at it.unimi.dsi.fastutil.io.BinIO.loadDoubles(BinIO.java:6431)
        at it.unimi.dsi.fastutil.io.BinIO.loadDoubles(BinIO.java:6452)
        at it.unimi.dsi.fastutil.io.BinIO.loadDoubles(BinIO.java:6520)
        at it.unimi.dsi.fastutil.io.BinIO.loadDoubles(BinIO.java:7006)
        at it.unimi.dsi.fastutil.io.BinIO.loadDoubles(BinIO.java:7018)
        at org.commoncrawl.webgraph.JoinSortRanks.loadPageRank(JoinSortRanks.java:50)
        at org.commoncrawl.webgraph.JoinSortRanks.main(JoinSortRanks.java:319)
  • worked around (no time at this point for a deeper analysis) by downgrading (reverting recent commits to f33704b used to build the preceding graph)
sebastian-nagel added a commit that referenced this issue Sep 5, 2022
@sebastian-nagel
Copy link
Contributor Author

sebastian-nagel commented Sep 5, 2022

The failure loading of a double array is reproducible with any array if the size of the serialization file is 2 GiB or more, see 49174af. Downgrading fastutil-core to 8.5.7 fixes the issue, the unit test also passes using the current head of fastutil (vigna/fastutil@b813824). See also question about release of fastutil 8.5.9.

@sebastian-nagel
Copy link
Contributor Author

Fixed by downgrading to fastutil-core 8.5.7

@sebastian-nagel sebastian-nagel added this to the 0.1 milestone Sep 15, 2022
sebastian-nagel added a commit that referenced this issue Sep 18, 2022
- rename existing option `--no-strict-domain-validate` into
  `--multipart-suffixes-as-domains`, complete documentatin
- add unit tests

Log progress when processing large graphs

HostToDomainGraph / JoinSortRanks: input/output is always UTF-8
(fixes #5): write output always as UTF-8
sebastian-nagel added a commit that referenced this issue Sep 20, 2022
- rename existing option `--no-strict-domain-validate` into
  `--multipart-suffixes-as-domains`, complete documentatin
- add unit tests

Log progress when processing large graphs

HostToDomainGraph / JoinSortRanks: input/output is always UTF-8
(fixes #5): write output always as UTF-8
sebastian-nagel added a commit that referenced this issue Sep 20, 2022
space to execute unit test using large arrays
sebastian-nagel added a commit that referenced this issue Sep 20, 2022
space to execute unit test using large arrays
sebastian-nagel added a commit that referenced this issue Sep 20, 2022
- rename existing option `--no-strict-domain-validate` into
  `--multipart-suffixes-as-domains`, complete documentatin
- add unit tests

Log progress when processing large graphs

HostToDomainGraph / JoinSortRanks: input/output is always UTF-8
(fixes #5): write output always as UTF-8
@vigna
Copy link

vigna commented Sep 26, 2022

Thank you for reporting this. Fixed in fastutil 8.5.9.

sebastian-nagel added a commit that referenced this issue Mar 15, 2023
(loading of a serialized double array in a 2 GiB file fails)
- reset array after storing it and call garbage collector to free memory
- more verbose logging after steps in unit test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants