-
Notifications
You must be signed in to change notification settings - Fork 33
Closed
Labels
Description
Describe the Enhancement
AUT uses hash values to create unique ids, which can leave us duplicates of the same url in a network graph when hashes collide.
To Reproduce
Steps to reproduce the behavior (e.g.):
Run a Domain Graph Extractor with a large number of network nodes (websites).
Run in Gephi.
Discover duplicate websites in graph.
Expected behavior
All network nodes should be unique.
Screenshots
N/A
Additional context
The .zipWithIndex() feature in Apache Spark would be a better approach. http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.zipWithIndex
.zipWithUniqueId() does not call another context so it could be faster.
See also #228
Reactions are currently unavailable