Skip to content

Replace hashing of unique ids with .zipWithUniqueId() #243

@greebie

Description

@greebie

Describe the Enhancement

AUT uses hash values to create unique ids, which can leave us duplicates of the same url in a network graph when hashes collide.

To Reproduce
Steps to reproduce the behavior (e.g.):
Run a Domain Graph Extractor with a large number of network nodes (websites).
Run in Gephi.
Discover duplicate websites in graph.

Expected behavior
All network nodes should be unique.

Screenshots
N/A

Additional context
The .zipWithIndex() feature in Apache Spark would be a better approach. http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.zipWithIndex

.zipWithUniqueId() does not call another context so it could be faster.

See also #228

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions