Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include script to write to GEXF. (#103) #137

Merged
merged 2 commits into from Dec 5, 2017
Merged

Include script to write to GEXF. (#103) #137

merged 2 commits into from Dec 5, 2017

Conversation

greebie
Copy link
Contributor

@greebie greebie commented Dec 5, 2017


This script creates a GEXF output for link structures similar to WriteGDF.


GitHub issue(s):

#103

What does this Pull Request do?

Adds WriteGEXF( RDD[((String, String, String), Int)] , path) to output a link structure RDD to a gexf file.

How should this be tested?

The following test script can be used to produce the file:

import io.archivesunleashed.spark.matchbox.{ExtractDomain, ExtractLinks, RecordLoader, WriteGEXF}
import io.archivesunleashed.spark.rdd.RecordRDD._

val links = RecordLoader.loadArchives("src/test/resources/arc/example.arc.gz", sc)
  .keepValidPages()
  .map(r => (r.getCrawlDate, ExtractLinks(r.getUrl, r.getContentString)))
  .flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1).replaceAll("^\\s*www\\.", ""), ExtractDomain(f._2).replaceAll("^\\s*www\\.", ""))))
  .filter(r => r._2 != "" && r._3 != "")
  .countItems()
  .filter(r => r._2 > 5)

WriteGEXF(links, "linksTest.gexf")

And then linksTest.gexf should be opened in a network file reader such as Sigma.js or Gephi.

Additional Notes:

Unittest will accompany the script over December (as I make another unit test push)

Interested parties

Tag (@ianmilligan1)

Thanks in advance for your help with the Archives Unleashed Toolkit!

@codecov
Copy link

codecov bot commented Dec 5, 2017

Codecov Report

Merging #137 into master will decrease coverage by 1.91%.
The diff coverage is 0%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #137      +/-   ##
==========================================
- Coverage   65.66%   63.74%   -1.92%     
==========================================
  Files          36       37       +1     
  Lines         731      753      +22     
  Branches      142      143       +1     
==========================================
  Hits          480      480              
- Misses        201      223      +22     
  Partials       50       50
Impacted Files Coverage Δ
...o/archivesunleashed/spark/matchbox/WriteGEXF.scala 0% <0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3eb093a...a5724e7. Read the comment docs.

@greebie
Copy link
Contributor Author

greebie commented Dec 5, 2017

Tested with example.arc.gz, and the two UFT files from aut-resources (one .arc, one .warc). Both files work swimmingly in Gephi.

@ianmilligan1
Copy link
Member

Excellent! I'll test this afternoon when I have a cycle.

@ianmilligan1
Copy link
Member

OK. Tested in Gephi and it looks good:

screen shot 2017-12-05 at 3 34 30 pm

And compared with the other export formats and it all lines up!

screen shot 2017-12-05 at 3 34 20 pm

Copy link
Member

@ianmilligan1 ianmilligan1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted in the ticket, tested and it all works well. Fantastic work, Ryan!

@ianmilligan1 ianmilligan1 merged commit 6b761b9 into master Dec 5, 2017
@ianmilligan1 ianmilligan1 deleted the Issue-103 branch December 5, 2017 20:40
@greebie
Copy link
Contributor Author

greebie commented Dec 5, 2017

Great!

This was referenced Dec 5, 2017
@ruebot ruebot added this to Done in 1.0.0 Release of AUT Dec 6, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

2 participants