Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GetCrawlYear to accompany GetCrawlMonth #104

Closed
ianmilligan1 opened this issue Oct 25, 2017 · 2 comments
Closed

GetCrawlYear to accompany GetCrawlMonth #104

ianmilligan1 opened this issue Oct 25, 2017 · 2 comments
Assignees

Comments

@ianmilligan1
Copy link
Member

I'm doing some research into a web archive right now, and it's actually long enough that year-over-year data is preferable to more granular day-by-day or month-by-month. We should have GetCrawlYear built into aut.

@ianmilligan1 ianmilligan1 self-assigned this Oct 25, 2017
ianmilligan1 added a commit that referenced this issue Oct 25, 2017
Will test locally to see if this fits use case.
@ianmilligan1
Copy link
Member Author

Testing in the wild with:

import io.archivesunleashed.spark.matchbox.{ExtractDomain, ExtractLinks, RecordLoader}
import io.archivesunleashed.spark.rdd.RecordRDD._

RecordLoader.loadArchives("/mnt/vol1/data_sets/labour/*.gz", sc)
  .keepValidPages()
  .map(r => (r.getCrawlYear, ExtractLinks(r.getUrl, r.getContentString)))
  .flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1).replaceAll("^\\s*www\\.", ""), ExtractDomain(f._2).replaceAll("^\\s*www\\.", ""))))
  .filter(r => r._2 != "" && r._3 != "")
  .countItems()
  .filter(r => r._2 > 5)
  .saveAsTextFile("/mnt/vol1/derivative_data/labour/sitelinks-year")

ruebot pushed a commit that referenced this issue Oct 26, 2017
Will test locally to see if this fits use case.
@ruebot
Copy link
Member

ruebot commented Oct 26, 2017

Resolved with: 010fe24

@ruebot ruebot closed this as completed Oct 26, 2017
@ruebot ruebot added this to Done in 1.0.0 Release of AUT Oct 26, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Development

No branches or pull requests

2 participants