Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Documentation for 0.18.0 #77

Open
wants to merge 7 commits into
base: master
from

Conversation

Projects
None yet
4 participants
@ianmilligan1
Copy link
Member

commented Nov 28, 2018

Do not merge until 0.18.0 is released

This PR adds documentation for:

  • getHttpStatus (#74)
  • getArchiveFilename (#74)
  • WriteGraph (including GraphML option) (#71)
  • Removing all Twitter utilities (AUT #323)

I've tested them all using the example.arc.gz file, and we have also tested each PR.

@ianmilligan1 ianmilligan1 requested review from greebie and SamFritz Nov 28, 2018

@ianmilligan1

This comment has been minimized.

Copy link
Member Author

commented Nov 28, 2018

I requested reviews from @greebie as authors of those PRs (anything we should be adding into the docs that I did on this first round) and @SamFritz for her general insights into making usable content. 😄

@ruebot
Copy link
Member

left a comment

We should probably tweak it a bit here to provide two different examples.


### Location of the Resource in ARCs and WARCs

Finally, you may want to know what WARC file the different resources are located in! The following command will list the WARC file that each URL is found in.

This comment has been minimized.

Copy link
@ruebot

ruebot Nov 28, 2018

Member
The following command will provide the full path and filename of the ARC/WARC that each url is found in.
.map(r => (r.getUrl, r.getArchiveFilename))
.take(10)
```

This comment has been minimized.

Copy link
@ruebot

ruebot Nov 28, 2018

Member

Or, if you just want to know the filename, not the full path and filename, ....

import io.archivesunleashed._
import io.archivesunleashed.matchbox._
import org.apache.commons.io.FilenameUtils

RecordLoader.loadArchives("example.arc.gz", sc)
  .keepValidPages()
  .map(r => (r.getUrl, FilenameUtils.getName(r.getArchiveFilename)))
  .saveAsTextFile("/path/to/output")
@SamFritz

This comment has been minimized.

Copy link
Member

commented Nov 29, 2018

Additions are great and looks good locally. Nice work!!

import io.archivesunleashed.matchbox._
val r = RecordLoader.loadArchives("example.arc.gz", sc)
.keepValidPages()

This comment has been minimized.

Copy link
@greebie

greebie Dec 13, 2018

Collaborator

I think the results will mostly be 200 if you include .keepValidPages(), so it might be fine to disinclude that here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.