New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add .getHttpStatus and .getFilename to ArchiveRecordImpl class #198 & #164 #292
Conversation
- add .httpStatus to potential outputs - add tests for .httpStatus calls - improve ArchiveRecord testing overall.
- add filename to trait - add filename for ArchiveRecordImpl - add tests for filename.
Codecov Report
@@ Coverage Diff @@
## master #292 +/- ##
==========================================
+ Coverage 73.33% 73.39% +0.06%
==========================================
Files 42 42
Lines 1170 1184 +14
Branches 205 210 +5
==========================================
+ Hits 858 869 +11
Misses 244 244
- Partials 68 71 +3
Continue to review full report at Codecov.
|
Thanks @greebie – I can test. Is this PR complete or is there anything else you're hoping to add to it before we start the review process? |
Thanks. I'm pretty sure I learned from the last PR and did a very good job testing. fingers crossed I think it's ready to go. |
OK sg. I'll take first round testing, realistically probably tomorrow morning. |
@@ -37,17 +38,76 @@ class ArchiveRecordTest extends FunSuite with BeforeAndAfter { | |||
.setAppName(appName) | |||
conf.set("spark.driver.allowMultipleContexts", "true"); | |||
sc = new SparkContext(conf) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove blank line.
|
||
/** Trait for a record in a web archive. */ | ||
trait ArchiveRecord extends Serializable { | ||
/** Returns the filename containing the Archive Records */ | ||
def getFilename: String |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use Resourcename
instead of Filename
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Just to confirm Resourcename (to mimic Filename) and not ResourceName (because resource name are two separate words)? I'm okay with either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current commit uses Resourcename. I think that's fine - I'm probably just overthinking it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Resourcename
works FWIW
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resourcename
is fine.
|
||
/** Trait for a record in a web archive. */ | ||
trait ArchiveRecord extends Serializable { | ||
/** Returns the filename containing the Archive Records */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing fullstop.
@@ -64,6 +74,7 @@ class ArchiveRecordImpl(r: SerializableWritable[ArchiveRecordWritable]) extends | |||
var arcRecord: ARCRecord = null | |||
var warcRecord: WARCRecord = null | |||
// scalastyle:on null | |||
var headerResponseFormat: String = "US-ASCII" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the rationale for US-ASCII
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Historically, HTTP has allowed field content with text in the
ISO-8859-1 charset [ISO-8859-1], supporting other charsets only
through use of [RFC2047] encoding. In practice, most HTTP header
field values use only a subset of the US-ASCII charset [USASCII].
Newly defined header fields SHOULD limit their field values to
US-ASCII octets. A recipient SHOULD treat other octets in field
content (obs-text) as opaque data.
I suppose it this? https://tools.ietf.org/html/rfc7230#section-3.2.4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - this part was a mimic of @dportabella 's suggestion. Seemed weird to me at first as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested with variations on
import io.archivesunleashed._
import io.archivesunleashed.matchbox._
RecordLoader.loadArchives("/tuna1/scratch/i2milligan/warcs.archive-it.org/cgi-bin/getarcs.pl/*.gz", sc)
.keepValidPages()
.map(r => (r.getFilename, r.getHttpStatus, r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString)))
.saveAsTextFile("/tuna1/scratch/i2milligan/results/plain-text-http")
WARC file and response codes all check out, and work in a variety of scripts. Good to go from my POV (and @ruebot thanks for your review too!).
- change .getFilename to .getResourcename - Other code style fixes.
I would like to run one more test on this minus the |
So just i.e. import io.archivesunleashed._
import io.archivesunleashed.matchbox._
RecordLoader.loadArchives("/tuna1/scratch/i2milligan/warcs.archive-it.org/cgi-bin/getarcs.pl/*.gz", sc)
.map(r => (r.getFilename, r.getHttpStatus, r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString)))
.saveAsTextFile("/tuna1/scratch/i2milligan/results/plain-text-http") I have a large body of WARCs to test on, so if that's what you mean I can test it here too @greebie |
Yes. Would like to see some statuses besides 200. :) Also if the header is munged, I'd like to be able to cover for it. Really appreciate it if you could. |
OK. Let me run it at scale minus the plain text, so we can easily look around at status codes too. |
Tested it without Am seeing other response codes - i.e. 404s, 301s, etc. i.e.
|
Awesome! I think doing a status code visualisation would make a decent Medium post. That might be my december project. |
More relevant to #260 but related to why coverage shrunk in this PR - I think we need test cases for ArchiveRecords that are neither WARC or ARC Format. The java takes care of the error handling I think, but codecov notices that we did not test for that here in ArchiveRecord for some reason. I think this is something for a different PR, however. |
I think we have different interpretations of what Resource name is. This appears to be putting out the filename of the ARC/WARC. My understanding was that it was the resource name of the resource being parsed. i.e. |
We also need to decide if it is just the ARC/WARC filename name itself that the method returns, or the full path. |
assert(textSampleWarc.deep == Array("20080430", "20080430", "20080430").deep) | ||
} | ||
|
||
test("Domains") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few extra test methods here. Are they scope for the issues posted in the original comment? Or do they cover other tickets?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No other tickets. Basically, last PR I had good test coverage, but failed to test particular cases and my code passed Travis with bugs. I decided to include these additional tests in case that happened again (it was unlikely, but I wanted this PR to go more smoothly). Since ArchiveRecord is used widely across the tests, I did not expect the tests to improve coverage (as per #260).
Other than the above comments, functionality-wise, looks good to me:
Nice work @greebie 😄 |
I think archiveFilename works. The issue is that because the filename comes from the |
I suggest we keep the fullpaths but provide information accessing the filename in the docs. There's an example to extract just the filename in the tests. Another option is to include a util. |
Ok, let's go with |
Ok. Update this ticket as necessary after this gets merged. |
- include changes to tests.
Good to go! Script:
Sample output:
|
GitHub issue(s):
#198, #164
What does this Pull Request do?
This PR includes three changes to the ArchiveRecord class:
.getHttpStatus
could be useful for studies that would like to compare status calls on crawls.Example:
Should produce something like
It might be more interesting to test this script without
.keepValidPages()
since most valid pages would likely return a200
. Where an ArchiveRecord fails to get a status code (via null or empty string) the record will return000
. This default value can be changed..getFilename
returns the fullpath (which may be a url) of the Warc that was consumed.should produce something like
To get just the filename, you could use
FilenameUtils.getName(x.getFilename)
(I have included the FilenameUtils import in the code above). I have not looked into whether FilenameUtils will get the filename from a url, but that is why I stuck with the fullpath. Also, I am open to feedback on whether .getFilename is the right call for this. IIPC calls it.getReaderIdentifier()
.I can think of a number of use cases for this, but it would be great as a way to detect problems with warcs (e.g. with wonky data). It might also be helpful for error responses down the road.
How should this be tested?
The above scripts should have expected outputs.
Both functions should work for both arc and warc formats.
Travis should pass.
Additional Notes:
I used some of @dportabella 's ideas to produce the code here, in particular for getting the HttpStatus from a Warc file. For Arc and the Warc filename, I tried to use the WARCRecord and ARCRecord classes instead.
I did not include the full header responses in this PR. It seems like the ArcRecord and WarcRecord responses are quite different and I haven't been able to produce an effective testing mechanism.
Interested parties
@ianmilligan1 @ruebot
Thanks in advance for your help with the Archives Unleashed Toolkit!