-
-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exception handling for IA #42
Exception handling for IA #42
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just left some quick notes to provide a little guidance early. I appreciate that this is still a work in progress!
web_monitoring/internetarchive.py
Outdated
|
||
|
||
except StopIteration: | ||
print("Internet archive does not have archived versions of this url.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I appreciate the clarification here of what is happening with StopIteration
. To refine it a bit, I would make this sentence a code comment instead of a call to print
. Library code should almost never print because it might "spam" the screen in the way that the application calling it does not want, and the application has no easy way of silencing the printing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, I'll change it to a comment. I guess I'll learn more about good practices for library code with experience.
web_monitoring/internetarchive.py
Outdated
dt = datetime.strptime(dt_str, DATE_FMT) | ||
yield dt, uri | ||
else: | ||
yield None,None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer this to fail more loudly than yielding (None, None)
. A downstream caller of this is generally expecting to get back a timestamp and a URI, and it could error out in a confusing way. Better to fail sooner if someone tries to crawl a nonexistent URL: I think if check_exists
fails, we can raise ValueError
or a subclass thereof.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you. Like I had mentioned earlier, this is just a simple solution I've tried to get it running. I'm trying to completely move the url checking a layer up and will push that code soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
@danielballan - I've updated the exception handling in the script. Take a look and let me know what you think. |
I originally suggested that you should make As implemented in this PR, What can we do?
|
When you had suggested creating The idea behind introducing I'll try to incorporate all the functions into a single function and check out |
Apologies if I am missing some context, but def iterate_version_lines(response_lines):
"""
Iterate through all the version lines from an iterator over lines of a raw
HTTP response from the Memento API
"""
try:
# The first three lines contain no information we need.
for _ in range(3):
next(response_lines)
except StopIteration:
# If you wanted to be safer, you could raise a custom error type here
raise ValueError('Line iterator contained no version lines')
yield from response_lines
def get_versions(url):
first_page_url = TIMEMAP_URL_TEMPLATE.format(url)
res = requests.get(first_page_url)
try:
yield from list_versions(
iterate_version_lines(
res.iter_lines()))
except ValueError:
# Reformat the error with info we have available at this level
raise ValueError('Internet archive does not have archived versions of {}'.format(url)) (I might also rename |
You could also factor out “iterating through the response to a Memento API URL” from ”iterating through the versions of a public URL,” which would also simplify what def get_versions(url):
first_page_url = TIMEMAP_URL_TEMPLATE.format(url)
yield from versions_from_memento(first_page_url)
def versions_from_memento(memento_url):
# this does the rest of what `get_versions` used to do
def list_versions(lines):
# No need for the `while True` anymore!
for line in lines:
# same logic as before except...
if 'timemap' in rel_chunk:
next_page_url, = URL_CHUNK_PATTERN.match(url_chunk).groups()
yield from versions_from_memento(next_page_url)
# ...and back to the same logic as before It’s 🐢 |
@Mr0grog I tried the first approach and while it does enforce a style which reduces the possibility of mutation, it does not immediately tell us if the archives don't exist. As this will be passed to the PageFreezer module, I think it's better if we get the error when |
… new script(internetarchive_alt) which uses a single function to do what internetarchive does
…version as my fork was not updated earlier
@danielballan I've added another version of the script which has all the code in a single function. This is what I had in mind earlier but now I'm more inclined towards using the code with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's go with the 'alt' version since the change minimally disruptive, and we want to move on to new topics.
I think the get_versions
approach is fine in general but there are some things I'd want to revise before merging, and instead we should move our attention to PageFreezer. Sound OK?
# The first three lines contain no information we need. | ||
for _ in range(3): | ||
next(lines) | ||
except: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is best to be as specific as possible when catching errors. We expect StopIteration
here. If we got some other kind of error, we would want that error to continue to propagate so it could be debugged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's true. I should've noticed this earlier.
I've deleted the one with |
Thanks for your patience on this one, @janakrajchadha. I did some 'git surgery' to clean up the history, applied a minor fix to make this work correctly on multi-page results, and merged it into master as a9f15e1, retaining your authorship credit. Your first commit is in! |
🎉 🌮 🎉 amazing! Congrats @janakrajchadha |
Thank you @danielballan! It wouldn't have been possible without your guidance and the time you spent on the 'git surgery'. Looking forward to bigger and better commits 😄 Thank you @dcwalk 🎉 |
WIP (WORK IN PROGRESS)
DO NOT MERGE
Added exception handling for IA script.
Ran into issues when tried with rare cases.
Working on solving these issues.