Exception handling for IA #42

janakrajchadha · 2017-05-18T02:44:28Z

WIP (WORK IN PROGRESS)
DO NOT MERGE

Added exception handling for IA script.

Ran into issues when tried with rare cases.
Working on solving these issues.

… listing

danielballan

I just left some quick notes to provide a little guidance early. I appreciate that this is still a work in progress!

danielballan · 2017-05-18T15:16:34Z

web_monitoring/internetarchive.py

+
+
+    except StopIteration:
+        print("Internet archive does not have archived versions of this url.")


I appreciate the clarification here of what is happening with StopIteration. To refine it a bit, I would make this sentence a code comment instead of a call to print. Library code should almost never print because it might "spam" the screen in the way that the application calling it does not want, and the application has no easy way of silencing the printing.

Alright, I'll change it to a comment. I guess I'll learn more about good practices for library code with experience.

danielballan · 2017-05-18T15:24:21Z

web_monitoring/internetarchive.py

+                dt = datetime.strptime(dt_str, DATE_FMT)
+                yield dt, uri
+    else:
+        yield None,None


I would prefer this to fail more loudly than yielding (None, None). A downstream caller of this is generally expecting to get back a timestamp and a URI, and it could error out in a confusing way. Better to fail sooner if someone tries to crawl a nonexistent URL: I think if check_exists fails, we can raise ValueError or a subclass thereof.

I agree with you. Like I had mentioned earlier, this is just a simple solution I've tried to get it running. I'm trying to completely move the url checking a layer up and will push that code soon.

Sounds good.

janakrajchadha · 2017-05-24T02:05:58Z

@danielballan - I've updated the exception handling in the script. Take a look and let me know what you think.

danielballan · 2017-05-24T18:14:33Z

I originally suggested that you should make check_exists a separate function from list_version. I had in mind check_exists(url) and list_versions(url) (though to be clear I did not specify this at the time). You implemented check_exists(lines), where lines is an intermediate result. I'm glad you did, because this gives us an opportunity to ponder some interesting design considerations.

As implemented in this PR, list_versions(lines) and check_exists(lines) are examples of functions that mutate their inputs. You give them a generator, lines, and through their action they mutate that generator in an irreversible way. Mutating inputs is usually something to avoid because separate functions give impression that they can be used and/or tested separately, when in fact they are tightly coupled together because of the mutation of lines. For example, it is not safe to call check_exists twice on the same lines.

What can we do?

My original idea was to have check_exists(url) do its own request. This is slightly wasteful because it involves pinging IA twice -- once in check_exists(url) and again in list_versions(url) -- but in practice it's probably OK because we will add new URLs (Pages) much more rarely than we will add new Versions. And, this way, the lines objects would stay internal to each function.
But your way is indeed more efficient. To avoid the "mutating inputs" problem, perhaps we should just put all of this into one function, list_versions(url) containing all the code from the other functions. I think that was your original plan, and in retrospect it looks pretty smart!
There is a third way. We could use the built-in function itertools.tee to "peek" ahead into lines without actually mutating it. That's probably more trouble that we should take in this instance, but it's a nice utility to know about.

janakrajchadha · 2017-05-25T02:28:42Z

When you had suggested creating check_exists(), I did start off with check_exists(url) and list_versions(url) separately pinging IA but as you mentioned here, I felt that it was an unnecessary thing to do and tried to ping the site just once.

The idea behind introducing get_versions(url) was to ping IA just once and lead to the list_versions() generator object directly.
I think that if the users are directed to use only get_versions(url), the mutating issue is avoided. I may be wrong here. Will the instructions be followed by the users? You can't predict if and when curiosity will take over a person's mind.

I'll try to incorporate all the functions into a single function and check out itertools.tee as well.

Mr0grog · 2017-05-25T03:12:12Z

Apologies if I am missing some context, but check_exists seems like a good candidate for a generator, which might simplify the code and enforce a style where we need to worry less about mutation:

def iterate_version_lines(response_lines):
    """
    Iterate through all the version lines from an iterator over lines of a raw
    HTTP response from the Memento API
    """
    
    try:
        # The first three lines contain no information we need.
        for _ in range(3):
            next(response_lines)
    except StopIteration:
        # If you wanted to be safer, you could raise a custom error type here
        raise ValueError('Line iterator contained no version lines')

    yield from response_lines
    
    
def get_versions(url):
    first_page_url = TIMEMAP_URL_TEMPLATE.format(url)
    res = requests.get(first_page_url)
    
    try:
        yield from list_versions(
            iterate_version_lines(
                res.iter_lines()))
    except ValueError:
        # Reformat the error with info we have available at this level
        raise ValueError('Internet archive does not have archived versions of {}'.format(url))

(I might also rename list_versions to parse_version_lines or similar to clarify how it should be used.)

Mr0grog · 2017-05-25T03:22:01Z

You could also factor out “iterating through the response to a Memento API URL” from ”iterating through the versions of a public URL,” which would also simplify what list_versions does:

def get_versions(url):
    first_page_url = TIMEMAP_URL_TEMPLATE.format(url)
    yield from versions_from_memento(first_page_url)


def versions_from_memento(memento_url):
    # this does the rest of what `get_versions` used to do


def list_versions(lines):
    # No need for the `while True` anymore!
    for line in lines:
        # same logic as before except...
        if 'timemap' in rel_chunk:
            next_page_url, = URL_CHUNK_PATTERN.match(url_chunk).groups()
            yield from versions_from_memento(next_page_url)
            
        # ...and back to the same logic as before

It’s 🐢
🐢
🐢
🐢
🐢
🐢 all the way down

janakrajchadha · 2017-05-31T04:16:20Z

@Mr0grog I tried the first approach and while it does enforce a style which reduces the possibility of mutation, it does not immediately tell us if the archives don't exist. As this will be passed to the PageFreezer module, I think it's better if we get the error when get_versions is called.
I did not quite get what you mean by the second comment. I'll spend some more time on it and see what works best. Thanks for the suggestions!

… new script(internetarchive_alt) which uses a single function to do what internetarchive does

…version as my fork was not updated earlier

janakrajchadha · 2017-06-02T03:10:57Z

@danielballan I've added another version of the script which has all the code in a single function. This is what I had in mind earlier but now I'm more inclined towards using the code with get_versions(url). I'll leave it up to you to decide which one we should use. I'll update the test file once this is done.

danielballan

Let's go with the 'alt' version since the change minimally disruptive, and we want to move on to new topics.

I think the get_versions approach is fine in general but there are some things I'd want to revise before merging, and instead we should move our attention to PageFreezer. Sound OK?

danielballan · 2017-06-05T21:16:38Z

web_monitoring/internetarchive_alt.py

+        # The first three lines contain no information we need.
+        for _ in range(3):
+            next(lines)
+    except:


It is best to be as specific as possible when catching errors. We expect StopIteration here. If we got some other kind of error, we would want that error to continue to propagate so it could be debugged.

Yes, that's true. I should've noticed this earlier.

janakrajchadha · 2017-06-06T02:14:11Z

I've deleted the one withget_versions and I've renamed internetarchive_alt to internetarchive. I've also added StopIteration as suggested.
I'll focus on PF now.

danielballan · 2017-06-11T20:56:25Z

Thanks for your patience on this one, @janakrajchadha. I did some 'git surgery' to clean up the history, applied a minor fix to make this work correctly on multi-page results, and merged it into master as a9f15e1, retaining your authorship credit. Your first commit is in!

dcwalk · 2017-06-11T22:26:36Z

🎉 🌮 🎉 amazing! Congrats @janakrajchadha

janakrajchadha · 2017-06-12T15:05:57Z

Thank you @danielballan! It wouldn't have been possible without your guidance and the time you spent on the 'git surgery'. Looking forward to bigger and better commits 😄

Thank you @dcwalk 🎉

Added exception handling for IA which checks if archive exists before…

d133f66

… listing

danielballan reviewed May 18, 2017

View reviewed changes

Updated exception handling for the internet archive script

bedfb90

janakrajchadha added 2 commits June 2, 2017 08:24

Updated the comments of the previous Internet Archive script; Added a…

50ed719

… new script(internetarchive_alt) which uses a single function to do what internetarchive does

Added files with the functions format_version and timestamped_uri_to_…

49296e4

…version as my fork was not updated earlier

Merge branch 'master' into master

8411db0

danielballan reviewed Jun 5, 2017

View reviewed changes

Small changes and import error fix

7c4c04b

danielballan closed this Jun 11, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exception handling for IA #42

Exception handling for IA #42

janakrajchadha commented May 18, 2017 •

edited

Loading

danielballan left a comment

danielballan May 18, 2017

janakrajchadha May 18, 2017

danielballan May 18, 2017

janakrajchadha May 18, 2017

danielballan May 18, 2017

janakrajchadha commented May 24, 2017 •

edited

Loading

danielballan commented May 24, 2017

janakrajchadha commented May 25, 2017 •

edited

Loading

Mr0grog commented May 25, 2017

Mr0grog commented May 25, 2017 •

edited

Loading

janakrajchadha commented May 31, 2017

janakrajchadha commented Jun 2, 2017

danielballan left a comment

danielballan Jun 5, 2017

janakrajchadha Jun 6, 2017

janakrajchadha commented Jun 6, 2017

danielballan commented Jun 11, 2017

dcwalk commented Jun 11, 2017

janakrajchadha commented Jun 12, 2017



		except StopIteration:
		print("Internet archive does not have archived versions of this url.")

Exception handling for IA #42

Exception handling for IA #42

Conversation

janakrajchadha commented May 18, 2017 • edited Loading

danielballan left a comment

Choose a reason for hiding this comment

danielballan May 18, 2017

Choose a reason for hiding this comment

janakrajchadha May 18, 2017

Choose a reason for hiding this comment

danielballan May 18, 2017

Choose a reason for hiding this comment

janakrajchadha May 18, 2017

Choose a reason for hiding this comment

danielballan May 18, 2017

Choose a reason for hiding this comment

janakrajchadha commented May 24, 2017 • edited Loading

danielballan commented May 24, 2017

janakrajchadha commented May 25, 2017 • edited Loading

Mr0grog commented May 25, 2017

Mr0grog commented May 25, 2017 • edited Loading

janakrajchadha commented May 31, 2017

janakrajchadha commented Jun 2, 2017

danielballan left a comment

Choose a reason for hiding this comment

danielballan Jun 5, 2017

Choose a reason for hiding this comment

janakrajchadha Jun 6, 2017

Choose a reason for hiding this comment

janakrajchadha commented Jun 6, 2017

danielballan commented Jun 11, 2017

dcwalk commented Jun 11, 2017

janakrajchadha commented Jun 12, 2017

janakrajchadha commented May 18, 2017 •

edited

Loading

janakrajchadha commented May 24, 2017 •

edited

Loading

janakrajchadha commented May 25, 2017 •

edited

Loading

Mr0grog commented May 25, 2017 •

edited

Loading