Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exception handling for IA #42

Closed

Conversation

janakrajchadha
Copy link
Contributor

@janakrajchadha janakrajchadha commented May 18, 2017

WIP (WORK IN PROGRESS)
DO NOT MERGE

Added exception handling for IA script.

Ran into issues when tried with rare cases.
Working on solving these issues.

Copy link
Contributor

@danielballan danielballan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just left some quick notes to provide a little guidance early. I appreciate that this is still a work in progress!



except StopIteration:
print("Internet archive does not have archived versions of this url.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate the clarification here of what is happening with StopIteration. To refine it a bit, I would make this sentence a code comment instead of a call to print. Library code should almost never print because it might "spam" the screen in the way that the application calling it does not want, and the application has no easy way of silencing the printing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I'll change it to a comment. I guess I'll learn more about good practices for library code with experience.

dt = datetime.strptime(dt_str, DATE_FMT)
yield dt, uri
else:
yield None,None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer this to fail more loudly than yielding (None, None). A downstream caller of this is generally expecting to get back a timestamp and a URI, and it could error out in a confusing way. Better to fail sooner if someone tries to crawl a nonexistent URL: I think if check_exists fails, we can raise ValueError or a subclass thereof.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you. Like I had mentioned earlier, this is just a simple solution I've tried to get it running. I'm trying to completely move the url checking a layer up and will push that code soon.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

@janakrajchadha
Copy link
Contributor Author

janakrajchadha commented May 24, 2017

@danielballan - I've updated the exception handling in the script. Take a look and let me know what you think.

@danielballan
Copy link
Contributor

I originally suggested that you should make check_exists a separate function from list_version. I had in mind check_exists(url) and list_versions(url) (though to be clear I did not specify this at the time). You implemented check_exists(lines), where lines is an intermediate result. I'm glad you did, because this gives us an opportunity to ponder some interesting design considerations.

As implemented in this PR, list_versions(lines) and check_exists(lines) are examples of functions that mutate their inputs. You give them a generator, lines, and through their action they mutate that generator in an irreversible way. Mutating inputs is usually something to avoid because separate functions give impression that they can be used and/or tested separately, when in fact they are tightly coupled together because of the mutation of lines. For example, it is not safe to call check_exists twice on the same lines.

What can we do?

  1. My original idea was to have check_exists(url) do its own request. This is slightly wasteful because it involves pinging IA twice -- once in check_exists(url) and again in list_versions(url) -- but in practice it's probably OK because we will add new URLs (Pages) much more rarely than we will add new Versions. And, this way, the lines objects would stay internal to each function.
  2. But your way is indeed more efficient. To avoid the "mutating inputs" problem, perhaps we should just put all of this into one function, list_versions(url) containing all the code from the other functions. I think that was your original plan, and in retrospect it looks pretty smart!
  3. There is a third way. We could use the built-in function itertools.tee to "peek" ahead into lines without actually mutating it. That's probably more trouble that we should take in this instance, but it's a nice utility to know about.

@janakrajchadha
Copy link
Contributor Author

janakrajchadha commented May 25, 2017

When you had suggested creating check_exists(), I did start off with check_exists(url) and list_versions(url) separately pinging IA but as you mentioned here, I felt that it was an unnecessary thing to do and tried to ping the site just once.

The idea behind introducing get_versions(url) was to ping IA just once and lead to the list_versions() generator object directly.
I think that if the users are directed to use only get_versions(url), the mutating issue is avoided. I may be wrong here. Will the instructions be followed by the users? You can't predict if and when curiosity will take over a person's mind.

I'll try to incorporate all the functions into a single function and check out itertools.tee as well.

@Mr0grog
Copy link
Member

Mr0grog commented May 25, 2017

Apologies if I am missing some context, but check_exists seems like a good candidate for a generator, which might simplify the code and enforce a style where we need to worry less about mutation:

def iterate_version_lines(response_lines):
    """
    Iterate through all the version lines from an iterator over lines of a raw
    HTTP response from the Memento API
    """
    
    try:
        # The first three lines contain no information we need.
        for _ in range(3):
            next(response_lines)
    except StopIteration:
        # If you wanted to be safer, you could raise a custom error type here
        raise ValueError('Line iterator contained no version lines')

    yield from response_lines
    
    
def get_versions(url):
    first_page_url = TIMEMAP_URL_TEMPLATE.format(url)
    res = requests.get(first_page_url)
    
    try:
        yield from list_versions(
            iterate_version_lines(
                res.iter_lines()))
    except ValueError:
        # Reformat the error with info we have available at this level
        raise ValueError('Internet archive does not have archived versions of {}'.format(url))

(I might also rename list_versions to parse_version_lines or similar to clarify how it should be used.)

@Mr0grog
Copy link
Member

Mr0grog commented May 25, 2017

You could also factor out “iterating through the response to a Memento API URL” from ”iterating through the versions of a public URL,” which would also simplify what list_versions does:

def get_versions(url):
    first_page_url = TIMEMAP_URL_TEMPLATE.format(url)
    yield from versions_from_memento(first_page_url)


def versions_from_memento(memento_url):
    # this does the rest of what `get_versions` used to do


def list_versions(lines):
    # No need for the `while True` anymore!
    for line in lines:
        # same logic as before except...
        if 'timemap' in rel_chunk:
            next_page_url, = URL_CHUNK_PATTERN.match(url_chunk).groups()
            yield from versions_from_memento(next_page_url)
            
        # ...and back to the same logic as before

It’s 🐢
  🐢
  🐢
  🐢
  🐢
  🐢 all the way down

@janakrajchadha
Copy link
Contributor Author

@Mr0grog I tried the first approach and while it does enforce a style which reduces the possibility of mutation, it does not immediately tell us if the archives don't exist. As this will be passed to the PageFreezer module, I think it's better if we get the error when get_versions is called.
I did not quite get what you mean by the second comment. I'll spend some more time on it and see what works best. Thanks for the suggestions!

… new script(internetarchive_alt) which uses a single function to do what internetarchive does
@janakrajchadha
Copy link
Contributor Author

@danielballan I've added another version of the script which has all the code in a single function. This is what I had in mind earlier but now I'm more inclined towards using the code with get_versions(url). I'll leave it up to you to decide which one we should use. I'll update the test file once this is done.

Copy link
Contributor

@danielballan danielballan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's go with the 'alt' version since the change minimally disruptive, and we want to move on to new topics.

I think the get_versions approach is fine in general but there are some things I'd want to revise before merging, and instead we should move our attention to PageFreezer. Sound OK?

# The first three lines contain no information we need.
for _ in range(3):
next(lines)
except:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is best to be as specific as possible when catching errors. We expect StopIteration here. If we got some other kind of error, we would want that error to continue to propagate so it could be debugged.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's true. I should've noticed this earlier.

@janakrajchadha
Copy link
Contributor Author

I've deleted the one withget_versions and I've renamed internetarchive_alt to internetarchive. I've also added StopIteration as suggested.
I'll focus on PF now.

@danielballan
Copy link
Contributor

Thanks for your patience on this one, @janakrajchadha. I did some 'git surgery' to clean up the history, applied a minor fix to make this work correctly on multi-page results, and merged it into master as a9f15e1, retaining your authorship credit. Your first commit is in!

@dcwalk
Copy link
Contributor

dcwalk commented Jun 11, 2017

🎉 🌮 🎉 amazing! Congrats @janakrajchadha

@janakrajchadha
Copy link
Contributor Author

Thank you @danielballan! It wouldn't have been possible without your guidance and the time you spent on the 'git surgery'. Looking forward to bigger and better commits 😄

Thank you @dcwalk 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants