Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gain Improvements - Ludaro #46

Open
kwuite opened this issue Sep 15, 2018 · 5 comments
Open

Gain Improvements - Ludaro #46

kwuite opened this issue Sep 15, 2018 · 5 comments

Comments

@kwuite
Copy link
Collaborator

kwuite commented Sep 15, 2018

re.findall issue

I reviewed the tests in this project after experiencing issues with my regex also catching some html as part of the process.

So I reviewed this test file: https://github.com/gaojiuli/gain/blob/master/tests/test_parse_multiple_items.py and catched the response of abstract_url.py

Version 0.1.4 of this project catches this as response:

URLS we found: ['/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/']

re.findall returns what is requested by your regex but not what is matched!

Test incorrect

The base url http://quotes.toscrape.com/ and http://quotes.toscrape.com/page/1 are the same page and if you look into the html you shall only find a reference to "/page/2" but not to "/page/1". For this reason the test seems to work but it was actually flawed from the start.

image

re.match

I rewrote function abstract_url to:

    def abstract_urls(self, html, base_url):
        _urls = []

        try:
            document = lxml.html.fromstring(html)
            document_domain = urlparse.urlparse(base_url).netloc
            
            for (al, attr, link, pos) in document.iterlinks():
                link = re.sub("#.*", "", link or "")

                if not link:
                    continue

                _urls.append(link)
        except (etree.XMLSyntaxError, etree.ParserError) as e:
            logger.error("While parsing the html for {} we received the following error {}.".format(base_url, e))

        # Cleanup urls
        r = re.compile(self.rule)
        urls = list(filter(r.match, _urls))

        return urls

and now this is the result of abstract_url:

['/static/bootstrap.min.css', '/static/main.css', '/', '/login', '/author/Albert-Einstein', '/tag/change/page/1/', '/tag/deep-thoughts/page/1/', '/tag/thinking/page/1/', '/tag/world/page/1/', '/author/J-K-Rowling', '/tag/abilities/page/1/', '/tag/choices/page/1/', '/author/Albert-Einstein', '/tag/inspirational/page/1/', '/tag/life/page/1/', '/tag/live/page/1/', '/tag/miracle/page/1/', '/tag/miracles/page/1/', '/author/Jane-Austen', '/tag/aliteracy/page/1/', '/tag/books/page/1/', '/tag/classic/page/1/', '/tag/humor/page/1/', '/author/Marilyn-Monroe', '/tag/be-yourself/page/1/', '/tag/inspirational/page/1/', '/author/Albert-Einstein', '/tag/adulthood/page/1/', '/tag/success/page/1/', '/tag/value/page/1/', '/author/Andre-Gide', '/tag/life/page/1/', '/tag/love/page/1/', '/author/Thomas-A-Edison', '/tag/edison/page/1/', '/tag/failure/page/1/', '/tag/inspirational/page/1/', '/tag/paraphrased/page/1/', '/author/Eleanor-Roosevelt', '/tag/misattributed-eleanor-roosevelt/page/1/', '/author/Steve-Martin', '/tag/humor/page/1/', '/tag/obvious/page/1/', '/tag/simile/page/1/', '/page/2/', '/tag/love/', '/tag/inspirational/', '/tag/life/', '/tag/humor/', '/tag/books/', '/tag/reading/', '/tag/friendship/', '/tag/friends/', '/tag/truth/', '/tag/simile/', 'https://www.goodreads.com/quotes', 'https://scrapinghub.com']

This test: tests/test_parse_multiple_items.py now fails as it should.

@kwuite
Copy link
Collaborator Author

kwuite commented Sep 15, 2018

@gaojiuli,I see we cannot install this project from pypi because that version is flawed. Can I expect this project to accept PR's or should I continue with my own fork?

@kwuite
Copy link
Collaborator Author

kwuite commented Sep 18, 2018

For people reading this message, here is my fork of the project

Changes

  • Bug fixes
  • Improved default url extraction
  • Proxy can now be a generator so every fetch can now yield a proxy
  • Implemented aiocache, with Redis as default (but you can choose) so you can cache every single requests and define which urls shouldn't be cached with cache_disabled_urls in case you are daily scraping a feed or blog and you are only interested in the new urls. Just set cache_enabled=True and run the docker instance that I have mentioned in the README.md file.
  • There is now a spider test flag so you can test a single parsed page to improve your css or figure out issues with your code. Together with the cache feature this makes a very quick and reliable trial and error cycle. Increase the amount of tests to be run with the max_requests
  • With the limit_requests flag you can adjust the max_requests value to actually limit the maximum amount of external requests.
  • Replaced Pyquery code with the lxml library and made it really easy to write extract data with css code. It's written with backwards compatibility in mind but if you use jquery you can still use the pyquery code by changing your Css class in Item to the Pyq class.

Extraction

With the new Css class the following can be done:

  • Get the dict of an element
  • Extract all elements into a list
  • HTML tables to dict
  • Control the index
  • Iterate throught texts
  • Request text and text content

In case you need to cleanup data or extract stuff like a phone number or email address, the following manipulate options are available:

  • clean_string; Will become clean in the future, so it cleans list, dicts and strings to give much more control
  • extract_email
  • extract_phone
  • extract_website
  • to_date_iso

The order in which you supply manipulate options, is the order of execution so you can actually combine these manipulations.

And many more will be added in the future. I have written tests for all features so take a look at this file if you are interested.

With the current version on my dev branch you can:

  1. First focus on parsing pages with the right xpath or regex (go for this one, since it's very reliable in my version)
  2. Set cache_enabled to true and go through all pages to cache them
  3. Use the test feature to start writing your extraction code
  4. Once done, re-run your code on the Redis cache and push it to any datasource like csv, postgres, etc

I hope @gaojiuli , is interested in the way I moved forward with this project so we can merge our code once I am satisfied with a production version. I kept the philosophy of creating a scraper for everyone and with that in mind I changed the way we extract data.

@kwuite kwuite changed the title Abstract_url is deeply flawed Gain Improvements - Ludaro Sep 18, 2018
@elliotgao2
Copy link
Owner

Accept pr.

@kwuite
Copy link
Collaborator Author

kwuite commented Sep 24, 2018

@gaojiuli, great news, happy that I can share my code.

Before the PR, the following I have to do:

  1. Update the clean code
  2. Ensure users can get back there url
  3. Go through all the code once more to make sure we can publish to Pypi

For the item.py I have a question regarding this code:

        if hasattr(self, 'save_url'):
            self.url = getattr(self, 'save_url')
        else:
            self.url = "file:///tmp.data"

Are you using this code or this junk code that can be removed?

@elliotgao2
Copy link
Owner

  1. Just pull request.
  2. I will review your code.
  3. I will realize a new version after I merge your pull request.

(Welcome any kind of optimization)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants