Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scrape and gui sources #2

Closed
wants to merge 15 commits into from
Closed

Scrape and gui sources #2

wants to merge 15 commits into from

Conversation

alecxe
Copy link
Contributor

@alecxe alecxe commented Feb 28, 2013

No description provided.

econchick pushed a commit that referenced this pull request Mar 1, 2013
Minor typo (you're -> your).
@econchick
Copy link
Owner

Could you just give me a brief description how you approached the Scrape project? besides researching the libraries and installing everything and such, how did you go about coding this?

Just trying to see what the best way to go about guiding this portion of the tutorial.

Thanks!

@alecxe
Copy link
Contributor Author

alecxe commented Mar 1, 2013

Sure!

  1. The whole project was created after going through the scrapy tutorial: http://doc.scrapy.org/en/0.16/intro/tutorial.html. So, starting point was to create scrapy project.
  2. I was choosing what regularly updated page to parse. Groupon didn't work because the data for deals was loaded by XHR call - so, scrapy couldn't get it. There're workarounds for how to deal with it - but it would be too complicated for the tutorial. So, I've chosen livingsocial new york page.
  3. I've written the spider. It's the most important and may be complicated part of the project because you have to be familiar with xpath to successfully find what you're looking for. So, I've just used chrome dev tools to see where were deals located and what was the structure of every deal. Then, I've defined xpaths for every field I wanted to parse. The process of writing spider was like this: write spider class draft: define name, allowed_domains, start_urls, then study target page html, define what block contains deals, then step by step write xpaths for every field in the deal dumping to json and checking if the field data is dumped correctly.
    Then, I've added processors for scraped item data - they just helped to trim the field value and to join multilines in one line.
    At the end of this step, I had Spider and Item implemented.
  4. sqlalchemy model. At this step I've already written Item class for the scraped data (basically, I've looked at deals on livingsocial and saw what fields were there). So, sqlalchemy model was easy to write (for the purpose of simplicity, I've just used 'string' type for all fields).
  5. pipelining to the database. I haven't found any clear examples on pipelining to the database from scrapy, so I've just tried to make it easy to understand. Just instantiate db connection in init and insert data in process_item - scrapy makes it really easy to implement. Then I've just added pipeline to the settings.
    Settings - it's something like a glue between different components: spiders, items and pipelines (similar to django).

Hope that helps. Let me know If you have any questions, thanks.

@econchick
Copy link
Owner

Oh actually another question - using scrapy requires X-Code command line tools for the GCC compiler, right?

@econchick
Copy link
Owner

Oh actually - another question.

But first - I really like the scraper code you wrote - very clear and to-the-point. You've made it very easy to write out the tutorial for this. The approach you wrote out is great - I mean, I understand why you went that route and I will mimic that in the tutorial.

I was thinking - these deals have expiration dates. Do you think you could add that to the models - perhaps a way to get both the "saved" date (not necessarily when the deal came public, but when we grabbed it and/or saved it to the db) as well as the expiration date. I'm thinking that I'm just going to query all deals within this category that are good right now.

@econchick
Copy link
Owner

OH 2 more questions (I hope) - what sort of wisdom would you impart on new coders? any tidbit?

And may I attribute you for these tutorials?

Scrapy spider part - it actually performs scraping.
"""

import re
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm - is re used at all here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, unused import, surprise! 👍

@alecxe
Copy link
Contributor Author

alecxe commented Mar 2, 2013

Oh actually another question - using scrapy requires X-Code command line tools for the GCC compiler, right?

GCC itself is required and they say that X-Code is required on mac too. I'm on ubuntu and haven't had any major problems with scrapy installation (may be because I had all system-wide requirements pre-installed). I can repeat scrapy installation on another ubuntu or windows instance if you want.

I was thinking - these deals have expiration dates. Do you think you could add that to the models - perhaps a way to get both the "saved" date (not necessarily when the deal came public, but when we grabbed it and/or saved it to the db) as well as the expiration date. I'm thinking that I'm just going to query all deals within this category that are good right now.

Sure, good point, I'll add these dates.

OH 2 more questions (I hope) - what sort of wisdom would you impart on new coders? any tidbit?

Wisdom? May be this note. HTML is often broken and sometimes difficult to parse. Browsers are forgiving, they are not very strict about the page content. So, don't try to parse html with regex.

And may I attribute you for these tutorials?

Sure, if you want - do it.

@alecxe
Copy link
Contributor Author

alecxe commented Mar 2, 2013

I was thinking - these deals have expiration dates.

Well, there are no expiration dates on the page with list of deals: http://www.livingsocial.com/cities/1719-newyork-citywide. On the deal page there is a note about how many days remaining..should I parse it and calculate expiration date? This means go deeper via deal link and may be make the code more complicated..but will do if needed.

@econchick
Copy link
Owner

Ah hmmm that's a good point; I don't want to make it any more complicated than it is. It's plenty fine right now, and maybe I can use the "go deeper to get the expiry dates" as a "if you want to continue on this project".

Much appreciated! May comment on the GUI this weekend after I get this tutorial squared away.

Would you be available just for reviewing the tutorial(s) after I've written the first draft?

@alecxe
Copy link
Contributor Author

alecxe commented Mar 2, 2013

Sure! Will do my best to help you.

@econchick
Copy link
Owner

Question: in settings.py, what's the difference between SPIDER_MODULES and NEWSPIDER_MODULE?

create_deals_table(engine)
self.Session = sessionmaker(bind=engine)

def process_item(self, item, spider):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the spider parameter used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to scrapy docs:

Each item pipeline component is a single Python class that must implement the following method:
process_item(item, spider) ...

So, it's a just a contract, spider param is not used in the tutorial.

@alecxe
Copy link
Contributor Author

alecxe commented Mar 3, 2013

Question: in settings.py, what's the difference between SPIDER_MODULES and NEWSPIDER_MODULE?

SPIDER_MODULES is the list of modules where scrapy will look for spiders. This is required for the tutorial.
NEWSPIDER_MODULE is used with genspider command - so it is not actually used in the tutorial.

@alecxe
Copy link
Contributor Author

alecxe commented Mar 3, 2013

Also - when I run scrapy crawl livingsocial then query the database, I don't get any records in return. Do you get anything when you query postgres?

Yeah, it works here.
Does it actually get the data? Try run: scrapy crawl livingsocial -o output.json and see if there is smth in the output.json.
Do you see "deals" table in the postgres database?

@econchick
Copy link
Owner

@alecxe - Here is the draft of the tutorial I wrote up with your code: tutorial

This does not take into account the added comments that you just provided re NEWSPIDER_MODULE and the returning of item.

In regards to the getting the data - running that command with the -o output.json flag, I do see a bunch of deals, like so:

    {'category': u'families',
     'description': u'Car Seat Installation',
     'link': u'/cities/1719-newyork-citywide/deals/584328-car-seat-installation',
     'location': u'Nassau County',
     'original_price': u'40',
     'price': u' 20',
     'title': u'Precious Cargo Installers'}

fly by in the console.

OH WAIT does scrapy crawl livingsocial not save to the database? do I have to run something else? it's not clear in the README that you provided.

@alecxe
Copy link
Contributor Author

alecxe commented Mar 3, 2013

Nono, scrapy crawl livingsocial does save to the database. The only thing you should do before is to create database and set database settings in settings.py.

@econchick
Copy link
Owner

Manually merged - still working on fine-tuning the scrape tutorial; will get starting on the GUI one soon (perhaps after PyCon :-/)

@econchick econchick closed this Mar 9, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants