New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scrape and gui sources #2
Conversation
Minor typo (you're -> your).
Could you just give me a brief description how you approached the Scrape project? besides researching the libraries and installing everything and such, how did you go about coding this? Just trying to see what the best way to go about guiding this portion of the tutorial. Thanks! |
Sure!
Hope that helps. Let me know If you have any questions, thanks. |
Oh actually another question - using scrapy requires X-Code command line tools for the GCC compiler, right? |
Oh actually - another question. But first - I really like the scraper code you wrote - very clear and to-the-point. You've made it very easy to write out the tutorial for this. The approach you wrote out is great - I mean, I understand why you went that route and I will mimic that in the tutorial. I was thinking - these deals have expiration dates. Do you think you could add that to the models - perhaps a way to get both the "saved" date (not necessarily when the deal came public, but when we grabbed it and/or saved it to the db) as well as the expiration date. I'm thinking that I'm just going to query all deals within this category that are good right now. |
OH 2 more questions (I hope) - what sort of wisdom would you impart on new coders? any tidbit? And may I attribute you for these tutorials? |
Scrapy spider part - it actually performs scraping. | ||
""" | ||
|
||
import re |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm - is re
used at all here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, unused import, surprise!
GCC itself is required and they say that X-Code is required on mac too. I'm on ubuntu and haven't had any major problems with scrapy installation (may be because I had all system-wide requirements pre-installed). I can repeat scrapy installation on another ubuntu or windows instance if you want.
Sure, good point, I'll add these dates.
Wisdom? May be this note. HTML is often broken and sometimes difficult to parse. Browsers are forgiving, they are not very strict about the page content. So, don't try to parse html with regex.
Sure, if you want - do it. |
Well, there are no expiration dates on the page with list of deals: http://www.livingsocial.com/cities/1719-newyork-citywide. On the deal page there is a note about how many days remaining..should I parse it and calculate expiration date? This means go deeper via deal link and may be make the code more complicated..but will do if needed. |
Ah hmmm that's a good point; I don't want to make it any more complicated than it is. It's plenty fine right now, and maybe I can use the "go deeper to get the expiry dates" as a "if you want to continue on this project". Much appreciated! May comment on the GUI this weekend after I get this tutorial squared away. Would you be available just for reviewing the tutorial(s) after I've written the first draft? |
Sure! Will do my best to help you. |
Question: in settings.py, what's the difference between SPIDER_MODULES and NEWSPIDER_MODULE? |
create_deals_table(engine) | ||
self.Session = sessionmaker(bind=engine) | ||
|
||
def process_item(self, item, spider): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is the spider
parameter used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to scrapy docs:
Each item pipeline component is a single Python class that must implement the following method:
process_item(item, spider) ...
So, it's a just a contract, spider param is not used in the tutorial.
SPIDER_MODULES is the list of modules where scrapy will look for spiders. This is required for the tutorial. |
Yeah, it works here. |
@alecxe - Here is the draft of the tutorial I wrote up with your code: tutorial This does not take into account the added comments that you just provided re In regards to the getting the data - running that command with the
fly by in the console. OH WAIT does |
Nono, scrapy crawl livingsocial does save to the database. The only thing you should do before is to create database and set database settings in settings.py. |
Manually merged - still working on fine-tuning the scrape tutorial; will get starting on the GUI one soon (perhaps after PyCon :-/) |
No description provided.