Scrape and gui sources #2

Closed
wants to merge 15 commits into
from

Projects

None yet

2 participants

@alecxe
Contributor
alecxe commented Feb 28, 2013

No description provided.

@econchick
Owner

Could you just give me a brief description how you approached the Scrape project? besides researching the libraries and installing everything and such, how did you go about coding this?

Just trying to see what the best way to go about guiding this portion of the tutorial.

Thanks!

@alecxe
Contributor
alecxe commented Mar 1, 2013

Sure!

  1. The whole project was created after going through the scrapy tutorial: http://doc.scrapy.org/en/0.16/intro/tutorial.html. So, starting point was to create scrapy project.
  2. I was choosing what regularly updated page to parse. Groupon didn't work because the data for deals was loaded by XHR call - so, scrapy couldn't get it. There're workarounds for how to deal with it - but it would be too complicated for the tutorial. So, I've chosen livingsocial new york page.
  3. I've written the spider. It's the most important and may be complicated part of the project because you have to be familiar with xpath to successfully find what you're looking for. So, I've just used chrome dev tools to see where were deals located and what was the structure of every deal. Then, I've defined xpaths for every field I wanted to parse. The process of writing spider was like this: write spider class draft: define name, allowed_domains, start_urls, then study target page html, define what block contains deals, then step by step write xpaths for every field in the deal dumping to json and checking if the field data is dumped correctly.
    Then, I've added processors for scraped item data - they just helped to trim the field value and to join multilines in one line.
    At the end of this step, I had Spider and Item implemented.
  4. sqlalchemy model. At this step I've already written Item class for the scraped data (basically, I've looked at deals on livingsocial and saw what fields were there). So, sqlalchemy model was easy to write (for the purpose of simplicity, I've just used 'string' type for all fields).
  5. pipelining to the database. I haven't found any clear examples on pipelining to the database from scrapy, so I've just tried to make it easy to understand. Just instantiate db connection in init and insert data in process_item - scrapy makes it really easy to implement. Then I've just added pipeline to the settings.
    Settings - it's something like a glue between different components: spiders, items and pipelines (similar to django).

Hope that helps. Let me know If you have any questions, thanks.

@econchick
Owner

Oh actually another question - using scrapy requires X-Code command line tools for the GCC compiler, right?

@econchick
Owner

Oh actually - another question.

But first - I really like the scraper code you wrote - very clear and to-the-point. You've made it very easy to write out the tutorial for this. The approach you wrote out is great - I mean, I understand why you went that route and I will mimic that in the tutorial.

I was thinking - these deals have expiration dates. Do you think you could add that to the models - perhaps a way to get both the "saved" date (not necessarily when the deal came public, but when we grabbed it and/or saved it to the db) as well as the expiration date. I'm thinking that I'm just going to query all deals within this category that are good right now.

@econchick
Owner

OH 2 more questions (I hope) - what sort of wisdom would you impart on new coders? any tidbit?

And may I attribute you for these tutorials?

@econchick econchick and 1 other commented on an outdated diff Mar 2, 2013
...urce/tutorial/tutorial/spiders/livingsocial_spider.py
@@ -0,0 +1,58 @@
+#! -*- coding: utf-8 -*-
+
+"""
+Web Scraper Project
+
+Scrape data from a regularly updated website livingsocial.com and
+save to a database (postgres).
+
+Scrapy spider part - it actually performs scraping.
+"""
+
+import re
@econchick
econchick Mar 2, 2013 Owner

Hmm - is re used at all here?

@alecxe
alecxe Mar 2, 2013 Contributor

Yeah, unused import, surprise! 👍

@alecxe
Contributor
alecxe commented Mar 2, 2013

Oh actually another question - using scrapy requires X-Code command line tools for the GCC compiler, right?

GCC itself is required and they say that X-Code is required on mac too. I'm on ubuntu and haven't had any major problems with scrapy installation (may be because I had all system-wide requirements pre-installed). I can repeat scrapy installation on another ubuntu or windows instance if you want.

I was thinking - these deals have expiration dates. Do you think you could add that to the models - perhaps a way to get both the "saved" date (not necessarily when the deal came public, but when we grabbed it and/or saved it to the db) as well as the expiration date. I'm thinking that I'm just going to query all deals within this category that are good right now.

Sure, good point, I'll add these dates.

OH 2 more questions (I hope) - what sort of wisdom would you impart on new coders? any tidbit?

Wisdom? May be this note. HTML is often broken and sometimes difficult to parse. Browsers are forgiving, they are not very strict about the page content. So, don't try to parse html with regex.

And may I attribute you for these tutorials?

Sure, if you want - do it.

@alecxe
Contributor
alecxe commented Mar 2, 2013

I was thinking - these deals have expiration dates.

Well, there are no expiration dates on the page with list of deals: http://www.livingsocial.com/cities/1719-newyork-citywide. On the deal page there is a note about how many days remaining..should I parse it and calculate expiration date? This means go deeper via deal link and may be make the code more complicated..but will do if needed.

@econchick
Owner

Ah hmmm that's a good point; I don't want to make it any more complicated than it is. It's plenty fine right now, and maybe I can use the "go deeper to get the expiry dates" as a "if you want to continue on this project".

Much appreciated! May comment on the GUI this weekend after I get this tutorial squared away.

Would you be available just for reviewing the tutorial(s) after I've written the first draft?

@alecxe
Contributor
alecxe commented Mar 2, 2013

Sure! Will do my best to help you.

@econchick
Owner

Question: in settings.py, what's the difference between SPIDER_MODULES and NEWSPIDER_MODULE?

@econchick econchick commented on the diff Mar 2, 2013
scrape/lib/full_source/tutorial/tutorial/pipelines.py
+from sqlalchemy.orm import sessionmaker
+from models import Deals, db_connect, create_deals_table
+
+
+class LivingSocialPipeline(object):
+ """Livingsocial pipeline for storing scraped items in the database"""
+ def __init__(self):
+ """
+ Initializes database connection and sessionmaker,
+ creates deals table
+ """
+ engine = db_connect()
+ create_deals_table(engine)
+ self.Session = sessionmaker(bind=engine)
+
+ def process_item(self, item, spider):
@econchick
econchick Mar 2, 2013 Owner

Where is the spider parameter used?

@alecxe
alecxe Mar 3, 2013 Contributor

According to scrapy docs:

Each item pipeline component is a single Python class that must implement the following method:
process_item(item, spider) ...

So, it's a just a contract, spider param is not used in the tutorial.

@econchick econchick and 1 other commented on an outdated diff Mar 2, 2013
scrape/lib/full_source/tutorial/tutorial/pipelines.py
+ creates deals table
+ """
+ engine = db_connect()
+ create_deals_table(engine)
+ self.Session = sessionmaker(bind=engine)
+
+ def process_item(self, item, spider):
+ """
+ This method is called for every item pipeline component.
+ Saves deals in the database
+ """
+ session = self.Session()
+ deal = Deals(**item)
+ session.add(deal)
+ session.commit()
+ return item
@econchick
econchick Mar 2, 2013 Owner

What's the purpose of returning item?

@econchick
econchick Mar 3, 2013 Owner

Also - when I run scrapy crawl livingsocial then query the database, I don't get any records in return. Do you get anything when you query postgres?

@alecxe
alecxe Mar 3, 2013 Contributor

What's the purpose of returning item?

According to scrapy docs:

process_item(item, spider)
This method is called for every item pipeline component and must either return a Item (or any descendant class) object or raise a DropItem exception.

It must return an item.

See, for example, JsonWriterPipeline - it does write lines to the opened file, but, anyway, returns an item.

@econchick econchick and 1 other commented on an outdated diff Mar 3, 2013
scrape/lib/full_source/tutorial/tutorial/pipelines.py
+ Initializes database connection and sessionmaker,
+ creates deals table
+ """
+ engine = db_connect()
+ create_deals_table(engine)
+ self.Session = sessionmaker(bind=engine)
+
+ def process_item(self, item, spider):
+ """
+ This method is called for every item pipeline component.
+ Saves deals in the database
+ """
+ session = self.Session()
+ deal = Deals(**item)
+ session.add(deal)
+ session.commit()
@econchick
econchick Mar 3, 2013 Owner

If we have a commit() - is this SQLAlchemy's way of saying "commit if nothing went wrong when writing to the database, or else rollback?"

@alecxe
alecxe Mar 3, 2013 Contributor

Nope, it was a mistake. According to sqlalchemy docs, we should handle rollbacks manually here. Fixed.

@alecxe
Contributor
alecxe commented Mar 3, 2013

Question: in settings.py, what's the difference between SPIDER_MODULES and NEWSPIDER_MODULE?

SPIDER_MODULES is the list of modules where scrapy will look for spiders. This is required for the tutorial.
NEWSPIDER_MODULE is used with genspider command - so it is not actually used in the tutorial.

@alecxe
Contributor
alecxe commented Mar 3, 2013

Also - when I run scrapy crawl livingsocial then query the database, I don't get any records in return. Do you get anything when you query postgres?

Yeah, it works here.
Does it actually get the data? Try run: scrapy crawl livingsocial -o output.json and see if there is smth in the output.json.
Do you see "deals" table in the postgres database?

@econchick
Owner

@alecxe - Here is the draft of the tutorial I wrote up with your code: tutorial

This does not take into account the added comments that you just provided re NEWSPIDER_MODULE and the returning of item.

In regards to the getting the data - running that command with the -o output.json flag, I do see a bunch of deals, like so:

    {'category': u'families',
     'description': u'Car Seat Installation',
     'link': u'/cities/1719-newyork-citywide/deals/584328-car-seat-installation',
     'location': u'Nassau County',
     'original_price': u'40',
     'price': u' 20',
     'title': u'Precious Cargo Installers'}

fly by in the console.

OH WAIT does scrapy crawl livingsocial not save to the database? do I have to run something else? it's not clear in the README that you provided.

@alecxe
Contributor
alecxe commented Mar 3, 2013

Nono, scrapy crawl livingsocial does save to the database. The only thing you should do before is to create database and set database settings in settings.py.

@econchick
Owner

Manually merged - still working on fine-tuning the scrape tutorial; will get starting on the GUI one soon (perhaps after PyCon :-/)

@econchick econchick closed this Mar 9, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment