add warnings on deprecated pipeline classes

fictive-kin · Apr 15, 2013 · f0f7acb · f0f7acb
commit f0f7acb
Show file tree

Hide file tree

Showing 57 changed files with 8,202 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,57 @@
+# scrapy stuff
+.scrapy
+scrapy_proj/setup.py
+dbs/
+settings.py
+
+.DS_Store
+.AppleDouble
+.LSOverride
+Icon
+
+# Thumbnails
+._*
+
+# Files that might appear on external disk
+.Spotlight-V100
+.Trashes
+
+# virtualenvs
+venv
+
+*.py[cod]
+
+# C extensions
+*.so
+
+# Packages
+*.egg
+*.egg-info
+dist
+build
+eggs
+parts
+bin
+var
+sdist
+develop-eggs
+.installed.cfg
+lib
+lib64
+__pycache__
+
+# Installer logs
+pip-log.txt
+
+# Unit test / coverage reports
+.coverage
+.tox
+nosetests.xml
+
+# Translations
+*.mo
+
+# Mr Developer
+.mr.developer.cfg
+.project
+.pydevproject
diff --git a/CONTRIBUTORS b/CONTRIBUTORS
@@ -0,0 +1,14 @@
+Here are people who have contributed code to this project.
+
+Adam M Dutko <https://github.com/StylusEater>
+Bedrich Rios <https://github.com/bedrich>
+Chris Shiflett <https://github.com/shiflett>
+Dan McGowan <https://github.com/dansmcgowan>
+Ed Finkler <https://github.com/funkatron>
+Eric Leclerc <https://github.com/eleclerc>
+Evan Haas <https://github.com/ehaas>
+Jonathan Suh <https://github.com/jonsuh>
+josefeg <https://github.com/josefeg>
+Justin Duke <https://github.com/dukerson>
+mickaobrien <https://github.com/mickaobrien>
+Tyler Mincey <https://github.com/tmincey>
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,13 @@
+Copyright 2013 Fictive Kin, LLC
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
diff --git a/README.md b/README.md
@@ -0,0 +1,125 @@
+# Open Recipes
+
+## About
+
+Open Recipes is an open database of recipe bookmarks.
+
+Our goals are simple:
+
+1. Help publishers make their recipes as discoverable and consumable (get it?) as possible.
+2. Prevent good recipes from disappearing when a publisher goes away.
+
+That's pretty much it. We're not trying to save the world. We're just trying to save some recipes.
+
+## Recipe Bookmarks?
+
+The recipes in Open Recipes do not include preparation instructions. This is why we like to think of Open Recipes as a database of recipe bookmarks. We think this database should provide everything you need to *find* a great recipe, but not everything you need to *prepare* a great recipe. For preparation instructions, please link to the source.
+
+## The Database
+
+Regular snapshots of the database will be provided as JSON. The format will mirror the [schema.org Recipe format](http://schema.org/Recipe). We've [posted an example dump of data](http://openrecipes.s3.amazonaws.com/openrecipes.txt) so you can get a feel for it.
+
+## The Story
+
+We're not a bunch of chefs. We're not even good cooks.
+
+When we read about the [acquisition shutdown of Punchfork](http://punchfork.com/pinterest), we just shook our heads. It was the same ol' story:
+
+> We're excited to share the news that we're gonna be rich! To celebrate, we're shutting down the site and taking all your data down with it. So long, suckers!
+
+This part of the story isn't unique, but it continues. When one of our Studiomates spoke up about her disappointment, we listened. Then, [we acted](https://hugspoon.com/punchfork). What happens next surprised us. The CEO of Punchfork [took issue](https://twitter.com/JeffMiller/status/314899821351821312) with our good deed and demanded that we not save any data, even the data (likes) of users who asked us to save their data.
+
+Here's the thing. None of the recipes belonged to Punchfork. They were scraped from various [publishers](https://github.com/fictivekin/openrecipes/wiki/Publishers) to begin with. But, we don't wanna ruffle any feathers, so we're starting over.
+
+Use the force; seek the source?
+
+## The Work
+
+Wanna help? Fantastic. We knew we liked you.
+
+We're gonna be using [the wiki](https://github.com/fictivekin/openrecipes/wiki) to help organize this effort. Right now, there are two simple ways to help:
+
+1. Add a [publisher](https://github.com/fictivekin/openrecipes/wiki/Publishers). We wanna have the most complete list of recipe publishers. This is the easiest way to contribute. Please also add [an issue](https://github.com/fictivekin/openrecipes/issues) and tag it `publisher`. If you don't have a github account you can also email us suggestions at openrecipes@fictivekin.com
+2. Claim a publisher.
+
+Claiming a publisher means you are taking responsibility for writing a simple parser for the recipes from this particular publisher. Our tech ([see below](#the-tech)) will store this in an object type based on the [schema.org Recipe format](http://schema.org/Recipe), and can convert it into other formats for easy storage and discovery.
+
+Each publisher is a [GitHub issue](https://github.com/fictivekin/openrecipes/issues), so you can claim a publisher by claiming an issue. Just like a bug, and just as delicious.  Just leave a comment on the issue claiming it, and it's all yours.
+
+When you have a working parser (what we call "spiders" below), you contribute it to this project by submitting a [Github pull request](https://help.github.com/articles/using-pull-requests). We'll use it to periodically bring recipe data into our database. The database will be available intially as data dumps.
+
+## The Tech
+
+To gather data for Open Recipes, we are building spiders based on [Scrapy](http://scrapy.org), a web scraping framework written in Python. We are using [Scrapy v0.16](http://doc.scrapy.org/en/0.16/) at the moment. To contribute spiders for sites, you should have basic familiarity with:
+
+* Python
+* Git
+* HTML and/or XML
+
+### Setting up a dev environment
+
+> Note: this is strongly biased towards OS X. Feel free to contribute instructions for other operating systems.
+
+To get things going, you will need the following tools:
+
+1. Python 2.7 (including headers)
+1. Git
+1. `pip`
+1. `virtualenv`
+
+You will probably already have the first two, although you may need to install Python headers on Linux with something like `apt-get install python-dev`.
+
+If you don't have `pip`, follow [the installation instructions in the pip docs](http://www.pip-installer.org/en/latest/installing.html). Then you can [install `virtualenv` using pip](http://www.virtualenv.org/en/latest/#installation).
+
+Once you have `pip` and `virtualenv`, you can clone our repo and install requirements with the following steps:
+
+1. Open a terminal and `cd` to the directory that will contain your repo clone. For these instructions, we'll assume you `cd ~/src`.
+2. `git clone https://github.com/fictivekin/openrecipes.git` to clone the repo. This will make a `~/src/openrecipes` directory that contains your local repo.
+3. `cd ./openrecipes` to move into the newly-cloned repo.
+4. `virtualenv --no-site-packages venv` to create a Python virtual environment inside `~/src/openrecipes/venv`.
+5. `source venv/bin/activate` to activate your new Python virtual environment.
+6. `pip install -r requirements.txt` to install the required Python libraries, including Scrapy.
+7. `scrapy -h` to confirm that the `scrapy` command was installed. You should get a dump of the help docs.
+8. `cd scrapy_proj/openrecipes` to move into the Scrapy project directory
+9. `cp settings.py.default settings.py` to set up a working settings module for the project
+10. `scrapy crawl thepioneerwoman.feed` to test the feed spider written for [thepioneerwoman.com](http://thepioneerwoman.com). You should get output like the following:
+
+	<pre>
+    2013-03-30 14:35:37-0400 [scrapy] INFO: Scrapy 0.16.4 started (bot: openrecipes)
+    2013-03-30 14:35:37-0400 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
+    2013-03-30 14:35:37-0400 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
+    2013-03-30 14:35:37-0400 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
+    2013-03-30 14:35:37-0400 [scrapy] DEBUG: Enabled item pipelines: MakestringsPipeline, DuplicaterecipePipeline
+    2013-03-30 14:35:37-0400 [thepioneerwoman.feed] INFO: Spider opened
+    2013-03-30 14:35:37-0400 [thepioneerwoman.feed] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
+    2013-03-30 14:35:37-0400 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
+    2013-03-30 14:35:37-0400 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
+    2013-03-30 14:35:38-0400 [thepioneerwoman.feed] DEBUG: Crawled (200) <GET http://feeds.feedburner.com/pwcooks> (referer: None)
+    2013-03-30 14:35:38-0400 [thepioneerwoman.feed] DEBUG: Crawled (200) <GET http://thepioneerwoman.com/cooking/2013/03/beef-fajitas/> (referer: http://feeds.feedburner.com/pwcooks)
+    ...
+	</pre>
+
+    If you do, [*baby you got a stew going!*](http://www.youtube.com/watch?v=5lFZAyZPjV0)
+
+### Writing your own spiders
+
+For now, we recommend looking at the following spider definitions to get a feel for writing them:
+
+* [spiders/thepioneerwoman_spider.py](scrapy_proj/openrecipes/spiders/thepioneerwoman_spider.py)
+* [spiders/thepioneerwoman_feedspider.py](scrapy_proj/openrecipes/spiders/thepioneerwoman_feedspider.py)
+
+Both files are extensively documented, and should give you an idea of what's involved. If you have questions, check the [Feedback section](#feedback) and hit us up.
+
+To generate your own spider, use the included generate.py program.  From the scrapy_proj directory, run the following (make sure you are in the correct virtualenv:
+
+`python generate.py SPIDER_NAME START_URL`
+
+This will generate a basic spider for you named SPIDER_NAME that starts crawling at START_URL.  All that remains for you to do is to fill in the correct info for scraping the name, image, etc.  See `python generate.py --help' for other command line options.
+
+We'll use the ["fork & pull" development model](https://help.github.com/articles/fork-a-repo) for collaboration, so if you plan to contribute, make sure to fork your own repo off of ours. Then you can send us a pull request when you have something to contribute. Please follow ["PEP 8 - Style Guide for Python Code"](http://www.python.org/dev/peps/pep-0008/) for code you write.
+
+## Feedback?
+
+We're just trying to do the right thing, so we value your feedback as we go. You can ping [Ed](https://github.com/funkatron), [Chris](https://github.com/shiflett), [Andreas](https://github.com/andbirkebaek), or anyone from [Fictive Kin](https://github.com/fictivekin). General suggestions and feedback to [openrecipes@fictivekin.com](mailto:openrecipes@fictivekin.com) are welcome, too.
+
+We're also gonna be on IRC, so please feel free to join us if you have any questions or comments. We'll be hanging out in #openrecipes on Freenode. See you there!
diff --git a/requirements.txt b/requirements.txt
@@ -0,0 +1,14 @@
+Scrapy==0.16.4
+Twisted==12.3.0
+bleach==1.2.1
+cssselect==0.8
+html5lib==0.95
+isodate==0.4.9
+lxml==3.1.0
+nose==1.3.0
+pyOpenSSL==0.13
+pymongo==2.5
+python-dateutil==2.1
+w3lib==1.2
+wsgiref==0.1.2
+zope.interface==4.0.5
diff --git a/scrapy_proj/generate.py b/scrapy_proj/generate.py
@@ -0,0 +1,159 @@
+import argparse
+from urlparse import urlparse
+import os
+import sys
+
+script_dir = os.path.dirname(os.path.realpath(__file__))
+
+SpiderTemplate = """from scrapy.contrib.spiders import CrawlSpider, Rule
+from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
+from scrapy.selector import HtmlXPathSelector
+from openrecipes.items import RecipeItem, RecipeItemLoader
+
+
+class %(crawler_name)sMixin(object):
+    source = '%(source)s'
+
+    def parse_item(self, response):
+
+        hxs = HtmlXPathSelector(response)
+
+        base_path = 'TODO'
+
+        recipes_scopes = hxs.select(base_path)
+
+        name_path = 'TODO'
+        description_path = 'TODO'
+        image_path = 'TODO'
+        prepTime_path = 'TODO'
+        cookTime_path = 'TODO'
+        recipeYield_path = 'TODO'
+        ingredients_path = 'TODO'
+        datePublished = 'TODO'
+
+        recipes = []
+
+        for r_scope in recipes_scopes:
+            il = RecipeItemLoader(item=RecipeItem())
+
+            item['source'] = self.source
+
+            il.add_value('name', r_scope.select(name_path).extract())
+            il.add_value('image', r_scope.select(image_path).extract())
+            il.add_value('url', response.url)
+            il.add_value('description', r_scope.select(description_path).extract())
+
+            il.add_value('prepTime', r_scope.select(prepTime_path).extract())
+            il.add_value('cookTime', r_scope.select(cookTime_path).extract())
+            il.add_value('recipeYield', r_scope.select(recipeYield_path).extract())
+
+            ingredient_scopes = r_scope.select(ingredients_path)
+            ingredients = []
+            for i_scope in ingredient_scopes:
+                pass
+            il.add_value('ingredients', ingredients)
+
+            il.add_value('datePublished', r_scope.select(datePublished).extract())
+
+            recipes.append(il.load_item())
+
+        return recipes
+
+
+class %(crawler_name)scrawlSpider(CrawlSpider, %(crawler_name)sMixin):
+
+    name = "%(domain)s"
+
+    allowed_domains = ["%(domain)s"]
+
+    start_urls = [
+        "%(start_url)s",
+    ]
+
+    rules = (
+        Rule(SgmlLinkExtractor(allow=('TODO'))),
+
+        Rule(SgmlLinkExtractor(allow=('TODO')),
+             callback='parse_item'),
+    )
+
+
+"""
+
+FeedSpiderTemplate = """from scrapy.spider import BaseSpider
+from scrapy.http import Request
+from scrapy.selector import XmlXPathSelector
+from openrecipes.spiders.%(source)s_spider import %(crawler_name)sMixin
+
+
+class %(crawler_name)sfeedSpider(BaseSpider, %(crawler_name)sMixin):
+    name = "%(name)s.feed"
+    allowed_domains = [
+        "%(feed_domains)s",
+        "feeds.feedburner.com",
+        "feedproxy.google.com",
+    ]
+    start_urls = [
+        "%(feed_url)s",
+    ]
+
+    def parse(self, response):
+        xxs = XmlXPathSelector(response)
+        links = xxs.select("TODO").extract()
+
+        return [Request(x, callback=self.parse_item) for x in links]
+"""
+
+
+def parse_url(url):
+    if url.startswith('http://') or url.startswith('https://'):
+        return urlparse(url)
+    else:
+        return urlparse('http://' + url)
+
+
+def generate_crawlers(args):
+    parsed_url = parse_url(args.start_url)
+
+    domain = parsed_url.netloc
+    name = args.name.lower()
+
+    values = {
+        'crawler_name': name.capitalize(),
+        'source': name,
+        'name': domain,
+        'domain': domain,
+        'start_url': args.start_url,
+    }
+
+    spider_filename = os.path.join(script_dir, 'openrecipes', 'spiders', '%s_spider.py' % name)
+    with open(spider_filename, 'w') as f:
+        f.write(SpiderTemplate % values)
+
+    if args.with_feed:
+        feed_url = args.with_feed[0]
+        feed_domain = parse_url(feed_url).netloc
+        values['feed_url'] = feed_url
+        values['name'] = name
+        if feed_domain == domain:
+            values['feed_domains'] = domain
+        else:
+            values['feed_domains'] = '%s",\n        "%s' % (domain, feed_domain)
+        feed_filename = os.path.join(script_dir, 'openrecipes', 'spiders', '%s_feedspider.py' % name)
+        with open(feed_filename, 'w') as f:
+            f.write(FeedSpiderTemplate % values)
+
+
+epilog = """
+Example usage: python generate.py epicurious http://www.epicurious.com/
+"""
+parser = argparse.ArgumentParser(description='Generate a scrapy spider', epilog=epilog)
+parser.add_argument('name', help='Spider name.  This will be used to generate the filename')
+parser.add_argument('start_url', help='Start URL for crawling')
+parser.add_argument('--with-feed', required=False, nargs=1, metavar='feed-url', help='RSS Feed URL')
+
+if len(sys.argv) == 1:
+    parser.print_help(sys.stderr)
+else:
+    args = parser.parse_args()
+    generate_crawlers(args)
diff --git a/scrapy_proj/openrecipes/__init__.py b/scrapy_proj/openrecipes/__init__.py