This repository has been archived by the owner on Feb 8, 2018. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 113
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
add warnings on deprecated pipeline classes
- Loading branch information
0 parents
commit f0f7acb
Showing
57 changed files
with
8,202 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,57 @@ | ||
# scrapy stuff | ||
.scrapy | ||
scrapy_proj/setup.py | ||
dbs/ | ||
settings.py | ||
|
||
.DS_Store | ||
.AppleDouble | ||
.LSOverride | ||
Icon | ||
|
||
# Thumbnails | ||
._* | ||
|
||
# Files that might appear on external disk | ||
.Spotlight-V100 | ||
.Trashes | ||
|
||
# virtualenvs | ||
venv | ||
|
||
*.py[cod] | ||
|
||
# C extensions | ||
*.so | ||
|
||
# Packages | ||
*.egg | ||
*.egg-info | ||
dist | ||
build | ||
eggs | ||
parts | ||
bin | ||
var | ||
sdist | ||
develop-eggs | ||
.installed.cfg | ||
lib | ||
lib64 | ||
__pycache__ | ||
|
||
# Installer logs | ||
pip-log.txt | ||
|
||
# Unit test / coverage reports | ||
.coverage | ||
.tox | ||
nosetests.xml | ||
|
||
# Translations | ||
*.mo | ||
|
||
# Mr Developer | ||
.mr.developer.cfg | ||
.project | ||
.pydevproject |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
Here are people who have contributed code to this project. | ||
|
||
Adam M Dutko <https://github.com/StylusEater> | ||
Bedrich Rios <https://github.com/bedrich> | ||
Chris Shiflett <https://github.com/shiflett> | ||
Dan McGowan <https://github.com/dansmcgowan> | ||
Ed Finkler <https://github.com/funkatron> | ||
Eric Leclerc <https://github.com/eleclerc> | ||
Evan Haas <https://github.com/ehaas> | ||
Jonathan Suh <https://github.com/jonsuh> | ||
josefeg <https://github.com/josefeg> | ||
Justin Duke <https://github.com/dukerson> | ||
mickaobrien <https://github.com/mickaobrien> | ||
Tyler Mincey <https://github.com/tmincey> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
Copyright 2013 Fictive Kin, LLC | ||
|
||
Licensed under the Apache License, Version 2.0 (the "License"); | ||
you may not use this file except in compliance with the License. | ||
You may obtain a copy of the License at | ||
|
||
http://www.apache.org/licenses/LICENSE-2.0 | ||
|
||
Unless required by applicable law or agreed to in writing, software | ||
distributed under the License is distributed on an "AS IS" BASIS, | ||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
See the License for the specific language governing permissions and | ||
limitations under the License. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,125 @@ | ||
# Open Recipes | ||
|
||
## About | ||
|
||
Open Recipes is an open database of recipe bookmarks. | ||
|
||
Our goals are simple: | ||
|
||
1. Help publishers make their recipes as discoverable and consumable (get it?) as possible. | ||
2. Prevent good recipes from disappearing when a publisher goes away. | ||
|
||
That's pretty much it. We're not trying to save the world. We're just trying to save some recipes. | ||
|
||
## Recipe Bookmarks? | ||
|
||
The recipes in Open Recipes do not include preparation instructions. This is why we like to think of Open Recipes as a database of recipe bookmarks. We think this database should provide everything you need to *find* a great recipe, but not everything you need to *prepare* a great recipe. For preparation instructions, please link to the source. | ||
|
||
## The Database | ||
|
||
Regular snapshots of the database will be provided as JSON. The format will mirror the [schema.org Recipe format](http://schema.org/Recipe). We've [posted an example dump of data](http://openrecipes.s3.amazonaws.com/openrecipes.txt) so you can get a feel for it. | ||
|
||
## The Story | ||
|
||
We're not a bunch of chefs. We're not even good cooks. | ||
|
||
When we read about the [acquisition shutdown of Punchfork](http://punchfork.com/pinterest), we just shook our heads. It was the same ol' story: | ||
|
||
> We're excited to share the news that we're gonna be rich! To celebrate, we're shutting down the site and taking all your data down with it. So long, suckers! | ||
This part of the story isn't unique, but it continues. When one of our Studiomates spoke up about her disappointment, we listened. Then, [we acted](https://hugspoon.com/punchfork). What happens next surprised us. The CEO of Punchfork [took issue](https://twitter.com/JeffMiller/status/314899821351821312) with our good deed and demanded that we not save any data, even the data (likes) of users who asked us to save their data. | ||
|
||
Here's the thing. None of the recipes belonged to Punchfork. They were scraped from various [publishers](https://github.com/fictivekin/openrecipes/wiki/Publishers) to begin with. But, we don't wanna ruffle any feathers, so we're starting over. | ||
|
||
Use the force; seek the source? | ||
|
||
## The Work | ||
|
||
Wanna help? Fantastic. We knew we liked you. | ||
|
||
We're gonna be using [the wiki](https://github.com/fictivekin/openrecipes/wiki) to help organize this effort. Right now, there are two simple ways to help: | ||
|
||
1. Add a [publisher](https://github.com/fictivekin/openrecipes/wiki/Publishers). We wanna have the most complete list of recipe publishers. This is the easiest way to contribute. Please also add [an issue](https://github.com/fictivekin/openrecipes/issues) and tag it `publisher`. If you don't have a github account you can also email us suggestions at openrecipes@fictivekin.com | ||
2. Claim a publisher. | ||
|
||
Claiming a publisher means you are taking responsibility for writing a simple parser for the recipes from this particular publisher. Our tech ([see below](#the-tech)) will store this in an object type based on the [schema.org Recipe format](http://schema.org/Recipe), and can convert it into other formats for easy storage and discovery. | ||
|
||
Each publisher is a [GitHub issue](https://github.com/fictivekin/openrecipes/issues), so you can claim a publisher by claiming an issue. Just like a bug, and just as delicious. Just leave a comment on the issue claiming it, and it's all yours. | ||
|
||
When you have a working parser (what we call "spiders" below), you contribute it to this project by submitting a [Github pull request](https://help.github.com/articles/using-pull-requests). We'll use it to periodically bring recipe data into our database. The database will be available intially as data dumps. | ||
|
||
## The Tech | ||
|
||
To gather data for Open Recipes, we are building spiders based on [Scrapy](http://scrapy.org), a web scraping framework written in Python. We are using [Scrapy v0.16](http://doc.scrapy.org/en/0.16/) at the moment. To contribute spiders for sites, you should have basic familiarity with: | ||
|
||
* Python | ||
* Git | ||
* HTML and/or XML | ||
|
||
### Setting up a dev environment | ||
|
||
> Note: this is strongly biased towards OS X. Feel free to contribute instructions for other operating systems. | ||
To get things going, you will need the following tools: | ||
|
||
1. Python 2.7 (including headers) | ||
1. Git | ||
1. `pip` | ||
1. `virtualenv` | ||
|
||
You will probably already have the first two, although you may need to install Python headers on Linux with something like `apt-get install python-dev`. | ||
|
||
If you don't have `pip`, follow [the installation instructions in the pip docs](http://www.pip-installer.org/en/latest/installing.html). Then you can [install `virtualenv` using pip](http://www.virtualenv.org/en/latest/#installation). | ||
|
||
Once you have `pip` and `virtualenv`, you can clone our repo and install requirements with the following steps: | ||
|
||
1. Open a terminal and `cd` to the directory that will contain your repo clone. For these instructions, we'll assume you `cd ~/src`. | ||
2. `git clone https://github.com/fictivekin/openrecipes.git` to clone the repo. This will make a `~/src/openrecipes` directory that contains your local repo. | ||
3. `cd ./openrecipes` to move into the newly-cloned repo. | ||
4. `virtualenv --no-site-packages venv` to create a Python virtual environment inside `~/src/openrecipes/venv`. | ||
5. `source venv/bin/activate` to activate your new Python virtual environment. | ||
6. `pip install -r requirements.txt` to install the required Python libraries, including Scrapy. | ||
7. `scrapy -h` to confirm that the `scrapy` command was installed. You should get a dump of the help docs. | ||
8. `cd scrapy_proj/openrecipes` to move into the Scrapy project directory | ||
9. `cp settings.py.default settings.py` to set up a working settings module for the project | ||
10. `scrapy crawl thepioneerwoman.feed` to test the feed spider written for [thepioneerwoman.com](http://thepioneerwoman.com). You should get output like the following: | ||
|
||
<pre> | ||
2013-03-30 14:35:37-0400 [scrapy] INFO: Scrapy 0.16.4 started (bot: openrecipes) | ||
2013-03-30 14:35:37-0400 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState | ||
2013-03-30 14:35:37-0400 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats | ||
2013-03-30 14:35:37-0400 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware | ||
2013-03-30 14:35:37-0400 [scrapy] DEBUG: Enabled item pipelines: MakestringsPipeline, DuplicaterecipePipeline | ||
2013-03-30 14:35:37-0400 [thepioneerwoman.feed] INFO: Spider opened | ||
2013-03-30 14:35:37-0400 [thepioneerwoman.feed] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) | ||
2013-03-30 14:35:37-0400 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023 | ||
2013-03-30 14:35:37-0400 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080 | ||
2013-03-30 14:35:38-0400 [thepioneerwoman.feed] DEBUG: Crawled (200) <GET http://feeds.feedburner.com/pwcooks> (referer: None) | ||
2013-03-30 14:35:38-0400 [thepioneerwoman.feed] DEBUG: Crawled (200) <GET http://thepioneerwoman.com/cooking/2013/03/beef-fajitas/> (referer: http://feeds.feedburner.com/pwcooks) | ||
... | ||
</pre> | ||
|
||
If you do, [*baby you got a stew going!*](http://www.youtube.com/watch?v=5lFZAyZPjV0) | ||
|
||
### Writing your own spiders | ||
|
||
For now, we recommend looking at the following spider definitions to get a feel for writing them: | ||
|
||
* [spiders/thepioneerwoman_spider.py](scrapy_proj/openrecipes/spiders/thepioneerwoman_spider.py) | ||
* [spiders/thepioneerwoman_feedspider.py](scrapy_proj/openrecipes/spiders/thepioneerwoman_feedspider.py) | ||
|
||
Both files are extensively documented, and should give you an idea of what's involved. If you have questions, check the [Feedback section](#feedback) and hit us up. | ||
|
||
To generate your own spider, use the included generate.py program. From the scrapy_proj directory, run the following (make sure you are in the correct virtualenv: | ||
|
||
`python generate.py SPIDER_NAME START_URL` | ||
|
||
This will generate a basic spider for you named SPIDER_NAME that starts crawling at START_URL. All that remains for you to do is to fill in the correct info for scraping the name, image, etc. See `python generate.py --help' for other command line options. | ||
|
||
We'll use the ["fork & pull" development model](https://help.github.com/articles/fork-a-repo) for collaboration, so if you plan to contribute, make sure to fork your own repo off of ours. Then you can send us a pull request when you have something to contribute. Please follow ["PEP 8 - Style Guide for Python Code"](http://www.python.org/dev/peps/pep-0008/) for code you write. | ||
|
||
## Feedback? | ||
|
||
We're just trying to do the right thing, so we value your feedback as we go. You can ping [Ed](https://github.com/funkatron), [Chris](https://github.com/shiflett), [Andreas](https://github.com/andbirkebaek), or anyone from [Fictive Kin](https://github.com/fictivekin). General suggestions and feedback to [openrecipes@fictivekin.com](mailto:openrecipes@fictivekin.com) are welcome, too. | ||
|
||
We're also gonna be on IRC, so please feel free to join us if you have any questions or comments. We'll be hanging out in #openrecipes on Freenode. See you there! |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
Scrapy==0.16.4 | ||
Twisted==12.3.0 | ||
bleach==1.2.1 | ||
cssselect==0.8 | ||
html5lib==0.95 | ||
isodate==0.4.9 | ||
lxml==3.1.0 | ||
nose==1.3.0 | ||
pyOpenSSL==0.13 | ||
pymongo==2.5 | ||
python-dateutil==2.1 | ||
w3lib==1.2 | ||
wsgiref==0.1.2 | ||
zope.interface==4.0.5 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,159 @@ | ||
import argparse | ||
from urlparse import urlparse | ||
import os | ||
import sys | ||
|
||
script_dir = os.path.dirname(os.path.realpath(__file__)) | ||
|
||
SpiderTemplate = """from scrapy.contrib.spiders import CrawlSpider, Rule | ||
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor | ||
from scrapy.selector import HtmlXPathSelector | ||
from openrecipes.items import RecipeItem, RecipeItemLoader | ||
class %(crawler_name)sMixin(object): | ||
source = '%(source)s' | ||
def parse_item(self, response): | ||
hxs = HtmlXPathSelector(response) | ||
base_path = 'TODO' | ||
recipes_scopes = hxs.select(base_path) | ||
name_path = 'TODO' | ||
description_path = 'TODO' | ||
image_path = 'TODO' | ||
prepTime_path = 'TODO' | ||
cookTime_path = 'TODO' | ||
recipeYield_path = 'TODO' | ||
ingredients_path = 'TODO' | ||
datePublished = 'TODO' | ||
recipes = [] | ||
for r_scope in recipes_scopes: | ||
il = RecipeItemLoader(item=RecipeItem()) | ||
item['source'] = self.source | ||
il.add_value('name', r_scope.select(name_path).extract()) | ||
il.add_value('image', r_scope.select(image_path).extract()) | ||
il.add_value('url', response.url) | ||
il.add_value('description', r_scope.select(description_path).extract()) | ||
il.add_value('prepTime', r_scope.select(prepTime_path).extract()) | ||
il.add_value('cookTime', r_scope.select(cookTime_path).extract()) | ||
il.add_value('recipeYield', r_scope.select(recipeYield_path).extract()) | ||
ingredient_scopes = r_scope.select(ingredients_path) | ||
ingredients = [] | ||
for i_scope in ingredient_scopes: | ||
pass | ||
il.add_value('ingredients', ingredients) | ||
il.add_value('datePublished', r_scope.select(datePublished).extract()) | ||
recipes.append(il.load_item()) | ||
return recipes | ||
class %(crawler_name)scrawlSpider(CrawlSpider, %(crawler_name)sMixin): | ||
name = "%(domain)s" | ||
allowed_domains = ["%(domain)s"] | ||
start_urls = [ | ||
"%(start_url)s", | ||
] | ||
rules = ( | ||
Rule(SgmlLinkExtractor(allow=('TODO'))), | ||
Rule(SgmlLinkExtractor(allow=('TODO')), | ||
callback='parse_item'), | ||
) | ||
""" | ||
|
||
FeedSpiderTemplate = """from scrapy.spider import BaseSpider | ||
from scrapy.http import Request | ||
from scrapy.selector import XmlXPathSelector | ||
from openrecipes.spiders.%(source)s_spider import %(crawler_name)sMixin | ||
class %(crawler_name)sfeedSpider(BaseSpider, %(crawler_name)sMixin): | ||
name = "%(name)s.feed" | ||
allowed_domains = [ | ||
"%(feed_domains)s", | ||
"feeds.feedburner.com", | ||
"feedproxy.google.com", | ||
] | ||
start_urls = [ | ||
"%(feed_url)s", | ||
] | ||
def parse(self, response): | ||
xxs = XmlXPathSelector(response) | ||
links = xxs.select("TODO").extract() | ||
return [Request(x, callback=self.parse_item) for x in links] | ||
""" | ||
|
||
|
||
def parse_url(url): | ||
if url.startswith('http://') or url.startswith('https://'): | ||
return urlparse(url) | ||
else: | ||
return urlparse('http://' + url) | ||
|
||
|
||
def generate_crawlers(args): | ||
parsed_url = parse_url(args.start_url) | ||
|
||
domain = parsed_url.netloc | ||
name = args.name.lower() | ||
|
||
values = { | ||
'crawler_name': name.capitalize(), | ||
'source': name, | ||
'name': domain, | ||
'domain': domain, | ||
'start_url': args.start_url, | ||
} | ||
|
||
spider_filename = os.path.join(script_dir, 'openrecipes', 'spiders', '%s_spider.py' % name) | ||
with open(spider_filename, 'w') as f: | ||
f.write(SpiderTemplate % values) | ||
|
||
if args.with_feed: | ||
feed_url = args.with_feed[0] | ||
feed_domain = parse_url(feed_url).netloc | ||
values['feed_url'] = feed_url | ||
values['name'] = name | ||
if feed_domain == domain: | ||
values['feed_domains'] = domain | ||
else: | ||
values['feed_domains'] = '%s",\n "%s' % (domain, feed_domain) | ||
feed_filename = os.path.join(script_dir, 'openrecipes', 'spiders', '%s_feedspider.py' % name) | ||
with open(feed_filename, 'w') as f: | ||
f.write(FeedSpiderTemplate % values) | ||
|
||
|
||
epilog = """ | ||
Example usage: python generate.py epicurious http://www.epicurious.com/ | ||
""" | ||
parser = argparse.ArgumentParser(description='Generate a scrapy spider', epilog=epilog) | ||
parser.add_argument('name', help='Spider name. This will be used to generate the filename') | ||
parser.add_argument('start_url', help='Start URL for crawling') | ||
parser.add_argument('--with-feed', required=False, nargs=1, metavar='feed-url', help='RSS Feed URL') | ||
|
||
if len(sys.argv) == 1: | ||
parser.print_help(sys.stderr) | ||
else: | ||
args = parser.parse_args() | ||
generate_crawlers(args) |
Empty file.
Oops, something went wrong.