Mocy is a simple web crawling framework that is flexible and easy to use.
-
Concurrent downloads
-
Decorators like
@before_download
,@after_download
,@pipe
-
Rate limit and retry mechanism
-
Session keeping
-
More...
$ pip install mocy
The following is a simple spider to extract upcoming Python events.
from mocy import Spider, Request, pipe
class SimpleSpider(Spider):
entry = 'https://www.python.org/'
def parse(self, res):
for link in res.select('.event-widget li a'):
yield Request(
link['href'],
state={'name': link.text},
callback=self.parse_detail_page
)
def parse_detail_page(self, res):
date = ' '.join(res.select('.single-event-date')[0].stripped_strings)
yield res.state['name'], date
@pipe
def output(self, item):
print(f'{item[0]} will be held on "{item[1]}"')
SimpleSpider().start()
The Result is:
[2021-06-08 00:59:22] INFO : Spider is running...
[2021-06-08 00:59:23] INFO : "GET https://www.python.org/" 200 0.27s
[2021-06-08 00:59:23] INFO : "GET https://www.python.org/events/python-events/1094/" 200 0.61s
[2021-06-08 00:59:23] INFO : "GET https://www.python.org/events/python-events/964/" 200 0.63s
[2021-06-08 00:59:23] INFO : "GET https://www.python.org/events/python-events/1036/" 200 0.69s
[2021-06-08 00:59:23] INFO : "GET https://www.python.org/events/python-events/1085/" 200 0.69s
[2021-06-08 00:59:24] INFO : "GET https://www.python.org/events/python-events/833/" 200 0.79s
[2021-06-08 00:59:24] INFO : Spider exited; total running time 1.12s.
PyFest will be held on "From 16 June through 18 June, 2021"
EuroPython 2021 will be held on "From 26 July through 01 Aug., 2021"
PyCon Namibia 2021 will be held on "From 18 June through 19 June, 2021"
PyOhio 2021 will be held on "31 July, 2021"
SciPy 2021 will be held on "From 12 July through 18 July, 2021"
There are some detailed examples in the directory /examples.
class Request(url: str,
method: str = 'GET',
callback: Optional[Callable] = None,
session: Union[bool, dict] = False,
state: Optional[dict] = None,
headers: Optional[dict] = None,
cookies: Optional[dict] = None,
params: Optional[dict] = None,
data: Optional[dict] = None,
json: Optional[dict] = None,
files: Optional[dict] = None,
proxies: Optional[dict] = None,
verify: bool = True,
timeout: Optional[Union[Tuple[Number, Number], Number]] = None,
**kwargs)
The popular HTTP library requests is used under the hood. Please refer to its documentation: requests.request.
It accepts some extra parameters:
Parameters:
- callback: It will be used to handle response to this request. The default value is
self.parse
. - session: It provides cookie persistence, connection-pooling, and configuration. It can be a
Bool
ordict
. The default value isFalse
, that means no new requests.Session will be created.- If set to
True
, a requests.Session object is created, and subsequent requests will be sent under the same session. - if set to a
dict
, in addition to the above, the value can be used to provide default data to theRequest
. For example:session={'auth': ('user', 'pass'), 'headers': {'x-test': 'true'}}
.
- If set to
- state: It is shared between a request and the corresponding response.
This object contains a server’s response to an HTTP request. Actually it is the same object as requests.Response.
Several attributes and methods are attached to this object:
Attributes:
- req: The
Request
object. - state: The same object that was passed by a
Request
.
Methods:
-
select(self, selector: str, **kw) -> List[bs4.element.Tag]
Perform a CSS selection operation on the HTML element. The powerful HTML parser Beautiful Soup is used.
Base class for spiders. All spiders must inherit from this class.
Class attributes:
-
WORKERS
Default:
os.cpu_count() * 2
The number of concurrent requests that will be performed by the downloader.
-
TIMEOUT:
Default:
30
The amount of time (in secs) that the downloader will wait before timeout.
-
DOWNLOAD_DELAY
Default:
0
The amount of time (in secs) that the downloader should wait before download.
-
RANDOM_DOWNLOAD_DELAY
Default:
True
If enabled, the downloader will wait a random time (0.5 * delay ~ 1.5 * delay by default) before downloading the next page.
-
RETRY_TIMES
Default:
3
Maximum number of times to retry when encountering connection issues or unexpected status codes.
-
RETRY_CODES:
Default:
(500, 502, 503, 504, 408, 429)
HTTP response status codes to retry. Other errors (DNS or connection issues) are always retried.
502: Bad Gateway, 503: Service Unavailable, 504: Gateway Timeout, 408: Request Timeout, 429: Too Many Requests.
-
RETRY_DELAY
Default:
1
The amount of time (in secs) that the downloader will wait before retrying a failed request.
-
DEFAULT_HEADERS
Default:
{'User-Agent': 'mocy/0.1'}
Attributes:
entry: Union[str, Request, Iterable[Union[str, Request]], Callable] = []
Methods:
-
entry() -> Union[str, Request, Iterable[Union[str, Request]], Callable] = []
-
on_start(self) -> None
Called when the spider starts up.
-
on_finish(self) -> None
Called when the spider exits.
-
on_error(self, reason: SpiderError) -> None
Called when the spider encounters an error when downloading or parsing. It may be called multiple times.
-
parse(self, res: Response) -> Any
Parse a response and generate some data or new requests.
-
collect(self, item: Any) -> Any
Called when the spider outputs a result. Usually it will be called multiple times.
-
collect(self, item: Any, res: Response) -> Any
Called when the spider outputs a result. Usually it will be called multiple times.
-
start(self) -> None
Starts up the spider. It will keep running until all requests were processed.
The decorators can be applied to multiple methods of the Spider
class. They are called in the same order as they were defined.
-
before_download
The decorated method is used to modify request objects. If it does not return the same or a new
Request
object, the passed request will be ignored. -
after_download
The decorated method is used to modify response objects. If it does not return the same or a new
Response
object, the passed response will be ignored. If it returns aRequest
object, then the object will be added to the request queue. -
pipe
The decorated method is used to process yielded items. If it returns
None
, the item won't be passed to the next pipeline.Spider.collect
is a default pipe.
-
class SpiderError(msg: str, cause: Optional[Exception] = None)
Base Class for spider errors. The following exceptions inherit from this class.
Attributes:
- msg: a brief text that explains what happened.
- cause: the underlying exception that raised this error.
- req: the
Request
object. - res: the
Response
object; it may beNone
.
-
class RequestIgnored(url: str, cause: Optional[Exception] = None)
Indicates a decision was made not to process a request.
-
class ResponseIgnored(url: str, cause: Optional[Exception] = None)
Indicates a decision was made not to process a response.
-
class DownLoadError(url: str, cause: Optional[Exception] = None)
Indicates an error when downloading.
-
class ParseError(url: str, cause: Optional[Exception] = None)
Indicates an error when parsing.
$ pytest
MIT