Swifind is web scraping function builder. It is a toolset to increase web scraping function simplicity and modularity. It came with its scripting language (swipl
) to plan web scraping and crawling strategies. swipl
script will be interpreted to Python sequence of function, makes it easier to recreate, reuse or modify web scraping script. It can be run as a standalone script or even attached to existing project.
Please check these amazing libraries that I used to develop this project:
- BeautifullSoup to parse HTML page.
- Request to retrieve website page.
- lxml to enable lxml parsing with BeautifulSoup.
- Python >= 3.6
Swifind work in three simple phase:
Catfish
initiated with swipl
script path as an argument. Catfish
will interpret, validate and extract information from swipl
script. It will store the information into sequence of function that stored in Strategy
in form of Plan
. Catfish
uses Bag
as a container for extracted or scraped data.
How it works?
swipl
script will be validated byValidator
.Validator
will check syntax validity of each line or block of component. If there is an error, exception will raised. All of validated component will be parsed into validated components.- Validated components will be used to generated plan blueprint with
Extractor
.Extractor
will return function and initiatedPlan
. Plan
will be assembled to linked list ofPlan
. This sequence ofPlan
is assigned toStrategy
that attached to existingCatfish
.Catfish
will utilize itsStrategy
to do scraping and crawling activity.
Catfish
execute all function that assigned to Strategy
. Each Plan
in Strategy
will be execute from Strategy
origin. For data collection activity, each scraped information will be stored in Bag
.
Swipl Activity
Currently, there are two activity that available in `swipl`:
- ORIGIN: define starting point of
Catfish
(first page). - PICK: define information extraction activity.
For more info about swipl
activity definition and usage, read this doc.
Catfish
return all collected items inside its Bag
. Bag
also contains activity or journey logs that can be retrieved with Catfish
unpack method.
For example, imagine there is a website (http://example.com) with following HTML structure:
<body>
<div class="container">
<h1>Title Example</h1>
<a href="/link">Example Link</a>
<ul>
<li>First Item</li>
<li>Second Item</li>
<li>Third Item</li>
</ul>
</div>
</body>
We then plan to extract several things:
- Title of page, we named it title.
- Link of example link, we named it link
- Second element of unordered-list, we named it second_elm
Below are the swipl
script to extracted those things, we named it example.swipl
:
ORIGIN http://example.com
PICK title 'h1*'
PICK link 'div a' href
PICK second_elm 'ul* li[1]'
To use this script, we define Python script as follow:
from swifind.catfish import Catfish
cf = Catfish('example.swipl')
cf.swim()
result = cf.unpack()
*above example assume swipl
and Python in the same directory
Result will contain extracted information as follow:
{
"items":{
"title": "Title Example",
"link": "/link",
"second_elm": "Second Item"
}
}