🚜 Simple desktop scraper app.
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
examples
resources
src
.gitignore
.travis.yml
LICENSE
README.md
electron-builder.json
index.js
package.json
updater.js

README.md

Scraper

Simple desktop scraper app. You can download the application here.

With Scraper it is possible to load a json file with a specific structure (x-ray like) for automated scraping (you can find some examples in the examples folder) or simply by inserting the url and the selector directly into the app.

Scraper app

Above there is the input configuration json, under the output of scraping.

Usage

Scraper get an input json file divided mainly into two parts:

  1. header: with the information you want to make visible in the output json,
  2. websites: a list of website configurations used for scraping.

In the website configuration you can pass custom parameters (as name above) and you have to pass the configuration fields:

  • url of the website,
  • scope of the selectors (optional),
  • selectors of the page,
  • additional options as (at the moment):
    • pagination: url with the next link,
    • limit: number of pages to scrape.

The selector rules are the same as x-ray. Below some use cases:

Scrape a single selector:

{
    "options": {},
    "url": "https://github.com/cedoor/scraper",
    "selectors": ".repository-meta-content .text-gray-dark | trim"
}

Scrape an attribute:

{
    "options": {},
    "url": "https://github.com/cedoor/scraper",
    "selectors": {
        "linkOfCommits": ".commits > a@href"
    }
}

Scrape innerHTML:

{
    "options": {},
    "url": "https://github.com/cedoor/scraper",
    "selectors": {
        "body": "body@html"
    }
}

Scrape with a scope:

{
    "options": {},
    "url": "https://github.com/cedoor/scraper",
    "scope": "body",
    "selectors": {
        "description": ".repository-meta-content .text-gray-dark | trim"
    }
}

or

{
    "options": {},
    "url": "https://github.com/cedoor/scraper",
    "selectors": {
        "description": ".repository-meta-content .text-gray-dark | trim",
        "files": {
            "scope": ".files .js-navigation-item",
            "selectors": [{
                "name": ".js-navigation-open",
                "lastCommit": ".message | trim"
            }]
        }
    }
}

Scrape with collections:

{
    "options": {},
    "url": "https://github.com/cedoor/scraper/commits/master",
    "selectors": [
        ".commit-group > li .commit-title a"
    ]
}

Scrape with pagination:

{
    "options": {
        "pagination": "li.next a@href",
        "limit": 78
    },
    "url": "https://icobazaar.com/v2/ico-list",
    "scope": ".icos-list div.ico",
    "selectors": [{
        "name": "h5"
    }]
}

Crawling to another site:

{
    "options": {},
    "url": "https://github.com/cedoor/scraper",
    "selectors": {
        ...
        "lastCommits": {
            "url": ".commits > a@href",
            "selectors": [
                ".commit-group > li .commit-title a"
            ]   
        }
    }
}

It is possible to nest sub-sites.

Crawling to another site with scope:

{
    "options": {},
    "url": "https://github.com/cedoor/scraper",
    "selectors": {
    	...
        "lastCommits": {
            "url": ".commits > a@href",
            "scope": ".commit-group > li",
            "selectors": [{
                "message": ".message",
                "author": ".commit-author",
                "time": "relative-time"
            }]
        }
    }
}

Filters

In addition there are filters that you can use to improve the values of the selectors:

  • trim: "selector": ".class | trim" -> Removes whitespace from both ends of a string,
  • reverse: "selector": ".class | reverse" -> Reverses a string,
  • slice: "selector": ".class | slice:2,3" -> Extracts a section of a string and returns it as a new string,
  • oneSpace: "selector": ".class | oneSpace" -> Replaces all spaces with one space,
  • toNumber: "selector": ".class | toNumber" -> Parses a string and returns a float or an integer,
  • getNumber: "selector": ".class | getNumber" -> Finds the numbers in a string and returns the i-th.

Other filters will be added later.

Development

Clone and install dependencies

> git clone https://github.com/cedoor/scraper.git
> cd scraper
> npm i

Run electron

> npm start

Rules

Commits

  • Use this commit message format (angular style):

    [<type>] <subject> <BLANK LINE> <body>

    where type must be one of the following:

    • feat: A new feature
    • fix: A bug fix
    • docs: Documentation only changes
    • style: Changes that do not affect the meaning of the code
    • refactor: A code change that neither fixes a bug nor adds a feature
    • test: Adding missing or correcting existing tests
    • chore: Changes to the build process or auxiliary tools and libraries such as documentation generation
    • update: Update of the library version or of the dependencies

and body must be should include the motivation for the change and contrast this with previous behavior (do not add body if the commit is trivial).

  • Use the imperative, present tense: "change" not "changed" nor "changes".
  • Don't capitalize first letter.
  • No dot (.) at the end.

Branches

  • There is a master branch, used only for release.
  • There is a dev branch, used to merge all sub dev branch.
  • Avoid long descriptive names for long-lived branches.
  • No CamelCase.
  • Use grouping tokens (words) at the beginning of your branch names (in a similar way to the type of commit).
  • Define and use short lead tokens to differentiate branches in a way that is meaningful to your workflow.
  • Use slashes to separate parts of your branch names.
  • Remove branch after merge if it is not important.

Examples:

git branch -b docs/README
git branch -b test/one-function
git branch -b feat/side-bar
git branch -b style/header

License

Contact

Developer