Go Scapper

A go web scraping framework using json configuration and other customizations to easily scrape websites.

Input

{
    "levels": [
        {
            "source": {
                "type": "default",
                "content": "https://mangapill.com/manga/2/one-piece"
            },
            "label": "chapter",
            "objects": {
                "chapter": {
                    "parser": {
                        "selector": "custom",
                        "struct": "mangapill",
                        "value": "chapter_parser"
                    },
                    "sort": {
                        "by": "name",
                        "order": "asc"
                    },
                    "save": {
                        "type": "directory",
                        "path": {
                            "type": "resolve",
                            "content": "OnePiece/%current.name%"
                        },
                        "skipIfExists": true
                    },
                    "levels": [
                        {
                            "source": {
                                "type": "resolve",
                                "content": "https://mangapill.com%parent.url%"
                            },
                            "label": "page",
                            "objects": {
                                "page": {
                                    "parser": {
                                        "selector": "custom",
                                        "struct": "mangapill",
                                        "value": "page_parser"
                                    },
                                    "sort": {
                                        "by": "page_number",
                                        "order": "asc"
                                    },
                                    "save": {
                                        "type": "file",
                                        "name": {
                                            "type": "resolve",
                                            "content": "%current.name%.jpg"
                                        },
                                        "path": {
                                            "type": "resolve",
                                            "content": "OnePiece/%parent.name%/"
                                        },
                                        "content": {
                                            "type": "resolve",
                                            "content": "%current.src%"
                                        }
                                    }
                                }
                            }
                        }
                    ]
                }
            }
        }
    ]
}

Crwaling happens via levels instead of going through all the links in the root page. We only need to traverse the required links.

For example, if you want to fetch a chapter in a manga, there would be a single level, which contains all the pages of the chapter. If you want to fetch all the chapters of a single manga, you'd have 2 levels - one for fetching all the chapters and another for fetching all the pages in each chapter.

How it works?

For each level, do the following:

Fetch data from source
Parse variables to

The sort attribute is applied after fetching all the variables from the source.

The attribute save in each level represents what needs to be stored for each objects in that level. The sub-attributes are self-explanotary.

FAQ

Why is `levels` an array in the input format?

One level can contain multiple types of data. For example, say you're fetching multiple mangas from a website. Here, for each manga(root level) you'd need:

All the chapters(which can be saved as a directory)
Manga cover which would be an image file

As you can see from the above example, it's possible that for each level, you'd need multiple types of data to be fetched. Hence, we define levels as an array.

TODO

Create a global thread limit

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
extend		extend
input		input
internal		internal
output		output
pkg/log		pkg/log
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
go.mod		go.mod
go.sum		go.sum
main.go		main.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Go Scapper

Input

How it works?

FAQ

Why is `levels` an array in the input format?

TODO

About

Releases

Packages

Languages

albingeorge/goscraper

Folders and files

Latest commit

History

Repository files navigation

Go Scapper

Input

How it works?

FAQ

Why is levels an array in the input format?

TODO

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Why is `levels` an array in the input format?

Packages