Skip to content

A go web scraping framework using json configuration and other customizations to easily scrape websites.

Notifications You must be signed in to change notification settings

albingeorge/goscraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Go Scapper

A go web scraping framework using json configuration and other customizations to easily scrape websites.

Input

{
    "levels": [
        {
            "source": {
                "type": "default",
                "content": "https://mangapill.com/manga/2/one-piece"
            },
            "label": "chapter",
            "objects": {
                "chapter": {
                    "parser": {
                        "selector": "custom",
                        "struct": "mangapill",
                        "value": "chapter_parser"
                    },
                    "sort": {
                        "by": "name",
                        "order": "asc"
                    },
                    "save": {
                        "type": "directory",
                        "path": {
                            "type": "resolve",
                            "content": "OnePiece/%current.name%"
                        },
                        "skipIfExists": true
                    },
                    "levels": [
                        {
                            "source": {
                                "type": "resolve",
                                "content": "https://mangapill.com%parent.url%"
                            },
                            "label": "page",
                            "objects": {
                                "page": {
                                    "parser": {
                                        "selector": "custom",
                                        "struct": "mangapill",
                                        "value": "page_parser"
                                    },
                                    "sort": {
                                        "by": "page_number",
                                        "order": "asc"
                                    },
                                    "save": {
                                        "type": "file",
                                        "name": {
                                            "type": "resolve",
                                            "content": "%current.name%.jpg"
                                        },
                                        "path": {
                                            "type": "resolve",
                                            "content": "OnePiece/%parent.name%/"
                                        },
                                        "content": {
                                            "type": "resolve",
                                            "content": "%current.src%"
                                        }
                                    }
                                }
                            }
                        }
                    ]
                }
            }
        }
    ]
}

Crwaling happens via levels instead of going through all the links in the root page. We only need to traverse the required links.

For example, if you want to fetch a chapter in a manga, there would be a single level, which contains all the pages of the chapter. If you want to fetch all the chapters of a single manga, you'd have 2 levels - one for fetching all the chapters and another for fetching all the pages in each chapter.

How it works?

For each level, do the following:

  1. Fetch data from source
  2. Parse variables to

The sort attribute is applied after fetching all the variables from the source.

The attribute save in each level represents what needs to be stored for each objects in that level. The sub-attributes are self-explanotary.

FAQ

Why is levels an array in the input format?

One level can contain multiple types of data. For example, say you're fetching multiple mangas from a website. Here, for each manga(root level) you'd need:

  • All the chapters(which can be saved as a directory)
  • Manga cover which would be an image file

As you can see from the above example, it's possible that for each level, you'd need multiple types of data to be fetched. Hence, we define levels as an array.

TODO

  • Create a global thread limit

About

A go web scraping framework using json configuration and other customizations to easily scrape websites.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published