This webcrawler project captures resources like - links, images, html meta, sitemap from given website URL and shows in web interface.
This may not work properly on all domains except: http://wiprodigital.com
- Spring Boot - 1.5.8
- Jsoup - Html parsing
- Jackson
- Gradle
- Swagger
- JQuery
- Boostrap
- FancyGrid
- Bootsrap TreeView plugin.
-
Execute
gradlew bootRun
to start the server manually. -
Open browser:
http://localhost:8080/index.html
and give any URL for analysis. e.g:http://wiprodigital.com
-
To access swagger rest client:
http://localhost:8080/api
Execute 'gradlew test' to run unit and integration tests.
1. Multithread support
The links identified in a URL could be large in numbers which may take more time to give back the result. It can be solved by using executor framework.
2. Data extraction
Data retrieval logic from given URL can be improved for other domains with support of common sitemap pattern.
3. User Interface
User Interface can be improved for look and feel.