GitHub - alexbudniy/nhs_elasticsearch: Crawling NHS and storing into Elasticsearch

#NHS Crawler

Crawls NHS syndication website, loading conditions into embedded Elasticsearch. Provides a REST service to query data and get URLs to relevant NHS pages

##Instructions to run:

mvn clean package
java -jar target/nhs-conditions-service-1.0.0.jar

You can access REST like this: http://localhost:8080/nhs/conditions/search?q=What about headache?

Crawler will start up automatically. Currently crawler log is in DEBUG mode to show visited/saved pages. Debug log can be suppressed in main/resources/logback.xml file.

##Configuration

Application properties can be found in the file application.yml:

crawler:

apiKey: - Nhs API key
agent: - browser agent string for crawler
numOfThreads: - how many threads to use for crawling
crawlDelay: - delay in millis between crawling requests
workFolder: - work folder for crawler
baseUrl: - crawler will handle only URLs which start with this value
startUrl: - crawler will start from this url, it is A to Z conditions catalog
enabled: true - crawler can be enabled/disabled

elasticsearch:

dbPath: - path to store ES db
nodeName: - ES node name
indexName: - ES index name
typeName: - ES type name to store Nhs pages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
src		src
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src

src

README.md

README.md

pom.xml

pom.xml

Repository files navigation

About

Releases

Packages

Languages

alexbudniy/nhs_elasticsearch

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Languages