Permalink
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
136 lines (121 sloc) 4.88 KB

Geo names ingest pipeline

Click Here to access the presentation.

Prerequisite

JAVA Version 1.8

SBT Version 1.1.5

SCALA Version 2.11.12

SPARK Version 2.3.1

SOLR Version 7.3.1

SPARK Solr Connector Version 3.4.0

Description

  1. The purpose of this project is to do the POC to ingest and index data for easy search.
  2. It has support for geo spatial search SpatialSearch nearest neighbors or full-text by name.
  3. Apache Spark is used for distributed in memory compute , transform and ingest to build the pipeline.
  4. Apache Solr is used for storage and indexing can be configured in cloud mode (Multiple Solr server servers) can be easily scaled up by increasing server nodes.
  5. The Apache Solr collection can be configured with shards (no of partitions) and replicas (fault tolerance)
  6. The requirement to handle schema evolution can be done by Solr Managed Schema Configuration
  7. The id attribute which is derived from geonameid will take care of updating the collection for future updates and schema evolution as describe above.
  8. We can store binary data Binary Data Store such as Shape Files into Solr Document.
  9. We can also convert shape file into GeoJSON format and then ingest it into Solr for future processing and updates.

Setup

  1. Download the specified Apache Solr Version mentioned in prerequisite section.

  2. Unzip the folder and copy it to some location on the disk.

  3. Change to Solr Home Directory

       cd solr-7.3.1 
  4. Start Solr Server in cloud mode

       bin/solr start -cloud
  5. Create collection for storage and indexing

       bin/solr create -c geo_collection
  6. Create schema

       curl -X POST -H 'Content-type:application/json' --data-binary '{
         "add-field":[
            {
             "name":"administrativeLevel1",
             "type":"string",
             "docValues":true,
             "multiValued":false,
             "indexed":true,
             "stored":true},
           {
             "name":"administrativeLevel2",
             "type":"string",
             "docValues":true,
             "multiValued":false,
             "indexed":true,
             "stored":true},
           {
             "name":"countryCode",
             "type":"string",
             "docValues":true,
             "multiValued":false,
             "indexed":true,
             "stored":true},
           {
             "name":"latitude",
             "type":"pfloat",
             "docValues":true,
             "multiValued":false,
             "indexed":true,
             "stored":true},
           {
             "name":"location",
             "type":"location",
             "docValues":true,
             "multiValued":false,
             "indexed":true,
             "stored":true},
           {
             "name":"longitude",
             "type":"pfloat",
             "docValues":true,
             "multiValued":false,
             "indexed":true,
             "stored":true},
           {
             "name":"name",
             "type":"string",
             "docValues":true,
             "multiValued":false,
             "indexed":true,
             "stored":true
             }]
       }' http://localhost:8983/solr/geo_collection/schema
  7. Build the project

       sbt clean assembly
  8. Index geo locations

       spark-submit \
       --master "local[*]" \
       --class com.geoname.IndexGeoData \
       --driver-memory "1g" \
       target/geoname-pipeline-assembly-0.1.jar \
       /Users/shona/IdeaProjects/geoname-pipeline/data/cities1000.txt
  9. Search By Name

        curl "http://localhost:8983/solr/geo_collection/select?q=name:Saint-*"
  10. Search Nearest Neighbors By Great Circle Distance Box geofilt and Filter By Radius 10.

       curl "http://localhost:8983/solr/geo_collection/select?d=10&fq=\{!geofilt%20sfield=location\}&pt=47.10247,5.26556&q=*:*&sfield=location"
  11. Search Nearest Neighbors By Bounding Box Distance bbox and Filter By Radius 5.

       curl "http://localhost:8983/solr/geo_collection/select?d=5&fq=\{\!bbox%20sfield=location\}&pt=47.10247,5.26556&q=*:*&sfield=location"

NOTE: Considering its a POC I am ignoring shape file which can be easily ingested into Solr document as is Binary form or converted to GeoJSON.