<font size="3">
    
# Elasticsearch Analyzers

## General
For an Elasticsearch index, if no mapping is defined, each field is mapped as "text".

Each field mapped as "text" is "analyzed" i.e. is processed by the Elasticsearch analyser.

## Default analyzer
If no custom analyzer is set for an indice, Elasticsearch applies a "default analyzer".

## Elasticsearch analyzer chain
Each Elasticsearch analyzer is a processing chain which can include following components:

1. **Char Filters** ⇒ substitution of character sequences<br>
ref: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-mapping-charfilter.html<br>
Not set in the standard analyser<br><br>

2. **Tokenizer**<br>
By default ES uses the Unicode Text segmentation algorithm<br>
http://unicode.org/reports/tr29/<br><br>

3. **Token filters**<br>
ref: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenfilters.html<br><br>

## Multiple analyzers
When analyzers are defined on the index settings level, to apply a given analyzer for a field, this should be defined in the "mapping" field of the index settings.

ref: https://www.elastic.co/guide/en/elasticsearch/reference/current/analyzer.html

<br><br>
</font>

In [1]:
%%bash
# Start a local Elasticsearch server
$HOME/SCRIPTS/start_es.sh

sudo: pas de tty présent et pas de programme askpass spécifié



In [2]:
%%bash
# check if the server is running
curl -X GET -i http://localhost:9200

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100   493  100   493    0     0   4522      0 --:--:-- --:--:-- --:--:--  4522
HTTP/1.1 200 OK
content-type: application/json; charset=UTF-8
content-length: 493

{
  "name" : "-zuCF-z",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "rZX5yELHQfyX2bMjZr6wyw",
  "version" : {
    "number" : "6.6.1",
    "build_flavor" : "default",
    "build_type" : "tar",
    "build_hash" : "1fd8f69",
    "build_date" : "2019-02-13T17:10:04.160291Z",
    "build_snapshot" : false,
    "lucene_version" : "7.6.0",
    "minimum_wire_compatibility_version" : "5.6.0",
    "minimum_index_compatibility_version" : "5.0.0"
  },
  "tagline" : "You Know, for Search"
}



# Creating custom analyzers

In [3]:
%%bash
# delete settings if already set
curl -X DELETE -i http://localhost:9200/my_index

# create 3 analyzers
curl -X PUT -H 'Content-Type: application/json' -i http://localhost:9200/my_index --data '{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer1": {
          "tokenizer": "keyword",
          "char_filter": [
            "my_char_filter1"
          ]
        },
        "my_analyzer2": {
          "tokenizer": "standard",
          "char_filter": [
            "my_char_filter1"
          ],
          "filter": ["lowercase"]
        },
        "my_analyzer3": {
          "tokenizer": "standard",
          "char_filter": [
            "my_char_filter2"
          ]
        }
      },
      "char_filter": {
        "my_char_filter1": {
          "type": "mapping",
          "mappings": [
            "٠ => 0",
            "١ => 1",
            "٢ => 2",
            "٣ => 3",
            "٤ => 4",
            "٥ => 5",
            "٦ => 6",
            "٧ => 7",
            "٨ => 8",
            "٩ => 9"
          ]
        },
        "my_char_filter2": {
          "type": "mapping",
          "mappings": [
            ":) => _happy_",
            ":( => _sad_"
          ]
        }
      }
    }
  }
}'

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100    21  100    21    0     0    141      0 --:--:-- --:--:-- --:--:--   142
HTTP/1.1 200 OK
content-type: application/json; charset=UTF-8
content-length: 21

{"acknowledged":true}  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  1123  100    67  100  1056    191   3025 --:--:-- --:--:-- --:--:--  3217
HTTP/1.1 100 Continue

HTTP/1.1 200 OK
content-type: application/json; charset=UTF-8
content-length: 67

{"acknowledged":true,"shards_acknowledged":true,"index":"my_index"}



# What does the standard analyzer do

In [4]:
%%bash
curl -X POST -H 'Content-Type: application/json' \
    http://localhost:9200/my_index/_analyze \
    -d '{"analyzer": "standard","text":"It even appeared recently in graphic-novel form"}'|jq '.tokens[].token'

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100   768  100   687  100    81  10253   1208 --:--:-- --:--:-- --:--:-- 11462
"it"
"even"
"appeared"
"recently"
"in"
"graphic"
"novel"
"form"



# Testing my_analyzer1

In [5]:
%%bash

curl -X POST -H 'Content-Type: application/json' \
    http://localhost:9200/my_index/_analyze \
    -d '{"analyzer": "my_analyzer1", "text": "My license plate is ٢٥٠١٥" }'|jq '.tokens[].token'

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100   181  100   110  100    71   5238   3380 --:--:-- --:--:-- --:--:--  8619
"My license plate is 25015"



# Testing my_analyzer2

In [6]:
%%bash

curl -X POST -H 'Content-Type: application/json' \
    http://localhost:9200/my_index/_analyze \
    -d '{"analyzer": "my_analyzer2", "text": "My license plate is ٢٥٠١٥" }'|jq '.tokens[].token'

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100   496  100   425  100    71  28333   4733 --:--:-- --:--:-- --:--:-- 33066
"my"
"license"
"plate"
"is"
"25015"



# Testing my_analyzer3

In [7]:
%%bash

curl -X POST -H 'Content-Type: application/json' \
    http://localhost:9200/my_index/_analyze \
    -d '{"analyzer": "my_analyzer3", "text": "I''m delighted about it :(" }'|jq '.tokens[].token'

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100   497  100   432  100    65  22736   3421 --:--:-- --:--:-- --:--:-- 26157
"Im"
"delighted"
"about"
"it"
"_sad_"



<font size="3">
    
# Elasticsearch ngrams and shingles

## Elasticsearch ngrams

On Elasticsearch level, "ngram" is a kind of tokenizer is a way to full text indexing with tokens built as a sequence of characters more than sequence of words.

To know more about Elasticsearch "ngram" tokenization, read:<br>
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-ngram-tokenizer.html

## Elasticsearch shingles

It as also possible to tokens based on sequences of words provided that our analyzer includes a "shingle" filter.

</font>

In [8]:
%%bash

# delete settings if already set
curl -X DELETE -i http://localhost:9200/my_index

# Exemple of analyzer with shingle filter
curl -X PUT -H 'Content-Type: application/json' -i http://localhost:9200/my_index --data '{
  "settings": {
    "index" : {
      "number_of_shards" : "1",
      "number_of_replicas" : "0"
    },
    "analysis": {
      "analyzer": {
        "shingle5": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase", "my_stop", "shingle5_filter"]
        }
      },
      "filter": {
        "shingle5_filter": {
          "type": "shingle",
          "min_shingle_size": 2,
          "max_shingle_size": 5
        },
        "my_stop": {
            "type":       "stop",
            "stopwords":  "_english_"
        }
      }
    }
  },
  "mappings":{
    "my_mapping": {
      "properties":{
         "doc_txt": {
            "type":"text",
            "analyzer":"shingle5"
         }
      }
    }
  }
}'

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100    21  100    21    0     0    304      0 --:--:-- --:--:-- --:--:--   304
HTTP/1.1 200 OK
content-type: application/json; charset=UTF-8
content-length: 21

{"acknowledged":true}  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100   822  100    67  100   755    435   4902 --:--:-- --:--:-- --:--:--  5337
HTTP/1.1 200 OK
content-type: application/json; charset=UTF-8
content-length: 67

{"acknowledged":true,"shards_acknowledged":true,"index":"my_index"}



In [9]:
%%bash

# Apply the shingle5 analyzer:
curl -X POST -H 'Content-Type: application/json' \
    http://localhost:9200/my_index/_analyze -d '{"analyzer": "shingle5","text":"It even appeared recently in graphic-novel form"}'|jq '.tokens[].token'

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  3129  100  3048  100    81   156k   4263 --:--:-- --:--:-- --:--:--  160k
"_ even"
"_ even appeared"
"_ even appeared recently"
"_ even appeared recently _"
"even"
"even appeared"
"even appeared recently"
"even appeared recently _"
"even appeared recently _ graphic"
"appeared"
"appeared recently"
"appeared recently _"
"appeared recently _ graphic"
"appeared recently _ graphic novel"
"recently"
"recently _"
"recently _ graphic"
"recently _ graphic novel"
"recently _ graphic novel form"
"_ graphic"
"_ graphic novel"
"_ graphic novel form"
"graphic"
"graphic novel"
"graphic novel form"
"novel"
"novel form"
"form"



In [10]:
%%bash

# Apply the shingle5 analyzer:
curl -X POST -H 'Content-Type: application/json' \
    http://localhost:9200/my_index/_analyze -d '{"analyzer": "shingle5","text":"A graphic novel is a book made up of comics content"}'|jq '.tokens[].token'

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  4294  100  4209  100    85   456k   9444 --:--:-- --:--:-- --:--:--  465k
"_ graphic"
"_ graphic novel"
"_ graphic novel _"
"_ graphic novel _ _"
"graphic"
"graphic novel"
"graphic novel _"
"graphic novel _ _"
"graphic novel _ _ book"
"novel"
"novel _"
"novel _ _"
"novel _ _ book"
"novel _ _ book made"
"_ _ book"
"_ _ book made"
"_ _ book made up"
"_ book"
"_ book made"
"_ book made up"
"_ book made up _"
"book"
"book made"
"book made up"
"book made up _"
"book made up _ comics"
"made"
"made up"
"made up _"
"made up _ comics"
"made up _ comics content"
"up"
"up _"
"up _ comics"
"up _ comics content"
"_ comics"
"_ comics content"
"comics"
"comics content"
"content"



In [11]:
%%bash

# insert first document

curl -X PUT -H 'Content-Type: application/json' http://localhost:9200/my_index/my_mapping/1 --data '{
  "doc_txt": "It even appeared recently in graphic-novel form"
}'

#  insert second document

curl -X PUT -H 'Content-Type: application/json' http://localhost:9200/my_index/my_mapping/2 --data '{
  "doc_txt": "A graphic novel is a book made up of comics content"
}'

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100   228  100   162  100    66   1396    568 --:--:-- --:--:-- --:--:--  1965
{"_index":"my_index","_type":"my_mapping","_id":"1","_version":1,"result":"created","_shards":{"total":1,"successful":1,"failed":0},"_seq_no":0,"_primary_term":1}  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100   232  100   162  100    70   7363   3181 --:--:-- --:--:-- --:--:-- 10545
{"_index":"my_index","_type":"my_mapping","_id":"2","_version":1,"result":"created","_shards":{"total":1,"successful":1,"failed":0},"_seq_no":1,"_primary_term":1}



In [12]:
%%bash

# check if data has been inserted
curl -X POST -H 'Content-Type: application/json' http://localhost:9200/my_index/my_mapping/_search --data '{
    "query": {
        "match_all": {}
    }
}'|jq '.'

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100   471  100   423  100    48   5794    657 --:--:-- --:--:-- --:--:--  6452
{
  "took": 57,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 1,
    "hits": [
      {
        "_index": "my_index",
        "_type": "my_mapping",
        "_id": "1",
        "_score": 1,
        "_source": {
          "doc_txt": "It even appeared recently in graphic-novel form"
        }
      },
      {
        "_index": "my_index",
        "_type": "my_mapping",
        "_id": "2",
        "_score": 1,
        "_source": {
          "doc_txt": "A graphic novel is a book made up of comics content"
        }
      }
    ]
  }
}



In [13]:
%%classpath add mvn
org.apache.lucene lucene-core 7.6.0
org.elasticsearch.client elasticsearch-rest-high-level-client 6.6.1
org.elasticsearch elasticsearch 6.6.1
org.apache.logging.log4j log4j-core 2.11.2
com.google.code.gson gson 2.8.5

In [14]:
import org.elasticsearch.client.RestHighLevelClient;
import org.elasticsearch.client.RestClient;
import org.apache.http.HttpHost;
import org.elasticsearch.action.search.SearchRequest;
import org.elasticsearch.search.builder.SearchSourceBuilder;
import org.elasticsearch.index.query.QueryBuilders;
import org.elasticsearch.search.builder.SearchSourceBuilder;
import org.elasticsearch.action.search.SearchResponse;
import org.elasticsearch.action.admin.indices.get.GetIndexRequest;
import org.elasticsearch.action.admin.indices.get.GetIndexResponse;
import org.elasticsearch.action.admin.indices.settings.get.GetSettingsRequest;
import org.elasticsearch.action.admin.indices.settings.get.GetSettingsResponse;
import org.elasticsearch.client.RequestOptions;
import org.elasticsearch.client.Response;
import com.google.gson.Gson;
import com.google.gson.GsonBuilder;
import com.google.gson.JsonParser;
import com.google.gson.JsonElement;
import com.google.gson.JsonObject;
import java.io.IOException;
import org.apache.commons.lang3.tuple.Pair;
import org.apache.commons.lang3.tuple.MutablePair;
import java.io.InputStream;

public class ElasticClient {
    private String esServer;
    private Integer esPort;
    private RestHighLevelClient client = null;
    private static final Gson GSONPRETTY = new GsonBuilder().setPrettyPrinting().create();
    private static final JsonParser JP = new JsonParser();
    
    ElasticClient(String esServer, Integer esPort) {
        this.esServer = esServer;
        this.esPort = esPort;
    }
    
    public RestHighLevelClient getClient() {
        if (client == null) {
            client = new RestHighLevelClient(RestClient.builder(new HttpHost(esServer, esPort, "http")));
        }
        return client;
    }
    
    public void searchAll(String index, String query) throws IOException {
        SearchRequest searchRequest = new SearchRequest(index);
        SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
        searchSourceBuilder.query(QueryBuilders.matchAllQuery());
        searchRequest.source(searchSourceBuilder);
        SearchResponse response = getClient().search(searchRequest);
        jsonPrettyPrint(response.toString());
    }
    
    public static void jsonPrettyPrint(String json) {
        JsonElement je = JP.parse(json);
        System.out.println(GSONPRETTY.toJson(je));
    }
    
    public String[] getAllIndices() throws IOException {
        GetIndexRequest request = new GetIndexRequest().indices("*");
        GetIndexResponse response = client.indices().get(request, RequestOptions.DEFAULT);
        return response.getIndices();
    }
    
    public Pair<String, Integer> getIndexProperties(String index) throws IOException {
        GetSettingsRequest request = new GetSettingsRequest().indices(index);
        GetSettingsResponse response = client.indices().getSettings(request, RequestOptions.DEFAULT);
        JsonElement je = JP.parse(response.toString());
        JsonObject indexObj = je.getAsJsonObject().get(index).getAsJsonObject().get("settings").getAsJsonObject().get("index").getAsJsonObject();
        return new MutablePair<>(indexObj.get("uuid").getAsString(), indexObj.get("number_of_shards").getAsInt());
    }
}

com.twosigma.beaker.javash.bkr451fa51c.ElasticClient

In [15]:
import java.util.Arrays;
import org.apache.commons.lang3.tuple.Pair;

ElasticClient esCli = new ElasticClient("localhost", 9200);
esCli.searchAll("my_index", "");
System.out.println();
System.out.println(Arrays.toString(esCli.getAllIndices()));
System.out.println();
Pair<String, Integer> myIndexSettings = esCli.getIndexProperties("my_index");
NamespaceClient.getBeakerX().set("my_index_uuid", myIndexSettings.getKey());
NamespaceClient.getBeakerX().set("my_index_shards", myIndexSettings.getValue());
System.out.println(myIndexSettings);

ERROR StatusLogger No Log4j 2 configuration file found. Using default configuration (logging only errors to the console), or user programmatically provided configurations. Set system property 'log4j2.debug' to show Log4j 2 internal initialization logging. See https://logging.apache.org/log4j/2.x/manual/configuration.html for instructions on how to configure Log4j 2


{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 1.0,
    "hits": [
      {
        "_index": "my_index",
        "_type": "my_mapping",
        "_id": "1",
        "_score": 1.0,
        "_source": {
          "doc_txt": "It even appeared recently in graphic-novel form"
        }
      },
      {
        "_index": "my_index",
        "_type": "my_mapping",
        "_id": "2",
        "_score": 1.0,
        "_source": {
          "doc_txt": "A graphic novel is a book made up of comics content"
        }
      }
    ]
  }
}

[.kibana_1, my_index, nutch]

(YtgaGpRpTCyDhEtcY6SonQ,1)


null

In [16]:
/*******************************************************************************
 * Copyright (c) 2010, 2012 Institute for Dutch Lexicology
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 *******************************************************************************/
import java.text.Collator;
import java.text.Normalizer;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
import java.util.Locale;
import java.util.Map;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.lucene.index.IndexReader;

import org.apache.commons.lang3.StringUtils;

/**
 * A collection of String-related utility methods and regular expression patterns.
 */
public class StringUtil {
    
    final static Pattern pattern = Pattern.compile("\\p{InCombiningDiacriticalMarks}+");


    private StringUtil() {}

    /**
     * <p>Removes diacritics (~= accents) from a string. The case will not be altered.</p>
     * <p>For instance, '&agrave;' will be replaced by 'a'.</p>
     * <p>Note that ligatures will be left as is.</p>
     *
     * <pre>
     * StringUtils.stripAccents(null)                = null
     * StringUtils.stripAccents("")                  = ""
     * StringUtils.stripAccents("control")           = "control"
     * StringUtils.stripAccents("&eacute;clair")     = "eclair"
     * </pre>
     *
     * NOTE: this method was copied from Apache StringUtils. The only change is precompiling
     * the regular expression for efficiency.
     *
     * @param input String to be stripped
     * @return input text with diacritics removed
     *
     * @since 3.0
     */
    // See also Lucene's ASCIIFoldingFilter (Lucene 2.9) that replaces accented characters by their unaccented equivalent (and uncommitted bug fix: https://issues.apache.org/jira/browse/LUCENE-1343?focusedCommentId=12858907&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12858907).
    public static String stripAccents(final String input) {
        if(input == null) {
            return null;
        }
        final StringBuilder decomposed = new StringBuilder(Normalizer.normalize(input, Normalizer.Form.NFD));
        convertRemainingAccentCharacters(decomposed);
        // Note that this doesn't correctly remove ligatures...
        return pattern.matcher(decomposed).replaceAll(StringUtils.EMPTY);
    }

    private static void convertRemainingAccentCharacters(StringBuilder decomposed) {
        for (int i = 0; i < decomposed.length(); i++) {
            if (decomposed.charAt(i) == '\u0141') {
                decomposed.deleteCharAt(i);
                decomposed.insert(i, 'L');
            } else if (decomposed.charAt(i) == '\u0142') {
                decomposed.deleteCharAt(i);
                decomposed.insert(i, 'l');
            }
        }
    }

}


com.twosigma.beaker.javash.bkr451fa51c.StringUtil

In [17]:
import java.util.List;
import java.util.Map;
import java.util.TreeMap;
import java.util.ArrayList;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.LeafReaderContext;
import org.apache.lucene.index.Terms;
import org.apache.lucene.index.TermsEnum;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.store.SimpleFSDirectory;
import org.apache.lucene.util.BytesRef;
import java.nio.file.Paths;
import java.io.IOException;
import org.apache.commons.text.StringEscapeUtils;
import org.apache.commons.lang3.StringUtils;
import org.apache.logging.log4j.Logger;
import org.apache.logging.log4j.LogManager;
import java.nio.charset.Charset;

public class LuceneExplorer {
    
    public static String USR_HOME = (String)System.getProperty("user.home");
    public static String ES_INSTALL_PATH = String.format("%s/TOOLS/BigData/es-661", USR_HOME);
    private static final Logger logger = LogManager.getLogger(LuceneExplorer.class);
    static final Charset LUCENE_DEFAULT_CHARSET = Charset.forName("utf-8");
    
    private String indexUuid;
    private Integer indexNbrOfShards;
    
    public static void setEsInstallPath(String path) {
        ES_INSTALL_PATH = path;
    }
    
    LuceneExplorer(String uuid, Integer nbrShards) {
        indexUuid = uuid;
        indexNbrOfShards = nbrShards;
    }
    
    public List<String> getLucenePaths() {
        List<String> result = new ArrayList<String>();
        for (int i=0; i<indexNbrOfShards; i++) {
            result.add(String.format("%s/data/nodes/0/indices/%s/%d/index", ES_INSTALL_PATH, indexUuid, i));
        }
        return result;
    }
    
    final static boolean isTermCandidate(String term, int freq) throws Exception {
//         return !StringUtils.isNumericSpace(term) && 
//                 freq > 5;
        return !StringUtils.isNumericSpace(term);
    }

    /**
     * Find terms in the index based on a prefix. Useful for autocomplete.
     * @param index the index
     * @param fieldName the field
     * @param prefix the prefix we're looking for (null or empty string for all terms)
     * @param sensitive match case-sensitively or not?
     * @param maxResults max. number of results to return (or -1 for all)
     * @return the matching terms
     */
    public static Map<String, Integer> findTermsByPrefix(IndexReader index, String fieldName,
        String prefix, boolean sensitive, int maxResults) {
        boolean allTerms = prefix == null || prefix.length() == 0;
        if (allTerms) {
            prefix = "";
            sensitive = true; // don't do unnecessary work in this case
        }
        try {
            if (!sensitive)
                prefix = StringUtil.stripAccents(prefix).toLowerCase();
            Map<String, Integer> results = new TreeMap<String, Integer>();
            for (LeafReaderContext leafReader: index.leaves()) {
                Terms terms = leafReader.reader().terms(fieldName);
                if (terms == null) {
                    if (logger.isDebugEnabled()) logger.debug("no terms for field " + fieldName + " in leafReader, skipping");
                    continue;
                }
                TermsEnum termsEnum = terms.iterator();
                BytesRef brPrefix = new BytesRef(prefix.getBytes(LUCENE_DEFAULT_CHARSET));
                TermsEnum.SeekStatus seekStatus = termsEnum.seekCeil(brPrefix);

                if (seekStatus == TermsEnum.SeekStatus.END) {
                    continue;
                }
                for (BytesRef term = termsEnum.term(); term != null; term = termsEnum.next()) {
                    if (maxResults < 0 || results.size() < maxResults) {
                        try {
                            int freq = termsEnum.docFreq();
                            String termText = term.utf8ToString();
                            if (isTermCandidate(termText, freq)) {
                                boolean startsWithPrefix = sensitive ? StringUtil.stripAccents(termText).startsWith(prefix) : termText.startsWith(prefix);
                                if (!allTerms && !startsWithPrefix) {
                                    // Doesn't match prefix or different field; no more matches
                                    break;
                                }
                                // Match, add term
                                if (results.get(termText) == null)
                                    results.put(termText, freq);
                            }
                        } catch (Exception e) {}
                    }
                }
            }
            return results;
        } catch (IOException e) {
            throw new RuntimeException(e);
        }
    }

    public void exploreLuceneIndex(String indexPath) throws IOException {
        SimpleFSDirectory dir = new SimpleFSDirectory(Paths.get(indexPath));
        IndexSearcher searcher = new IndexSearcher(DirectoryReader.open(dir));
        IndexReader reader = searcher.getIndexReader();
        int shardCardinal = reader.getDocCount("doc_txt");
        System.out.println(String.format("lucene index: %s\ntotal docs:%d", indexPath, shardCardinal));
        Map<String, Integer> terms = findTermsByPrefix(reader, "doc_txt", "", true, -1);
        terms.entrySet().stream().forEach(e -> System.out.println("\"" + StringEscapeUtils.escapeJava(e.getKey()) + "\"," + e.getValue()));
    }
}

com.twosigma.beaker.javash.bkr451fa51c.LuceneExplorer

In [18]:
%%bash
# Stop Elasticsearch server
$HOME/SCRIPTS/stop_es.sh

ElasticSearch has just been stopped



In [19]:
//
// Exchange data between Java and Python kernels
//
import java.util.List;

String idxUuid = (String)NamespaceClient.getBeakerX().get("my_index_uuid");
Integer nbrShards = (Integer) NamespaceClient.getBeakerX().get("my_index_shards");
System.out.println(idxUuid);
System.out.println(nbrShards);

System.out.println();

LuceneExplorer luceneExplorer = new LuceneExplorer(idxUuid, nbrShards);
List<String> luceneIndices = luceneExplorer.getLucenePaths();
System.out.println(luceneIndices);

System.out.println();
luceneIndices.stream().forEach(idx -> {
    try {
        luceneExplorer.exploreLuceneIndex(idx);
    } catch (Exception e) {
        e.printStackTrace();
    }
});

YtgaGpRpTCyDhEtcY6SonQ
1



ERROR StatusLogger No Log4j 2 configuration file found. Using default configuration (logging only errors to the console), or user programmatically provided configurations. Set system property 'log4j2.debug' to show Log4j 2 internal initialization logging. See https://logging.apache.org/log4j/2.x/manual/configuration.html for instructions on how to configure Log4j 2


[/home/weborama.office/cverdier/TOOLS/BigData/es-661/data/nodes/0/indices/YtgaGpRpTCyDhEtcY6SonQ/0/index]

lucene index: /home/weborama.office/cverdier/TOOLS/BigData/es-661/data/nodes/0/indices/YtgaGpRpTCyDhEtcY6SonQ/0/index
total docs:2
"_ _ book",1
"_ _ book made",1
"_ _ book made up",1
"_ book",1
"_ book made",1
"_ book made up",1
"_ book made up _",1
"_ comics",1
"_ comics content",1
"_ even",1
"_ even appeared",1
"_ even appeared recently",1
"_ even appeared recently _",1
"_ graphic",2
"_ graphic novel",2
"_ graphic novel _",1
"_ graphic novel _ _",1
"_ graphic novel form",1
"appeared",1
"appeared recently",1
"appeared recently _",1
"appeared recently _ graphic",1
"appeared recently _ graphic novel",1
"book",1
"book made",1
"book made up",1
"book made up _",1
"book made up _ comics",1
"comics",1
"comics content",1
"content",1
"even",1
"even appeared",1
"even appeared recently",1
"even appeared recently _",1
"even appeared recently _ graphic",1
"form",1
"graphic",2
"graphic novel",

null

In [20]:
%%python
# Retrieve stored data from python kernel
from beakerx.object import beakerx

beakerx.my_index_uuid, beakerx.my_index_shards

('YtgaGpRpTCyDhEtcY6SonQ', 1)