Skip to content

Commit

Permalink
Merge pull request #136 from duydo/feature/upgrade-to-es-8.x
Browse files Browse the repository at this point in the history
Feature/upgrade to es 8.x
  • Loading branch information
duydo committed May 8, 2023
2 parents 487f870 + 5ca2dfa commit d00e220
Show file tree
Hide file tree
Showing 14 changed files with 235 additions and 231 deletions.
3 changes: 2 additions & 1 deletion .env.sample
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
ES_VERSION=7.5.1
ES_VERSION=8.7.0
ELASTIC_PASSWORD=changeme
12 changes: 0 additions & 12 deletions .github/FUNDING.yml

This file was deleted.

3 changes: 1 addition & 2 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,6 @@ jobs:
strategy:
matrix:
entry:
- { version: 11, distribution: 'adopt' }
- { version: 17, distribution: 'adopt' }
steps:
- name: Checkout analysis-vietnamese
Expand All @@ -33,4 +32,4 @@ jobs:
- name: Build and Test
run: |
export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH
mvn --batch-mode test
mvn --batch-mode test
44 changes: 44 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
ARG ES_VERSION
FROM docker.elastic.co/elasticsearch/elasticsearch:$ES_VERSION as builder

USER root
ENV DEBIAN_FRONTEND=noninteractive

RUN apt-get update -y && apt-get install -y software-properties-common build-essential
RUN gcc --version
RUN apt-get update -y && apt-get install -y make cmake pkg-config wget git

ENV JAVA_HOME=/usr/share/elasticsearch/jdk
ENV PATH=$JAVA_HOME/bin:$PATH

# Build coccoc-tokenizer
RUN echo "Build coccoc-tokenizer..."
WORKDIR /tmp
RUN git clone https://github.com/duydo/coccoc-tokenizer.git
RUN mkdir /tmp/coccoc-tokenizer/build
WORKDIR /tmp/coccoc-tokenizer/build
RUN cmake -DBUILD_JAVA=1 ..
RUN make install

# Build analysis-vietnamese
RUN echo "analysis-vietnamese..."
WORKDIR /tmp
RUN wget https://dlcdn.apache.org/maven/maven-3/3.8.8/binaries/apache-maven-3.8.8-bin.tar.gz \
&& tar xvf apache-maven-3.8.8-bin.tar.gz
ENV MVN_HOME=/tmp/apache-maven-3.8.8
ENV PATH=$MVN_HOME/bin:$PATH

COPY . /tmp/elasticsearch-analysis-vietnamese
WORKDIR /tmp/elasticsearch-analysis-vietnamese
RUN mvn verify clean --fail-never
RUN mvn --batch-mode -Dmaven.test.skip -e package

FROM docker.elastic.co/elasticsearch/elasticsearch:$ES_VERSION
ARG ES_VERSION
ARG COCCOC_INSTALL_PATH=/usr/local
ARG COCCOC_DICT_PATH=$COCCOC_INSTALL_PATH/share/tokenizer/dicts

COPY --from=builder $COCCOC_INSTALL_PATH/lib/libcoccoc_tokenizer_jni.so /usr/lib
COPY --from=builder $COCCOC_DICT_PATH $COCCOC_DICT_PATH
COPY --from=builder /tmp/elasticsearch-analysis-vietnamese/target/releases/elasticsearch-analysis-vietnamese-$ES_VERSION.zip /
RUN echo "Y" | /usr/share/elasticsearch/bin/elasticsearch-plugin install --batch file:///elasticsearch-analysis-vietnamese-$ES_VERSION.zip
53 changes: 44 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -108,20 +108,52 @@ The above example produces the following terms:
```

## Use Docker

Make sure you have installed both Docker & docker-compose

### Build the image with Docker Compose

```sh
# Copy, edit ES version and password for user elastic in file .env. Default password: changeme
cp .env.sample .env
docker compose build
docker compose up
```
### Verify
```sh
curl -k http://elastic:changeme@localhost:9200/_analyze -H 'Content-Type: application/json' -d '
{
"analyzer": "vi_analyzer",
"text": "Cộng hòa Xã hội chủ nghĩa Việt Nam"
}'

# Output
{"tokens":[{"token":"cộng hòa","start_offset":0,"end_offset":8,"type":"<WORD>","position":0},{"token":"xã hội","start_offset":9,"end_offset":15,"type":"<WORD>","position":1},{"token":"chủ nghĩa","start_offset":16,"end_offset":25,"type":"<WORD>","position":2},{"token":"việt nam","start_offset":26,"end_offset":34,"type":"<WORD>","position":3}]}
```

## Build from Source
### Step 1: Build C++ tokenizer for Vietnamese library
```sh
git clone https://github.com/coccoc/coccoc-tokenizer.git
git clone https://github.com/duydo/coccoc-tokenizer.git
cd coccoc-tokenizer && mkdir build && cd build
cmake -DBUILD_JAVA=1 ..
make install
# Link the coccoc shared lib to /usr/lib
sudo ln -sf /usr/local/lib/libcoccoc_tokenizer_jni.* /usr/lib/
```
By default, the `make install` installs:
- the lib commands (`tokenizer`, `dict_compiler` and `vn_lang_tool`) under `/usr/local/bin`
- the dynamic lib (`libcoccoc_tokenizer_jni.so`) under `/usr/local/lib/`. The plugin uses this lib directly.
- the dictionary files under `/usr/local/share/tokenizer/dicts`. The plugin uses this path for `dict_path` by default.
- The lib commands `tokenizer`, `dict_compiler` and `vn_lang_tool` under `/usr/local/bin`
- The dynamic lib `libcoccoc_tokenizer_jni.so` under `/usr/local/lib/`. The plugin uses this lib directly.
- The dictionary files under `/usr/local/share/tokenizer/dicts`. The plugin uses this path for `dict_path` by default.

Verify
```sh
/usr/local/bin/tokenizer "Cộng hòa Xã hội chủ nghĩa Việt Nam"
# cộng hòa xã hội chủ nghĩa việt nam
```

Refer [the repo](https://github.com/coccoc/coccoc-tokenizer) for more information to build the library.
Refer [the repo](https://github.com/duydo/coccoc-tokenizer) for more information to build the library.


### Step 2: Build the plugin
Expand All @@ -136,7 +168,7 @@ Optionally, edit the `elasticsearch-analysis-vietnamese/pom.xml` to change the v

```xml
...
<version>7.17.1</version>
<version>8.7.0</version>
...
```

Expand All @@ -149,16 +181,19 @@ mvn package
### Step 3: Installation the plugin on Elasticsearch

```sh
bin/elasticsearch-plugin install file://target/releases/elasticsearch-analysis-vietnamese-7.17.1.zip
bin/elasticsearch-plugin install file://target/releases/elasticsearch-analysis-vietnamese-8.7.0.zip
```

## Compatible Versions
From v7.12.11, the plugin uses CocCoc C++ tokenizer instead of the VnTokenizer by Lê Hồng Phương,
I don't maintain the plugin with the VnTokenizer anymore, if you want to continue developing with it, refer [the branch vntokenizer](https://github.com/duydo/elasticsearch-analysis-vietnamese/tree/vntokenizer).

| Vietnamese Analysis Plugin | Elasticsearch |
| -------------------------- |-----------------|
| master | 7.16 ~ 7.17.1 |
|----------------------------|-----------------|
| master | 8.7.0 |
| develop | 8.7.0 |
| 8.7.0 | 8.7.0 |
| 7.16.1 | 7.16 ~ 7.17.1 |
| 7.12.1 | 7.12.1 ~ 7.15.x |
| 7.3.1 | 7.3.1 |
| 5.6.5 | 5.6.5 |
Expand Down
32 changes: 24 additions & 8 deletions docker-compose.yaml
Original file line number Diff line number Diff line change
@@ -1,15 +1,31 @@
version: '3.4'

services:
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:${ES_VERSION}
build:
context: .
args:
ES_VERSION: ${ES_VERSION}
restart: on-failure
ports:
- "9200:9200"
volumes:
- ./target/releases/elasticsearch-analysis-vietnamese-${ES_VERSION}.zip:/usr/share/elasticsearch/plugin/elasticsearch-analysis-vietnamese-${ES_VERSION}.zip
- ./install-es-plugin.sh:/apps/install-es-plugin.sh
ulimits:
nofile:
soft: 65536
hard: 65536
memlock:
hard: -1
soft: -1
environment:
- "ES_VERSION=${ES_VERSION}"
- "discovery.type=single-node"
entrypoint:
- /apps/install-es-plugin.sh
ES_JAVA_OPTS: "-Xmx2g -Xms2g"
ELASTIC_USERNAME: "elastic"
ELASTIC_PASSWORD: ${ELASTIC_PASSWORD}
bootstrap.memory_lock: "true"
discovery.type: "single-node"
xpack.security.enabled: "true"
networks:
- elastic

networks:
elastic:
driver: bridge
8 changes: 0 additions & 8 deletions install-es-plugin.sh

This file was deleted.

6 changes: 3 additions & 3 deletions pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
<modelVersion>4.0.0</modelVersion>
<groupId>org.elasticsearch</groupId>
<artifactId>elasticsearch-analysis-vietnamese</artifactId>
<version>7.17.1</version>
<version>8.7.0</version>
<packaging>jar</packaging>
<name>elasticsearch-analysis-vietnamese</name>
<url>https://github.com/duydo/elasticsearch-analysis-vietnamese/</url>
Expand Down Expand Up @@ -83,8 +83,8 @@
<artifactId>maven-compiler-plugin</artifactId>
<version>3.3</version>
<configuration>
<source>11</source>
<target>11</target>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
<plugin>
Expand Down
Loading

0 comments on commit d00e220

Please sign in to comment.