Skip to content

Latest commit

 

History

History
215 lines (169 loc) · 6.73 KB

README.md

File metadata and controls

215 lines (169 loc) · 6.73 KB
七輪の上で焼かれたお餅

Omochi 😊

Full text search engine from scratch by Golangʕ◔ϖ◔ʔ (Just a toy)

✨ Features

  • Omochi is an inverted index based search engine by Golang.
  • If indexed correctly, any document can be searched.
  • You can search documents from RESTful API.
  • Supported language: English, Japanese.
スクリーンショット 2022-07-08 11 08 15

📍 Requirements

📦 Setup

Create network

Create docker network(omochi_network) by:

$ docker network create omochi_network

Database migration

Omochi uses MariaDB for storing Inverted Indexes & Documents, and Ent for ORM.

For database migration, connect docker container shell by:

$ docker-compose run api bash

Then, running database migration by:

$ go run ./cmd/migrate/migrate.go 

Seed data

To try search engine, this project provides two datasets as samples in TSV Format.

The dataset for English is a Movie title dataset, and the dataset for Japanese is a Doraemon comic title dataset.

At first, connect docker container shell by:

$ docker-compose run api bash

Then, seed data by:

$ go run {path to seed.go}

If you initialize with a Japanese dataset, {path to seed.go} should be ./cmd/seeds/ja/seed.go . On the other hand, for English, ./cmd/seeds/eng/seed.go .

🏇 Start Application

After completing setup, you can start application by running:

$ docker-compose up

This app starts a RESTful API and listens on port 8081 for connections

🌎 How to use & Demo

After seeding data , you can search documents by send GET request to /v1/document/search .

Query parameters are as follow:

  • "keywords": Keywords to search. If there are multiple search terms, specify them separated by commas like "hoge,fuga,piyo"
  • "mode": Search mode. The search modes that can be specified are "And" and "Or"

Demo

  • Doraemon comic title dataset

After data seeding by Doraemon comic title dataset, you can search documents which include "ドラえもん" by:

$ curl "http://localhost:8081/v1/document/search?keywords=ドラえもん" | jq . 
{
  "documents": [
    {
      "id": 12054,
      "content": "ドラえもんの歌",
      "tokenized_content": [
        "ドラえもん",
        "歌"
      ],
      "created_at": "2022-07-08T12:59:49+09:00",
      "updated_at": "2022-07-08T12:59:49+09:00"
    },
    {
      "id": 11992,
      "content": "恋するドラえもん",
      "tokenized_content": [
        "恋する",
        "ドラえもん"
      ],
      "created_at": "2022-07-08T12:59:48+09:00",
      "updated_at": "2022-07-08T12:59:48+09:00"
    },
    {
      "id": 11230,
      "content": "ドラえもん登場!",
      "tokenized_content": [
        "ドラえもん",
        "登場"
      ],
      "created_at": "2022-07-08T12:59:44+09:00",
      "updated_at": "2022-07-08T12:59:44+09:00"
    },
    ... 
  • Movie title dataset

After data seeding by Movie title dataset, you can search documents which include "toy" and "story" by:

$ curl "http://localhost:8081/v1/document/search?keywords=toy,story&mode=And" | jq .
{
  "documents": [
    {
      "id": 1,
      "content": "Toy Story",
      "tokenized_content": [
        "toy",
        "story"
      ],
      "created_at": "2022-07-08T13:49:24+09:00",
      "updated_at": "2022-07-08T13:49:24+09:00"
    },
    {
      "id": 39,
      "content": "Toy Story of Terror!",
      "tokenized_content": [
        "toy",
        "story",
        "terror"
      ],
      "created_at": "2022-07-08T13:49:34+09:00",
      "updated_at": "2022-07-08T13:49:34+09:00"
    },
    {
      "id": 83,
      "content": "Toy Story That Time Forgot",
      "tokenized_content": [
        "toy",
        "story",
        "time",
        "forgot"
      ],
      "created_at": "2022-07-08T13:49:53+09:00",
      "updated_at": "2022-07-08T13:49:53+09:00"
    },
    {
      "id": 213,
      "content": "Toy Story 2",
      "tokenized_content": [
        "toy",
        "story"
      ],
      "created_at": "2022-07-08T13:50:35+09:00",
      "updated_at": "2022-07-08T13:50:35+09:00"
    },
    {
      "id": 352,
      "content": "Toy Story 3",
      "tokenized_content": [
        "toy",
        "story"
      ],
      "created_at": "2022-07-08T13:51:23+09:00",
      "updated_at": "2022-07-08T13:51:23+09:00"
    }
  ]
}

📚 Reference

Dataset

Book

🧑‍💻 License

MIT