Omochi 😊

Full text search engine from scratch by Golangʕ◔ϖ◔ʔ (Just a toy)

✨ Features

Omochi is an inverted index based search engine by Golang.
If indexed correctly, any document can be searched.
You can search documents from RESTful API.
Supported language: English, Japanese.

📍 Requirements

Golang 1.18+
Docker 20.10+

📦 Setup

Create network

Create docker network(omochi_network) by:

$ docker network create omochi_network

Database migration

Omochi uses MariaDB for storing Inverted Indexes & Documents, and Ent for ORM.

For database migration, connect docker container shell by:

$ docker-compose run api bash

Then, running database migration by:

$ go run ./cmd/migrate/migrate.go

Seed data

To try search engine, this project provides two datasets as samples in TSV Format.

The dataset for English is a Movie title dataset, and the dataset for Japanese is a Doraemon comic title dataset.

At first, connect docker container shell by:

$ docker-compose run api bash

Then, seed data by:

$ go run {path to seed.go}

If you initialize with a Japanese dataset, {path to seed.go} should be ./cmd/seeds/ja/seed.go . On the other hand, for English, ./cmd/seeds/eng/seed.go .

🏇 Start Application

After completing setup, you can start application by running:

$ docker-compose up

This app starts a RESTful API and listens on port 8081 for connections

🌎 How to use & Demo

After seeding data , you can search documents by send GET request to /v1/document/search .

Query parameters are as follow:

"keywords": Keywords to search. If there are multiple search terms, specify them separated by commas like "hoge,fuga,piyo"
"mode": Search mode. The search modes that can be specified are "And" and "Or"

Demo

Doraemon comic title dataset

After data seeding by Doraemon comic title dataset, you can search documents which include "ドラえもん" by:

$ curl "http://localhost:8081/v1/document/search?keywords=ドラえもん" | jq . 
{
  "documents": [
    {
      "id": 12054,
      "content": "ドラえもんの歌",
      "tokenized_content": [
        "ドラえもん",
        "歌"
      ],
      "created_at": "2022-07-08T12:59:49+09:00",
      "updated_at": "2022-07-08T12:59:49+09:00"
    },
    {
      "id": 11992,
      "content": "恋するドラえもん",
      "tokenized_content": [
        "恋する",
        "ドラえもん"
      ],
      "created_at": "2022-07-08T12:59:48+09:00",
      "updated_at": "2022-07-08T12:59:48+09:00"
    },
    {
      "id": 11230,
      "content": "ドラえもん登場！",
      "tokenized_content": [
        "ドラえもん",
        "登場"
      ],
      "created_at": "2022-07-08T12:59:44+09:00",
      "updated_at": "2022-07-08T12:59:44+09:00"
    },
    ...

Movie title dataset

After data seeding by Movie title dataset, you can search documents which include "toy" and "story" by:

$ curl "http://localhost:8081/v1/document/search?keywords=toy,story&mode=And" | jq .
{
  "documents": [
    {
      "id": 1,
      "content": "Toy Story",
      "tokenized_content": [
        "toy",
        "story"
      ],
      "created_at": "2022-07-08T13:49:24+09:00",
      "updated_at": "2022-07-08T13:49:24+09:00"
    },
    {
      "id": 39,
      "content": "Toy Story of Terror!",
      "tokenized_content": [
        "toy",
        "story",
        "terror"
      ],
      "created_at": "2022-07-08T13:49:34+09:00",
      "updated_at": "2022-07-08T13:49:34+09:00"
    },
    {
      "id": 83,
      "content": "Toy Story That Time Forgot",
      "tokenized_content": [
        "toy",
        "story",
        "time",
        "forgot"
      ],
      "created_at": "2022-07-08T13:49:53+09:00",
      "updated_at": "2022-07-08T13:49:53+09:00"
    },
    {
      "id": 213,
      "content": "Toy Story 2",
      "tokenized_content": [
        "toy",
        "story"
      ],
      "created_at": "2022-07-08T13:50:35+09:00",
      "updated_at": "2022-07-08T13:50:35+09:00"
    },
    {
      "id": 352,
      "content": "Toy Story 3",
      "tokenized_content": [
        "toy",
        "story"
      ],
      "created_at": "2022-07-08T13:51:23+09:00",
      "updated_at": "2022-07-08T13:51:23+09:00"
    }
  ]
}

📚 Reference

Dataset

Fujiko.F.Fujio,Doraemon(Tentomushi Comics) 1~45, Shogakukan , 1974～1996
ROUNAK BANIK."The Movies Dataset".kaggle.https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset. Accessed on 07/08

Book

Information Retrieval: Implementing and Evaluating Search Engines
情報検索アルゴリズム
Pythonではじめる情報検索プログラミング
WEB+DB PRESS Vol.126. 特集 Goで作って学ぶ検索エンジン
検索エンジン自作入門 ~手を動かしながら見渡す検索の舞台裏

🧑‍💻 License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Omochi 😊

✨ Features

📍 Requirements

📦 Setup

Create network

Database migration

Seed data

🏇 Start Application

🌎 How to use & Demo

Demo

📚 Reference

Dataset

Book

🧑‍💻 License

Files

README.md

Latest commit

History

README.md

File metadata and controls

Omochi 😊

✨ Features

📍 Requirements

📦 Setup

Create network

Database migration

Seed data

🏇 Start Application

🌎 How to use & Demo

Demo

📚 Reference

Dataset

Book

🧑‍💻 License