## 编译打包

In [None]:
!sh init.sh

## 上传 wheel 到pipy


可能出现的错误：
1. 上传时，报错：“Binary wheel 'pykgraph-0.0.1-cp310-cp310-linux_x86_64.whl' has an unsupported platform tag 'linux_x86_64'.” 
    解决办法：使用manylinux docker镜像重新构建wheel上传
    ```Dockerfile
    FROM quay.io/pypa/manylinux2014_x86_64

    # 安装构建所需的依赖
    RUN yum install -y <your-dependencies>

    # 设置工作目录
    WORKDIR /app

    # 复制项目文件到容器中
    COPY . /app

    # 构建Wheel文件
    RUN python setup.py bdist_wheel
    ```
2. 

In [5]:
!pip install twine

# 安装完成后，使用如下的命令上传wheel 到pipy 
# twine upload dist/* 



## 测试kgraph python api

In [1]:
from numpy import random
import kgraph

dataset = random.rand(100000, 16)
query = random.rand(1000, 16)

index = kgraph.KGraph(dataset, 'euclidean')  # another option is 'angular'
index.build(reverse=-1)                        #
index.save("index_file.kgraph");
# load with index.load("index_file");

# knn = index.search(query, K=10)                       # this uses all CPU threads
# knn = index.search(query, K=10, threads=1)            # one thread, slower
# knn = index.search(query, K=1000, P=100)              # search for 1000-nn, no need to recompute index.

Generating control...
Initializing...
iteration: 1 recall: 0.0032 accuracy: 1.54893 cost: 0.00380004 M: 10 delta: 1 time: 1.1136 one-recall: 0 one-ratio: 2.47247
iteration: 2 recall: 0.0276 accuracy: 0.911731 cost: 0.00596146 M: 10 delta: 0.767592 time: 1.76081 one-recall: 0.03 one-ratio: 1.80445
iteration: 3 recall: 0.1264 accuracy: 0.44105 cost: 0.00983024 M: 12.5905 delta: 0.763926 time: 2.54798 one-recall: 0.18 one-ratio: 1.35835
iteration: 4 recall: 0.41 accuracy: 0.164649 cost: 0.0139892 M: 12.8554 delta: 0.679023 time: 3.27701 one-recall: 0.41 one-ratio: 1.13389
iteration: 5 recall: 0.7292 accuracy: 0.0434543 cost: 0.0184702 M: 14.5543 delta: 0.493892 time: 4.00186 one-recall: 0.84 one-ratio: 1.02509
iteration: 6 recall: 0.924 accuracy: 0.00736505 cost: 0.024376 M: 20.1104 delta: 0.236886 time: 4.79448 one-recall: 0.98 one-ratio: 1.00175
iteration: 7 recall: 0.9772 accuracy: 0.00151333 cost: 0.0319479 M: 27.8816 delta: 0.0731787 time: 5.67472 one-recall: 1 one-ratio: 1
iteration

## 构建数据集的索引

每个数据集构建 10k, 1m 和完整的kgraph 索引。

数据集情况：

| 数据集    | 数据量 | dim | 测试量     |
|--------|-----|-----|---------|
| glove  | 50K | 50  | 10000   |
| deep1B | 10M |     |         |
| sift1M | 1M  | 128 | 1000000 |


In [4]:
# 下载数据集
# !git clone https://huggingface.co/datasets/qbo-odp/deep1B
!git clone https://huggingface.co/datasets/qbo-odp/sift1m

Cloning into 'sift1m'...


remote: Enumerating objects: 14, done.[K
remote: Counting objects: 100% (14/14), done.[K
remote: Compressing objects: 100% (13/13), done.[K
remote: Total 14 (delta 2), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (14/14), 2.03 KiB | 1.01 MiB/s, done.
Filtering content: 100% (4/4), 550.08 MiB | 28.06 MiB/s, done.


### glove

```python
from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format(os.path.join(DATA_PATH, 'glove_50k_50.txt'))
```

### deep1B

In [1]:
def read_fbin(filename, start_idx=0, chunk_size=None):
    """ Read *.fbin file that contains float32 vectors
    Args:
        :param filename (str): path to *.fbin file
        :param start_idx (int): start reading vectors from this index
        :param chunk_size (int): number of vectors to read.
                                 If None, read all vectors
    Returns:
        Array of float32 vectors (numpy.ndarray)
    """
    with open(filename, "rb") as f:
        nvecs, dim = np.fromfile(f, count=2, dtype=np.int32)
        nvecs = (nvecs - start_idx) if chunk_size is None else chunk_size
        arr = np.fromfile(f, count=nvecs * dim, dtype=np.float32,
                          offset=start_idx * 4 * dim)
    return arr.reshape(nvecs, dim)


# 读取数据
# chunk_size = 50000
# dataset = read_fbin('deep1B/base.10M.fbin', chunk_size=chunk_size)

### sift1M

In [2]:
def ivecs_read(fname):
    a = np.fromfile(fname, dtype='int32')
    d = a[0]
    return a.reshape(-1, d + 1)[:, 1:].copy()


def fvecs_read(fname):
    return ivecs_read(fname).view('float32')

# 读取数据
#sift = fvecs_read(os.path.join(DATA_PATH, 'sift_base.fvecs'))

### 构建索引

In [3]:
import numpy as np
import kgraph

# 索引规模
scale = [10_000, 1_000_000, -1]
# data = ['glove', 'deep1b', 'sift1m']
data = [ 'sift1m']

for ins in data:
    for sca in scale:
        if 'sift1m' == ins:
            dataset = fvecs_read('sift1m/sift_base.fvecs')
        elif 'deep1b' == ins:
            dataset = read_fbin('deep1B/base.10M.fbin', chunk_size=10000)

        if sca != -1:
            dataset = dataset[:sca, :]
        print(f"dataset info: shape={dataset.shape}")
        index = kgraph.KGraph(dataset, 'euclidean')  # another option is 'angular'
        index.build(reverse=-1)                        #
        index.save("index/%d-%s.idx"%( 'all' if sca==-1 else sca, ins))

dataset info: shape=(10000, 128)


Generating control...
Initializing...
iteration: 1 recall: 0.0348 accuracy: 0.982831 cost: 0.0380038 M: 10 delta: 1 time: 0.106101 one-recall: 0.01 one-ratio: 2.33575
iteration: 2 recall: 0.2876 accuracy: 0.29036 cost: 0.0641303 M: 10 delta: 0.839039 time: 0.188901 one-recall: 0.37 one-ratio: 1.49633
iteration: 3 recall: 0.7856 accuracy: 0.0317521 cost: 0.10938 M: 11.7611 delta: 0.660866 time: 0.281586 one-recall: 0.85 one-ratio: 1.09834
iteration: 4 recall: 0.9576 accuracy: 0.00237081 cost: 0.171997 M: 15.1766 delta: 0.329421 time: 0.393512 one-recall: 1 one-ratio: 1
iteration: 5 recall: 0.9876 accuracy: 0.000566841 cost: 0.244029 M: 21.4741 delta: 0.123291 time: 0.493915 one-recall: 1 one-ratio: 1
iteration: 6 recall: 0.9928 accuracy: 0.000352397 cost: 0.279635 M: 23.8449 delta: 0.0947159 time: 0.536759 one-recall: 1 one-ratio: 1
Graph completion with reverse edges...
Reranking edges...


dataset info: shape=(1000000, 128)


Generating control...
Initializing...
iteration: 1 recall: 0.0016 accuracy: 1.68825 cost: 0.00038 M: 10 delta: 1 time: 13.5139 one-recall: 0 one-ratio: 2.90427
iteration: 2 recall: 0.004 accuracy: 0.987917 cost: 0.000637314 M: 10 delta: 0.855938 time: 22.7597 one-recall: 0.01 one-ratio: 2.4079
iteration: 3 recall: 0.0296 accuracy: 0.557211 cost: 0.00109502 M: 11.5298 delta: 0.835078 time: 33.6624 one-recall: 0.02 one-ratio: 1.97366
iteration: 4 recall: 0.156 accuracy: 0.290958 cost: 0.00163026 M: 11.8382 delta: 0.783594 time: 44.7212 one-recall: 0.23 one-ratio: 1.61438
iteration: 5 recall: 0.446 accuracy: 0.11768 cost: 0.00223542 M: 12.5962 delta: 0.664658 time: 55.6918 one-recall: 0.49 one-ratio: 1.31
iteration: 6 recall: 0.7172 accuracy: 0.035269 cost: 0.00297849 M: 15.1132 delta: 0.432613 time: 67.6699 one-recall: 0.84 one-ratio: 1.07438
iteration: 7 recall: 0.862 accuracy: 0.0114052 cost: 0.00395442 M: 21.1329 delta: 0.196602 time: 83.7692 one-recall: 0.94 one-ratio: 1.01487
iterat

dataset info: shape=(1000000, 128)


Generating control...
Initializing...
iteration: 1 recall: 0 accuracy: 2.00847 cost: 0.00038 M: 10 delta: 1 time: 18.4602 one-recall: 0 one-ratio: 3.78706
iteration: 2 recall: 0.0016 accuracy: 1.20006 cost: 0.000637314 M: 10 delta: 0.855938 time: 32.3354 one-recall: 0 one-ratio: 3.01121
iteration: 3 recall: 0.0388 accuracy: 0.701686 cost: 0.00109502 M: 11.5298 delta: 0.835085 time: 45.8561 one-recall: 0.03 one-ratio: 2.35646
iteration: 4 recall: 0.17 accuracy: 0.37186 cost: 0.00163027 M: 11.8382 delta: 0.783593 time: 59.83 one-recall: 0.24 one-ratio: 1.84663
iteration: 5 recall: 0.492 accuracy: 0.122261 cost: 0.00223552 M: 12.5964 delta: 0.664666 time: 71.4503 one-recall: 0.65 one-ratio: 1.29555
iteration: 6 recall: 0.7608 accuracy: 0.0304291 cost: 0.0029785 M: 15.1123 delta: 0.432582 time: 85.864 one-recall: 0.87 one-ratio: 1.06974
iteration: 7 recall: 0.8952 accuracy: 0.0111428 cost: 0.00395428 M: 21.1326 delta: 0.196567 time: 99.7804 one-recall: 0.94 one-ratio: 1.03465
iteration: 8 

In [5]:
!ls -alh index

total 1.4G
drwxrwxrwx+  2 codespace codespace 4.0K Aug 25 09:17 .
drwxrwxrwx+ 15 codespace root      4.0K Aug 25 09:14 ..
-rw-rw-rw-   1 codespace codespace 4.4M Aug 25 09:10 10000-sift1m.idx
-rw-rw-rw-   1 codespace codespace 679M Aug 25 09:13 1000000-sift1m.idx
-rw-rw-rw-   1 codespace codespace 679M Aug 25 09:15 all-sift1m.idx
