Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use image search #67

Closed
zhenzi0322 opened this issue May 18, 2020 · 21 comments
Closed

How to use image search #67

zhenzi0322 opened this issue May 18, 2020 · 21 comments

Comments

@zhenzi0322
Copy link

zhenzi0322 commented May 18, 2020

When I used image search, I returned all documents from elasticsearch.How does the elastiknn plugin generate image feature vectors?The language I use is python

@alexklibisz
Copy link
Owner

Generating image feature vectors is up to you. You can do it a few ways..

  • just unroll the image into a vector (e.g. an image with height 28 and width 28 becomes a vector with length 784).
  • use an algorithm/library like phash to generate a more robust feature vector than just the raw pixel values
  • use a convolutional network to process the image but extract the values at the next-to-last layer instead of the classification layer.

@alexklibisz
Copy link
Owner

I have briefly considered adding some functionality to the plugin to ingest images but there are many other things to solve first. It might be implemented as an ingest processor with a small handful of common algos for mapping images to vectors, e.g. phash, sift, a few convnets, etc..

@zhenzi0322
Copy link
Author

zhenzi0322 commented May 19, 2020

I have briefly considered adding some functionality to the plugin to ingest images but there are many other things to solve first. It might be implemented as an ingest processor with a small handful of common algos for mapping images to vectors, e.g. phash, sift, a few convnets, etc..

{'_index': 'long', '_type': '_doc', '_id': 'ids-1485050', '_score': 0.0625227}
{'_index': 'long', '_type': '_doc', '_id': 'ids-1485146', '_score': 0.06249257}
{'_index': 'long', '_type': '_doc', '_id': 'ids-1485177', '_score': 0.06245229}

I used vgg16 in karas to obtain the image feature vector, and saved the image feature vector in elasticsearch7.40 version. However, when I used elasticsearch to query the image data, I found them all. How can I obtain the image similarity diagram I want to query?How much does the _score attribute have to be similar?I'm using the L1 function here

@alexklibisz
Copy link
Owner

I have briefly considered adding some functionality to the plugin to ingest images but there are many other things to solve first. It might be implemented as an ingest processor with a small handful of common algos for mapping images to vectors, e.g. phash, sift, a few convnets, etc..

{'_index': 'long', '_type': '_doc', '_id': 'ids-1485050', '_score': 0.0625227}
{'_index': 'long', '_type': '_doc', '_id': 'ids-1485146', '_score': 0.06249257}
{'_index': 'long', '_type': '_doc', '_id': 'ids-1485177', '_score': 0.06245229}

I used vgg16 in karas to obtain the image feature vector, and saved the image feature vector in elasticsearch7.40 version. However, when I used elasticsearch to query the image data, I found them all. How can I obtain the image similarity diagram I want to query?How much does the _score attribute have to be similar?I'm using the L1 function here

I guess you mean they all had roughly the same score? L1 might not be a good similarity function for those vectors. I would try L2.

@alexklibisz
Copy link
Owner

Here is an example of using L2 (on raw image pixels, not feature vectors): http://demo.elastiknn.klibisz.com/dataset/cifar-l2

You can see the exact mapping and query for each set of results by clicking on the Mapping and Query tabs.

@zhenzi0322
Copy link
Author

Here is an example of using L2 (on raw image pixels, not feature vectors): http://demo.elastiknn.klibisz.com/dataset/cifar-l2

You can see the exact mapping and query for each set of results by clicking on the Mapping and Query tabs.

That means I have to save the original image pixels in elasticsearch instead of saving the feature vectors. That's right. How does the elastiknn library create search queries?

{
  "query" : {
    "elastiknn_nearest_neighbors" : {
      "field" : "vec",
      "vec" : {
        "index" : "cifar-l2-lsh-2",
        "id" : "15231",
        "field" : "vec"
      },
      "candidates" : 20,
      "similarity" : "l2",
      "model" : "lsh"
    }
  },
  "size" : 10,
  "_source" : true
}

@zhenzi0322
Copy link
Author

I have briefly considered adding some functionality to the plugin to ingest images but there are many other things to solve first. It might be implemented as an ingest processor with a small handful of common algos for mapping images to vectors, e.g. phash, sift, a few convnets, etc..

{'_index': 'long', '_type': '_doc', '_id': 'ids-1485050', '_score': 0.0625227}
{'_index': 'long', '_type': '_doc', '_id': 'ids-1485146', '_score': 0.06249257}
{'_index': 'long', '_type': '_doc', '_id': 'ids-1485177', '_score': 0.06245229}
I used vgg16 in karas to obtain the image feature vector, and saved the image feature vector in elasticsearch7.40 version. However, when I used elasticsearch to query the image data, I found them all. How can I obtain the image similarity diagram I want to query?How much does the _score attribute have to be similar?I'm using the L1 function here

I guess you mean they all had roughly the same score? L1 might not be a good similarity function for those vectors. I would try L2.

I mapped to create the elasticsearch index, python as follows:

from elastiknn.client import ElastiKnnClient
from elastiknn.api import Mapping


# create indies
eknn = ElastiKnnClient()
dim = 512
index = "long"
field = "long"
mapping = Mapping.DenseFloat(dims=dim) 
eknn.es.indices.refresh()
eknn.es.indices.create(index=index)
eknn.es.indices.refresh()
m = eknn.put_mapping(index, field, mapping)

print(m)  # {'acknowledged': True}

@alexklibisz
Copy link
Owner

You can use that same mapping with L1, L2, and angular. If you want to use an approximate method you'll have to modify the line mapping = Mapping.DenseFloat to another mapping. Unfortunately it looks like I forgot to add a mapping dataclass for the L2 LSH method. That would go here: https://github.com/alexklibisz/elastiknn/blob/master/client-python/elastiknn/api.py#L64 But you can also just create a dict matching the JSON and submit a PUT request. You can see how the mapping is submitted here: https://github.com/alexklibisz/elastiknn/blob/master/client-python/elastiknn/client.py#L49-L54

You can save either the original pixels or the feature vector. I was just pointing to an example where L2 seems to work well on the original pixels. Most papers I've read also use L2 on feature vectors or they normalize the features vectors to unit norm and use angular. I don't think I've seen L1 used for images.

For exact queries, the plugin creates a FunctionScoreQuery that scores every vector in the index against the query vector. So that's obviously not very efficient. For approximate queries it hashes the stored vectors, indexes the hashes (just like words), uses the same hash function to hash the query vector, and runs a boolean match query to lookup stored vectors which share the most hash values with the query vector. There's a lot more info here: http://elastiknn.klibisz.com/api/

@zhenzi0322
Copy link
Author

Thank you. I'll try it first

@zhenzi0322
Copy link
Author

You can use that same mapping with L1, L2, and angular. If you want to use an approximate method you'll have to modify the line mapping = Mapping.DenseFloat to another mapping. Unfortunately it looks like I forgot to add a mapping dataclass for the L2 LSH method. That would go here: https://github.com/alexklibisz/elastiknn/blob/master/client-python/elastiknn/api.py#L64 But you can also just create a dict matching the JSON and submit a PUT request. You can see how the mapping is submitted here: https://github.com/alexklibisz/elastiknn/blob/master/client-python/elastiknn/client.py#L49-L54

You can save either the original pixels or the feature vector. I was just pointing to an example where L2 seems to work well on the original pixels. Most papers I've read also use L2 on feature vectors or they normalize the features vectors to unit norm and use angular. I don't think I've seen L1 used for images.

For exact queries, the plugin creates a FunctionScoreQuery that scores every vector in the index against the query vector. So that's obviously not very efficient. For approximate queries it hashes the stored vectors, indexes the hashes (just like words), uses the same hash function to hash the query vector, and runs a boolean match query to lookup stored vectors which share the most hash values with the query vector. There's a lot more info here: http://elastiknn.klibisz.com/api/

I made the following error while creating the elasticsearch index:

Traceback (most recent call last):
  File "D:/zhenzi/es7.4.0/main_create.py", line 14, in <module>
    m = eknn.put_mapping(index, field, mapping)
  File "D:\zhenzi\es7.4.0\elastiknn\client.py", line 56, in put_mapping
    return self.es.transport.perform_request("PUT", f"/{index}/_mapping", body=body)
  File "F:\py368\Envs\knn\lib\site-packages\elasticsearch\transport.py", line 358, in perform_request
    timeout=timeout,
  File "F:\py368\Envs\knn\lib\site-packages\elasticsearch\connection\http_urllib3.py", line 257, in perform_request
    self._raise_error(response.status, raw_data)
  File "F:\py368\Envs\knn\lib\site-packages\elasticsearch\connection\base.py", line 182, in _raise_error
    status_code, error_message, additional_info
elasticsearch.exceptions.TransportError: TransportError(500, '', 'Incompatible type [elastiknn_dense_float_vector], model [Some(lsh)], similarity [None]')

Add in the file (https://github.com/alexklibisz/elastiknn/blob/master/client-python/elastiknn/api.py) the following:

@dataclass(frozen=True)
    class DenseFloatLong(Base):
        dims: int

        def to_dict(self):
            return {
                "type": "elastiknn_dense_float_vector",
                "elastiknn": {
                    "model": "lsh",
                    "dims": self.dims,
                    "similarity": "12",
                    "bands": 100,
                    "rows": 1,
                    "width": 3
                }
            }

@alexklibisz
Copy link
Owner

Try "similarity": "l2", not "similarity": "12".
The error isn't particularly helpful, but similarity [None] means it wasn't able to match "12" to a known similarity.

@zhenzi0322
Copy link
Author

Try "similarity": "l2", not "similarity": "12".
The error isn't particularly helpful, but similarity [None] means it wasn't able to match "12" to a known similarity.

Do not "similarity": "12".Again, the following error message:

Traceback (most recent call last):
  File "D:/zhenzi/es7.4.0/main_create.py", line 14, in <module>
    m = eknn.put_mapping(index, field, mapping)
  File "D:\zhenzi\es7.4.0\elastiknn\client.py", line 56, in put_mapping
    return self.es.transport.perform_request("PUT", f"/{index}/_mapping", body=body)
  File "F:\py368\Envs\knn\lib\site-packages\elasticsearch\transport.py", line 358, in perform_request
    timeout=timeout,
  File "F:\py368\Envs\knn\lib\site-packages\elasticsearch\connection\http_urllib3.py", line 257, in perform_request
    self._raise_error(response.status, raw_data)
  File "F:\py368\Envs\knn\lib\site-packages\elasticsearch\connection\base.py", line 182, in _raise_error
    status_code, error_message, additional_info
elasticsearch.exceptions.TransportError: TransportError(500, '', 'Incompatible type [elastiknn_dense_float_vector], model [Some(lsh)], similarity [None]')

Create an elasticsearch index file as follows:

from elastiknn.client import ElastiKnnClient
from elastiknn.api import Mapping

# create indies
eknn = ElastiKnnClient()
dim = 3072
index = "test"
field = "test"

mapping = Mapping.DenseFloatLong(dims=dim)
eknn.es.indices.refresh()
eknn.es.indices.create(index=index)
eknn.es.indices.refresh()
m = eknn.put_mapping(index, field, mapping)

print(m)  # {'acknowledged': True}

api.py

@dataclass(frozen=True)
    class DenseFloatLong(Base):
        dims: int

        def to_dict(self):
            return {
                "type": "elastiknn_dense_float_vector",
                "elastiknn": {
                    "model": "lsh",
                    "dims": self.dims,
                    # "similarity": 12,
                    "bands": 100,
                    "rows": 1,
                    "width": 3
                }
            }

I used elasticsearch version 7.4.0

@alexklibisz
Copy link
Owner

You need to specify the similarity as l2.

    @dataclass(frozen=True)
    class DenseFloatLong(Base):
        dims: int

        def to_dict(self):
            return {
                "type": "elastiknn_dense_float_vector",
                "elastiknn": {
                    "model": "lsh",
                    "dims": self.dims,
                    "similarity": "l2",
                    "bands": 100,
                    "rows": 1,
                    "width": 3
                }
            }

The similarity field is required when using the lsh model.
It's a very subtle character difference. l2 is the lowercase of L2. 12 is the number twelve.

@zhenzi0322
Copy link
Author

You need to specify the similarity as l2.

    @dataclass(frozen=True)
    class DenseFloatLong(Base):
        dims: int

        def to_dict(self):
            return {
                "type": "elastiknn_dense_float_vector",
                "elastiknn": {
                    "model": "lsh",
                    "dims": self.dims,
                    "similarity": "l2",
                    "bands": 100,
                    "rows": 1,
                    "width": 3
                }
            }

The similarity field is required when using the lsh model.
It's a very subtle character difference. l2 is the lowercase of L2. 12 is the number twelve.

thank you. I made a mistake between the number 1 and the letter l.You can create it successfully

@zhenzi0322
Copy link
Author

zhenzi0322 commented May 19, 2020

You need to specify the similarity as l2.

    @dataclass(frozen=True)
    class DenseFloatLong(Base):
        dims: int

        def to_dict(self):
            return {
                "type": "elastiknn_dense_float_vector",
                "elastiknn": {
                    "model": "lsh",
                    "dims": self.dims,
                    "similarity": "l2",
                    "bands": 100,
                    "rows": 1,
                    "width": 3
                }
            }

The similarity field is required when using the lsh model.
It's a very subtle character difference. l2 is the lowercase of L2. 12 is the number twelve.

Why does the query result differ from what I expected.

The query results are as follows:

{'_index': 'test', '_type': '_doc', '_id': 'ids-1485205', '_score': 1000000.0}
{'_index': 'test', '_type': '_doc', '_id': 'ids-1485185', '_score': 1.2629013}
{'_index': 'test', '_type': '_doc', '_id': 'ids-1485238', '_score': 1.2498195}
{'_index': 'test', '_type': '_doc', '_id': 'ids-1485149', '_score': 1.2451644}
{'_index': 'test', '_type': '_doc', '_id': 'ids-1485198', '_score': 1.2327285}
{'_index': 'test', '_type': '_doc', '_id': 'ids-1485219', '_score': 1.2177316}
{'_index': 'test', '_type': '_doc', '_id': 'ids-1485212', '_score': 1.1902684}
{'_index': 'test', '_type': '_doc', '_id': 'ids-1485229', '_score': 1.1901888}
{'_index': 'test', '_type': '_doc', '_id': 'ids-1485152', '_score': 1.1610229}
{'_index': 'test', '_type': '_doc', '_id': 'ids-1488300', '_score': 0.0}
{'_index': 'test', '_type': '_doc', '_id': 'ids-1485209', '_score': 0.0}
{'_index': 'test', '_type': '_doc', '_id': 'ids-1485208', '_score': 0.0}
{'_index': 'test', '_type': '_doc', '_id': 'ids-1488289', '_score': 0.0}
{'_index': 'test', '_type': '_doc', '_id': 'ids-1485203', '_score': 0.0}
{'_index': 'test', '_type': '_doc', '_id': 'ids-1485202', '_score': 0.0}

How are score values scored?

All the pictures in elasticsearch have some similarities. For example, all my pictures have the words "children's day"

Here are my three pictures:

@alexklibisz
Copy link
Owner

I'm not sure what you are expecting. :)
You can read some more about the scoring method here: http://elastiknn.klibisz.com/api/#similarity-scoring

@alexklibisz
Copy link
Owner

@yu258 I added the missing mappings and queries in this PR #68
Docs are here: http://elastiknn.klibisz.com/python-client/

@alexklibisz
Copy link
Owner

One thing to consider when doing image search with L2 is that the floating point operations might overflow if your vector has large values. You might try to scale your vectors values so they are between 0 and 1.

@zhenzi0322
Copy link
Author

One thing to consider when doing image search with L2 is that the floating point operations might overflow if your vector has large values. You might try to scale your vectors values so they are between 0 and 1.

What's a good python library for generating image feature vectors?Currently, I use the image feature vector which is similar to [0.0,0.2...].In this format

@alexklibisz
Copy link
Owner

One thing to consider when doing image search with L2 is that the floating point operations might overflow if your vector has large values. You might try to scale your vectors values so they are between 0 and 1.

What's a good python library for generating image feature vectors?Currently, I use the image feature vector which is similar to [0.0,0.2...].In this format

I've always used the pretrained models from Keras: https://keras.io/api/applications/

@alexklibisz
Copy link
Owner

Closing this. Let me know if there are any other questions and we can open it again if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants