Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for artificial documents #7530

Closed
wants to merge 8 commits into from

Conversation

Projects
None yet
3 participants
@alexksikes
Copy link
Contributor

alexksikes commented Sep 1, 2014

This adds the ability to the Term Vector API to generate term vectors for
artifical documents, that is for documents not present in the index. Following
a similar syntax to the Percolator API, a new 'doc' parameter is used, instead
of '_id', that specifies the document of interest. The parameters '_index' and
'_type' determine the mapping and therefore analyzers to apply to each value
field.

Term Vectors: Support for artificial documents
This adds the ability to the Term Vector API to generate term vectors for
artifical documents, that is for documents not present in the index. Following
a similar syntax to the Percolator API, a new 'doc' parameter is used, instead
of '_id', that specifies the document of interest. The parameters '_index' and
'_type' determine the mapping and therefore analyzers to apply to each value
field.
numbers have no meaning in this context.
numbers have no meaning in this context. By default, when requesting
term vectors of artificial documents, a shard to get the statistics from
is randomly selected. Use `routing` only to hit a particular shard.

This comment has been minimized.

Copy link
@jpountz

jpountz Sep 3, 2014

Contributor

Doesn't the term vectors API currently return statistics that are aggregated across all shards? Documentation suggests so?

This comment has been minimized.

Copy link
@alexksikes

alexksikes Sep 3, 2014

Author Contributor

Nope it does not.

"The term and field statistics are not accurate. Deleted documents are not taken into account. The information is only retrieved for the shard the requested document resides in."

This comment has been minimized.

Copy link
@jpountz

jpountz Sep 3, 2014

Contributor

ok

@@ -303,6 +329,9 @@ public void readFrom(StreamInput in) throws IOException {
}
type = in.readString();
id = in.readString();
if (in.readBoolean()) {

This comment has been minimized.

Copy link
@jpountz

jpountz Sep 3, 2014

Contributor

I think you need to check the stream version?


public void setArtificial(boolean artificial) {
this.artificial = artificial;
}

This comment has been minimized.

Copy link
@jpountz

jpountz Sep 3, 2014

Contributor

I don't think this method should be public?

This comment has been minimized.

Copy link
@alexksikes

alexksikes Sep 3, 2014

Author Contributor

I think it should be, just like isExists() and setExists, no?

This comment has been minimized.

Copy link
@jpountz

jpountz Sep 3, 2014

Contributor

Hmm, I had not noticed the other fields had the same issue. I would rather like these setters to not exist on TermVectorsResponse like GetResponse. Maybe we can just worry about setArtificial here and remove the other setters in a different PR?

This comment has been minimized.

Copy link
@alexksikes

alexksikes Sep 3, 2014

Author Contributor

Shouldn't we have something similar to a GetResult for term vectors? In this case it'd require a bit more work, and all of it should go to a different PR?

This comment has been minimized.

Copy link
@jpountz

jpountz Sep 4, 2014

Contributor

ok

* @param name the name of the field
* @return a <code>String[]</code> of field values
*/
public final String[] getValues(String name) {

This comment has been minimized.

Copy link
@jpountz

jpountz Sep 3, 2014

Contributor

Can you call it getStringValues to make clear it doesn't care about other types?

This comment has been minimized.

Copy link
@alexksikes

alexksikes Sep 3, 2014

Author Contributor

It would be nicer but it is called getValues in the original Lucene Document class, and since this is a fork, perhaps we should be keeping the same naming?

This comment has been minimized.

Copy link
@jpountz

jpountz Sep 3, 2014

Contributor

Good point, let's keep it consistent


if (result.size() == 0) {
return new String[0];
}

This comment has been minimized.

Copy link
@jpountz

jpountz Sep 3, 2014

Contributor

This looks unnecessary? (already covered by the conversion from a list to an array?)

This comment has been minimized.

Copy link
@alexksikes

alexksikes Sep 3, 2014

Author Contributor

Yep! It was just cut and paste of that, but I agree.

private String routing;

protected String preference;

private static AtomicInteger randomInt = new AtomicInteger(0);

This comment has been minimized.

Copy link
@jpountz

jpountz Sep 3, 2014

Contributor

Can you make it final?

request.doc(xContent);
return this;
}

This comment has been minimized.

Copy link
@jpountz

jpountz Sep 3, 2014

Contributor

Can you add javadocs since there are user-facing APIs?

// skip fields not in the mapping
if (topLevelFields != null && topLevelFields.terms(field.name()) == null) {
continue;
}

This comment has been minimized.

Copy link
@jpountz

jpountz Sep 3, 2014

Contributor

This looks unnecessary? If a field is not in the mapping then isValidField(fieldMapper) will return false since fieldMapper is null? Additionally, It seems to me that this check will create bugs if you have shards that have some indexed documents (so that topLevelFields is not null) but don't contain some of the fields that exist in the mappings?

This comment has been minimized.

Copy link
@alexksikes

alexksikes Sep 3, 2014

Author Contributor

I think the issue is about dynamic mappings. Do we allow it or not? Right now fieldMapper could be not null because the mappings are dynamically created when parsing the document ...

return MultiFields.getFields(index.createSearcher().getIndexReader());
}

private Fields generateTermVectorsFromDoc(TermVectorRequest request, Fields topLevelFields) throws IOException {
DocumentMapper docMapper = indexShard.mapperService().documentMapper(request.type());
ParseContext.Document doc = docMapper.parse(source(request.doc()).type(request.type()).flyweight(true)).rootDoc();

This comment has been minimized.

Copy link
@jpountz

jpountz Sep 3, 2014

Contributor

I think it can be confusing if it creates dynamic fields?

This comment has been minimized.

Copy link
@jpountz

jpountz Sep 3, 2014

Contributor

I just looked at how the percolator deals with it, and it also creates dynamic mappings in such a case. However, it makes sure that they are replicated to the master cluster state, see PercolatorService.parseRequest. I think this should do the same?

This comment has been minimized.

Copy link
@alexksikes

alexksikes Sep 3, 2014

Author Contributor

Well this is exactly what I tried to avoid, that is to create these mappings. Do we really want to allow the user to create new mappings when providing a document where the fields are not in the original mapping? Those fields could have been provided by mistake. I wanted to just ignore these fields, and not dynamically create a mapping for them, but there does not seem to be an easy way to do so, or is there?

@jpountz

This comment has been minimized.

Copy link
Contributor

jpountz commented Sep 3, 2014

@alexksikes I left some comments

@jpountz jpountz removed the review label Sep 3, 2014

@alexksikes

This comment has been minimized.

Copy link
Contributor Author

alexksikes commented Sep 3, 2014

@jpountz Thanks for comments. We should decide on allowing dynamic mappings or not, and if not what would be the easiest way to implement it? I'd be in favor of disabling dynamic mapping and only returning the TVs from the fields found in the original mapping. That because there is just too much room for mistakes and unintended behaviors. Maybe @clintongormley has some ideas?

that is for documents not present in the index. The syntax is similar to the
<<search-percolate,percolator>> API. For example, the following request would
return the same results as in example 1. The mapping used is determined by the
`index` and `type`.

This comment has been minimized.

Copy link
@jpountz

jpountz Sep 4, 2014

Contributor

Can you leave a note about the fact that it can introduce new mappings?

@@ -303,6 +329,12 @@ public void readFrom(StreamInput in) throws IOException {
}
type = in.readString();
id = in.readString();

if (in.getVersion().after(Version.V_1_4_0)) {

This comment has been minimized.

Copy link
@jpountz

jpountz Sep 4, 2014

Contributor

it should be onOrAfter instead of after?

This comment has been minimized.

Copy link
@jpountz

jpountz Sep 4, 2014

Contributor

Should'nt it have made the tests fail?

@@ -331,6 +363,13 @@ public void writeTo(StreamOutput out) throws IOException {
}
out.writeString(type);
out.writeString(id);

if (out.getVersion().after(Version.V_1_4_0)) {

This comment has been minimized.

Copy link
@jpountz

jpountz Sep 4, 2014

Contributor

onOrAfter too here?

@jpountz

This comment has been minimized.

Copy link
Contributor

jpountz commented Sep 4, 2014

I left some comments but I think it is close

@alexksikes alexksikes added the review label Sep 4, 2014

@jpountz

This comment has been minimized.

Copy link
Contributor

jpountz commented Sep 4, 2014

LGTM

@jpountz jpountz removed the review label Sep 4, 2014

@alexksikes alexksikes closed this in 07d741c Sep 5, 2014

@alexksikes alexksikes deleted the alexksikes:feature/termvector-docs branch Sep 5, 2014

alexksikes added a commit that referenced this pull request Sep 5, 2014

Term Vectors: Support for artificial documents
This adds the ability to the Term Vector API to generate term vectors for
artifical documents, that is for documents not present in the index. Following
a similar syntax to the Percolator API, a new 'doc' parameter is used, instead
of '_id', that specifies the document of interest. The parameters '_index' and
'_type' determine the mapping and therefore analyzers to apply to each value
field.

Closes #7530

alexksikes added a commit that referenced this pull request Sep 8, 2014

Term Vectors: Support for artificial documents
This adds the ability to the Term Vector API to generate term vectors for
artifical documents, that is for documents not present in the index. Following
a similar syntax to the Percolator API, a new 'doc' parameter is used, instead
of '_id', that specifies the document of interest. The parameters '_index' and
'_type' determine the mapping and therefore analyzers to apply to each value
field.

Closes #7530

Mpdreamz added a commit that referenced this pull request Dec 12, 2014

@clintongormley clintongormley changed the title Term Vectors: Support for artificial documents Support for artificial documents Jun 6, 2015

@clintongormley clintongormley removed the >feature label Jun 7, 2015

mute pushed a commit to mute/elasticsearch that referenced this pull request Jul 29, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.