Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/full text search #1136

Merged
merged 13 commits into from
Feb 20, 2017
Merged

Feature/full text search #1136

merged 13 commits into from
Feb 20, 2017

Conversation

rslota
Copy link
Contributor

@rslota rslota commented Dec 28, 2016

This PR introduces MVP for MAM full text search. Currently it works only with ODBC and Riak backends and its very limited when comes to text matching. This is NOT an extension to the standard. Current MAM XEP-0313 allows for queries using "forms" that may contain any server-defined fields, therefore simple field named 'full-text-search' was introduced in order to provide this feature.

The following stanza shows a example MAM stanza that uses full text search:

<iq type='set' id='query1'>
  <query xmlns='urn:xmpp:mam:1'>
    <x xmlns='jabber:x:data' type='submit'>
      <field type='hidden' var='FORM_TYPE'>
        <value>urn:xmpp:mam:1</value>
      </field>
      <field type='text-single' var='full-text-search'>
        <value>cat and dog</value>
      </field>
    </x>
  </query>
</iq>

Full Text Search is an optional server-side feature. To find out whether its supported, client may query server about active form fields for MAM stanza:

<iq type='get' id='form1'>
  <query xmlns='urn:xmpp:mam:1'/>
</iq>

The response shows which form fields may be used to filter MAM messages (that includes required fields i.e. with, start and end):

<iq type='result' id='form1'>
  <query xmlns='urn:xmpp:mam:1'>
    <x xmlns='jabber:x:data' type='form'>
      <field type='hidden' var='FORM_TYPE'>
        <value>urn:xmpp:mam:1</value>
      </field>
      <field type='jid-single' var='with'/>
      <field type='text-single' var='start'/>
      <field type='text-single' var='end'/>
      <field type='text-single' var='full-text-search'/>
    </x>
  </query>
</iq>

While reading PR you may want to skip "Make Elvis happy" commit to get rid of code quality changes.

@rslota rslota force-pushed the feature/free-text-search branch 5 times, most recently from c15389d to 251d09c Compare December 29, 2016 14:51
Copy link
Contributor

@michalwski michalwski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very good job, thanks! I really appreciate you also added the search feature to Riak's backend. Here are my general comments also:

  • The PR description is not up-to-date (doesn't say about Riak backend)
  • It would be very valuable if you could add an example stanza in the PR desc so it's easy for client devs to see and understand how to use this feature.
  • It would be perfect if the search feature were optional. There maybe some installations where the packet format is in fact sth encrypted (that's why the packet's format is configurable). In such situation we don't want to keep the body in plain text. Also for setups where the search feature is not needed, disabling it will safe some storage space.

_Start, _End, _Now, _WithJID, <<_SearchText/binary>>,
_PageSize, _LimitPassed, _MaxResultLimit,
_IsSimple) ->
omg;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really interesting return value :) however I think {error, omg} is better and can still be improved ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, sorry, that was a placeholder :)

_Start, _End, _Now, _WithJID, <<_SearchText/binary>>,
_PageSize, _LimitPassed, _MaxResultLimit,
_IsSimple) ->
{error, 'not-supported'};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one looks far more better then omg from prev comment :)

@@ -133,7 +143,7 @@ stop_pm(Host) ->
%% ----------------------------------------------------------------------
%% Add hooks for mod_mam_muc

-spec start_muc(ejabberd:server(),_) -> 'ok'.
-spec start_muc(ejabberd:server(), _) -> 'ok'.
start_muc(Host, _Opts) ->
case gen_mod:get_module_opt(Host, ?MODULE, no_writer, false) of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably not needed here, am I right?

@@ -141,13 +149,13 @@ microseconds_to_now(MicroSeconds) when is_integer(MicroSeconds) ->

%% @doc Returns time in `now()' format.
-spec iso8601_datetime_binary_to_timestamp(iso8601_datetime_binary())
-> erlang:timestamp() | undefined.
-> erlang:timestamp() | undefined.
iso8601_datetime_binary_to_timestamp(DateTime) when is_binary(DateTime) ->
jlib:datetime_binary_to_timestamp(DateTime).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A word (or more) of an explanation will makes things easier to understand. A simple example may also help.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what this refers to, considering that i didn't even touch this function ;)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your IDE seems to have touched it.

replace_x_user_element(FromJID, Role, Affiliation, Packet) ->
append_x_user_element(FromJID, Role, Affiliation,
delete_x_user_element(Packet)).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's clever :) I like it!

elvis.config Outdated
@@ -27,7 +27,7 @@
%% Allow usage of ?assertMatch and similar macros, since after expanding they
%% result in use of variables like __X and __V that wouldn't be permitted by
%% default regex.
{elvis_style, variable_naming_convention, #{regex => "^(_?[A-Z][0-9a-zA-Z_]*)$"}}
{elvis_style, variable_naming_convention, #{regex => "^(_?[A-Z][0-9a-zA-Z]*)$"}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we do need _ in this regexp. Otherwise Elvis may complain about macros like ?A_MACRO_WITH_WORDS, am I right? We have such macros in tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm gonna revert it. That was an merge issue.

@rslota rslota changed the title Feature/free text search Feature/full text search Jan 9, 2017
@rslota rslota force-pushed the feature/free-text-search branch 2 times, most recently from 8643221 to 7548e5f Compare January 10, 2017 08:10
@@ -228,29 +239,32 @@ do_archive_message(_Result, Host, MessID, UserID,
SDir = encode_direction(Dir),
SRemLResource = ejabberd_odbc:escape(RemLResource),
Data = packet_to_stored_binary(Packet),
TextBody = packet_to_search_body(Host, Packet),
STextBody = ejabberd_odbc:escape(TextBody),
string:to_lower(STextBody),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably not needed here, am I right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's redundant. I'm gonna remove it.

-spec normalize_search_text(binary() | string() | undefined, string()) -> string() | undefined.
normalize_search_text(undefined, _WordSeparator) ->
undefined;
normalize_search_text(Text, WordSeparator) ->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A word (or more) of an explanation will makes things easier to understand. A simple example may also help.

true ->
false %% full page but not the last one in the result set
end;
TotalCount =:= PagedCount; %% false means full page but not the last one in the result set
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's clever :) I like it!

@@ -917,6 +978,153 @@ simple_archive_request(Config) ->
end,
escalus_fresh:story(Config, [{alice, 1}, {bob, 1}], F).

text_search_is_not_available(Config) ->

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to Elvis:

The code in the following (LINE, COL) locations has the same structure: (981, 1), (1008, 1).

@rslota rslota force-pushed the feature/free-text-search branch 3 times, most recently from 509a6b4 to 89c8cea Compare January 19, 2017 17:43
Copy link
Contributor

@kzemek kzemek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's still documentation missing, with focus on what happens when full-text search is turned on (and a note that turning it on won't make old messages searchable).

ArchiveID :: mod_mam:archive_id(), ArchiveJID :: ejabberd:jid())
-> integer().
ArchiveID :: mod_mam:archive_id(), ArchiveJID :: ejabberd:jid())
-> integer().
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the formatter doesn't like lines beginning with an arrow.

-> purge_single_message_result().
MessID :: mod_mam:message_id(), ArchiveID :: mod_mam:archive_id(),
ArchiveJID :: ejabberd:jid(), Now :: mod_mam:unix_timestamp())
-> purge_single_message_result().
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more mis-formatted arrow.

@@ -186,7 +186,7 @@ send_message(From, To, Body) ->
ejabberd_hooks:run(user_send_packet,
F#jid.lserver,
[F, T, Packet]),
% privacy check is missing, but is it needed?
% privacy check is missing, but is it needed?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Misformatted comment.

@@ -65,7 +67,8 @@

-import(mod_mam_utils,
[maybe_microseconds/1,
microseconds_to_now/1]).
microseconds_to_now/1,
normalize_search_text/2]).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it would make sense to avoid importing new things into this module and call them fully qualified with mod_mam_utils:? The end goal would be to eliminate those imports alltogether.

@@ -112,7 +116,8 @@
-include_lib("ejabberd/include/amp.hrl").
-include_lib("exml/include/exml.hrl").

%% ----------------------------------------------------------------------
%% ---------------------------------------------------
%% -------------------
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happened :<

children = [ResultSetEl]}.
name = <<"fin">>,
attrs = [{<<"xmlns">>, MamNs}]
++ [{<<"complete">>, <<"true">>} || IsComplete]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like formatting in this case makes the code less readable.

children = [ResultSetEl]}.
name = <<"fin">>,
attrs = [{<<"xmlns">>, MamNs}]
++ [{<<"complete">>, <<"true">>} || IsComplete]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like formatting in this case makes the code less readable.

[form_type_field(MamNs),
form_field(<<"jid-single">>, <<"with">>),
form_field(<<"text-single">>, <<"start">>),
form_field(<<"text-single">>, <<"end">>)].
form_field(<<"text-single">>, <<"end">>)] ++ TextSearch.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

| TextSearch]? :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not the same :) I want lists concatenation, not insertion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it's not the same in general, but isn't it more idiomatic (and marginally faster) in this case?

LowerBody = string:to_lower(BodyString),
ReOpts = [{return, list}, global, unicode, ucp],
Re0 = re:replace(LowerBody, "[, .:;-?!]+", " ", ReOpts),
Re1 = re:replace(Re0, "([^\\w\\d ]+)|(^\\s+)|(\\s+$)", "", ReOpts),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The \d is redundant here as \w group already includes digits.

-spec normalize_search_text(binary() | string() | undefined, string()) -> string() | undefined.
normalize_search_text(undefined, _WordSeparator) ->
undefined;
normalize_search_text(Text, WordSeparator) ->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function has to do some heavy lifting on texts, have you measured the performance impact of ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It adds like 100 μs to processing time per row on my slow CPU. Simple string:to_lower is only 10 times faster then this whole function. Also, I've checked that precompiling regexes only speed up this function by ~5%. Removing only one of those 3 regexes could give also no more then 5-10%, but I'm not sure if its easy to do.

@rslota rslota force-pushed the feature/free-text-search branch 3 times, most recently from fbc1746 to 456e7e3 Compare February 16, 2017 14:28
@rslota rslota force-pushed the feature/free-text-search branch 2 times, most recently from 9a698c7 to 30dbec3 Compare February 17, 2017 13:53
@michalwski michalwski merged commit 0ddc72a into master Feb 20, 2017
@michalwski michalwski deleted the feature/free-text-search branch February 20, 2017 08:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants