New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

natural search for type #472

Closed
fommil opened this Issue Jul 1, 2014 · 56 comments

Comments

Projects
None yet
10 participants
@fommil
Contributor

fommil commented Jul 1, 2014

Imagine having https://github.com/ornicar/scalex in your editor and project-specific Scaladoc browser :-)

Not just using its web-interface, but having it work for everything on your classpath, with completion of types.

If you'd like this feature, you can help out by giving us some example queries and the expected responses that you'd want to see (if they are not obvious already). We want a simple, natural, query language not a strict DSL.

@fommil

This comment has been minimized.

Show comment
Hide comment
@fommil
Contributor

fommil commented Jul 1, 2014

@aemoncannon

This comment has been minimized.

Show comment
Hide comment
@aemoncannon

aemoncannon Jul 1, 2014

Member

+1
I assume you are aware of
https://github.com/ensime/ensime-emacs/blob/master/ensime-scalex.el

On Tue, Jul 1, 2014 at 3:16 PM, Sam Halliday notifications@github.com
wrote:

https://github.com/ornicar/scalex


Reply to this email directly or view it on GitHub
#472 (comment)
.

Member

aemoncannon commented Jul 1, 2014

+1
I assume you are aware of
https://github.com/ensime/ensime-emacs/blob/master/ensime-scalex.el

On Tue, Jul 1, 2014 at 3:16 PM, Sam Halliday notifications@github.com
wrote:

https://github.com/ornicar/scalex


Reply to this email directly or view it on GitHub
#472 (comment)
.

@fommil

This comment has been minimized.

Show comment
Hide comment
@fommil

fommil Jul 1, 2014

Contributor

you assumed incorrectly :-)

Contributor

fommil commented Jul 1, 2014

you assumed incorrectly :-)

@fommil

This comment has been minimized.

Show comment
Hide comment
@fommil

fommil Jul 2, 2014

Contributor

wow, scalex is really out of date and unmaintained.

This feature is more to have scalex like syntax work in a special search mode, which means we need to index the entire classpath correctly and then parse the query.

Contributor

fommil commented Jul 2, 2014

wow, scalex is really out of date and unmaintained.

This feature is more to have scalex like syntax work in a special search mode, which means we need to index the entire classpath correctly and then parse the query.

@fommil

This comment has been minimized.

Show comment
Hide comment
@fommil

fommil Aug 11, 2014

Contributor

some ideas for parsing and domain objects to use: https://gist.github.com/milessabin/61860a59a93184c0c5df#file-gistfile1-scala-L18

(use blackbox instead of whitebox... requires a compiler instance)

Contributor

fommil commented Aug 11, 2014

some ideas for parsing and domain objects to use: https://gist.github.com/milessabin/61860a59a93184c0c5df#file-gistfile1-scala-L18

(use blackbox instead of whitebox... requires a compiler instance)

@ShaneDelmore

This comment has been minimized.

Show comment
Hide comment
@ShaneDelmore

ShaneDelmore Jan 19, 2015

I think this is an incredibly useful idea, if it could find functions by signature available in the current project or libraries, instead of just open buffers. On my team we are starting to see a lot of duplicated methods rewritten by multiple developers because it is easier to write yet another max function than it is find out if there is already one written by your coworkers you can re-use.

ShaneDelmore commented Jan 19, 2015

I think this is an incredibly useful idea, if it could find functions by signature available in the current project or libraries, instead of just open buffers. On my team we are starting to see a lot of duplicated methods rewritten by multiple developers because it is easier to write yet another max function than it is find out if there is already one written by your coworkers you can re-use.

@fommil

This comment has been minimized.

Show comment
Hide comment
@fommil

fommil Jan 19, 2015

Contributor

😄 I know, it'd be awesome. We're currently working towards stabilising for 1.0 but if you were able to help out by coming up with a way of indexing (in lucene) and persisting (in SQL/H2) the domain objects that we produce during classfile/depickle parsing, then we'd love to hear/chat more about this.

Contributor

fommil commented Jan 19, 2015

😄 I know, it'd be awesome. We're currently working towards stabilising for 1.0 but if you were able to help out by coming up with a way of indexing (in lucene) and persisting (in SQL/H2) the domain objects that we produce during classfile/depickle parsing, then we'd love to hear/chat more about this.

@ShaneDelmore

This comment has been minimized.

Show comment
Hide comment
@ShaneDelmore

ShaneDelmore Jan 19, 2015

I just discovered the project, and am pretty new to Scala in general but if I am capable I would love to help out as I would like to do a little more open source work. Baby steps, first I have to get ensime actually working (I am currently an intellij user and find it doesn't "just work" on multi-project sbt projects that intellij works with. I would just end up adding noise right now I think but if I can get myself working in some fashion then I was planning on picking through your low hanging fruit to see if I could be useful.

ShaneDelmore commented Jan 19, 2015

I just discovered the project, and am pretty new to Scala in general but if I am capable I would love to help out as I would like to do a little more open source work. Baby steps, first I have to get ensime actually working (I am currently an intellij user and find it doesn't "just work" on multi-project sbt projects that intellij works with. I would just end up adding noise right now I think but if I can get myself working in some fashion then I was planning on picking through your low hanging fruit to see if I could be useful.

@fommil

This comment has been minimized.

Show comment
Hide comment
@fommil

fommil Jan 19, 2015

Contributor

hmm, that's a shame. I thought it did "just work". Are you using the sbt plugin to generate the .ensime file?

We could modularise the task of working on this ticket if that helps:

  1. create some standalone tests (also define user input/expected response)
  2. think about how to index/persist the project's information for searching
  3. implement it in our existing lucene/h2 persistence
  4. add protocol support
  5. add emacs GUI support
Contributor

fommil commented Jan 19, 2015

hmm, that's a shame. I thought it did "just work". Are you using the sbt plugin to generate the .ensime file?

We could modularise the task of working on this ticket if that helps:

  1. create some standalone tests (also define user input/expected response)
  2. think about how to index/persist the project's information for searching
  3. implement it in our existing lucene/h2 persistence
  4. add protocol support
  5. add emacs GUI support
@rorygraves

This comment has been minimized.

Show comment
Hide comment
@rorygraves

rorygraves Jan 19, 2015

Contributor

@ShaneDelmore Dive in - any help you give (low hanging or otherwise) is all appreciated 😄

Contributor

rorygraves commented Jan 19, 2015

@ShaneDelmore Dive in - any help you give (low hanging or otherwise) is all appreciated 😄

@aemoncannon

This comment has been minimized.

Show comment
Hide comment
@aemoncannon

aemoncannon Jan 19, 2015

Member

+1 : )
On Jan 19, 2015 12:11 PM, "Rory" notifications@github.com wrote:

@ShaneDelmore https://github.com/ShaneDelmore Dive in - any help you
give (low hanging or otherwise) is all appreciated [image: 😄]


Reply to this email directly or view it on GitHub
#472 (comment)
.

Member

aemoncannon commented Jan 19, 2015

+1 : )
On Jan 19, 2015 12:11 PM, "Rory" notifications@github.com wrote:

@ShaneDelmore https://github.com/ShaneDelmore Dive in - any help you
give (low hanging or otherwise) is all appreciated [image: 😄]


Reply to this email directly or view it on GitHub
#472 (comment)
.

@adelbertc

This comment has been minimized.

Show comment
Hide comment
@adelbertc

adelbertc Feb 19, 2015

Hey folks,

Just had a brief chat with @fommil - I have entertained the idea of (attempting to) write a Hoogle-ish tool for Scala. I'm currently playing around scraping what I need from the compiler via a compiler plugin, but it looks like there's some similar discussion/work being done here.

My vision for such a thing would be very similar to that of Hoogle - being used via command line, via an interpreter (or something like multibot), and perhaps most importantly, from an editor (editor agnostic!).

The brief chat involved discussing writing the tool "in ENSIME" (will need some more guidance on what this means.. any resources would be helpful since I'm fairly new to ENSIME and Emacs) and having the other capabilities query ENSIME externally.

@rorygraves I was told you were interested in pushing ENSIME to be more of an "as-a-service" thing?

adelbertc commented Feb 19, 2015

Hey folks,

Just had a brief chat with @fommil - I have entertained the idea of (attempting to) write a Hoogle-ish tool for Scala. I'm currently playing around scraping what I need from the compiler via a compiler plugin, but it looks like there's some similar discussion/work being done here.

My vision for such a thing would be very similar to that of Hoogle - being used via command line, via an interpreter (or something like multibot), and perhaps most importantly, from an editor (editor agnostic!).

The brief chat involved discussing writing the tool "in ENSIME" (will need some more guidance on what this means.. any resources would be helpful since I'm fairly new to ENSIME and Emacs) and having the other capabilities query ENSIME externally.

@rorygraves I was told you were interested in pushing ENSIME to be more of an "as-a-service" thing?

@aemoncannon

This comment has been minimized.

Show comment
Hide comment
@aemoncannon

aemoncannon Feb 19, 2015

Member

@adelbertc
For maximum flexibility, I'd suggest following scala-refactoring's example and depend only on the compiler. As in: trait { self: Global => your code here }. That way it's easy to drop into ensime and other tools. scala-refactoring is also nice in the way that it provides tools for building an index (required for things like global rename), but is agnostic to the threading model of the host program.

Member

aemoncannon commented Feb 19, 2015

@adelbertc
For maximum flexibility, I'd suggest following scala-refactoring's example and depend only on the compiler. As in: trait { self: Global => your code here }. That way it's easy to drop into ensime and other tools. scala-refactoring is also nice in the way that it provides tools for building an index (required for things like global rename), but is agnostic to the threading model of the host program.

@rorygraves

This comment has been minimized.

Show comment
Hide comment
@rorygraves

rorygraves Feb 19, 2015

Contributor

@adelbertc Yes I am - I've been slowly but surely teasing apart the expectation that emacs is sitting on the other end. I should put together a outline 'analysis' project that you can play with.

Contributor

rorygraves commented Feb 19, 2015

@adelbertc Yes I am - I've been slowly but surely teasing apart the expectation that emacs is sitting on the other end. I should put together a outline 'analysis' project that you can play with.

@fommil

This comment has been minimized.

Show comment
Hide comment
@fommil

fommil Feb 19, 2015

Contributor

For this particular feature, you don't need to go near Global because the classfile indexer and scalap will give you everything you need. This all kicks in during the startup phase so if you wrote a scalax / hoogle database schema / lucene model then you'd start up ensime, wait for "index complete" and make as many queries from whatever tool you want 😄

Contributor

fommil commented Feb 19, 2015

For this particular feature, you don't need to go near Global because the classfile indexer and scalap will give you everything you need. This all kicks in during the startup phase so if you wrote a scalax / hoogle database schema / lucene model then you'd start up ensime, wait for "index complete" and make as many queries from whatever tool you want 😄

@rorygraves

This comment has been minimized.

Show comment
Hide comment
@rorygraves

rorygraves Feb 19, 2015

Contributor

I agree with @fommil - all of the information you require should be available for the Ensime api, Ensime protects you from the internal compiler differences. I have a plan - give me 24 hours for a test project, but it will be a bit of a hack job in first instance as until @fommil refactors are complete we don't have a clean (non-lisp) remote api.

Contributor

rorygraves commented Feb 19, 2015

I agree with @fommil - all of the information you require should be available for the Ensime api, Ensime protects you from the internal compiler differences. I have a plan - give me 24 hours for a test project, but it will be a bit of a hack job in first instance as until @fommil refactors are complete we don't have a clean (non-lisp) remote api.

@adelbertc

This comment has been minimized.

Show comment
Hide comment
@adelbertc

adelbertc Feb 19, 2015

Woah what, ENSIME already has the type signatures of the methods/functions scraped?? Or am I misunderstanding?

adelbertc commented Feb 19, 2015

Woah what, ENSIME already has the type signatures of the methods/functions scraped?? Or am I misunderstanding?

@rorygraves

This comment has been minimized.

Show comment
Hide comment
@rorygraves

rorygraves Feb 19, 2015

Contributor

@adelbertc I believe so, and in a Scala version independent way as well. And if it doesn't have it fully exposed we should fix it so it is.

Contributor

rorygraves commented Feb 19, 2015

@adelbertc I believe so, and in a Scala version independent way as well. And if it doesn't have it fully exposed we should fix it so it is.

@adelbertc

This comment has been minimized.

Show comment
Hide comment
@adelbertc

adelbertc Feb 19, 2015

Oh come on, I just spent 2 hours of my night messing around with scala.tools.nsc._ :-)

That's awesome!! Pesky part is done then, on to the fun part! Looking forward to your test project @rorygraves

adelbertc commented Feb 19, 2015

Oh come on, I just spent 2 hours of my night messing around with scala.tools.nsc._ :-)

That's awesome!! Pesky part is done then, on to the fun part! Looking forward to your test project @rorygraves

@fommil

This comment has been minimized.

Show comment
Hide comment
@fommil

fommil Feb 19, 2015

Contributor

@adelbertc have a read of https://github.com/ensime/ensime-server/blob/master/src/main/scala/org/ensime/indexer/SearchService.scala and think about where your stuff would plug in.

Our scraping backends are currently:

Although we don't actually use the classfiledepickler because the indexer gives us everything we have needed so far (but we have expected to need more).

You can persist into

Make sure you don't try to use Lucene as a database (that's what we have H2 for).

When I spent some brain cycles on this, I couldn't figure out how to index the data. I think the important thing is to come up with a spec of how you would want the queries to look and then think about how to parse that into a query that would work in SQL, or use the advanced indexing features of Lucene if SQL isn't expressive enough.

Contributor

fommil commented Feb 19, 2015

@adelbertc have a read of https://github.com/ensime/ensime-server/blob/master/src/main/scala/org/ensime/indexer/SearchService.scala and think about where your stuff would plug in.

Our scraping backends are currently:

Although we don't actually use the classfiledepickler because the indexer gives us everything we have needed so far (but we have expected to need more).

You can persist into

Make sure you don't try to use Lucene as a database (that's what we have H2 for).

When I spent some brain cycles on this, I couldn't figure out how to index the data. I think the important thing is to come up with a spec of how you would want the queries to look and then think about how to parse that into a query that would work in SQL, or use the advanced indexing features of Lucene if SQL isn't expressive enough.

@rorygraves

This comment has been minimized.

Show comment
Hide comment
@rorygraves

rorygraves Feb 19, 2015

Contributor

@adelbertc Interestingly @fommil interpreted the question is different ways.
To me I was trying to supply an api that you can use during your search/indexing phase - i.e. using Ensime as an api to interate over the classes/types/methods to discover the available info that you would index/store elsewhere. (so I could see a global database fed by different projects/runs and exposed on a website - the same as Hoogle) .

@fommil was more thinking along time lines of building a hoogle type calls into ensime for your project so it is available - so it can be embedded into editors etc for your project.

Contributor

rorygraves commented Feb 19, 2015

@adelbertc Interestingly @fommil interpreted the question is different ways.
To me I was trying to supply an api that you can use during your search/indexing phase - i.e. using Ensime as an api to interate over the classes/types/methods to discover the available info that you would index/store elsewhere. (so I could see a global database fed by different projects/runs and exposed on a website - the same as Hoogle) .

@fommil was more thinking along time lines of building a hoogle type calls into ensime for your project so it is available - so it can be embedded into editors etc for your project.

@fommil

This comment has been minimized.

Show comment
Hide comment
@fommil

fommil Feb 19, 2015

Contributor

😄 yup, add Hoogle/Scalex to the ENSIME server. If you can have it working up to unit tests, we can do the wiring into the protocol layer for you.

Contributor

fommil commented Feb 19, 2015

😄 yup, add Hoogle/Scalex to the ENSIME server. If you can have it working up to unit tests, we can do the wiring into the protocol layer for you.

@adelbertc

This comment has been minimized.

Show comment
Hide comment
@adelbertc

adelbertc Feb 19, 2015

@rorygraves So the test project you were talking about is to provide an API to allow me to get access to the information I need, e.g. all functions/methods defined along with their type signatures? Will I be able to get at these in a structured format, like some sort of ADT or case class or whatnot (as opposed to a string-y dump that I have to parse)? I'm currently mostly interested in getting what I need out of the compiler so I can start playing with queries, ranking, etc, so this is very interesting to me. If your test project can do this.. yes please! :-)

@fommil So the idea is to build it into ENSIME which can be easily hooked into editors, and then folks who want to use it, say, from a command line tool or a web interface will just treat ENSIME as a sort of server thats doing the indexing and whatnot for them? That's certainly an interesting approach, and I'm on board so long as we can get such a thing working more or less editor agnostically :-)

adelbertc commented Feb 19, 2015

@rorygraves So the test project you were talking about is to provide an API to allow me to get access to the information I need, e.g. all functions/methods defined along with their type signatures? Will I be able to get at these in a structured format, like some sort of ADT or case class or whatnot (as opposed to a string-y dump that I have to parse)? I'm currently mostly interested in getting what I need out of the compiler so I can start playing with queries, ranking, etc, so this is very interesting to me. If your test project can do this.. yes please! :-)

@fommil So the idea is to build it into ENSIME which can be easily hooked into editors, and then folks who want to use it, say, from a command line tool or a web interface will just treat ENSIME as a sort of server thats doing the indexing and whatnot for them? That's certainly an interesting approach, and I'm on board so long as we can get such a thing working more or less editor agnostically :-)

@fommil

This comment has been minimized.

Show comment
Hide comment
@fommil

fommil Feb 19, 2015

Contributor

@adelbertc I think you'll find it easier to get it set up within ENSIME. You'll have access to the project definition and H2/Lucene is already there for you to use, with the data you need being streamed at you (instead of pull), and it can be a lot of data.

I think the best way for you to start would be to write up some TDD unit tests with user input and expected responses. It makes sense for there to be a method on SearchService{Spec} making this functionality available but the actual functionality could be provided by a dependency of SearchService to allow for independent testing. Perhaps we ought to refactor some of these classes to reflect the fact that there will be various indexing backends. Providing hierarchy lookup is something that will be coming relatively soon as well (no fancy indexing needed).

BTW, the ensime-server is editor agnostic. The problem is that nobody has written a (maintained) front-end for anything other than Emacs.

Contributor

fommil commented Feb 19, 2015

@adelbertc I think you'll find it easier to get it set up within ENSIME. You'll have access to the project definition and H2/Lucene is already there for you to use, with the data you need being streamed at you (instead of pull), and it can be a lot of data.

I think the best way for you to start would be to write up some TDD unit tests with user input and expected responses. It makes sense for there to be a method on SearchService{Spec} making this functionality available but the actual functionality could be provided by a dependency of SearchService to allow for independent testing. Perhaps we ought to refactor some of these classes to reflect the fact that there will be various indexing backends. Providing hierarchy lookup is something that will be coming relatively soon as well (no fancy indexing needed).

BTW, the ensime-server is editor agnostic. The problem is that nobody has written a (maintained) front-end for anything other than Emacs.

@rorygraves

This comment has been minimized.

Show comment
Hide comment
@rorygraves

rorygraves Feb 19, 2015

Contributor

@adelbertc Yes thats exactly what I'm aiming for.

Contributor

rorygraves commented Feb 19, 2015

@adelbertc Yes thats exactly what I'm aiming for.

@fommil

This comment has been minimized.

Show comment
Hide comment
@fommil

fommil Feb 19, 2015

Contributor

@adelbertc the case classes that are passed to you are generated in the two files I referenced above (classfileindexer and classfiledepickler), look in https://github.com/ensime/ensime-server/blob/master/src/main/scala/org/ensime/indexer/domain.scala for the classfile parsing (Java form) and the depickler returns objects from the Scala Compiler API (which is less pretty, but contains all the type information you'd ever need).

Contributor

fommil commented Feb 19, 2015

@adelbertc the case classes that are passed to you are generated in the two files I referenced above (classfileindexer and classfiledepickler), look in https://github.com/ensime/ensime-server/blob/master/src/main/scala/org/ensime/indexer/domain.scala for the classfile parsing (Java form) and the depickler returns objects from the Scala Compiler API (which is less pretty, but contains all the type information you'd ever need).

@adelbertc

This comment has been minimized.

Show comment
Hide comment
@adelbertc

adelbertc Feb 19, 2015

@fommil Just took a look, looks like everything I need :-) I'll play around with it tonight - will probably first just work with the indexer first since that's all you folks are using. I'll create a topic branch on my fork (will link when I push come code) to share progress.

To scrape the methods it looks like I just need to:

  1. Use ClassfileIndexer#indexClassFile to get RawClassFiles
  2. Poke into RawClassFile#methods to get the methods for the class

I'm guessing I'll need to clean up the names of the objects since I'm assuming the names given to me will be the $-y names?

Actually if it's Java form I wonder how messy stuff like higher-kinds and type classes/context bounds will be.. maybe I'll need to look into ClassfileDepickler ?

adelbertc commented Feb 19, 2015

@fommil Just took a look, looks like everything I need :-) I'll play around with it tonight - will probably first just work with the indexer first since that's all you folks are using. I'll create a topic branch on my fork (will link when I push come code) to share progress.

To scrape the methods it looks like I just need to:

  1. Use ClassfileIndexer#indexClassFile to get RawClassFiles
  2. Poke into RawClassFile#methods to get the methods for the class

I'm guessing I'll need to clean up the names of the objects since I'm assuming the names given to me will be the $-y names?

Actually if it's Java form I wonder how messy stuff like higher-kinds and type classes/context bounds will be.. maybe I'll need to look into ClassfileDepickler ?

@fommil

This comment has been minimized.

Show comment
Hide comment
@fommil

fommil Feb 19, 2015

Contributor

@adelbertc there is a naming clash with "indexer". For legacy reasons the whole search service is referred to as "indexing", but that's really made up out of Lucene Indexing, and H2 Database. We use both.

There is loads of good stuff could happen in SearchService. I'm really hoping to implement hierarchy browsing in the next few weeks. e.g. "what implements this?"

Contributor

fommil commented Feb 19, 2015

@adelbertc there is a naming clash with "indexer". For legacy reasons the whole search service is referred to as "indexing", but that's really made up out of Lucene Indexing, and H2 Database. We use both.

There is loads of good stuff could happen in SearchService. I'm really hoping to implement hierarchy browsing in the next few weeks. e.g. "what implements this?"

@jvican

This comment has been minimized.

Show comment
Hide comment
@jvican

jvican Jan 10, 2016

Member

Cool, I have enough info to start the research. I will upload a doc with naive examples of the queries any time soon!

Member

jvican commented Jan 10, 2016

Cool, I have enough info to start the research. I will upload a doc with naive examples of the queries any time soon!

@fommil

This comment has been minimized.

Show comment
Hide comment
@fommil

fommil Feb 11, 2016

Contributor

@Luegg just presented the answer at scalasphere (needs relicence to GPL) scala-search.org

Contributor

fommil commented Feb 11, 2016

@Luegg just presented the answer at scalasphere (needs relicence to GPL) scala-search.org

@Luegg

This comment has been minimized.

Show comment
Hide comment
@Luegg

Luegg Feb 15, 2016

I would love to see Scaps in Ensime 👍 Though, besides relicensing, this requires some additional rework of the API. I've sketched out my plans for the new interface in https://github.com/scala-search/scaps/blob/master/scapsApi.md. @fommil, can you have a glance at the API specs? I hope the transformation from your internal representation of types to the format proposed should be straight forward.

Luegg commented Feb 15, 2016

I would love to see Scaps in Ensime 👍 Though, besides relicensing, this requires some additional rework of the API. I've sketched out my plans for the new interface in https://github.com/scala-search/scaps/blob/master/scapsApi.md. @fommil, can you have a glance at the API specs? I hope the transformation from your internal representation of types to the format proposed should be straight forward.

@fommil

This comment has been minimized.

Show comment
Hide comment
@fommil

fommil Feb 15, 2016

Contributor

@Luegg cool, thanks! The API is probably still a little too high level for use in ENSIME. The ideal API for use would be something like

def index(sig: ScalaSig): Document // for scala functions
def index(fqn: FullyQualifiedName): Document // for java methods

where Document is the lucene thing that we can put into our index https://github.com/ensime/ensime-server/tree/master/core/src/main/scala/org/ensime/indexer/lucene

Note that for a lot of this stuff, my dev branch is more appropriate https://github.com/fommil/ensime-server/commits/index_method_descriptors because I've refactored our lucene layer and also added descriptors to the method FQNs.

and then on the query side, this kind of thing would be ideal

def scalaSearch(query: String): Query

so that we can run the query against our existing managed Lucene instance. We'd probably do some simple rule like only call this query if => is in the user's query.

Contributor

fommil commented Feb 15, 2016

@Luegg cool, thanks! The API is probably still a little too high level for use in ENSIME. The ideal API for use would be something like

def index(sig: ScalaSig): Document // for scala functions
def index(fqn: FullyQualifiedName): Document // for java methods

where Document is the lucene thing that we can put into our index https://github.com/ensime/ensime-server/tree/master/core/src/main/scala/org/ensime/indexer/lucene

Note that for a lot of this stuff, my dev branch is more appropriate https://github.com/fommil/ensime-server/commits/index_method_descriptors because I've refactored our lucene layer and also added descriptors to the method FQNs.

and then on the query side, this kind of thing would be ideal

def scalaSearch(query: String): Query

so that we can run the query against our existing managed Lucene instance. We'd probably do some simple rule like only call this query if => is in the user's query.

@fommil

This comment has been minimized.

Show comment
Hide comment
@fommil

fommil Feb 15, 2016

Contributor

@Luegg btw, I noticed you're using scalaz https://github.com/scala-search/scaps/blob/master/project/dependencies.scala#L9-L10

is there any chance you could remove that dependency from the core? We're very keen to avoid introducing scala bloat in our dependency chain as we're a part of the community build, and dependencies like this really slow us down from being able to support more recent versions of scala. Ideally we'd like to get rid of all our scala dependencies.

Contributor

fommil commented Feb 15, 2016

@Luegg btw, I noticed you're using scalaz https://github.com/scala-search/scaps/blob/master/project/dependencies.scala#L9-L10

is there any chance you could remove that dependency from the core? We're very keen to avoid introducing scala bloat in our dependency chain as we're a part of the community build, and dependencies like this really slow us down from being able to support more recent versions of scala. Ideally we'd like to get rid of all our scala dependencies.

@Luegg

This comment has been minimized.

Show comment
Hide comment
@Luegg

Luegg Feb 16, 2016

@fommil There are some difficulties in delegating control of the index/DB to the users of Scaps:

  • Scaps requires some persistent state to store frequency statistics and subtype relation which is later used to calculate statistics and analyze type queries. This could be solved, if ensime provides Scaps with a key/value store.
  • Furthermore, the frequency statistics are an aggregation over all indexed entities. This is a relatively costly operation and is best executed after an initial index of the project and after larger modification in the code base. Thus, ensime needs to eventually call a finalization method which updates the stats.
  • Additionally, to calculate the frequency stats, Scaps needs access to the index. More precisely, I need to know how many documents contain certain combinations of terms.

Dropping the Scalaz dependency should be possible without too much pain.

Luegg commented Feb 16, 2016

@fommil There are some difficulties in delegating control of the index/DB to the users of Scaps:

  • Scaps requires some persistent state to store frequency statistics and subtype relation which is later used to calculate statistics and analyze type queries. This could be solved, if ensime provides Scaps with a key/value store.
  • Furthermore, the frequency statistics are an aggregation over all indexed entities. This is a relatively costly operation and is best executed after an initial index of the project and after larger modification in the code base. Thus, ensime needs to eventually call a finalization method which updates the stats.
  • Additionally, to calculate the frequency stats, Scaps needs access to the index. More precisely, I need to know how many documents contain certain combinations of terms.

Dropping the Scalaz dependency should be possible without too much pain.

@fommil

This comment has been minimized.

Show comment
Hide comment
@fommil

fommil Feb 16, 2016

Contributor

We have some well defined places where we can call a batch method. I'm a bit confused about the frequency stats, can you please elaborate?

Contributor

fommil commented Feb 16, 2016

We have some well defined places where we can call a batch method. I'm a bit confused about the frequency stats, can you please elaborate?

@Luegg

This comment has been minimized.

Show comment
Hide comment
@Luegg

Luegg Feb 17, 2016

Frequency stats capture how likely a fingerprint term is to occur in a query. More likely terms (like -Any or +Nothing) are considered to be of less relevance. The current implementation iterates over all indexed signature and converts them to a query expression. Thus a signature Int => String may be converted to something like (-Int | -Any) & (+String | +Nothing) etc. The frequency of a term is then the number of query expressions containing this term.

Another, more efficient, approach would be to iterate over all type views (e.g. -Int |> -Any) and query the number of documents containing the left hand side of the type view. This number can then be added to the frequency of the right hand side of the view.

Most likely, there is also an online algorithm that can aggregate the frequencies while indexing value and type definitions. But, I did not yet have the time to implement it.

Luegg commented Feb 17, 2016

Frequency stats capture how likely a fingerprint term is to occur in a query. More likely terms (like -Any or +Nothing) are considered to be of less relevance. The current implementation iterates over all indexed signature and converts them to a query expression. Thus a signature Int => String may be converted to something like (-Int | -Any) & (+String | +Nothing) etc. The frequency of a term is then the number of query expressions containing this term.

Another, more efficient, approach would be to iterate over all type views (e.g. -Int |> -Any) and query the number of documents containing the left hand side of the type view. This number can then be added to the frequency of the right hand side of the view.

Most likely, there is also an online algorithm that can aggregate the frequencies while indexing value and type definitions. But, I did not yet have the time to implement it.

@fommil

This comment has been minimized.

Show comment
Hide comment
@fommil

fommil Feb 17, 2016

Contributor

Hmm. That could be tricky, we definitely don't want two phases. However, maybe this could be encoded into query boosts?

Contributor

fommil commented Feb 17, 2016

Hmm. That could be tricky, we definitely don't want two phases. However, maybe this could be encoded into query boosts?

@Luegg

This comment has been minimized.

Show comment
Hide comment
@Luegg

Luegg Feb 17, 2016

In the end, frequencies are used to calculate query boosts. Altogether, they are one of the most central aspects of the approach and cannot easily be replaced (comparable to IDF in TF/IDF).

What exactly do you mean by "two phases"? First indexing and then aggregating statistics before the user can issue queries? This could be addressed by the online algorithm mentioned.

Luegg commented Feb 17, 2016

In the end, frequencies are used to calculate query boosts. Altogether, they are one of the most central aspects of the approach and cannot easily be replaced (comparable to IDF in TF/IDF).

What exactly do you mean by "two phases"? First indexing and then aggregating statistics before the user can issue queries? This could be addressed by the online algorithm mentioned.

@fommil

This comment has been minimized.

Show comment
Hide comment
@fommil

fommil Feb 17, 2016

Contributor

are you applying the boosts to the index or the query? If its the query, then we're ok.

Contributor

fommil commented Feb 17, 2016

are you applying the boosts to the index or the query? If its the query, then we're ok.

@Luegg

This comment has been minimized.

Show comment
Hide comment
@Luegg

Luegg Feb 17, 2016

Only to the query. Actually, it is a bit more complex because querying
consists of two steps. First, Lucene is used to find documents with
fingerprints containing terms with a high relevancy (and documents
including matching full-text keywords that may be present in the user’s
query). The second step is to reorder the retrieved fingerprints by the
similarity to the query.

But I think this does not make too much of a difference. All we need is
infrastructure to store some metadata collected during indexing that can
later be used to build the queries.

I’m currently working on extracting the core functionality of Scaps without
any dependencies to the index infrastructure. This discussion already
provided some good inputs for a proper design (not too different from the
methods you described).

2016-02-17 10:34 GMT+01:00 Sam Halliday notifications@github.com:

are you applying the boosts to the index or the query? If its the query,
then we're ok.


Reply to this email directly or view it on GitHub
#472 (comment)
.

Luegg commented Feb 17, 2016

Only to the query. Actually, it is a bit more complex because querying
consists of two steps. First, Lucene is used to find documents with
fingerprints containing terms with a high relevancy (and documents
including matching full-text keywords that may be present in the user’s
query). The second step is to reorder the retrieved fingerprints by the
similarity to the query.

But I think this does not make too much of a difference. All we need is
infrastructure to store some metadata collected during indexing that can
later be used to build the queries.

I’m currently working on extracting the core functionality of Scaps without
any dependencies to the index infrastructure. This discussion already
provided some good inputs for a proper design (not too different from the
methods you described).

2016-02-17 10:34 GMT+01:00 Sam Halliday notifications@github.com:

are you applying the boosts to the index or the query? If its the query,
then we're ok.


Reply to this email directly or view it on GitHub
#472 (comment)
.

@Luegg

This comment has been minimized.

Show comment
Hide comment
@Luegg

Luegg Feb 17, 2016

Also, I think the online algorithm is crucial for the integration into IDEs. First, it will certainly improve the user experience because the index is sooner ready after adding new sources/dependencies. And second, integrating it into an asynchronous environment will be much simpler. Thus, I'll also direct my efforts into this direction.

Luegg commented Feb 17, 2016

Also, I think the online algorithm is crucial for the integration into IDEs. First, it will certainly improve the user experience because the index is sooner ready after adding new sources/dependencies. And second, integrating it into an asynchronous environment will be much simpler. Thus, I'll also direct my efforts into this direction.

@fommil

This comment has been minimized.

Show comment
Hide comment
@fommil

fommil Feb 17, 2016

Contributor

oh that's fine... I thought it could mean extra steps during index creation, which would be a lot more pain. Multi step queries are fine.

Contributor

fommil commented Feb 17, 2016

oh that's fine... I thought it could mean extra steps during index creation, which would be a lot more pain. Multi step queries are fine.

@fommil fommil added Analyser and removed Analyser labels Mar 31, 2016

@fommil fommil modified the milestones: Graphpocalypse, TNG 2.0 May 1, 2016

@fommil fommil removed this from the Graphpocalypse milestone Aug 8, 2016

@ShaneDelmore

This comment has been minimized.

Show comment
Hide comment
@ShaneDelmore

ShaneDelmore Mar 9, 2017

I had forgotten about this issue. Regarding a search syntax, I actually wasn't thinking of a separate search syntax (not sure it would get much use) I was thinking that in actual usage you would either autocomplete and filter the complete by type incrementally or just declare the signature, leave a hole (???? Maybe) then have suggestions to fill the hole based on type.

ShaneDelmore commented Mar 9, 2017

I had forgotten about this issue. Regarding a search syntax, I actually wasn't thinking of a separate search syntax (not sure it would get much use) I was thinking that in actual usage you would either autocomplete and filter the complete by type incrementally or just declare the signature, leave a hole (???? Maybe) then have suggestions to fill the hole based on type.

@fommil

This comment has been minimized.

Show comment
Hide comment
@fommil

fommil Mar 9, 2017

Contributor
Contributor

fommil commented Mar 9, 2017

@ShaneDelmore that's #1730

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment