natural search for type #472

Open
fommil opened this Issue Jul 1, 2014 · 54 comments

Projects

None yet

10 participants

@fommil
Member
fommil commented Jul 1, 2014 edited

Imagine having https://github.com/ornicar/scalex in your editor and project-specific Scaladoc browser :-)

Not just using its web-interface, but having it work for everything on your classpath, with completion of types.

If you'd like this feature, you can help out by giving us some example queries and the expected responses that you'd want to see (if they are not obvious already). We want a simple, natural, query language not a strict DSL.

@aemoncannon
Member

+1
I assume you are aware of
https://github.com/ensime/ensime-emacs/blob/master/ensime-scalex.el

On Tue, Jul 1, 2014 at 3:16 PM, Sam Halliday notifications@github.com
wrote:

https://github.com/ornicar/scalex

β€”
Reply to this email directly or view it on GitHub
#472 (comment)
.

@fommil
Member
fommil commented Jul 1, 2014

you assumed incorrectly :-)

@fommil
Member
fommil commented Jul 2, 2014

wow, scalex is really out of date and unmaintained.

This feature is more to have scalex like syntax work in a special search mode, which means we need to index the entire classpath correctly and then parse the query.

@fommil fommil referenced this issue in ornicar/scalex Aug 3, 2014
Closed

help with ENSIME migration/integration #47

@fommil
Member
fommil commented Aug 11, 2014

some ideas for parsing and domain objects to use: https://gist.github.com/milessabin/61860a59a93184c0c5df#file-gistfile1-scala-L18

(use blackbox instead of whitebox... requires a compiler instance)

@fommil fommil referenced this issue in typelevel/scala Sep 19, 2014
Closed

Add way to get all function signatures #53

@ShaneDelmore

I think this is an incredibly useful idea, if it could find functions by signature available in the current project or libraries, instead of just open buffers. On my team we are starting to see a lot of duplicated methods rewritten by multiple developers because it is easier to write yet another max function than it is find out if there is already one written by your coworkers you can re-use.

@fommil
Member
fommil commented Jan 19, 2015

πŸ˜„ I know, it'd be awesome. We're currently working towards stabilising for 1.0 but if you were able to help out by coming up with a way of indexing (in lucene) and persisting (in SQL/H2) the domain objects that we produce during classfile/depickle parsing, then we'd love to hear/chat more about this.

@ShaneDelmore

I just discovered the project, and am pretty new to Scala in general but if I am capable I would love to help out as I would like to do a little more open source work. Baby steps, first I have to get ensime actually working (I am currently an intellij user and find it doesn't "just work" on multi-project sbt projects that intellij works with. I would just end up adding noise right now I think but if I can get myself working in some fashion then I was planning on picking through your low hanging fruit to see if I could be useful.

@fommil
Member
fommil commented Jan 19, 2015

hmm, that's a shame. I thought it did "just work". Are you using the sbt plugin to generate the .ensime file?

We could modularise the task of working on this ticket if that helps:

  1. create some standalone tests (also define user input/expected response)
  2. think about how to index/persist the project's information for searching
  3. implement it in our existing lucene/h2 persistence
  4. add protocol support
  5. add emacs GUI support
@rorygraves
Contributor

@ShaneDelmore Dive in - any help you give (low hanging or otherwise) is all appreciated πŸ˜„

@aemoncannon
Member

+1 : )
On Jan 19, 2015 12:11 PM, "Rory" notifications@github.com wrote:

@ShaneDelmore https://github.com/ShaneDelmore Dive in - any help you
give (low hanging or otherwise) is all appreciated [image: πŸ˜„]

β€”
Reply to this email directly or view it on GitHub
#472 (comment)
.

@adelbertc

Hey folks,

Just had a brief chat with @fommil - I have entertained the idea of (attempting to) write a Hoogle-ish tool for Scala. I'm currently playing around scraping what I need from the compiler via a compiler plugin, but it looks like there's some similar discussion/work being done here.

My vision for such a thing would be very similar to that of Hoogle - being used via command line, via an interpreter (or something like multibot), and perhaps most importantly, from an editor (editor agnostic!).

The brief chat involved discussing writing the tool "in ENSIME" (will need some more guidance on what this means.. any resources would be helpful since I'm fairly new to ENSIME and Emacs) and having the other capabilities query ENSIME externally.

@rorygraves I was told you were interested in pushing ENSIME to be more of an "as-a-service" thing?

@aemoncannon
Member

@adelbertc
For maximum flexibility, I'd suggest following scala-refactoring's example and depend only on the compiler. As in: trait { self: Global => your code here }. That way it's easy to drop into ensime and other tools. scala-refactoring is also nice in the way that it provides tools for building an index (required for things like global rename), but is agnostic to the threading model of the host program.

@rorygraves
Contributor

@adelbertc Yes I am - I've been slowly but surely teasing apart the expectation that emacs is sitting on the other end. I should put together a outline 'analysis' project that you can play with.

@fommil
Member
fommil commented Feb 19, 2015

For this particular feature, you don't need to go near Global because the classfile indexer and scalap will give you everything you need. This all kicks in during the startup phase so if you wrote a scalax / hoogle database schema / lucene model then you'd start up ensime, wait for "index complete" and make as many queries from whatever tool you want πŸ˜„

@rorygraves
Contributor

I agree with @fommil - all of the information you require should be available for the Ensime api, Ensime protects you from the internal compiler differences. I have a plan - give me 24 hours for a test project, but it will be a bit of a hack job in first instance as until @fommil refactors are complete we don't have a clean (non-lisp) remote api.

@adelbertc

Woah what, ENSIME already has the type signatures of the methods/functions scraped?? Or am I misunderstanding?

@rorygraves
Contributor

@adelbertc I believe so, and in a Scala version independent way as well. And if it doesn't have it fully exposed we should fix it so it is.

@adelbertc

Oh come on, I just spent 2 hours of my night messing around with scala.tools.nsc._ :-)

That's awesome!! Pesky part is done then, on to the fun part! Looking forward to your test project @rorygraves

@fommil
Member
fommil commented Feb 19, 2015

@adelbertc have a read of https://github.com/ensime/ensime-server/blob/master/src/main/scala/org/ensime/indexer/SearchService.scala and think about where your stuff would plug in.

Our scraping backends are currently:

Although we don't actually use the classfiledepickler because the indexer gives us everything we have needed so far (but we have expected to need more).

You can persist into

Make sure you don't try to use Lucene as a database (that's what we have H2 for).

When I spent some brain cycles on this, I couldn't figure out how to index the data. I think the important thing is to come up with a spec of how you would want the queries to look and then think about how to parse that into a query that would work in SQL, or use the advanced indexing features of Lucene if SQL isn't expressive enough.

@rorygraves
Contributor

@adelbertc Interestingly @fommil interpreted the question is different ways.
To me I was trying to supply an api that you can use during your search/indexing phase - i.e. using Ensime as an api to interate over the classes/types/methods to discover the available info that you would index/store elsewhere. (so I could see a global database fed by different projects/runs and exposed on a website - the same as Hoogle) .

@fommil was more thinking along time lines of building a hoogle type calls into ensime for your project so it is available - so it can be embedded into editors etc for your project.

@fommil
Member
fommil commented Feb 19, 2015

πŸ˜„ yup, add Hoogle/Scalex to the ENSIME server. If you can have it working up to unit tests, we can do the wiring into the protocol layer for you.

@adelbertc

@rorygraves So the test project you were talking about is to provide an API to allow me to get access to the information I need, e.g. all functions/methods defined along with their type signatures? Will I be able to get at these in a structured format, like some sort of ADT or case class or whatnot (as opposed to a string-y dump that I have to parse)? I'm currently mostly interested in getting what I need out of the compiler so I can start playing with queries, ranking, etc, so this is very interesting to me. If your test project can do this.. yes please! :-)

@fommil So the idea is to build it into ENSIME which can be easily hooked into editors, and then folks who want to use it, say, from a command line tool or a web interface will just treat ENSIME as a sort of server thats doing the indexing and whatnot for them? That's certainly an interesting approach, and I'm on board so long as we can get such a thing working more or less editor agnostically :-)

@fommil
Member
fommil commented Feb 19, 2015

@adelbertc I think you'll find it easier to get it set up within ENSIME. You'll have access to the project definition and H2/Lucene is already there for you to use, with the data you need being streamed at you (instead of pull), and it can be a lot of data.

I think the best way for you to start would be to write up some TDD unit tests with user input and expected responses. It makes sense for there to be a method on SearchService{Spec} making this functionality available but the actual functionality could be provided by a dependency of SearchService to allow for independent testing. Perhaps we ought to refactor some of these classes to reflect the fact that there will be various indexing backends. Providing hierarchy lookup is something that will be coming relatively soon as well (no fancy indexing needed).

BTW, the ensime-server is editor agnostic. The problem is that nobody has written a (maintained) front-end for anything other than Emacs.

@rorygraves
Contributor

@adelbertc Yes thats exactly what I'm aiming for.

@fommil
Member
fommil commented Feb 19, 2015

@adelbertc the case classes that are passed to you are generated in the two files I referenced above (classfileindexer and classfiledepickler), look in https://github.com/ensime/ensime-server/blob/master/src/main/scala/org/ensime/indexer/domain.scala for the classfile parsing (Java form) and the depickler returns objects from the Scala Compiler API (which is less pretty, but contains all the type information you'd ever need).

@adelbertc

@fommil Just took a look, looks like everything I need :-) I'll play around with it tonight - will probably first just work with the indexer first since that's all you folks are using. I'll create a topic branch on my fork (will link when I push come code) to share progress.

To scrape the methods it looks like I just need to:

  1. Use ClassfileIndexer#indexClassFile to get RawClassFiles
  2. Poke into RawClassFile#methods to get the methods for the class

I'm guessing I'll need to clean up the names of the objects since I'm assuming the names given to me will be the $-y names?

Actually if it's Java form I wonder how messy stuff like higher-kinds and type classes/context bounds will be.. maybe I'll need to look into ClassfileDepickler ?

@fommil
Member
fommil commented Feb 19, 2015

@adelbertc there is a naming clash with "indexer". For legacy reasons the whole search service is referred to as "indexing", but that's really made up out of Lucene Indexing, and H2 Database. We use both.

There is loads of good stuff could happen in SearchService. I'm really hoping to implement hierarchy browsing in the next few weeks. e.g. "what implements this?"

@fommil
Member
fommil commented Feb 19, 2015

@adelbertc you don't need to use ClassFileIndexer. If you hook into the right part of SearchService, its output will be handed to you. You'll probably definitely want to add a ClassfileDepickler stage to the SearchService around the same place (there is a TODO), as it will give you all the type information you need. You'd probably pass the output of that to your service and get back some objects that can be indexed/persisted, then implement the search side (keep that bit trivial) in a single method on SearchService.

My philosophy with search is:

  1. do all the work on the indexing side, so the search is trivial (and super fast)
  2. only return database row numbers from a search on Lucene
  3. put everything you want to retrieve in the SQL database

not only is that a lot cleaner to debug/understand, but it's actually faster. Lucene is an index so it works pretty much in reverse to how you imagine databases normally working.

If you can come up with a data structure that you think makes sense to persist, and how you want to query it, I'll advise on how to Lucene-ify it.

@adelbertc

Yep taking a look, I (think I) see what needs to be done and where :-)

@rorygraves
Contributor

In the alternate universe thread - here is the test project I promised - https://github.com/rorygraves/ensime-analyser Its very rough and ready and I have already discovered several things that need tweaking in Ensime to support this better.

@MasseGuillaume
Contributor

Hey,
As pointed out Scalex is out of date. @ornicar want to put his efforts on http://en.lichess.org/ and he does an excellent jobs!

I'm starting metadoc. The idea is to have a way to zoom from all available Scala projects to a class' def. Mac users may know Dash documentation tool, I'm solving a similar problem.

What I achieve so far:

  1. I collected all Scala projects poms on Bintray/Sonatype. This will allow us to list tons of projects (5000 artifacts for Scala 2.11, 10 000 artifacts for 2.10), their dependencies and meta information like website url, etc.

  2. Right now a compiler plugin can print trees from @non / cats. This is challenging because we need to make sure Scalahost can compile any projects. @xeno-by is making sure everything works.

There is still a bunch of stuff to find out. I will make sure everybody has free access to this information.

@fommil fommil changed the title from scalex to Duck Duck Type Jul 15, 2015
@fommil
Member
fommil commented Jul 15, 2015

renaming in honour of @viktorklang // @dickwall

@viktorklang

Thanks @fommil! :)

@fommil fommil changed the title from Duck Duck Type to Type search in your editor Jul 28, 2015
@fommil fommil added the OMG! Ponies! label Jul 28, 2015
@fommil fommil added this to the TNG 2.0 milestone Jul 28, 2015
@fommil
Member
fommil commented Jul 28, 2015

The more I think about this, the more I believe it is a strategic move for ENSIME. I'm personally putting it as my no. 2 priority after Java support but would love somebody to get there first. Step one is getting people who are excited by this to give us a bunch of examples of the kinds of queries they'd naturally like to make then we can think about how to parse that into a form that can match against the serialised form of the type signature coming from scalap.

@fommil fommil changed the title from Type search in your editor to natural search for type Jul 28, 2015
@epost
epost commented Oct 31, 2015

There's also Scaps, which I recently came across.

As a side note, and FWIW, I'm trying to come up with a useful model for querying software projects in the form of something like Datalog (which gives you automated reasoning), currently with a focus on Haskell and PureScript, but with a view to abstracting away a lot of language-specific details, see this gist and the nascent psc-query, which aims to convert PureScript projects to queryable knowledge bases. I think something like that, or an alternative involving graph queries, would be pretty cool to have. See also this thread on the emerging haskell-ide-engine project. (Apologies if this seems too far OT.)

@fommil
Member
fommil commented Oct 31, 2015

@epost cool! have you seen the references in #1136 ?

@epost
epost commented Oct 31, 2015

@fommil Wow, I hadn't, somehow... Thanks!

@jvican
Member
jvican commented Jan 10, 2016

This sounds awesome. Is anyone working on this (perhaps @epost)? I would like to pick it up.

@fommil
Member
fommil commented Jan 10, 2016

@jvican excellent! First thing to do is research on what the natural queries would look like. Maybe study scalex a bit? I can help with Lucene implementation.

@jvican
Member
jvican commented Jan 10, 2016

Cool, I have enough info to start the research. I will upload a doc with naive examples of the queries any time soon!

@fommil
Member
fommil commented Feb 11, 2016

@luegg just presented the answer at scalasphere (needs relicence to GPL) scala-search.org

@Luegg
Luegg commented Feb 15, 2016

I would love to see Scaps in Ensime πŸ‘ Though, besides relicensing, this requires some additional rework of the API. I've sketched out my plans for the new interface in https://github.com/scala-search/scaps/blob/master/scapsApi.md. @fommil, can you have a glance at the API specs? I hope the transformation from your internal representation of types to the format proposed should be straight forward.

@fommil
Member
fommil commented Feb 15, 2016

@Luegg cool, thanks! The API is probably still a little too high level for use in ENSIME. The ideal API for use would be something like

def index(sig: ScalaSig): Document // for scala functions
def index(fqn: FullyQualifiedName): Document // for java methods

where Document is the lucene thing that we can put into our index https://github.com/ensime/ensime-server/tree/master/core/src/main/scala/org/ensime/indexer/lucene

Note that for a lot of this stuff, my dev branch is more appropriate https://github.com/fommil/ensime-server/commits/index_method_descriptors because I've refactored our lucene layer and also added descriptors to the method FQNs.

and then on the query side, this kind of thing would be ideal

def scalaSearch(query: String): Query

so that we can run the query against our existing managed Lucene instance. We'd probably do some simple rule like only call this query if => is in the user's query.

@Luegg Luegg referenced this issue in scala-search/scaps Feb 15, 2016
Open

Extract Scaps Core #23

@fommil
Member
fommil commented Feb 15, 2016

@Luegg btw, I noticed you're using scalaz https://github.com/scala-search/scaps/blob/master/project/dependencies.scala#L9-L10

is there any chance you could remove that dependency from the core? We're very keen to avoid introducing scala bloat in our dependency chain as we're a part of the community build, and dependencies like this really slow us down from being able to support more recent versions of scala. Ideally we'd like to get rid of all our scala dependencies.

@Luegg
Luegg commented Feb 16, 2016

@fommil There are some difficulties in delegating control of the index/DB to the users of Scaps:

  • Scaps requires some persistent state to store frequency statistics and subtype relation which is later used to calculate statistics and analyze type queries. This could be solved, if ensime provides Scaps with a key/value store.
  • Furthermore, the frequency statistics are an aggregation over all indexed entities. This is a relatively costly operation and is best executed after an initial index of the project and after larger modification in the code base. Thus, ensime needs to eventually call a finalization method which updates the stats.
  • Additionally, to calculate the frequency stats, Scaps needs access to the index. More precisely, I need to know how many documents contain certain combinations of terms.

Dropping the Scalaz dependency should be possible without too much pain.

@fommil
Member
fommil commented Feb 16, 2016

We have some well defined places where we can call a batch method. I'm a bit confused about the frequency stats, can you please elaborate?

@Luegg
Luegg commented Feb 17, 2016

Frequency stats capture how likely a fingerprint term is to occur in a query. More likely terms (like -Any or +Nothing) are considered to be of less relevance. The current implementation iterates over all indexed signature and converts them to a query expression. Thus a signature Int => String may be converted to something like (-Int | -Any) & (+String | +Nothing) etc. The frequency of a term is then the number of query expressions containing this term.

Another, more efficient, approach would be to iterate over all type views (e.g. -Int |> -Any) and query the number of documents containing the left hand side of the type view. This number can then be added to the frequency of the right hand side of the view.

Most likely, there is also an online algorithm that can aggregate the frequencies while indexing value and type definitions. But, I did not yet have the time to implement it.

@fommil
Member
fommil commented Feb 17, 2016

Hmm. That could be tricky, we definitely don't want two phases. However, maybe this could be encoded into query boosts?

@Luegg
Luegg commented Feb 17, 2016

In the end, frequencies are used to calculate query boosts. Altogether, they are one of the most central aspects of the approach and cannot easily be replaced (comparable to IDF in TF/IDF).

What exactly do you mean by "two phases"? First indexing and then aggregating statistics before the user can issue queries? This could be addressed by the online algorithm mentioned.

@fommil
Member
fommil commented Feb 17, 2016

are you applying the boosts to the index or the query? If its the query, then we're ok.

@Luegg
Luegg commented Feb 17, 2016

Only to the query. Actually, it is a bit more complex because querying
consists of two steps. First, Lucene is used to find documents with
fingerprints containing terms with a high relevancy (and documents
including matching full-text keywords that may be present in the user’s
query). The second step is to reorder the retrieved fingerprints by the
similarity to the query.

But I think this does not make too much of a difference. All we need is
infrastructure to store some metadata collected during indexing that can
later be used to build the queries.

I’m currently working on extracting the core functionality of Scaps without
any dependencies to the index infrastructure. This discussion already
provided some good inputs for a proper design (not too different from the
methods you described).

2016-02-17 10:34 GMT+01:00 Sam Halliday notifications@github.com:

are you applying the boosts to the index or the query? If its the query,
then we're ok.

β€”
Reply to this email directly or view it on GitHub
#472 (comment)
.

@Luegg
Luegg commented Feb 17, 2016

Also, I think the online algorithm is crucial for the integration into IDEs. First, it will certainly improve the user experience because the index is sooner ready after adding new sources/dependencies. And second, integrating it into an asynchronous environment will be much simpler. Thus, I'll also direct my efforts into this direction.

@fommil
Member
fommil commented Feb 17, 2016

oh that's fine... I thought it could mean extra steps during index creation, which would be a lot more pain. Multi step queries are fine.

@Luegg Luegg referenced this issue in sschaef/amora Feb 20, 2016
Closed

Semantic search engine #13

@fommil fommil added Analyser Indexer Blocked and removed Analyser labels Mar 31, 2016
@fommil fommil modified the milestone: Graphpocalypse, TNG 2.0 May 1, 2016
@fommil fommil removed this from the Graphpocalypse milestone Aug 8, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment