Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Umbrella issue for ML team #231

Closed
19 of 36 tasks
EgorBu opened this issue Jan 3, 2019 · 10 comments
Closed
19 of 36 tasks

Umbrella issue for ML team #231

EgorBu opened this issue Jan 3, 2019 · 10 comments
Assignees
Labels
tracking (☂) Umbrella or tracking issues with multiple subparts

Comments

@EgorBu
Copy link

EgorBu commented Jan 3, 2019

Hello,
As we discussed in slack here it's umbrella issue for ML team. I will try to aggregate our wishes and problems here.

Issues

Let's start with issues:

bblfshd issues

Go & Python client issues

Drivers

SDK

Web

Miscellaneous

I believe that some of these issues can be closed already.

Wishlist

Let's start second part about wishes (probably it's a bit late for new year's resolution but anyway).
It should show our priorities.

Feature extraction for source code

This part is very important for everybody in MLonCode area and still it's quite complicated to do.
In summary:

What we need (and not only we) - code -> UAST -> code. It's not possible right now so we did several tricks to overcome this problem.

We wrote tokenizer for source code (right now only JS-lang is supported) that allows to do code -> {UAST, info about parent of nodes, tokens} where tokens has property "".join(tokens) == code.

To do it we had to find reserved keywords for particular language and made logic to split everything between UAST nodes with tokens into reserved keywords and spaces/quotes/newlines.

Also it's important to know which roles are used in particular language (obviously only subset of all roles is used for some language -> and it can be used to reduce dimensionality of features like we did here)

Position information

We rely heavily on position information of identifiers and literals - if it's not available - it makes our life much harder.

So we have to make workarounds like this for [bug] Missed token for regular expressions.

How it can be checked - make long tests for each driver for 10-20-30 best repositories for each language and check that node.token == code[node.start_offset:node.end_offset]

Integration/regression tests on popular repositories

Maybe proces the top 10-20-30 repositories for each supported language.

Make long running tests to check if everything is good with drivers-sdk-etc - it will allow to avoid many problems like

and so on.

Language extensions

Follow up of this discussion https://src-d.slack.com/archives/C7UDG9VNY/p1533802242000193. In summary: https://github.com/babel/babel/ is excluded from our list of repos to check because more an issue of the native driver not covering all node types in this repo. Probably it's driver/enry issue to detect such kind of language extensions. (another example https://github.com/samgozman/YoptaScript#%D0%9F%D1%80%D0%B8%D0%BC%D0%B5%D1%80%D1%8B)

Partial parsing

It could help to check that code wasn't changed in the way to break parsing of UAST. Right now it just parse full content of file even if one space is added.

@creachadair
Copy link
Contributor

Thank you very much for writing this up, @EgorBu. There's a lot to consider here, but this will be really helpful for our planning.

@creachadair creachadair self-assigned this Jan 3, 2019
@creachadair creachadair added the tracking (☂) Umbrella or tracking issues with multiple subparts label Jan 3, 2019
@zurk
Copy link

zurk commented Feb 5, 2019

+ https://github.com/bblfsh/bblfshd/issues/236 (JS driver related)

@dennwc
Copy link
Member

dennwc commented Feb 5, 2019

2) was fixed in bblfsh/javascript-driver#50

@dennwc
Copy link
Member

dennwc commented Feb 5, 2019

I've checked the issues list and some of them are already resolved.

@creachadair Maybe we can make a separate project for this to track the progress?

@creachadair
Copy link
Contributor

creachadair commented Feb 5, 2019

This would make a good milestone. I'll set that up.

Edit: I started doing this, but it wound up opening up various design questions I didn't want to pre-answer, so I went with the simpler approach @bzz suggested below for now.

@bzz
Copy link
Contributor

bzz commented Feb 7, 2019

Short-term and as a first step we could convert PR description to have a checkboxes and mark issues that are solved. Happy to do that, if nobody has any objections or better short-term suggestions.

@creachadair
Copy link
Contributor

Short-term and as a first step we could convert PR description to have a checkboxes and mark issues that are solved. Happy to do that, if nobody has any objections or better short-term suggestions.

It's a good idea, and I started working on this. So far I have mostly just rearranged things cosmetically, checking off items that are complete and separating out the FRs from the bugs.

@bzz
Copy link
Contributor

bzz commented Apr 9, 2019

Did a pass over checkboxes in Description and updated resolved ones.

@dennwc
Copy link
Member

dennwc commented Jul 8, 2019

I think we can close this one, assuming all feature requests have their own issue filed.

@creachadair
Copy link
Contributor

I think we can close this one, assuming all feature requests have their own issue filed.

It looks that way. If I'm wrong, let me know and I'll take care of it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tracking (☂) Umbrella or tracking issues with multiple subparts
Projects
None yet
Development

No branches or pull requests

5 participants