Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

language c is unknown #87

Closed
quasarea opened this issue Mar 20, 2017 · 11 comments
Closed

language c is unknown #87

quasarea opened this issue Mar 20, 2017 · 11 comments

Comments

@quasarea
Copy link

have a bunch of files with 8 nothing special in name and "c" as an extension.

all are reported as unknown, even tho classifier has four records for this extension.

btw, are extension definition inclusive? so we can assign single extension to multiple languages and it will be covered by all language filters that met this condition?

@boyter
Copy link
Owner

boyter commented Mar 20, 2017

I would have to see an example file itself to know why this is. Is it possible to post one here? Or at least create a file which replicate this?

At the moment it is inclusive... If you look in the classifier database.json file however you can see an array called keywords which I plan to put the most common keywords in as a way of guessing the language when there are duplicate matching extensions. The logic should live in here

public String languageGuesser(String fileName, List<String> codeLines) {
when implemented.

@boyter
Copy link
Owner

boyter commented Mar 29, 2017

Any word on this?

@quasarea
Copy link
Author

I'm out of office till 19th of April, cant deliver anything meaning full before that date, sorry

@boyter
Copy link
Owner

boyter commented Mar 31, 2017

No problem.

@boyter
Copy link
Owner

boyter commented Apr 5, 2017

Managed to replicate using the Golang repository

https://github.com/golang/go.git

search for

http://localhost:8080/?q=assert&repo=golang

/misc/cgo/life/c-life.c

is reported as unknown.

@boyter
Copy link
Owner

boyter commented Apr 9, 2017

Replicated. The root cause is that there are multiple extensions which match C. When this is true the logic is meant to try and guess which one is correct based on the contents of the files. However the keywords are missing from the classifier database. For example,

{
    "language": "C",
    "extensions": [
      "c"
    ],
    "keywords": []
  }

To resolve this need to fix the database ./include/classifier/database.json to have keywords in there as well as clean up the filetypes.

To do so I will run against the searchcode.com database to pull back the top 100 or so keywords for each language and use that to populate the database to resolve the issue.

boyter added a commit that referenced this issue Apr 10, 2017
@boyter
Copy link
Owner

boyter commented Apr 10, 2017

Proof of concept fix added. Works as expected. Need to build out the database correctly to resolve this properly.

@quasarea
Copy link
Author

great 👍 wonder if it is worth keeping keywords that appears in all flavours, or just flavour specific for performance gain.

@boyter
Copy link
Owner

boyter commented Apr 10, 2017

It only falls back to that logic where it finds a conflict. So it shouldn't be that much of a problem in terms of performance.

What I need to do is decide what to do when there are multiple matches and the keyword isn't able to split them. Probably will have it default to the first option.

Have tried out with the following keywords for C and C++ which work for my sample Hello World test.

  {
    "language": "C",
    "extensions": [
      "c"
    ],
    "keywords": ["the", "return", "int", "void", "null", "int32", "0x0000", "struct", "for", "char", "static", "else", "type", "case", "error", "and", "size", "this", "break", "include", "name", "data", "get", "const", "file", "len", "set", "endif", "info", "not", "sizeof", "value", "define", "buf", "flags", "list", "free", "that", "state", "code", "ptr", "next", "dev", "read", "offset", "with", "mode", "from", "buffer", "tree", "new", "table", "err", "unsigned", "write", "reg", "addr", "are", "out", "max", "end", "node", "count", "string", "status", "goto", "use", "result", "length", "device", "any", "ret", "printf", "start", "ifdef", "line", "mng", "have", "log", "long", "init", "index", "while", "check", "png", "args", "num", "block", "val", "lock", "all", "sys", "key", "default", "false", "function", "bfd", "isc", "true", "entry", "software", "add", "first", "arg", "debug", "number", "res", "mask", "port", "bus", "can", "frame", "last", "stream", "version", "pos", "base", "0x00", "msg", "str", "pyobject", "config", "bits", "map", "current", "copyright", "object", "but", "header", "time", "handle", "assert", "register", "switch", "hash", "output", "flag", "cmd", "used", "byte", "insn", "eif", "src", "event", "dns", "bit", "target", "defined", "will", "memory", "address", "softc", "link", "must", "priv", "copy", "uint32", "class", "field", "bytes", "path", "self", "license", "tag", "context", "only", "width", "func", "input", "cur", "one", "source", "call", "ctx", "without", "symbol", "alloc", "print", "hal", "pointer", "exit", "decl", "format", "fprintf", "acpi", "has", "section", "mem", "command", "ifp", "create", "text", "dst", "face", "rtx", "make", "argv", "page", "xfs", "should"]
  },
{
    "language": "C++",
    "extensions": [
      "cpp",
      "cc",
      "c"
    ],
    "keywords": ["the", "0000000000000000", "return", "const", "void", "5000000000000000", "this", "int", "2500000000000000", "for", "7500000000000000", "else", "case", "ace", "name", "type", "string", "include", "false", "char", "true", "null", "size", "bool", "std", "and", "break", "data", "value", "new", "50000000000000000", "result", "static", "25000000000000000", "file", "get", "unsigned", "not", "set", "end", "0x00", "str", "75000000000000000", "item", "class", "that", "error", "object", "code", "list", "player", "public", "with", "count", "spell", "out", "handle", "target", "from", "uint32", "license", "info", "index", "cast", "check", "0x20", "s32", "map", "soap", "key", "function", "node", "test", "endif", "add", "max", "iter", "swig", "printf", "buffer", "iterator", "any", "length", "unit", "delete", "event", "begin", "use", "free", "message", "start", "ptr", "have", "can", "state", "text", "offset", "default", "width", "are", "msg", "assert", "has", "software", "llvm", "define", "time", "next", "pos", "all", "append", "first", "bwapi", "err", "base", "float", "args", "line", "context", "source", "creature", "while", "sizeof", "path", "back", "version", "entry", "will", "gnu", "f32", "block", "copy", "number", "push", "len", "mask", "height", "virtual", "current", "log", "you", "flags", "input", "double", "read", "color", "general", "only", "but", "see", "without", "0x65", "num", "buf", "empty", "template", "defined", "struct", "mode", "vector", "llsd", "expr", "switch", "handler", "second", "should", "flag", "array", "instance", "register", "arg", "getattr", "val", "debug", "0x64", "clear", "status", "write", "init", "group", "endl", "find", "field", "long", "call", "copyright", "timer", "create", "continue", "isolate", "src", "itr", "left", "output", "module", "format", "namespace", "one", "0x74", "rect"]
  },

@boyter
Copy link
Owner

boyter commented Apr 10, 2017

Ok, not totally resolved yet, but the following commit resolves it for C and C++

ef39f8e

I will continue to build out the database (it literally takes days to build the keywords due to how much data is being crunched) and move it into resolved when done.

@boyter
Copy link
Owner

boyter commented Apr 11, 2017

Confirmed fixed against Go. Going to close this one out.

@boyter boyter closed this as completed Apr 11, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants