Adding files with no extension to be searched #105

NathanaelA · 2017-05-02T02:00:50Z

Is there a way to tag "Podfile" as a Cocoapod file; it has no extension so I'm not sure how to add it to the include/classifier/database.json file... I'd rather have it come up with as a known file...

The text was updated successfully, but these errors were encountered:

boyter · 2017-05-02T02:06:23Z

Open the ./include/classifier/database.json file in your editor of choice.

add a new entry like the following (changing pod to have whatever extension you need),

{
    "language": "Cocoapod",
    "extensions": [
      "pod"
    ],
    "keywords": []
}

You can leave keywords empty. Be sure to validate that the file is still valid JSON. You can use a command line tool like jq to validate it like so

jq . database.json

If its a common format please post it back here and I will include it into the default list.

boyter · 2017-05-02T02:08:46Z

Just realised I misread this. You want to add a file without an extension.

Currently there is no way to do this. I will need to look at fixing this.

TODO - Add support for files without extension to be classified.

quasarea · 2017-05-02T08:14:08Z

Have some fortran without extensions here too ;)

boyter · 2017-05-02T11:13:09Z

Hmmm for that situation would you need to rely totally on the keyword checks. This specific issue can be solved by just looking for an explicit filename.

Keyword checks are something that I have been playing around with locally. The idea being that if a file doesn't match anything based on extension then use keywords to guess what filetype it is. It will slow things down considerably when indexing due to the additional processing overhead. Assuming its a minority of files though it shouldn't be a huge issue.

NathanaelA · 2017-05-02T16:15:26Z

Boyter, another way at least to deal with some of these would be something like this:

{
    "language": "Cocoapod",
    "keywords": [], 
    "fixedName":  "Podfile"
},
{
    "language": "License",
    "keywords": [], 
    "fixedName": "License",
    "ignore": true
}

Two additional JSON attributes; fixedName and ignore...

This would not fix the fortran of quasarea; but it would solve both the "License" and "Podfile" issues and allow people to easily ignore files in the classifier file... ;-) And rather than adding gitignore and npmignore to the "binary" file types I could also add them to the classifier and put the "ignore" = true flag on them... Might be a more universal cleaner fix for most issues, as everything it together then.

Then your keyword checks could go into effect after this point to cover things like Fortran files w/o an extension...

boyter · 2017-05-02T22:06:52Z

That was the plan for for your specific case. The fixedname thing. I would probably make it an array though just to cover things like COPYING and LICENSE both generally being license files.

I was going to keep the ignores inside the properties file though. I will have a think about it in this case though. It might make more sense for specific types for them to live in the database file.

NathanaelA · 2017-05-02T22:19:36Z

The only reason I suggest ignore be moved to the database; is it makes everything in the same file. Then people aren't having to go between places... If the cpu hit is minor I would actually move all the binary files into that same database with the ignore flag... Keeping things consistent makes it easier to configure and should simplify your code... ;-)

boyter · 2017-05-02T22:21:26Z

Valid reasons. The main issue is during upgrades. Its a little more painful to migrate your own changes into the database file.

The CPU hit should in theory be nothing thankfully. I might do it as though as I can see it being a better solution in the long term.

boyter · 2017-05-03T22:42:13Z

So I was looking into this, and turns out some of it is already done. The problem is that I didn't make the database name "extensions" very descriptive. If you add the following,

{
    "language": "Cocoapod",
    "extensions": [
      "cocoapod"
    ],
    "keywords": []
}

To the database the file with the name cocoapod will be classified correctly. I made it such that if no file extension is specified with a . then the filename itself is treated as it. An example of this already happening is for Jenkins Buildfiles which looks like this

{
    "language": "Jenkins Buildfile",
    "extensions": [
      "jenkinsfile"
    ],
    "keywords": []
  }

I will need to update the KB with this detail and probably add it as part of a readme in the directory itself.

I will however be adding a check which tries to guess the file type given that nothing else matches. This will not however be 100% accurate as it will be based on the most common keywords in the database.

Adding the ignored functionality however is something I will be adding.

I have also added Cocopod into the database to save the effort of having to do this yourself in the future, b141810

boyter · 2017-05-04T08:01:54Z

Logic to guess file type given no matches added. Can be enabled by setting the property

deep_guess_files=true

In the searchcode.properties file.

boyter · 2017-05-18T21:08:13Z

Documentation for KB updated

https://searchcodeserver.com/knowledge-base/how-to-add-files-to-be-recognised.html

- BREAKING CHANGE Changed validation of repository names such that they must be alphanumeric, _ or - with client and server side validation - BREAKING CHANGE Fix spelling of check_filerepo_chages to check_filerepo_changes for properties file - Set follow symlinks to be configurable through properties file #99 - Clicking Remove will also clear the text box filters #98 - Improved stop/reset jobs logic, deleted jobs persist on searchcode restart #41 - Add logic to calculate project stats by lines not files and display next to existing #103 - Deep guess logic added to guess a files type based on keyword heuristic's #105 - Additional languages added to classifier database, F#, Mathematica, Parrot, Puppet, Rakefile, PKGBUILD, Cargo, Lock, License - API auditing via logs added #57 - Search results now have RSS feed #114 - Can add custom HTML/CSS/JS to all pages #107 - Add average index time seconds to repo overview page #118 - Fix bug where unable to filter on html page #120

boyter added a commit that referenced this issue May 3, 2017

Add logic to check file type on keywords if no extention #105

516a2be

boyter mentioned this issue May 4, 2017

Any easy way to eliminate files from being indexed / counted #104

Closed

boyter added this to ToDo in Release 1.3.12 Oct 3, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding files with no extension to be searched #105

Adding files with no extension to be searched #105

NathanaelA commented May 2, 2017

boyter commented May 2, 2017 •

edited

Loading

boyter commented May 2, 2017 •

edited

Loading

quasarea commented May 2, 2017

boyter commented May 2, 2017

NathanaelA commented May 2, 2017 •

edited

Loading

boyter commented May 2, 2017

NathanaelA commented May 2, 2017

boyter commented May 2, 2017

boyter commented May 3, 2017 •

edited

Loading

boyter commented May 4, 2017

boyter commented May 18, 2017

Adding files with no extension to be searched #105

Adding files with no extension to be searched #105

Comments

NathanaelA commented May 2, 2017

boyter commented May 2, 2017 • edited Loading

boyter commented May 2, 2017 • edited Loading

quasarea commented May 2, 2017

boyter commented May 2, 2017

NathanaelA commented May 2, 2017 • edited Loading

boyter commented May 2, 2017

NathanaelA commented May 2, 2017

boyter commented May 2, 2017

boyter commented May 3, 2017 • edited Loading

boyter commented May 4, 2017

boyter commented May 18, 2017

boyter commented May 2, 2017 •

edited

Loading

boyter commented May 2, 2017 •

edited

Loading

NathanaelA commented May 2, 2017 •

edited

Loading

boyter commented May 3, 2017 •

edited

Loading