Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding files with no extension to be searched #105

Open
NathanaelA opened this issue May 2, 2017 · 11 comments
Open

Adding files with no extension to be searched #105

NathanaelA opened this issue May 2, 2017 · 11 comments

Comments

@NathanaelA
Copy link

Is there a way to tag "Podfile" as a Cocoapod file; it has no extension so I'm not sure how to add it to the include/classifier/database.json file... I'd rather have it come up with as a known file...

@boyter
Copy link
Owner

boyter commented May 2, 2017

Open the ./include/classifier/database.json file in your editor of choice.

add a new entry like the following (changing pod to have whatever extension you need),

{
    "language": "Cocoapod",
    "extensions": [
      "pod"
    ],
    "keywords": []
}

You can leave keywords empty. Be sure to validate that the file is still valid JSON. You can use a command line tool like jq to validate it like so

jq . database.json

If its a common format please post it back here and I will include it into the default list.

@boyter
Copy link
Owner

boyter commented May 2, 2017

Just realised I misread this. You want to add a file without an extension.

Currently there is no way to do this. I will need to look at fixing this.

TODO - Add support for files without extension to be classified.

@quasarea
Copy link

quasarea commented May 2, 2017

Have some fortran without extensions here too ;)

@boyter
Copy link
Owner

boyter commented May 2, 2017

Hmmm for that situation would you need to rely totally on the keyword checks. This specific issue can be solved by just looking for an explicit filename.

Keyword checks are something that I have been playing around with locally. The idea being that if a file doesn't match anything based on extension then use keywords to guess what filetype it is. It will slow things down considerably when indexing due to the additional processing overhead. Assuming its a minority of files though it shouldn't be a huge issue.

@NathanaelA
Copy link
Author

NathanaelA commented May 2, 2017

Boyter, another way at least to deal with some of these would be something like this:

{
    "language": "Cocoapod",
    "keywords": [], 
    "fixedName":  "Podfile"
},
{
    "language": "License",
    "keywords": [], 
    "fixedName": "License",
    "ignore": true
}

Two additional JSON attributes; fixedName and ignore...

This would not fix the fortran of quasarea; but it would solve both the "License" and "Podfile" issues and allow people to easily ignore files in the classifier file... ;-) And rather than adding gitignore and npmignore to the "binary" file types I could also add them to the classifier and put the "ignore" = true flag on them... Might be a more universal cleaner fix for most issues, as everything it together then.

Then your keyword checks could go into effect after this point to cover things like Fortran files w/o an extension...

@boyter
Copy link
Owner

boyter commented May 2, 2017

That was the plan for for your specific case. The fixedname thing. I would probably make it an array though just to cover things like COPYING and LICENSE both generally being license files.

I was going to keep the ignores inside the properties file though. I will have a think about it in this case though. It might make more sense for specific types for them to live in the database file.

@NathanaelA
Copy link
Author

The only reason I suggest ignore be moved to the database; is it makes everything in the same file. Then people aren't having to go between places... If the cpu hit is minor I would actually move all the binary files into that same database with the ignore flag... Keeping things consistent makes it easier to configure and should simplify your code... ;-)

@boyter
Copy link
Owner

boyter commented May 2, 2017

Valid reasons. The main issue is during upgrades. Its a little more painful to migrate your own changes into the database file.

The CPU hit should in theory be nothing thankfully. I might do it as though as I can see it being a better solution in the long term.

@boyter
Copy link
Owner

boyter commented May 3, 2017

So I was looking into this, and turns out some of it is already done. The problem is that I didn't make the database name "extensions" very descriptive. If you add the following,

{
    "language": "Cocoapod",
    "extensions": [
      "cocoapod"
    ],
    "keywords": []
}

To the database the file with the name cocoapod will be classified correctly. I made it such that if no file extension is specified with a . then the filename itself is treated as it. An example of this already happening is for Jenkins Buildfiles which looks like this

{
    "language": "Jenkins Buildfile",
    "extensions": [
      "jenkinsfile"
    ],
    "keywords": []
  }

I will need to update the KB with this detail and probably add it as part of a readme in the directory itself.

I will however be adding a check which tries to guess the file type given that nothing else matches. This will not however be 100% accurate as it will be based on the most common keywords in the database.

Adding the ignored functionality however is something I will be adding.

I have also added Cocopod into the database to save the effort of having to do this yourself in the future, b141810

@boyter
Copy link
Owner

boyter commented May 4, 2017

Logic to guess file type given no matches added. Can be enabled by setting the property

deep_guess_files=true

In the searchcode.properties file.

@boyter
Copy link
Owner

boyter commented May 18, 2017

boyter added a commit that referenced this issue Jun 14, 2017
 - BREAKING CHANGE Changed validation of repository names such that they must be alphanumeric, _ or - with client and server side validation
 - BREAKING CHANGE Fix spelling of check_filerepo_chages to check_filerepo_changes for properties file
 - Set follow symlinks to be configurable through properties file #99
 - Clicking Remove will also clear the text box filters #98
 - Improved stop/reset jobs logic, deleted jobs persist on searchcode restart #41
 - Add logic to calculate project stats by lines not files and display next to existing #103
 - Deep guess logic added to guess a files type based on keyword heuristic's #105
 - Additional languages added to classifier database, F#, Mathematica, Parrot, Puppet, Rakefile, PKGBUILD, Cargo, Lock, License
 - API auditing via logs added #57
 - Search results now have RSS feed #114
 - Can add custom HTML/CSS/JS to all pages #107
 - Add average index time seconds to repo overview page #118
 - Fix bug where unable to filter on html page #120
@boyter boyter added this to ToDo in Release 1.3.12 Oct 3, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

No branches or pull requests

3 participants