FiletypeDetection

You can read more about the project on its corresponding JIRA issue. File Byte Histogram Machine Learning Classification

This code is used to generate a model for Tika for Content based mime detection with byte frequency histograms.

Steps

Identify the type of file you want Tika to detect using this method.
Collect files of the type and also files which are not of this type.
Create three datasets
- Training
- Testing
- Validation The dimensionality for each set is as follows. m*(256+1), where m indicates the number of training/validation/test examples; 256 is the size of features (i.e. byte frequency histogram which is not preprocessed with a companding function) + 1 for the labeled output.
These can be generated as csv files.
Run main.r
The output of main.r is tika.model which can be used in Tika.

For more detailed documentation, download the Documenation_NNModelIntegrationWithTika.docx in this project.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
Documenation_NNModelIntegrationWithTika.docx		Documenation_NNModelIntegrationWithTika.docx
README.md		README.md
Rplots.pdf		Rplots.pdf
checkNNGradients.R		checkNNGradients.R
computeNumericalGradient.R		computeNumericalGradient.R
costFunctionReg.R		costFunctionReg.R
data.hist		data.hist
dataset.hist		dataset.hist
debugInitializeWeights.R		debugInitializeWeights.R
drTest.Rproj		drTest.Rproj
gradFunctionReg.R		gradFunctionReg.R
learningCurve.R		learningCurve.R
learningCurveNN.R		learningCurveNN.R
loadAndProcess.R		loadAndProcess.R
main.R		main.R
myfunctions.R		myfunctions.R
nnCostFunction.R		nnCostFunction.R
nnGradFunction.R		nnGradFunction.R
nnPredict.R		nnPredict.R
nongrb.hist		nongrb.hist
predict.R		predict.R
randInitWeights.R		randInitWeights.R
sigmoid.R		sigmoid.R
sigmoidGradient.R		sigmoidGradient.R
test.csv		test.csv
test.hist		test.hist
tika.hist		tika.hist
tika.model		tika.model
train.csv		train.csv
train1.hist		train1.hist
trainLinearReg.R		trainLinearReg.R
trainNN.R		trainNN.R
trainval.hist		trainval.hist
val.csv		val.csv
val.hist		val.hist
valres_200.learningcurve		valres_200.learningcurve