Skip to content

Uses Tika to detect languages for common document files.

Notifications You must be signed in to change notification settings

benreillyss/TikaLanguageDetector

Repository files navigation

Tika Language Detector File Ingest Module

Autopsy File Ingest Module that uses Tika to detect the language of common file documents.

Files types supported

Currently programmed to try and process files with the following extensions:

  • .doc
  • .docx
  • .xls
  • .xlsx
  • .ppt
  • .pptx
  • .pdf

Languages supported by Tika models:

Tika Supported Language Models
Belarusian Catalan Danish German
Esperanto Estonian Greek English
Spanish Finnish French Persian
Galician Hungarian Icelandic Italian
Lithuanian Dutch Norwegian Polish
Portuguese Romanian Russian Slovakian
Slovenian Swedish Thai Ukrainian

tika.language.properties

Example Autopsy Results

Interesting File Hits

Interesting Files with Result

Results are displayed as Interesting Items with a sub-category of Language_Detected

Ingest Summary Overview

Processing Metrics Overview

Tika Langue Detector reports processing summary results similar to other Plugins.

Tika Language Detector Results

Once an ingestion is complete, the total processing time and number of files processed is reported.

Example Results from nps-2008-jean.E01

Interesting_Files_w_Result_Jean.PNG

Proc_Metrics_Jean.PNG