Text Mining, Spring 2015 – Predictive OSTI Subject Classification of Technical Report Abstracts from SciTech Connect
I explored what classification models and feature selections were best for automating the subject classification of technical reports in the SciTech Connect database. I exported metadata for technical report records with abstracts and assigned one of three OSTI subjects. These records were processed with java programs and XSL stylesheets I created before I uploaded them into Oracle. Features were selected from the Oracle tables using two feature selection algorithms: Information Gain and TFxIDF. Oracle Data Miner was used to create two types of classification models: Decision Tree and SVM. Finally, these models were tested against new records pulled from SciTech Connect. The final report also includes a literature review on the value of grey literature.
Technologies: Oracle, SQL, Java, XML/XSL