Spreadsheet Transcription Software
Clinical researchers, historians, educators and field researchers alike still regularly capture data on paper spreadsheets. In the case of health care and education, data will often contain sensitive personal information, further complicating the process of transcribing paper-based archives into digital form.
This software utilizes machine learning and crowd intelligence to automatically transcribe images of paper-based spreadsheets into electronic form while protecting sensitive personal information. Our algorithm consists of four high-level stages:
(1) the extraction of cell-level images from the spreadsheet grid, (2) machine recognition of digits within the cells, (3) human transcription of cell contents that the machine was uncertain of and (4) feedback of human transcription results to the machine to improve future classification performance.
See: Images_to_spreadsheets_Public_Release.m for the implementation of the algorithm. The code is highly commented.
See: Supplemental_Materials.pdf for additional information on how to adjust the settings of the algorithm.
Also, please feel free to contact me personally with questions: ghassemi(at)mit(dot)edu