Helsinki Metropolia University of Applied Sciences
Information Technology
Bachelor Thesis
Ari Bajo Rouvinen
The Theseus open repository contains metadata about more than 100,000 thesis publications from the different universities of applied sciences in Finland. Different data mining techniques were applied to the Theseus dataset to build a web application to explore thesis topics and degree programmes using different libraries in Python and JavaScript. Thesis topics were extracted from manually annotated keywords by the authors and curated subjects by the librarians. During the project, the quality of the thesis keywords and subjects to represent the thesis topics was evaluated and several data quality issues were raised. Data mining techniques were used to collect, explore, clean, analyse, model and visualize the data.
Special focus was put on comparing the results of different dimensionality reduction and clustering techniques to visualize similar degrees based on topics. t-SNE proved to be a powerful method to visualize degrees on a 2-dimensional interactive map and hierarchical clustering was found to be the most flexible technique to get multiple clusterings at different levels.
The application allows to discover popular topics for a degree or university and popular degrees for a series of topics, as well as to explore related topics and related degrees. The work presented serves also as a foundation for future study regarding the evolution of topics popularity over time and the detection of trending topics.