Skip to content

deliciouscat/Blog_Text_Embedding_and_Clustering

Repository files navigation

Blog_Text_Embedding_and_Clustering

This repository is a summary of my thesis, "User Analysis Based on the Triplet Loss for an Advanced Recommendation System" (개선된 추천 시스템을 위한 삼중항 손실 기반 사용자 분석)

The service provider needs to know the individual characteristics of the user to recommend products suitable for each user. There are factors about users including personality, social class, hobby and etc. Service providers can identify the user characteristics for improving functions or developing personalized recommendation systems. However, identifying these features through direct question-and-answer must ensure high confidence in the subject that collects the data or the irreplaceability of the service. On the other hand, users have a desire for self-expression, and in many cases, they have recorded the expressions as a form of a posting via blog system. the postings are expressed in the form of text and it naturally leads to an open attitude in providing information. In this study, we want to featurize the users characteristics, hidden in their textual creations. in doing so, we employ the Large-scale Language Model (LLM) as a foundational model. LLM model is a model that have learned the common knowledges including grammar, general informations and etc. This study follows the direction of gradually deriving specific knowledge from overall knowledge, which leads to linguistic characteristics, human characteristics, and specific characteristics. We narrow to the specific information from the wide-spread and general information, through the multilayer transfer learning model. The specific information includes characteristics, taste, possesion of knowledge and etc. At first, extract semantic information from the text via fine-tuned Sentence-BERT based primarily on the popular language model Bi-direction Encoder Presentations from Transformers (BERT). Secondly, the author's characteristics that can be inferred from this information are embedded through a model based on a Recurrent Neural Network. Embedding of user characteristics is achieved through learning via triplet loss, a type of metric learning methodology. This is a process in which the embedding distance of anchor (reference) text and positive (written by the same user as the anchor) is relatively far from negative (written by a different user from the anchor). Through the learning process, the text written by a person with similar user characteristics becomes closer to the Euclidean distance in the embedding space. In contrast to methodologies that assume important characteristics such as social groups and genders, textual data can find a distinction between users, so we can expect embeddings similar to principal components, and we can derive qualitative insights about the similarity between users by analyzing label distribution in embedding space.

In addition to the content of the thesis, I attempted to embed text using the Graph Neural Network.