Simulation study on SimilarityClassifier #1371
Replies: 6 comments 6 replies
-
|
Beta Was this translation helpful? Give feedback.
-
Update: Performed additional simulations
|
Beta Was this translation helpful? Give feedback.
-
Only top performing doc2vec/similarity:
|
Beta Was this translation helpful? Give feedback.
-
Summary statistics excluding Skeletal Muscle Relaxant Dataset:
|
Beta Was this translation helpful? Give feedback.
-
With the Similarity Classifier, ASReview can be scaled to millions of records using index like FAISS, local databases like Postgresql with pgvector extension or cloud-based vector databases like Pinecone. This can potentially address the issues mentioned in #1009. The selection/development of robust stopping criteria will be crucial in such use case. Also, selecting a proper feature extraction method will be important. The system can possibly be extended to using full texts or sections of the full texts. I have tried FAISS with Similarity Classifier on a few of the benchmark datasets and it works. Will be updating the Asreview-SimilarityClassifier extension soon with important modifications and more variations. |
Beta Was this translation helpful? Give feedback.
-
@rohitgarud the new datasets are available in https://github.com/asreview/synergy-dataset |
Beta Was this translation helpful? Give feedback.
-
Results of a simulation study are presented in the following table, where a Similarity-based classifier (cosine similarity of the feature vectors in this case) was developed and its performance on benchmark datasets as compared to the NB classifier with TFIDF features is studied. For the similarity classifier, doc2vec features were used with different vector sizes. The default doc2vec from ASReview generated a 40-dimensional vector. We can download the Wide Doc2Vec Extension by @jteijema from the repository. This feature extractor generates a 120-dimensional vector.
The similarity classifier gives good results for some of the datasets as compared to NB/TFIDF but the performance is not consistent. It is interesting to see that the classifier with wide doc2vec (vector size=120 ) features perform poorly as compared to the default doc2vec (vector size=40) for almost all the datasets with the exception of two.
The SimilarityClassifier is developed as ASReview model extension and can be downloaded/installed from the Asreview-SImilarityClassifier Repository. Simulations were performed using the ASReview Makita Extension with a 'multiple model template'.
After installing the SimilarityClassifier extension and asreview-makita extension, we can run the following command to get the jobs file (for windows) and then run the
jobs.bat
command to run the simulations.For Naive Bayes classifier with TFIDF features
For Similarity Classifier with doc2vec and wide_doc2vec features.
I am experimenting with different feature extraction techniques such as SBERT and will present the results soon.
Beta Was this translation helpful? Give feedback.
All reactions