Simulation study on SimilarityClassifier #1371

rohitgarud · 2023-03-25T11:12:01Z

rohitgarud
Mar 25, 2023

Results of a simulation study are presented in the following table, where a Similarity-based classifier (cosine similarity of the feature vectors in this case) was developed and its performance on benchmark datasets as compared to the NB classifier with TFIDF features is studied. For the similarity classifier, doc2vec features were used with different vector sizes. The default doc2vec from ASReview generated a 40-dimensional vector. We can download the Wide Doc2Vec Extension by @jteijema from the repository. This feature extractor generates a 120-dimensional vector.

The similarity classifier gives good results for some of the datasets as compared to NB/TFIDF but the performance is not consistent. It is interesting to see that the classifier with wide doc2vec (vector size=120 ) features perform poorly as compared to the default doc2vec (vector size=40) for almost all the datasets with the exception of two.

The SimilarityClassifier is developed as ASReview model extension and can be downloaded/installed from the Asreview-SImilarityClassifier Repository. Simulations were performed using the ASReview Makita Extension with a 'multiple model template'.
After installing the SimilarityClassifier extension and asreview-makita extension, we can run the following command to get the jobs file (for windows) and then run the jobs.bat command to run the simulations.

For Naive Bayes classifier with TFIDF features

asreview makita template multiple_models --classifiers nb --feature_extractors tfidf -f jobs.bat

For Similarity Classifier with doc2vec and wide_doc2vec features.

asreview makita template multiple_models --classifiers similarity --feature_extractors doc2vec wide_doc2vec -f jobs.bat

I am experimenting with different feature extraction techniques such as SBERT and will present the results soon.

rohitgarud · 2023-03-25T11:12:10Z

rohitgarud
Mar 25, 2023
Author

Dataset	Classifier	Features	Feature size	Similarity Metric	recall_0.5	wss_0.95	erf_0.1	atd
ACEInhibitors	Naïve Bayes	TFIDF			0.950	0.782	0.750	190.325
	Similarity	Doc2Vec	120	Cosine	0.975	0.563	0.500	331.850
			120	Dot Product	0.975	0.710	0.675	227.100
			120	Euclidean Distance	0.975	0.563	0.500	330.250
			200	Cosine	0.975	0.489	0.275	445.850
			200	Dot Product	0.975	0.687	0.650	255.625
			200	Euclidean Distance	0.975	0.486	0.275	445.925
			40	Cosine	0.950	0.622	0.675	239.625
			40	Dot Product	0.975	0.703	0.650	244.075
			40	Euclidean Distance	0.950	0.619	0.675	240.175
ADHD	Naïve Bayes	TFIDF			0.947	0.446	0.737	96.789
	Similarity	Doc2Vec	120	Cosine	0.895	0.259	0.474	135.737
			120	Dot Product	0.947	0.449	0.579	125.737
			120	Euclidean Distance	0.895	0.258	0.474	136.947
			200	Cosine	0.895	0.271	0.526	128.579
			200	Dot Product	0.947	0.438	0.579	116.842
			200	Euclidean Distance	0.895	0.271	0.526	128.737
			40	Cosine	0.895	0.314	0.474	122.474
			40	Dot Product	0.895	0.420	0.526	134.421
			40	Euclidean Distance	0.895	0.316	0.474	123.684
Antihistamines	Naïve Bayes	TFIDF			0.867	0.091	0.267	72.333
	Similarity	Doc2Vec	120	Cosine	0.800	0.305	0.333	69.600
			120	Dot Product	0.733	0.192	0.133	100.467
			120	Euclidean Distance	0.800	0.305	0.333	69.400
			200	Cosine	0.800	0.351	0.333	69.133
			200	Dot Product	0.733	0.195	0.200	95.733
			200	Euclidean Distance	0.800	0.351	0.333	69.133
			40	Cosine	0.800	0.315	0.333	73.133
			40	Dot Product	0.733	0.140	0.133	99.200
			40	Euclidean Distance	0.800	0.312	0.333	73.267
Appenzeller-Herzog_2020	Naïve Bayes	TFIDF			0.964	0.723	0.714	263.464
	Similarity	Doc2Vec	120	Cosine	1.000	0.565	0.286	540.464
			120	Dot Product	0.893	0.188	0.750	420.250
			120	Euclidean Distance	1.000	0.563	0.321	542.071
			200	Cosine	0.964	0.588	0.000	815.143
			200	Dot Product	0.893	0.192	0.750	416.536
			200	Euclidean Distance	0.964	0.589	0.000	815.929
			40	Cosine	1.000	0.649	0.607	247.893
			40	Dot Product	0.893	0.306	0.786	381.536
			40	Euclidean Distance	1.000	0.650	0.607	248.750
AtypicalAntipsychotics	Naïve Bayes	TFIDF			0.869	0.168	0.255	275.124
	Similarity	Doc2Vec	120	Cosine	0.786	0.197	0.131	347.400
			120	Dot Product	0.841	0.212	0.179	297.228
			120	Euclidean Distance	0.786	0.200	0.131	348.655
			200	Cosine	0.759	0.154	0.103	369.255
			200	Dot Product	0.828	0.208	0.193	301.200
			200	Euclidean Distance	0.759	0.154	0.103	368.745
			40	Cosine	0.786	0.232	0.166	330.986
			40	Dot Product	0.828	0.257	0.152	306.586
			40	Euclidean Distance	0.786	0.231	0.166	329.938
Bannach-Brown_2019	Naïve Bayes	TFIDF			0.932	0.404	0.423	293.022
	Similarity	Doc2Vec	120	Cosine	0.957	0.474	0.176	422.082
			120	Dot Product	0.864	0.067	0.297	448.950
			120	Euclidean Distance	0.957	0.474	0.176	421.853
			200	Cosine	0.953	0.461	0.133	438.401
			200	Dot Product	0.860	0.055	0.297	453.642
			200	Euclidean Distance	0.953	0.463	0.136	438.806
			40	Cosine	0.975	0.522	0.376	294.276
			40	Dot Product	0.882	0.071	0.326	424.918
			40	Euclidean Distance	0.975	0.521	0.373	294.011
BetaBlockers	Naïve Bayes	TFIDF			0.927	0.669	0.610	241.878
	Similarity	Doc2Vec	120	Cosine	0.902	0.309	0.439	373.683
			120	Dot Product	0.951	0.544	0.512	298.073
			120	Euclidean Distance	0.902	0.310	0.439	373.854
			200	Cosine	0.878	0.309	0.317	437.634
			200	Dot Product	0.951	0.506	0.488	328.512
			200	Euclidean Distance	0.878	0.309	0.317	437.049
			40	Cosine	0.878	0.379	0.512	332.268
			40	Dot Product	0.927	0.540	0.512	293.585
			40	Euclidean Distance	0.878	0.379	0.512	326.049
Bos_2018	Naïve Bayes	TFIDF			1.000	0.834	0.900	76.000
	Similarity	Doc2Vec	120	Cosine	1.000	0.756	0.900	226.900
			120	Dot Product	1.000	0.788	0.800	200.100
			120	Euclidean Distance	1.000	0.757	0.900	225.000
			200	Cosine	1.000	0.728	0.700	300.200
			200	Dot Product	1.000	0.775	0.900	262.800
			200	Euclidean Distance	1.000	0.729	0.700	300.000
			40	Cosine	1.000	0.804	0.800	159.500
			40	Dot Product	1.000	0.782	0.900	227.100
			40	Euclidean Distance	1.000	0.802	0.800	161.600
CalciumChannelBlockers	Naïve Bayes	TFIDF			0.899	0.229	0.172	268.364
	Similarity	Doc2Vec	120	Cosine	0.737	0.068	0.071	453.869
			120	Dot Product	0.818	0.065	0.101	398.677
			120	Euclidean Distance	0.737	0.068	0.061	454.657
			200	Cosine	0.788	0.063	0.061	449.636
			200	Dot Product	0.838	0.057	0.101	366.202
			200	Euclidean Distance	0.788	0.066	0.061	449.475
			40	Cosine	0.778	0.057	0.121	441.051
			40	Dot Product	0.778	0.109	0.273	367.747
			40	Euclidean Distance	0.768	0.057	0.101	442.273
Estrogens	Naïve Bayes	TFIDF			0.911	0.249	0.177	89.418
	Similarity	Doc2Vec	120	Cosine	0.861	0.216	0.076	116.962
			120	Dot Product	0.823	0.167	0.089	122.405
			120	Euclidean Distance	0.861	0.216	0.076	117.392
			200	Cosine	0.848	0.202	0.063	118.203
			200	Dot Product	0.823	0.134	0.076	122.266
			200	Euclidean Distance	0.848	0.202	0.063	117.241
			40	Cosine	0.835	0.216	0.076	121.152
			40	Dot Product	0.785	0.153	0.063	126.468
			40	Euclidean Distance	0.848	0.199	0.076	121.532
Hall_2012	Naïve Bayes	TFIDF			1.000	0.911	0.893	134.272
	Similarity	Doc2Vec	120	Cosine	1.000	0.905	0.893	125.340
			120	Dot Product	1.000	0.884	0.883	190.233
			120	Euclidean Distance	1.000	0.905	0.893	125.631
			200	Cosine	1.000	0.901	0.893	144.680
			200	Dot Product	1.000	0.885	0.874	190.806
			200	Euclidean Distance	1.000	0.901	0.893	145.612
			40	Cosine	1.000	0.896	0.903	133.350
			40	Dot Product	1.000	0.868	0.893	216.738
			40	Euclidean Distance	1.000	0.895	0.903	133.932
Kitchenham_2010	Naïve Bayes	TFIDF			0.977	0.652	0.568	200.136
	Similarity	Doc2Vec	120	Cosine	0.955	0.472	0.545	198.636
			120	Dot Product	0.977	0.589	0.500	222.795
			120	Euclidean Distance	0.955	0.475	0.545	198.682
			200	Cosine	0.977	0.515	0.545	214.909
			200	Dot Product	1.000	0.562	0.455	227.977
			200	Euclidean Distance	0.977	0.516	0.545	214.477
			40	Cosine	1.000	0.680	0.477	199.273
			40	Dot Product	0.977	0.606	0.409	222.045
			40	Euclidean Distance	1.000	0.680	0.477	199.795
Kwok_2020	Naïve Bayes	TFIDF			0.992	0.675	0.588	227.092
	Similarity	Doc2Vec	120	Cosine	0.958	0.476	0.361	389.748
			120	Dot Product	0.908	0.296	0.387	471.025
			120	Euclidean Distance	0.958	0.476	0.361	390.025
			200	Cosine	0.950	0.451	0.235	444.613
			200	Dot Product	0.891	0.223	0.370	489.672
			200	Euclidean Distance	0.950	0.451	0.227	445.000
			40	Cosine	0.966	0.550	0.462	333.008
			40	Dot Product	0.916	0.238	0.345	474.555
			40	Euclidean Distance	0.966	0.549	0.462	332.025
Nagtegaal_2019	Naïve Bayes	TFIDF			0.990	0.529	0.540	237.190
	Similarity	Doc2Vec	120	Cosine	0.980	0.554	0.430	259.370
			120	Dot Product	0.980	0.568	0.550	231.720
			120	Euclidean Distance	0.980	0.555	0.410	259.580
			200	Cosine	0.970	0.588	0.280	286.620
			200	Dot Product	0.980	0.582	0.560	238.420
			200	Euclidean Distance	0.970	0.588	0.290	286.240
			40	Cosine	0.960	0.536	0.560	205.710
			40	Dot Product	0.990	0.628	0.530	216.210
			40	Euclidean Distance	0.960	0.538	0.550	206.140
NSAIDS	Naïve Bayes	TFIDF			0.975	0.698	0.500	46.775
	Similarity	Doc2Vec	120	Cosine	0.950	0.547	0.250	70.850
			120	Dot Product	0.950	0.488	0.150	77.600
			120	Euclidean Distance	0.950	0.545	0.250	70.875
			200	Cosine	0.925	0.427	0.325	70.950
			200	Dot Product	0.950	0.463	0.125	77.375
			200	Euclidean Distance	0.925	0.427	0.325	71.125
			40	Cosine	0.950	0.575	0.350	59.475
			40	Dot Product	0.950	0.504	0.100	76.925
			40	Euclidean Distance	0.950	0.570	0.375	58.950
Opiods	Naïve Bayes	TFIDF			0.929	0.467	-0.071	717.857
	Similarity	Doc2Vec	120	Cosine	0.857	0.299	-0.071	535.786
			120	Dot Product	1.000	0.548	0.714	179.857
			120	Euclidean Distance	0.929	0.410	0.643	246.929
			200	Cosine	0.929	0.442	-0.071	773.929
			200	Dot Product	0.929	0.589	0.786	166.214
			200	Euclidean Distance	0.929	0.476	0.000	716.786
			40	Cosine	0.929	0.497	0.714	193.929
			40	Dot Product	1.000	0.686	0.714	122.429
			40	Euclidean Distance	0.929	0.496	0.714	200.071
OralHypoglycemics	Naïve Bayes	TFIDF			0.770	0.190	0.141	155.444
	Similarity	Doc2Vec	120	Cosine	0.667	0.076	0.037	198.096
			120	Dot Product	0.741	0.100	0.081	180.770
			120	Euclidean Distance	0.667	0.076	0.037	197.652
			200	Cosine	0.681	0.068	0.089	189.296
			200	Dot Product	0.711	0.100	0.030	187.281
			200	Euclidean Distance	0.667	0.068	0.089	189.452
			40	Cosine	0.696	0.050	0.030	193.763
			40	Dot Product	0.726	0.116	0.037	188.622
			40	Euclidean Distance	0.696	0.050	0.030	193.504
ProtonPumpInhibitors	Naïve Bayes	TFIDF			0.940	0.481	0.560	191.960
	Similarity	Doc2Vec	120	Cosine	0.940	0.465	-0.040	315.400
			120	Dot Product	0.900	0.394	0.440	254.000
			120	Euclidean Distance	0.940	0.465	-0.040	316.320
			200	Cosine	0.920	0.340	-0.040	342.580
			200	Dot Product	0.880	0.374	0.460	257.760
			200	Euclidean Distance	0.920	0.340	-0.040	342.660
			40	Cosine	0.900	0.331	0.480	226.240
			40	Dot Product	0.920	0.314	0.000	304.760
			40	Euclidean Distance	0.900	0.331	0.480	226.320
Radjenovic_2013	Naïve Bayes	TFIDF			1.000	0.841	0.851	189.574
	Similarity	Doc2Vec	120	Cosine	1.000	0.827	0.830	234.170
			120	Dot Product	0.979	0.792	0.681	428.319
			120	Euclidean Distance	1.000	0.827	0.830	236.426
			200	Cosine	1.000	0.782	0.745	306.362
			200	Dot Product	0.979	0.772	0.681	511.426
			200	Euclidean Distance	1.000	0.782	0.745	307.809
			40	Cosine	1.000	0.826	0.830	240.723
			40	Dot Product	0.957	0.746	0.660	470.681
			40	Euclidean Distance	1.000	0.826	0.830	240.532
SkeletalMuscleRelaxants	Naïve Bayes	TFIDF			0.625	-0.015	-0.125	782.625
	Similarity	Doc2Vec	120	Cosine	0.125	-0.144	-0.125	1175.500
			120	Dot Product	0.125	-0.169	-0.125	976.750
			120	Euclidean Distance	0.125	-0.183	-0.125	1188.125
			200	Cosine	0.125	-0.180	-0.125	1221.000
			200	Dot Product	0.500	-0.178	-0.125	963.625
			200	Euclidean Distance	0.125	-0.172	-0.125	1217.875
			40	Cosine	0.250	-0.182	-0.125	1221.500
			40	Dot Product	0.250	-0.168	-0.125	983.125
			40	Euclidean Distance	0.250	-0.163	-0.125	1219.375
Statins	Naïve Bayes	TFIDF			0.940	0.471	0.631	445.036
	Similarity	Doc2Vec	120	Cosine	0.810	0.258	0.250	890.417
			120	Dot Product	0.893	0.341	0.536	554.536
			120	Euclidean Distance	0.810	0.258	0.250	891.131
			200	Cosine	0.786	0.238	0.202	1020.083
			200	Dot Product	0.893	0.290	0.524	595.321
			200	Euclidean Distance	0.786	0.237	0.202	1019.964
			40	Cosine	0.833	0.167	0.405	714.440
			40	Dot Product	0.881	0.326	0.524	628.798
			40	Euclidean Distance	0.833	0.166	0.417	712.190
Triptans	Naïve Bayes	TFIDF			0.957	0.520	0.478	98.565
	Similarity	Doc2Vec	120	Cosine	0.783	0.238	0.435	140.783
			120	Dot Product	0.957	0.435	0.522	99.478
			120	Euclidean Distance	0.783	0.238	0.435	141.304
			200	Cosine	0.783	0.227	0.435	147.739
			200	Dot Product	0.913	0.433	0.478	102.739
			200	Euclidean Distance	0.783	0.227	0.435	147.739
			40	Cosine	0.826	0.236	0.478	134.043
			40	Dot Product	0.870	0.386	0.435	107.391
			40	Euclidean Distance	0.826	0.236	0.478	134.304
UrinaryIncontinence	Naïve Bayes	TFIDF			0.872	0.382	0.359	60.923
	Similarity	Doc2Vec	120	Cosine	0.872	0.369	0.000	83.179
			120	Dot Product	0.923	0.437	0.000	79.205
			120	Euclidean Distance	0.846	0.369	0.051	81.077
			200	Cosine	0.846	0.317	-0.077	105.821
			200	Dot Product	0.949	0.440	0.051	73.692
			200	Euclidean Distance	0.846	0.317	-0.077	106.564
			40	Cosine	0.897	0.320	-0.077	104.256
			40	Dot Product	0.923	0.437	0.103	70.333
			40	Euclidean Distance	0.897	0.320	-0.077	104.282
van_de_Schoot_2017	Naïve Bayes	TFIDF			1.000	0.888	0.905	101.571
	Similarity	Doc2Vec	120	Cosine	1.000	0.796	0.667	398.571
			120	Dot Product	0.976	0.797	0.786	323.833
			120	Euclidean Distance	1.000	0.798	0.667	398.119
			200	Cosine	0.976	0.757	0.476	578.429
			200	Dot Product	0.976	0.807	0.810	315.976
			200	Euclidean Distance	0.976	0.757	0.476	578.524
			40	Cosine	1.000	0.846	0.881	174.738
			40	Dot Product	1.000	0.814	0.810	329.167
			40	Euclidean Distance	1.000	0.846	0.881	173.833
van_Dis_2020	Naïve Bayes	TFIDF			0.972	0.650	0.556	1191.264
	Similarity	Doc2Vec	120	Cosine	0.972	0.572	0.417	1371.111
			120	Dot Product	0.889	0.114	0.389	2219.417
			120	Euclidean Distance	0.972	0.572	0.389	1371.667
			200	Cosine	0.958	0.473	0.431	1402.361
			200	Dot Product	0.903	0.055	0.347	2476.819
			200	Euclidean Distance	0.958	0.474	0.431	1404.500
			40	Cosine	0.986	0.639	0.347	1347.986
			40	Dot Product	0.903	0.184	0.347	2393.514
			40	Euclidean Distance	0.986	0.640	0.347	1342.597
Wahono_2015	Naïve Bayes	TFIDF			1.000	0.863	0.869	197.803
	Similarity	Doc2Vec	120	Cosine	1.000	0.879	0.902	166.869
			120	Dot Product	1.000	0.839	0.885	211.131
			120	Euclidean Distance	1.000	0.879	0.902	167.607
			200	Cosine	1.000	0.858	0.885	187.902
			200	Dot Product	1.000	0.845	0.852	208.951
			200	Euclidean Distance	1.000	0.858	0.885	187.918
			40	Cosine	1.000	0.868	0.902	171.656
			40	Dot Product	1.000	0.828	0.836	256.033
			40	Euclidean Distance	1.000	0.868	0.902	171.311
Wolters_2018	Naïve Bayes	TFIDF			1.000	0.794	0.778	162.222
	Similarity	Doc2Vec	120	Cosine	1.000	0.643	0.611	577.556
			120	Dot Product	1.000	0.750	0.667	420.000
			120	Euclidean Distance	1.000	0.643	0.611	579.000
			200	Cosine	1.000	0.692	0.222	682.444
			200	Dot Product	1.000	0.777	0.667	345.944
			200	Euclidean Distance	1.000	0.691	0.222	680.889
			40	Cosine	1.000	0.787	0.778	241.389
			40	Dot Product	1.000	0.734	0.667	276.389
			40	Euclidean Distance	1.000	0.787	0.778	241.611

2 replies

J535D165 Mar 26, 2023
Maintainer

Thanks for this great table and simulation. Very interesting. It might be cool to add the average of datasets as well :)

rohitgarud Mar 26, 2023
Author

Thank you for your response @J535D165. I wanted to add some summary statistics but there are a few negative values as well so I was not sure whether the statistics will be accurate and representative. Do you have any insights about the type of features I should try next for the similarity classifier or empirical study is the only way to get insights regarding what works and what doesn't?

Can you please also comment on the approach proposed in #1344

rohitgarud · 2023-04-03T08:43:30Z

rohitgarud
Apr 3, 2023
Author

Update: Performed additional simulations

Added doc2vec of feature vector dimension of 200
Added two more similarity metrics - Dot product of unnormalised vectors and Euclidean distance for L2 normalised vectors
(Note: if we normalise the vectors and then take the dot product, it becomes the cosine similarity)

0 replies

rohitgarud · 2023-04-03T09:07:55Z

rohitgarud
Apr 3, 2023
Author

Only top performing doc2vec/similarity:

Dataset	Classifier	Features	Feature size	Similarity Metric	recall_0.5	wss_0.95	erf_0.1	atd
ACEInhibitors	Naïve Bayes	TFIDF			0.950	0.782	0.750	190.325
	Similarity	Doc2Vec	120	Dot Product	0.975	0.710	0.675	227.100
			40	Dot Product	0.975	0.703	0.650	244.075
ADHD	Naïve Bayes	TFIDF			0.947	0.446	0.737	96.789
	Similarity	Doc2Vec	120	Dot Product	0.947	0.449	0.579	125.737
			200	Dot Product	0.947	0.438	0.579	116.842
Antihistamines	Naïve Bayes	TFIDF			0.867	0.091	0.267	72.333
	Similarity	Doc2Vec	200	Cosine	0.800	0.351	0.333	69.133
			200	Euclidean Distance	0.800	0.351	0.333	69.133
Appenzeller-Herzog_2020	Naïve Bayes	TFIDF			0.964	0.723	0.714	263.464
	Similarity	Doc2Vec	40	Cosine	1.000	0.649	0.607	247.893
			40	Euclidean Distance	1.000	0.650	0.607	248.750
AtypicalAntipsychotics	Naïve Bayes	TFIDF			0.869	0.168	0.255	275.124
	Similarity	Doc2Vec	40	Cosine	0.786	0.232	0.166	330.986
			40	Dot Product	0.828	0.257	0.152	306.586
Bannach-Brown_2019	Naïve Bayes	TFIDF			0.932	0.404	0.423	293.022
	Similarity	Doc2Vec	40	Cosine	0.975	0.522	0.376	294.276
			40	Euclidean Distance	0.975	0.521	0.373	294.011
BetaBlockers	Naïve Bayes	TFIDF			0.927	0.669	0.610	241.878
	Similarity	Doc2Vec	120	Dot Product	0.951	0.544	0.512	298.073
			40	Dot Product	0.927	0.540	0.512	293.585
Bos_2018	Naïve Bayes	TFIDF			1.000	0.834	0.900	76.000
	Similarity	Doc2Vec	40	Cosine	1.000	0.804	0.800	159.500
			40	Euclidean Distance	1.000	0.802	0.800	161.600
CalciumChannelBlockers	Naïve Bayes	TFIDF			0.899	0.229	0.172	268.364
	Similarity	Doc2Vec	120	Cosine	0.737	0.068	0.071	453.869
			40	Dot Product	0.778	0.109	0.273	367.747
Estrogens	Naïve Bayes	TFIDF			0.911	0.249	0.177	89.418
	Similarity	Doc2Vec	120	Cosine	0.861	0.216	0.076	116.962
			120	Euclidean Distance	0.861	0.216	0.076	117.392
			40	Cosine	0.835	0.216	0.076	121.152
Hall_2012	Naïve Bayes	TFIDF			1.000	0.911	0.893	134.272
	Similarity	Doc2Vec	120	Cosine	1.000	0.905	0.893	125.340
			120	Euclidean Distance	1.000	0.905	0.893	125.631
Kitchenham_2010	Naïve Bayes	TFIDF			0.977	0.652	0.568	200.136
	Similarity	Doc2Vec	40	Cosine	1.000	0.680	0.477	199.273
			40	Euclidean Distance	1.000	0.680	0.477	199.795
Kwok_2020	Naïve Bayes	TFIDF			0.992	0.675	0.588	227.092
	Similarity	Doc2Vec	40	Cosine	0.966	0.550	0.462	333.008
			40	Euclidean Distance	0.966	0.549	0.462	332.025
Nagtegaal_2019	Naïve Bayes	TFIDF			0.990	0.529	0.540	237.190
	Similarity	Doc2Vec	200	Cosine	0.970	0.588	0.280	286.620
			200	Euclidean Distance	0.970	0.588	0.290	286.240
			40	Dot Product	0.990	0.628	0.530	216.210
NSAIDS	Naïve Bayes	TFIDF			0.975	0.698	0.500	46.775
	Similarity	Doc2Vec	40	Cosine	0.950	0.575	0.350	59.475
			40	Euclidean Distance	0.950	0.570	0.375	58.950
Opiods	Naïve Bayes	TFIDF			0.929	0.467	-0.071	717.857
	Similarity	Doc2Vec	200	Dot Product	0.929	0.589	0.786	166.214
			40	Dot Product	1.000	0.686	0.714	122.429
OralHypoglycemics	Naïve Bayes	TFIDF			0.770	0.190	0.141	155.444
	Similarity	Doc2Vec	120	Dot Product	0.741	0.100	0.081	180.770
			200	Dot Product	0.711	0.100	0.030	187.281
			40	Dot Product	0.726	0.116	0.037	188.622
ProtonPumpInhibitors	Naïve Bayes	TFIDF			0.940	0.481	0.560	191.960
	Similarity	Doc2Vec	120	Cosine	0.940	0.465	-0.040	315.400
			120	Euclidean Distance	0.940	0.465	-0.040	316.320
Radjenovic_2013	Naïve Bayes	TFIDF			1.000	0.841	0.851	189.574
	Similarity	Doc2Vec	120	Cosine	1.000	0.827	0.830	234.170
			120	Euclidean Distance	1.000	0.827	0.830	236.426
			40	Cosine	1.000	0.826	0.830	240.723
			40	Euclidean Distance	1.000	0.826	0.830	240.532
SkeletalMuscleRelaxants	Naïve Bayes	TFIDF			0.625	-0.015	-0.125	782.625
	Similarity	Doc2Vec	120	Cosine	0.125	-0.144	-0.125	1175.500
			200	Dot Product	0.500	-0.178	-0.125	963.625
			40	Dot Product	0.250	-0.168	-0.125	983.125
Statins	Naïve Bayes	TFIDF			0.940	0.471	0.631	445.036
	Similarity	Doc2Vec	120	Dot Product	0.893	0.341	0.536	554.536
			40	Dot Product	0.881	0.326	0.524	628.798
Triptans	Naïve Bayes	TFIDF			0.957	0.520	0.478	98.565
	Similarity	Doc2Vec	120	Dot Product	0.957	0.435	0.522	99.478
			200	Dot Product	0.913	0.433	0.478	102.739
UrinaryIncontinence	Naïve Bayes	TFIDF			0.872	0.382	0.359	60.923
	Similarity	Doc2Vec	120	Dot Product	0.923	0.437	0.000	79.205
			200	Dot Product	0.949	0.440	0.051	73.692
			40	Dot Product	0.923	0.437	0.103	70.333
van_de_Schoot_2017	Naïve Bayes	TFIDF			1.000	0.888	0.905	101.571
	Similarity	Doc2Vec	40	Cosine	1.000	0.846	0.881	174.738
			40	Euclidean Distance	1.000	0.846	0.881	173.833
van_Dis_2020	Naïve Bayes	TFIDF			0.972	0.650	0.556	1191.264
	Similarity	Doc2Vec	40	Cosine	0.986	0.639	0.347	1347.986
			40	Euclidean Distance	0.986	0.640	0.347	1342.597
Wahono_2015	Naïve Bayes	TFIDF			1.000	0.863	0.869	197.803
	Similarity	Doc2Vec	120	Cosine	1.000	0.879	0.902	166.869
			120	Euclidean Distance	1.000	0.879	0.902	167.607
Wolters_2018	Naïve Bayes	TFIDF			1.000	0.794	0.778	162.222
	Similarity	Doc2Vec	40	Cosine	1.000	0.787	0.778	241.389
			40	Euclidean Distance	1.000	0.787	0.778	241.611

0 replies

rohitgarud · 2023-04-05T07:07:35Z

rohitgarud
Apr 5, 2023
Author

Summary statistics excluding Skeletal Muscle Relaxant Dataset:

Classifier	Features	Feature size	Similarity Metric	recall_0.5	wss_0.95	erf_0.1	atd
Naïve Bayes	TFIDF			0.945	0.562	0.544	239.400
Similarity	Doc2Vec	120	Cosine	0.910	0.465	0.381	345.170
		120	Dot Product	0.920	0.452	0.473	337.804
		120	Euclidean Distance	0.912	0.469	0.409	334.312
		200	Cosine	0.906	0.450	0.311	402.721
		200	Dot Product	0.915	0.440	0.473	353.297
		200	Euclidean Distance	0.906	0.451	0.314	400.627
		40	Cosine	0.917	0.497	0.487	270.628
		40	Dot Product	0.912	0.458	0.451	344.624
		40	Euclidean Distance	0.917	0.496	0.487	270.488

3 replies

jteijema Apr 17, 2023
Collaborator

Keep up the good work, and I'm looking forward to seeing any updates or insights you might have from further experimentation. In the mean time, do you advise taking any of these into account for a bigger simulation study?

rohitgarud Apr 17, 2023
Author

@jteijema Thank you. Sure, the Similarity classifier can be used with datasets of any size. I think Vector size 40 with cosine similarity is the best-performing combination of parameters. This is what I observed from other simulations I have performed, which are not presented here and also can be observed from the aggregated results table.
I will be adding a few things to the classifier. Currently only the resultant using summation is used for aggregating the relevant features, looking into other ways of aggregation. Also, the irrelevant features are not taken into consideration at the moment, which can also be explored. Please let me know if you have any suggestions.

rohitgarud Apr 21, 2023
Author

Using SimpleBalance (no balancing) gives better results, as while calculating the resultant, individual relevant features will be considered once. If we use DoubleBalance, the relevant features will be duplicated due to oversampling and can skew the resultant of relevant features towards highly oversampled records.

rohitgarud · 2023-04-21T04:48:46Z

rohitgarud
Apr 21, 2023
Author

With the Similarity Classifier, ASReview can be scaled to millions of records using index like FAISS, local databases like Postgresql with pgvector extension or cloud-based vector databases like Pinecone. This can potentially address the issues mentioned in #1009. The selection/development of robust stopping criteria will be crucial in such use case. Also, selecting a proper feature extraction method will be important. The system can possibly be extended to using full texts or sections of the full texts.

I have tried FAISS with Similarity Classifier on a few of the benchmark datasets and it works. Will be updating the Asreview-SimilarityClassifier extension soon with important modifications and more variations.

0 replies

Rensvandeschoot · 2023-04-25T08:30:14Z

Rensvandeschoot
Apr 25, 2023
Maintainer

@rohitgarud the new datasets are available in https://github.com/asreview/synergy-dataset

1 reply

rohitgarud Apr 25, 2023
Author

Thank you @Rensvandeschoot. I will check it out and run simulations using the dataset. Will present the results here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simulation study on SimilarityClassifier #1371

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments 6 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Simulation study on SimilarityClassifier #1371

rohitgarud Mar 25, 2023

Replies: 6 comments · 6 replies

rohitgarud Mar 25, 2023 Author

J535D165 Mar 26, 2023 Maintainer

rohitgarud Mar 26, 2023 Author

rohitgarud Apr 3, 2023 Author

rohitgarud Apr 3, 2023 Author

rohitgarud Apr 5, 2023 Author

jteijema Apr 17, 2023 Collaborator

rohitgarud Apr 17, 2023 Author

rohitgarud Apr 21, 2023 Author

rohitgarud Apr 21, 2023 Author

Rensvandeschoot Apr 25, 2023 Maintainer

rohitgarud Apr 25, 2023 Author

rohitgarud
Mar 25, 2023

Replies: 6 comments 6 replies

rohitgarud
Mar 25, 2023
Author

J535D165 Mar 26, 2023
Maintainer

rohitgarud Mar 26, 2023
Author

rohitgarud
Apr 3, 2023
Author

rohitgarud
Apr 3, 2023
Author

rohitgarud
Apr 5, 2023
Author

jteijema Apr 17, 2023
Collaborator

rohitgarud Apr 17, 2023
Author

rohitgarud Apr 21, 2023
Author

rohitgarud
Apr 21, 2023
Author

Rensvandeschoot
Apr 25, 2023
Maintainer

rohitgarud Apr 25, 2023
Author