Skip to content

Stemmer Similarity Metric (SSM)

Electronart edited this page Dec 12, 2018 · 7 revisions

Inverse Mean MHD (Inverse of the mean Modified Hamming Distance)

"The stemmer similarity metric M for a pair of stemming algorithms A1 and A2, given a word list W, is the inverse of the mean modified Hamming distance (d) for all words in the word list, M(A1,A2,W) = n/Σ d(xi,yi), for i ranging from 1 to n, where n is the size of W and for all words wi in W, xi is the result of the application of A1 to wi and yi is the is the result of the application of A2 to wi. The inverse of the mean is used so that more similar algorithms will have higher values of M.

For example, suppose W = {brittle, engineered, fairies} and that stemming algorithm A1 produces the stems {brit, engineer, fairy} from W, and stemming algorithm A2 produces {britt, engineered, fairi} from W. Then M(A1,A2,W) is computed by dividing the wordlist size (3) by the sum of the Modified Hamming Distances between the stems produced by each stemmer for each word (for example, the Modified Hamming Distance between brit and britt is 1). The result is 3/(1+2+1) = 0.75 as our measure of the similarity of algorithms A1 and A2." Frakes and Fox Ref: W1

Verification

Test files SSM1A and SSM2 are based on the above and using SSM1A as File A and SSM2 as File B will give an Inverse Mean MHD of 0.75 and will list MHD as 1,2,1.

Improved SSM (SSM*)

The Inverse Mean MHD method suggested above leaves the following problems: It would be better if the result could be normalised to give a value between 1 and 0, The above metric fails to take into consideration the overall lengths of the two strings. Thus, removing ‘s' from the word ‘dogs' to give ‘dog', is a removal of 1/4 which is 25% of the word, and would have a Hamming distance of 1. Removing the suffix ‘s' from the word ‘methylenedioxymethamphetamins' to give ‘methylenedioxymethamphetamin' is a removal of 1/29 which is 3.4% of the word, yet this also gives a Modified Hamming Distance (MHD) of 1.

A new metric was therefore developed which, though still based on the Hamming distance, also takes into account the lengths of the words being compared. It uses the idea that there is a maximum and minimum Modified Hamming Distance that can be obtained between two strings. The minimum distance is zero, when both strings are identical, whereas the maximum distance MD is the length of the longer string. The relative distance RD is obtained from the Modified Hamming Distance MHD thus:

RD(X,Y) = MHD (X,Y)/MD (X,Y)

RD is labelled Relative MHD in List Analyser.

Using test files SSM3 as File A and SSM4 as File B will give:

SSM3 words
Length
MHD
Relative MHD
dogs
4
1
0.25
methylenedioxymethamphetamins
29
1
0.0345
Engineered
10
0
0

Therefore, by calculating the sum of the relative distances (Relative MHD) for all pairs of stems, then dividing this by the number of words, an average distance metric is obtained. This was then further developed by subtracting the value from one, then multiplying it by 100 to give a percentage, so that 0% represents no similarity and 100% total similarity. Thus, the new stemmer similarity metric is given by the formula:

SSM(U,V) = 100 (( Σw(MHD(Uw,Vw)/MD(Uw,Vw))/N ) -1 ).

where U & V are the Stemmers being compared, N = Number of words in the sample MHD = Modified Hamming Distance MD = Maximum Distance

Verification

For the test files SSM3 as File A and SSM4 as File B the value of SSM* will therefore be:

100 * ((0.25 + 0.0345 + 0)/3 ) - 1) = 90.517%

The Inverse Mean MHD will be 3/(1+1+0) = 1.500

*Original Source was: http://www.comp.lancs.ac.uk/computing/research/stemming/Links/similarity.htm however as at 5 July 2018 this link is broken.