New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Programmatically filtering "false" hits #2

Open
caguillen214 opened this Issue Aug 14, 2014 · 7 comments

Comments

Projects
None yet
4 participants
@caguillen214

Saw this project on reddit and thought it was really awesome. Keep up the good work, its a great a idea. Anyway, I saw that you said you were manually rejecting "false" hits which included identical tweets or transpositions of few letters and I thought to myself that could be an easy fix.

I wrote a simple abstracted implementation to remove some of the "false" hits (in javascript) of this mainly because I was too lazy to look how to implement it in python or how to integrate it with your project and submit a PR haha.

Hopefully it helps. Maybe at least it keeps you from having to think of a solution haha. Awesome project!

Carlos

//tweets would have to be stripped of things you exclude mentioned in your readme like @'s
function tooSimilar(tweet1, tweet2) {
    var LD_THRESHOLD = 4, //max changes between the tweets (insertions, deletions, transpositions)
        TWEET_WORD_SIMILARITY = 3, //max number of words in common
        matchCount = 0;

    tweet1.split(' ').map(function(word) {
        //ignore words like pronouns, articles, etc.
        if(IGNORED_WORDS_HASH[word] === undefined && tweet2.indexOf(word) > -1) { 
            matchCount++;
        }
    });

   //lev_Dist can be expensive so we want to kick out early if possible
    if(matchCount > TWEET_WORD_SIMILARITY) {
        return true;
    }

    var lev_Dist = levenshtein(tweet1, tweet2);    
    return (lev_Dist < LD_THRESHOLD);
}   

//credit: http://stackoverflow.com/questions/11919065/sort-an-array-by-the-levenshtein-distance-with-best-performance-in-javascript
//info: http://en.wikipedia.org/wiki/Levenshtein_distance
function levenshtein(s, t) {
      var d = []; //2d matrix

    // Step 1
    var n = s.length;
    var m = t.length;

    if (n == 0) return m;
    if (m == 0) return n;

    //Create an array of arrays in javascript (a descending loop is quicker)
    for (var i = n; i >= 0; i--) d[i] = [];

    // Step 2
    for (var i = n; i >= 0; i--) d[i][0] = i;
    for (var j = m; j >= 0; j--) d[0][j] = j;

    // Step 3
    for (var i = 1; i <= n; i++) {
        var s_i = s.charAt(i - 1);

        // Step 4
        for (var j = 1; j <= m; j++) {

            //Check the jagged ld total so far
            if (i == j && d[i][j] > 4) return n;

            var t_j = t.charAt(j - 1);
            var cost = (s_i == t_j) ? 0 : 1; // Step 5

            //Calculate the minimum
            var mi = d[i - 1][j] + 1;
            var b = d[i][j - 1] + 1;
            var c = d[i - 1][j - 1] + cost;

            if (b < mi) mi = b;
            if (c < mi) mi = c;

            d[i][j] = mi; // Step 6

            //Damerau transposition (checks for transposition of letters e.g. haet and hate have a Damerau-Lev distance
            // of 1 instead of Lev distance of 2). Can be removed for optimization
            if (i > 1 && j > 1 && s_i == t.charAt(j - 2) && s.charAt(i - 2) == t_j) {
                d[i][j] = Math.min(d[i][j], d[i - 2][j - 2] + cost);
            }
        }
    }

    // Step 7
    return d[n][m];
}
@cmyr

This comment has been minimized.

Show comment
Hide comment
@cmyr

cmyr Sep 12, 2014

Owner

hey sorry I've been super busy, but thanks for posting this. I didn't know about levenshtein distance when I first wrote this, and I ended up writing a bunch of shitty comparison filters. I'll eventually implement this change, but it'll have to wait a little bit. Thanks again for the note though!

Owner

cmyr commented Sep 12, 2014

hey sorry I've been super busy, but thanks for posting this. I didn't know about levenshtein distance when I first wrote this, and I ended up writing a bunch of shitty comparison filters. I'll eventually implement this change, but it'll have to wait a little bit. Thanks again for the note though!

@cmyr cmyr closed this Sep 12, 2014

@cmyr cmyr reopened this Sep 12, 2014

@JoshKeegan

This comment has been minimized.

Show comment
Hide comment
@JoshKeegan

JoshKeegan Dec 1, 2014

This would be a really nice addition!

I don't think there's any need to consider addition or deletion of characters though, since for two messages to have gotten this far they must be equal in length. So you could use the Hamming distance instead of the Levenshtein distance which would make this trivial to implement.

The only thing worth nothing is that the implementation of the Levenshtein distance in the previous post is actually the Damerau–Levenshtein distance, which counts character transposition as a single operation. Since all operations required to convert between the two anagram strings are transpositions, the distance returned from using just the Levenshtein or Hamming distance will be half that of the Damerau–Levenshtein distance, so the threshold to decide whether or not the strings are too similar will need to be doubled.

This would be a really nice addition!

I don't think there's any need to consider addition or deletion of characters though, since for two messages to have gotten this far they must be equal in length. So you could use the Hamming distance instead of the Levenshtein distance which would make this trivial to implement.

The only thing worth nothing is that the implementation of the Levenshtein distance in the previous post is actually the Damerau–Levenshtein distance, which counts character transposition as a single operation. Since all operations required to convert between the two anagram strings are transpositions, the distance returned from using just the Levenshtein or Hamming distance will be half that of the Damerau–Levenshtein distance, so the threshold to decide whether or not the strings are too similar will need to be doubled.

@caguillen214

This comment has been minimized.

Show comment
Hide comment
@caguillen214

caguillen214 Dec 1, 2014

That's a great point, and thanks for the info! I hadn't heard of the hamming distance, I'll definitely keep that in my rolodex of nifty tricks

That's a great point, and thanks for the info! I hadn't heard of the hamming distance, I'll definitely keep that in my rolodex of nifty tricks

@cmyr

This comment has been minimized.

Show comment
Hide comment
@cmyr

cmyr Dec 1, 2014

Owner

Looks like I’m going to have to actually get around to doing this now, huh?

Owner

cmyr commented Dec 1, 2014

Looks like I’m going to have to actually get around to doing this now, huh?

@caguillen214

This comment has been minimized.

Show comment
Hide comment
@caguillen214

caguillen214 Dec 1, 2014

Yeah I believe two random people commenting about it means it must be done. Internet law stuff

Yeah I believe two random people commenting about it means it must be done. Internet law stuff

@bdrupieski

This comment has been minimized.

Show comment
Hide comment
@bdrupieski

bdrupieski Nov 26, 2017

Sorry for digging up an old issue but I wanted to share my own experience with using Damerau–Levenshtein distance on tweets in case it helps anyone here.

Like @caguillen214 I saw this project and thought Damerau–Levenshtein distance would be a great way to score anagrammed tweets. I went ahead and implemented it, including Damerau–Levenshtein distance as one metric in a normalized score of how interesting a candidate anagram match is. After processing the tweet text to normalize Unicode characters, strip spaces out spaces, strip out weird characters, and convert everything to the same case, this normalized score equally combines:

  1. Total length of the tweets in the match
  2. Length of the longest common substring between the two tweets
  3. Total number of English words across the two tweets
  4. Total number of words across the two tweets
  5. Damerau–Levenshtein distance between the two tweets

I review matches and post some to my own bot account. I've now posted and rejected quite a few anagram matches.

I ran some queries to see how the above 5 metrics affect which anagrams I'm likely to post to the account (meaning, how different/funny/witty/interesting the two tweets in the match seem to me). I've found that the above 5 metrics individually contribute to the likelihood of me posting a match ranked in the order shown above. In other words, length alone is the best indicator, and Damerau–Levenshtein distance is the worst part of my score. If you want to see some numbers and graphs, I included some in the "Scoring Metric Efficiency" section of this blog post. I would probably have a better score using only the first four metrics by completely removing Damerau–Levenshtein distance.

However, the judgment that Damerau–Levenshtein distance is not a good way to measure how interesting two anagrammed tweets are is biased by my own idea of what I should or should not post to my bot account. I've posted a bunch of crappy matches in the past that probably would not have met anagramatron's bar, so take this with a grain of salt.

Sorry for digging up an old issue but I wanted to share my own experience with using Damerau–Levenshtein distance on tweets in case it helps anyone here.

Like @caguillen214 I saw this project and thought Damerau–Levenshtein distance would be a great way to score anagrammed tweets. I went ahead and implemented it, including Damerau–Levenshtein distance as one metric in a normalized score of how interesting a candidate anagram match is. After processing the tweet text to normalize Unicode characters, strip spaces out spaces, strip out weird characters, and convert everything to the same case, this normalized score equally combines:

  1. Total length of the tweets in the match
  2. Length of the longest common substring between the two tweets
  3. Total number of English words across the two tweets
  4. Total number of words across the two tweets
  5. Damerau–Levenshtein distance between the two tweets

I review matches and post some to my own bot account. I've now posted and rejected quite a few anagram matches.

I ran some queries to see how the above 5 metrics affect which anagrams I'm likely to post to the account (meaning, how different/funny/witty/interesting the two tweets in the match seem to me). I've found that the above 5 metrics individually contribute to the likelihood of me posting a match ranked in the order shown above. In other words, length alone is the best indicator, and Damerau–Levenshtein distance is the worst part of my score. If you want to see some numbers and graphs, I included some in the "Scoring Metric Efficiency" section of this blog post. I would probably have a better score using only the first four metrics by completely removing Damerau–Levenshtein distance.

However, the judgment that Damerau–Levenshtein distance is not a good way to measure how interesting two anagrammed tweets are is biased by my own idea of what I should or should not post to my bot account. I've posted a bunch of crappy matches in the past that probably would not have met anagramatron's bar, so take this with a grain of salt.

@cmyr

This comment has been minimized.

Show comment
Hide comment
@cmyr

cmyr Nov 28, 2017

Owner

@bdrupieski I'm on the same page, and in general I think that while things like edit distance can be a useful screening step, in general the human conception of "what is interesting" is hard to model.

An approach I think would be interesting would be to use a neural network; using word2vec to encode each tweet, and then training a linear classifier on pairs of tweets, using manually approved/rejected pairs as the training/test data. I'm not sure exactly how many manually approved pairs I have, but it's quite a few.

This approach would require a reasonable outlay in engineering time, and I haven't been thinking about this stuff much lately (I was doing something similar to this for work a while ago, though, so I at least have a glimmer that this approach could work...) in any case, it's firmly in the 'maybe one day' pile. :)

Owner

cmyr commented Nov 28, 2017

@bdrupieski I'm on the same page, and in general I think that while things like edit distance can be a useful screening step, in general the human conception of "what is interesting" is hard to model.

An approach I think would be interesting would be to use a neural network; using word2vec to encode each tweet, and then training a linear classifier on pairs of tweets, using manually approved/rejected pairs as the training/test data. I'm not sure exactly how many manually approved pairs I have, but it's quite a few.

This approach would require a reasonable outlay in engineering time, and I haven't been thinking about this stuff much lately (I was doing something similar to this for work a while ago, though, so I at least have a glimmer that this approach could work...) in any case, it's firmly in the 'maybe one day' pile. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment