Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.Sign up
Programmatically filtering "false" hits #2
Saw this project on reddit and thought it was really awesome. Keep up the good work, its a great a idea. Anyway, I saw that you said you were manually rejecting "false" hits which included identical tweets or transpositions of few letters and I thought to myself that could be an easy fix.
Hopefully it helps. Maybe at least it keeps you from having to think of a solution haha. Awesome project!
hey sorry I've been super busy, but thanks for posting this. I didn't know about levenshtein distance when I first wrote this, and I ended up writing a bunch of shitty comparison filters. I'll eventually implement this change, but it'll have to wait a little bit. Thanks again for the note though!
This would be a really nice addition!
I don't think there's any need to consider addition or deletion of characters though, since for two messages to have gotten this far they must be equal in length. So you could use the Hamming distance instead of the Levenshtein distance which would make this trivial to implement.
The only thing worth nothing is that the implementation of the Levenshtein distance in the previous post is actually the Damerau–Levenshtein distance, which counts character transposition as a single operation. Since all operations required to convert between the two anagram strings are transpositions, the distance returned from using just the Levenshtein or Hamming distance will be half that of the Damerau–Levenshtein distance, so the threshold to decide whether or not the strings are too similar will need to be doubled.
Sorry for digging up an old issue but I wanted to share my own experience with using Damerau–Levenshtein distance on tweets in case it helps anyone here.
Like @caguillen214 I saw this project and thought Damerau–Levenshtein distance would be a great way to score anagrammed tweets. I went ahead and implemented it, including Damerau–Levenshtein distance as one metric in a normalized score of how interesting a candidate anagram match is. After processing the tweet text to normalize Unicode characters, strip spaces out spaces, strip out weird characters, and convert everything to the same case, this normalized score equally combines:
I review matches and post some to my own bot account. I've now posted and rejected quite a few anagram matches.
I ran some queries to see how the above 5 metrics affect which anagrams I'm likely to post to the account (meaning, how different/funny/witty/interesting the two tweets in the match seem to me). I've found that the above 5 metrics individually contribute to the likelihood of me posting a match ranked in the order shown above. In other words, length alone is the best indicator, and Damerau–Levenshtein distance is the worst part of my score. If you want to see some numbers and graphs, I included some in the "Scoring Metric Efficiency" section of this blog post. I would probably have a better score using only the first four metrics by completely removing Damerau–Levenshtein distance.
However, the judgment that Damerau–Levenshtein distance is not a good way to measure how interesting two anagrammed tweets are is biased by my own idea of what I should or should not post to my bot account. I've posted a bunch of crappy matches in the past that probably would not have met anagramatron's bar, so take this with a grain of salt.
@bdrupieski I'm on the same page, and in general I think that while things like edit distance can be a useful screening step, in general the human conception of "what is interesting" is hard to model.
An approach I think would be interesting would be to use a neural network; using word2vec to encode each tweet, and then training a linear classifier on pairs of tweets, using manually approved/rejected pairs as the training/test data. I'm not sure exactly how many manually approved pairs I have, but it's quite a few.
This approach would require a reasonable outlay in engineering time, and I haven't been thinking about this stuff much lately (I was doing something similar to this for work a while ago, though, so I at least have a glimmer that this approach could work...) in any case, it's firmly in the 'maybe one day' pile. :)