In my survey, I provided a short text box on a Mechanical Turk survey. Most responders did not write multiple paragraphs... so a good first pass is simply to check for newlines (these are indicated by "/" in my own system) as well as funny unicode characters that folks are unlikely to have typed.

You can download google results for all of our texts using the script `get_goog.py`, which should be in the same directory as this.

In [None]:
# This allows us to read a specific encoding
from codecs import open
import json

# If you're doing science, you probably have pandas installed. It's good and *fast* for reading CSVs.
import pandas

# These are two fairly similar metrics (or families of metrics), but let's see how they compare
from nltk.metrics import edit_distance
from fuzzywuzzy import fuzz

class PlagiDistance:
    def __init__(self, df):
        '''Compute the distance from the top google result for each cell in df

        Index and column names are used for guessing the filename in
        google_searches
        '''
        self.fuzz_ratio = pd.DataFrame(np.nan, index=df.index, columns=df.columns)
        self.fuzz_partial = pd.DataFrame(np.nan, index=df.index, columns=df.columns)
        self.edit = pd.DataFrame(np.nan, index=df.index, columns=df.columns)

        for colname, txts in df.iteritems():
            for id, txt in txts.iteritems():
                self.get_dists(colname, id, txt)

    def get_dists(self, colname, id, txt):
        '''Get the json file and compute dists'''
        fname = 'google_searches/%s_%s.json' % (id, 'know2_pre')

        with open(fname, 'r', 'utf8') as json_file:
            json_data = json.load(json_file)

        # This is probably fine - a more careful approach would get rid of
        # ellipses, but this is probably "good enough"

        top_match = json_data['items'][0]['snippet']
        q_string = ' '.join(txt.split()[:32])

        self.fuzz_ratio.loc[id, colname] = fuzz.ratio(q_string, top_match)
        self.fuzz_partial.loc[id, colname] = fuzz.partial_ratio(q_string, top_match)
        self.edit.loc[id, colname] = edit_distance(q_string, top_match)

In [None]:
# An example of how you might grab your text columns
know_cols = [cn for cn in some_df.columns if cn.startswith('text')]
dists = PlagiDistance(come_df[know_cols])

In [None]:
# Grab our 10 worst (er, best) matches
dists.sort(columns=0).iloc[:10]

In [None]:
# This seems to pick out almost exactly the same features as edit_distance
dists.sort(columns=1, ascending=False).iloc[:10]

In [None]:
# In my case, edit_distance and fuzz.ratio are highly correlated
dists.corr()

In [None]:
dists.sort(columns=2, ascending=False).iloc[:10]

For my data, it seems that the standard fuzz.ratio and edit_distance are sensitive to approximately the same information (which is as advertised), and partial_ratio doesn't buy you much else. Moreover, in this sample, we could have caught cheaters by looking for unusual codes and slashes.