# Plagiarism Detector
In this notebook, I examine text files to perform data classification. Each file is labeled as either plagiarized or not.
The notebook was created to be used in **AWS Sagemaker** environment.

## Download data and save locally

Source for database: 

Clough, P. and Stevenson, M. Developing A Corpus of Plagiarised Short Answers, Language Resources and Evaluation: Special Issue on Plagiarism and Authorship Analysis, In Press. [Download]

In [None]:
!wget https://s3.amazonaws.com/video.udacity-data.com/topher/2019/January/5c4147f9_data/data.zip
!unzip data

In [None]:
# import libraries
import pandas as pd
import numpy as np
import os

This plagiarism dataset is made of multiple text files; each of these files has characteristics that are is summarized in a .csv file named file_information.csv, which we can read in using pandas.

In [None]:
csv_file = 'data/file_information.csv'
plagiarism_df = pd.read_csv(csv_file)

# print out the first few rows of data info
plagiarism_df.head(10)

(Text extracted from original source)
### Five task types, A-E
Each text file contains an answer to one short question; these questions are labeled as tasks A-E.

- Each task, A-E, is about a topic that might be included in the Computer Science curriculum that was created by the authors of this dataset.
- For example, Task A asks the question: "What is inheritance in object oriented programming?"

Four categories of plagiarism
Each text file has an associated plagiarism label/category:

- `cut`: An answer is plagiarized; it is copy-pasted directly from the relevant Wikipedia source text.
- `light`: An answer is plagiarized; it is based on the Wikipedia source text and includes some copying and paraphrasing.
- `heavy`: An answer is plagiarized; it is based on the Wikipedia source text but expressed using different words and structure. Since this doesn't copy directly from a source text, this will likely be the most challenging kind of plagiarism to detect.
- `non`: An answer is not plagiarized; the Wikipedia source text is not used to create this answer.
- `orig`: This is a specific category for the original, Wikipedia source text. Files for comparison purposes  only.

### Data visualization and analysis

In [None]:
# print out some stats about the data
print('Number of files: ', plagiarism_df.shape[0])  # .shape[0] gives the rows 
# .unique() gives unique items in a specified column
print('Number of unique tasks/question types (A-E): ', (len(plagiarism_df['Task'].unique())))
print('Unique plagiarism categories: ', (plagiarism_df['Category'].unique()))

In [None]:
# Show counts by different tasks and amounts of plagiarism

# group and count by task
counts_per_task = plagiarism_df.groupby(['Task']).size().reset_index(name="Counts")
print("\nTask:")
display(counts_per_task)

# group by plagiarism level
counts_per_category = plagiarism_df.groupby(['Category']).size().reset_index(name="Counts")
print("\nPlagiarism Levels:")
display(counts_per_category)

# group by task AND plagiarism level
counts_task_and_plagiarism = plagiarism_df.groupby(['Task', 'Category']).size().reset_index(name="Counts")
print("\nTask & Plagiarism Level Combos :")
display(counts_task_and_plagiarism)

In [None]:

import matplotlib.pyplot as plt
%matplotlib inline

# counts
group = ['Task', 'Category']
counts = plagiarism_df.groupby(group).size().reset_index(name="Counts")

plt.figure(figsize=(8,5))
plt.bar(range(len(counts)), counts['Counts'], color = 'blue')

## Feature Engineering
Tasks: 
- Clean and pre-process the data.
- Define features for comparing the similarity of an answer text and a source text, and extract similarity features.
- Select "good" features, by analyzing the correlations between different features.
- Create train/test .csv files that hold the relevant features and class labels for train/test data points

In [None]:
# import extra library
from sklearn.feature_extraction.text import CountVectorizer

### Convert categorical to numerical data

Two columns will be created to provide a numerical value for each of the samples. 
They are:
- `Category`: labels to numerical labels according to the following rules (a higher value indicates a higher degree of plagiarism):
    * 0 = non;
    * 1 = heavy;
    * 2 = light;
    * 3 = cut;
    * -1 = orig, this is a special value that indicates an original file.
- `Class`: Any answer text that is not plagiarized (non) should have the class label 0. Any plagiarized answer texts should have the class label 1.
And any orig texts will have a special label -1.

In [None]:
# Read in a csv file and return a transformed dataframe
def numerical_dataframe(csv_file='data/file_information.csv'):
    '''Reads in a csv file which is assumed to have `File`, `Category` and `Task` columns.
       This function does two things: 
       1) converts `Category` column values to numerical values 
       2) Adds a new, numerical `Class` label column.
       The `Class` column will label plagiarized answers as 1 and non-plagiarized as 0.
       Source texts have a special label, -1.
       :param csv_file: The directory for the file_information.csv file
       :return: A dataframe with numerical categories and a new `Class` label column'''
    
    # your code here
    df = pd.read_csv(csv_file)
    category_conversion = {
        'non': 0,
        'heavy': 1,
        'light': 2,
        'cut': 3, 
        'orig': -1
    }

    df["Category"] = df["Category"].apply(lambda x: category_conversion[x])
    df["Class"] = df["Category"].apply(lambda x: 1 if x > 0 else (-1 if x < 0 else 0))
    
    return df

In [None]:
# informal testing, print out the results of a called function
# create new `transformed_df`
transformed_df = numerical_dataframe(csv_file ='data/file_information.csv')

# check work
# check that all categories of plagiarism have a class label = 1
transformed_df.head()

### Similarity features
One of the ways we might go about detecting plagiarism, is by computing similarity features that measure how similar a given answer text is as compared to the original wikipedia source text (for a specific task, a-e). The similarity features are informed by [this paper on plagiarism detection](https://s3.amazonaws.com/video.udacity-data.com/topher/2019/January/5c412841_developing-a-corpus-of-plagiarised-short-answers/developing-a-corpus-of-plagiarised-short-answers.pdf).
In this paper, researchers created features called __containment__ and __longest common subsequence__.

#### Containment calculation

The general steps to complete this function are as follows:

1. From all of the text files in a given df, create an array of n-gram counts; it is suggested that you use a CountVectorizer for this purpose.
2. Get the processed answer and source texts for the given answer_filename.
3. Calculate the containment between an answer and source text according to the following equation.

$$ \frac{\sum{count(\text{ngram}_{A}) \cap count(\text{ngram}_{S})}}{\sum{count(\text{ngram}_{A})}} $$

4. Return that containment value.


In [None]:
# Calculate the ngram containment for one answer file/source file pair in a df
def calculate_containment(df, n, answer_filename):
    '''Calculates the containment between a given answer text and its associated source text.
       This function creates a count of ngrams (of a size, n) for each text file in our data.
       Then calculates the containment by finding the ngram count for a given answer text, 
       and its associated source text, and calculating the normalized intersection of those counts.
       :param df: A dataframe with columns,
           'File', 'Task', 'Category', 'Class', 'Text', and 'Datatype'
       :param n: An integer that defines the ngram size
       :param answer_filename: A filename for an answer text in the df, ex. 'g0pB_taskd.txt'
       :return: A single containment value that represents the similarity
           between an answer text and its source text.
    '''

    source_filename = 'orig_' + answer_filename.split('_')[1]
    answer_text = df[df['File'] == answer_filename].iloc[0]['Text']
    source_text = df[df['File'] == source_filename].iloc[0]['Text']
  
    cv = CountVectorizer(ngram_range=(n,n))
    matrix = cv.fit_transform([answer_text, source_text]).toarray()
    
    intersection = np.min(matrix, 0)
    
    return sum(intersection)/sum(matrix[0])

#### Test cells

In [None]:
# select a value for n
n = 3

# indices for first few files
test_indices = range(5)

# iterate through files and calculate containment
category_vals = []
containment_vals = []
for i in test_indices:
    # get level of plagiarism for a given file index
    category_vals.append(complete_df.loc[i, 'Category'])
    # calculate containment for given file and n
    filename = complete_df.loc[i, 'File']
    c = calculate_containment(complete_df, n, filename)
    containment_vals.append(c)

# print out result, does it make sense?
print('Original category values: \n', category_vals)
print()
print(str(n)+'-gram containment values: \n', containment_vals)

#### Longest Common Subsequence

It may be helpful to think of this in a concrete example. A Longest Common Subsequence (LCS) problem may look as follows:

* Given two texts: text A (answer text) of length n, and string S (original source text) of length m. Our goal is to produce their longest common subsequence of words: the longest sequence of words that appear left-to-right in both texts (though the words don't have to be in continuous order).
* Consider:

    * A = "i think pagerank is a link analysis algorithm used by google that uses a system of weights attached to each element of a hyperlinked set of documents"
    * S = "pagerank is a link analysis algorithm used by the google internet search engine that assigns a numerical weighting to each element of a hyperlinked set of documents"
- In this case, we can see that the start of each sentence of fairly similar, having overlap in the sequence of words, "pagerank is a link analysis algorithm used by" before diverging slightly. Then we continue moving left -to-right along both texts until we see the next common sequence; in this case it is only one word, "google". Next we find "that" and "a" and finally the same ending "to each element of a hyperlinked set of documents".

In [2]:
# Compute the normalized LCS given an answer text and a source text
def lcs_norm_word(answer_text, source_text):
    '''Computes the longest common subsequence of words in two texts; returns a normalized value.
       :param answer_text: The pre-processed text for an answer text
       :param source_text: The pre-processed text for an answer's associated source text
       :return: A normalized LCS value'''
    
    answer_words = [''] + answer_text.split()
    source_words = [''] + source_text.split()
    
    # Prepare matrix for Dynamic Programmaing
    matrix = np.zeros((len(answer_words), len(source_words)))
    
    for i in range(1, len(answer_words)):
        for j in range(1, len(source_words)):
            matrix[i][j] = (matrix[i-1][j-1] + 1) if (source_words[j] == answer_words[i]) else max(matrix[i-1][j], matrix[i][j-1])
          
    return matrix[-1,-1] / (len(answer_words) - 1)

#### Test cells

In [None]:
# Run the test scenario from above
# does your function return the expected value?

A = "i think pagerank is a link analysis algorithm used by google that uses a system of weights attached to each element of a hyperlinked set of documents"
S = "pagerank is a link analysis algorithm used by the google internet search engine that assigns a numerical weighting to each element of a hyperlinked set of documents"

# calculate LCS
lcs = lcs_norm_word(A, S)
print('LCS = ', lcs)


# expected value test
assert lcs==20/27., "Incorrect LCS value, expected about 0.7408, got "+str(lcs)

print('Test passed!')

In [None]:
# test on your own
test_indices = range(5) # look at first few files

category_vals = []
lcs_norm_vals = []
# iterate through first few docs and calculate LCS
for i in test_indices:
    category_vals.append(complete_df.loc[i, 'Category'])
    # get texts to compare
    answer_text = complete_df.loc[i, 'Text'] 
    task = complete_df.loc[i, 'Task']
    # we know that source texts have Class = -1
    orig_rows = complete_df[(complete_df['Class'] == -1)]
    orig_row = orig_rows[(orig_rows['Task'] == task)]
    source_text = orig_row['Text'].values[0]
    
    # calculate lcs
    lcs_val = lcs_norm_word(answer_text, source_text)
    lcs_norm_vals.append(lcs_val)

# print out result, does it make sense?
print('Original category values: \n', category_vals)
print()
print('Normalized LCS values: \n', lcs_norm_vals)

## Create all features
### Multiple containment features
This function returns a list of containment features, calculated for a given n and for all files in a df (assumed to the the complete_df).

In [3]:
# Function returns a list of containment features, calculated for a given n 
# Should return a list of length 100 for all files in a complete_df
def create_containment_features(df, n, column_name=None):
    
    containment_values = []
    
    if(column_name==None):
        column_name = 'c_'+str(n) # c_1, c_2, .. c_n
    
    # iterates through dataframe rows
    for i in df.index:
        file = df.loc[i, 'File']
        # Computes features using calculate_containment function
        if df.loc[i,'Category'] > -1:
            c = calculate_containment(df, n, file)
            containment_values.append(c)
        # Sets value to -1 for original tasks 
        else:
            containment_values.append(-1)
    
    print(str(n)+'-gram containment features created!')
    return containment_values

### LCS features


In [None]:

# Function creates lcs feature and add it to the dataframe
def create_lcs_features(df, column_name='lcs_word'):
    
    lcs_values = []
    
    # iterate through files in dataframe
    for i in df.index:
        # Computes LCS_norm words feature using function above for answer tasks
        if df.loc[i,'Category'] > -1:
            # get texts to compare
            answer_text = df.loc[i, 'Text'] 
            task = df.loc[i, 'Task']
            # we know that source texts have Class = -1
            orig_rows = df[(df['Class'] == -1)]
            orig_row = orig_rows[(orig_rows['Task'] == task)]
            source_text = orig_row['Text'].values[0]

            # calculate lcs
            lcs = lcs_norm_word(answer_text, source_text)
            lcs_values.append(lcs)
        # Sets to -1 for original tasks 
        else:
            lcs_values.append(-1)

    print('LCS features created!')
    return lcs_values

In the below cell I define an n-gram range; these will be the n's I use to create n-gram containment features.

In [None]:
# Define an ngram range
ngram_range = range(1,15)


# The following code may take a minute to run, depending on your ngram_range
"""
DON'T MODIFY ANYTHING IN THIS CELL THAT IS BELOW THIS LINE
"""
features_list = []

# Create features in a features_df
all_features = np.zeros((len(ngram_range)+1, len(complete_df)))

# Calculate features for containment for ngrams in range
i=0
for n in ngram_range:
    column_name = 'c_'+str(n)
    features_list.append(column_name)
    # create containment features
    all_features[i]=np.squeeze(create_containment_features(complete_df, n))
    i+=1

# Calculate features for LCS_Norm Words 
features_list.append('lcs_word')
all_features[i]= np.squeeze(create_lcs_features(complete_df))

# create a features dataframe
features_df = pd.DataFrame(np.transpose(all_features), columns=features_list)

# Print all features/columns
print()
print('Features: ', features_list)
print()

In [None]:
# print some results 
features_df

## Correlated features
Some features are too highly-correlated. We have to extract only some features that present a lower correlation to avoid overfitting.

In [None]:
# Create correlation matrix for just Features to determine different models to test
corr_matrix = features_df.corr().abs().round(2)

# display shows all of a dataframe
display(corr_matrix)

The function below takes in dataframes and a list of selected features (column names) and returns (train_x, train_y), (test_x, test_y)

In [None]:

def train_test_data(complete_df, features_df, selected_features):
    '''Gets selected training and test features from given dataframes, and 
       returns tuples for training and test features and their corresponding class labels.
       :param complete_df: A dataframe with all of our processed text data, datatypes, and labels
       :param features_df: A dataframe of all computed, similarity features
       :param selected_features: An array of selected features that correspond to certain columns in `features_df`
       :return: training and test features and labels: (train_x, train_y), (test_x, test_y)'''
    
    # get the training features
    train_x = features_df[complete_df['Datatype'] == 'train'][selected_features].to_numpy()
    # And training class labels (0 or 1)
    train_y = complete_df[complete_df['Datatype'] == 'train']['Category'].to_numpy()
    
    # get the test features and labels
    test_x = features_df[complete_df['Datatype'] == 'test'][selected_features].to_numpy()
    test_y = complete_df[complete_df['Datatype'] == 'test']['Category'].to_numpy()
    
    return (train_x, train_y), (test_x, test_y)

#### Test cells

In [None]:
test_selection = list(features_df)[:2] # first couple columns as a test
# test that the correct train/test data is created
(train_x, train_y), (test_x, test_y) = train_test_data(complete_df, features_df, test_selection)

# params: generated train/test data
tests.test_data_split(train_x, train_y, test_x, test_y)

## Select features
Select two of the features that are not that correlated

In [None]:
# Select your list of features, this should be column names from features_df
# ex. ['c_1', 'lcs_word']
selected_features = ['c_1', 'c_5']

(train_x, train_y), (test_x, test_y) = train_test_data(complete_df, features_df, selected_features)

# check that division of samples seems correct
# these should add up to 95 (100 - 5 original files)
print('Training size: ', len(train_x))
print('Test size: ', len(test_x))
print()
print('Training df sample: \n', train_x[:10])

## Creating final data files

In this project, SageMaker will expect the following format for train/test data:

- Training and test data should be saved in one .csv file each, ex train.csv and test.csv
- These files should have class labels in the first column and features in the rest of the columns

#### Creating csv files

In [None]:
def make_csv(x, y, filename, data_dir):
    '''Merges features and labels and converts them into one csv file with labels in the first column.
       :param x: Data features
       :param y: Data labels
       :param file_name: Name of csv file, ex. 'train.csv'
       :param data_dir: The directory where files will be saved
       '''
    # make data dir, if it does not exist
    if not os.path.exists(data_dir):
        os.makedirs(data_dir)
    
    # your code here
    pd.concat([pd.DataFrame(y), pd.DataFrame(x)], axis=1).dropna().to_csv(data_dir+'/'+filename, index=False, header=False)
    
    # nothing is returned, but a print statement indicates that the function has run
    print('Path created: '+str(data_dir)+'/'+str(filename))

#### Test

In [None]:
fake_x = [ [0.39814815, 0.0001, 0.19178082], 
           [0.86936937, 0.44954128, 0.84649123], 
           [0.44086022, 0., 0.22395833] ]

fake_y = [0, 1, 1]

make_csv(fake_x, fake_y, filename='to_delete.csv', data_dir='test_csv')

# read in and test dimensions
fake_df = pd.read_csv('test_csv/to_delete.csv', header=None)

# check shape
assert fake_df.shape==(3, 4), \
      'The file should have as many rows as data_points and as many columns as features+1 (for indices).'
# check that first column = labels
assert np.all(fake_df.iloc[:,0].values==fake_y), 'First column is not equal to the labels, fake_y.'
print('Tests passed!')

In [None]:
# delete the test csv file, generated above
! rm -rf test_csv

In [None]:

# create train.csv and test.csv files in a directory
# to be specified when uploading data to S3
data_dir = 'plagiarism_data'

make_csv(train_x, train_y, filename='train.csv', data_dir=data_dir)
make_csv(test_x, test_y, filename='test.csv', data_dir=data_dir)

# Training a model

In [None]:
# import libraries
import boto3
import sagemaker

## Load data to S3

In [None]:
# session and role
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

# create an S3 bucket
bucket = sagemaker_session.default_bucket()

# name of directory created to save features data
data_dir = 'plagiarism_data'

# set prefix, a descriptive name for a directory  
prefix = 'plagiarism_project'


In [None]:
# upload all data to S3
s3_path = sagemaker_session.upload_data(key_prefix=prefix, bucket=bucket, path=data_dir)

#### Test cell

In [None]:
# confirm that data is in S3 bucket
empty_check = []
for obj in boto3.resource('s3').Bucket(bucket).objects.all():
    empty_check.append(obj.key)
    print(obj.key)

assert len(empty_check) !=0, 'S3 bucket is empty.'
print('Test passed!')

### Modeling
Here I'm going to use SKLearn from Sagemaker module in order to define an estimator. The `train.py` script is used as source for the training routine.

In [None]:
from sagemaker.sklearn.estimator import SKLearn
# your import and estimator code, here
estimator = SKLearn(entry_point='train.py',
                   source_dir='source_sklearn',
                    output_path=output_path,
                   role=role,
                   train_instance_count=1,
                   train_instance_type='ml.c4.xlarge',
                    framework_version='0.23-1',
                   sagemaker_session=sagemaker_session,
                   hyperparameters= {
                       'neighbors': 10
                   })

### Train


In [None]:
%%time

# Train your estimator on S3 training data
estimator.fit({'train': s3_path})

### Deploy endpoint

In [None]:
%%time

# deploy your model to create a predictor
predictor = estimator.deploy(instance_type='ml.t2.medium', initial_instance_count=1)

### Evaluate model


In [None]:

import os

# read in test data, assuming it is stored locally
test_data = pd.read_csv(os.path.join(data_dir, "test.csv"), header=None, names=None)

# labels are in the first column
test_y = test_data.iloc[:,0]
test_x = test_data.iloc[:,1:]

In [None]:

# First: generate predicted, class labels
test_y_preds = predictor.predict(test_x)

# test that model generates the correct number of labels
assert len(test_y_preds)==len(test_y), 'Unexpected number of predictions.'
print('Test passed!')

In [None]:
# Second: calculate the test accuracy
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(test_y, test_y_preds)

print(accuracy)


## print out the array of predicted and true labels, if you want
print('\nPredicted class labels: ')
print(test_y_preds)
print('\nTrue class labels: ')
print(test_y.values)

## Clean up resources

In [None]:
predictor.delete_endpoint()

In [None]:
bucket_to_delete = boto3.resource('s3').Bucket(bucket)
bucket_to_delete.objects.all().delete()