# Plagiarism Detection Model

Now that I've creatd teraining and test data, I'm ready to ready to define and train a model. My goal in this notebook, will be to train a Linear Learner binary classification model that learns to label an answer file as either plagiarized or not, based on the features you provide the model.

This task will be broken down into a few discrete steps:

* Upload data to S3.
* Define a binary classification model and a training script.
* Train a Linear Learner binary classifier model and deploy it.
* Evaluate deployed classifier and analyze some questions about this approach.

---

## Load Data to S3

I have created two files: a `training.csv` and `test.csv` file with the features and class labels for the given corpus of plagiarized/non-plagiarized text data, to wich I applyied a PCA.  

>The below cells load in some AWS SageMaker libraries and creates a default bucket. After creating this bucket, I'll upload locally stored data to S3.

In [1]:
import pandas as pd
import boto3
import sagemaker

In [2]:
# session and role
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

# create an S3 bucket
bucket = sagemaker_session.default_bucket()

Upload training data to S3

Specifying the `data_dir` where `train.csv` file is saved. Defining a`prefix` where data will be uploaded in the default S3 bucket. Finally, creating a pointer to training data by calling `sagemaker_session.upload_data` and passing in the required parameters.

In [3]:
data_dir = "plagiarism_data_pca"

# seting a prefix, a descriptive name for a directory
prefix = "plagiarism_data_pca"

# upload all data to S3
input_data = sagemaker_session.upload_data(path=data_dir, bucket=bucket, key_prefix=prefix)

---

# Modeling

Now that I've uploaded my training data, it's time to define and train a model!

In this notebook, for this binary classification task, I'll use a SageMaker built-in LinearLearner algorithm.

---
# Create a Linear Learner Estimator
I hereby define a Linear Learner model in order to 
analyze the PCA preprocessed features from training set.

In [12]:
role = sagemaker.get_execution_role()

In [13]:
s3_ll_output_key_prefix = "ll_training_output"
s3_ll_output_location = 's3://{}/{}/{}/{}'.format(bucket, prefix, s3_ll_output_key_prefix, 'll_model')

In [26]:
# create linearlearner image
import boto3


linear_learner = sagemaker.LinearLearner(
    role=role,
    train_instance_count=1,
    train_instance_type='ml.m4.xlarge',
    predictor_type='binary_classifier',
    epochs=30,
    output_path=s3_ll_output_location,
    num_models=10
    )

## Preparing input data

In [41]:
import os
import pandas as pd
import numpy as np

In [45]:
train_df = pd.read_csv(os.path.join('plagiarism_data_pca', 'train.csv')) 

In [46]:
test_df =  pd.read_csv(os.path.join('plagiarism_data_pca', 'test.csv')) 

In [57]:
train_features = train_df.values[:, :-1].astype('float32')

In [69]:
train_labels = np.squeeze(train_df.values[:, -1:].astype('float32'))

In [59]:
test_features = test_df.values[:, :-1].astype('float32')

In [63]:
test_labels = np.squeeze(test_df.values[:, -1:].astype('float32'))

In [71]:
# wrap data in RecordSet objects
train_records = linear_learner.record_set(train_features, train_labels, channel='train')
test_records = linear_learner.record_set(test_features, test_labels, channel='test')

Wrapping data into record sets

## Train the estimator

Training my estimator on the training data stored in S3. This should create a training job that can be monitored in SageMaker console.

In [72]:
%%time

# Train your estimator on S3 training data
linear_learner.fit(train_records)

'get_image_uri' method will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.
's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.
'get_image_uri' method will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.


2020-11-01 22:54:17 Starting - Starting the training job...
2020-11-01 22:54:20 Starting - Launching requested ML instances......
2020-11-01 22:55:41 Starting - Preparing the instances for training......
2020-11-01 22:56:36 Downloading - Downloading input data...
2020-11-01 22:57:05 Training - Downloading the training image..[34mDocker entrypoint called with argument(s): train[0m
[34mRunning default environment configuration script[0m
[34m[11/01/2020 22:57:29 INFO 139752463005504] Reading default configuration from /opt/amazon/lib/python2.7/site-packages/algorithm/resources/default-input.json: {u'loss_insensitivity': u'0.01', u'epochs': u'15', u'feature_dim': u'auto', u'init_bias': u'0.0', u'lr_scheduler_factor': u'auto', u'num_calibration_samples': u'10000000', u'accuracy_top_k': u'3', u'_num_kv_servers': u'auto', u'use_bias': u'true', u'num_point_for_scaler': u'10000', u'_log_level': u'info', u'quantile': u'0.5', u'bias_lr_mult': u'auto', u'lr_scheduler_step': u'auto', u'init_me

In [73]:
%%time 
# deploy and create a predictor
linear_predictor = linear_learner.deploy(initial_instance_count=1, instance_type='ml.t2.medium')

Parameter image will be renamed to image_uri in SageMaker Python SDK v2.


-------------------!CPU times: user 341 ms, sys: 17.8 ms, total: 358 ms
Wall time: 9min 33s


---
# Evaluating Your Model

Once your model is deployed, you can see how it performs when applied to our test data.

The provided cell below, reads in the test data, assuming it is stored locally in `data_dir` and named `test.csv`. The labels and features are extracted from the `.csv` file.

In [92]:
"""
DON'T MODIFY ANYTHING IN THIS CELL THAT IS BELOW THIS LINE
"""
import os

# read in test data, assuming it is stored locally
test_data = pd.read_csv(os.path.join(data_dir, "test.csv"), header=None, names=None)

# labels are in the first column
test_y = np.squeeze(test_data.iloc[1:,-1:])
test_x = np.squeeze(test_data.iloc[1:,:-1])

print(test_x)


           0         1         2
1  -1.639462 -0.246917  0.128120
2  -0.879668 -0.178596  0.047907
3  -0.225507  0.256465  0.055519
4   0.286421  0.107939 -0.038232
5  -0.694184  0.177860 -0.010238
6  -1.766031 -0.237912 -0.000274
7   0.514478 -0.037475 -0.127731
8   0.583113 -0.091273  0.016539
9   0.490602  0.020631  0.055853
10  0.486265  0.044117  0.021951
11  0.485346  0.019304  0.040157
12  0.454414 -0.031748 -0.022410
13  0.339447  0.085182  0.028119
14 -1.210377  0.011069 -0.064943
15 -1.700588 -0.172470 -0.010210
16  0.081825  0.296291  0.077079
17 -0.171464  0.042922 -0.078266
18 -1.767607 -0.223671  0.001471
19  0.587676 -0.146077 -0.021378
20 -1.683539 -0.257004  0.106298
21  0.477424  0.059141  0.035333
22 -1.517221 -0.127082  0.008374
23 -1.270118 -0.006002 -0.015576
24  0.581445 -0.111310  0.002219
25  0.451718 -0.021417  0.032656


## EXERCISE: Determine the accuracy of your model

Use your deployed `predictor` to generate predicted, class labels for the test data. Compare those to the *true* labels, `test_y`, and calculate the accuracy as a value between 0 and 1.0 that indicates the fraction of test data that your model classified correctly. You may use [sklearn.metrics](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics) for this calculation.

**To pass this project, your model should get at least 90% test accuracy.**

In [134]:
# First: generate predicted, class labels
test_y_preds = linear_predictor.predict(test_features.astype('float32'))

In [135]:
# code to evaluate the endpoint on test data
# returns a variety of model metrics
def evaluate(predictor, test_features, test_labels, verbose=True):
    """
    Evaluate a model on a test set given the prediction endpoint.  
    Return binary classification metrics.
    :param predictor: A prediction endpoint
    :param test_features: Test features
    :param test_labels: Class labels for test data
    :param verbose: If True, prints a table of all performance metrics
    :return: A dictionary of performance metrics.
    """
    
    # We have a lot of test data, so we'll split it into batches of 100
    # split the test data set into batches and evaluate using prediction endpoint    
    prediction_batches = [predictor.predict(batch) for batch in np.array_split(test_features, 1)]
    
    # LinearLearner produces a `predicted_label` for each data point in a batch
    # get the 'predicted_label' for every point in a batch
    test_preds = np.concatenate([np.array([x.label['predicted_label'].float32_tensor.values[0] for x in batch]) 
                                 for batch in prediction_batches])
    
    # calculate true positives, false positives, true negatives, false negatives
    tp = np.logical_and(test_labels, test_preds).sum()
    fp = np.logical_and(1-test_labels, test_preds).sum()
    tn = np.logical_and(1-test_labels, 1-test_preds).sum()
    fn = np.logical_and(test_labels, 1-test_preds).sum()
    
    # calculate binary classification metrics
    recall = tp / (tp + fn)
    precision = tp / (tp + fp)
    accuracy = (tp + tn) / (tp + fp + tn + fn)
    
    # printing a table of metrics
    if verbose:
        print(pd.crosstab(test_labels, test_preds, rownames=['actual (row)'], colnames=['prediction (col)']))
        print("\n{:<11} {:.3f}".format('Recall:', recall))
        print("{:<11} {:.3f}".format('Precision:', precision))
        print("{:<11} {:.3f}".format('Accuracy:', accuracy))
        print()
        
    return {'TP': tp, 'FP': fp, 'FN': fn, 'TN': tn, 
            'Precision': precision, 'Recall': recall, 'Accuracy': accuracy}


In [136]:
print('Metrics for simple LinearLearner.\n')

# get metrics for linear predictor
metrics = evaluate(linear_predictor, 
                   test_features.astype('float32'), 
                   test_labels, 
                   verbose=True) # verbose means we'll print out the metrics

Metrics for simple LinearLearner.

prediction (col)  0.0  1.0
actual (row)              
0.0                10    0
1.0                 0   15

Recall:     1.000
Precision:  1.000
Accuracy:   1.000



### Question 1: How many false positives and false negatives did your model produce, if any? And why do you think this is?

** Answer**: 

As printed in the above confusion matrix: 15 true positives, 0 false positive, 0 false negatives and 10 true negatives.

My guess: LinearLearner applyied to PCA preprocessed feature was also tuned by training 10 different models in parallel. So, very good performance could be reached on this dataset.

----
## EXERCISE: Clean up Resources

After you're done evaluating your model, **delete your model endpoint**. You can do this with a call to `.delete_endpoint()`. You need to show, in this notebook, that the endpoint was deleted. Any other resources, you may delete from the AWS console, and you will find more instructions on cleaning up all your resources, below.

In [138]:
# uncomment and fill in the line below!
linear_predictor.delete_endpoint()


### Deleting S3 bucket

When you are *completely* done with training and testing models, you can also delete your entire S3 bucket. If you do this before you are done training your model, you'll have to recreate your S3 bucket and upload your training data again.

In [139]:
# deleting bucket, uncomment lines below

bucket_to_delete = boto3.resource('s3').Bucket(bucket)
bucket_to_delete.objects.all().delete()

[{'ResponseMetadata': {'RequestId': 'DA34701EB5AFFA10',
   'HostId': 'vaY5uyA4g0nTfBEqodBewrAqgueXNUYNczS8SEQnbxJVwAyFLqI/W6LU+zn6SlfIz0nqF8R2nGY=',
   'HTTPStatusCode': 200,
   'HTTPHeaders': {'x-amz-id-2': 'vaY5uyA4g0nTfBEqodBewrAqgueXNUYNczS8SEQnbxJVwAyFLqI/W6LU+zn6SlfIz0nqF8R2nGY=',
    'x-amz-request-id': 'DA34701EB5AFFA10',
    'date': 'Mon, 02 Nov 2020 00:05:27 GMT',
    'content-type': 'application/xml',
    'transfer-encoding': 'chunked',
    'server': 'AmazonS3',
    'connection': 'close'},
   'RetryAttempts': 0},
  'Deleted': [{'Key': 'sagemaker-record-sets/LinearLearner-2020-11-01-22-53-12-299/matrix_0.pbr'},
   {'Key': 'plagiarism_data_pca/ll_training_output/ll_model/linear-learner-2020-11-01-22-54-17-353/output/model.tar.gz'},
   {'Key': 'plagiarism_data_pca/.ipynb_checkpoints/test-checkpoint.csv'},
   {'Key': 'plagiarism_data/pca-2020-10-26-22-10-40-443/output/model.tar.gz'},
   {'Key': 'sagemaker-record-sets/LinearLearner-2020-11-01-22-53-13-451/matrix_0.pbr'},
   {'Key

### Deleting all your models and instances

When you are _completely_ done with this project and do **not** ever want to revisit this notebook, you can choose to delete all of your SageMaker notebook instances and models by following [these instructions](https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-cleanup.html). Before you delete this notebook instance, I recommend at least downloading a copy and saving it, locally.

---
## Further Directions

There are many ways to improve or add on to this project to expand your learning or make this more of a unique project for you. A few ideas are listed below:
* Train a classifier to predict the *category* (1-3) of plagiarism and not just plagiarized (1) or not (0).
* Utilize a different and larger dataset to see if this model can be extended to other types of plagiarism.
* Use language or character-level analysis to find different (and more) similarity features.
* Write a complete pipeline function that accepts a source text and submitted text file, and classifies the submitted text as plagiarized or not.
* Use API Gateway and a lambda function to deploy your model to a web application.

These are all just options for extending your work. If you've completed all the exercises in this notebook, you've completed a real-world application, and can proceed to submit your project. Great job!