# Plagiarism Detection Model with PCA engineered data

Now that I've creatd teraining and test data, I'm ready to ready to define and train a model.

My goal in this notebook, will be to train a Linear Learner binary classification model that learns to label an answer file as either plagiarized or not, based on the features provided to the model.

This task will be broken down into a few discrete steps:

* Upload data to S3.
* Define a binary classification model and a training script.
* Train a Linear Learner binary classifier model and deploy it.
* Evaluate deployed classifier and analyze some questions about this approach.

---

## Load Data to S3

I have created two files: a `training.csv` and `test.csv` file with the features and class labels for the given corpus of plagiarized/non-plagiarized text data, to wich I applyied a PCA.  

>The below cells load in some AWS SageMaker libraries and creates a default bucket. After creating this bucket, I'll upload locally stored data to S3.

In [None]:
import pandas as pd
import boto3
import sagemaker

In [None]:
# session and role
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

# create an S3 bucket
bucket = sagemaker_session.default_bucket()

Upload training data to S3

Specifying the `data_dir` where `train.csv` file is saved. Defining a `prefix` where data will be uploaded in the default S3 bucket. Finally, creating a pointer to training data by calling `sagemaker_session.upload_data` and passing in the required parameters.

In [None]:
# the name of directory created to save features data
data_dir = "plagiarism_data_pca"

# seting a prefix, a descriptive name for a directory
prefix = "plagiarism_data_pca"

# upload all data to S3
input_data = sagemaker_session.upload_data(path=data_dir, bucket=bucket, key_prefix=prefix)

---

# Modeling

Now that I've uploaded my training data, it's time to define and train a model!

In this notebook, for this binary classification task, I'll use a SageMaker built-in LinearLearner algorithm.

---
# Create a Linear Learner Estimator
I hereby define a Linear Learner model in order to 
analyze the PCA preprocessed features from training set.

In [None]:
role = sagemaker.get_execution_role()

In [None]:
s3_ll_output_key_prefix = "ll_training_output"
s3_ll_output_location = 's3://{}/{}/{}/{}'.format(bucket, prefix, s3_ll_output_key_prefix, 'll_model')

In [None]:
# create linearlearner image
import boto3


linear_learner = sagemaker.LinearLearner(
    role=role,
    train_instance_count=1,
    train_instance_type='ml.m4.xlarge',
    predictor_type='binary_classifier',
    epochs=30,
    output_path=s3_ll_output_location,
    num_models=10
    )

## Preparing input data

In [None]:
import os
import pandas as pd
import numpy as np

In [None]:
train_df = pd.read_csv(os.path.join('plagiarism_data_pca', 'train.csv')) 

In [None]:
test_df =  pd.read_csv(os.path.join('plagiarism_data_pca', 'test.csv')) 

In [None]:
train_features = train_df.values[:, :-1].astype('float32')

In [None]:
train_labels = np.squeeze(train_df.values[:, -1:].astype('float32'))

In [None]:
test_features = test_df.values[:, :-1].astype('float32')

In [None]:
test_labels = np.squeeze(test_df.values[:, -1:].astype('float32'))

In [None]:
# wrap data in RecordSet objects
train_records = linear_learner.record_set(train_features, train_labels, channel='train')
test_records = linear_learner.record_set(test_features, test_labels, channel='test')

Wrapping data into record sets

## Train the estimator

Training my estimator on the training data stored in S3. This should create a training job that can be monitored in SageMaker console.

In [None]:
%%time

# Train estimator on S3 training data
linear_learner.fit(train_records)

In [None]:
%%time 
# deploy and create a predictor
linear_predictor = linear_learner.deploy(initial_instance_count=1, instance_type='ml.t2.medium')

---
# Evaluating the model

Once model is deployed, we can see how it performs when applied to our test data.

The provided cell below, reads in the test data, assuming it is stored locally in `data_dir` and named `test.csv`. The labels and features are extracted from the `.csv` file.

In [None]:
import os

# read in test data, assuming it is stored locally
test_data = pd.read_csv(os.path.join(data_dir, "test.csv"), header=None, names=None)

# labels are in the first column
test_y = np.squeeze(test_data.iloc[1:,-1:])
test_x = np.squeeze(test_data.iloc[1:,:-1])

print(test_x)


Use deployed `predictor` to generate predicted, class labels for the test data. Compare those to the *true* labels, `test_y`, and calculate the accuracy as a value between 0 and 1.0 that indicates the fraction of test data that the model classified correctly. [sklearn.metrics](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics) may be used for this calculation.

In [None]:
# First: generate predicted, class labels
test_y_preds = linear_predictor.predict(test_features.astype('float32'))

In [None]:
# code to evaluate the endpoint on test data
# returns a variety of model metrics
def evaluate(predictor, test_features, test_labels, verbose=True):
    """
    Evaluate a model on a test set given the prediction endpoint.  
    Return binary classification metrics.
    :param predictor: A prediction endpoint
    :param test_features: Test features
    :param test_labels: Class labels for test data
    :param verbose: If True, prints a table of all performance metrics
    :return: A dictionary of performance metrics.
    """
    
    # We have a lot of test data, so we'll split it into batches of 100
    # split the test data set into batches and evaluate using prediction endpoint    
    prediction_batches = [predictor.predict(batch) for batch in np.array_split(test_features, 1)]
    
    # LinearLearner produces a `predicted_label` for each data point in a batch
    # get the 'predicted_label' for every point in a batch
    test_preds = np.concatenate([np.array([x.label['predicted_label'].float32_tensor.values[0] for x in batch]) 
                                 for batch in prediction_batches])
    
    # calculate true positives, false positives, true negatives, false negatives
    tp = np.logical_and(test_labels, test_preds).sum()
    fp = np.logical_and(1-test_labels, test_preds).sum()
    tn = np.logical_and(1-test_labels, 1-test_preds).sum()
    fn = np.logical_and(test_labels, 1-test_preds).sum()
    
    # calculate binary classification metrics
    recall = tp / (tp + fn)
    precision = tp / (tp + fp)
    accuracy = (tp + tn) / (tp + fp + tn + fn)
    
    # printing a table of metrics
    if verbose:
        print(pd.crosstab(test_labels, test_preds, rownames=['actual (row)'], colnames=['prediction (col)']))
        print("\n{:<11} {:.3f}".format('Recall:', recall))
        print("{:<11} {:.3f}".format('Precision:', precision))
        print("{:<11} {:.3f}".format('Accuracy:', accuracy))
        print()
        
    return {'TP': tp, 'FP': fp, 'FN': fn, 'TN': tn, 
            'Precision': precision, 'Recall': recall, 'Accuracy': accuracy}


In [None]:
print('Metrics for simple LinearLearner.\n')

# get metrics for linear predictor
metrics = evaluate(linear_predictor, 
                   test_features.astype('float32'), 
                   test_labels, 
                   verbose=True) # verbose means we'll print out the metrics

### Question 1: How many false positives and false negatives did the model produce, if any? And why?

** Answer**: 

As printed in the above confusion matrix: 15 true positives, 0 false positive, 0 false negatives and 10 true negatives.

My guess: LinearLearner applyied to PCA preprocessed feature was also tuned by training 10 different models in parallel. So, very good performance could be reached on this dataset.

----
## Clean up Resources

After model evaluation completion, it is better to **delete model endpoint**. We can do this with a call to `.delete_endpoint()`.

In [None]:
# uncomment and fill in the line below!
linear_predictor.delete_endpoint()


### Deleting S3 bucket

When *completely* done with training and testing models, it is also possible to delete the entire S3 bucket. If done before  training the model, we have to recreate our S3 bucket and upload training data again.

In [None]:
bucket_to_delete = boto3.resource('s3').Bucket(bucket)
bucket_to_delete.objects.all().delete()