# Plagiarism Detection Model

Now that I created training and test data, I'm ready to define and train a model. In this notebook, I train a binary classification model that learns to label an answer file as either plagiarized or not, based on the features you provide the model.

This task will be broken down into a few discrete steps:

* Upload your data to S3.
* Define a binary classification model and a training script.
* Train your model and deploy it.
* Evaluate your deployed classifier and answer some questions about your approach.

---

## Load Data to S3

In the last notebook, I created two files: a `training.csv` and `test.csv` file with the features and class labels for the given corpus of plagiarized/non-plagiarized text data. 

>The below cells load in some AWS SageMaker libraries and creates a default bucket. After creating this bucket, you can upload your locally stored data to S3.


In [1]:
import pandas as pd
import boto3
import sagemaker

In [2]:
# session and role
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

# create an S3 bucket
bucket = sagemaker_session.default_bucket()

## Upload your training data to S3

`data_dir` points to saved `train.csv` file. `prefix` defines where the data will be uploaded in the default S3 bucket. Finally, I create a pointer to training data by calling `sagemaker_session.upload_data` and passing in the required parameters.

I upload the entire directory. Later, the training script will only access the `train.csv` file.

In [3]:
# should be the name of directory you created to save your features data
data_dir = 'plagiarism_data'

# set prefix, a descriptive name for a directory  
prefix = 'sagemaker/plagiarism'

# upload all data to S3
input_data = sagemaker_session.upload_data(data_dir, bucket=bucket, key_prefix=prefix)

---

# Modeling

Now that I uploaded my training data, it's time to define and train a model!

The type of model you create is up to you. For a binary classification task, I define a custom PyTorch neural network classifier. 

---

## Complete a training script 

To implement a custom classifier, I developed `train.py` script is responsible for training your final model.

A typical training script:
* Loads training data from a specified directory
* Parses any training & model hyperparameters (ex. nodes in a neural network, training epochs, etc.)
* Instantiates a model of your design, with any specified hyperparams
* Trains that model 
* Finally, saves the model so that it can be hosted/deployed, later

### Defining and training a model
Almost all of work will be done in the `if __name__ == '__main__':` section. To complete a `train.py` file, you will:
* Import any extra libraries you need
* Define any additional model training hyperparameters using `parser.add_argument`
* Define a model in the `if __name__ == '__main__':` section
* Train the model in that same section

Below, I use `!pygmentize` to display an existing `train.py` file. Read through the code; 

In [1]:
# directory can be changed to: source_sklearn or source_pytorch
!pygmentize source_pytorch/train.py

[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mimport[39;49;00m [04m[36mjson[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mpandas[39;49;00m [34mas[39;49;00m [04m[36mpd[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m
[34mimport[39;49;00m [04m[36mtorch.optim[39;49;00m [34mas[39;49;00m [04m[36moptim[39;49;00m
[34mimport[39;49;00m [04m[36mtorch.utils.data[39;49;00m
[34mimport[39;49;00m [04m[36mtorch.nn[39;49;00m [34mas[39;49;00m [04m[36mnn[39;49;00m


[37m# imports the model in model.py by name[39;49;00m
[34mfrom[39;49;00m [04m[36mmodel[39;49;00m [34mimport[39;49;00m BinaryClassifier

[34mdef[39;49;00m [32mmodel_fn[39;49;00m(model_dir):
    [33m"""Load the PyTorch model from the `model_dir` directory."""[39;49;00m
    [34mprint[39;49;00m([33m"[39;49;00m[33mLoading model.[39;49;00m[33m"[39;49;00m)

    [37m# First, load the parameters used to create 

### Provided code

If you read the code above, you can see that the starter code includes a few things:
* Model loading (`model_fn`) and saving code
* Getting SageMaker's default hyperparameters
* Loading the training data by name, `train.csv` and extracting the features and labels, `train_x`, and `train_y`

If you'd like to read more about model saving with [joblib for sklearn](https://scikit-learn.org/stable/modules/model_persistence.html) or with [torch.save](https://pytorch.org/tutorials/beginner/saving_loading_models.html), click on the provided links.

---
# Create an Estimator

When a custom model is constructed in SageMaker, an entry point must be specified. This is the Python file which will be executed when the model is trained; the `train.py` function I specified above. To run a custom training script in SageMaker, I construct an estimator, and fill in the appropriate constructor arguments:

* **entry_point**: The path to the Python script SageMaker runs for training and prediction.
* **source_dir**: The path to the training script directory `source_pytorch`.
* **entry_point**: The path to the Python script SageMaker runs for training and prediction.
* **source_dir**: The path to the training script directory `train_pytorch`.
* **entry_point**: The path to the Python script SageMaker runs for training.
* **source_dir**: The path to the training script directory `train_pytorch`.
* **role**: Role ARN, which was specified, above.
* **train_instance_count**: The number of training instances (should be left at 1).
* **train_instance_type**: The type of SageMaker instance for training.
* **sagemaker_session**: The session used to train on Sagemaker.
* **hyperparameters** (optional): A dictionary `{'name':value, ..}` passed to the train function as hyperparameters.

NotFor a PyTorch model, there is another optional argument **framework_version**, which I can set to the latest version of PyTorch, `1.0`.


In [14]:
# import a PyTorch wrapper
from sagemaker.pytorch import PyTorch

# specify an output path
output_path = 's3://{}/{}'.format(bucket, prefix)

# instantiate a pytorch estimator
estimator = PyTorch(entry_point='train.py',
                    source_dir='source_pytorch',
                    role=role,
                    framework_version='1.0',
                    train_instance_count=1,
                    train_instance_type='ml.c4.xlarge',
                    output_path=output_path,
                    sagemaker_session=sagemaker_session,
                    hyperparameters={
                        'input_features': 3,  # num of features
                        'hidden_dim': 20,
                        'output_dim': 1,
                        'epochs': 160 # could change to higher
                    })


## Train the estimator

Train the estimator on the training data stored in S3. This should create a training job that can be monitor in SageMaker console.

In [15]:
%%time

# Train the estimator on S3 training data
estimator.fit({'train': input_data})


2019-11-11 13:59:03 Starting - Starting the training job...
2019-11-11 13:59:04 Starting - Launching requested ML instances.........
2019-11-11 14:00:38 Starting - Preparing the instances for training...
2019-11-11 14:01:18 Downloading - Downloading input data...
2019-11-11 14:01:58 Training - Training image download completed. Training in progress..[31mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[31mbash: no job control in this shell[0m
[31m2019-11-11 14:02:00,452 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training[0m
[31m2019-11-11 14:02:00,455 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[31m2019-11-11 14:02:00,467 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[31m2019-11-11 14:02:01,106 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[31m2019-11-11 14:02:02,973 sagemaker-containers INFO    


2019-11-11 14:02:14 Uploading - Uploading generated training model
2019-11-11 14:02:14 Completed - Training job completed
Training seconds: 56
Billable seconds: 56
CPU times: user 537 ms, sys: 27.4 ms, total: 564 ms
Wall time: 3min 42s


## Deploy the trained model

After training, deploy my model to create a `predictor` I'll create a trained `PyTorchModel` that accepts the trained `<model>.model_data` as an input parameter and points to the provided `source_pytorch/predict.py` file as an entry point. 

To deploy a trained model, I use `<model>.deploy`, which takes in two arguments:
* **initial_instance_count**: The number of deployed instances (1).
* **instance_type**: The type of SageMaker instance for deployment.


In [16]:
%%time

# uncomment, if needed
from sagemaker.pytorch import PyTorchModel

# deploy your model to create a predictor
model = PyTorchModel(model_data=estimator.model_data,
                    role=role,
                    framework_version="1.0",
                    entry_point='predict.py',
                    source_dir='source_pytorch')

predict = model.deploy(initial_instance_count=1, instance_type='ml.t2.medium')

--------------------------------------------------------------------------------------------------------------!CPU times: user 687 ms, sys: 35.8 ms, total: 723 ms
Wall time: 9min 16s


---
# Evaluating the Model

Once the model is deployed, I can see how it performs when applied to our test data.

The provided cell below, reads in the test data, assuming it is stored locally in `data_dir` and named `test.csv`. The labels and features are extracted from the `.csv` file.

In [17]:
import os

# read in test data, assuming it is stored locally
test_data = pd.read_csv(os.path.join(data_dir, "test.csv"), header=None, names=None)

# labels are in the first column
test_y = test_data.iloc[:,0]
test_x = test_data.iloc[:,1:]

## Determine the accuracy of your model

Use my deployed `predictor` to generate predicted, class labels for the test data. Compare those to the *true* labels, `test_y`, and calculate the accuracy as a value between 0 and 1.0 that indicates the fraction of test data that your model classified correctly.

In [18]:
import numpy as np

# First: generate predicted, class labels
test_y_preds = np.squeeze(np.round(predict.predict(test_x)))

# test that your model generates the correct number of labels
assert len(test_y_preds)==len(test_y), 'Unexpected number of predictions.'
print('Test passed!')

Test passed!


In [19]:
# Second: calculate the test accuracy
# calculate true positives, false positives, true negatives, false negatives
tp = np.logical_and(test_y, test_y_preds).sum()
fp = np.logical_and(1-test_y, test_y_preds).sum()
tn = np.logical_and(1-test_y, 1-test_y_preds).sum()
fn = np.logical_and(test_y, 1-test_y_preds).sum()

# calculate binary classification metrics
recall = tp / (tp + fn)
precision = tp / (tp + fp)
accuracy = (tp + tn) / (tp + fp + tn + fn)

print("accuracy = {}".format(accuracy))
print("FP = {}".format(fp))
print("FN = {}".format(fn))

## print out the array of predicted and true labels, if you want
print('\nPredicted class labels: ')
print(test_y_preds)
print('\nTrue class labels: ')
print(test_y.values)

accuracy = 0.96
FP = 0
FN = 1

Predicted class labels: 
[1. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 0. 1. 0. 1. 1. 0.
 0.]

True class labels: 
[1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 0 1 0 1 1 0 0]


----
## Clean up Resources

After evaluating your model is finished, I **delete my model endpoint**. 

In [20]:
# uncomment and fill in the line below!
# <name_of_deployed_predictor>.delete_endpoint()
predict.delete_endpoint()