# Plagiarism Detection Model

Now that we've created training and test data, we are ready to define and train a model. our goal in this notebook, will be to train a binary classification model that learns to label an answer file as either plagiarized or not, based on the features we provide the model.

This task will be broken down into a few discrete steps:

* Upload our data to S3.
* Define a binary classification model and a training script.
* Train our model and deploy it.
* Evaluate our deployed classifier.


## Load Data to S3

In the last notebook, we have created two files: a `training.csv` and `test.csv` file with the features and class labels for the given corpus of plagiarized/non-plagiarized text data. 

>The below cells load in some AWS SageMaker libraries and creates a default bucket. After creating this bucket,we can upload our locally stored data to S3.

In [1]:
import pandas as pd
import boto3
import sagemaker

In [2]:

# session and role
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

# create an S3 bucket
bucket = sagemaker_session.default_bucket()

## Uploading our training data to S3

Specify the `data_dir` where we've saved our `train.csv` file. Decide on a descriptive `prefix` that defines where our data will be uploaded in the default S3 bucket. Finally, create a pointer to our training data by calling `sagemaker_session.upload_data` and passing in the required parameters. It may help to look at the [Session documentation](https://sagemaker.readthedocs.io/en/stable/session.html#sagemaker.session.Session.upload_data) or previous SageMaker code examples.

In [9]:
# should be the name of directory we created to save our features data
data_dir = 'plagiarism_data'

# set prefix, a descriptive name for a directory  
prefix = 'plagiarism'

# upload all data to S3
input_data = sagemaker_session.upload_data(path=data_dir, bucket=bucket, key_prefix=prefix)

### Test cell

The below cell prints out the items in our S3 bucket and will throw an error if it is empty. We should see the contents of our `data_dir` and perhaps some checkpoints. If we see any other files listed, then we may have some old model files that we can delete via the S3 console (though, additional files shouldn't affect the performance of model developed in this notebook).

In [10]:
# confirm that data is in S3 bucket
empty_check = []
for obj in boto3.resource('s3').Bucket(bucket).objects.all():
    empty_check.append(obj.key)
    print(obj.key)

assert len(empty_check) !=0, 'S3 bucket is empty.'
print('Test passed!')

plagiarism/.ipynb_checkpoints/test-checkpoint.csv
plagiarism/.ipynb_checkpoints/train-checkpoint.csv
plagiarism/test.csv
plagiarism/train.csv
Test passed!


---

# Modeling

Now that we've uploaded our training data, it's time to define and train a model!


 
---

## Completing a training script 

To implement a custom classifier, we'll need to complete a `train.py` script. We've been given the folders `source_sklearn` and `source_pytorch` which hold starting code for a custom Scikit-learn model and a PyTorch model, respectively. Each directory has a `train.py` training script. To complete this project **we only need to complete one of these scripts**; the script that is responsible for training our final model.

A typical training script:
* Loads training data from a specified directory
* Parses any training & model hyperparameters (ex. nodes in a neural network, training epochs, etc.)
* Instantiates a model of our design, with any specified hyperparams
* Trains that model 
* Finally, saves the model so that it can be hosted/deployed, later

### Defining and training a model
Much of the training script code is provided. Almost all of our work will be done in the `if __name__ == '__main__':` section. To complete a `train.py` file, we will:
1. Importing any extra libraries we need
2. Define any additional model training hyperparameters using `parser.add_argument`
2. Define a model in the `if __name__ == '__main__':` section
3. Train the model in that same section


In [11]:
# directory can be changed to: source_sklearn or source_pytorch
!pygmentize source_sklearn/train.py

[34mfrom[39;49;00m [04m[36m__future__[39;49;00m [34mimport[39;49;00m print_function

[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mpandas[39;49;00m [34mas[39;49;00m [04m[36mpd[39;49;00m

[34mfrom[39;49;00m [04m[36msklearn.externals[39;49;00m [34mimport[39;49;00m joblib

[37m## TODO: Import any additional libraries you need to define a model[39;49;00m
[34mfrom[39;49;00m [04m[36msklearn.ensemble[39;49;00m [34mimport[39;49;00m RandomForestClassifier

[37m# Provided model load function[39;49;00m
[34mdef[39;49;00m [32mmodel_fn[39;49;00m(model_dir):
    [33m"""Load model from the model_dir. This is the same model that is saved[39;49;00m
[33m    in the main if statement.[39;49;00m
[33m    """[39;49;00m
    [34mprint[39;49;00m([33m"[39;49;00m[33mLoading model.[39;49;00m[33m"[39;49;00m)
    
    [37m# load using joblib[39;49;00m
    model = joblib.load(os.pat

### Provided code

We can see that the starter code includes a few things:
* Model loading (`model_fn`) and saving code
* Getting SageMaker's default hyperparameters
* Loading the training data by name, `train.csv` and extracting the features and labels, `train_x`, and `train_y`

---
# Creating an Estimator

When a custom model is constructed in SageMaker, an entry point must be specified. This is the Python file which will be executed when the model is trained; the `train.py` function we specified above. To run a custom training script in SageMaker, construct an estimator, and fill in the appropriate constructor arguments:

* **entry_point**: The path to the Python script SageMaker runs for training and prediction.
* **source_dir**: The path to the training script directory `source_sklearn` OR `source_pytorch`.
* **entry_point**: The path to the Python script SageMaker runs for training and prediction.
* **source_dir**: The path to the training script directory `train_sklearn` OR `train_pytorch`.
* **entry_point**: The path to the Python script SageMaker runs for training.
* **source_dir**: The path to the training script directory `train_sklearn` OR `train_pytorch`.
* **role**: Role ARN, which was specified, above.
* **train_instance_count**: The number of training instances (should be left at 1).
* **train_instance_type**: The type of SageMaker instance for training. Note: Because Scikit-learn does not natively support GPU training, Sagemaker Scikit-learn does not currently support training on GPU instance types.
* **sagemaker_session**: The session used to train on Sagemaker.
* **hyperparameters** (optional): A dictionary `{'name':value, ..}` passed to the train function as hyperparameters.

Note: For a PyTorch model, there is another optional argument **framework_version**, which we can set to the latest version of PyTorch, `1.0`.

## Defining a Scikit-learn

To import our desired estimator, use one of the following lines:
```
from sagemaker.sklearn.estimator import SKLearn
```
```
from sagemaker.pytorch import PyTorch
```

In [31]:
from sagemaker import LinearLearner

output_path = 's3://{}/{}'.format(bucket, prefix)

estimator = LinearLearner(role=role,
                    train_instance_count=1,
                    train_instance_type='ml.c4.xlarge',
                    predictor_type="binary_classifier",
                    output_path=output_path,
                    sagemaker_session=sagemaker_session,
                    epochs=15) 

In [64]:
# Load train and test data
import os
import numpy as np

train = pd.read_csv(os.path.join(data_dir, 'train.csv'), header=None)

train_y = train.loc[:,0]
train_x = train.loc[:, 1:3]

In [65]:
# Convert to RecordSet
train_x_np = np.array(train_x).astype('float32')
train_y_np = np.array(train_y).astype('float32')

formatted_train_data = estimator.record_set(train_x_np, labels=train_y_np)

## Train the estimator

Training the estimator on the training data stored in S3. This should create a training job that we can monitor in our SageMaker console.

In [66]:
%%time

# Train our estimator on S3 training data
estimator.fit(formatted_train_data)

2019-07-22 19:20:42 Starting - Starting the training job...
2019-07-22 19:20:44 Starting - Launching requested ML instances......
2019-07-22 19:21:45 Starting - Preparing the instances for training......
2019-07-22 19:22:55 Downloading - Downloading input data...
2019-07-22 19:23:38 Training - Training image download completed. Training in progress.
2019-07-22 19:23:38 Uploading - Uploading generated training model
[31mDocker entrypoint called with argument(s): train[0m
[31m[07/22/2019 19:23:35 INFO 140105739908928] Reading default configuration from /opt/amazon/lib/python2.7/site-packages/algorithm/resources/default-input.json: {u'loss_insensitivity': u'0.01', u'epochs': u'15', u'init_bias': u'0.0', u'lr_scheduler_factor': u'auto', u'num_calibration_samples': u'10000000', u'accuracy_top_k': u'3', u'_num_kv_servers': u'auto', u'use_bias': u'true', u'num_point_for_scaler': u'10000', u'_log_level': u'info', u'quantile': u'0.5', u'bias_lr_mult': u'auto', u'lr_scheduler_step': u'auto', 

## Deploy the trained model

After training, deploy  model to create a `predictor`.

To deploy a trained model, we'll use `<model>.deploy`, which takes in two arguments:
* **initial_instance_count**: The number of deployed instances (1).
* **instance_type**: The type of SageMaker instance for deployment.


In [67]:
%%time

# deploy our model to create a predictor
predictor = estimator.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

---------------------------------------------------------------------------------------!CPU times: user 503 ms, sys: 0 ns, total: 503 ms
Wall time: 7min 19s


---
# Evaluating our Model

Once our model is deployed, we can see how it performs when applied to our test data.

The provided cell below, reads in the test data, assuming it is stored locally in `data_dir` and named `test.csv`. The labels and features are extracted from the `.csv` file.

In [68]:
import os

# read in test data, assuming it is stored locally
test_data = pd.read_csv(os.path.join(data_dir, "test.csv"), header=None, names=None)

# labels are in the first column
test_y = test_data.iloc[:,0]
test_x = test_data.iloc[:,1:]

## Determining the accuracy of our model

Using our deployed `predictor` to generate predicted, class labels for the test data. Comparing those to the *true* labels, `test_y`, and calculate the accuracy as a value between 0 and 1.0 that indicates the fraction of test data that our model classified correctly. we may use [sklearn.metrics](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics) for this calculation.

In [92]:
# First: generate predicted, class labels
test_x = np.array(test_x).astype('float32')
test_y = np.array(test_y)
test_y_preds = predictor.predict(test_x)

# test that our model generates the correct number of labels
assert len(test_y_preds)==len(test_y), 'Unexpected number of predictions.'
print('Test passed!')

Test passed!


In [93]:
# Pull out predictions
test_y_preds = np.array([x.label['predicted_label'].float32_tensor.values[0] for x in test_y_preds])
test_y_preds

array([1., 1., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 1., 1., 1., 1., 1.,
       1., 0., 1., 0., 1., 1., 0., 0.])

In [99]:
# Second: calculate the test accuracy
from sklearn.metrics import accuracy_score

accuracy = accuracy_score(test_y, test_y_preds)

print(accuracy)


## print out the array of predicted and true labels, if we want
print('\nPredicted class labels: ')
print(test_y_preds)
print('\nTrue class labels: ')
print(test_y)

1.0

Predicted class labels: 
[1. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 0. 1. 0. 1. 1. 0.
 0.]

True class labels: 
[1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 0 1 0 1 1 0 0]


Accuracy was 100%, so there are no false positives or negatives. 


----
## Clean up Resources

After  done with evaluating our model, **deleting our model endpoint**. we can do this with a call to `.delete_endpoint()`.

In [100]:
# uncomment and fill in the line below!
estimator.delete_endpoint()


### Deleting S3 bucket

When we are *completely* done with training and testing models, we can also delete our entire S3 bucket. If we do this before we are done training our model, we'll have to recreate our S3 bucket and upload our training data again.

In [101]:
# deleting bucket, uncomment lines below

bucket_to_delete = boto3.resource('s3').Bucket(bucket)
bucket_to_delete.objects.all().delete()

[{'ResponseMetadata': {'RequestId': '19B9D8B942C63F1A',
   'HostId': 'upUrHw4MicOCUIoFbc37Ji8Y5BLe16OTanw/XKZbHWh7ajh0jQdcF0sLZxFIKkgsl6/2IytCdW8=',
   'HTTPStatusCode': 200,
   'HTTPHeaders': {'x-amz-id-2': 'upUrHw4MicOCUIoFbc37Ji8Y5BLe16OTanw/XKZbHWh7ajh0jQdcF0sLZxFIKkgsl6/2IytCdW8=',
    'x-amz-request-id': '19B9D8B942C63F1A',
    'date': 'Mon, 22 Jul 2019 19:55:38 GMT',
    'connection': 'close',
    'content-type': 'application/xml',
    'transfer-encoding': 'chunked',
    'server': 'AmazonS3'},
   'RetryAttempts': 0},
  'Deleted': [{'Key': 'sagemaker-scikit-learn-2019-07-22-03-20-17-725/source/sourcedir.tar.gz'},
   {'Key': 'sagemaker-scikit-learn-2019-07-22-04-17-59-471/source/sourcedir.tar.gz'},
   {'Key': 'sagemaker-record-sets/LinearLearner-2019-07-22-19-20-28-678/matrix_0.pbr'},
   {'Key': 'plagiarism/.ipynb_checkpoints/train-checkpoint.csv'},
   {'Key': 'sagemaker-scikit-learn-2019-07-22-03-35-05-405/source/sourcedir.tar.gz'},
   {'Key': 'plagiarism/linear-learner-2019-07-2