# Plagiarism Detection Model

Now that we've created training and test data, we are ready to define and train a model. Our goal in this notebook, will be to train a binary classification model that learns to label an answer file as either plagiarized or not, based on the features we provide the model.

This task will be broken down into a few discrete steps:

* Upload your data to S3.
* Define a binary classification model and a training script.
* Train your model and deploy it.
* Evaluate your deployed classifier and answer some questions about your approach.

---

## Load Data to S3

In the last notebook, we have created two files: a `training.csv` and `test.csv` file with the features and class labels for the given corpus of plagiarized/non-plagiarized text data. 

>The below cells load in some AWS SageMaker libraries and creates a default bucket. After creating this bucket, we can upload our locally stored data to S3.

Save our train and test `.csv` feature files, locally. To do this we can run the second notebook "2_Plagiarism_Feature_Engineering" in SageMaker or we can manually upload our files to this notebook using the upload icon in Jupyter Lab. Then we can upload local files to S3 by using `sagemaker_session.upload_data` and pointing directly to where the training data is saved.

In [1]:
import pandas as pd
import boto3
import sagemaker

In [2]:
# session and role
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

# create an S3 bucket
bucket = sagemaker_session.default_bucket()

## Upload your training data to S3

We specify the `data_dir` where we've saved our `train.csv` file. We decide on a descriptive `prefix` that defines where our data will be uploaded in the default S3 bucket. Finally, we create a pointer to our training data by calling `sagemaker_session.upload_data` and passing in the required parameters. It may help to look at the [Session documentation](https://sagemaker.readthedocs.io/en/stable/session.html#sagemaker.session.Session.upload_data) or previous SageMaker code examples.

We are expected to upload our entire directory. Later, the training script will only access the `train.csv` file.

In [3]:
# should be the name of directory you created to save your features data
data_dir = 'plagiarism_data'

# set prefix, a descriptive name for a directory  
prefix = 'data-plagiarism'

# upload all data to S3
input_data = sagemaker_session.upload_data(path=data_dir, bucket=bucket, key_prefix=prefix)


### Test cell

We test that our data has been successfully uploaded. The below cell prints out the items in our S3 bucket and will throw an error if it is empty. We should see the contents of our `data_dir` and perhaps some checkpoints. If we see any other files listed, then we may have some old model files that we can delete via the S3 console (though, additional files shouldn't affect the performance of model developed in this notebook).

In [4]:
# confirm that data is in S3 bucket
empty_check = []
for obj in boto3.resource('s3').Bucket(bucket).objects.all():
    empty_check.append(obj.key)
    print(obj.key)

assert len(empty_check) !=0, 'S3 bucket is empty.'
print('Test passed!')

data-plagiarism/sagemaker-pytorch-2020-04-11-22-38-12-952/debug-output/training_job_end.ts
data-plagiarism/sagemaker-pytorch-2020-04-11-22-38-12-952/output/model.tar.gz
data-plagiarism/sagemaker-pytorch-2020-04-11-22-42-49-740/debug-output/training_job_end.ts
data-plagiarism/sagemaker-pytorch-2020-04-11-22-42-49-740/output/model.tar.gz
data-plagiarism/test.csv
data-plagiarism/train.csv
plagiarism_data/test.csv
plagiarism_data/train.csv
sagemaker-pytorch-2020-04-11-19-20-47-802/source/sourcedir.tar.gz
sagemaker-pytorch-2020-04-11-19-21-39-129/source/sourcedir.tar.gz
sagemaker-pytorch-2020-04-11-19-26-44-041/source/sourcedir.tar.gz
sagemaker-pytorch-2020-04-11-19-36-01-946/source/sourcedir.tar.gz
sagemaker-pytorch-2020-04-11-19-36-42-981/source/sourcedir.tar.gz
sagemaker-pytorch-2020-04-11-19-58-05-051/source/sourcedir.tar.gz
sagemaker-pytorch-2020-04-11-22-25-42-283/source/sourcedir.tar.gz
sagemaker-pytorch-2020-04-11-22-31-16-545/source/sourcedir.tar.gz
sagemaker-pytorch-2020-04-11-22-

---

# Modeling

Now that we've uploaded our training data, it's time to define and train a model!

The type of model we create is up to us. For a binary classification task, we can choose to go one of three routes:
* Use a built-in classification algorithm, like LinearLearner.
* Define a custom Scikit-learn classifier, a comparison of models can be found [here](https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html).
* Define a custom PyTorch neural network classifier. 
 
---

## Create a training script 

To implement a custom classifier, we'll need to complete a `train.py` script.

A typical training script:
* Loads training data from a specified directory
* Parses any training & model hyperparameters (ex. nodes in a neural network, training epochs, etc.)
* Instantiates a model of your design, with any specified hyperparams
* Trains that model 
* Finally, saves the model so that it can be hosted/deployed, later

### Defining and training a model
To define and train the model, `train.py`, `model.py` and `predict.py` files were writen and can be found in the directory.

In [5]:
# directory can be changed to: source_sklearn or source_pytorch
!pygmentize source_pytorch/train.py

[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mimport[39;49;00m [04m[36mjson[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mpandas[39;49;00m [34mas[39;49;00m [04m[36mpd[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m
[34mimport[39;49;00m [04m[36mtorch.optim[39;49;00m [34mas[39;49;00m [04m[36moptim[39;49;00m
[34mimport[39;49;00m [04m[36mtorch.utils.data[39;49;00m

[37m# imports the model in model.py by name[39;49;00m
[34mfrom[39;49;00m [04m[36mmodel[39;49;00m [34mimport[39;49;00m BinaryClassifier

[34mdef[39;49;00m [32mmodel_fn[39;49;00m(model_dir):
    [33m"""Load the PyTorch model from the `model_dir` directory."""[39;49;00m
    [34mprint[39;49;00m([33m"[39;49;00m[33mLoading model.[39;49;00m[33m"[39;49;00m)

    [37m# First, load the parameters used to create the model.[39;49;00m
    model_info = {}
    model_info_path = os.path.join(model_dir, [33m'[3

### Provided code

Ifwe read the code above, we can see that the code includes a few things:
* Model loading (`model_fn`) and saving code
* Getting SageMaker's default hyperparameters
* Loading the training data by name, `train.csv` and extracting the features and labels, `train_x`, and `train_y`

If we'd like to read more about model saving with [joblib for sklearn](https://scikit-learn.org/stable/modules/model_persistence.html) or with [torch.save](https://pytorch.org/tutorials/beginner/saving_loading_models.html), click on the provided links.

---
# Create an Estimator

When a custom model is constructed in SageMaker, an entry point must be specified. This is the Python file which will be executed when the model is trained; the `train.py` function we specified above. To run a custom training script in SageMaker, construct an estimator, and fill in the appropriate constructor arguments:

* **entry_point**: The path to the Python script SageMaker runs for training and prediction.
* **source_dir**: The path to the training script directory `source_sklearn` OR `source_pytorch`.
* **entry_point**: The path to the Python script SageMaker runs for training and prediction.
* **source_dir**: The path to the training script directory `train_sklearn` OR `train_pytorch`.
* **entry_point**: The path to the Python script SageMaker runs for training.
* **source_dir**: The path to the training script directory `train_sklearn` OR `train_pytorch`.
* **role**: Role ARN, which was specified, above.
* **train_instance_count**: The number of training instances (should be left at 1).
* **train_instance_type**: The type of SageMaker instance for training. Note: Because Scikit-learn does not natively support GPU training, Sagemaker Scikit-learn does not currently support training on GPU instance types.
* **sagemaker_session**: The session used to train on Sagemaker.
* **hyperparameters** (optional): A dictionary `{'name':value, ..}` passed to the train function as hyperparameters.

Note: For a PyTorch model, there is another optional argument **framework_version**, which we can set to the latest version of PyTorch, `1.0`.

## Define a Scikit-learn or PyTorch estimator

To import our desired estimator, use one of the following lines:
```
from sagemaker.pytorch import PyTorch
```

In [8]:
from sagemaker.pytorch import PyTorch

output_path = 's3://{}/{}'.format(bucket, prefix)

# instantiate a pytorch estimator
estimator = PyTorch(entry_point='train.py',
                    source_dir='source_pytorch', # this should be just "source" for your code
                    role=role,
                    framework_version='1.0',
                    train_instance_count=1,
                    train_instance_type='ml.c4.xlarge',
                    output_path=output_path,
                    sagemaker_session=sagemaker_session,
                    hyperparameters={
                        'input_features': 5,  # num of features
                        'hidden_dim': 150,
                        'output_dim': 1,
                        'epochs': 500 # could change to higher
                    })

## Train the estimator

We train our estimator on the training data stored in S3. This should create a training job that we can monitor in our SageMaker console.

In [9]:
%%time

# Train your estimator on S3 training data

estimator.fit({'train': input_data})

2020-04-11 23:35:45 Starting - Starting the training job...
2020-04-11 23:35:47 Starting - Launching requested ML instances......
2020-04-11 23:36:46 Starting - Preparing the instances for training......
2020-04-11 23:38:06 Downloading - Downloading input data
2020-04-11 23:38:06 Training - Downloading the training image..[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2020-04-11 23:38:20,703 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2020-04-11 23:38:20,706 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2020-04-11 23:38:20,717 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2020-04-11 23:38:22,132 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2020-04-11 23:38:22,367 sagemaker-containers INFO     Module train does not provi


2020-04-11 23:38:37 Uploading - Uploading generated training model
2020-04-11 23:38:37 Completed - Training job completed
Training seconds: 45
Billable seconds: 45
CPU times: user 498 ms, sys: 30.3 ms, total: 529 ms
Wall time: 3min 11s


## Deploy the trained model

After training, we deploy our model to create a `predictor`. We'll create a trained `PyTorchModel` that accepts the trained `<model>.model_data` as an input parameter and points to the provided `source_pytorch/predict.py` file as an entry point. 

To deploy a trained model, we'll use `<model>.deploy`, which takes in two arguments:
* **initial_instance_count**: The number of deployed instances (1).
* **instance_type**: The type of SageMaker instance for deployment.

Note: If we run into an instance error, it may be because our chose the wrong training or deployment instance_type. It may help to refer to our previous code to see which types of instances we used.

In [10]:
%%time

# uncomment, if needed
# from sagemaker.pytorch import PyTorchModel


# we deploy our model to create a predictor
predictor = estimator.deploy(initial_instance_count=1, instance_type='ml.t2.medium')


-------------!CPU times: user 240 ms, sys: 26.9 ms, total: 267 ms
Wall time: 6min 31s


---
# Evaluating the Model

Once our model is deployed, we can see how it performs when applied to our test data.

The provided cell below, reads in the test data, assuming it is stored locally in `data_dir` and named `test.csv`. The labels and features are extracted from the `.csv` file.

In [11]:
import os

# read in test data, assuming it is stored locally
test_data = pd.read_csv(os.path.join(data_dir, "test.csv"), header=None, names=None)

# labels are in the first column
test_y = test_data.iloc[:,0]
test_x = test_data.iloc[:,1:]

##  Determine the accuracy of our model

Use our deployed `predictor` to generate predicted, class labels for the test data. Compare those to the *true* labels, `test_y`, and calculate the accuracy as a value between 0 and 1.0 that indicates the fraction of test data that our model classified correctly. We may use [sklearn.metrics](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics) for this calculation.

**We target that our model should get at least 90% test accuracy.**

In [17]:
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# First: generate predicted, class labels
test_y_preds = predictor.predict(torch.from_numpy(test_x.values).float().to(device))

# test that your model generates the correct number of labels
assert len(test_y_preds)==len(test_y), 'Unexpected number of predictions.'
print('Test passed!')

Test passed!


In [18]:
# Second: calculate the test accuracy
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(test_y, round(pd.DataFrame(test_y_preds)))

print(accuracy)


## print out the array of predicted and true labels, if you want
print('\nPredicted class labels: ')
print(test_y_preds)
print('\nTrue class labels: ')
print(test_y.values)

0.92

Predicted class labels: 
[[1.        ]
 [1.        ]
 [1.        ]
 [0.98929006]
 [1.        ]
 [1.        ]
 [0.12486755]
 [0.08934402]
 [0.15312222]
 [0.12483734]
 [0.16342236]
 [0.64334506]
 [0.9642041 ]
 [1.        ]
 [1.        ]
 [0.99990535]
 [1.        ]
 [1.        ]
 [0.11921723]
 [1.        ]
 [0.12640212]
 [1.        ]
 [1.        ]
 [0.11286246]
 [0.67459834]]

True class labels: 
[1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 0 1 0 1 1 0 0]


In [45]:
import numpy as np
predicted = [int(round(i[0])) for i in test_y_preds]
actual = [i for i in test_y.values]
fp = np.count_nonzero(np.diff([actual,predicted], axis=0)==1)
fn = np.count_nonzero(np.diff([actual,predicted], axis=0)==-1)
print("actual:\t\t{}\npredicted:\t{}\nfalse positives: {}\nfalse negatives: {}".format(actual,predicted,fp,fn))

actual:		[1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0]
predicted:	[1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1]
false positives: 2
false negatives: 0


**How many false positives and false negatives did your model produce?**

- When the model predicts that an answer was pregariazed, that is positive while in actual it is not, false, we call it **false positive**
- When the model predicts that an answer was not pregariazed, that is negative while in actual it is, that is false, we call it **false negative**
- check the computation above and below

In [47]:
from sklearn.metrics import confusion_matrix
tn, fp, fn, tp = confusion_matrix(test_y, round(pd.DataFrame(test_y_preds))).ravel()

print('FP rate is {} and FN rate is {}'.format(fp,fn))

FP rate is 2 and FN rate is 0


**How do we decide on the type of model to use?**

- My intuision I considered the fact that a neural network is advanced regression. So, using sigmond on output layer and rounding up converts this to categorical.
- Also neural network has performed well in similar problem.

----
## Clean up Resources

After we're done evaluating our model, **we delete our model endpoint**. We can do this with a call to `.delete_endpoint()`.  Any other resources, we may delete from the AWS console, and we will find more instructions on cleaning up all your resources, below.

In [48]:
predictor.endpoint

'sagemaker-pytorch-2020-04-11-23-35-45-262'

In [49]:
# uncomment and fill in the line below!
predictor.delete_endpoint()


### Deleting S3 bucket

When we are *completely* done with training and testing models, we can also delete our entire S3 bucket. If we do this before we are done training our model, we'll have to recreate our S3 bucket and upload our training data again.

In [50]:
# deleting bucket, uncomment lines below

bucket_to_delete = boto3.resource('s3').Bucket(bucket)
bucket_to_delete.objects.all().delete()

[{'ResponseMetadata': {'RequestId': '805DB1E394FA062D',
   'HostId': 'twsK/7qKxrpAlBEJJVHdKUsEIRkN1WbqeTftJFhyeG7r+A2YLSnzmnI1DceiWlFXUVD7rPQf+po=',
   'HTTPStatusCode': 200,
   'HTTPHeaders': {'x-amz-id-2': 'twsK/7qKxrpAlBEJJVHdKUsEIRkN1WbqeTftJFhyeG7r+A2YLSnzmnI1DceiWlFXUVD7rPQf+po=',
    'x-amz-request-id': '805DB1E394FA062D',
    'date': 'Sun, 12 Apr 2020 00:46:06 GMT',
    'connection': 'close',
    'content-type': 'application/xml',
    'transfer-encoding': 'chunked',
    'server': 'AmazonS3'},
   'RetryAttempts': 0},
  'Deleted': [{'Key': 'data-plagiarism/sagemaker-pytorch-2020-04-11-22-38-12-952/debug-output/training_job_end.ts'},
   {'Key': 'data-plagiarism/sagemaker-pytorch-2020-04-11-23-35-45-262/output/model.tar.gz'},
   {'Key': 'data-plagiarism/sagemaker-pytorch-2020-04-11-23-35-45-262/debug-output/training_job_end.ts'},
   {'Key': 'sagemaker-pytorch-2020-04-11-22-25-42-283/source/sourcedir.tar.gz'},
   {'Key': 'plagiarism_data/train.csv'},
   {'Key': 'sagemaker-pytorch-20

### Deleting all our models and instances

When we are _completely_ done with this project and do **not** ever want to revisit this notebook, we can choose to delete all of our SageMaker notebook instances and models by following [these instructions](https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-cleanup.html). Before we delete this notebook instance, I recommend at least downloading a copy and saving it, locally.

---
## Further Directions

There are many ways to improve or add on to this project to expand your learning or make this more of a unique project for you. A few ideas are listed below:
* Train a classifier to predict the *category* (1-3) of plagiarism and not just plagiarized (1) or not (0).
* Utilize a different and larger dataset to see if this model can be extended to other types of plagiarism.
* Use language or character-level analysis to find different (and more) similarity features.
* Write a complete pipeline function that accepts a source text and submitted text file, and classifies the submitted text as plagiarized or not.
* Use API Gateway and a lambda function to deploy your model to a web application.