# Plagiarism Detection Model with AdaBoost

Now that I've creatd teraining and test data, I'm ready to ready to define and train a model. My goal in this notebook, will be to train a Linear Learner binary classification model that learns to label an answer file as either plagiarized or not, based on the features provided to the model.

This task will be broken down into a few discrete steps:

* Upload data to S3.
* Define a binary classification model and a training script.
* Train a Linear Learner binary classifier model and deploy it.
* Evaluate deployed classifier and analyze some questions about this approach.

## Load Data to S3

I have created two files: a `training.csv` and `test.csv` file with the features and class labels for the given corpus of plagiarized/non-plagiarized text data, to wich I applyied a PCA.  

>The below cells load in some AWS SageMaker libraries and creates a default bucket. After creating this bucket, I'll upload locally stored data to S3.

In [None]:
import pandas as pd
import boto3
import sagemaker

In [None]:
"""
DON'T MODIFY ANYTHING IN THIS CELL THAT IS BELOW THIS LINE
"""
# session and role
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

# create an S3 bucket
bucket = sagemaker_session.default_bucket()

Upload training data to S3

Specifying the `data_dir` where `train.csv` file is saved. Defining a `prefix` where data will be uploaded in the default S3 bucket. Finally, creating a pointer to training data by calling `sagemaker_session.upload_data` and passing in the required parameters.

In [None]:
# the name of directory created to save features data
data_dir = "plagiarism_data"

# set prefix, a descriptive name for a directory  
prefix = "plagiarism_data"

# upload all data to S3
input_data = sagemaker_session.upload_data(path=data_dir, bucket=bucket, key_prefix=prefix)

### Test cell

Test that your data has been successfully uploaded. The below cell prints out the items in your S3 bucket and will throw an error if it is empty. You should see the contents of your `data_dir` and perhaps some checkpoints. If you see any other files listed, then you may have some old model files that you can delete via the S3 console (though, additional files shouldn't affect the performance of model developed in this notebook).

In [None]:

# confirm that data is in S3 bucket
empty_check = []
for obj in boto3.resource('s3').Bucket(bucket).objects.all():
    empty_check.append(obj.key)
    print(obj.key)

assert len(empty_check) !=0, 'S3 bucket is empty.'
print('Test passed!')

---

# Modeling

Now that I've uploaded my training data, it's time to define and train a model!

In this notebook, for this binary classification task, I'll use a SageMaker AdaBoost ensamble learning meta-algorithm from sklearn.

In [None]:
# directory can be changed to: source_sklearn or source_pytorch
!pygmentize source_sklearn/train.py

### Provided code

The train.py code includes a few things:
* Model loading (`model_fn`) and saving code
* Getting SageMaker's default hyperparameters
* Loading the training data by name, `train.csv` and extracting the features and labels, `train_x`, and `train_y`

To read more about model saving with [joblib for sklearn](https://scikit-learn.org/stable/modules/model_persistence.html) or with [torch.save](https://pytorch.org/tutorials/beginner/saving_loading_models.html), click on the provided links.

---
# Create an Estimator

When a custom model is constructed in SageMaker, an entry point must be specified. This is the Python file which will be executed when the model is trained; the `train.py` function you specified above. To run a custom training script in SageMaker, construct an estimator, and fill in the appropriate constructor arguments:

* **entry_point**: The path to the Python script SageMaker runs for training and prediction.
* **source_dir**: The path to the training script directory `source_sklearn` OR `source_pytorch`.
* **entry_point**: The path to the Python script SageMaker runs for training and prediction.
* **source_dir**: The path to the training script directory `train_sklearn` OR `train_pytorch`.
* **entry_point**: The path to the Python script SageMaker runs for training.
* **source_dir**: The path to the training script directory `train_sklearn` OR `train_pytorch`.
* **role**: Role ARN, which was specified, above.
* **train_instance_count**: The number of training instances (should be left at 1).
* **train_instance_type**: The type of SageMaker instance for training. Note: Because Scikit-learn does not natively support GPU training, Sagemaker Scikit-learn does not currently support training on GPU instance types.
* **sagemaker_session**: The session used to train on Sagemaker.
* **hyperparameters** (optional): A dictionary `{'name':value, ..}` passed to the train function as hyperparameters.

In [None]:
from sagemaker.sklearn.estimator import SKLearn
# your import and estimator code, here
estimator = SKLearn(entry_point='train.py',
                    source_dir='source_sklearn',
                    train_instance_count = 1,
                    train_instance_type='ml.m4.xlarge',
                    role=role,
                    sagemaker_session = sagemaker_session,
                    framework_version='0.20.0',                 
                    hyperparameters=
                    {
                        'learning-rate': 0.005, # learning rate
                        'n-estimators': 200,  # num of estimators
                    }
                   )

## Train the estimator

Training the estimator on the training data stored in S3. This should create a training job that can be monitored in SageMaker console.

In [None]:
%%time

# Training estimator on S3 training data
estimator.fit({'train':input_data})


## Deploy the trained model

After training, we'll deploy the model to create a `predictor`.

To deploy a trained model, we'll use `<model>.deploy`, which takes in two arguments:
* **initial_instance_count**: The number of deployed instances (1).
* **instance_type**: The type of SageMaker instance for deployment.

In [None]:
%%time

# uncomment, if needed
# from sagemaker.pytorch import PyTorchModel


# deploy your model to create a predictor
predictor = estimator.deploy(initial_instance_count=1, instance_type='ml.t2.medium')


---
# Evaluating the Model

Once the model is deployed, we can see how it performs when applied to our test data.

The provided cell below, reads in the test data, assuming it is stored locally in `data_dir` and named `test.csv`. The labels and features are extracted from the `.csv` file.

In [None]:
import os

# read in test data, assuming it is stored locally
test_data = pd.read_csv(os.path.join(data_dir, "test.csv"), header=None, names=None)

# labels are in the first column
test_y = test_data.iloc[:,0]
test_x = test_data.iloc[:,1:]

## Determine the accuracy of the model

Here we'll use the deployed `predictor` to generate predicted, class labels for the test data. Then we'll compare those to the *true* labels, `test_y`, and calculate the accuracy as a value between 0 and 1.0 that indicates the fraction of test data that the model classified correctly.[sklearn.metrics](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics) may be used for this calculation.

In [None]:
# First: generate predicted, class labels
test_y_preds = predictor.predict(test_x)


# test that the model generates the correct number of labels
assert len(test_y_preds)==len(test_y), 'Unexpected number of predictions.'
print('Test passed!')

In [None]:
%matplotlib inline


from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt



# printing accuracy score
print("accuracy score = {}".format(accuracy_score(test_y.values, test_y_preds)))

# using matplotlib to show confusion matrix
fig, ax = plt.subplots(1,1,figsize=(7,4))


ConfusionMatrixDisplay(confusion_matrix(test_y_preds,test_y,labels=[1,0]),
                       display_labels=['Plagiarized', 'Non Plagiarized']).plot(values_format=".0f",ax=ax)
plt.show()
ax.set_xlabel("True Label")
ax.set_ylabel("Predicted Label")

### Question 1: How many false positives and false negatives did hte model produce, if any? And why?

** Answer**: 

As printed in the above confusion matrix: 15 true positives, 1 false positive, 0 false negatives and 9 true negatives.

My guess: AdaBoost is known to be sensitive to noise and outliers. I presume that the misclassied example is an outlier, maybe it is a text that uses a lot of terms that are also used in the reference task text, so it can have a high 1-grams containment value, without being a plagiarized text.

### Question 2: Which are the highlights of this kind of model and why we can pick this one? 

** Answer**:



AdaBoostClassifier from sklearn with default DecistionTreeClassifier as base estimator, it's known to be a good classifier on complex classification tasks and it is known not to be overfitting-prone as other algorithms.

----
## Clean up Resources

After done evaluating the model, **delete model endpoint**. We can do this with a call to `.delete_endpoint()`. 

In [2]:
predictor.delete_endpoint()


NameError: name 'predictor' is not defined

### Deleting S3 bucket

When *completely* done with training and testing models, it is also possible to delete the entire S3 bucket. If done before  training the model, we have to recreate our S3 bucket and upload training data again.

In [None]:
# deleting bucket, uncomment lines below

bucket_to_delete = boto3.resource('s3').Bucket(bucket)
bucket_to_delete.objects.all().delete()