# Bert text classification on SageMaker using PyTorch

This uses the dbpedia dataset and BERT for text classification. The dataset csv looks like

```text
1,"E. D. Abbott Ltd"," Abbott of Farnham E D Abbott Limited was a British coachbuilding business based in Farnham Surrey trading under that name from 1929. A major part of their output was under sub-contract to motor vehicle manufacturers. Their business closed in 1972."
1,"Schwan-Stabilo"," Schwan-STABILO is a German maker of pens for writing colouring and cosmetics as well as markers and highlighters for office use. It is the world's largest manufacturer of highlighter pens Stabilo Boss."
1,"Q-workshop"," Q-workshop is a Polish company located in Poznań that specializes in designand production of polyhedral dice and dice accessories for use in various games (role-playing gamesboard games and tabletop wargames). They also run an online retail store and maintainan active forum community.Q-workshop was established in 2001 by Patryk Strzelewicz – a student from Poznań. Initiallythe company sold its products via online auction services but in 2005 a website and online store wereestablished."
```



In [1]:
import sys, os
import logging

sys.path.append("src")

logging.basicConfig(level="INFO", handlers=[logging.StreamHandler(sys.stdout)],
                        format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')

### Bucket and role set up

In [2]:
import sagemaker
from sagemaker import get_execution_role
sm_session = sagemaker.session.Session()
role = get_execution_role()

In [3]:
data_bucket = sm_session.default_bucket()

data_bucket_prefix = "bert-demo"

s3_uri_data = "s3://{}/{}/data".format(data_bucket, data_bucket_prefix)
s3_uri_train = "{}/{}".format(s3_uri_data, "train.csv")
s3_uri_val = "{}/{}".format(s3_uri_data, "val.csv")

# Use a small data set if required for local testing..
s3_uri_mini_data = "s3://{}/{}/minidata".format(data_bucket, data_bucket_prefix)
s3_uri_mini_train = "{}/{}".format(s3_uri_mini_data, "train.csv")
s3_uri_mini_val = "{}/{}".format(s3_uri_mini_data, "val.csv")

s3_uri_classes = "{}/{}".format(s3_uri_data, "classes.txt")

s3_uri_test = "{}/{}".format(s3_uri_data, "test.csv")

s3_output_path = "s3://{}/{}/output".format(data_bucket, data_bucket_prefix)
s3_code_path = "s3://{}/{}/code".format(data_bucket, data_bucket_prefix)
s3_checkpoint = "s3://{}/{}/checkpoint".format(data_bucket, data_bucket_prefix)

In [4]:
prepare_dataset = True

## Prepare dataset

In [5]:
tmp ="tmp"

In [6]:
%%bash -s  "$prepare_dataset"  "$s3_uri_test" "$s3_uri_classes" "$tmp"
   
prepare_dataset=$1
s3_test=$2
s3_classes=$3
tmp=$4

if [ "$prepare_dataset" == "True" ]
then  
    echo "Downloading data.."
    wget https://github.com/saurabh3949/Text-Classification-Datasets/raw/master/dbpedia_csv.tar.gz -P ${tmp}
    tar -xzvf ${tmp}/dbpedia_csv.tar.gz
    mv dbpedia_csv ${tmp}
    
    ls -l ${tmp}/dbpedia_csv/
    cat  ${tmp}/dbpedia_csv/classes.txt
    head -3 ${tmp}/dbpedia_csv/train.csv 
    
    echo aws s3 cp ${tmp}/dbpedia_csv/test.csv ${s3_test}
    aws s3 cp ${tmp}/dbpedia_csv/test.csv ${s3_test}
    
    aws s3 cp ${tmp}/dbpedia_csv/classes.txt ${s3_classes}
   
fi

Downloading data..
dbpedia_csv/
dbpedia_csv/test.csv
dbpedia_csv/classes.txt
dbpedia_csv/train.csv
dbpedia_csv/readme.txt
total 191344
-rw------- 1 ec2-user ec2-user       146 Mar 28  2015 classes.txt
-rw-rw-r-- 1 ec2-user ec2-user      1758 Mar 29  2015 readme.txt
-rw------- 1 ec2-user ec2-user  21775285 Mar 28  2015 test.csv
-rw------- 1 ec2-user ec2-user 174148970 Mar 28  2015 train.csv
Company
EducationalInstitution
Artist
Athlete
OfficeHolder
MeanOfTransportation
Building
NaturalPlace
Village
Animal
Plant
Album
Film
WrittenWork
1,"E. D. Abbott Ltd"," Abbott of Farnham E D Abbott Limited was a British coachbuilding business based in Farnham Surrey trading under that name from 1929. A major part of their output was under sub-contract to motor vehicle manufacturers. Their business closed in 1972."
1,"Schwan-Stabilo"," Schwan-STABILO is a German maker of pens for writing colouring and cosmetics as well as markers and highlighters for office use. It is the world's largest manufacturer 

--2020-07-06 10:42:20--  https://github.com/saurabh3949/Text-Classification-Datasets/raw/master/dbpedia_csv.tar.gz
Resolving github.com (github.com)... 192.30.255.113
Connecting to github.com (github.com)|192.30.255.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/saurabh3949/Text-Classification-Datasets/master/dbpedia_csv.tar.gz [following]
--2020-07-06 10:42:20--  https://raw.githubusercontent.com/saurabh3949/Text-Classification-Datasets/master/dbpedia_csv.tar.gz
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.52.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.52.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 68431223 (65M) [application/octet-stream]
Saving to: ‘tmp/dbpedia_csv.tar.gz’

     0K .......... .......... .......... .......... ..........  0% 3.28M 20s
    50K .......... .......... .......... .......... ..........  0% 

#### Train val split

In [7]:
from sklearn.model_selection import train_test_split

def train_val_split(data_file, train_file_name = None, val_file_name = None, val_ratio =.30, train_ratio = .70):
    with open(data_file, "r") as f:
        lines = f.readlines()
        
    train, val = train_test_split( lines, test_size=val_ratio, train_size = train_ratio ,random_state=42)
    
    train_file_name = train_file_name or os.path.join(os.path.dirname(data_file), "train.csv")
    val_file_name = val_file_name or os.path.join(os.path.dirname(data_file), "val.csv")

    
    with open(train_file_name, "w") as f:
        f.writelines(train)
    print("Wrote {} records to train".format(len(train)))
    
    with open(val_file_name, "w") as f:
        f.writelines(val)
    print("Wrote {} records to validation".format(len(val)))
    
    return train_file_name, val_file_name


In [8]:
if prepare_dataset:
    from s3_util import S3Util
    
    s3util = S3Util()
    l_data_file = os.path.join(tmp, "dbpedia_csv", "train.csv")
    l_train, l_val = train_val_split(l_data_file)
    s3util.upload_file(l_train, s3_uri_train)
    s3util.upload_file(l_val, s3_uri_val)
    
    l_mini_train = os.path.join(os.path.dirname(l_data_file), "mini_train.csv")
    l_mini_val = os.path.join(os.path.dirname(l_data_file), "mini_val.csv") 
    l_train, l_val = train_val_split(l_data_file, l_mini_train, l_mini_val, val_ratio = 0.001, train_ratio=0.01)

    s3util.upload_file(l_mini_train, s3_uri_mini_train)
    s3util.upload_file(l_mini_val, s3_uri_mini_val)
    
    

Wrote 392000 records to train
Wrote 168000 records to validation
Uploading file tmp/dbpedia_csv/train.csv to s3://sagemaker-us-west-2-324346001917/bert-demo/data/train.csv in 2.056788 seconds
Uploading file tmp/dbpedia_csv/val.csv to s3://sagemaker-us-west-2-324346001917/bert-demo/data/val.csv in 0.998275 seconds
Wrote 3920 records to train
Wrote 392 records to validation
Uploading file tmp/dbpedia_csv/mini_train.csv to s3://sagemaker-us-west-2-324346001917/bert-demo/minidata/train.csv in 0.156894 seconds
Uploading file tmp/dbpedia_csv/mini_val.csv to s3://sagemaker-us-west-2-324346001917/bert-demo/minidata/val.csv in 0.140068 seconds


In [9]:
# Clean up temp file..
!rm -rf $tmp

## Train

In [10]:
inputs_full =  {
    "train" : s3_uri_train,
    "val" : s3_uri_val,
    "class" : s3_uri_classes
}

inputs_sample =  {
    "train" : s3_uri_mini_train,
    "val" : s3_uri_mini_val,
    "class" : s3_uri_classes
}

# Using the full dataset can take a while 4-5 hours. So if you just quickly test the sample, use inputs_sample
inputs = inputs_full

In [11]:
sm_localcheckpoint_dir="/opt/ml/checkpoints/"

In [15]:
hp = {
"epochs" : 50,
"earlystoppingpatience" : 3,
# Increasing batch size might end up with CUDA OOM error, increase grad accumulation instead
"batch" : 8,
"trainfile" :s3_uri_train.split("/")[-1],
"valfile" : s3_uri_val.split("/")[-1],
"classfile":s3_uri_classes.split("/")[-1],
# The number of steps to accumulate gradients for
"gradaccumulation" : 8,
"log-level":"INFO",
# This param depends on your model max pos embedding size or when large you might end up with CUDA OOM error    
"maxseqlen" : 512,
# Make sure the lr is quite small, as this is a pretrained model..
"lr":0.00001,
# Use finetuning (set to 1), if you only want to change the weights in the final classification layer.. 
"finetune": 0,
"checkpointdir" : sm_localcheckpoint_dir,
# Checkpoints once every n epochs
"checkpointfreq": 2
}



In [16]:
hp

{'epochs': 50,
 'earlystoppingpatience': 3,
 'batch': 4,
 'trainfile': 'train.csv',
 'valfile': 'val.csv',
 'classfile': 'classes.txt',
 'gradaccumulation': 8,
 'log-level': 'INFO',
 'maxseqlen': 512,
 'lr': 1e-05,
 'finetune': 0,
 'checkpointdir': '/opt/ml/checkpoints/',
 'checkpointfreq': 2}

In [17]:
inputs

{'train': 's3://sagemaker-us-west-2-324346001917/bert-demo/minidata/train.csv',
 'val': 's3://sagemaker-us-west-2-324346001917/bert-demo/minidata/val.csv',
 'class': 's3://sagemaker-us-west-2-324346001917/bert-demo/data/classes.txt'}

In [18]:
metric_definitions = [{"Name": "TrainLoss",
                     "Regex": "###score: train_loss### (\d*[.]?\d*)"}
                    ,{"Name": "ValidationLoss",
                     "Regex": "###score: val_loss### (\d*[.]?\d*)"}
                    ,{"Name": "TrainScore",
                     "Regex": "###score: train_score### (\d*[.]?\d*)"}
                   ,{"Name": "ValidationScore",
                     "Regex": "###score: val_score### (\d*[.]?\d*)"}
                    ]

In [19]:
instance_type = "ml.p3.2xlarge"

In [20]:
# set if you need spot instance
use_spot = True
train_max_run_secs =   2*24 * 60 * 60
spot_wait_sec =  5 * 60
max_wait_time_secs = train_max_run_secs +  spot_wait_sec

if not use_spot:
    max_wait_time_secs = None
    
# During local mode, no spot..
if instance_type == 'local':
    use_spot = False
    max_wait_time_secs = 0

# Use smaller dataset to run locally
if instance_type == 'local':
    inputs = inputs_sample
    wait = True

In [21]:
job_type = "bert-classification"
base_name = "{}".format(job_type)

In [None]:
from sagemaker.pytorch import PyTorch

estimator = PyTorch(
     #entry_point='main_train_k_fold.py',
    entry_point='main.py',
                    source_dir = 'src',
                    role=role,
                    framework_version ="1.4.0",
                    py_version='py3',
                    train_instance_count=1,
                    train_instance_type=instance_type,
                    hyperparameters = hp,
                    output_path=s3_output_path,
                    metric_definitions=metric_definitions,
                    #train_use_spot_instances = True
                    train_volume_size=30,
                    code_location=s3_code_path,
                    debugger_hook_config=False,
                    base_job_name =base_name,  
                    train_use_spot_instances = use_spot,
                    train_max_run =  train_max_run_secs,
                    train_max_wait = max_wait_time_secs,   
                    checkpoint_s3_uri=s3_checkpoint,
                    checkpoint_local_path=sm_localcheckpoint_dir)

estimator.fit(inputs, wait=True)

2020-07-06 10:43:13,274 - sagemaker - INFO - Creating training-job with name: bert-classification-2020-07-06-10-43-13-177
2020-07-06 10:43:13 Starting - Starting the training job...
2020-07-06 10:43:15 Starting - Launching requested ML instances......
2020-07-06 10:44:22 Starting - Preparing the instances for training.........
2020-07-06 10:46:03 Downloading - Downloading input data...
2020-07-06 10:46:17 Training - Downloading the training image.....[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2020-07-06 10:47:20,529 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2020-07-06 10:47:20,553 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2020-07-06 10:47:22,011 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2020-07-06 10:47:22,320 sagemaker-containers INFO     Module

## Deploy BERT model

In [None]:
model_uri =  estimator.model_data

In [None]:
model = PyTorchModel(model_data=model_uri,
                     role=role,
                     framework_version='1.4.0',
                     entry_point='sagemaker_inference.py',
                     source_dir='src')

predictor = model.deploy(initial_instance_count=1, instance_type='ml.p3.2xlarge')