### Prerequisites

* Choose `Switch instance type` above
* Toggle `Fast launch only` to select more instance types
* Change instance type to `ml.m5.2xlarge`
* For Kernel, choose `Python 3 (Data Science)`

##### > Install dependencies 

In [1]:
%%capture 

!pip install sagemaker==2.100.0
!pip install sklearn==0.22.1
!pip install boto3==1.24.12
!pip install kaggle==1.5.12
!pip install pandas==1.0.1

**Note:** Recommended to restart the Kernel after installing the dependencies above

### Imports 

In [2]:
from sklearn.preprocessing import LabelEncoder
from sagemaker import Session
from pandas import DataFrame
from time import sleep
import pandas as pd
import sagemaker
import sklearn
import logging
import pickle
import boto3
import os

##### > Setup logging

In [3]:
logger = logging.getLogger('sagemaker')
logger.setLevel(logging.DEBUG)
logger.addHandler(logging.StreamHandler())

##### > Log versions of dependencies

In [4]:
logger.info(f'[Using SageMaker version: {sagemaker.__version__}]')
logger.info(f'[Using Sklearn version: {sklearn.__version__}]')
logger.info(f'[Using Boto3 version: {boto3.__version__}]')
logger.info(f'[Using Pandas version: {pd.__version__}]')

[Using SageMaker version: 2.100.0]
[Using Sklearn version: 0.22.1]
[Using Boto3 version: 1.24.12]
[Using Pandas version: 1.0.1]


### Essentials

In [5]:
session = Session()
s3 = boto3.resource('s3')

S3_BUCKET = session.default_bucket()
S3_DATA_FOLDER = 'data'

logger.info(f'S3 bucket = {S3_BUCKET}')

S3 bucket = sagemaker-us-east-1-119174016168


### Prepare data

##### > Follow the instructions below to download COVID news articles dataset from kaggle here: https://www.kaggle.com/datasets/timmayer/covid-news-articles-2020-2022/

* Create a Kaggle account if you don't have one using an email id.
* Once you have an account, under Account, click `Create New API Token` button as shown below.<br>
![kaggle-credentials](./../img/kaggle-credentials.png)<br>
* This should download a JSON file named `kaggle.json` with your API credentials.
* Copy the `username` and `key` from the downloaded JSON and assign it to the environment variables as shown below.

In [6]:
os.environ['KAGGLE_USERNAME'] = 'ENTER YOUR KAGGLE USERNAME>'
os.environ['KAGGLE_KEY'] = 'ENTER YOUR KAGGLE KEY'

##### > Download raw dataset from Kaggle to your local directory

In [7]:
!kaggle datasets download -d timmayer/covid-news-articles-2020-2022 
!unzip covid-news-articles-2020-2022.zip

Downloading covid-news-articles-2020-2022.zip to /root/train-bert-from-scratch-on-sagemaker/01-prepare
 98%|████████████████████████████████████████▎| 873M/889M [00:08<00:00, 104MB/s]
100%|████████████████████████████████████████| 889M/889M [00:13<00:00, 69.3MB/s]
Archive:  covid-news-articles-2020-2022.zip
  inflating: covid_articles_raw.csv  


##### > Upload raw dataset from local to S3

In [8]:
!aws s3 cp covid_articles_raw.csv s3://{S3_BUCKET}/data/covid_articles_raw.csv 

upload: ./covid_articles_raw.csv to s3://sagemaker-us-east-1-119174016168/data/covid_articles_raw.csv


In [9]:
RAW_INPUT_DATA_S3_LOCATION = f's3://{S3_BUCKET}/data/covid_articles_raw.csv'

##### > Read raw dataset into a pandas dataframe

In [10]:
%%time

df = pd.read_csv(RAW_INPUT_DATA_S3_LOCATION)
df.dropna(inplace=True)
df = df.apply(lambda x: x.str.lower())
df.head()

CPU times: user 50.5 s, sys: 10 s, total: 1min
Wall time: 1min 24s


Unnamed: 0,title,content,category
0,looking into the truth about modern workplace ...,"hi, what are you looking for?\nby\npublished\n...",general
1,hexo refiles financial statements,"new york reported a record 90,132 new covid-19...",general
2,"japan raid, turkey arrests in widening ghosn p...","hi, what are you looking for?\nby\npublished\n...",general
3,pope's bodyguards criticised over slapping inc...,"hi, what are you looking for?\nby\npublished\n...",general
4,lebanon denies president welcomed fugitive ghosn,"hi, what are you looking for?\nby\npublished\n...",general


In [11]:
df.shape

(477536, 3)

### III. Prepare dataset for BERT MLM training 

In [12]:
mlm_df = df[['title', 'content']].copy()
mlm_df['content'] = mlm_df['content'].apply(lambda x: x.replace('hi, what are you looking for?\nby\npublished\n', ''))
mlm_df['content'] = mlm_df['content'].apply(lambda x: x.replace('\n', ' '))
mlm_df.head()

Unnamed: 0,title,content
0,looking into the truth about modern workplace ...,"workplaces are being transformed, according to..."
1,hexo refiles financial statements,"new york reported a record 90,132 new covid-19..."
2,"japan raid, turkey arrests in widening ghosn p...",officials on thursday raided the tokyo residen...
3,pope's bodyguards criticised over slapping inc...,pope francis's attempt to wrest himself from t...
4,lebanon denies president welcomed fugitive ghosn,the lebanese presidency on thursday denied rep...


In [13]:
with open('.././data/covid_articles.txt', 'w', encoding='utf-8') as f:
    for title, content in zip(mlm_df.title.values, mlm_df.content.values):
        f.write('\n'.join([title, content]))

##### > Copy dataset from local to S3

In [14]:
%%time

s3.meta.client.upload_file('.././data/covid_articles.txt', S3_BUCKET, f'{S3_DATA_FOLDER}/covid_articles.txt')

CPU times: user 18.3 s, sys: 17.6 s, total: 35.8 s
Wall time: 13.6 s


### IV. Prepare dataset for sequence classification 

In [15]:
clf_df = df.copy()
clf_df.drop(['content'], axis=1, inplace=True) 
clf_df.head()

Unnamed: 0,title,category
0,looking into the truth about modern workplace ...,general
1,hexo refiles financial statements,general
2,"japan raid, turkey arrests in widening ghosn p...",general
3,pope's bodyguards criticised over slapping inc...,general
4,lebanon denies president welcomed fugitive ghosn,general


**Note:** `category` `esg` stands for `Environmental, Social and Governance`

In [16]:
clf_df.count()

title       477536
category    477536
dtype: int64

##### > Drop duplicate titles 

In [17]:
clf_df = clf_df.drop_duplicates(subset='title', keep='first')
clf_df.count()

title       453682
category    453682
dtype: int64

##### > Filter article title that are covid related only

In [18]:
include_keywords = ['virus', 'covid', 'pandemic', 'variant']
clf_df = clf_df[clf_df.stack().str.contains('|'.join(include_keywords)).any(level=0)]
clf_df.count()

title       140325
category    140325
dtype: int64

In [19]:
clf_df.head()

Unnamed: 0,title,category
22,mysterious respiratory virus strikes 44 people...,general
77,coronavirus impact on tech supply chains minim...,tech
96,"hackers imitating cdc, who with coronavirus ph...",tech
125,new virus identified as likely cause of myster...,science
142,"new sars related virus, wuhan pneumonia, ideni...",general


In [20]:
set(clf_df.category.unique())

{'business', 'esg', 'general', 'science', 'tech'}

##### > Label encode `category` column

In [21]:
label_encoder = LabelEncoder()
clf_df['category'] = label_encoder.fit_transform(clf_df['category'])

##### > Get label mapping

In [22]:
label_map = dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)))
label_map = dict((k, str(v)) for k, v in label_map.items())
label_map

{'business': '0', 'esg': '1', 'general': '2', 'science': '3', 'tech': '4'}

##### > Save label mapping to be used during inference

In [23]:
with open('.././data/label_map.pkl', 'wb') as f:
     pickle.dump(label_map, f, protocol=pickle.HIGHEST_PROTOCOL)

##### > Copy dataset from local to S3 

In [24]:
%%time 

clf_df.to_csv('.././data/covid_articles_clf_data.csv',  encoding='utf-8', index=False, header=False)

CPU times: user 312 ms, sys: 965 µs, total: 313 ms
Wall time: 463 ms


In [25]:
%%time 

s3.meta.client.upload_file('.././data/covid_articles_clf_data.csv', S3_BUCKET, f'{S3_DATA_FOLDER}/covid_articles_clf_data.csv')

CPU times: user 43 ms, sys: 22.5 ms, total: 65.5 ms
Wall time: 543 ms


##### > Copy evaluation dataset for fill mask task 

In [26]:
s3.meta.client.upload_file('.././data/eval_mlm.csv', S3_BUCKET, f'{S3_DATA_FOLDER}/eval/eval_mlm.csv')

##### > Copy label mapping from local to s3

In [27]:
s3.meta.client.upload_file('.././data/label_map.pkl', S3_BUCKET, f'{S3_DATA_FOLDER}/labels/label_map.pkl')

#### Clean up local copies of data 

In [28]:
! rm covid_articles_raw.csv

In [29]:
! rm covid-news-articles-2020-2022.zip

In [30]:
! rm .././data/covid_articles.txt

In [31]:
! rm .././data/covid_articles_clf_data.csv