### Prerequisites

* Choose `Switch instance type` above
* Toggle `Fast launch only` to select more instance types
* Change instance type to `ml.m5.2xlarge`
* For Kernel, choose `Python 3 (Data Science)`

##### > Install dependencies 

In [None]:
%%capture 

!pip install sagemaker==2.100.0
!pip install boto3==1.24.12
!pip install kaggle==1.5.12
!pip install pandas==1.0.1

**Note:** Recommended to restart the Kernel after installing the dependencies above

### Imports 

In [None]:
from sagemaker import Session
from pandas import DataFrame
from time import sleep
import pandas as pd
import sagemaker
import logging
import pickle
import boto3
import os

##### > Setup logging

In [None]:
logger = logging.getLogger('sagemaker')
logger.setLevel(logging.DEBUG)
logger.addHandler(logging.StreamHandler())

##### > Log versions of dependencies

In [None]:
logger.info(f'[Using SageMaker version: {sagemaker.__version__}]')
logger.info(f'[Using Boto3 version: {boto3.__version__}]')
logger.info(f'[Using Pandas version: {pd.__version__}]')

### Essentials

In [None]:
session = Session()
s3 = boto3.resource('s3')

S3_BUCKET = session.default_bucket()
S3_DATA_FOLDER = 'data'

logger.info(f'S3 bucket = {S3_BUCKET}')

### Prepare data

##### > Follow the instructions below to download COVID news articles dataset from kaggle here: https://www.kaggle.com/datasets/timmayer/covid-news-articles-2020-2022/

* Create a Kaggle account if you don't have one using an email id.
* Once you have an account, under Account, click `Create New API Token` button as shown below.<br>
![kaggle-credentials](./../img/kaggle-credentials.png)<br>
* This should download a JSON file named `kaggle.json` with your API credentials.
* Copy the `username` and `key` from the downloaded JSON and assign it to the environment variables as shown below.

In [None]:
os.environ['KAGGLE_USERNAME'] = 'ENTER YOUR KAGGLE USERNAME>'
os.environ['KAGGLE_KEY'] = 'ENTER YOUR KAGGLE KEY'

##### > Download raw dataset from Kaggle to your local directory

In [None]:
!kaggle datasets download -d timmayer/covid-news-articles-2020-2022 
!unzip covid-news-articles-2020-2022.zip

##### > Upload raw dataset from local to S3

In [None]:
!aws s3 cp covid_articles_raw.csv s3://{S3_BUCKET}/data/covid_articles_raw.csv 

In [None]:
RAW_INPUT_DATA_S3_LOCATION = f's3://{S3_BUCKET}/data/covid_articles_raw.csv'

##### > Read raw dataset into a pandas dataframe

In [None]:
%%time

df = pd.read_csv(RAW_INPUT_DATA_S3_LOCATION)
df.dropna(inplace=True)
df = df.apply(lambda x: x.str.lower())
df.head()

In [None]:
df.shape

### III. Prepare dataset for BERT MLM training 

In [None]:
mlm_df = df[['title', 'content']].copy()
mlm_df['content'] = mlm_df['content'].apply(lambda x: x.replace('hi, what are you looking for?\nby\npublished\n', ''))
mlm_df['content'] = mlm_df['content'].apply(lambda x: x.replace('\n', ' '))
mlm_df.head()

In [None]:
mlm_df.shape

In [None]:
with open('covid_articles.txt', 'w', encoding='utf-8') as f:
    for title, content in zip(mlm_df.title.values, mlm_df.content.values):
        f.write('\n'.join([title, content]))

##### > Copy dataset from local to S3

In [None]:
%%time

s3.meta.client.upload_file('.././data/covid_articles.txt', S3_BUCKET, f'{S3_DATA_FOLDER}/covid_articles.txt')

#### Clean up local copies of data 

In [None]:
! rm covid_articles_raw.csv

In [None]:
! rm covid-news-articles-2020-2022.zip

In [None]:
! rm .././data/covid_articles.txt