# End-to-end NLP: News Headline classifier

### Setup execution role and session

Let's start by specifying:

- The S3 bucket and prefix that you want to use for training and model data. This should be within the same region as the Notebook Instance, training, and hosting. If you don't specify a bucket, SageMaker SDK will create a default bucket following a pre-defined naming convention in the same region. 
- The IAM role ARN used to give SageMaker access to your data. It can be fetched using the **get_execution_role** method from sagemaker python SDK.

In [None]:
%%time
import sagemaker
from sagemaker import get_execution_role

role = get_execution_role()
print(role)
sess = sagemaker.Session()
#bucket = <bucket> # custom bucket name.
s3_bucket = sess.default_bucket()
s3_prefix = 'news'

### Download News Aggregator Dataset available at the public UCI dataset repository

We will download our dataset from the UCI Machine Learning Database public repository. The dataset is the News Aggregator Dataset and we will use the newsCorpora.csv file. This dataset contains a table of news headlines and their corresponding classes.

In [None]:
import src.preprocessing

In [None]:
src.preprocessing.download_dataset()

### Let's visualize the dataset

We will load the newsCorpora.csv file to a Pandas dataframe for our data processing work.

In [None]:
import numpy as np
import pandas as pd
import re
import os

In [None]:
column_names = ["TITLE", "URL", "PUBLISHER", "CATEGORY", "STORY", "HOSTNAME", "TIMESTAMP"]
df = pd.read_csv('newsCorpora.csv', names=column_names, header=None, delimiter='\t')
df.head()

#### For this exercice we'll only use the title (Headline) of the news story and the category as our target variable

In [None]:
from collections import Counter
Counter(df['CATEGORY'])

The dataset has four categories: Business (b), Entertainment (e), Health & Medicine (m) and Science & Technology (t).

## Natural Language pre processing

We will do some basic processing of the text data to convert it into numerical form that the algorithm will be able to consume to create a model.
We will do typical pre processing for NLP workloads such as: dummy encoding the labels, tokenizing the documents and set fixed sequence lengths for input feature dimension, padding documents to have fixed length input vectors.

#### Dummy encode the labels

In [None]:
encoded_y=src.preprocessing.dummy_encode_labels(df,'CATEGORY')

In [None]:
df['CATEGORY'][1]

In [None]:
encoded_y[0]

#### Tokenize documents and set fixed sequence lengths for input feature dimension.

In [None]:
padded_docs, tokenizer=src.preprocessing.tokenize_pad_docs(df,'TITLE')

In [None]:
df['TITLE'][1]

In [None]:
padded_docs[0]

### Import word embeddings

The vectors.txt file is the output of the blazingtext_word2vec_text8.ipynb notebook. This will have a list of vector representations for each word in our vocabulary.

In [None]:
embedding_matrix=src.preprocessing.get_word_embeddings(tokenizer)

In [None]:
mkdir ./data/ ./data/embeddings/

### Save embedding matrix to push to S3 for Sagemaker to use during training

In [None]:
#embedding_matrix.dump("ingredients-embedding-matrix.dat")
np.save(file="./data/embeddings/docs-embedding-matrix",
        arr=embedding_matrix,
        allow_pickle=False)
vocab_size=embedding_matrix.shape[0]
print(embedding_matrix.shape)

### Train, test split

In this section we will prep the data for ingestion for the algortihm. Split the data set in train and test samples and uplad the data to S3

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(padded_docs, encoded_y, test_size=0.2, random_state=42)

In [None]:
np.save('./data/train/train_X.npy', X_train)
np.save('./data/train/train_Y.npy', y_train)
np.save('./data/test/test_X.npy', X_test)
np.save('./data/test/test_Y.npy', y_test)

In [None]:
import tensorflow as tf
from keras.wrappers.scikit_learn import KerasClassifier
from keras.utils import np_utils
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline

from tensorflow import keras
from tensorflow.keras import layers

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten, Dropout
from tensorflow.keras.layers import Conv1D, MaxPooling1D
from tensorflow.keras.layers import Embedding

seed = 42
np.random.seed(seed)
num_classes=4

In [None]:

# define the model
model = Sequential()
model.add(Embedding(vocab_size, 100, 
                        weights=[embedding_matrix],
                        input_length=40, 
                        trainable=False, 
                        name="embed"))
model.add(Conv1D(filters=128, 
                     kernel_size=3, 
                     activation='relu',
                     name="conv_1"))
model.add(MaxPooling1D(pool_size=5,
                           name="maxpool_1"))
model.add(Flatten(name="flat_1"))
model.add(Dropout(0.3,
                     name="dropout_1"))
model.add(Dense(128, 
                    activation='relu',
                    name="dense_1"))
model.add(Dense(num_classes,
                    activation='softmax',
                    name="out_1"))
    
    # compile the model
model.compile(optimizer='rmsprop',
                  loss='binary_crossentropy',
                  metrics=['acc'])
    

model.summary()

In [None]:
# fit the model
model.fit(X_train, y_train, batch_size=16, epochs=5, verbose=1)
model.evaluate(X_test, y_test, verbose=0)
scores = model.evaluate(X_test, y_test, verbose=0)

print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))

### Your model should now be in production as a RESTful API!

In [None]:
from keras.preprocessing.sequence import pad_sequences
example_doc=['The markets were bullish after news of the merger']
# integer encode the document
encoded_example = tokenizer.texts_to_sequences(example_doc)

# pad documents to a max length of 4 words
max_length = 40
padded_example = pad_sequences(encoded_example, maxlen=max_length, padding='post')

In [None]:
model.predict(padded_example)