# News Topic Popularity Forecasting

### This is the second notebook in a series that goes through an example for building an end to end machine learning model for forecasting news topics.  The question we are asking is: 

### Given historical trends on the popularity of a particular topic based on News articles and sentiment, how popular will this topic be in the future?

The example uses a sample dataset which has been obtained from UCI data repository. The raw dataset can be found here: https://archive.ics.uci.edu/ml/machine-learning-databases/00432/Data/News_Final.csv.
Details of the data preparation process can be found here: https://arxiv.org/pdf/1801.07055.pdf <br/>

Please first run 1_preprocess.ipynb which runs a series of preprocessing steps on the raw data. This notebook will show you how to build a neural topic model to extract topic vectors from the processed dataset.

In a subsequent notebook (3_Forecast.ipynb) we will use Amazon Forecast's DeepAR+ Algorithm for time series forecasting to predict the topic's future popularity.

Note: This notebook works in any python 3 kernel.

## Import the data for preprocessing

In [None]:
from datetime import datetime
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import sys
import time
import re
import boto3
import sagemaker
import nltk
# importing forecast notebook utility from notebooks/ directory
sys.path.insert( 0, os.path.abspath("../../common"))
import util


session = sagemaker.session.Session()

In [None]:
text_widget_bucket = util.create_text_widget( "bucket_name", "input your S3 bucket name" )
PREFIX = 'ntm-deepar'
NUM_TOPICS = 20

In [None]:
BUCKET = text_widget_bucket.value
assert BUCKET, "BUCKET not set."

In [None]:
df = pd.read_csv('data/NewsRatingsdataset.csv')
df.head()

In [None]:
print("The Shape of the dataframe is {}".format(df.shape))


In [None]:
df['PublishDate'] = pd.to_datetime(df['PublishDate'], infer_datetime_format=True)
START_DATE = datetime(2015, 11, 1)
END_DATE = datetime(2016, 7, 6)


## Plot the Time Series for a given Topic

In [None]:
topics = list(set(df.Topic))
print("Available Topics = {}".format(topics))


### Topic 0

Most of the data is missing from the early years, so let's just look at data from 10-22-2015 onwards

In [None]:
topic = 0 # Change this to any of [0, 1]
subdf = df[(df['Topic']==topic)&(df['PublishDate']>START_DATE)]
subdf = subdf.reset_index().set_index('PublishDate')
subdf.index = pd.to_datetime(subdf.index)
subdf.head()
subdf[["LinkedIn", 'GooglePlus', 'Facebook']].resample("1D").mean().dropna().plot(figsize=(15, 4))
subdf[["SentimentTitle", 'SentimentHeadline']].resample("1D").mean().dropna().plot(figsize=(15, 4))

Since the data has some hourly information, there are multiple datapoints within a given day. However there isn't enough data in a 24 hour period to create an hourly forecast. Instead we will aggregate the data on a daily basis and use the day information to create a daily forecast.

This is pretty reasonable -- here we are forecasting the topic popularity a couple weeks out into the future. 

### Topic 1

In [None]:
topic = 1 # Change this to any of [0, 1, 2, 3]
subdf = df[(df['Topic']==topic)&(df['PublishDate']>START_DATE)]
subdf = subdf.reset_index().set_index('PublishDate')
subdf.index = pd.to_datetime(subdf.index)
subdf.head()
subdf[["LinkedIn", 'GooglePlus', 'Facebook']].resample("1D").mean().dropna().plot(figsize=(15, 4))
subdf[["SentimentTitle", 'SentimentHeadline']].resample("1D").mean().dropna().plot(figsize=(15, 4))

In [None]:
# Take only data after the START_DATE since there is very little data before 2015-10-13
df = df[(df['PublishDate']>START_DATE)].reset_index(drop=True)
df.head()


Some of the headlines are 'nan's. Let's do a Regex Match to find those and replace those headlines with empty strings. <br/>
<br/>
Also note that a number of the ratings are negative, which may denote missing values. Since Negative ratings are not physically meaningful, we convert these to 0s.

In [None]:
df['Headline'] = df.Headline.replace(np.NaN, '')
df = df.replace({'Facebook': -1, 'GooglePlus': -1, 'LinkedIn':-1}, 0)
df.head()


Notice that there is a difference in scale for the popularity on Facebook versus Linkedin versus GooglePlus. For this example, we will focus on forecasting popularity for Facebook only.

# Topic Modeling

Here we will use the Neural Topic model built-in algorithm within SageMaker for extracting topics from the news headlines. To do so, we need to do some preliminary cleaning and preprocessing of the data.

## Text Processing: Topic Modeling using NTM

In [None]:
#convert to lower case
df['Headline'] = df['Headline'].str.lower()
df['Title'] = df['Title'].str.lower()

#remove punctuation marks
punctuation = '!"#$%&()*+-/:;<=>?@[\\]^_`{|}~'

df['Headline']  = df['Headline'] .apply(lambda x: ''.join(ch for ch in x if ch not in set(punctuation)))
df['Title'] = df['Title'].apply(lambda x: ''.join(ch for ch in x if ch not in set(punctuation)))

# remove numbers
df['Headline'] = df['Headline'] .str.replace("[0-9]", " ")
df['Title'] = df['Title'].str.replace("[0-9]", " ")

In [None]:
df.head()


The Headline Column contains similar information as the Titles, but since the Headlines are longer, we drop the titles and just retain the actual headlines. But we will store the titles separately to use a validation set for our Neural Topic Model during training. 

In [None]:
title_column = df.Title
df = df.drop(columns = ['Title'])


Next, let's write some code to convert the Titles and headlines into tokens that are suitable for a Neural Topic Model

In [None]:
nltk.download('punkt')
nltk.download('wordnet')
from nltk import word_tokenize          
from nltk.stem import WordNetLemmatizer 
token_pattern = re.compile(r"(?u)\b\w\w+\b")
class LemmaTokenizer(object):
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, doc):
        return [self.wnl.lemmatize(t) for t in word_tokenize(doc) if len(t) >= 2 and re.match("[a-z].*",t) 
                and re.match(token_pattern, t)]

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
vocab_size = 10000
print('Tokenizing and counting, this may take a few minutes...')
start_time = time.time()
vectorizer = CountVectorizer(input='content', analyzer='word', stop_words='english',
                             tokenizer=LemmaTokenizer(), max_features=vocab_size, max_df=0.95, min_df=2)
vectors = vectorizer.fit_transform(df.Headline)
topic_vectors = vectorizer.fit_transform(title_column)
vocab_list = vectorizer.get_feature_names()
print('vocab size:', len(vocab_list))

# Retain the index of the vectors
idx = np.arange(vectors.shape[0])
print('Done. Time elapsed: {:.2f}s'.format(time.time() - start_time))

In [None]:
# Type cast the vectors into a sparse array
import scipy.sparse as sparse
vectors = sparse.csr_matrix(vectors, dtype=np.float32)
topic_vectors = sparse.csr_matrix(topic_vectors, dtype=np.float32)
print(type(vectors), vectors.dtype)


## Training a Neural Topic Model

To extract "text vectors", we take the Topics and convert each topic into a 20 (NUM_TOPICS) dimensional "Topic" vector. This can be viewed as an effective lower dimensional embedding of all the text in the corpus into some predefined topics. Each Topic will have a representation as a vector, and related topics will have a related vector representation. Assuming that there is some correlation between topics from one day to the next., i.e, the top topics don't change very frequently on a daily basis, we can represent all the text in the dataset as a collection of 20 topics. Feel free to experiment with changing the number of topics to extract by modifying the NUM_TOPICS parameter.

Each topic can be viewed as distinct from every other topic, allowing us to then input it as a separate timeseries into DeepAR for model training.

In [None]:
#IAM Roles:- In order to train a model on SageMaker, SageMaker needs to assume an IAM role to be able to access data in S3.
## If you are running in a SageMaker notebook environment, a role has already been created when you provisioned the notebook instance.

# To use that role simply run the following cell. If you get an exception, create a SageMaker IAM role first using this link: https://docs.aws.amazon.com/glue/latest/dg/create-an-iam-role-sagemaker-notebook.html
# Make sure that along with the AmazonSageMakerFullAccess Policy you also attach a policy to provide access to your specific S3 bucket
# This could be easily done by providing AmazonS3FullAccess but this is **not** a recommended security best practice.
# then input that role in the text widget below.

try:
    from sagemaker import get_execution_role
    role = get_execution_role()
except Exception as e:
    print( "Enter the IAM role needed for SageMaker to access training containers" )
    role = input("IAM Role Name")

In [None]:
train_prefix = os.path.join(PREFIX, 'train').replace("\\","/")
output_prefix = os.path.join(PREFIX, 'ntm-output').replace("\\","/")

s3_train_data = 's3://{}/{}'.format(BUCKET, train_prefix)
output_path = 's3://{}/{}'.format(BUCKET, output_prefix)

print('Training set location', s3_train_data)
print('Trained model will be saved at', output_path)

In [None]:
def split_convert_upload(sparray, bucket, prefix, fname_template='data_part{}.pbr', n_parts=2):
    import io
    import sagemaker.amazon.common as smac
    
    chunk_size = sparray.shape[0]// n_parts
    for i in range(n_parts):

        # Calculate start and end indices
        start = i*chunk_size
        end = (i+1)*chunk_size
        if i+1 == n_parts:
            end = sparray.shape[0]
        
        # Convert to record protobuf
        buf = io.BytesIO()
        smac.write_spmatrix_to_sparse_tensor(array=sparray[start:end], file=buf, labels=None)
        buf.seek(0)
        
        # Upload to s3 location specified by bucket and prefix
        fname = os.path.join(prefix, fname_template.format(i)).replace("\\","/")
        boto3.resource('s3').Bucket(bucket).Object(fname).upload_fileobj(buf)
        print('Uploaded data to s3://{}'.format(os.path.join(bucket, fname).replace("\\","/")))
split_convert_upload(vectors, bucket=BUCKET, prefix=train_prefix, fname_template='train_part{}.pbr', n_parts=1)


#### Load the latest NTM container

In [None]:
from sagemaker.amazon.amazon_estimator import get_image_uri
container = get_image_uri(boto3.Session().region_name, 'ntm', 'latest')
print(container)

### Model Training

To train the model, you can use 1 or more instances (specified by train_instance_count) and choose a strategy to either Fully Replicate the data on each instance or use ShardedByS3Key which only puts certain data shards on each instance, thus speeding up the training at the cost of each instance only seeing a fraction of the data.

Another way to avoid overfitting is to introduce EarlyStopping which is done here by num_patience_epochs which ensure that the training is stopped if the change in the loss is less than the specified tolerance for a number of epochs.

In [None]:
%%time
from sagemaker.session import s3_input

sess = sagemaker.Session()
ntm = sagemaker.estimator.Estimator(container,
                                    role, 
                                    train_instance_count=1, 
                                    train_instance_type='ml.c4.xlarge',
                                    output_path=output_path,
                                    sagemaker_session=sess)
ntm.set_hyperparameters(num_topics=NUM_TOPICS, feature_dim=vocab_size, mini_batch_size=128, 
                        epochs=100, num_patience_epochs=5, tolerance=0.001)
s3_train = s3_input(s3_train_data, distribution='FullyReplicated') 
ntm.fit({'train': s3_train})

### Deploy the Model

To generate the "feature vectors" for the headlines, we deploy the model first, and run inferences on the entire training dataset to obtain the topic vectors.
**IMPORTANT** Only run the next 2 cells and enter the NTM endpoint if you HAVE already deployed the NTM model, then move on to **Test the Model**.

Otherwise, don't enter anything into the widget box and skip to the cell with the heading "**New Deployment**."

If this is your first time running this notebook, you likely have not deployed the model yet.

In [None]:
#execute if you PREVIOUSLY deployed the NTM model for this notebook series and skip the cell under "New Deployment"

#Below you will need to input your endpoint name. To find this, navigate to the AWS Console and look for SageMaker.
#Under SageMaker go to Endpoints and find the endpoint starting with "ntm-". Note the name and enter it in the widget below.

from sagemaker.predictor import RealTimePredictor, csv_serializer, json_deserializer


endpoint_widget = util.create_text_widget( "endpoint_name", "ONLY enter if you have already (previously) deployed the NTM model. It should start with 'ntm-" )

In [None]:
endpoint = endpoint_widget.value
assert endpoint, "Endpoint not set."

ntm_predictor = RealTimePredictor(endpoint, sagemaker_session=session, serializer=csv_serializer, deserializer=json_deserializer, content_type='text/csv')

### New Deployment

In [None]:
#execute if you have NOT deployed the NTM model for this notebook series and you skipped the previous two cells
#(didn't enter anything into the widget box)
from sagemaker.predictor import csv_serializer, json_deserializer


ntm_predictor = ntm.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')


### Test the Model

In [None]:
ntm_predictor.content_type = 'text/csv'
ntm_predictor.serializer = csv_serializer
ntm_predictor.deserializer = json_deserializer


To do a "sanity" check that our topic model is working as expected, we look at the extracted topic vectors from the titles and check if the topic distribution of the title is similar to that of the corresponding headline. Remember that our model has not seen the titles before. As a measure of "similarity", we compute the cosine similarity for a random title and associated headline. A high cosine similarity indicates that Titles and Headlines have a similar representation in this low dimensional embedding space. 

A cosine similarity of the Title-Headline can also be used as a feature: well written titles that correlate well with the actual Headline may obtain a higher popularity score. This could be used to check if titles and headlines represent the content of the document accurately, but we will not explore this further in this notebook.

We also visualize the Headlines in a T-SNE plot to capture the number of distinct Topic clusters that appear.

In [None]:
%%time
print("Converted back to Dense Tensor")
print("Extracting Results ...")
headline_data_batch = [np.array(vectors[i:i+100].todense()) for i in range(0, vectors.shape[0], 100)]
pred_array = []
print("Data Batched and ready to go")
for i in range(len(headline_data_batch)):
    results = ntm_predictor.predict(headline_data_batch[i])
    predictions = np.array([prediction['topic_weights'] for prediction in results['predictions']])
    pred_array.append(predictions)
    sys.stdout.write(".")
    sys.stdout.flush()
print('Success')

In [None]:
pred_array_cc = np.concatenate(pred_array, axis= 0)
print(pred_array_cc.shape)
print("Store back into dataframe")      
for i in range(NUM_TOPICS):
    df['Headline_Topic_{}'.format(i)] = pred_array_cc[:, i]


In [None]:
#Save the Dataframe back as a pre-processed csv for importing into the DeepAR Model
df.to_csv('data/preprocessed_data.csv', index =None)


#### Test Vector Similarity

In [None]:
topic_data = np.array(topic_vectors.tocsr()[:10].todense())
topic_vecs = []
for index in range(10):
    results = ntm_predictor.predict(topic_data[index])
    predictions = np.array([prediction['topic_weights'] for prediction in results['predictions']])
    topic_vecs.append(predictions)

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
comparisonvec = []
for i, idx in enumerate(range(10)):
    comparisonvec.append([df.Headline[idx], title_column[idx], cosine_similarity(topic_vecs[i], [pred_array_cc[idx]])[0][0]])
pd.DataFrame(comparisonvec, columns=['Headline', 'Title', 'CosineSimilarity'])

Also compare headlines with other headlines in the same Topic and different topics

In [None]:
headlinecomparisonvec = []
for i in range(10):
    headlinecomparisonvec.append([df.Headline[10], df.Headline[i + 10], cosine_similarity([pred_array_cc[10]], [pred_array_cc[i+10]])[0][0]])
pd.DataFrame(headlinecomparisonvec, columns=['Headline', 'Nearby Headlines', 'CosineSimilarity'])

In [None]:
headlinecomparisonvec = []
for i in range(10):
    headlinecomparisonvec.append([df.Headline[10], df.Headline[i + 50000], cosine_similarity([pred_array_cc[10]], [pred_array_cc[i+50000]])[0][0]])
pd.DataFrame(headlinecomparisonvec, columns=['Headline', 'Far away Headlines', 'CosineSimilarity'])

Notice that on average the nearby topics have a higher cosine similarity than far away ones. By tweaking the vocab_size and the NUM_TOPICS parameters, you can look for a better model. 

For now, we choose to proceed with our current results

#### T-SNE

Another way to visualize the results is to plot a T-SNE graph. T-SNE uses a nonlinear embedding model by attempting to check if the nearest neighbor joint probability distribution in the high dimensional space (in this case NUM_TOPICS) matches the equivalent lower dimensional (in this case: 2) joint distribution by minimizing a loss known as the Kullback-Leibler divergence.

Computing the T-SNE can take quite some time especially for large datasets, so we shuffle the dataset and extract only 10K Headline embeddings for the T-SNE plot.

Refer to this excellent article which describes some of the advantages and pitfalls of using T-SNE in Topic Modeling. https://distill.pub/2016/misread-tsne/


In [None]:
smarray = np.random.permutation(pred_array_cc)[:10000]
from sklearn.manifold import TSNE
time_start = time.time()
tsne = TSNE(n_components=2, verbose=1, perplexity=50, n_iter=2000)
tsne_results = tsne.fit_transform(smarray)
print('t-SNE done! Time elapsed: {} seconds'.format(time.time()-time_start))


In [None]:
topic_vec = [np.argmax(x) for x in smarray]


In [None]:
import seaborn as sns

tsnedf = pd.DataFrame()
tsnedf['tsne-2d-one'] = tsne_results[:,0]
tsnedf['tsne-2d-two'] = tsne_results[:,1]
tsnedf['Topic']=topic_vec
plt.figure(figsize=(25,25))
sns.lmplot(
    x="tsne-2d-one", y="tsne-2d-two",
    hue='Topic',
    palette=sns.color_palette("hls", NUM_TOPICS),
    data=tsnedf,
    legend="full",
    fit_reg=False
)
plt.axis('Off')
plt.show()

The T-SNE plot shows 4 large topics, which is consistent with the dataset containing 4 primary topics. But by expanding the dimensioanlity of the topic vectors to 20, we are allowing for the NTM model to capture more semantics between the Headlines than is captured by a single topic token. 

### Delete the endpoint

In [None]:
# With our topic modeling complete, and our data saved, we can delete the endpoint.
ntm_predictor.delete_endpoint()


## Conclusion

In this notebook we showed how to include semantic information from unstructured text data into time series forecasting. Metadata about items extracted from text capture high level features, but don't necessarily include any semantic information that is associated with free-form text. Often times, text data is only captured using an overall sentiment, but this misses the rich information contained in the actual text itself. Furthermore, in addition to the sentiment, the semantic content of news articles can vary over time which will no doubt affect the popularity of particular topics, how they are trending etc. This notebook shows one approach for incorporating unstructured text into time series modeling using SageMaker's built-in Neural Topic Model algorithm.

There are 2 main high level steps:

We first build a topic model to convert text data into topic vectors. <br/>
We then load the corresponding topic vectors associated with our input text into the dataframe. <br/>
 

