# Introduction

### Download the raw data

This notebook is the first of three notebooks on combining natural language processing with time series forecasting using Amazon SageMaker and Amazon Forecast. As a first step, we download the raw dataset from UCI: the dataset consists of news articles and their headlines and titles and their source on 4 major topics. Associated sentiment scores and article ratings on Facebook, GooglePlus and LinkedIn are provided.

The dataset can be viewed in 2 ways:

1) Regression: given an article, predict its popularity

2) Given a topic, forecast the popularity of the topic on various social media channels from historical data out into the future.

Since we want to leverage Amazon Forecast, we treat it as the latter problem. A major thrust of this workshop is to demonstrate how unstructured text data can be included in Forecasting problems. That will be the topic of Notebook 2 (2_NTM.ipynb) and 3 (3_Forecast.ipynb).

But first, we need to download the preprocess the dataset.

In [None]:
import os
import pandas as pd
import requests

In [None]:
if os.path.exists('data/'):
    pass
else:
    os.mkdir('data')

In [None]:
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00432/Data/News_Final.csv'
r = requests.get(url, allow_redirects=True, verify=False)
with open ('data/News_Final.csv', 'wb') as fd:
    fd.write(r.content)

### Load the data

In [None]:
df = pd.read_csv('data/News_Final.csv')

In [None]:
df.head()

In [None]:
df.Source.value_counts()

This exercise is primarily focused on extracting content from the headlines and title. So let's drop the source column and the IDLink relating the dataset to an internal ID.

In [None]:
df = df.drop(columns = ['Source', 'IDLink'])

### Basic Data Exploration

In [None]:
# Take a small sample of the dataset for visualization
df_small = df.sample(frac = 0.2)

In [None]:
import matplotlib.pyplot as plt
plt.xlabel('Articles')
plt.ylabel('Popularity')
plt.title('Facebook News Articles')
n, bins, patches = plt.hist(df_small['Facebook'], bins = 100, density=True, range = (0,600), alpha=0.75)

In [None]:
plt.xlabel('Articles')
plt.ylabel('Popularity')
plt.title('GooglePlus News Articles')
n, bins, patches = plt.hist(df_small['GooglePlus'], bins = 100, density=True, range = (0,600), alpha=0.75)

Notice that the popularity of articles is extremely skewed. For this exercise, we may just choose to forecast the popularity on one of the platforms. In order to convert this into a usable time series for Machine Learning, we need to aggregate the news articles. We have 4 categories, let's aggregate the news datasets from all the 4 categories into 4 timeseries.

In [None]:
# First we replace the Original Topics with Numerical "item_id"
df =df.replace({'Topic': {'economy':0, 'obama': 1, 'microsoft': 2, 'palestine': 3}})
df.head()

In [None]:
df.Topic.value_counts()

### Preprocess the data

In [None]:
# First we convert the PublishDate column to a datetime column using pandas to_datetime function.
df['PublishDate'] = pd.to_datetime(df['PublishDate'], infer_datetime_format=True)

In [None]:
df = df.sort_values(by = ['Topic', 'PublishDate'])

In [None]:
df.head()

In [None]:
df.to_csv('data/NewsRatingsdataset.csv', index = None)

### End

In this notebook, we downloaded the dataset and did some very basic preprocessing and cleaning as well as some simple visualizations. 

Next move on to the 2_NTM.ipynb notebook to preprocess the text data even further and build a neural topic model to generate topic vectors from all the Headlines. This will then become the input to a DeepAR+ forecasting algorithm in the last notebook for the Amazon Forecast service in 3_Forecast.ipynb.

Enjoy!