# Daily News & Stock Market Correlation-Prediction (1/4)

Read the news and profit from stocks?

Today's traders need every bit of info to help them make the right market call.

In this Jupyter notebook, we analyze the correlation between daily news headlines and their effect on the Dow Jones stock market index.

The notebook employs data analysis techniques, machine learning algorithms and Spark.

_Dataset courtesy of Sun, J. \[1\]._

[![Author - Andreja Nesic](https://andrejanesic.com/git-signature-sm.png)](https://github.com/andrejanesic)

## Notebook Setup

Let's install the required dependencies, configure our environment and download the data.

**NOTE:** a Kaggle API key is needed to download the data. To obtain one, you need a _(free)_ [Kaggle account](https://www.kaggle.com/account/login). Go to the [Account tab](https://www.kaggle.com/me/account) and click "Create New API Token". Open the downloaded `kaggle.json` file and **copy the entire line into the "Kaggle API token:" input.** The token will be cached for the following runs locally.

_Based on Barve, S. comment \[2\]._

In [0]:
# Set up Kaggle dependency and dir
!mkdir -p ~/.kaggle
!chmod -R 600 ~/.kaggle
!pip install -q kaggle

You should consider upgrading via the '/local_disk0/.ephemeral_nfs/envs/pythonEnv-0c81a3b2-1bdb-468e-b52b-1c2e95dfae51/bin/python -m pip install --upgrade pip' command.[0m


In [0]:
import os

# Fetch Kaggle API token
KAGGLE_API_TOKEN_PATH = os.path.expanduser("~/.kaggle/kaggle.json")

try:
    with open(KAGGLE_API_TOKEN_PATH, "r") as f:
        temp = f.read()
        if len(temp) > 0:
            print("Using cached Kaggle API token")
except:
    pass

try:
    KAGGLE_API_TOKEN
except:
    KAGGLE_API_TOKEN = input("Kaggle API token:")

with open(KAGGLE_API_TOKEN_PATH, "w") as f:
    f.write(KAGGLE_API_TOKEN)

Kaggle API token: {"username":"andrejanesic","key":"b736f5a8216013428fa36b20c2659286"}

In [0]:
# Download dataset with given API key
!mkdir -p ./data
!kaggle datasets download -d aaron7sun/stocknews -p ./data

Downloading stocknews.zip to ./data
  0%|                                               | 0.00/5.82M [00:00<?, ?B/s]
100%|██████████████████████████████████████| 5.82M/5.82M [00:00<00:00, 79.6MB/s]


In [0]:
# Unzip dataset
!unzip -u ./data/*.zip -d ./data

Archive:  ./data/stocknews.zip
  inflating: ./data/Combined_News_DJIA.csv  
  inflating: ./data/RedditNews.csv   
  inflating: ./data/upload_DJIA_table.csv  


In [0]:
# Available dataset files
!ls -1 ./data | grep -v '\.zip$'

Combined_News_DJIA.csv
RedditNews.csv
upload_DJIA_table.csv


In [0]:
# Sample data to verify
!head -n 5 ./data/upload_DJIA_table.csv

Date,Open,High,Low,Close,Volume,Adj Close
2016-07-01,17924.240234,18002.380859,17916.910156,17949.369141,82160000,17949.369141
2016-06-30,17712.759766,17930.609375,17711.800781,17929.990234,133030000,17929.990234
2016-06-29,17456.019531,17704.509766,17456.019531,17694.679688,106380000,17694.679688
2016-06-28,17190.509766,17409.720703,17190.509766,17409.720703,112190000,17409.720703


In [0]:
# Sample data to verify
!head -n 5 ./data/RedditNews.csv

Date,News
2016-07-01,"A 117-year-old woman in Mexico City finally received her birth certificate, and died a few hours later. Trinidad Alvarez Lira had waited years for proof that she had been born in 1898."
2016-07-01,IMF chief backs Athens as permanent Olympic host
2016-07-01,"The president of France says if Brexit won, so can Donald Trump"
2016-07-01,British Man Who Must Give Police 24 Hours' Notice of Sex Threatens Hunger Strike: The man is the subject of a sexual risk order despite having never been convicted of a crime.


In [0]:
# Other dependencies
!pip install nltk
!pip install wordcloud
!pip install vaderSentiment

Collecting nltk
  Downloading nltk-3.8.1-py3-none-any.whl (1.5 MB)
[?25l[K     |▏                               | 10 kB 27.9 MB/s eta 0:00:01[K     |▍                               | 20 kB 8.2 MB/s eta 0:00:01[K     |▋                               | 30 kB 11.6 MB/s eta 0:00:01[K     |▉                               | 40 kB 6.3 MB/s eta 0:00:01[K     |█                               | 51 kB 6.2 MB/s eta 0:00:01[K     |█▎                              | 61 kB 7.3 MB/s eta 0:00:01[K     |█▌                              | 71 kB 7.8 MB/s eta 0:00:01[K     |█▊                              | 81 kB 8.8 MB/s eta 0:00:01[K     |██                              | 92 kB 8.0 MB/s eta 0:00:01[K     |██▏                             | 102 kB 6.2 MB/s eta 0:00:01[K     |██▍                             | 112 kB 6.2 MB/s eta 0:00:01[K     |██▋                             | 122 kB 6.2 MB/s eta 0:00:01[K     |██▉                             | 133 kB 6.2 MB/s eta 0:00:01[K     

In [0]:
import nltk
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
Out[9]: True

## Import & Prepare Data

### Copy to DBFS

Our files are in local mode, but we want them on DBFS. Let's copy them.

In [0]:
# Copy to DBFS
import os

DBFS_DATA_DIR = f"/tmp/data/"

input_files = ["RedditNews.csv", "upload_DJIA_table.csv"]
for f in input_files:
    src = os.path.abspath(f"./data/{f}")
    dbutils.fs.cp(f"file:{src}", f"{DBFS_DATA_DIR}/in/{f}")

In [0]:
# Copied files
for f in dbutils.fs.ls(DBFS_DATA_DIR):
    print(f)

FileInfo(path='dbfs:/tmp/data/RedditNews.csv', name='RedditNews.csv', size=9099659, modificationTime=1674638016000)
FileInfo(path='dbfs:/tmp/data/in/', name='in/', size=0, modificationTime=0)
FileInfo(path='dbfs:/tmp/data/out/', name='out/', size=0, modificationTime=0)
FileInfo(path='dbfs:/tmp/data/upload_DJIA_table.csv', name='upload_DJIA_table.csv', size=167083, modificationTime=1674638017000)


### Import DJIA Data

Let's import DJIA data from DBFS with a custom schema, and cache it for faster access.

In [0]:
import os
import pyspark.sql.types as T

# Absolute path required by PySpark
pathDjia = os.path.abspath(f"{DBFS_DATA_DIR}/in/upload_DJIA_table.csv")

# Define custom schema for optimal performance
schemaDjia = T.StructType([
    T.StructField("Date", T.DateType(), True),
    T.StructField("Open", T.DoubleType(), True),
    T.StructField("High", T.DoubleType(), True),
    T.StructField("Low", T.DoubleType(), True),
    T.StructField("Close", T.DoubleType(), True),
    T.StructField("Volume", T.DoubleType(), True),
    T.StructField("Adj Close", T.DoubleType(), True),
])

# Load CSV with header
dfDjia = spark.read.format("csv") \
    .option("header", "true") \
    .schema(schemaDjia) \
    .load(pathDjia)

In [0]:
# Check data
display(dfDjia)

Date,Open,High,Low,Close,Volume,Adj Close
2016-07-01,17924.240234,18002.380859,17916.910156,17949.369141,82160000.0,17949.369141
2016-06-30,17712.759766,17930.609375,17711.800781,17929.990234,133030000.0,17929.990234
2016-06-29,17456.019531,17704.509766,17456.019531,17694.679688,106380000.0,17694.679688
2016-06-28,17190.509766,17409.720703,17190.509766,17409.720703,112190000.0,17409.720703
2016-06-27,17355.210938,17355.210938,17063.080078,17140.240234,138740000.0,17140.240234
2016-06-24,17946.630859,17946.630859,17356.339844,17400.75,239000000.0,17400.75
2016-06-23,17844.109375,18011.070312,17844.109375,18011.070312,98070000.0,18011.070312
2016-06-22,17832.669922,17920.160156,17770.359375,17780.830078,89440000.0,17780.830078
2016-06-21,17827.330078,17877.839844,17799.800781,17829.730469,85130000.0,17829.730469
2016-06-20,17736.869141,17946.359375,17736.869141,17804.869141,99380000.0,17804.869141


In [0]:
# Cache for faster access
dfDjia.cache()

Out[14]: DataFrame[Date: date, Open: double, High: double, Low: double, Close: double, Volume: double, Adj Close: double]

### Import News Data

Let's import news data from DBFS with a custom schema, and cache it for faster access.

In [0]:
import os
import pyspark.sql.types as T

# Absolute path required by PySpark
pathNews = os.path.abspath(f"{DBFS_DATA_DIR}/in/RedditNews.csv")

# Define custom schema for optimal performance
schemaNews = T.StructType([
    T.StructField("Date", T.DateType(), True),
    T.StructField("News", T.StringType(), True),
])

# Load CSV with header
dfNews = spark.read.format("csv") \
    .option("header", "true") \
    .schema(schemaNews) \
    .load(pathNews)

In [0]:
# Check data
display(dfNews)

Date,News
2016-07-01,"A 117-year-old woman in Mexico City finally received her birth certificate, and died a few hours later. Trinidad Alvarez Lira had waited years for proof that she had been born in 1898."
2016-07-01,IMF chief backs Athens as permanent Olympic host
2016-07-01,"The president of France says if Brexit won, so can Donald Trump"
2016-07-01,British Man Who Must Give Police 24 Hours' Notice of Sex Threatens Hunger Strike: The man is the subject of a sexual risk order despite having never been convicted of a crime.
2016-07-01,100+ Nobel laureates urge Greenpeace to stop opposing GMOs
2016-07-01,Brazil: Huge spike in number of police killings in Rio ahead of Olympics
2016-07-01,Austria's highest court annuls presidential election narrowly lost by right-wing candidate.
2016-07-01,"Facebook wins privacy case, can track any Belgian it wants: Doesn't matter if Internet users are logged into Facebook or not"
2016-07-01,"Switzerland denies Muslim girls citizenship after they refuse to swim with boys at school: The 12- and 14-year-old will no longer be considered for naturalised citizenship because they have not complied with the school curriculum, authorities in Basel said"
2016-07-01,"China kills millions of innocent meditators for their organs, report finds"


In [0]:
# Cache for faster access
dfNews.cache()

Out[17]: DataFrame[Date: date, News: string]

## Data Cleanup & Transformation

Before jumping into analysis, we'll need to check our data for bad values and conduct some transformations to get more useful values.

### Analyze

We'll start by viewing info provided by Pandas' `describe()` function.

In [0]:
pdDjia = dfDjia.toPandas()
pdDjia.describe()

Unnamed: 0,Open,High,Low,Close,Volume,Adj Close
count,1989.0,1989.0,1989.0,1989.0,1989.0,1989.0
mean,13459.116048,13541.303173,13372.931728,13463.032255,162811000.0,13463.032255
std,3143.281634,3136.271725,3150.420934,3144.006996,93923430.0,3144.006996
min,6547.009766,6709.609863,6469.950195,6547.049805,8410000.0,6547.049805
25%,10907.339844,11000.980469,10824.759766,10913.379883,100000000.0,10913.379883
50%,13022.049805,13088.110352,12953.129883,13025.580078,135170000.0,13025.580078
75%,16477.699219,16550.070312,16392.769531,16478.410156,192600000.0,16478.410156
max,18315.060547,18351.359375,18272.560547,18312.390625,674920000.0,18312.390625


In [0]:
pdNews = dfNews.toPandas()
pdNews.describe()

Unnamed: 0,Date,News
count,73608,74092
unique,2943,73871
top,2008-10-26,\n
freq,50,130


### Date Validation

The `Date` column of both datasets will be extremely important for our analysis. Let's check to see the differences in the `Date` column of the respective datasets.

In [0]:
# Which dataset has more rows?
pdDjia.shape[0] - pdNews.shape[0]

Out[20]: -74114

In [0]:
pdDjia[['Date']].describe()

Unnamed: 0,Date
count,1989
unique,1989
top,2016-07-01
freq,1


In [0]:
pdNews[['Date']].describe()

Unnamed: 0,Date
count,73608
unique,2943
top,2008-10-26
freq,50


Here we can observe that there are significantly more News dataset rows than DJIA rows. We can also see that all of DJIA's `Date` keys are unique, which means one entry per day.

Let's see if there are any dates from DJIA missing in the News dataset:

In [0]:
dfDjia.join(dfNews, "Date", "leftanti").show()

+----+----+----+---+-----+------+---------+
|Date|Open|High|Low|Close|Volume|Adj Close|
+----+----+----+---+-----+------+---------+
+----+----+----+---+-----+------+---------+



In [0]:
dfDjia.join(dfNews, "Date", "leftanti").count()

Out[24]: 0

The result is empty, which means that there's at least 1 news entry for every DJIA row. Now let's reverse the join:

In [0]:
dfNews.join(dfDjia, "Date", "leftanti").show()

+----------+--------------------+
|      Date|                News|
+----------+--------------------+
|2016-06-26|Authorities raid ...|
|2016-06-26|Pope Francis uses...|
|2016-06-26|Vladimir Putin Sa...|
|2016-06-26|Poll puts support...|
|2016-06-26|Brexit: Expats de...|
|2016-06-26|The petition to r...|
|2016-06-26|Historian Gudni J...|
|2016-06-26|Nigel Farage has ...|
|2016-06-26|Iraq takes 'last ...|
|2016-06-26|Liberal Democrats...|
|2016-06-26|Nigerian Army ann...|
|2016-06-26|The United Nation...|
|2016-06-26|UK food prices se...|
|2016-06-26|Prices have incre...|
|2016-06-26|No chance of a se...|
|2016-06-26|Nicola Sturgeon s...|
|2016-06-26|"Labour MP David ...|
|2016-06-26|Peru's state ener...|
|2016-06-26|HSBC 'to move job...|
|2016-06-26|A coral reef plea...|
+----------+--------------------+
only showing top 20 rows



In [0]:
extra = dfNews.join(dfDjia, "Date", "leftanti").count()
extra

Out[26]: 26385

In [0]:
extra / dfNews.count()

Out[27]: 0.3467011812937729

Here we can see that about 34.7% of the News dataset consists of entries not in the DJIA set.

Let's examine the continuity of the dates. We want to find out if there are any skipped dates in either dataset:

In [0]:
# Shift pdNews by one and matrix subtract to get date difference
(pdNews[1:].reset_index(drop=True)["Date"] - pdNews[:-1].reset_index(drop=True)["Date"]).dt.days.describe()

Out[28]: count    71514.000000
mean        -0.039671
std          0.195400
min         -2.000000
25%          0.000000
50%          0.000000
75%          0.000000
max          0.000000
Name: Date, dtype: float64

In [0]:
# Shift pdNews by one and matrix subtract to get date difference
(pdDjia[1:].reset_index(drop=True)["Date"] - pdDjia[:-1].reset_index(drop=True)["Date"]).dt.days.describe()

Out[29]: count    1988.000000
mean       -1.450704
std         0.878694
min        -5.000000
25%        -1.000000
50%        -1.000000
75%        -1.000000
max        -1.000000
Name: Date, dtype: float64

From the calculations above, we can see that the News dataset has multiple news for each day, and there may exist a 2-day difference between two dates. As for DJIA data, the maximum difference between two dates is 5 days, with the mean difference of 1.45 days.

This means we can't fully rely on DJIA set as it isn't fully continuous, but we'll try and make use of the data at hand. Values for missing days may be extrapolated by using average values of the days between.

### DJIA Transformation

There are a few transformations we can apply to the DJIA dataset to give us more information.

First, let's calculate the daily change in stock value:

In [0]:
# Sort by date###T###
dfDjia = dfDjia.orderBy("Date")

In [0]:
# DJIA change###T###
dfDjia = dfDjia.withColumn('Change', dfDjia['Close'] - dfDjia['Open'])

In [0]:
import pyspark.sql.functions as F
from pyspark.sql.window import Window

# Simple moving average (close) over 5, 14 and 30 days###T###
n = 5
dfDjia = dfDjia.withColumn(f'SMA{n}', F.avg(dfDjia["Close"]).over(Window.rowsBetween(1 - n, 0)))
n = 14
dfDjia = dfDjia.withColumn(f'SMA{n}', F.avg(dfDjia["Close"]).over(Window.rowsBetween(1 - n, 0)))
n = 30
dfDjia = dfDjia.withColumn(f'SMA{n}', F.avg(dfDjia["Close"]).over(Window.rowsBetween(1 - n, 0)))

In [0]:
# We'll add an ID for easier manipulation of data###T###
dfDjia = dfDjia.withColumn("id", F.monotonically_increasing_id())

# On-balance value###T###
diff = dfDjia["Close"] - F.lag(dfDjia["Close"], 1).over(Window.partitionBy().orderBy("id"))
dfDjia = dfDjia.withColumn("OBV", (F.signum(diff) * dfDjia["Volume"]))
dfDjia = dfDjia.fillna(0, subset="OBV")
dfDjia = dfDjia.withColumn("OBV", F.sum("OBV").over(Window.partitionBy().orderBy("id")
                                                  .rangeBetween(Window.unboundedPreceding, 0)))

In [0]:
# Whether stock went up or down for the day###T###
dfDjia = dfDjia.withColumn("Increase", udf(lambda x: 1 if x > 0 else 0, T.IntegerType())(dfDjia["Change"]))

In [0]:
# Whether stock went up or down compared to -n days###T###
n = 1
dfDjia = dfDjia.withColumn(f"Change {n}D", udf(lambda x: 1 if x > 0 else 0, T.IntegerType())(
    dfDjia["Close"] - F.lag("Close", n, default=1).over(Window.partitionBy().orderBy("id"))))
n = 5
dfDjia = dfDjia.withColumn(f"Change {n}D", udf(lambda x: 1 if x > 0 else 0, T.IntegerType())(
    dfDjia["Close"] - F.lag("Close", n, default=1).over(Window.partitionBy().orderBy("id"))))
n = 14
dfDjia = dfDjia.withColumn(f"Change {n}D", udf(lambda x: 1 if x > 0 else 0, T.IntegerType())(
    dfDjia["Close"] - F.lag("Close", n, default=1).over(Window.partitionBy().orderBy("id"))))
n = 30
dfDjia = dfDjia.withColumn(f"Change {n}D", udf(lambda x: 1 if x > 0 else 0, T.IntegerType())(
    dfDjia["Close"] - F.lag("Close", n, default=1).over(Window.partitionBy().orderBy("id"))))

Let's visualize our new dataset with the new indicators:

In [0]:
display(dfDjia)

Date,Open,High,Low,Close,Volume,Adj Close,Change,SMA5,SMA14,SMA30,id,OBV,Increase,Change 1D,Change 5D,Change 14D,Change 30D
2008-08-08,11432.089844,11759.959961,11388.040039,11734.320312,212830000.0,11734.320312,302.23046799999975,11734.320312,11734.320312,11734.320312,0,0.0,1,1,1,1,1
2008-08-11,11729.669922,11867.110352,11675.530273,11782.349609,183190000.0,11782.349609,52.67968700000165,11758.3349605,11758.3349605,11758.3349605,1,183190000.0,1,1,1,1,1
2008-08-12,11781.700195,11782.349609,11601.519531,11642.469727,173590000.0,11642.469727,-139.23046799999975,11719.713215999998,11719.713215999998,11719.713215999998,2,9600000.0,0,0,1,1,1
2008-08-13,11632.80957,11633.780273,11453.339844,11532.959961,182550000.0,11532.959961,-99.84960899999896,11673.02490225,11673.02490225,11673.02490225,3,-172950000.0,0,0,1,1,1
2008-08-14,11532.070312,11718.280273,11450.889648,11615.929688,159790000.0,11615.929688,83.85937600000034,11661.6058594,11661.6058594,11661.6058594,4,-13160000.0,1,1,1,1,1
2008-08-15,11611.209961,11709.889648,11599.730469,11659.900391,215040000.0,11659.900391,48.69042999999874,11646.7218752,11661.321614666667,11661.321614666667,5,201880000.0,1,1,0,1,1
2008-08-18,11659.650391,11690.429688,11434.120117,11479.389648,156290000.0,11479.389648,-180.2607429999989,11586.129882999998,11635.331333714286,11635.331333714286,6,45590000.0,0,0,0,1,1
2008-08-19,11478.089844,11478.169922,11318.5,11348.549805,171580000.0,11348.549805,-129.54003899999952,11527.3458986,11599.483642625,11599.483642625,7,-125990000.0,0,0,0,1,1
2008-08-20,11345.94043,11454.150391,11290.580078,11417.429688,144880000.0,11417.429688,71.48925799999961,11504.239844000002,11579.255425444446,11579.255425444446,8,18890000.0,1,1,0,1,1
2008-08-21,11415.230469,11476.209961,11315.570312,11430.209961,130020000.0,11430.209961,14.979492000000391,11467.0958986,11564.350879,11564.350879,9,148910000.0,1,1,0,1,1


Output can only be rendered in Databricks

Output can only be rendered in Databricks

Output can only be rendered in Databricks

### News Transformation

Our News dataset doesn't contain a lot of info&mdash;besides the actual headlines. Let's apply some transformations to extract more useful data.

In [0]:
# Let's do some basic tokenization & lemmatization###T###
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

stopwords_eng = set(stopwords.words("english"))
def udf_lemmatize(txt):
    txt = str(txt)
    txt = re.sub(r"[^a-zA-Z\s]", "", txt)
    txt = re.sub(r"^b(.*)", r"\g<1>", txt)
    txt = txt.lower()
    txt = re.sub(r"[^a-z\s]", "", txt)
    txt = txt.split()
    
    ps = PorterStemmer()
    return [ps.stem(x) for x in txt if x not in stopwords_eng and ps.stem(x) != ""]

dfNews = dfNews.withColumn("Words", udf(lambda x: udf_lemmatize(x), T.ArrayType(T.StringType()))(F.col("News")))

In [0]:
# Join the lemmatized col###T###
dfNews = dfNews.withColumn("Clean", F.udf(lambda x: " ".join(x), T.StringType())(dfNews["Words"]))

In [0]:
# Word count###T###
dfNews = dfNews.withColumn("Count", F.udf(lambda x: len(x), T.IntegerType())(dfNews["Words"]))

In [0]:
# Word frequency###T###
def udf_freq(words):
    d = dict()
    for w in words:
        if not (w in d):
            d[w] = 0
        d[w] += 1
    return d
dfNews = dfNews.withColumn("Freq", F.udf(lambda x: udf_freq(x), T.MapType(T.StringType(), T.IntegerType()))(dfNews["Words"]))

Sentiment analysis of the headlines may provide additional insights into the nature of current market events. We'll first drop all rows that are missing a value:

In [0]:
dfNews = dfNews.na.drop("any")

In [0]:
# TODO needs more optimization, otherwise too slow!
# Set up NLP with vaderSentiment
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

analyzer = SentimentIntensityAnalyzer()
def get_nlp(text):
    if not text:
        return 0
    if not (text in get_nlp.cache):
        get_nlp.cache[text] = analyzer.polarity_scores(text)["compound"]
    t = get_nlp.cache[text]
    if not t:
        return 0
    else:
        return t
get_nlp.cache = dict()

dfNews = dfNews.withColumn("Neutrality", F.udf((lambda x: get_nlp(x)), T.DoubleType())(dfNews["Clean"]).cast(T.DoubleType()))

def get_neutrality(x):
    if not x or not (type(x) == type(0.1)):
        return "Unknown"
    if x >= 0.05:
        return "Positive"
    if x <= -0.05:
        return "Negative"
    return "Neutral"
dfNews = dfNews.withColumn("Emotion", F.udf((lambda x: get_neutrality(x)), T.StringType())(dfNews["Neutrality"]).cast(T.StringType()))
dfNews = dfNews.fillna(0, subset="Neutrality")
dfNews = dfNews.fillna("Unknown", subset="Emotion")

In [0]:
# Only first n rows because this can be slow
t = dfNews.orderBy("Date").head(100)
display(t)

Date,News,Words,Clean,Count,Freq,Neutrality,Emotion
2008-06-08,b'Nim Chimpsky: The tragedy of the chimp who thought he was a boy (and proved that humans were not humane)',"List(nim, chimpski, tragedi, chimp, thought, boy, prove, human, human)",nim chimpski tragedi chimp thought boy prove human human,9,"Map(chimpski -> 1, prove -> 1, thought -> 1, human -> 2, tragedi -> 1, boy -> 1, nim -> 1, chimp -> 1)",0.0,Unknown
2008-06-08,"""b""""Canada: Beware slippery slope' to censorship","List(canada, bewar, slipperi, slope, censorship)",canada bewar slipperi slope censorship,5,"Map(bewar -> 1, canada -> 1, censorship -> 1, slope -> 1, slipperi -> 1)",0.0,Unknown
2008-06-08,"""b'EU Vice-President Luisa Morgantini and the Irish Nobel laureate, Mairead Corrigan, have been tear gased and injured by the IDF while attending the """"International Conference on Non-violent Resistance""""'""","List(eu, vicepresid, luisa, morgantini, irish, nobel, laureat, mairead, corrigan, tear, gase, injur, idf, attend, intern, confer, nonviol, resist)",eu vicepresid luisa morgantini irish nobel laureat mairead corrigan tear gase injur idf attend intern confer nonviol resist,18,"Map(nobel -> 1, eu -> 1, mairead -> 1, tear -> 1, luisa -> 1, confer -> 1, injur -> 1, vicepresid -> 1, resist -> 1, morgantini -> 1, corrigan -> 1, idf -> 1, nonviol -> 1, laureat -> 1, irish -> 1, attend -> 1, intern -> 1, gase -> 1)",0.0,Unknown
2008-06-08,"""b""""Israeli minister: Israel will attack Iran if it doesn't abandon its nuclear program""""""","List(isra, minist, israel, attack, iran, doesnt, abandon, nuclear, program)",isra minist israel attack iran doesnt abandon nuclear program,9,"Map(program -> 1, isra -> 1, israel -> 1, nuclear -> 1, iran -> 1, doesnt -> 1, abandon -> 1, attack -> 1, minist -> 1)",-0.1764,Negative
2008-06-08,"b'Albino Killings in Tanzania. At least 19 albinos, including several young children, have been killed in Tanzania in the past year. [video] '","List(albino, kill, tanzania, least, albino, includ, sever, young, children, kill, tanzania, past, year, video)",albino kill tanzania least albino includ sever young children kill tanzania past year video,14,"Map(children -> 1, young -> 1, albino -> 2, tanzania -> 2, year -> 1, least -> 1, includ -> 1, sever -> 1, video -> 1, kill -> 2, past -> 1)",-0.886,Negative
2008-06-08,"""b'Chiapas: army occupies Zapatista communities in """"anti-drug"""" ops'""","List(chiapa, armi, occupi, zapatista, commun, antidrug, op)",chiapa armi occupi zapatista commun antidrug op,7,"Map(chiapa -> 1, armi -> 1, commun -> 1, zapatista -> 1, antidrug -> 1, occupi -> 1, op -> 1)",0.0,Unknown
2008-06-08,"b'Polar bear swims 200 miles, is shot dead upon arrival'","List(polar, bear, swim, mile, shot, dead, upon, arriv)",polar bear swim mile shot dead upon arriv,8,"Map(polar -> 1, bear -> 1, dead -> 1, arriv -> 1, swim -> 1, upon -> 1, shot -> 1, mile -> 1)",-0.6486,Negative
2008-06-08,"b'News is a contraband item in Pakistan now, and it is being sold on the black market,'","List(news, contraband, item, pakistan, sold, black, market)",news contraband item pakistan sold black market,7,"Map(market -> 1, contraband -> 1, pakistan -> 1, black -> 1, news -> 1, sold -> 1, item -> 1)",0.0,Unknown
2008-06-08,"b'Albinos, Long Shunned, Face Threat in Tanzania where witch doctors are now marketing albino skin, bones and hair as ingredients in potions that are promised to make people rich.'","List(albino, long, shun, face, threat, tanzania, witch, doctor, market, albino, skin, bone, hair, ingredi, potion, promis, make, peopl, rich)",albino long shun face threat tanzania witch doctor market albino skin bone hair ingredi potion promis make peopl rich,19,"Map(market -> 1, albino -> 2, hair -> 1, potion -> 1, promis -> 1, tanzania -> 1, rich -> 1, ingredi -> 1, doctor -> 1, bone -> 1, shun -> 1, face -> 1, long -> 1, witch -> 1, make -> 1, threat -> 1, peopl -> 1, skin -> 1)",-0.3182,Negative
2008-06-08,b'Town in Britain Plans to Start its Own Currency',"List(town, britain, plan, start, currenc)",town britain plan start currenc,5,"Map(britain -> 1, currenc -> 1, town -> 1, start -> 1, plan -> 1)",0.0,Unknown


In [0]:
dfNews.cache()

Out[44]: DataFrame[Date: date, News: string, Words: array<string>, Clean: string, Count: int, Freq: map<string,int>, Neutrality: double, Emotion: string]

Finally, for purposes of machine learning, we'll want to join all the news for one day into one column. We'll also need to concatenate it with the results from the DJIA dataset (whether the stock went up or down).

There are several ways to do this. We can observe whether DJIA went up or down the day the news was published. For prediction, we can also observe whether DJIA went down a few days later (etc. +1, +5 or +10 days). In this case, a single news headline may not bear much significance in "affecting" the DJIA, but observing multiple headlines on the same day or over a range of days may give better prediction.

In [0]:
import re

# Group by date -> merge ###T###
dfNewsConcat = dfNews.groupBy("Date").agg(F.concat_ws("[STOP]", F.collect_list(dfNews["Clean"])).alias("Concat"))

# "b..." appearing, remove:
dfNewsConcat = dfNewsConcat.withColumn("Concat", F.udf(lambda x: re.sub(r"\[STOP\]b", "[STOP]", x[1:]), T.StringType())(dfNewsConcat["Concat"]))

# Split into cols###T###
for i in range(25):
    dfNewsConcat = dfNewsConcat.withColumn(f"News {i}", F.split(dfNewsConcat["Concat"], r"\[STOP\]").getItem(i))

dfNewsCols = dfNewsConcat.drop("Concat")
dfNewsCols.printSchema()

root
 |-- Date: date (nullable = true)
 |-- News 0: string (nullable = true)
 |-- News 1: string (nullable = true)
 |-- News 2: string (nullable = true)
 |-- News 3: string (nullable = true)
 |-- News 4: string (nullable = true)
 |-- News 5: string (nullable = true)
 |-- News 6: string (nullable = true)
 |-- News 7: string (nullable = true)
 |-- News 8: string (nullable = true)
 |-- News 9: string (nullable = true)
 |-- News 10: string (nullable = true)
 |-- News 11: string (nullable = true)
 |-- News 12: string (nullable = true)
 |-- News 13: string (nullable = true)
 |-- News 14: string (nullable = true)
 |-- News 15: string (nullable = true)
 |-- News 16: string (nullable = true)
 |-- News 17: string (nullable = true)
 |-- News 18: string (nullable = true)
 |-- News 19: string (nullable = true)
 |-- News 20: string (nullable = true)
 |-- News 21: string (nullable = true)
 |-- News 22: string (nullable = true)
 |-- News 23: string (nullable = true)
 |-- News 24: string (nullable = tr

In [0]:
# Add predictions from DJIA###T###
dfMl = dfNewsCols.join(dfDjia.select("Date", "Change 1D", "Change 5D", "Change 14D", "Change 30D"), "Date", "leftouter")

# Add ID for easier data splitting###T###
dfMl = dfMl.orderBy("Date").withColumn("id", F.monotonically_increasing_id())
dfMl.cache()
display(dfMl.head(100))

Date,News 0,News 1,News 2,News 3,News 4,News 5,News 6,News 7,News 8,News 9,News 10,News 11,News 12,News 13,News 14,News 15,News 16,News 17,News 18,News 19,News 20,News 21,News 22,News 23,News 24,Change 1D,Change 5D,Change 14D,Change 30D,id
2008-06-08,im chimpski tragedi chimp thought boy prove human human,canada bewar slipperi slope censorship,eu vicepresid luisa morgantini irish nobel laureat mairead corrigan tear gase injur idf attend intern confer nonviol resist,isra minist israel attack iran doesnt abandon nuclear program,albino kill tanzania least albino includ sever young children kill tanzania past year video,chiapa armi occupi zapatista commun antidrug op,polar bear swim mile shot dead upon arriv,news contraband item pakistan sold black market,albino long shun face threat tanzania witch doctor market albino skin bone hair ingredi potion promis make peopl rich,town britain plan start currenc,bc report among dead afghanistan,lebanes women still vulner violenc exist legal framework violat intern human right law fail protect women domest violenc,polic releas chill imag man left dead hitandrun driver,iceland open europ largest nation park squar kilomet squar mile vatnajkul nation park,citizen fight blackwat potrero locat chosen train cheap mexican soldier perform extraordinari rendit partner facil mexico,korean protest,oil reserv last decad bbc scotland investig told,camera design detect terrorist facial express,isra peac activist protest year occup,earthquak hit china southern qinghai provinc,man goe berzerk akihabara stab everyon nearbi dead injur,threat world aid pandem among heterosexu report admit,angst ankara turkey steer danger ident crisi,uk ident card could use spi peopl new children databas may use identifi like futur crimin covert surveil gone far,marriag said reduc statu commerci transact women could discard husband claim discov hidden defect,,,,,0
2008-06-09,nit state quit human right council,pentagon block cheney attack iran,j street,former ambassador joseph wilson us militari proselyt behalf constitut unit state rather behalf sort fanat view end time,eu leader anxious await irish verdict lisbon treati ireland countri allow referendum vote june th,hit stab,treati tension mount iraq tell us want troop back barrack,council paint street artist banksi graffiti mural worth fortun,finder keeper get complic half billion dollar lawyer involv,chew qat yemen per cent men per cent women chew qat two six hour day,one uk experienc decor special forc soldier quit armi criticis govern risk soldier live fail fund troop equip,itter struggl two power beltway terror analyst broken whether al qaeda still aliv well,jailer guantanamo urg destroy interrog note lawyer,uk law made brussel uk parliament power reject amend voter realiz mani power transfer elsewher,virgin media uk work record industri spi threaten download,ukrain miner trap underground threat rise water,ilderberg attende geithner call global bank framework,jo manuel barroso bulli irish say price pay ireland reject lisbon treati,dont worri everyon,ush attack iran way offic,futur unit state europ hand ireland,militari coup zimbabw mugab forc cede power gener,rise oil price spark strike spain saudi call meet,chvez farc ask end arm strugl guerrilla war histori time come free hostag,flier pain airlin pack,,,,,1
2008-06-10,il shortag myth say industri insid,israel launch iran command war,petit call investig read record canada parliament today,canadian hrc tribun forc pastor publicli renounc religi beleif,us eu threaten freez iranian asset unless uranium enrich program halt,thousand uk homeown face neg equiti,year old beat year old death worst case abus,white hous say senat doesnt get vote perman iraq base treati treati iraq us instal,man escap north korea year,yearold afghan suicid bomber,zimbabw run militari junta,former british prime minist fals link iraq misinform someth fear total loss privaci intrus state authoritarian tendenc,colleg drink heart problem,ten thousand protest sale us beef korea fear mad cow diseas,ush us eu must bond press iran nuke,shanghai composit tumbl percent,olmert hint us attack iran near,israel plan use nuke next war,uss liberti new revel attack american spi ship,world first church unearth jordan,corrupt ukrain allow epidem reach crisi point,ush discuss iran sanction eu leader,tibet olymp tradit,saddam tribe leader murder iraq,law creat underclass child crimin,,,,,2
2008-06-11,bc uncov lost iraq billion,war crimin georg w bush welcom uk,offens speech us say free,ushcheney plan demand us base iraq immun prosecut us troop privat militari contractor allow us troop remain countri indefinit,groceri refus sell jack daniel barbecu sauc alcohol yearold without id,farewel tour europ bush step threat iran,pakistan blame us coalit troop death,senior intellig offic leav top secret terrorist file train al qaeda truli danger lie leader paint would fuck,food schoolchildren order given backer presid robert mugab us ambassador said,cuba citi garden flourish veget herb organ,european commiss pledg support open standard promot interoper avoid vendor lockin european govern one technolog,china list olymp dont,log singapor hazard busi,iraqi denounc american demand militari base,africa world leader begin debat crop use biofuel food,iran st middl east stem cell research,set guin world record enjoy better web,franc block onlin child porn terror racism,race life come czech republ,theyr kill us darl realli sinc peopl run show found lie,opec support saudiback oil summit,major twelv pakistani soldier martyr,turkish pm court must explain headscarf rule,crazi fit tool,us lawmak say comput compromis chines,,,,,3
2008-06-12,s attack pakistan last night pakistan furi deadli us strike,us israel drop bomb iran shut gulf oil export launch barrag missil tanker strait hormuz arab oil facil,top secret document alqaeda left train senior intellig offici work cabinet offic document found passerbi hand bbc,seoul protest start allow possibl mad cow creutzfeldtjakob diseas beef threaten toppl govern dont care,jewish settler attack film,cult scientolog disabl man plymouth arrest stuart wyatt,one know longterm carri capac planet earth human absent cheap fossil fuel like lot fewer billion implic sober paralyz,robert mugab steal food aid hand children give polit support guy semblanc human left,huh american establish long histori incred dick enemi combat date back ww,marin mean motari get expel,american aid seiz zimbabw,us offer pledg bn afghanistan,great countri,unit airlin charg check bag,zimbabw militari activ involv run robert mugab reelect campaign,chines name children olymp game,pentagon releas video controversi air strike pakistan,dead hurt tornado hit boy scout camp,islam cleric issu fatwa condemn gambian presid gay genocid,footag deadli us airstrik,ritish mp resign seekrelect defenc civil liberti,john mccain believ would good thing longterm militari presenc iraq long casualti wouldnt make us conqueror occupi,china amp taiwan reopen talk,estonian parliament approv eu lisbon treati still wait ireland,ukranian women strip protest,,,,,4
2008-06-13,ook two plan attack germani persuad major popul support massiv dismantl civil right,uk mp resign launch oneman crusad govern plan suspect terrorist detain day without charg,german paper say mani lost faith america bush,jewish settler attack film footag video camera hand isra human right group appear show jewish settler beat palestinian west bank,us armi fraudul alter armor test result,zimbabw go shit even opposit leader arrest,fallujah babi suffer unpreced deform,sign thing come spain trucker strike crippl nation,comparison mani day held without charg seven differ countri way go uk,mexico spain presid reject embargo cuba say joint confer hasnt work,mugab strip us degre,insurg attack free hundr kandahar prison taliban roam street kandahar citi,hundr taliban milit escap prison afghanistan,ush admin insist abstinenceonli sex ed iran yep one call mullah distribut free condom amp needl fight aid year,mugab find difficult come term realiti longer popular threaten war,textbook privat islam school northern virginia teach student permiss muslim kill adulter convert islam accord feder investig,want help iran,us media blackout bilderberg,written bodi children realiti war caution disturb imag,drought ethiopia caus food shortag kill livestock doubl number peopl need urgent humanitarian aid million,lisbon treati european constituion reject ireland talli indic victori lisbon vote,isra settler throw stone palestinian boy steal donkey compass zionist style,gadhafi obama fear israel assassin like jfk,nuclear power includ unit state,eu new rule forc chang us firm requir compani demonstr chemic safe enter commerc opposit polici unit state,,,,,5
2008-06-14,ccain slam suprem court guantanamo rule call one worst histori,women find erot look nake man walk beach excit look landscap women find nake women arous,palestinian bar dead sea beach appeas isra settler,juri quebec acquit man firstdegre murder deadli shoot laval polic offic raid home last year,time make year newspap onlin avail,japanes requir waistlin measur must inch happen america,eu dictat may ram lisbon treati despit irish reject,tini teenag india smallest girl worldft tall,dont clue even may,german paper say mani lost faith america bush,much improv iraq militia leader alsadr call new offens us forc,ukrainian femal student strip protest hot water,eu referendum czech presid say lisbon treati project,canada govern take children hold unpopular belief,see eyewit testimoni vid,mugab vow opposit never lead zimbabw threaten civil war meanwhil rival presid rearrest,prostitut unheard taliban rule common afghanistan,taliban milit escap kandahar prison attack car bomb rocket,popular palm oil push orangutan toward extinct produc resist movement forc food manufactur list palm oil product label,eu leader vow press treati defeat,earthquak rattl japan,ill press iraq recogn israel,presid jewish nation fund florida zone arrest prostitut charg connect craigslist sting,new isra home unlaw palestinian leader criticis isra plan build,european lobbi transpar cyber action,,,,,6
2008-06-15,mpeach front page stori le mond bare cover top tier main stream media us great gore vidal articl excel comment,get osama bin laden leav offic order georg w bush,ill moyer journal nation intellig agenc issu contract spi player massiv multiplay onlin game like world warcraft second life,may us alli troop kill guerrilla afghanistan iraq,ireland punish eu constitut trick add bill right took get us constitut approv,congressman nadler senior american offici ought go jail,maliki call us demand unaccept,america prison terrorist often held wrong men,lock indefinit,pair report interview former guantanamo detaine,ritain militari say soldier wear uniform london gay pride march,nuclear weapon blueprint share axi evil pakistani scientist bush smoke gun,saudi arabia oil minist sunday address report world largest oilproduc countri set rais product,us professor speak israel proisrael lobbi us lead charg attack iran damag us interest,world expens real estat market london manhattan,in laden still aliv still threat bush win war terror,expert astound high rate imprison new zealandalmost twice western european countri,presid afghanistan threaten sunday send troop pakistan fight taliban prison oper liber hundr fighter,cover beij game expect censor,institut investig communist crime romania,afghanistan threaten invad pakistan,real peopl eurocrat extra time,drug cartel led woman turn mexican town shoot galleri,ush urg british pm brown set iraq pullout timet,karzai threaten send forc pakistan,,,,,7
2008-06-16,ran withdraw billion europ,lithuania reject idea send germani invoic occupi countri ww ii instead lithuania demand cash russia forc soviet union,israel press reveal armi kill cameraman,peopl understand annoy smash head countri request foreign secur servic,russia father identifi miss son neonazi behead video,guantanamo routin held wrong men,dont see cnn iof shoot live ammunit peac protest,repress govern arrest record number blogger speak onlin,weird bin laden son aspir peac activist,red alert taliban offens taken control eight villag march toward kandahar citi,lueprint nuclear warhead found smuggler comput,reason bicycl popular vehicl world today,happi food crisi produc lot food problem market wtf,rfk jr warn canadian corpor media,protest chant georg bush terrorist arrest georg bush bush dine rupert murdoch,satellit document war destruct outer space,leak detail ciamossad plot iran,iraq take turn toward tehran,russia wonder announc,american longer tallest peopl world,man gun stomp toddler death,want understand conflict central asia cute chibi countri help,panic grow muddl think us airlin industri hip,refuge forcibl deport australia china tortur commit suicid,austrian cellar wife return dungeon,,,,,8
2008-06-17,oll world trust ahmadinejad bush right thing regard world affair,egypt achiev truce israel hama leader go effect within three day,new amsterdam smoke ban doesnt appli marijuana break law smoke tobaccobas joint,swede take street fight domest spi,guantnamo prove use distract secret detent camp run us around world,london polic attack bush protest truncheon,fifth sever foot found canadian coast,hama say reach ceasefir truce israel,undercov bbc report zimbabw encount fear mugab support,report us gave green light taliban prison attack,west bank attack arrest,vibrant london demonstr georg bush attack polic,top world smallest car,much talk terrorist,wind power blow peru brighten futur,israel hama agre gaza truce,shame us refuge iraq,ahmadinejad opec dump weak dollar,fifth sever foot found british columbia shorelin,acklash brew beckham fellow band famou laker fan whose prime seat cost,time travel could possibl month,iran attack like bush pretend use diplomaci,reak car bomb kill iraqi crowd marketplac,iran withdraw b european bank prevent seizur,serbia reject kosovo constitut,,,,,9


### Save Data

Finally, we'll save our distributed datasets to DBFS.

In [0]:
# Persist to cache and disk
from pyspark import StorageLevel

dfDjia.persist(StorageLevel.MEMORY_AND_DISK)
dfNews.persist(StorageLevel.MEMORY_AND_DISK)
dfMl.persist(StorageLevel.MEMORY_AND_DISK)

Out[47]: DataFrame[Date: date, News 0: string, News 1: string, News 2: string, News 3: string, News 4: string, News 5: string, News 6: string, News 7: string, News 8: string, News 9: string, News 10: string, News 11: string, News 12: string, News 13: string, News 14: string, News 15: string, News 16: string, News 17: string, News 18: string, News 19: string, News 20: string, News 21: string, News 22: string, News 23: string, News 24: string, Change 1D: int, Change 5D: int, Change 14D: int, Change 30D: int, id: bigint]

In [0]:
# Save to DBFS
out_path = f"dbfs:{DBFS_DATA_DIR}/out/"

t = [
    (dfDjia, f"{out_path}djia.parquet"),
    (dfNews, f"{out_path}news.parquet"),
    (dfMl,   f"{out_path}ml.parquet"),
]

# Write
for (df, fp) in t:
    df.write.mode("overwrite").parquet(fp)

## Part 2

The second part of this notebook is available here:

**[Daily News & Stock Market Correlation-Prediction (2-4)](https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1993205155917960/4235175522479872/6079964132923530/latest.html)**