<a href="https://colab.research.google.com/github/ejihoon6065/Project_TurnAround/blob/Hyundai/Bert_Price_Prediction_(1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Intro 

When we want to predict next day's (week's or month's even) prices of a certain stock, first thing we do is to get as much as information about a company and 'guess' what it will be likely. This was usually done by hands without much help from using computers in the past. Even if one was used, it did not help much because of limits on resources such as computing power. 

However, as technology is getting better and faster computers are manufactured every second, we began to start utilizing them to help us for predicetion. In this post, I am sharing what I did to predict DJIA's adjusted closing prices with news articles as input features. 

The data used is from [Kaggle Dataset](https://www.kaggle.com/aaron7sun/stocknews), uploaded by Aaron7sun. It has 25 news articles each day from 2008-06-08 to 2016-07-01, total of 1989 days of samples.

There are three csv files but I only used 'Combined_News_DJIA' because I made a model that predicts only with articles.

Common approaches before was just to use RNN, GRU, LSTM or ARIMA models that rely on past values. However, my approach was to use same day's news articles and try to get how much they affect the day's opening value. If it affects positively, the closing will result in higher value.

Since the data is in string format and not numeric, I used pre-trained BERT to convert them into vectors of floating values, which I got from [Mxnet's Model Zoo](https://gluon-nlp.mxnet.io/model_zoo/bert/index.html).

### What is BERT?

BERT is an encoder that given sets of words (or phrases), converts them into appropriate floating values. Unlike word2vec which has fixed value for each word, it can capture significance of a word in a sentence. So for the same word in two different sentences, it can output different values if it has different meaning or impact on them.

As an example, we can look at two sentences.
1. I hate seeing you
2. I hate leaving you

If we are to predict my feeling about you with word2vec, we are forced to make a model only with 'seeing' and 'leaving' because they both contain 'I', 'hate', and 'you' in the same position that the model will not gain much from them. But if we use BERT, it's possible to capture that 'hate seeing' has negative feeling while 'hate leaving' has positive one because 'hate' will then have differert values.

Another example of is to predict a rating of a restaurant. With the sentence 'Bob hates this restaurant', word2vec might have following values.

1. Bob : 3
2. hates : -7
3. this : 0
4. restaurant : 3

If we make a (naive) model that just sums up values, with above numbers and predict if Bob's rating will be positive or negative, we will get a negative rating. But what happens if we change 'hates' to 'dislikes' which has the value of -5. Then the output of the model will be positive with the value of 1. 

If we define a model with word2vec, we would have to consider all kinds of possibility and many different combinations to correctly output a desired result.

This is where BERT differs from word2vec as it has the capability of capturing each word's impact. As the purpose of the post is not about BERT, I will skip the rest of the explanation and will have another in later post.

# Data Preprocess

In [1]:
import warnings
warnings.filterwarnings('ignore')

import numpy
import pandas as pd

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:

combined_news_path = '/content/drive/My Drive/Combined_News_DJIA.csv'

news_djia = pd.read_csv(combined_news_path)

In [4]:
news_djia.shape

(1989, 27)

In [5]:
news_djia.head(2)

Unnamed: 0,Date,Label,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,Top9,Top10,Top11,Top12,Top13,Top14,Top15,Top16,Top17,Top18,Top19,Top20,Top21,Top22,Top23,Top24,Top25
0,2008-08-08,0,"b""Georgia 'downs two Russian warplanes' as cou...",b'BREAKING: Musharraf to be impeached.',b'Russia Today: Columns of troops roll into So...,b'Russian tanks are moving towards the capital...,"b""Afghan children raped with 'impunity,' U.N. ...",b'150 Russian tanks have entered South Ossetia...,"b""Breaking: Georgia invades South Ossetia, Rus...","b""The 'enemy combatent' trials are nothing but...",b'Georgian troops retreat from S. Osettain cap...,b'Did the U.S. Prep Georgia for War with Russia?',b'Rice Gives Green Light for Israel to Attack ...,b'Announcing:Class Action Lawsuit on Behalf of...,"b""So---Russia and Georgia are at war and the N...","b""China tells Bush to stay out of other countr...",b'Did World War III start today?',b'Georgia Invades South Ossetia - if Russia ge...,b'Al-Qaeda Faces Islamist Backlash',"b'Condoleezza Rice: ""The US would not act to p...",b'This is a busy day: The European Union has ...,"b""Georgia will withdraw 1,000 soldiers from Ir...",b'Why the Pentagon Thinks Attacking Iran is a ...,b'Caucasus in crisis: Georgia invades South Os...,b'Indian shoe manufactory - And again in a se...,b'Visitors Suffering from Mental Illnesses Ban...,"b""No Help for Mexico's Kidnapping Surge"""
1,2008-08-11,1,b'Why wont America and Nato help us? If they w...,b'Bush puts foot down on Georgian conflict',"b""Jewish Georgian minister: Thanks to Israeli ...",b'Georgian army flees in disarray as Russians ...,"b""Olympic opening ceremony fireworks 'faked'""",b'What were the Mossad with fraudulent New Zea...,b'Russia angered by Israeli military sale to G...,b'An American citizen living in S.Ossetia blam...,b'Welcome To World War IV! Now In High Definit...,"b""Georgia's move, a mistake of monumental prop...",b'Russia presses deeper into Georgia; U.S. say...,b'Abhinav Bindra wins first ever Individual Ol...,b' U.S. ship heads for Arctic to define territ...,b'Drivers in a Jerusalem taxi station threaten...,b'The French Team is Stunned by Phelps and the...,b'Israel and the US behind the Georgian aggres...,"b'""Do not believe TV, neither Russian nor Geor...",b'Riots are still going on in Montreal (Canada...,b'China to overtake US as largest manufacturer',b'War in South Ossetia [PICS]',b'Israeli Physicians Group Condemns State Tort...,b' Russia has just beaten the United States ov...,b'Perhaps *the* question about the Georgia - R...,b'Russia is so much better at war',"b""So this is what it's come to: trading sex fo..."


In [6]:
news_djia = news_djia.drop(labels='Label', axis=1)

# Some column values are not in string so convert them
news_djia = news_djia.apply(lambda x: x.map(lambda y: str(y)), axis=1)

# Remove starting b' and b" characters
news_djia = news_djia.apply(lambda x: x.map(lambda y: y.replace('b"', '').replace("b'", '').replace('"', '')), axis=1)

# Set each strings of articles to list of articles for bert_embedding
news_djia.iloc[:, 1:] = news_djia.iloc[:, 1:].apply(lambda x: x.map(lambda y: [y]), axis=1)

# Move Date to Index
news_djia = news_djia.set_index(news_djia.iloc[:, 0]).drop('Date', axis=1)

In [7]:
news_djia.head(2)

Unnamed: 0_level_0,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,Top9,Top10,Top11,Top12,Top13,Top14,Top15,Top16,Top17,Top18,Top19,Top20,Top21,Top22,Top23,Top24,Top25
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1
2008-08-08,[Georgia 'downs two Russian warplanes' as coun...,[BREAKING: Musharraf to be impeached.'],[Russia Today: Columns of troops roll into Sou...,[Russian tanks are moving towards the capital ...,"[Afghan children raped with 'impunity,' U.N. o...",[150 Russian tanks have entered South Ossetia ...,"[Breaking: Georgia invades South Ossetia, Russ...",[The 'enemy combatent' trials are nothing but ...,[Georgian troops retreat from S. Osettain capi...,[Did the U.S. Prep Georgia for War with Russia?'],[Rice Gives Green Light for Israel to Attack I...,[Announcing:Class Action Lawsuit on Behalf of ...,[So---Russia and Georgia are at war and the NY...,[China tells Bush to stay out of other countri...,[Did World War III start today?'],[Georgia Invades South Ossetia - if Russia get...,[Al-Qaeda Faces Islamist Backlash'],[Condoleezza Rice: The US would not act to pre...,[This is a busy day: The European Union has a...,"[Georgia will withdraw 1,000 soldiers from Ira...",[Why the Pentagon Thinks Attacking Iran is a B...,[Caucasus in crisis: Georgia invades South Oss...,[Indian shoe manufactory - And again in a ser...,[Visitors Suffering from Mental Illnesses Bann...,[No Help for Mexico's Kidnapping Surge]
2008-08-11,[Why wont America and Nato help us? If they wo...,[Bush puts foot down on Georgian conflict'],[Jewish Georgian minister: Thanks to Israeli t...,[Georgian army flees in disarray as Russians a...,[Olympic opening ceremony fireworks 'faked'],[What were the Mossad with fraudulent New Zeal...,[Russia angered by Israeli military sale to Ge...,[An American citizen living in S.Ossetia blame...,[Welcome To World War IV! Now In High Definiti...,"[Georgia's move, a mistake of monumental propo...",[Russia presses deeper into Georgia; U.S. says...,[Abhinav Bindra wins first ever Individual Oly...,[ U.S. ship heads for Arctic to define territo...,[Drivers in a Jerusalem taxi station threaten ...,[The French Team is Stunned by Phelps and the ...,[Israel and the US behind the Georgian aggress...,"[Do not believe TV, neither Russian nor Georgi...",[Riots are still going on in Montreal (Canada)...,[China to overtake US as largest manufacturer'],[War in South Ossetia [PICS]'],[Israeli Physicians Group Condemns State Tortu...,[ Russia has just beaten the United States ove...,[Perhaps *the* question about the Georgia - Ru...,[Russia is so much better at war'],[So this is what it's come to: trading sex for...


I removed the label column and moved the date values to index. Then I removed starting b' or b" since it is not an actual word that I need.

The reason I converted a string to a list of words is so that BERT will output values for each word.

It is possible some news articles contain non-alphanumeric but I did not preprocess them but doing so will likely improve a model.

You can [download files](https://gluon-nlp.mxnet.io/_downloads/sentence_embedding.zip) necessary to run BERT from [Mxnet BERT page](https://gluon-nlp.mxnet.io/examples/sentence_embedding/bert.html). Also to run it, you have to install mxnet with pip.

In [13]:
!pip install mxnet

Collecting mxnet
[?25l  Downloading https://files.pythonhosted.org/packages/29/bb/54cbabe428351c06d10903c658878d29ee7026efbe45133fd133598d6eb6/mxnet-1.7.0.post1-py2.py3-none-manylinux2014_x86_64.whl (55.0MB)
[K     |████████████████████████████████| 55.0MB 52kB/s 
Collecting graphviz<0.9.0,>=0.8.1
  Downloading https://files.pythonhosted.org/packages/53/39/4ab213673844e0c004bed8a0781a0721a3f6bb23eb8854ee75c236428892/graphviz-0.8.4-py2.py3-none-any.whl
Installing collected packages: graphviz, mxnet
  Found existing installation: graphviz 0.10.1
    Uninstalling graphviz-0.10.1:
      Successfully uninstalled graphviz-0.10.1
Successfully installed graphviz-0.8.4 mxnet-1.7.0.post1


In [15]:
!pip install bert

Collecting bert
  Downloading https://files.pythonhosted.org/packages/e8/e6/55ed98ef52b168a38192da1aff7265c640f214009790220664ee3b4cb52a/bert-2.2.0.tar.gz
Collecting erlastic
  Downloading https://files.pythonhosted.org/packages/f3/30/f40d99fe35c38c2e0415b1e746c89569f2483e64ef65d054b9f0f382f234/erlastic-2.0.0.tar.gz
Building wheels for collected packages: bert, erlastic
  Building wheel for bert (setup.py) ... [?25l[?25hdone
  Created wheel for bert: filename=bert-2.2.0-cp36-none-any.whl size=3756 sha256=961fda593a0dc0663ba5e9449d35a2e1c75d664f26d083124fe9f7cd26461582
  Stored in directory: /root/.cache/pip/wheels/fe/71/b7/941459453bd38e5d97a8c886361dee19325e9933c9cf88ad46
  Building wheel for erlastic (setup.py) ... [?25l[?25hdone
  Created wheel for erlastic: filename=erlastic-2.0.0-cp36-none-any.whl size=6789 sha256=3795262de60812756169c5628cfa6bb85640a55ca4342df5f56358fc1d0f5506
  Stored in directory: /root/.cache/pip/wheels/02/62/46/93c713a5f061aeeb4f16eb6bf5ee798816e6ddda70fa

In [17]:
!pip install bert.embedding

Collecting bert.embedding
  Downloading https://files.pythonhosted.org/packages/62/85/e0d56e29a055d8b3ba6da6e52afe404f209453057de95b90c01475c3ff75/bert_embedding-1.0.1-py3-none-any.whl
Collecting typing==3.6.6
  Downloading https://files.pythonhosted.org/packages/4a/bd/eee1157fc2d8514970b345d69cb9975dcd1e42cd7e61146ed841f6e68309/typing-3.6.6-py3-none-any.whl
Collecting gluonnlp==0.6.0
[?25l  Downloading https://files.pythonhosted.org/packages/e2/07/037585c23bccec19ce333b402997d98b09e43cc8d2d86dc810d57249c5ff/gluonnlp-0.6.0.tar.gz (209kB)
[K     |████████████████████████████████| 215kB 10.7MB/s 
[?25hCollecting numpy==1.14.6
[?25l  Downloading https://files.pythonhosted.org/packages/e5/c4/395ebb218053ba44d64935b3729bc88241ec279915e72100c5979db10945/numpy-1.14.6-cp36-cp36m-manylinux1_x86_64.whl (13.8MB)
[K     |████████████████████████████████| 13.8MB 240kB/s 
[?25hCollecting mxnet==1.4.0
[?25l  Downloading https://files.pythonhosted.org/packages/c0/e9/241aadccc4522f99adee5b6043f7

In [11]:
!pip install https://github.com/dmlc/gluon-nlp/tarball/master

Collecting https://github.com/dmlc/gluon-nlp/tarball/master
[?25l  Downloading https://github.com/dmlc/gluon-nlp/tarball/master (892kB)
[K     |████████████████████████████████| 901kB 5.4MB/s 
Collecting sacremoses>=0.0.38
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 5.2MB/s 
[?25hCollecting yacs>=0.1.6
  Downloading https://files.pythonhosted.org/packages/38/4f/fe9a4d472aa867878ce3bb7efb16654c5d63672b86dc0e6e953a67018433/yacs-0.1.8-py3-none-any.whl
Collecting sacrebleu
[?25l  Downloading https://files.pythonhosted.org/packages/a3/c4/8e948f601a4f9609e8b2b58f31966cb13cf17b940b82aa3e767f01c42c52/sacrebleu-1.4.14-py3-none-any.whl (64kB)
[K     |████████████████████████████████| 71kB 9.4MB/s 
[?25hCollecting flake8
[?25l  Downloading https://files.pythonhosted.org/packages/6c/20/6326a9a0c6f0527612bae748c4c03df5cd69cf06dfb

In [32]:
import mxnet as mx

from mxnet import gluon
from bert_embedding import BertEmbedding

# Get GPU
ctx = mx.gpu(0)

# Define a model in GPU for faster training
bert_embedding = BertEmbedding(model='bert_12_768_12', dataset_name='book_corpus_wiki_en_cased')

You can change the model to another and [this page](https://gluon-nlp.mxnet.io/model_zoo/bert/index.html) has parameters for that. Additionally you can change the dataset to a different one. The model I loaded outputs an embedding in the shape of 768, as it can be seen in the name of model. Bigger number will generate bigger features which might boost accuracy of the model so feel free to try different models as well.

Next is the result of passing first two samples into BERT.

In [33]:
example_embedding = news_djia.iloc[:2, :].apply(lambda x: x.map(lambda y: bert_embedding(y)))
example_embedding

Unnamed: 0_level_0,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,Top9,Top10,Top11,Top12,Top13,Top14,Top15,Top16,Top17,Top18,Top19,Top20,Top21,Top22,Top23,Top24,Top25
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1
2008-08-08,"[([Georgia, ', downs, two, Russian, warplanes,...","[([BREAKING, :, Musharraf, to, be, impeached, ...","[([Russia, Today, :, Columns, of, troops, roll...","[([Russian, tanks, are, moving, towards, the, ...","[([Afghan, children, raped, with, ', impunity,...","[([150, Russian, tanks, have, entered, South, ...","[([Breaking, :, Georgia, invades, South, Osset...","[([The, ', enemy, combatent, ', trials, are, n...","[([Georgian, troops, retreat, from, S, ., Oset...","[([Did, the, U, ., S, ., Prep, Georgia, for, W...","[([Rice, Gives, Green, Light, for, Israel, to,...","[([Announcing, :, Class, Action, Lawsuit, on, ...","[([So, -, -, -, Russia, and, Georgia, are, at,...","[([China, tells, Bush, to, stay, out, of, othe...","[([Did, World, War, III, start, today, ?, '], ...","[([Georgia, Invades, South, Ossetia, -, if, Ru...","[([Al, -, Qaeda, Faces, Islamist, Backlash, ']...","[([Condoleezza, Rice, :, The, US, would, not, ...","[([This, is, a, busy, day, :, The, European, U...","[([Georgia, will, withdraw, 1, ,, 000, soldier...","[([Why, the, Pentagon, Thinks, Attacking, Iran...","[([Caucasus, in, crisis, :, Georgia, invades, ...","[([Indian, shoe, manufactory, -, And, again, i...","[([Visitors, Suffering, from, Mental, Illnesse...","[([No, Help, for, Mexico, ', s, Kidnapping, Su..."
2008-08-11,"[([Why, wont, America, and, Nato, help, us, ?,...","[([Bush, puts, foot, down, on, Georgian, confl...","[([Jewish, Georgian, minister, :, Thanks, to, ...","[([Georgian, army, flees, in, disarray, as, Ru...","[([Olympic, opening, ceremony, fireworks, ', f...","[([What, were, the, Mossad, with, fraudulent, ...","[([Russia, angered, by, Israeli, military, sal...","[([An, American, citizen, living, in, S, ., Os...","[([Welcome, To, World, War, IV, !, Now, In, Hi...","[([Georgia, ', s, move, ,, a, mistake, of, mon...","[([Russia, presses, deeper, into, Georgia, ;, ...","[([Abhinav, Bindra, wins, first, ever, Individ...","[([U, ., S, ., ship, heads, for, Arctic, to, d...","[([Drivers, in, a, Jerusalem, taxi, station, t...","[([The, French, Team, is, Stunned, by, Phelps,...","[([Israel, and, the, US, behind, the, Georgian...","[([Do, not, believe, TV, ,, neither, Russian, ...","[([Riots, are, still, going, on, in, Montreal,...","[([China, to, overtake, US, as, largest, manuf...","[([War, in, South, Ossetia, [, PICS, ], '], [[...","[([Israeli, Physicians, Group, Condemns, State...","[([Russia, has, just, beaten, the, United, Sta...","[([Perhaps, *, the, *, question, about, the, G...","[([Russia, is, so, much, better, at, war, '], ...","[([So, this, is, what, it, ', s, come, to, :, ..."


The output of bert_embedding is a tuple whose first entry is words and second is the floating values corresponding to each of them. Since I did not need any string values, I extracted numeric values by doing next.

In [34]:
import numpy as np

In [35]:
def extract_features(x):
    
    # Compact code
    # return np.array(x[0][1]).sum(axis=0)
    
    features = np.array(x[0][1])
    features = features.sum(axis=0)
    
    return features

In [36]:
example_embedding = example_embedding.apply(lambda x: x.map(extract_features))
example_embedding

Unnamed: 0_level_0,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,Top9,Top10,Top11,Top12,Top13,Top14,Top15,Top16,Top17,Top18,Top19,Top20,Top21,Top22,Top23,Top24,Top25
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1
2008-08-08,"[3.4343593, 0.89766926, -2.2731159, 2.6207273,...","[1.5841044, 0.3651707, -2.7459157, -0.7592665,...","[1.6664425, 0.78698075, -1.8210568, 0.19420537...","[1.2475847, -1.6762671, -4.9052677, -1.2343652...","[3.7802272, -0.58494264, -4.4265456, -2.084756...","[6.2831354, -2.1698499, -4.3652363, -0.2826851...","[2.5234256, -4.296561, -1.4806514, -0.82675654...","[4.6181684, 1.6290157, -2.4697719, 1.2765439, ...","[3.6747978, -4.586723, -3.1463275, -2.662077, ...","[2.273276, -6.1183367, -3.834559, -3.4150221, ...","[2.498811, -4.485126, -4.4073887, -2.6550674, ...","[-1.7069225, -1.0635066, -3.5754585, 2.8024619...","[-1.2736461, -2.8189285, -4.58098, 4.570773, 7...","[-0.5036616, -0.7411705, -1.5258787, -1.235189...","[0.7533729, -0.8578297, 0.083378404, 0.7738523...","[4.8692017, -3.098872, -3.3744936, 1.1512774, ...","[0.67757106, 0.40796664, -1.2753848, -0.430121...","[3.9853826, -1.947959, -5.3694806, -2.9952905,...","[3.5076091, 2.4318266, -0.8598722, 1.729223, 2...","[3.2756999, -2.0283217, -3.9403336, -5.527244,...","[2.0910888, 1.1763185, -6.940914, -1.0599109, ...","[1.0531394, -1.4839482, -0.2102995, -0.0948405...","[5.1034346, -1.7355354, -7.810959, 1.496744, 6...","[0.9602088, 3.2546334, -2.083721, -0.19298653,...","[0.9675671, -0.0868776, -1.7552404, 1.6409774,..."
2008-08-11,"[4.3726134, -2.7928586, -5.8791585, 6.665765, ...","[1.5314145, 0.33340392, -0.26014385, 0.1285088...","[1.4045095, -3.7549505, -3.26506, -0.55188835,...","[1.3028452, -7.201991, -1.9817094, -1.7650374,...","[0.38248762, 1.9304385, -0.9879122, -0.0849634...","[2.7923086, -1.367171, -0.4131552, -2.2261844,...","[2.8151383, -2.6929421, -3.1940496, -0.9433340...","[4.82897, -7.0465093, 0.57416075, -2.9042096, ...","[-0.70016146, 4.2033854, 0.78425217, 3.064478,...","[1.7868799, 0.4861898, -0.59401107, 1.1086818,...","[3.6118486, -0.64954025, 1.7564191, -1.8382015...","[-1.2037139, -2.2116253, -5.0030403, 4.700337,...","[2.426506, -3.475122, -0.7171306, -2.72811, 1....","[1.2772789, -1.779817, 1.5488563, 0.042267412,...","[1.131243, -0.8605994, -1.6347471, -1.7167449,...","[2.896379, -1.5717525, -2.8346763, -1.608649, ...","[7.5717945, -2.2538946, 0.51045245, 0.16108486...","[3.0750847, 1.9874792, -1.6867796, 4.9326515, ...","[1.0883623, 1.6837845, -1.1653569, -0.7572401,...","[1.1617029, -0.7679175, -1.4622602, -1.3730854...","[0.30732304, -0.46207255, -1.5561316, -0.21302...","[-1.5455484, -0.837101, -1.9839878, 0.3174597,...","[1.9832282, -0.8678921, -6.023197, 1.49993, -1...","[0.65908015, -1.0287197, -2.9924257, 0.0656182...","[1.1136361, 3.6621566, -0.060658894, 4.421232,..."


With the function above, now I have a dataframe with 25 columns of numeric values. Same thing was applied to the whole dataset. 

Using BERT model on CPU took more than an hour so I had to use on Google Clout Platform with one Tesla v4 which still took about 30 minutes.

In [None]:
news_embedding = news_djia.apply(lambda x: x.map(lambda y: bert_embedding(y)))

# Remove word and only keep numeric vectors
news_embedding = news_embedding.apply(lambda x: x.map(extract_features))

After that, I aggregated all columns into one.

In [None]:
news_embedding['combined'] = news_embedding.values.tolist()

news_embedding = news_embedding[['combined']]

news_embedding.head()

Each article differs in the number of words that the shape of each embedding is also different. So I cannot just put them into a model because then it will have to have flexible input size. 

Instead, by using min, max, sum and mean over each data sample's embedding element-wise, I extracted extreme values. For example by using max, it will take the strongest features among others. 

In [None]:
min_embedding = news_embedding['combined'].map(lambda x: np.min(x, axis=0)).to_frame()
max_embedding = news_embedding['combined'].map(lambda x: np.max(x, axis=0)).to_frame()
sum_embedding = news_embedding['combined'].map(lambda x: np.sum(x, axis=0)).to_frame()
mean_embedding = news_embedding['combined'].map(lambda x: np.mean(x, axis=0)).to_frame()

In [None]:
mean_embedding.head(2)

In [None]:
# Save them for easier access later
path = 'embedding_files/'

min_embedding.to_json(path+'min_embedding.json')
max_embedding.to_json(path+'max_embedding.json')
sum_embedding.to_json(path+'sum_embedding.json')
mean_embedding.to_json(path+'mean_embedding.json')

I had to make a different post for actual model implementation because putting all together was too long for one. You can find it [here]()