# Introduction

Predicting financial indicators is definitely a holy grail for our society at its present stage. There is a vast literature on how to do this and the general approach is a time-series one, that is, predict the future of one quantity based on that quantity's past.

We are trying to see if it's possible to complement this approach with data coming from news sources, reasoning that news from the world should directly and indirectly weigh on the performance of such indicators as stocks, employment rate, or inflation.

Please keep in mind that we do not expect to make any significant improvement over state-of-the-art financial analyses (which involve much more complex and refined models). Rather, we are interested in building a scalable and dynamic pipeline that in the future might supplement those already-existing models or give interesting insights.

### This notebook

This is a walkthrough illustrating the typical usage of our package. We will try to predict future S&P500 closing values based on past S&P500 values along with NLP features extracted from the daily-updted GDELT 1.0 (http://www.gdeltproject.org/) event database.

In particular, to scope down the analysis to a minimally viable scalable pipeline, I extract features from the source urls contained in the database (one associated to each event).

For each day, all urls get parsed, tokenized, and stemmed, and then conflated together into a single bag of words. This will constitute one document. After that I may apply a tf-idf or word2vec vectorization (this latter being much favored).

I use the extracted features (plus the same day's closing S&P500) to try and fit various regression models to predict the next day's S&P500 and compare them to a benchmark model (a simple naive model predicting the same for tomorrow as today, plus the average increase or decrease over the last few days).

I also try to predict if tomorrow's index value will rise or fall, given today's news.

For both tasks random forest regressors/classifiers seem promising approaches.

In [1]:
import importlib
import sys
import os
sourcedir=os.getcwd()+"/../source"
if sourcedir not in sys.path:
    sys.path.append(sourcedir)
import numpy as np

In [12]:
#importing our nlp proprocessing module, the reload command is for development
import nlp_preprocessing as nlpp
importlib.reload(nlpp)
#importing our model training module, the reload command is for development
import model_training as mdlt
importlib.reload(mdlt)

<module 'model_training' from '/Users/Maxos/Desktop/Insight_stuff/bigsnippyrepo/maqro/notebooks/../source/model_training.py'>

## The nlp-preprocessing module

The module has two classes for now: one deals with the nlp preprocessing of Google News articles, which are talked about in much more depth in another notebook; the other is the analog for GDELT url data.

Let's explore these classes and their contents.

### The CorpusGoogleNews class

In [13]:
#del datagnews
datagnews=nlpp.CorpusGoogleNews() # nlpp.CorpusGoogleNews('some/data/directory') 

These are the attributes of the initialized class

In [4]:
datagnews.raw_articles

{}

In [5]:
datagnews.datadirectory

'../data/'

There is one public method for now: it loads files from the data folder

In [6]:
datagnews.data_directory_crawl('AAPL',verbose=1)

Apple Inc
Apple Inc 1-26-17
Apple Inc 1-27-17
Apple Inc 1-30-17
Apple Inc 1-31-17
Apple Inc 2-1-17


which populates datagnews.raw_articles with dataframes like this:

In [7]:
datagnews.raw_articles['Apple Inc 1-30-17'].head()

Unnamed: 0,body,category,title
0,The first day of public trading with President...,Apple Inc,3 Stocks to Watch on Tuesday: Apple Inc. (AAPL...
1,The first day of public trading with President...,Apple Inc,3 Stocks to Watch on Tuesday: Apple Inc. (AAPL...
2,"The smart home market continues to heat up, an...",Apple Inc,Alphabet Inc (GOOGL) Steals AI Expert Back Fro...
3,"Reportedly, Apple Inc.’s AAPL management is sc...",Apple Inc,Apple (AAPL) Set to Meet Government Officials ...
4,Apple Inc. (AAPL) executives were in India tod...,Apple Inc,Apple Close to Signing Deal With Indian Govern...


### The CorpusGDELT class

Let's initialize the class

In [14]:
#del datagdelt
datagdelt=nlpp.CorpusGDELT(min_ment=500) # min_ment defaults to 1 and cuts off events that have a low number of mentions

Let's have a look at the several attributes that the class contains.

In [4]:
#minimum number of mentions for one event to be used
print('Minimum number of mentions:',datagdelt.minimum_ment)
print('Current directory:',datagdelt.currentdir) # current directory
print('Dates loaded so far:',datagdelt.dates) # dates for which data has been loaded so far
print('Corpus of raw urls',datagdelt.url_corpus)
print('Corpus of tfidf-vectorized docs:')
print(datagdelt.vect_corpus_tfidf)

Minimum number of mentions: 500
Current directory: ../data/GDELT_1.0/
Dates loaded so far: []
Corpus of raw urls []
Corpus of tfidf-vectorized docs:
Empty DataFrame
Columns: []
Index: []


In [10]:
#vowels and consonants
print('Vowels:',datagdelt.vowels)
print('Consonants:',datagdelt.consonants,end=' ')
print()
print('Stemmer:',datagdelt.porter) #stemmer of choice
print('Punctuation:',datagdelt.punctuation) #punctuation regular expression
print('Tokenizer:',datagdelt.re_tokenizer) 
print('Filter for spurious url beginnings:',datagdelt.spurious_beginnings)
print('Filter for stop words:',datagdelt.stop_words)

Vowels: ['a', 'e', 'i', 'o', 'u', 'y']
Consonants: ['b', 'c', 'd', 'f', 'g', 'h', 'j', 'k', 'l', 'm', 'n', 'p', 'q', 'r', 's', 't', 'v', 'w', 'x', 'z'] 
Stemmer: <PorterStemmer>
Punctuation: re.compile('[-.?!,":;()|0-9]')
Tokenizer: RegexpTokenizer(pattern='\\w+', gaps=False, discard_empty=True, flags=56)
Filter for spurious url beginnings: re.compile('idind.|idus.|iduk.')
Filter for stop words: {'', 'had', 'what', 'these', 'why', 'y', 'over', 'shouldn', 'is', 'himself', 'him', 'won', 'll', 'itself', 'should', 'now', 'herself', 'am', 'too', 'while', 'ourselves', 'here', 'that', 'there', 'some', 'mustn', 'being', 'yourself', 'been', 'by', 'themselves', 'how', 'weren', 'nor', 'each', 'aren', 'between', 'isn', 'yourselves', 'me', 'i', 'once', 'doesn', 'a', 'with', 'your', 'does', 'up', 'or', 'has', 're', 'whom', 'couldn', 'you', 'yours', 'mightn', 'and', 'myself', 'd', 'from', 'more', 'through', 'again', 'if', 'then', 'its', 'other', 'ma', 'for', 'but', 'so', 'hadn', 'do', 'such', 'only',

In [11]:
print(datagdelt.header,end=' ') #GDELT csv files header, notice the last field has the urls

['GLOBALEVENTID', 'SQLDATE', 'MonthYear', 'Year', 'FractionDate', 'Actor1Code', 'Actor1Name', 'Actor1CountryCode', 'Actor1KnownGroupCode', 'Actor1EthnicCode', 'Actor1Religion1Code', 'Actor1Religion2Code', 'Actor1Type1Code', 'Actor1Type2Code', 'Actor1Type3Code', 'Actor2Code', 'Actor2Name', 'Actor2CountryCode', 'Actor2KnownGroupCode', 'Actor2EthnicCode', 'Actor2Religion1Code', 'Actor2Religion2Code', 'Actor2Type1Code', 'Actor2Type2Code', 'Actor2Type3Code', 'IsRootEvent', 'EventCode', 'EventBaseCode', 'EventRootCode', 'QuadClass', 'GoldsteinScale', 'NumMentions', 'NumSources', 'NumArticles', 'AvgTone', 'Actor1Geo_Type', 'Actor1Geo_FullName', 'Actor1Geo_CountryCode', 'Actor1Geo_ADM1Code', 'Actor1Geo_Lat', 'Actor1Geo_Long', 'Actor1Geo_FeatureID', 'Actor2Geo_Type', 'Actor2Geo_FullName', 'Actor2Geo_CountryCode', 'Actor2Geo_ADM1Code', 'Actor2Geo_Lat', 'Actor2Geo_Long', 'Actor2Geo_FeatureID', 'ActionGeo_Type', 'ActionGeo_FullName', 'ActionGeo_CountryCode', 'ActionGeo_ADM1Code', 'ActionGeo_Lat', 

Now let's see what methods are available and what the pipeline is like.

First we load the urls.

In [None]:
datagdelt.load_urls('20161001','20170217') #the earliest available date is April 1st 2013 = 20130401

 loading news for 20161001 ...

Now let's look at what the url_corpus attribute looks like

In [10]:
day=5 #select one day
print('There are',len(datagdelt.url_corpus),'elements in it, because we loaded',len(datagdelt.dates),'days!')
print('The loaded day n.',day,'had',len(datagdelt.url_corpus[day-1]) ,'events in it that were mentioned more than',datagdelt.minimum_ment,'times:\n', datagdelt.url_corpus[day-1][:10],'\n etc...')
print('The first event was mentioned',datagdelt.url_corpus[day-1][0][0],'times, the second',datagdelt.url_corpus[day-1][1][0],'times, etc...')

There are 140 elements in it, because we loaded 140 days!
The loaded day n. 5 had 653 events in it that were mentioned more than 500 times:
 [[1972, 'http://www.philippinetimes.com/index.php/sid/248243461'], [1115, 'http://www.capradio.org/news/npr/story?storyid=496552413'], [970, 'http://thecabin.net/news/2016-10-04/dazzle-daze-raffle-tickets-sale'], [660, 'http://www.princegeorgecitizen.com/celebrity-chef-jamie-oliver-hopes-to-discuss-child-health-issues-with-trudeau-1.2358050'], [748, 'http://1045snx.iheart.com/articles/trending-104650/spencer-pratt-mocks-kim-kardashian-robbery-15169059/'], [746, 'https://in.news.yahoo.com/may-woo-labour-voters-pitch-britains-center-ground-230645420.html'], [754, 'http://www.stuff.co.nz/entertainment/84945059/Veteran-broadcaster-Mark-Sainsbury-takes-Rocky-Horror-Stage-in-Hamilton'], [1951, 'http://www.whio.com/news/national-govt--politics/clinton-reaches-out-women-while-trump-defends-taxes/xnmN5QugmLzeGkEBR64y9I/'], [1965, 'http://wgno.com/2016/10/0

We see that many of those urls contain wordings that can be very informative on what's happening in the world and therefore might tell us something about the near future of the markets!!

Now, let's process these messy raw urls! Let's use word2vec:

In [11]:
datagdelt.gdelt_preprocess(vectrz='word2vec',size_w2v=8)

Using word2vec vectorization procedure


2017-02-20 02:35:42,272 : INFO : collecting all words and their counts
2017-02-20 02:35:42,273 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-02-20 02:35:42,366 : INFO : collected 20347 word types from a corpus of 482406 raw words and 140 sentences
2017-02-20 02:35:42,367 : INFO : Loading a fresh vocabulary
2017-02-20 02:35:42,450 : INFO : min_count=1 retains 20347 unique words (100% of original 20347, drops 0)
2017-02-20 02:35:42,451 : INFO : min_count=1 leaves 482406 word corpus (100% of original 482406, drops 0)
2017-02-20 02:35:42,550 : INFO : deleting the raw counts dictionary of 20347 items
2017-02-20 02:35:42,551 : INFO : sample=0.001 downsamples 24 most-common words
2017-02-20 02:35:42,552 : INFO : downsampling leaves estimated 458000 word corpus (94.9% of prior 482406)
2017-02-20 02:35:42,553 : INFO : estimated required memory for 20347 words and 8 dimensions: 11475708 bytes
2017-02-20 02:35:42,635 : INFO : resetting layer weights
2017-02-20 02

TypeError: sort_index() got an unexpected keyword argument 'nplace'

which gives

In [86]:
datagdelt.word2vec_corpus.head(10)

Unnamed: 0_level_0,w2v_1,w2v_10,w2v_11,w2v_12,w2v_13,w2v_14,w2v_15,w2v_16,w2v_17,w2v_18,...,w2v_44,w2v_45,w2v_46,w2v_47,w2v_48,w2v_5,w2v_6,w2v_7,w2v_8,w2v_9
news_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
20161001,-0.106196,0.107095,0.107668,-0.033903,-0.024161,0.115558,-0.060396,0.13537,-0.011432,0.0789,...,-0.261782,0.156517,-0.246079,0.161685,-0.111064,0.155389,-0.168112,0.115281,0.019051,-0.047699
20161002,-0.113771,0.100264,0.105713,-0.024083,-0.024964,0.131052,-0.062632,0.127284,-0.023864,0.073769,...,-0.260302,0.149023,-0.250064,0.147783,-0.113185,0.15727,-0.172353,0.103838,0.018536,-0.048057
20161003,-0.11054,0.100292,0.114567,-0.018727,-0.024265,0.122546,-0.057136,0.129817,-0.025815,0.086951,...,-0.255653,0.15575,-0.249969,0.144981,-0.115265,0.157248,-0.168569,0.1052,0.029849,-0.042455
20161004,-0.11505,0.094899,0.108515,-0.012831,-0.017209,0.125177,-0.057109,0.124576,-0.02378,0.073699,...,-0.259164,0.146644,-0.257893,0.140503,-0.113885,0.158432,-0.172649,0.108129,0.024723,-0.047769
20161005,-0.108108,0.099156,0.115526,-0.013976,-0.012861,0.104003,-0.046854,0.130387,-0.009976,0.076836,...,-0.267575,0.144755,-0.258512,0.158576,-0.107521,0.154613,-0.162068,0.125842,0.022172,-0.051734
20161006,-0.112574,0.104572,0.097706,-0.036413,-0.039222,0.141862,-0.073521,0.124061,-0.025292,0.072915,...,-0.249048,0.158949,-0.245712,0.142401,-0.117084,0.156357,-0.183324,0.093742,0.020404,-0.044839
20161007,-0.10872,0.099964,0.110017,-0.021059,-0.025717,0.119847,-0.05755,0.123723,-0.017548,0.076016,...,-0.259163,0.149232,-0.255364,0.146495,-0.111227,0.156205,-0.17269,0.110451,0.024106,-0.048494
20161008,-0.132082,0.089863,0.082826,-0.019553,-0.003057,0.159752,-0.072429,0.12146,-0.032843,0.040385,...,-0.251844,0.141894,-0.255263,0.121864,-0.115007,0.155791,-0.190899,0.097681,0.011721,-0.046504
20161009,-0.096938,0.110626,0.128477,-0.029676,-0.035534,0.100564,-0.051367,0.138368,-0.01816,0.103708,...,-0.256257,0.159699,-0.238598,0.160712,-0.112247,0.154521,-0.158433,0.111544,0.035158,-0.038087
20161010,-0.137194,0.081542,0.075596,-0.013007,0.013085,0.170506,-0.074553,0.115688,-0.036748,0.022193,...,-0.252187,0.13041,-0.256965,0.108889,-0.111273,0.153195,-0.194263,0.097542,0.00348,-0.04916


BOOM! Now we have all of our datapoints with their nlp features neatly arranged in a pandas dataframe, ready for processing. Mission accomplished!

If we try to run this expensive preprocessing again on the same exact data...

In [16]:
datagdelt.gdelt_preprocess(vectrz='word2vec',size_w2v=24)

Using word2vec vectorization procedure
Nothing to be done, dataframes are up to date


Yay for savings!

Now we initialize the model training class, feeding it the dataframe from the nlp processing

## The model training module
This section covers model training, validation, and testing, from our model_training module

We initialize a class instance by loading into it two lists: one of names of your choosing and one of dataframes, which in this case is the output form the previous module above, datagdelt.vect_corpus_tfidf.

In [110]:
import model_training as mdlt
importlib.reload(mdlt)
tet=mdlt.StockPrediction([['word2vec'],[datagdelt.word2vec_corpus],[datagdelt.w2vec_model]],update=True)

Let's try an L1 linear regressor which is trying to predict the increase/decrease of tomorrow's S&P index over today's. We test on the last 20 days out of 50 and validate/tune, for every testing case, over the previous 10 days. As for the hyperparameters, we are letting our regularization parameter be searched for in the 0.001-3000 range and we allow for 40 iterations of the optimal parameter search.

In [112]:
tet.auto_ts_val_test_reg('word2vec','lasso',[['alpha',[0.001,7000.,60.]]],parm_search_iter=40,n_folds_val=15,
                         past_depth=60,n_folds_test=15,scaling=False,differential=False,notest=False,verbose=False)



best parameter choices: (72.01745061728414,)
model_test_rmse: 12.977 benchmark_test_rmse: 15.351
best parameter choices: (129.63061111111085,)
model_test_rmse: 0.160 benchmark_test_rmse: 3.403
best parameter choices: (110.4262242798356,)
model_test_rmse: 2.406 benchmark_test_rmse: 0.653
best parameter choices: (129.63061111111085,)
model_test_rmse: 3.403 benchmark_test_rmse: 0.025
best parameter choices: (129.63061111111085,)
model_test_rmse: 18.666 benchmark_test_rmse: 15.245
best parameter choices: (95.87970837261244,)
model_test_rmse: 3.402 benchmark_test_rmse: 6.362
best parameter choices: (110.4262242798356,)
model_test_rmse: 2.292 benchmark_test_rmse: 0.909
best parameter choices: (95.87970837261244,)
model_test_rmse: 3.009 benchmark_test_rmse: 0.172
best parameter choices: (33.60867695473269,)
model_test_rmse: 13.207 benchmark_test_rmse: 11.780
best parameter choices: (33.60867695473269,)
model_test_rmse: 8.015 benchmark_test_rmse: 6.679
best parameter choices: (33.6086769547326

[(6.5635096275580205, 5.310320686182215),
 (6.3036606676414824, 5.1252792112507803)]

The performance is not too bad. The coefficients are

In [113]:
feat_imp=tet.models['word2vec'].coef_
feat_imp
#these are the feature importances for the lasso classifier

array([  -0.        ,    0.        ,    0.        ,    0.        ,
          0.        ,    0.        ,    0.        ,   -0.        ,
         -0.        ,    0.        ,    0.        ,    0.        ,
          0.        ,   -0.        , -382.3833822 ,   -0.        ,
        486.55262121,   -0.        ,  431.09244711,  -55.86947521,
          0.        ,    0.        ,   -0.        ,   -0.        ,
        129.62534241,   -0.        ,    0.        , -354.77565051,
         -0.        ,    0.        ,   -0.        ,  395.14775662,
        144.82587492,  500.85960297,    0.        ,   -0.        ,
          0.        ,    0.        ,    0.        ,    0.        ,
       -755.93900631,    0.        , -154.62738738,   -0.        ,
         -0.        ,  -83.94125043,    0.        ,   -0.        ,
          1.70947899,    0.98668189])

...which isn't surprising. As we said at the beginning, the most important feature should have been today's closing, and it was, entirely offuscating everything else.

If one had gotten a reasonable result, they might want to play with feature importances to try and see which stems actually were the most significant. It can be done as follows

In [114]:
model=tet.w2v_models['word2vec']
model.similar_by_word('appl')

[('buri', 0.999561607837677),
 ('light', 0.9994657635688782),
 ('ford', 0.9994151592254639),
 ('factori', 0.9993614554405212),
 ('let', 0.9993431568145752),
 ('holiday', 0.9993090629577637),
 ('draw', 0.9993025064468384),
 ('stock', 0.9992968440055847),
 ('seri', 0.9992875456809998),
 ('might', 0.9991116523742676)]

In [117]:
model.wv.similar_by_vector(feat_imp[:-2])

[('fazl', 0.36830952763557434),
 ('oahu', 0.3590974807739258),
 ('tyrrel', 0.3559659421443939),
 ('barnstorm', 0.3421967625617981),
 ('predecessor', 0.3185845613479614),
 ('fleme', 0.31192827224731445),
 ('silli', 0.30939996242523193),
 ('gsiqdp', 0.30907684564590454),
 ('ruddock', 0.3047294020652771),
 ('tumult', 0.3030698299407959)]

Hmmm... Not sure what to make of this

How about we try a random forest regressor instead? We are letting our tuning select any combination among 5 values for the number of estimators, 5 for the maximum number of features used for splitting, and we allow a maximum depth from 5 to 7.

In [101]:
tet.auto_ts_val_test_reg('word2vec','rfreg',[['n_estim',{5,6,7}],['max_feat',{26,27,35,37,40,45,48}],
                                             ['max_depth',{4,5,6,7,8,9}]],parm_search_iter=1,n_folds_val=15,
                         past_depth=6,n_folds_test=20,scaling=True,differential=True,verbose=False,notest=False)

model_test_rmse: 7.928 benchmark_test_rmse: 7.466
model_test_rmse: 13.311 benchmark_test_rmse: 13.611
model_test_rmse: 20.861 benchmark_test_rmse: 16.867
model_test_rmse: 4.195 benchmark_test_rmse: 3.337
model_test_rmse: 0.348 benchmark_test_rmse: 3.595
model_test_rmse: 12.230 benchmark_test_rmse: 15.351
model_test_rmse: 1.193 benchmark_test_rmse: 3.403
model_test_rmse: 4.395 benchmark_test_rmse: 0.653
model_test_rmse: 2.959 benchmark_test_rmse: 0.025
model_test_rmse: 18.708 benchmark_test_rmse: 15.245
model_test_rmse: 8.365 benchmark_test_rmse: 6.362
model_test_rmse: 3.341 benchmark_test_rmse: 0.909
model_test_rmse: 0.266 benchmark_test_rmse: 0.172
model_test_rmse: 11.297 benchmark_test_rmse: 11.780
model_test_rmse: 3.998 benchmark_test_rmse: 6.679
model_test_rmse: 10.709 benchmark_test_rmse: 10.525
model_test_rmse: 8.110 benchmark_test_rmse: 7.591
model_test_rmse: 5.400 benchmark_test_rmse: 9.849
model_test_rmse: 9.583 benchmark_test_rmse: 3.955
model_test_rmse: 0.906 benchmark_test_

[(7.4050806437339549, 5.71111290802365),
 (6.9715473175452187, 5.3271630939674264)]

Now, if we want to get a prediction for today, we toggle the 'notest' attribute to True

In [119]:
tet.auto_ts_val_test_reg('word2vec','rfreg',[['n_estim',{5,6,7}],['max_feat',{26,27,35,37,40,45,48}],
                                             ['max_depth',{4,5,6,7,8,9}]],parm_search_iter=1,n_folds_val=15,
                         past_depth=6,n_folds_test=20,scaling=True,differential=True,verbose=False,notest=True)

I can't make a prediction: either tomorrow the stock market will be closed or I don't yet have the news of today. Come back after 6am EST


In [103]:
feat_imp=tet.models['word2vec'].feature_importances_
feat_imp
#these are the feature importances for a random forest classifier

array([  1.86560304e-02,   6.47340680e-02,   5.04366059e-03,
         4.82489746e-02,   2.60912845e-02,   6.21046423e-03,
         1.83707279e-03,   1.73020912e-02,   5.95717567e-02,
         2.61237858e-02,   7.94298900e-04,   5.88512680e-02,
         2.66449214e-02,   3.90272604e-02,   5.99184367e-02,
         3.76074582e-05,   5.71304649e-03,   2.77368604e-02,
         4.56659497e-03,   2.76483038e-02,   6.08056934e-03,
         9.11762961e-03,   4.30781150e-02,   1.25883151e-04,
         3.42150078e-03,   1.04673323e-03,   7.51815745e-03,
         1.65745414e-02,   1.89179524e-02,   8.84814693e-03,
         4.26913387e-03,   1.41589770e-02,   3.70210519e-03,
         7.26372796e-03,   5.41131341e-03,   2.14004773e-02,
         1.06000923e-02,   4.32772542e-03,   4.30099701e-03,
         2.96269348e-02,   2.25382335e-02,   2.06082933e-03,
         8.70850127e-03,   3.85983915e-02,   0.00000000e+00,
         6.05505838e-04,   1.66542643e-02,   9.59456167e-03,
         3.41739488e-04,

In [82]:
tet.auto_ts_val_test_reg('word2vec','rfreg',[['n_estim',{1,2,3,5,7}],['max_feat',{21,22,23,24}],['max_depth',{5,6,7}]],
                         parm_search_iter=1,n_folds_val=6,n_folds_test=20,scaling=True,differential=True,verbose=False,
                         notest=True)

I can't predict for tomorrow, because the stock market will be closed


In [88]:
tet.auto_ts_val_test_reg('word2vec','knnreg',[['numb_nn',{1,2,3,4}]],parm_search_iter=4,n_folds_val=6,n_folds_test=10,
                         scaling=True,differential=True,notest=False,verbose=False)

model_test_rmse: 3.019775 flat_test_rmse: 7.40724752174
model_test_rmse: 8.410034 flat_test_rmse: 1.71873020833
model_test_rmse: 6.6198125 flat_test_rmse: 0.580156
model_test_rmse: 17.930257 flat_test_rmse: 11.0525086923
model_test_rmse: 11.505005 flat_test_rmse: 5.67294251852
model_test_rmse: 8.880004 flat_test_rmse: 9.39025842857
model_test_rmse: 3.10017866667 flat_test_rmse: 6.24663241379
model_test_rmse: 0.929932 flat_test_rmse: 8.37825533333
model_test_rmse: 12.530029 flat_test_rmse: 5.59196196774
model_test_rmse: 3.84002725 flat_test_rmse: 0.55275684375


[(7.6765054416666727, 4.993188530339169),
 (5.6591449928088533, 3.4814802333665931)]

In [83]:
tet.auto_ts_val_test_reg('word2vec','knnreg',[['numb_nn',{1,2,3,4}]],parm_search_iter=4,n_folds_val=6,n_folds_test=10,
                         scaling=True,differential=True,notest=True,verbose=False)

I can't predict for tomorrow, because the stock market will be closed


In [46]:
tet.auto_ts_val_test_class('word2vec','logreg',[['l1orl2?',{'l1',}],
                                                ['C',[0.0000000000001,0.001,0.000001]]],
                           parm_search_iter=30,n_folds_val=10,n_folds_test=10,past_depth=15,scaling=False,notest=False,
                           verbose=False)

test_rec,prec,F1: [1.0, 1.0, 1.0] benchmark_rec,prec,F1: [1.0, 1.0, 1.0]
test_rec,prec,F1: [0.0, 1.0, 0.0] benchmark_rec,prec,F1: [0.0, 1.0, 0.0]
test_rec,prec,F1: [0.0, 1.0, 0.0] benchmark_rec,prec,F1: [0.0, 1.0, 0.0]
test_rec,prec,F1: [0.0, 1.0, 0.0] benchmark_rec,prec,F1: [1.0, 1.0, 1.0]
test_rec,prec,F1: [0.0, 1.0, 0.0] benchmark_rec,prec,F1: [1.0, 1.0, 1.0]
test_rec,prec,F1: [0.0, 1.0, 0.0] benchmark_rec,prec,F1: [1.0, 1.0, 1.0]
test_rec,prec,F1: [0.0, 1.0, 0.0] benchmark_rec,prec,F1: [1.0, 1.0, 1.0]
test_rec,prec,F1: [0.0, 1.0, 0.0] benchmark_rec,prec,F1: [1.0, 1.0, 1.0]
test_rec,prec,F1: [1.0, 1.0, 1.0] benchmark_rec,prec,F1: [1.0, 0.0, 0.0]
test_rec,prec,F1: [0.0, 1.0, 0.0] benchmark_rec,prec,F1: [1.0, 1.0, 1.0]


[(array([[ 0.2,  1. ,  0.2]]), array([[ 0.4,  0. ,  0.4]])),
 (array([[ 0.8,  0.9,  0.7]]),
  array([[ 0.4       ,  0.3       ,  0.45825757]]))]

Let's see if classifying tomorrow's value going up or down will do us and better...
N.B. We need to specify a decision threshold which I recommend leaving at 0.5 for now.

In [None]:
tet.auto_ts_val_test_class('word2vec','rfclass',[['n_estim',{1,2,3}],['max_feat',{3,4,5,6,7,8,9}],
                                                 ['max_depth',{2,3,4,5}]],parm_search_iter=5,n_folds_val=10,past_depth=15,
                           n_folds_test=20,scaling=True,notest=False,verbose=True)

(1, 3, 2)
avg_train_rec,prec,F1: [0.51698502168329752, 0.6972104808646391, 0.55046606123530339] avg_validation_rec,prec,F1: [0.48333333333333323, 0.65000000000000002, 0.46095238095238089]
(1, 3, 3)
avg_train_rec,prec,F1: [0.60821671862620141, 0.77178665324445372, 0.66446874166701764] avg_validation_rec,prec,F1: [0.49999999999999989, 0.50166666666666671, 0.41857142857142859]
(1, 3, 4)
avg_train_rec,prec,F1: [0.68559060671129646, 0.74329406260520814, 0.70387021371930614] avg_validation_rec,prec,F1: [0.6333333333333333, 0.46833333333333327, 0.4821428571428571]
(1, 3, 5)
avg_train_rec,prec,F1: [0.73401195308522893, 0.76752760999630076, 0.73986529412042346] avg_validation_rec,prec,F1: [0.52499999999999991, 0.4416666666666666, 0.41857142857142859]
(1, 4, 2)
avg_train_rec,prec,F1: [0.60916833406057536, 0.67387799132956949, 0.61739426423249966] avg_validation_rec,prec,F1: [0.39166666666666666, 0.39999999999999997, 0.3261904761904762]
(1, 4, 3)
avg_train_rec,prec,F1: [0.61549311081638669, 0.770

In [117]:
tet.auto_ts_val_test_class('word2vec','rfclass',[['n_estim',{1,2,3,4,5,6,7}],['max_feat',{8,10,12,14,18,22,23,24}],
                                                 ['max_depth',{4,5,6}]],parm_search_iter=1,n_folds_val=10,
                           n_folds_test=20,scaling=True,notest=False,verbose=False)

test_rec,prec,F1: [1.0, 0.0, 0.0] benchmark_rec,prec,F1: [1.0, 0.0, 0.0]
test_rec,prec,F1: [1.0, 1.0, 1.0] benchmark_rec,prec,F1: [0.0, 1.0, 0.0]
test_rec,prec,F1: [0.0, 1.0, 0.0] benchmark_rec,prec,F1: [1.0, 1.0, 1.0]
test_rec,prec,F1: [1.0, 0.0, 0.0] benchmark_rec,prec,F1: [1.0, 0.0, 0.0]
test_rec,prec,F1: [1.0, 1.0, 1.0] benchmark_rec,prec,F1: [1.0, 0.0, 0.0]
test_rec,prec,F1: [1.0, 0.0, 0.0] benchmark_rec,prec,F1: [1.0, 1.0, 1.0]
test_rec,prec,F1: [1.0, 1.0, 1.0] benchmark_rec,prec,F1: [1.0, 1.0, 1.0]
test_rec,prec,F1: [0.0, 1.0, 0.0] benchmark_rec,prec,F1: [0.0, 1.0, 0.0]
test_rec,prec,F1: [1.0, 1.0, 1.0] benchmark_rec,prec,F1: [0.0, 1.0, 0.0]
test_rec,prec,F1: [1.0, 1.0, 1.0] benchmark_rec,prec,F1: [0.0, 1.0, 0.0]
test_rec,prec,F1: [1.0, 1.0, 1.0] benchmark_rec,prec,F1: [1.0, 0.0, 0.0]
test_rec,prec,F1: [1.0, 1.0, 1.0] benchmark_rec,prec,F1: [0.0, 1.0, 0.0]
test_rec,prec,F1: [1.0, 1.0, 1.0] benchmark_rec,prec,F1: [1.0, 1.0, 1.0]
test_rec,prec,F1: [1.0, 1.0, 1.0] benchmark_rec,pre

[(array([[ 0.9,  0.8,  0.7]]),
  array([[ 0.3       ,  0.4       ,  0.45825757]])),
 (array([[ 0.75,  0.75,  0.5 ]]),
  array([[ 0.4330127,  0.4330127,  0.5      ]]))]

Good! Our model consistently overperform the benchmark as for accuracy (=F1 in this case): 0.7 vs 0.5.

Now let's predict what today's closing is going to be!

In [41]:
tet.auto_ts_val_test_class('word2vec','rfclass',[['n_estim',{1,2,3,4,5,6,7}],['max_feat',{8,12}],['max_depth',{4,6}]],
                           parm_search_iter=1,n_folds_val=10,n_folds_test=25,scaling=False,one_shot=True,notest=True,
                           verbose=False)

array([ 1.])

In [108]:
feat_imp=tet.models['word2vec'].coef_
feat_imp
#remember: only the first n-2 features are nlp, the (n-1)-th is being after a weekend and the n-th is today's closing

AttributeError: 'RandomForestClassifier' object has no attribute 'coef_'

It will go up, apparently!

In [None]:
tet.auto_ts_val_test_class('word2vec','svmclass',[['C',[0.000001,1.,0.001]],['kernel',{'poly',}]],
                           parm_search_iter=15,n_folds_val=10,n_folds_test=15,scaling=True,notest=False,
                           verbose=False)

In [67]:
predictorgdelt.dataset_names

['some_name_you_choose']

And now, for the real deal: k-fold training and validation!
The following method performs that in a very general manner. It lets you decide what regression model to choose, as well as the values of the hyperparameters (please see the module documentation in model_training.py for details on how to pass the hyperparameters), also you need to supply the number of folds you want your data split into, and a seed, for reproducibility. There is also an option to scale and normalize the features but it doesn't quite perform well in general.

The method returns the model average performance over the k training iterations. In short, tuning will consist of choosing the value for the hyperparameters that optimizes avg_validation_rmse (that is minimize the average root mean squared on the validation datasets)

By chance, in this one case we outperform the benchmark model with a lower rmse, but this procedure should be performed a couple of time and an average final performance should be quoted instead.

Out of curiosity, let's see what the most important features were.

As you can see, the method returns again average validation performances which are now measured in terms of recall, precision, and F1 score. In lack of a specific metric we want to optimize, we are going to use the F1 score for tuning.

The performance plateaus and is optimal for alpha ~1.0

Bingo! Our model predicts all 1's. Not much gained...

Incidentally anyway, that's how you pull the predictions vector for a specific dataset.
In the future I'll give the option to save a specific model run instead of overwriting. Good for free exploration.

# Scratch from now on, please ignore!!

## Trying out different regressors on the data, no luck so far :(

In [896]:
#model=LogisticRegression(penalty='l2', C=1.) #, dual=False, tol=0.0001,, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver='liblinear', max_iter=100, multi_class='ovr', verbose=0, warm_start=False, n_jobs=1)
#model=RandomForestClassifier(n_estimators=10,max_features=4000,max_depth=5)#,max_features=7000)
#model=AdaBoostClassifier(n_estimators=20)
#model=MLPClassifier(activation='logistic',hidden_layer_sizes=(100,10,5))
#model=KNeighborsClassifier(n_neighbors=15)
model=SVC(C=.8,kernel='poly')
model.fit(x_tfidf_class_trainval,y_tfidf_class_trainval)
#model.fit(x_bow_class_trainval,y_bow_class_trainval)

SVC(C=0.8, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='poly',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)