# Introduction

Predicting financial indicators is definitely a holy grail for our society at its present stage. There is a vast literature on how to do this and the general approach is a time-series one, that is, predict the future of one quantity based on that quantity's past.

We are trying to see if it's possible to complement this approach with data coming from news sources, reasoning that news from the world should directly and indirectly weigh on the performance of such indicators as stocks, employment rate, or inflation.

Please keep in mind that we do not expect to make any significant improvement over state-of-the-art financial analyses (which involve much more complex and refined models). Rather, we are interested in building a scalable and dynamic pipeline that in the future might supplement those already-existing models or give interesting insights.

### This notebook

This is a walkthrough illustrating the typical usage of our package. We will try to predict future S&P500 closing values based on past S&P500 values along with NLP features extracted from the daily-updted GDELT 1.0 (http://www.gdeltproject.org/) event database.

In particular, to scope down the analysis to a minimally viable scalable pipeline, I extract features from the source urls contained in the database (one associated to each event).

For each day, all urls get parsed, tokenized, and stemmed, and then conflated together into a single bag of words. This will constitute one document. After that I may apply a tf-idf or word2vec vectorization (this latter being much favored).

I use the extracted features (plus the same day's closing S&P500) to try and fit various regression models to predict the next day's S&P500 and compare them to a benchmark model (a simple naive model predicting the same for tomorrow as today, plus the average increase or decrease over the last few days).

I also try to predict if tomorrow's index value will rise or fall, given today's news.

For both tasks random forest regressors/classifiers seem promising approaches.

In [1]:
import importlib
import sys
import os
sourcedir=os.getcwd()+"/../source"
if sourcedir not in sys.path:
    sys.path.append(sourcedir)
import numpy as np

In [118]:
#importing our nlp proprocessing module, the reload command is for development
import nlp_preprocessing as nlpp
importlib.reload(nlpp)
#importing our model training module, the reload command is for development
import model_training as mdlt
importlib.reload(mdlt)

<module 'model_training' from '/Users/Maxos/Desktop/Insight_stuff/bigsnippyrepo/maqro/notebooks/../source/model_training.py'>

## The nlp-preprocessing module

The module has two classes for now: one deals with the nlp preprocessing of Google News articles, which are talked about in much more depth in another notebook; the other is the analog for GDELT url data.

Let's explore these classes and their contents.

### The CorpusGoogleNews class

In [13]:
#del datagnews
datagnews=nlpp.CorpusGoogleNews() # nlpp.CorpusGoogleNews('some/data/directory') 

These are the attributes of the initialized class

In [4]:
datagnews.raw_articles

{}

In [5]:
datagnews.datadirectory

'../data/'

There is one public method for now: it loads files from the data folder

In [6]:
datagnews.data_directory_crawl('AAPL',verbose=1)

Apple Inc
Apple Inc 1-26-17
Apple Inc 1-27-17
Apple Inc 1-30-17
Apple Inc 1-31-17
Apple Inc 2-1-17


which populates datagnews.raw_articles with dataframes like this:

In [7]:
datagnews.raw_articles['Apple Inc 1-30-17'].head()

Unnamed: 0,body,category,title
0,The first day of public trading with President...,Apple Inc,3 Stocks to Watch on Tuesday: Apple Inc. (AAPL...
1,The first day of public trading with President...,Apple Inc,3 Stocks to Watch on Tuesday: Apple Inc. (AAPL...
2,"The smart home market continues to heat up, an...",Apple Inc,Alphabet Inc (GOOGL) Steals AI Expert Back Fro...
3,"Reportedly, Apple Inc.’s AAPL management is sc...",Apple Inc,Apple (AAPL) Set to Meet Government Officials ...
4,Apple Inc. (AAPL) executives were in India tod...,Apple Inc,Apple Close to Signing Deal With Indian Govern...


### The CorpusGDELT class

Let's initialize the class

In [3]:
#del datagdelt
datagdelt=nlpp.CorpusGDELT(min_ment=500) # min_ment defaults to 1 and cuts off events that have a low number of mentions

Let's have a look at the several attributes that the class contains.

In [4]:
#minimum number of mentions for one event to be used
print('Minimum number of mentions:',datagdelt.minimum_ment)
print('Current directory:',datagdelt.currentdir) # current directory
print('Dates loaded so far:',datagdelt.dates) # dates for which data has been loaded so far
print('Corpus of raw urls',datagdelt.url_corpus)
print('Corpus of tfidf-vectorized docs:')
print(datagdelt.vect_corpus_tfidf)

Minimum number of mentions: 500
Current directory: ../data/GDELT_1.0/
Dates loaded so far: []
Corpus of raw urls []
Corpus of tfidf-vectorized docs:
Empty DataFrame
Columns: []
Index: []


In [10]:
#vowels and consonants
print('Vowels:',datagdelt.vowels)
print('Consonants:',datagdelt.consonants,end=' ')
print()
print('Stemmer:',datagdelt.porter) #stemmer of choice
print('Punctuation:',datagdelt.punctuation) #punctuation regular expression
print('Tokenizer:',datagdelt.re_tokenizer) 
print('Filter for spurious url beginnings:',datagdelt.spurious_beginnings)
print('Filter for stop words:',datagdelt.stop_words)

Vowels: ['a', 'e', 'i', 'o', 'u', 'y']
Consonants: ['b', 'c', 'd', 'f', 'g', 'h', 'j', 'k', 'l', 'm', 'n', 'p', 'q', 'r', 's', 't', 'v', 'w', 'x', 'z'] 
Stemmer: <PorterStemmer>
Punctuation: re.compile('[-.?!,":;()|0-9]')
Tokenizer: RegexpTokenizer(pattern='\\w+', gaps=False, discard_empty=True, flags=56)
Filter for spurious url beginnings: re.compile('idind.|idus.|iduk.')
Filter for stop words: {'', 'had', 'what', 'these', 'why', 'y', 'over', 'shouldn', 'is', 'himself', 'him', 'won', 'll', 'itself', 'should', 'now', 'herself', 'am', 'too', 'while', 'ourselves', 'here', 'that', 'there', 'some', 'mustn', 'being', 'yourself', 'been', 'by', 'themselves', 'how', 'weren', 'nor', 'each', 'aren', 'between', 'isn', 'yourselves', 'me', 'i', 'once', 'doesn', 'a', 'with', 'your', 'does', 'up', 'or', 'has', 're', 'whom', 'couldn', 'you', 'yours', 'mightn', 'and', 'myself', 'd', 'from', 'more', 'through', 'again', 'if', 'then', 'its', 'other', 'ma', 'for', 'but', 'so', 'hadn', 'do', 'such', 'only',

In [11]:
print(datagdelt.header,end=' ') #GDELT csv files header, notice the last field has the urls

['GLOBALEVENTID', 'SQLDATE', 'MonthYear', 'Year', 'FractionDate', 'Actor1Code', 'Actor1Name', 'Actor1CountryCode', 'Actor1KnownGroupCode', 'Actor1EthnicCode', 'Actor1Religion1Code', 'Actor1Religion2Code', 'Actor1Type1Code', 'Actor1Type2Code', 'Actor1Type3Code', 'Actor2Code', 'Actor2Name', 'Actor2CountryCode', 'Actor2KnownGroupCode', 'Actor2EthnicCode', 'Actor2Religion1Code', 'Actor2Religion2Code', 'Actor2Type1Code', 'Actor2Type2Code', 'Actor2Type3Code', 'IsRootEvent', 'EventCode', 'EventBaseCode', 'EventRootCode', 'QuadClass', 'GoldsteinScale', 'NumMentions', 'NumSources', 'NumArticles', 'AvgTone', 'Actor1Geo_Type', 'Actor1Geo_FullName', 'Actor1Geo_CountryCode', 'Actor1Geo_ADM1Code', 'Actor1Geo_Lat', 'Actor1Geo_Long', 'Actor1Geo_FeatureID', 'Actor2Geo_Type', 'Actor2Geo_FullName', 'Actor2Geo_CountryCode', 'Actor2Geo_ADM1Code', 'Actor2Geo_Lat', 'Actor2Geo_Long', 'Actor2Geo_FeatureID', 'ActionGeo_Type', 'ActionGeo_FullName', 'ActionGeo_CountryCode', 'ActionGeo_ADM1Code', 'ActionGeo_Lat', 

Now let's see what methods are available and what the pipeline is like.

First we load the urls.

In [5]:
datagdelt.load_urls('20161001','20170219') #the earliest available date is April 1st 2013 = 20130401

 Done!

Now let's look at what the url_corpus attribute looks like

In [10]:
day=5 #select one day
print('There are',len(datagdelt.url_corpus),'elements in it, because we loaded',len(datagdelt.dates),'days!')
print('The loaded day n.',day,'had',len(datagdelt.url_corpus[day-1]) ,'events in it that were mentioned more than',datagdelt.minimum_ment,'times:\n', datagdelt.url_corpus[day-1][:10],'\n etc...')
print('The first event was mentioned',datagdelt.url_corpus[day-1][0][0],'times, the second',datagdelt.url_corpus[day-1][1][0],'times, etc...')

There are 140 elements in it, because we loaded 140 days!
The loaded day n. 5 had 653 events in it that were mentioned more than 500 times:
 [[1972, 'http://www.philippinetimes.com/index.php/sid/248243461'], [1115, 'http://www.capradio.org/news/npr/story?storyid=496552413'], [970, 'http://thecabin.net/news/2016-10-04/dazzle-daze-raffle-tickets-sale'], [660, 'http://www.princegeorgecitizen.com/celebrity-chef-jamie-oliver-hopes-to-discuss-child-health-issues-with-trudeau-1.2358050'], [748, 'http://1045snx.iheart.com/articles/trending-104650/spencer-pratt-mocks-kim-kardashian-robbery-15169059/'], [746, 'https://in.news.yahoo.com/may-woo-labour-voters-pitch-britains-center-ground-230645420.html'], [754, 'http://www.stuff.co.nz/entertainment/84945059/Veteran-broadcaster-Mark-Sainsbury-takes-Rocky-Horror-Stage-in-Hamilton'], [1951, 'http://www.whio.com/news/national-govt--politics/clinton-reaches-out-women-while-trump-defends-taxes/xnmN5QugmLzeGkEBR64y9I/'], [1965, 'http://wgno.com/2016/10/0

We see that many of those urls contain wordings that can be very informative on what's happening in the world and therefore might tell us something about the near future of the markets!!

Now, let's process these messy raw urls! Let's use word2vec:

In [6]:
datagdelt.gdelt_preprocess(vectrz='word2vec',size_w2v=8)

Using word2vec vectorization procedure


2017-02-20 05:02:52,161 : INFO : collecting all words and their counts
2017-02-20 05:02:52,162 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-02-20 05:02:52,260 : INFO : collected 20461 word types from a corpus of 487933 raw words and 142 sentences
2017-02-20 05:02:52,260 : INFO : Loading a fresh vocabulary
2017-02-20 05:02:52,332 : INFO : min_count=1 retains 20461 unique words (100% of original 20461, drops 0)
2017-02-20 05:02:52,333 : INFO : min_count=1 leaves 487933 word corpus (100% of original 487933, drops 0)
2017-02-20 05:02:52,429 : INFO : deleting the raw counts dictionary of 20461 items
2017-02-20 05:02:52,430 : INFO : sample=0.001 downsamples 24 most-common words
2017-02-20 05:02:52,431 : INFO : downsampling leaves estimated 463332 word corpus (95.0% of prior 487933)
2017-02-20 05:02:52,432 : INFO : estimated required memory for 20461 words and 8 dimensions: 11540004 bytes
2017-02-20 05:02:52,511 : INFO : resetting layer weights
2017-02-20 05

which gives

In [7]:
datagdelt.word2vec_corpus.head(10)

Unnamed: 0_level_0,w2v_1,w2v_2,w2v_3,w2v_4,w2v_5,w2v_6,w2v_7,w2v_8
news_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
20161001,-0.17814,-0.295545,0.241965,-0.507727,-0.646261,0.36763,0.067822,0.084742
20161002,-0.189584,-0.286524,0.220864,-0.53422,-0.642891,0.348691,-0.060732,0.095956
20161003,-0.209694,-0.279247,0.23714,-0.498854,-0.646216,0.376962,-0.015636,0.114103
20161004,-0.207614,-0.265522,0.239953,-0.487596,-0.669452,0.34623,0.013481,0.151145
20161005,-0.165409,-0.258032,0.254543,-0.476076,-0.676383,0.369045,0.009359,0.144369
20161006,-0.222821,-0.320958,0.221421,-0.497328,-0.634485,0.371804,-0.032747,0.095357
20161007,-0.195355,-0.296706,0.226934,-0.470243,-0.663831,0.372546,-0.017702,0.146284
20161008,-0.176664,-0.248983,0.161096,-0.53751,-0.637041,0.390565,-0.082044,0.163815
20161009,-0.259883,-0.281645,0.171616,-0.457675,-0.687972,0.360261,0.063804,0.083991
20161010,-0.062354,-0.174777,0.166919,-0.518312,-0.657563,0.3824,-0.149366,0.261012


BOOM! Now we have all of our datapoints with their nlp features neatly arranged in a pandas dataframe, ready for processing. Mission accomplished!

If we try to run this expensive preprocessing again on the same exact data...

In [8]:
datagdelt.gdelt_preprocess(vectrz='word2vec',size_w2v=8)

Using word2vec vectorization procedure
Nothing to be done, dataframes are up to date


Yay for savings!

Now we initialize the model training class, feeding it the dataframe from the nlp processing

## The model training module
This section covers model training, validation, and testing, from our model_training module

We initialize a class instance by loading into it two lists: one of names of your choosing and one of dataframes, which in this case is the output form the previous module above, datagdelt.vect_corpus_tfidf.

In [179]:
import model_training as mdlt
importlib.reload(mdlt)
tet=mdlt.StockPrediction([['word2vec'],[datagdelt.word2vec_corpus],[datagdelt.w2vec_model]],update=True)

Let's try an L1 linear regressor which is trying to predict the increase/decrease of tomorrow's S&P index over today's. We test on the last 20 days out of 50 and validate/tune, for every testing case, over the previous 10 days. As for the hyperparameters, we are letting our regularization parameter be searched for in the 0.001-3000 range and we allow for 40 iterations of the optimal parameter search.

In [87]:
tet.auto_ts_val_test_reg('word2vec','lasso',[['alpha',[0.001,7000.,60.]]],parm_search_iter=50,n_folds_val=20,
                         past_depth=50,n_folds_test=20,scaling=True,differential=False,notest=False,verbose=False,
                         eqdiff=False)



best parameter choices: (80.55273365340669,)
model_test_rmse: 4.911 benchmark_test_rmse: 7.466
best parameter choices: (77.7076393080325,)
model_test_rmse: 15.977 benchmark_test_rmse: 13.611
best parameter choices: (46.47802028916847,)
model_test_rmse: 18.671 benchmark_test_rmse: 16.867
best parameter choices: (46.47802028916847,)
model_test_rmse: 1.674 benchmark_test_rmse: 3.337
best parameter choices: (33.13449456383699,)
model_test_rmse: 2.280 benchmark_test_rmse: 3.595
best parameter choices: (32.08075591740211,)
model_test_rmse: 14.027 benchmark_test_rmse: 15.351
best parameter choices: (32.08075591740211,)
model_test_rmse: 1.953 benchmark_test_rmse: 3.403
best parameter choices: (32.08075591740211,)
model_test_rmse: 0.787 benchmark_test_rmse: 0.653
best parameter choices: (32.08075591740211,)
model_test_rmse: 1.405 benchmark_test_rmse: 0.025
best parameter choices: (32.08075591740211,)
model_test_rmse: 16.667 benchmark_test_rmse: 15.245
best parameter choices: (31.2377650002542,)

[(6.9915571716431488, 5.7916109207521451),
 (6.9715473175452187, 5.3271630939674157)]

Well, the model clearly falls back on the benchmark. The coefficients are

In [88]:
feat_imp=tet.models['word2vec'].coef_
feat_imp
#these are the feature importances for the lasso classifier

array([-0.        , -0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        , -0.        ,  1.41475612,  1.00397025])

...which isn't surprising. As we said at the beginning, the most important feature should have been today's closing, and it was, entirely offuscating everything else. (Except for the engineered after_weekend feature)

If one had gotten a reasonable result, they might want to play with feature importances to try and see which stems actually were the most significant. It can be done as follows

In [78]:
model=tet.w2v_models['word2vec']
model.similar_by_word('appl')

2017-02-20 06:46:15,943 : INFO : precomputing L2-norms of word weight vectors


[('salli', 0.9973966479301453),
 ('hiv', 0.997178316116333),
 ('elig', 0.9961667060852051),
 ('diabet', 0.9960787296295166),
 ('globe', 0.9957556128501892),
 ('beyonc', 0.9957464933395386),
 ('diana', 0.995519757270813),
 ('anxieti', 0.9954898953437805),
 ('payment', 0.9949756860733032),
 ('bump', 0.9947608709335327)]

In [79]:
model.wv.similar_by_vector(feat_imp[:-2])

[('redlin', 0.0),
 ('aep', 0.0),
 ('curt', 0.0),
 ('disd', 0.0),
 ('oussid', 0.0),
 ('astrazeneca', 0.0),
 ('gsfqo', 0.0),
 ('uzel', 0.0),
 ('layton', 0.0),
 ('idinkbntiz', 0.0)]

In [129]:
tet.auto_ts_val_test_reg('word2vec','ridge',[['alpha',[0.001,7000.,60.]]],parm_search_iter=50,n_folds_val=20,
                         past_depth=50,n_folds_test=20,scaling=True,differential=True,notest=False,verbose=False,
                         eqdiff=False)

best parameter choices: (178499.97550000023,)
model_test_rmse: 7.466 benchmark_test_rmse: 7.466
best parameter choices: (178499.97550000023,)
model_test_rmse: 13.611 benchmark_test_rmse: 13.611
best parameter choices: (178499.97550000023,)
model_test_rmse: 16.867 benchmark_test_rmse: 16.867
best parameter choices: (178499.97550000023,)
model_test_rmse: 3.337 benchmark_test_rmse: 3.337
best parameter choices: (178499.97550000023,)
model_test_rmse: 3.595 benchmark_test_rmse: 3.595
best parameter choices: (178499.97550000023,)
model_test_rmse: 15.351 benchmark_test_rmse: 15.351


KeyboardInterrupt: 

In [144]:
tet.auto_ts_val_test_reg('word2vec','ridge',[['alpha',[0.001,7000.,60.]]],parm_search_iter=50,n_folds_val=20,
                         past_depth=50,n_folds_test=20,scaling=False,differential=False,notest=True,verbose=False,
                         eqdiff=False)

array([ 2356.00999525])

Hmmm... Not sure what to make of this

How about we try a random forest regressor instead? We are letting our tuning select any combination among 5 values for the number of estimators, 5 for the maximum number of features used for splitting, and we allow a maximum depth from 5 to 7.

In [68]:
tet.auto_ts_val_test_reg('word2vec','rfreg',[['n_estim',{5,6,7}],['max_feat',{5,6,7,8,9,10}],['max_depth',{4,5,6,7,8,9}]],
                         parm_search_iter=1,n_folds_val=25,past_depth=20,n_folds_test=20,scaling=True,differential=True,
                         verbose=False,notest=False,eqdiff=False)

best parameter choices: (6, 6, 8)
model_test_rmse: 7.792 benchmark_test_rmse: 7.466
best parameter choices: (5, 6, 4)
model_test_rmse: 19.177 benchmark_test_rmse: 13.611
best parameter choices: (5, 6, 7)
model_test_rmse: 23.201 benchmark_test_rmse: 16.867
best parameter choices: (5, 10, 8)
model_test_rmse: 0.070 benchmark_test_rmse: 3.337
best parameter choices: (5, 7, 9)
model_test_rmse: 3.530 benchmark_test_rmse: 3.595
best parameter choices: (5, 9, 8)
model_test_rmse: 29.664 benchmark_test_rmse: 15.351
best parameter choices: (7, 5, 4)
model_test_rmse: 8.364 benchmark_test_rmse: 3.403
best parameter choices: (6, 8, 5)
model_test_rmse: 2.911 benchmark_test_rmse: 0.653
best parameter choices: (6, 5, 6)
model_test_rmse: 4.993 benchmark_test_rmse: 0.025
best parameter choices: (5, 8, 8)
model_test_rmse: 18.863 benchmark_test_rmse: 15.245
best parameter choices: (5, 10, 5)
model_test_rmse: 9.581 benchmark_test_rmse: 6.362
best parameter choices: (7, 10, 5)
model_test_rmse: 6.436 benchmar

KeyboardInterrupt: 

Now, if we want to get a prediction for today, we toggle the 'notest' attribute to True

In [139]:
tet.auto_ts_val_test_reg('word2vec','rfreg',[['n_estim',{5,6,7}],['max_feat',{5,6,7,8,9,10}],['max_depth',{4,5,6,7,8,9}]],
                         parm_search_iter=1,n_folds_val=25,past_depth=20,n_folds_test=20,scaling=True,differential=True,
                         verbose=False,notest=True,eqdiff=True)

array([ 2230.36855433])

In [65]:
feat_imp=tet.models['word2vec'].feature_importances_
feat_imp
#these are the feature importances for a random forest classifier

array([ 0.1341678 ,  0.15907709,  0.07990877,  0.06699914,  0.03191157,
        0.09950022,  0.13786384,  0.11339288,  0.01543178,  0.16174691])

In [82]:
tet.auto_ts_val_test_reg('word2vec','rfreg',[['n_estim',{1,2,3,5,7}],['max_feat',{21,22,23,24}],['max_depth',{5,6,7}]],
                         parm_search_iter=1,n_folds_val=6,n_folds_test=20,scaling=True,differential=True,verbose=False,
                         notest=True)

I can't predict for tomorrow, because the stock market will be closed


In [149]:
tet.auto_ts_val_test_reg('word2vec','knnreg',[['numb_nn',{3,4,5,6}]],parm_search_iter=10,n_folds_val=15,
                         n_folds_test=15,past_depth=30,scaling=True,differential=True,notest=False,verbose=False)

best parameter choices: (4,)
model_test_rmse: 14.150 benchmark_test_rmse: 15.351
best parameter choices: (4,)
model_test_rmse: 2.390 benchmark_test_rmse: 3.403
best parameter choices: (4,)
model_test_rmse: 0.320 benchmark_test_rmse: 0.653
best parameter choices: (4,)
model_test_rmse: 0.940 benchmark_test_rmse: 0.025
best parameter choices: (5,)
model_test_rmse: 13.330 benchmark_test_rmse: 15.245
best parameter choices: (4,)
model_test_rmse: 5.220 benchmark_test_rmse: 6.362
best parameter choices: (4,)
model_test_rmse: 0.160 benchmark_test_rmse: 0.909
best parameter choices: (4,)
model_test_rmse: 1.230 benchmark_test_rmse: 0.172
best parameter choices: (4,)
model_test_rmse: 11.695 benchmark_test_rmse: 11.780
best parameter choices: (6,)
model_test_rmse: 3.307 benchmark_test_rmse: 6.679
best parameter choices: (6,)
model_test_rmse: 7.227 benchmark_test_rmse: 10.525
best parameter choices: (6,)
model_test_rmse: 4.407 benchmark_test_rmse: 7.591
best parameter choices: (6,)
model_test_rmse:

[(5.2705113333334035, 4.5526692805490301),
 (6.3036606676414824, 5.1252792112507803)]

In [151]:
tet.auto_ts_val_test_reg('word2vec','knnreg',[['numb_nn',{3,4,5,6}]],parm_search_iter=10,n_folds_val=15,
                         n_folds_test=15,past_depth=30,scaling=True,differential=True,notest=True,verbose=False)

array([ 2360.38659633])

In [180]:
tet.auto_ts_val_test_class('word2vec','logreg',[['l1orl2?',{'l1',}],
                                                ['C',[0.1,1.,0.3]]],
                           parm_search_iter=30,n_folds_val=15,n_folds_test=15,past_depth=50,scaling=False,notest=False,
                           verbose=False)

best parameter choices: ('l1', 0.1)
test_rec,prec,F1: [1.0, 1.0, 1.0] benchmark_rec,prec,F1: [1.0, 1.0, 1.0]
best parameter choices: ('l1', 0.1)
test_rec,prec,F1: [1.0, 1.0, 1.0] benchmark_rec,prec,F1: [1.0, 1.0, 1.0]
best parameter choices: ('l1', 0.1)
test_rec,prec,F1: [0.0, 1.0, 0.0] benchmark_rec,prec,F1: [0.0, 1.0, 0.0]
best parameter choices: ('l1', 0.1)
test_rec,prec,F1: [0.0, 1.0, 0.0] benchmark_rec,prec,F1: [0.0, 1.0, 0.0]
best parameter choices: ('l1', 0.1)
test_rec,prec,F1: [0.0, 1.0, 0.0] benchmark_rec,prec,F1: [0.0, 1.0, 0.0]
best parameter choices: ('l1', 0.1)
test_rec,prec,F1: [1.0, 0.0, 0.0] benchmark_rec,prec,F1: [1.0, 1.0, 1.0]
best parameter choices: ('l1', 0.1)
test_rec,prec,F1: [0.0, 1.0, 0.0] benchmark_rec,prec,F1: [0.0, 1.0, 0.0]
best parameter choices: ('l1', 0.1)
test_rec,prec,F1: [1.0, 1.0, 1.0] benchmark_rec,prec,F1: [0.0, 1.0, 0.0]
best parameter choices: ('l1', 0.1)
test_rec,prec,F1: [1.0, 1.0, 1.0] benchmark_rec,prec,F1: [1.0, 1.0, 1.0]
best parameter choi

[(array([[ 0.73333333,  0.86666667,  0.6       ]]),
  array([[ 0.44221664,  0.33993463,  0.48989795]])),
 (array([[ 0.66666667,  0.93333333,  0.6       ]]),
  array([[ 0.47140452,  0.24944383,  0.48989795]]))]

Let's see if classifying tomorrow's value going up or down will do us and better...
N.B. We need to specify a decision threshold which I recommend leaving at 0.5 for now.

In [171]:
tet.auto_ts_val_test_class('word2vec','rfclass',[['n_estim',{1,2,3,4,5}],['max_feat',{8,9,10}],
                                                 ['max_depth',{2,3,4,5}]],parm_search_iter=1,n_folds_val=20,
                           past_depth=40,n_folds_test=15,scaling=False,notest=False,verbose=False)

best parameter choices: (1, 9, 3)
test_rec,prec,F1: [1.0, 1.0, 1.0] benchmark_rec,prec,F1: [1.0, 1.0, 1.0]
best parameter choices: (3, 10, 5)
test_rec,prec,F1: [1.0, 1.0, 1.0] benchmark_rec,prec,F1: [1.0, 1.0, 1.0]
best parameter choices: (1, 9, 3)
test_rec,prec,F1: [0.0, 1.0, 0.0] benchmark_rec,prec,F1: [0.0, 1.0, 0.0]
best parameter choices: (1, 9, 3)
test_rec,prec,F1: [0.0, 1.0, 0.0] benchmark_rec,prec,F1: [0.0, 1.0, 0.0]
best parameter choices: (2, 9, 5)
test_rec,prec,F1: [1.0, 1.0, 1.0] benchmark_rec,prec,F1: [0.0, 1.0, 0.0]
best parameter choices: (3, 8, 5)
test_rec,prec,F1: [1.0, 1.0, 1.0] benchmark_rec,prec,F1: [1.0, 1.0, 1.0]
best parameter choices: (2, 8, 5)
test_rec,prec,F1: [1.0, 1.0, 1.0] benchmark_rec,prec,F1: [0.0, 1.0, 0.0]
best parameter choices: (5, 9, 4)
test_rec,prec,F1: [0.0, 1.0, 0.0] benchmark_rec,prec,F1: [0.0, 1.0, 0.0]
best parameter choices: (3, 9, 4)
test_rec,prec,F1: [0.0, 1.0, 0.0] benchmark_rec,prec,F1: [1.0, 1.0, 1.0]
best parameter choices: (3, 8, 3)
te

[(array([[ 0.66666667,  1.        ,  0.66666667]]),
  array([[ 0.47140452,  0.        ,  0.47140452]])),
 (array([[ 0.66666667,  0.93333333,  0.6       ]]),
  array([[ 0.47140452,  0.24944383,  0.48989795]]))]

Good! Our model overperform the benchmark as for accuracy (=F1 in this case): 0.67 vs 0.60.

Now let's predict what today's closing is going to be!

In [172]:
tet.auto_ts_val_test_class('word2vec','rfclass',[['n_estim',{1,2,3,4,5}],['max_feat',{8,9,10}],
                                                 ['max_depth',{2,3,4,5}]],parm_search_iter=1,n_folds_val=20,
                           past_depth=40,n_folds_test=15,scaling=False,notest=True,verbose=False)

array([ 1.])

It will go up, apparently!

In [175]:
feat_imp=tet.models['word2vec'].feature_importances_
feat_imp
#remember: only the first n-2 features are nlp, the (n-1)-th is being after a weekend and the n-th is today's closing

array([ 0.01150188,  0.        ,  0.        ,  0.04061601,  0.        ,
        0.16298408,  0.32183145,  0.06389972,  0.        ,  0.06375228,
        0.12311446,  0.21230012])

In [None]:
tet.auto_ts_val_test_class('word2vec','svmclass',[['C',[0.000001,1.,0.001]],['kernel',{'poly','linear'}]],
                           parm_search_iter=15,past_depth=30,n_folds_val=15,n_folds_test=15,scaling=False,notest=False,
                           verbose=False)

And now, for the real deal: k-fold training and validation!
The following method performs that in a very general manner. It lets you decide what regression model to choose, as well as the values of the hyperparameters (please see the module documentation in model_training.py for details on how to pass the hyperparameters), also you need to supply the number of folds you want your data split into, and a seed, for reproducibility. There is also an option to scale and normalize the features but it doesn't quite perform well in general.

The method returns the model average performance over the k training iterations. In short, tuning will consist of choosing the value for the hyperparameters that optimizes avg_validation_rmse (that is minimize the average root mean squared on the validation datasets)

By chance, in this one case we outperform the benchmark model with a lower rmse, but this procedure should be performed a couple of time and an average final performance should be quoted instead.

Out of curiosity, let's see what the most important features were.

As you can see, the method returns again average validation performances which are now measured in terms of recall, precision, and F1 score. In lack of a specific metric we want to optimize, we are going to use the F1 score for tuning.

The performance plateaus and is optimal for alpha ~1.0

Bingo! Our model predicts all 1's. Not much gained...

Incidentally anyway, that's how you pull the predictions vector for a specific dataset.
In the future I'll give the option to save a specific model run instead of overwriting. Good for free exploration.

# Scratch from now on, please ignore!!

## Trying out different regressors on the data, no luck so far :(

In [896]:
#model=LogisticRegression(penalty='l2', C=1.) #, dual=False, tol=0.0001,, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver='liblinear', max_iter=100, multi_class='ovr', verbose=0, warm_start=False, n_jobs=1)
#model=RandomForestClassifier(n_estimators=10,max_features=4000,max_depth=5)#,max_features=7000)
#model=AdaBoostClassifier(n_estimators=20)
#model=MLPClassifier(activation='logistic',hidden_layer_sizes=(100,10,5))
#model=KNeighborsClassifier(n_neighbors=15)
model=SVC(C=.8,kernel='poly')
model.fit(x_tfidf_class_trainval,y_tfidf_class_trainval)
#model.fit(x_bow_class_trainval,y_bow_class_trainval)

SVC(C=0.8, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='poly',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)