# Introduction

Predicting financial indicators is definitely a holy grail for our society at its present stage. There is a vast literature on how to do this and the general approach is a time-series one, that is, predict the future of one quantity based on that quantity's past.

We are trying to see if it's possible to complement this approach with data coming from news sources, reasoning that news from the world should directly and indirectly weigh on the performance of such indicators as stocks, employment rate, or inflation.

Please keep in mind that we do not expect to make any significant improvement over state-of-the-art financial analyses (which involve much more complex and refined models). Rather, we are interested in building a scalable and dynamic pipeline that in the future might supplement those already-existing models or give interesting insights.

### This notebook

This is a walkthrough illustrating the typical usage of our package. We will try to predict future S&P500 closing values based on past S&P500 values along with NLP features extracted from the daily-updted GDELT 1.0 (http://www.gdeltproject.org/) event database.

In particular, to scope down the analysis to a minimally viable scalable pipeline, I extract features from the source urls contained in the database (one associated to each event).

For each day, all urls get parsed, tokenized, and stemmed, and then conflated together into a single bag of words. This will constitute one document. After that I may apply a tf-idf or word2vec vectorization (this latter being much favored).

I use the extracted features (plus the same day's closing S&P500) to try and fit various regression models to predict the next day's S&P500 and compare them to a benchmark model (a simple naive model predicting the same for tomorrow as today, plus the average increase or decrease over the last few days).

I also try to predict if tomorrow's index value will rise or fall, given today's news.

For both tasks random forest regressors/classifiers seem promising approaches.

In [1]:
import importlib
import sys
import os
sourcedir=os.getcwd()+"/../source"
if sourcedir not in sys.path:
    sys.path.append(sourcedir)
import numpy as np

In [2]:
#importing our nlp proprocessing module, the reload command is for development
import nlp_preprocessing as nlpp
importlib.reload(nlpp)
#importing our model training module, the reload command is for development
import model_training as mdlt
importlib.reload(mdlt)

<module 'model_training' from '/Users/Maxos/Desktop/Insight_stuff/bigsnippyrepo/maqro/notebooks/../source/model_training.py'>

## The nlp-preprocessing module

The module has two classes for now: one deals with the nlp preprocessing of Google News articles, which are talked about in much more depth in another notebook; the other is the analog for GDELT url data.

Let's explore these classes and their contents.

### The CorpusGoogleNews class

In [3]:
#del datagnews
datagnews=nlpp.CorpusGoogleNews() # nlpp.CorpusGoogleNews('some/data/directory') 

These are the attributes of the initialized class

In [4]:
datagnews.raw_articles

{}

In [5]:
datagnews.datadirectory

'../data/'

There is one public method for now: it loads files from the data folder

In [6]:
datagnews.data_directory_crawl('AAPL',verbose=1)

Apple Inc
Apple Inc 1-26-17
Apple Inc 1-27-17
Apple Inc 1-30-17
Apple Inc 1-31-17
Apple Inc 2-1-17


which populates datagnews.raw_articles with dataframes like this:

In [7]:
datagnews.raw_articles['Apple Inc 1-30-17'].head()

Unnamed: 0,body,category,title
0,The first day of public trading with President...,Apple Inc,3 Stocks to Watch on Tuesday: Apple Inc. (AAPL...
1,The first day of public trading with President...,Apple Inc,3 Stocks to Watch on Tuesday: Apple Inc. (AAPL...
2,"The smart home market continues to heat up, an...",Apple Inc,Alphabet Inc (GOOGL) Steals AI Expert Back Fro...
3,"Reportedly, Apple Inc.’s AAPL management is sc...",Apple Inc,Apple (AAPL) Set to Meet Government Officials ...
4,Apple Inc. (AAPL) executives were in India tod...,Apple Inc,Apple Close to Signing Deal With Indian Govern...


### The CorpusGDELT class

Let's initialize the class

In [8]:
#del datagdelt
datagdelt=nlpp.CorpusGDELT(min_ment=800) # min_ment defaults to 1 and cuts off events that have a low number of mentions

Let's have a look at the several attributes that the class contains.

In [9]:
#minimum number of mentions for one event to be used
print('Minimum number of mentions:',datagdelt.minimum_ment)
print('Current directory:',datagdelt.currentdir) # current directory
print('Dates loaded so far:',datagdelt.dates) # dates for which data has been loaded so far
print('Corpus of raw urls',datagdelt.url_corpus)
print('Corpus of tfidf-vectorized docs:')
print(datagdelt.vect_corpus_tfidf)

Minimum number of mentions: 800
Current directory: ../data/GDELT_1.0/
Dates loaded so far: []
Corpus of raw urls []
Corpus of tfidf-vectorized docs:
Empty DataFrame
Columns: []
Index: []


In [10]:
#vowels and consonants
print('Vowels:',datagdelt.vowels)
print('Consonants:',datagdelt.consonants,end=' ')
print()
print('Stemmer:',datagdelt.porter) #stemmer of choice
print('Punctuation:',datagdelt.punctuation) #punctuation regular expression
print('Tokenizer:',datagdelt.re_tokenizer) 
print('Filter for spurious url beginnings:',datagdelt.spurious_beginnings)
print('Filter for stop words:',datagdelt.stop_words)

Vowels: ['a', 'e', 'i', 'o', 'u', 'y']
Consonants: ['b', 'c', 'd', 'f', 'g', 'h', 'j', 'k', 'l', 'm', 'n', 'p', 'q', 'r', 's', 't', 'v', 'w', 'x', 'z'] 
Stemmer: <PorterStemmer>
Punctuation: re.compile('[-.?!,":;()|0-9]')
Tokenizer: RegexpTokenizer(pattern='\\w+', gaps=False, discard_empty=True, flags=56)
Filter for spurious url beginnings: re.compile('idind.|idus.|iduk.')
Filter for stop words: {'', 'had', 'what', 'these', 'why', 'y', 'over', 'shouldn', 'is', 'himself', 'him', 'won', 'll', 'itself', 'should', 'now', 'herself', 'am', 'too', 'while', 'ourselves', 'here', 'that', 'there', 'some', 'mustn', 'being', 'yourself', 'been', 'by', 'themselves', 'how', 'weren', 'nor', 'each', 'aren', 'between', 'isn', 'yourselves', 'me', 'i', 'once', 'doesn', 'a', 'with', 'your', 'does', 'up', 'or', 'has', 're', 'whom', 'couldn', 'you', 'yours', 'mightn', 'and', 'myself', 'd', 'from', 'more', 'through', 'again', 'if', 'then', 'its', 'other', 'ma', 'for', 'but', 'so', 'hadn', 'do', 'such', 'only',

In [11]:
print(datagdelt.header,end=' ') #GDELT csv files header, notice the last field has the urls

['GLOBALEVENTID', 'SQLDATE', 'MonthYear', 'Year', 'FractionDate', 'Actor1Code', 'Actor1Name', 'Actor1CountryCode', 'Actor1KnownGroupCode', 'Actor1EthnicCode', 'Actor1Religion1Code', 'Actor1Religion2Code', 'Actor1Type1Code', 'Actor1Type2Code', 'Actor1Type3Code', 'Actor2Code', 'Actor2Name', 'Actor2CountryCode', 'Actor2KnownGroupCode', 'Actor2EthnicCode', 'Actor2Religion1Code', 'Actor2Religion2Code', 'Actor2Type1Code', 'Actor2Type2Code', 'Actor2Type3Code', 'IsRootEvent', 'EventCode', 'EventBaseCode', 'EventRootCode', 'QuadClass', 'GoldsteinScale', 'NumMentions', 'NumSources', 'NumArticles', 'AvgTone', 'Actor1Geo_Type', 'Actor1Geo_FullName', 'Actor1Geo_CountryCode', 'Actor1Geo_ADM1Code', 'Actor1Geo_Lat', 'Actor1Geo_Long', 'Actor1Geo_FeatureID', 'Actor2Geo_Type', 'Actor2Geo_FullName', 'Actor2Geo_CountryCode', 'Actor2Geo_ADM1Code', 'Actor2Geo_Lat', 'Actor2Geo_Long', 'Actor2Geo_FeatureID', 'ActionGeo_Type', 'ActionGeo_FullName', 'ActionGeo_CountryCode', 'ActionGeo_ADM1Code', 'ActionGeo_Lat', 

Now let's see what methods are available and what the pipeline is like.

First we load the urls.

In [12]:
datagdelt.load_urls('20161001','20170217') #the earliest available date is April 1st 2013 = 20130401

 Done!

Now let's look at what the url_corpus attribute looks like

In [13]:
day=5 #select one day
print('There are',len(datagdelt.url_corpus),'elements in it, because we loaded',len(datagdelt.dates),'days!')
print('The loaded day n.',day,'had',len(datagdelt.url_corpus[day-1]) ,'events in it that were mentioned more than',datagdelt.minimum_ment,'times:\n', datagdelt.url_corpus[day-1][:10],'\n etc...')
print('The first event was mentioned',datagdelt.url_corpus[day-1][0][0],'times, the second',datagdelt.url_corpus[day-1][1][0],'times, etc...')

There are 140 elements in it, because we loaded 140 days!
The loaded day n. 5 had 308 events in it that were mentioned more than 800 times:
 [[1972, 'http://www.philippinetimes.com/index.php/sid/248243461'], [1115, 'http://www.capradio.org/news/npr/story?storyid=496552413'], [970, 'http://thecabin.net/news/2016-10-04/dazzle-daze-raffle-tickets-sale'], [1951, 'http://www.whio.com/news/national-govt--politics/clinton-reaches-out-women-while-trump-defends-taxes/xnmN5QugmLzeGkEBR64y9I/'], [1965, 'http://wgno.com/2016/10/04/this-robot-is-so-realistic-that-it-helps-train-first-responders/'], [1012, 'https://www.stgeorgeutah.com/news/archive/2016/10/04/bureau-of-indian-affairs-discussion-kicks-off-free-brown-bag-lecture-series-full-schedule/'], [925, 'https://www.stgeorgeutah.com/news/archive/2016/10/04/yesco-co-owner-featured-on-undercover-boss-speaks-at-chamber-luncheon/'], [1544, 'http://www.nbcnews.com/politics/2016-election/lid-live-blogging-vice-presidential-debate-n659606?cid=public-rs

We see that many of those urls contain wordings that can be very informative on what's happening in the world and therefore might tell us something about the near future of the markets!!

Now, let's process these messy raw urls! Let's use word2vec:

In [48]:
datagdelt.gdelt_preprocess(vectrz='word2vec',size_w2v=48)

Using word2vec vectorization procedure


2017-02-19 11:44:04,442 : INFO : collecting all words and their counts
2017-02-19 11:44:04,443 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-02-19 11:44:04,490 : INFO : collected 15000 word types from a corpus of 236610 raw words and 140 sentences
2017-02-19 11:44:04,490 : INFO : Loading a fresh vocabulary
2017-02-19 11:44:04,549 : INFO : min_count=1 retains 15000 unique words (100% of original 15000, drops 0)
2017-02-19 11:44:04,550 : INFO : min_count=1 leaves 236610 word corpus (100% of original 236610, drops 0)
2017-02-19 11:44:04,624 : INFO : deleting the raw counts dictionary of 15000 items
2017-02-19 11:44:04,625 : INFO : sample=0.001 downsamples 26 most-common words
2017-02-19 11:44:04,626 : INFO : downsampling leaves estimated 223514 word corpus (94.5% of prior 236610)
2017-02-19 11:44:04,626 : INFO : estimated required memory for 15000 words and 48 dimensions: 13260000 bytes
2017-02-19 11:44:04,683 : INFO : resetting layer weights
2017-02-19 1

which gives

In [15]:
datagdelt.word2vec_corpus.head(10)

Unnamed: 0_level_0,w2v_1,w2v_10,w2v_11,w2v_12,w2v_13,w2v_14,w2v_15,w2v_16,w2v_17,w2v_18,...,w2v_22,w2v_23,w2v_24,w2v_3,w2v_4,w2v_5,w2v_6,w2v_7,w2v_8,w2v_9
news_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
20161001,-0.135579,0.14571,0.150963,-0.049898,-0.034872,0.151127,-0.080067,0.182112,-0.014674,0.103979,...,0.149336,-0.0056,-0.009186,-0.658091,-0.154392,0.203319,-0.223801,0.156598,0.02693,-0.062948
20161002,-0.148192,0.134067,0.146997,-0.033446,-0.036487,0.176244,-0.082453,0.166917,-0.034782,0.094391,...,0.136275,0.018574,0.01111,-0.66483,-0.173978,0.204555,-0.228685,0.137403,0.024689,-0.063401
20161003,-0.142121,0.134035,0.15881,-0.02574,-0.035427,0.163111,-0.074129,0.171442,-0.037753,0.115952,...,0.144167,0.006728,-0.012171,-0.658845,-0.168502,0.203247,-0.222505,0.13855,0.043704,-0.053548
20161004,-0.149176,0.124418,0.149711,-0.014482,-0.023597,0.165278,-0.072817,0.160967,-0.033454,0.093653,...,0.134668,0.024593,0.003607,-0.669298,-0.182983,0.205226,-0.228417,0.143929,0.035397,-0.062658
20161005,-0.140583,0.132568,0.160435,-0.018698,-0.019044,0.137025,-0.060016,0.171353,-0.013063,0.098221,...,0.146103,0.013903,-0.018853,-0.669843,-0.169948,0.201957,-0.21639,0.170279,0.031264,-0.069058
20161006,-0.143939,0.141425,0.136822,-0.051011,-0.05869,0.188056,-0.09693,0.162577,-0.034926,0.095348,...,0.130045,0.020371,0.02107,-0.654903,-0.166612,0.202456,-0.244021,0.125311,0.029238,-0.059137
20161007,-0.140061,0.133492,0.153134,-0.027963,-0.03749,0.158093,-0.074293,0.16045,-0.023274,0.096862,...,0.133177,0.021111,-0.002762,-0.667467,-0.170896,0.203211,-0.230707,0.149247,0.034228,-0.064662
20161008,-0.170171,0.115314,0.114239,-0.023972,-0.00252,0.212334,-0.092182,0.1532,-0.048958,0.046495,...,0.135059,0.036627,0.066802,-0.661742,-0.210109,0.196828,-0.250369,0.123787,0.018437,-0.057738
20161009,-0.123227,0.151773,0.181047,-0.044145,-0.052562,0.130993,-0.067063,0.188574,-0.0267,0.141019,...,0.156417,-0.025875,-0.046736,-0.645434,-0.139736,0.201139,-0.209978,0.150015,0.051672,-0.047011
20161010,-0.175676,0.100656,0.104136,-0.013532,0.023032,0.228106,-0.09456,0.140867,-0.056298,0.018654,...,0.134139,0.058744,0.091294,-0.664773,-0.232322,0.190728,-0.253318,0.120793,0.005628,-0.06049


BOOM! Now we have all of our datapoints with their nlp features neatly arranged in a pandas dataframe, ready for processing. Mission accomplished!

If we try to run this expensive preprocessing again on the same exact data...

In [16]:
datagdelt.gdelt_preprocess(vectrz='word2vec',size_w2v=24)

Using word2vec vectorization procedure
Nothing to be done, dataframes are up to date


Yay for savings!

Now we initialize the model training class, feeding it the dataframe from the nlp processing

## The model training module (work in progress, please be patient)
This section covers model training, validation, and testing, from our model_training module

In [59]:
import model_training as mdlt
importlib.reload(mdlt)
tet=mdlt.StockPrediction([['word2vec'],[datagdelt.word2vec_corpus],[datagdelt.w2vec_model]],update=True)

Let's try an L1 linear regressor which is trying to predict the increase/decrease of tomorrow's S&P index over today's. We test on the last 20 days out of 50 and validate/tune, for every testing case, over the previous 10 days. As for the hyperparameters, we are letting our regularization parameter be searched for in the 0.001-3000 range and we allow for 40 iterations of the optimal parameter search.

In [60]:
tet.auto_ts_val_test_reg('word2vec','lasso',[['alpha',[0.001,7000.,60.]]],parm_search_iter=40,n_folds_val=15,
                         past_depth=60,n_folds_test=15,scaling=True,differential=True,notest=False,verbose=False)



model_test_rmse: 15.351 benchmark_test_rmse: 15.351
model_test_rmse: 3.403 benchmark_test_rmse: 3.403
model_test_rmse: 0.653 benchmark_test_rmse: 0.653
model_test_rmse: 0.025 benchmark_test_rmse: 0.025
model_test_rmse: 15.245 benchmark_test_rmse: 15.245
model_test_rmse: 6.362 benchmark_test_rmse: 6.362
model_test_rmse: 0.909 benchmark_test_rmse: 0.909
model_test_rmse: 0.172 benchmark_test_rmse: 0.172
model_test_rmse: 11.780 benchmark_test_rmse: 11.780
model_test_rmse: 6.679 benchmark_test_rmse: 6.679
model_test_rmse: 10.525 benchmark_test_rmse: 10.525
model_test_rmse: 7.591 benchmark_test_rmse: 7.591
model_test_rmse: 9.849 benchmark_test_rmse: 9.849
model_test_rmse: 3.955 benchmark_test_rmse: 3.955
model_test_rmse: 2.056 benchmark_test_rmse: 2.056


[(6.3036606676414557, 5.1252792112508061),
 (6.3036606676414557, 5.1252792112508061)]

The performance is not particularly promising. How about we try a random forest regressor instead? We are letting our tuning select any combination among 5 values for the number of estimators, 5 for the maximum number of features used for splitting, and we allow a maximum depth from 5 to 7.

In [None]:
tet.auto_ts_val_test_reg('word2vec','rfreg',[['n_estim',{5,6,7}],['max_feat',{26,27,35,37,40,45,48}],
                                             ['max_depth',{4,5,6,7,8,9}]],parm_search_iter=1,n_folds_val=15,
                         past_depth=20,n_folds_test=20,scaling=True,differential=True,verbose=False,notest=False)

model_test_rmse: 6.430 benchmark_test_rmse: 7.466


In [28]:
feat_imp=tet.models['word2vec'].coef_#.feature_importances_
feat_imp
#these are the feature importances for a random forest classifier

array([ 0.        ,  0.        ,  0.        , -0.        , -0.        ,
       -0.        , -0.        ,  0.        ,  0.        ,  0.        ,
       -0.        ,  0.        ,  0.        , -0.        , -0.        ,
       -0.        , -0.        ,  0.        ,  0.        ,  0.        ,
        0.        , -0.        ,  0.        ,  0.        ,  0.        ,
        1.00350041])

In [59]:
model=tet.w2v_models['word2vec']

In [131]:
model.similar_by_word('tax')

[('list', 0.9997903108596802),
 ('pay', 0.9997895956039429),
 ('media', 0.9997743368148804),
 ('american', 0.9997565746307373),
 ('job', 0.9997377991676331),
 ('governor', 0.9997320175170898),
 ('war', 0.9997308254241943),
 ('deal', 0.9997145533561707),
 ('lawmak', 0.9997134208679199),
 ('week', 0.9997121095657349)]

In [119]:
n=23
aa=np.zeros(24)
aa[0]=feat_imp[0]
aa[1]=feat_imp[1]
aa[2]=-feat_imp[2]
aa[3]=feat_imp[3]
aa[4]=-feat_imp[4]
aa[5]=feat_imp[5]
aa[6]=-feat_imp[6]
aa[7]=feat_imp[7]
aa[8]=feat_imp[8]
aa[9]=-feat_imp[9]
aa[10]=-feat_imp[10]
aa[11]=-feat_imp[11]
aa[12]=feat_imp[12]
aa[13]=-feat_imp[13]
aa[14]=feat_imp[14]
aa[15]=feat_imp[15]
aa[16]=feat_imp[16]
aa[17]=feat_imp[17]
aa[18]=-feat_imp[18]
aa[19]=-feat_imp[19]
aa[20]=-feat_imp[20]
aa[21]=-feat_imp[21]
aa[22]=-feat_imp[22]
aa[n]=-feat_imp[n]
model.wv.similar_by_vector(aa)#,model.wv.similar_by_vector(-aa)

[('irctc', 0.8196725845336914),
 ('underdog', 0.8064718246459961),
 ('bernhard', 0.8056542873382568),
 ('trailguid', 0.7977608442306519),
 ('cadeb', 0.7947107553482056),
 ('bigli', 0.7910323739051819),
 ('mulhal', 0.7737561464309692),
 ('dickinson', 0.7713722586631775),
 ('restless', 0.768715500831604),
 ('marque', 0.7666662931442261)]

In [86]:
tet.auto_ts_val_test_reg('word2vec','rfreg',[['n_estim',{1,2,3,5,7}],['max_feat',{21,22,23,24}],['max_depth',{5,6,7}]],
                         parm_search_iter=1,n_folds_val=6,n_folds_test=20,scaling=True,differential=True,verbose=False,
                         notest=False)

model_test_rmse: 4.89339233333 flat_test_rmse: 8.60856807692
model_test_rmse: 8.960082 flat_test_rmse: 12.9865546429
model_test_rmse: 16.7501225 flat_test_rmse: 15.5507163333
model_test_rmse: 3.189942 flat_test_rmse: 5.4114374375
model_test_rmse: 12.386638 flat_test_rmse: 5.39292358824
model_test_rmse: 15.8485800714 flat_test_rmse: 16.8933647222
model_test_rmse: 0.0397940000003 flat_test_rmse: 4.24398626316
model_test_rmse: 7.4552 flat_test_rmse: 1.32206995
model_test_rmse: 3.16670766667 flat_test_rmse: 0.638997238095
model_test_rmse: 8.20413614286 flat_test_rmse: 14.6598230909
model_test_rmse: 5.20131128571 flat_test_rmse: 7.40724752174
model_test_rmse: 3.630006 flat_test_rmse: 1.71873020833
model_test_rmse: 6.07581386667 flat_test_rmse: 0.580156
model_test_rmse: 9.91398693571 flat_test_rmse: 11.0525086923
model_test_rmse: 7.266972 flat_test_rmse: 5.67294251852
model_test_rmse: 2.48998975 flat_test_rmse: 9.39025842857
model_test_rmse: 3.64998333333 flat_test_rmse: 6.24663241379
model_

[(7.0019043842857176, 4.5739688827417986),
 (7.1149945635660883, 4.9619411960085946)]

We see that our model does sometimes on average perform better than the benchmark one (root mean squared error of 7.0 vs 7.1 in this specific case)

Now, if we want to get a prediction for today, we toggle the 'notest' attribute to True

In [87]:
tet.auto_ts_val_test_reg('word2vec','rfreg',[['n_estim',{1,2,3,5,7}],['max_feat',{21,22,23,24}],['max_depth',{5,6,7}]],
                         parm_search_iter=1,n_folds_val=6,n_folds_test=20,scaling=True,differential=True,verbose=False,
                         notest=True)

I can't predict for tomorrow, because the stock market is closed


Exception: 

In [88]:
tet.auto_ts_val_test_reg('word2vec','knnreg',[['numb_nn',{1,2,3,4}]],parm_search_iter=4,n_folds_val=6,n_folds_test=10,
                         scaling=True,differential=True,notest=False,verbose=False)

model_test_rmse: 3.019775 flat_test_rmse: 7.40724752174
model_test_rmse: 8.410034 flat_test_rmse: 1.71873020833
model_test_rmse: 6.6198125 flat_test_rmse: 0.580156
model_test_rmse: 17.930257 flat_test_rmse: 11.0525086923
model_test_rmse: 11.505005 flat_test_rmse: 5.67294251852
model_test_rmse: 8.880004 flat_test_rmse: 9.39025842857
model_test_rmse: 3.10017866667 flat_test_rmse: 6.24663241379
model_test_rmse: 0.929932 flat_test_rmse: 8.37825533333
model_test_rmse: 12.530029 flat_test_rmse: 5.59196196774
model_test_rmse: 3.84002725 flat_test_rmse: 0.55275684375


[(7.6765054416666727, 4.993188530339169),
 (5.6591449928088533, 3.4814802333665931)]

In [89]:
tet.auto_ts_val_test_reg('word2vec','knnreg',[['numb_nn',{1,2,3,4}]],parm_search_iter=4,n_folds_val=6,n_folds_test=10,
                         scaling=True,differential=True,notest=True,verbose=False)

I can't predict for tomorrow, because the stock market is closed


Exception: 

In [46]:
tet.auto_ts_val_test_class('word2vec','logreg',[['l1orl2?',{'l1',}],
                                                ['C',[0.0000000000001,0.001,0.000001]]],
                           parm_search_iter=30,n_folds_val=10,n_folds_test=10,past_depth=15,scaling=False,notest=False,
                           verbose=False)

test_rec,prec,F1: [1.0, 1.0, 1.0] benchmark_rec,prec,F1: [1.0, 1.0, 1.0]
test_rec,prec,F1: [0.0, 1.0, 0.0] benchmark_rec,prec,F1: [0.0, 1.0, 0.0]
test_rec,prec,F1: [0.0, 1.0, 0.0] benchmark_rec,prec,F1: [0.0, 1.0, 0.0]
test_rec,prec,F1: [0.0, 1.0, 0.0] benchmark_rec,prec,F1: [1.0, 1.0, 1.0]
test_rec,prec,F1: [0.0, 1.0, 0.0] benchmark_rec,prec,F1: [1.0, 1.0, 1.0]
test_rec,prec,F1: [0.0, 1.0, 0.0] benchmark_rec,prec,F1: [1.0, 1.0, 1.0]
test_rec,prec,F1: [0.0, 1.0, 0.0] benchmark_rec,prec,F1: [1.0, 1.0, 1.0]
test_rec,prec,F1: [0.0, 1.0, 0.0] benchmark_rec,prec,F1: [1.0, 1.0, 1.0]
test_rec,prec,F1: [1.0, 1.0, 1.0] benchmark_rec,prec,F1: [1.0, 0.0, 0.0]
test_rec,prec,F1: [0.0, 1.0, 0.0] benchmark_rec,prec,F1: [1.0, 1.0, 1.0]


[(array([[ 0.2,  1. ,  0.2]]), array([[ 0.4,  0. ,  0.4]])),
 (array([[ 0.8,  0.9,  0.7]]),
  array([[ 0.4       ,  0.3       ,  0.45825757]]))]

In [47]:
tet.auto_ts_val_test_class('word2vec','rfclass',[['n_estim',{1,2,3,4,5,6,7}],['max_feat',{8,10,12,14,18,22,23,24}],
                                                 ['max_depth',{4,5,6}]],parm_search_iter=1,n_folds_val=10,past_depth=15,
                           n_folds_test=20,scaling=True,notest=False,verbose=False)

test_rec,prec,F1: [1.0, 0.0, 0.0] benchmark_rec,prec,F1: [1.0, 1.0, 1.0]
test_rec,prec,F1: [0.0, 1.0, 0.0] benchmark_rec,prec,F1: [0.0, 1.0, 0.0]
test_rec,prec,F1: [1.0, 1.0, 1.0] benchmark_rec,prec,F1: [0.0, 1.0, 0.0]
test_rec,prec,F1: [1.0, 1.0, 1.0] benchmark_rec,prec,F1: [1.0, 0.0, 0.0]
test_rec,prec,F1: [1.0, 1.0, 1.0] benchmark_rec,prec,F1: [1.0, 1.0, 1.0]
test_rec,prec,F1: [1.0, 1.0, 1.0] benchmark_rec,prec,F1: [1.0, 1.0, 1.0]
test_rec,prec,F1: [1.0, 0.0, 0.0] benchmark_rec,prec,F1: [1.0, 1.0, 1.0]
test_rec,prec,F1: [0.0, 1.0, 0.0] benchmark_rec,prec,F1: [0.0, 1.0, 0.0]
test_rec,prec,F1: [0.0, 1.0, 0.0] benchmark_rec,prec,F1: [0.0, 1.0, 0.0]
test_rec,prec,F1: [0.0, 1.0, 0.0] benchmark_rec,prec,F1: [0.0, 1.0, 0.0]
test_rec,prec,F1: [1.0, 1.0, 1.0] benchmark_rec,prec,F1: [1.0, 1.0, 1.0]
test_rec,prec,F1: [0.0, 1.0, 0.0] benchmark_rec,prec,F1: [0.0, 1.0, 0.0]
test_rec,prec,F1: [0.0, 1.0, 0.0] benchmark_rec,prec,F1: [0.0, 1.0, 0.0]
test_rec,prec,F1: [1.0, 1.0, 1.0] benchmark_rec,pre

[(array([[ 0.5,  0.9,  0.4]]),
  array([[ 0.5       ,  0.3       ,  0.48989795]])),
 (array([[ 0.65,  0.9 ,  0.55]]),
  array([[ 0.4769696 ,  0.3       ,  0.49749372]]))]

In [117]:
tet.auto_ts_val_test_class('word2vec','rfclass',[['n_estim',{1,2,3,4,5,6,7}],['max_feat',{8,10,12,14,18,22,23,24}],
                                                 ['max_depth',{4,5,6}]],parm_search_iter=1,n_folds_val=10,
                           n_folds_test=20,scaling=True,notest=False,verbose=False)

test_rec,prec,F1: [1.0, 0.0, 0.0] benchmark_rec,prec,F1: [1.0, 0.0, 0.0]
test_rec,prec,F1: [1.0, 1.0, 1.0] benchmark_rec,prec,F1: [0.0, 1.0, 0.0]
test_rec,prec,F1: [0.0, 1.0, 0.0] benchmark_rec,prec,F1: [1.0, 1.0, 1.0]
test_rec,prec,F1: [1.0, 0.0, 0.0] benchmark_rec,prec,F1: [1.0, 0.0, 0.0]
test_rec,prec,F1: [1.0, 1.0, 1.0] benchmark_rec,prec,F1: [1.0, 0.0, 0.0]
test_rec,prec,F1: [1.0, 0.0, 0.0] benchmark_rec,prec,F1: [1.0, 1.0, 1.0]
test_rec,prec,F1: [1.0, 1.0, 1.0] benchmark_rec,prec,F1: [1.0, 1.0, 1.0]
test_rec,prec,F1: [0.0, 1.0, 0.0] benchmark_rec,prec,F1: [0.0, 1.0, 0.0]
test_rec,prec,F1: [1.0, 1.0, 1.0] benchmark_rec,prec,F1: [0.0, 1.0, 0.0]
test_rec,prec,F1: [1.0, 1.0, 1.0] benchmark_rec,prec,F1: [0.0, 1.0, 0.0]
test_rec,prec,F1: [1.0, 1.0, 1.0] benchmark_rec,prec,F1: [1.0, 0.0, 0.0]
test_rec,prec,F1: [1.0, 1.0, 1.0] benchmark_rec,prec,F1: [0.0, 1.0, 0.0]
test_rec,prec,F1: [1.0, 1.0, 1.0] benchmark_rec,prec,F1: [1.0, 1.0, 1.0]
test_rec,prec,F1: [1.0, 1.0, 1.0] benchmark_rec,pre

[(array([[ 0.9,  0.8,  0.7]]),
  array([[ 0.3       ,  0.4       ,  0.45825757]])),
 (array([[ 0.75,  0.75,  0.5 ]]),
  array([[ 0.4330127,  0.4330127,  0.5      ]]))]

Good! Our model consistently overperform the benchmark as for accuracy (=F1 in this case): 0.7 vs 0.5.

Now let's predict what today's closing is going to be!

In [41]:
tet.auto_ts_val_test_class('word2vec','rfclass',[['n_estim',{1,2,3,4,5,6,7}],['max_feat',{8,12}],['max_depth',{4,6}]],
                           parm_search_iter=1,n_folds_val=10,n_folds_test=25,scaling=False,one_shot=True,notest=True,
                           verbose=False)

array([ 1.])

In [40]:
feat_imp=tet.models['word2vec'].coef_
feat_imp
#remember: only the first n-2 features are nlp, the (n-1)-th is being after a weekend and the n-th is today's closing

AttributeError: 'RandomForestRegressor' object has no attribute 'coef_'

It will go up, apparently!

In [127]:
tet.auto_ts_val_test_class('word2vec','knnclass',[['n_neighb',{1,2,3,4}]],parm_search_iter=1,
                           n_folds_val=6,n_folds_test=10,scaling=True,notest=False,verbose=False)

test_rec,prec,F1: [1.0, 1.0, 1.0] benchmark_rec,prec,F1: [1.0, 0.0, 0.0]
test_rec,prec,F1: [0.0, 1.0, 0.0] benchmark_rec,prec,F1: [0.0, 1.0, 0.0]
test_rec,prec,F1: [1.0, 1.0, 1.0] benchmark_rec,prec,F1: [1.0, 1.0, 1.0]
test_rec,prec,F1: [0.0, 1.0, 0.0] benchmark_rec,prec,F1: [1.0, 1.0, 1.0]
test_rec,prec,F1: [0.0, 1.0, 0.0] benchmark_rec,prec,F1: [1.0, 1.0, 1.0]
test_rec,prec,F1: [1.0, 1.0, 1.0] benchmark_rec,prec,F1: [1.0, 1.0, 1.0]
test_rec,prec,F1: [1.0, 1.0, 1.0] benchmark_rec,prec,F1: [1.0, 1.0, 1.0]
test_rec,prec,F1: [1.0, 1.0, 1.0] benchmark_rec,prec,F1: [1.0, 1.0, 1.0]
test_rec,prec,F1: [1.0, 0.0, 0.0] benchmark_rec,prec,F1: [1.0, 0.0, 0.0]
test_rec,prec,F1: [0.0, 1.0, 0.0] benchmark_rec,prec,F1: [1.0, 1.0, 1.0]


[(array([[ 0.6,  0.9,  0.5]]),
  array([[ 0.48989795,  0.3       ,  0.5       ]])),
 (array([[ 0.9,  0.8,  0.7]]),
  array([[ 0.3       ,  0.4       ,  0.45825757]]))]

In [None]:
tet.auto_ts_val_test_class('word2vec','svmclass',[['C',[0.000001,1.,0.001]],['kernel',{'poly',}]],
                           parm_search_iter=15,n_folds_val=10,n_folds_test=15,scaling=True,notest=False,
                           verbose=False)

In [488]:
tet.yhat_reg['word2vec']

array([ 1843.12623653])

In [29]:
aa=np.zeros(32)
aa[30]=1
model.wv.similar_by_vector(res[:24])

[('mercenari', 0.5851517915725708),
 ('albuquerqu', 0.5756868124008179),
 ('camper', 0.4917409121990204),
 ('honeywel', 0.48540031909942627),
 ('consequenti', 0.4711318612098694),
 ('surinam', 0.4206582307815552),
 ('yankton', 0.41216057538986206),
 ('kat', 0.40529105067253113),
 ('lyme', 0.40325677394866943),
 ('mornington', 0.390191912651062)]

In [489]:
tet.ydata['word2vec'][1][:,0]

array([ 1828.459961])

In [358]:
tet.kfold_val_reg(10,'word2vec','svmreg',[0.000000000000000000001,'poly'],100,scaling=False,differential=True)

avg_train_rmse: 12.0795343059 avg_validation_rmse: 11.9635335428


In [92]:
tet.auto_ts_val_test_reg('word2vec','ridge',[['alpha',[0.5,5000.,6.]]],parm_search_iter=30,n_folds_val=10,
                         n_folds_test=20,scaling=True,differential=True,notest=False,verbose=False)



model_test_rmse: 2.63499875554 flat_test_rmse: 8.60856807692
model_test_rmse: 9.14430435726 flat_test_rmse: 12.9865546429
model_test_rmse: 23.9881516511 flat_test_rmse: 15.5507163333
model_test_rmse: 11.4025187908 flat_test_rmse: 5.4114374375
model_test_rmse: 5.21515508858 flat_test_rmse: 5.39292358824
model_test_rmse: 2.66571579527 flat_test_rmse: 16.8933647222
model_test_rmse: 2.77283964558 flat_test_rmse: 4.24398626316
model_test_rmse: 1.13300009011 flat_test_rmse: 1.32206995
model_test_rmse: 0.0744177398856 flat_test_rmse: 0.638997238095
model_test_rmse: 15.9667312097 flat_test_rmse: 14.6598230909
model_test_rmse: 8.5881427328 flat_test_rmse: 7.40724752174
model_test_rmse: 2.84202594862 flat_test_rmse: 1.71873020833
model_test_rmse: 3.95979868099 flat_test_rmse: 0.580156
model_test_rmse: 15.7528918614 flat_test_rmse: 11.0525086923
model_test_rmse: 13.0841439217 flat_test_rmse: 5.67294251852
model_test_rmse: 23.4026069352 flat_test_rmse: 9.39025842857
model_test_rmse: 7.63390407267 

[(8.3539267956979426, 6.9235143583625938),
 (7.1149945635660883, 4.9619411960085946)]

We initialize a class instance by loading into it two lists: one of names of your choosing and one of dataframes, which in this case is the output form the previous module above, datagdelt.vect_corpus_tfidf.

In [67]:
predictorgdelt.dataset_names

['some_name_you_choose']

And now, for the real deal: k-fold training and validation!
The following method performs that in a very general manner. It lets you decide what regression model to choose, as well as the values of the hyperparameters (please see the module documentation in model_training.py for details on how to pass the hyperparameters), also you need to supply the number of folds you want your data split into, and a seed, for reproducibility. There is also an option to scale and normalize the features but it doesn't quite perform well in general.

The method returns the model average performance over the k training iterations. In short, tuning will consist of choosing the value for the hyperparameters that optimizes avg_validation_rmse (that is minimize the average root mean squared on the validation datasets)

In [91]:
#10-fold validated lasso linear regression with sliding hyperparameter alpha, seed=100, no scaling, 
#for dataset 'some_name_you_choose'.
for alpha in [12.+0.1*i for i in range(-8,8)]:
    print('alpha =',alpha)
    predictorgdelt.kfold_val_reg(10,'some_name_you_choose','lasso',alpha,100,scaling=False)

alpha = 11.2
avg_train_rmse: 11.6470100241 avg_validation_rmse: 11.8295047334
alpha = 11.3
avg_train_rmse: 11.6470935177 avg_validation_rmse: 11.8295020586
alpha = 11.4
avg_train_rmse: 11.6471777528 avg_validation_rmse: 11.8295001005
alpha = 11.5
avg_train_rmse: 11.6472627295 avg_validation_rmse: 11.8294988592
alpha = 11.6
avg_train_rmse: 11.6473484476 avg_validation_rmse: 11.8294983351
alpha = 11.7
avg_train_rmse: 11.6474349073 avg_validation_rmse: 11.8294985283
alpha = 11.8
avg_train_rmse: 11.6475221085 avg_validation_rmse: 11.829499439
alpha = 11.9
avg_train_rmse: 11.6476100511 avg_validation_rmse: 11.8295010676
alpha = 12.0
avg_train_rmse: 11.6476987351 avg_validation_rmse: 11.8295034142
alpha = 12.1
avg_train_rmse: 11.6477881607 avg_validation_rmse: 11.8295064791
alpha = 12.2
avg_train_rmse: 11.6478783276 avg_validation_rmse: 11.8295102625
alpha = 12.3
avg_train_rmse: 11.6479692359 avg_validation_rmse: 11.8295147646
alpha = 12.4
avg_train_rmse: 11.6480608856 avg_validation_rmse: 1

So we see that the minimum is reached for alpha = 11.6 (you'll probably get different values). So now we go into testing and use this parameter.

The following method, very similar to the previous one, retrains the model on the full train+validation dataset with the desired hyperparameters. If the model defines feature importances, these are returned by the method.

Importantly, the method also prints out the performance of a benchmark model (just a trivial flat prediction from today to tomorrow).

By chance, in this one case we outperform the benchmark model with a lower rmse, but this procedure should be performed a couple of time and an average final performance should be quoted instead.

Out of curiosity, let's see what the most important features were.

...which, isn't surprising. As we said at the beginning, the most important feature should have been today's closing, and it was, entirely offuscating everything else.

Let's see if classifying tomorrow's value going up or down will do us and better...
N.B. We need to specify a decision threshold which I recommend leaving at 0.5 for now.

As you can see, the method returns again average validation performances which are now measured in terms of recall, precision, and F1 score. In lack of a specific metric we want to optimize, we are going to use the F1 score for tuning.

The performance plateaus and is optimal for alpha ~1.0

Bingo! Our model predicts all 1's. Not much gained...

Incidentally anyway, that's how you pull the predictions vector for a specific dataset.
In the future I'll give the option to save a specific model run instead of overwriting. Good for free exploration.

# Scratch from now on, please ignore!!

In [612]:
predictorgdelt.kfold_val_reg(10,'apriltodectfidf','lasso',1.3,10,scaling=True)

avg_train_rmse: 9.20135417438 avg_validation_rmse: 23.8508310037


In [629]:
predictorgdelt.kfold_val_reg(10,'apriltodectfidf','rfreg',[50,4500,10],10,scaling=True)

avg_train_rmse: 5.91587243572 avg_validation_rmse: 24.4121821352


In [643]:
predictorgdelt.kfold_val_reg(10,'apriltodectfidf','adabreg',15,10,scaling=False)

avg_train_rmse: 11.1816716619 avg_validation_rmse: 15.3459786376


In [670]:
predictorgdelt.kfold_val_reg(10,'apriltodectfidf','knnreg',7,10,scaling=False)

avg_train_rmse: 10.342872738 avg_validation_rmse: 11.4892361364


In [671]:
aa=predictorgdelt.kfold_test_reg('apriltodectfidf','knnreg',7)

model_test_rmse: 16.0651087792 flat_test_rmse: 14.4209762157


In [727]:
predictorgdelt.kfold_val_class(10,'apriltodectfidf','knnclass',7,10,[0.5])

avg_train_rec,prec,F1: [0.81837529044943147, 0.70479197132136429, 0.75720674859733283] avg_validation_rec,prec,F1: [0.76335497835497834, 0.64537684537684537, 0.6926744610887835]


In [724]:
predictorgdelt.kfold_test_class('apriltodectfidf','knnclass',7,[0.5])

test_rec,prec,F1: [0.5, 0.41666666666666669, 0.45454545454545453]


In [738]:
predictorgdelt.kfold_val_class(10,'apriltodectfidf','rfclass',[40,5,10],10,[0.5])

avg_train_rec,prec,F1: [1.0, 0.71820616787952551, 0.83593883914424061] avg_validation_rec,prec,F1: [1.0, 0.59458333333333324, 0.73921100638491943]


In [739]:
predictorgdelt.kfold_test_class('apriltodectfidf','rfclass',[40,5,10],[0.5])

test_rec,prec,F1: [1.0, 0.51282051282051277, 0.67796610169491522]


In [744]:
sum(predictorgdelt.ydata['apriltodectfidf'][1][:,1])/len(predictorgdelt.ydata['apriltodectfidf'][1][:,1])

0.51282051282051277

In [778]:
mdlt.scores(predictorgdelt.ydata['apriltodectfidf'][1][:,1],np.ones(len(predictorgdelt.ydata['apriltodectfidf'][1])),[0.5])

(1.0, 0.5128205128205128, 0.6779661016949152, 0.5128205128205128)

In [779]:
predictorgdelt.kfold_val_class(10,'apriltodectfidf','logreg',['l1',1.5],10,[0.5])

avg_train_rec,prec,F1: [1.0, 0.59213718334048937, 0.74374608177131407] avg_validation_rec,prec,F1: [1.0, 0.59458333333333324, 0.73921100638491943]


In [768]:
aa=predictorgdelt.kfold_test_class('apriltodectfidf','logreg',['l1',2.5],[0.5])

test_rec,prec,F1: [1.0, 0.51282051282051277, 0.67796610169491522]


In [773]:
predictorgdelt.kfold_val_class(10,'apriltodectfidf','svmclass',[1.,'poly'],10,[0.5])

avg_train_rec,prec,F1: [0.76262732475581552, 0.77500920878424662, 0.7669871458718045] avg_validation_rec,prec,F1: [0.66152958152958141, 0.61717171717171726, 0.62368359527432093]


In [774]:
predictorgdelt.kfold_test_class('apriltodectfidf','svmclass',[1.,'poly'],[0.5])

test_rec,prec,F1: [0.65000000000000002, 0.59090909090909094, 0.61904761904761907]


In [775]:
predictorgdelt.yhat_class

{'apriltodectfidf': array([ 0.,  1.,  1.,  1.,  1.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,  1.,
         0.,  1.,  1.,  0.,  0.,  0.,  0.,  1.,  1.,  1.,  0.,  1.,  0.,
         0.,  1.,  1.,  0.,  1.,  0.,  1.,  1.,  1.,  1.,  1.,  0.,  1.])}

In [772]:
predictorgdelt.ydata['apriltodectfidf'][1][:,1]

array([ 1.,  1.,  1.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  1.,  1.,  1.,
        1.,  0.,  1.,  0.,  0.,  1.,  1.,  0.,  1.,  1.,  1.,  1.,  0.,
        1.,  0.,  0.,  0.,  0.,  0.,  1.,  1.,  0.,  0.,  1.,  0.,  1.])

In [660]:
predictorgdelt.kfold_val_reg(10,'apriltodectfidf','svmreg',[15,'poly'],10,scaling=False)

KeyboardInterrupt: 

In [649]:
predictorgdelt.kfold_val_reg(10,'apriltodectfidf','mlpreg',['relu',(100,)],10,scaling=False)

avg_train_rmse: 12.6222333907 avg_validation_rmse: 12.1568874732


In [657]:
aa=predictorgdelt.kfold_test_reg('apriltodectfidf','mlpreg',['relu',(100,)])

model_test_rmse: 14.8924855598 flat_test_rmse: 14.4209762157


In [561]:
predictorgdelt.yhat_reg

{'apriltodectfidf': array([ 1744.52278741,  1803.29915885,  1694.4451712 ,  1693.34940803,
         1837.50763223,  1661.97637678,  1576.88280783,  1647.17413376,
         1653.98888935,  1660.05401055,  1630.59372504,  1760.84365842,
         1594.12637618,  1693.15485608,  1657.5242832 ,  1559.35083745,
         1693.66655617,  1770.69578534,  1678.18873354,  1618.16557548,
         1798.57971612,  1595.75929349,  1790.80377837,  1762.44887651,
         1569.11649056,  1588.40738375,  1769.48474365,  1773.99257709,
         1789.57343515,  1688.86454837,  1637.86982931,  1793.05291322,
         1655.27695931,  1687.8802146 ,  1631.17041088,  1657.8406513 ,
         1655.79591841,  1562.46506112,  1711.34272805])}

In [562]:
predictorgdelt.ydata['apriltodectfidf'][1][:,0]

array([ 1754.670044,  1800.900024,  1703.199951,  1689.469971,
        1841.069946,  1650.469971,  1553.689941,  1642.810059,
        1667.469971,  1630.47998 ,  1612.52002 ,  1767.930054,
        1592.430054,  1685.72998 ,  1655.079956,  1541.609985,
        1681.550049,  1767.689941,  1655.449951,  1606.280029,
        1795.150024,  1573.089966,  1785.030029,  1756.540039,
        1570.25    ,  1593.609985,  1747.150024,  1786.540039,
        1787.869995,  1689.130005,  1633.77002 ,  1792.810059,
        1628.930054,  1706.869995,  1639.040039,  1652.619995,
        1642.800049,  1562.5     ,  1698.060059])

In [620]:
predictorgdelt.kfold_val_reg(10,'apriltodectfidf','ridge',0.001,10,scaling=True)

avg_train_rmse: 6.33518759404 avg_validation_rmse: 24.5460444404


In [385]:
aa=predictorgdelt.kfold_test_reg('7daystfidf','lasso',37.5)

model_test_rmse: 14.2964522937 flat_test_rmse: 13.5413782105


In [417]:
predictorgdelt.kfold_val_reg(15,'7daystfidf','ridge',8400.,10)

avg_train_rmse: 11.1986058145 avg_validation_rmse: 10.800359651


In [418]:
aa=predictorgdelt.kfold_test_reg('7daystfidf','ridge',8400.)

model_test_rmse: 14.3814887421 flat_test_rmse: 13.5413782105


In [263]:
predictorgdelt.kfold_val_reg(10,'7daystfidf','svreg',[0.01,'poly'],10)

avg_train_rmse: 11.0071954777 avg_validation_rmse: 12.3585904621


In [264]:
predictorgdelt.kfold_test_reg('7daystfidf','svreg',[0.01,'poly'])

model_test_rmse: 10.3370215097 flat_test_rmse: 10.8353511909


In [161]:
key_cols=list(datagdelt.vect_corpus_tfidf.columns)+['*weekend?','*yesterdayS&P']

In [769]:
ab=aa[0]#model.coef_[0]
[[key_cols[i],ab[i]] for i in np.argsort(abs(ab))[::-1]]

[['north', 2.1730974001773933],
 ['train', 1.5123459726524862],
 ['snowden', 1.264379084303753],
 ['crew', -1.130072887205146],
 ['big', 1.0974619202530456],
 ['ae', -0.84521265295771053],
 ['milit', -0.26812945924067383],
 ['*yesterdayS&P', 1.8144839374748277e-05],
 ['juli', 0.0],
 ['exam', 0.0],
 ['everi', 0.0],
 ['everybodi', 0.0],
 ['evict', 0.0],
 ['evid', 0.0],
 ['evolut', 0.0],
 ['ex', 0.0],
 ['exactli', 0.0],
 ['examin', 0.0],
 ['egyptian', 0.0],
 ['exce', 0.0],
 ['except', 0.0],
 ['exception', 0.0],
 ['exchang', 0.0],
 ['exclus', 0.0],
 ['exec', 0.0],
 ['execut', 0.0],
 ['juror', 0.0],
 ['ever', 0.0],
 ['eventu', 0.0],
 ['event', 0.0],
 ['etern', 0.0],
 ['ethanol', 0.0],
 ['ethnic', 0.0],
 ['etx', 0.0],
 ['eu', 0.0],
 ['eurasian', 0.0],
 ['eurobank', 0.0],
 ['europ', 0.0],
 ['european', 0.0],
 ['eurozon', 0.0],
 ['evacu', 0.0],
 ['evad', 0.0],
 ['evalu', 0.0],
 ['eve', 0.0],
 ['even', 0.0],
 ['exelon', 0.0],
 ['exercis', 0.0],
 ['exhaust', 0.0],
 ['express', 0.0],
 ['extens', 

In [572]:
predictorgdelt.kfold_val_class(10,'apriltodectfidf','logreg',['l1',0.095],10,0.5)

avg_train_rec,prec,F1: [1.0, 0.77187932464248254, 0.87119529697081077] avg_validation_rec,prec,F1: [0.97070707070707074, 0.62297619047619035, 0.75461295226512615]


In [573]:
aa=predictorgdelt.kfold_test_class('apriltodectfidf','logreg',['l1',0.095],0.5)

test_rec,prec,F1: [0.93333333333333335, 0.3888888888888889, 0.5490196078431373]


In [583]:
len(predictorgdelt.ydata['apriltodectfidf'][1][:,1])

39

In [574]:
predictorgdelt.yhat_class['apriltodectfidf']

array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  0.,  1.,  1.,  0.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  0.,  1.,  1.,  1.,  1.,  1.])

In [247]:
predictorgdelt.kfold_val_class(10,'7daystfidf','svclass',[0.01,'poly'],10,0.5)

avg_train_rec,prec,F1: [0.99316945840049764, 0.87472075977840691, 0.93013643870901608] avg_validation_rec,prec,F1: [0.78934065934065933, 0.60476190476190483, 0.65097784615687426]


In [248]:
predictorgdelt.kfold_test_class('7daystfidf','svclass',[0.01,'poly'],0.5)

test_rec,prec,F1: [0.69230769230769229, 0.75, 0.71999999999999986]


In [249]:
predictorgdelt.yhat_class['7daystfidf']

array([ 0.,  1.,  1.,  1.,  0.,  1.,  1.,  0.,  1.,  0.,  0.,  1.,  1.,
        1.,  1.,  1.,  0.,  1.,  0.,  0.])

In [252]:
predictorgdelt.ydata['7daystfidf'][1][:,1]

array([ 0.,  0.,  1.,  1.,  1.,  1.,  1.,  0.,  0.,  0.,  1.,  1.,  0.,
        1.,  1.,  1.,  1.,  1.,  1.,  0.])

In [1248]:
#downloading and unzipping, run at your own risk, contains dreadful shell commands
for date in range(20131001,20131032):
    os.system('wget http://data.gdeltproject.org/events/'+str(date)+'.export.CSV.zip')
    os.system('unzip '+str(date)+'.export.CSV.zip')
    os.system('mv '+str(date)+'.export.CSV data/GDELT_1.0')
    os.system('rm '+str(date)+'.export.CSV.zip')

In [11]:
!ls -hl data/GDELT_1.0/20130401.export.CSV

-rw-r--r--  1 Maxos  staff    10M May 20  2013 data/GDELT_1.0/20130401.export.CSV


In [872]:
header_daily=pd.read_csv('../data/GDELT_1.0/CSV.header.dailyupdates.txt',delimiter='\t')

In [908]:
import pandas as pd
#this is just to show what the GDELT files look like
sample_df=pd.read_csv('../data/GDELT_1.0/20130401.export.CSV',delimiter='\t')
sample_df.columns=list(header_daily)
sample_df.head()

Unnamed: 0,GLOBALEVENTID,SQLDATE,MonthYear,Year,FractionDate,Actor1Code,Actor1Name,Actor1CountryCode,Actor1KnownGroupCode,Actor1EthnicCode,...,Actor2Geo_FeatureID,ActionGeo_Type,ActionGeo_FullName,ActionGeo_CountryCode,ActionGeo_ADM1Code,ActionGeo_Lat,ActionGeo_Long,ActionGeo_FeatureID,DATEADDED,SOURCEURL
0,253461012,20030404,200304,2003,2003.2575,AUS,AUSTRALIA,AUS,,,...,AS,1,Australia,AS,AS,-27.0,133.0,AS,20130401,http://www.bangkokpost.com/breakingnews/343522...
1,253461013,20030404,200304,2003,2003.2575,BUS,SHOP OWNER,,,,...,-1354145,4,"Tai Hang, Hong Kong (general), Hong Kong",HK,HK00,22.4667,114.15,-1354145,20130401,http://www.bloomberg.com/news/2013-04-01/hong-...
2,253461014,20030404,200304,2003,2003.2575,BUS,SHOP OWNER,,,,...,-1354454,4,"Tai Hang, Hong Kong (general), Hong Kong",HK,HK00,22.4667,114.15,-1354145,20130401,http://www.bloomberg.com/news/2013-04-01/hong-...
3,253461015,20030404,200304,2003,2003.2575,CVL,MIGRANT,,,,...,AS,1,Australia,AS,AS,-27.0,133.0,AS,20130401,http://www.bangkokpost.com/breakingnews/343522...
4,253461016,20030404,200304,2003,2003.2575,HLH,DOCTOR,,,,...,,2,"Nevada, United States",US,USNV,38.4199,-117.122,NV,20130401,http://www.startribune.com/nation/200818961.html


In [906]:
droppers=[row_ind for row_ind in range(len(sample_df)) if sample_df.iloc[row_ind,7]=='AUS']
droppers

[0,
 52,
 639,
 640,
 641,
 642,
 4840,
 4841,
 4842,
 4843,
 4844,
 4845,
 4846,
 4847,
 4848,
 4849,
 4850,
 4851,
 4852,
 4853,
 4854,
 4855,
 4856,
 4857,
 4858,
 4859,
 4860,
 4861,
 4862,
 4863,
 4864,
 4865,
 4866,
 4867,
 4868,
 4869,
 4870,
 4871,
 4872,
 4873,
 4874,
 4875,
 4876,
 4877,
 4878,
 4879,
 4880,
 4881,
 4882,
 4883,
 4884,
 4885,
 4886,
 4887,
 4888,
 4889,
 4890,
 4891,
 4892,
 4893,
 4894,
 4895,
 4896,
 4897,
 4898,
 4899,
 4900,
 4901,
 4902,
 4903,
 4904,
 4905,
 4906,
 4907,
 4908,
 4909,
 4910,
 4911,
 4912,
 4913,
 4914,
 4915,
 4916,
 4917,
 4918,
 4919,
 4920,
 4921,
 4922,
 4923,
 4924,
 4925,
 4926,
 4927,
 4928,
 4929,
 4930,
 4931,
 4932,
 4933,
 4934,
 4935,
 4936,
 4937,
 4938,
 4939,
 4940,
 4941,
 4942,
 4943,
 4944,
 4945,
 4946,
 4947,
 4948,
 4949,
 4950,
 4951,
 4952,
 4953,
 4954,
 4955,
 4956,
 4957,
 4958,
 4959,
 4960,
 4961,
 4962,
 4963,
 4964,
 4965,
 4966,
 4967,
 4968,
 4969,
 4970,
 4971,
 4972,
 4973,
 4974,
 4975,
 4976,
 4977,
 

In [909]:
sample_df.drop(droppers)

Unnamed: 0,GLOBALEVENTID,SQLDATE,MonthYear,Year,FractionDate,Actor1Code,Actor1Name,Actor1CountryCode,Actor1KnownGroupCode,Actor1EthnicCode,...,Actor2Geo_FeatureID,ActionGeo_Type,ActionGeo_FullName,ActionGeo_CountryCode,ActionGeo_ADM1Code,ActionGeo_Lat,ActionGeo_Long,ActionGeo_FeatureID,DATEADDED,SOURCEURL
1,253461013,20030404,200304,2003,2003.2575,BUS,SHOP OWNER,,,,...,-1354145,4,"Tai Hang, Hong Kong (general), Hong Kong",HK,HK00,22.46670,114.15000,-1354145,20130401,http://www.bloomberg.com/news/2013-04-01/hong-...
2,253461014,20030404,200304,2003,2003.2575,BUS,SHOP OWNER,,,,...,-1354454,4,"Tai Hang, Hong Kong (general), Hong Kong",HK,HK00,22.46670,114.15000,-1354145,20130401,http://www.bloomberg.com/news/2013-04-01/hong-...
3,253461015,20030404,200304,2003,2003.2575,CVL,MIGRANT,,,,...,AS,1,Australia,AS,AS,-27.00000,133.00000,AS,20130401,http://www.bangkokpost.com/breakingnews/343522...
4,253461016,20030404,200304,2003,2003.2575,HLH,DOCTOR,,,,...,,2,"Nevada, United States",US,USNV,38.41990,-117.12200,NV,20130401,http://www.startribune.com/nation/200818961.html
5,253461017,20030404,200304,2003,2003.2575,UIS,THE INTERNATIONAL COMMUNITY,,,,...,AG,1,Algeria,AG,AG,28.00000,3.00000,AG,20130401,BBC Monitoring
6,253461018,20030404,200304,2003,2003.2575,USA,NEW YORK,USA,,,...,,2,"Hawaii, United States",US,USHI,21.10980,-157.53100,HI,20130401,http://www.philippinetimes.com/index.php/sid/2...
7,253461019,20030404,200304,2003,2003.2575,USA,NEW YORK,USA,,,...,,2,"Nevada, United States",US,USNV,38.41990,-117.12200,NV,20130401,http://www.theglobeandmail.com/life/health-and...
8,253461020,20030404,200304,2003,2003.2575,USA,NEW YORK,USA,,,...,,2,"New York, United States",US,USNY,42.14970,-74.93840,NY,20130401,http://www.theglobeandmail.com/life/health-and...
9,253461021,20120401,201204,2012,2012.2493,,,,,,...,-1955538,4,"Brussels, Bruxelles-Capitale, Belgium",BE,BE11,50.83330,4.33333,-1955538,20130401,http://www.channelnewsasia.com/news/world/us-u...
10,253461022,20120401,201204,2012,2012.2493,,,,,,...,,0,,,,,,,20130401,http://www.miamiherald.com/2013/04/01/3317988/...


In [1206]:
corpus_url[0][1][1]

'http://www.rte.ie/news/2013/0401/379281-india-drug-patent-novartis/'

## Trying out different regressors on the data, no luck so far :(

In [896]:
#model=LogisticRegression(penalty='l2', C=1.) #, dual=False, tol=0.0001,, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver='liblinear', max_iter=100, multi_class='ovr', verbose=0, warm_start=False, n_jobs=1)
#model=RandomForestClassifier(n_estimators=10,max_features=4000,max_depth=5)#,max_features=7000)
#model=AdaBoostClassifier(n_estimators=20)
#model=MLPClassifier(activation='logistic',hidden_layer_sizes=(100,10,5))
#model=KNeighborsClassifier(n_neighbors=15)
model=SVC(C=.8,kernel='poly')
model.fit(x_tfidf_class_trainval,y_tfidf_class_trainval)
#model.fit(x_bow_class_trainval,y_bow_class_trainval)

SVC(C=0.8, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='poly',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)