# Introduction

Predicting financial indicators is definitely a holy grail for our society at its present stage. There is a vast literature on how to do this and the general approach is a time-series one, that is, predict the future of one quantity based on that quantity's past.

We are trying to see if it's possible to complement this approach with data coming from news sources, reasoning that news from the world should directly and indirectly weigh on the performance of stocks, employment rates, or inflation.

Please keep in mind that we do not expect to make any significant improvement over state-of-the-art financial analyses (which involve much more complex and refined models). Rather, we are interested in building a scalable and dynamic pipeline that in the future might supplement those already-existing models or give interesting insights.

### This notebook

This notebook is trying to predict future S&P500 closing values based on past S&P500 values along with NLP features extracted from the daily-updted GDELT 1.0 (http://www.gdeltproject.org/) event database.

In particular, to scope down the analysis to a minimally viable scalable pipeline, I extract features from the urls contained in the database (one associated to each event).

For each day, all urls get parsed, tokenized, and stemmed and conflated together into a single bag of words, weighted depending on the number of mentions of the event related to each specific url. This will constitute one document. After that I may or may not apply a tdf-idf vectorization or stick with bag of words.

I use the extracted features (plus the same day's closing S&P500) to try and fit various regression models to predict the next day's S&P500 and compare them to the flat model, i.e. predicting the same for tomorrow as today. Unfortunately there is no clear benefit so far.

I also try to predict if tomorrow's index value will rise or fall, given today's news. This approach seems more promising but as of now a successful example of this is not yet included in this notebook.

In [4]:
import importlib
import sys
import os
sourcedir=os.path.split(os.getcwd())[0]+"/source"
if sourcedir not in sys.path:
    sys.path.append(sourcedir)
import numpy as np

In [5]:
#importing our nlp proprocessing module, the reload command is for development
import nlp_preprocessing as nlpp
importlib.reload(nlpp)
#importing our model training module, the reload command is for development
import model_training as mdlt
importlib.reload(mdlt)

<module 'model_training' from '/Users/Maxos/Desktop/Insight_stuff/bigsnippyrepo/maqro/source/model_training.py'>

## The nlp-preprocessing module

The module has two classes for now: one deals with the nlp preprocessing of Google News articles, which are talked about in much more depth in another notebook; the other is the analog for GDELT url data.

Let's explore these classes and their contents.

### The CorpusGoogleNews class

In [3]:
#del datagnews
datagnews=nlpp.CorpusGoogleNews() # nlpp.CorpusGoogleNews('some/data/directory') 

These are the attributes of the initialized class

In [4]:
datagnews.raw_articles

{}

In [5]:
datagnews.datadirectory

'../data/'

There is one public method for now: it loads files from the data folder

In [57]:
datagnews.data_directory_crawl('AAPL',verbose=1)

Apple Inc
Apple Inc 1-26-17
Apple Inc 1-27-17
Apple Inc 1-30-17
Apple Inc 1-31-17
Apple Inc 2-1-17


which populates datagnews.raw_articles with dataframes like this:

In [58]:
datagnews.raw_articles['Apple Inc 1-30-17'].head()

Unnamed: 0,category,title,body
0,Apple Inc,3 Stocks to Watch on Tuesday: Apple Inc. (AAPL...,The first day of public trading with President...
1,Apple Inc,3 Stocks to Watch on Tuesday: Apple Inc. (AAPL...,The first day of public trading with President...
2,Apple Inc,Alphabet Inc (GOOGL) Steals AI Expert Back Fro...,"The smart home market continues to heat up, an..."
3,Apple Inc,Apple (AAPL) Set to Meet Government Officials ...,"Reportedly, Apple Inc.’s AAPL management is sc..."
4,Apple Inc,Apple Close to Signing Deal With Indian Govern...,Apple Inc. (AAPL) executives were in India tod...


### The CorpusGDELT class

In [154]:
#del datagdelt
datagdelt=nlpp.CorpusGDELT(min_ment=400) # min_ment defaults to 1 and cuts off events that have a 
#low number of mentions

In [13]:
#minimum number of mentions for one event to be used
print('Minimum numbe rof mentions:',datagdelt.minimum_ment)
print('Current directory:',datagdelt.currentdir) # current directory
print('Dates loaded so far:',datagdelt.dates) # dates for which data has been loaded so far
print('Corpus of raw urls',datagdelt.url_corpus)
print('Corpus of tfidf-vectorized docs:')
print(datagdelt.vect_corpus_tfidf)

Minimum numbe rof mentions: 400
Current directory: ../data/GDELT_1.0/
Dates loaded so far: []
Corpus of raw urls []
Corpus of tfidf-vectorized docs:
Empty DataFrame
Columns: []
Index: []


In [8]:
#vowels and consonants
print('Vowels:',datagdelt.vowels)
print('Consonants:',datagdelt.consonants,end=' ')
print()
print('Stemmer:',datagdelt.porter) #stemmer of choice
print('Punctuation:',datagdelt.punctuation) #punctuation regular expression
print('Tokenizer:',datagdelt.re_tokenizer) 
print('Filter for spurious url beginnings:',datagdelt.spurious_beginnings)
print('Filter for stop words:',datagdelt.stop_words)

Vowels: ['a', 'e', 'i', 'o', 'u', 'y']
Consonants: ['b', 'c', 'd', 'f', 'g', 'h', 'j', 'k', 'l', 'm', 'n', 'p', 'q', 'r', 's', 't', 'v', 'w', 'x', 'z'] 
Stemmer: <PorterStemmer>
Punctuation: re.compile('[-.?!,":;()|0-9]')
Tokenizer: RegexpTokenizer(pattern='\\w+', gaps=False, discard_empty=True, flags=56)
Filter for spurious url beginnings: re.compile('idind.|idus.|iduk.')
Filter for stop words: {'', 'my', 'do', 'the', 'am', 'ourselves', 'were', 'shouldn', 'ain', 'all', 'himself', 'during', 'needn', 'didn', 'itself', 're', 'd', 'about', 'their', 'once', 'while', 'both', 'herself', 'over', 'no', 'aren', 'this', 'own', 'any', 'by', 'ma', 'should', 'where', 'same', 'out', 'ours', 'he', 'why', 'nor', 'its', 'y', 'because', 'as', 'under', 'her', 'between', 'too', 'can', 'or', 'had', 'me', 'we', 'has', 'to', 'hadn', 'myself', 'll', 'through', 'hers', 'in', 'are', 'm', 'which', 'haven', 'those', 'what', 'only', 'few', 'up', 'if', 'most', 'his', 'don', 'have', 'into', 'is', 'when', 'your', 'fu

In [9]:
print(datagdelt.header,end=' ') #GDELT csv files header, notice the last field has the urls

['GLOBALEVENTID', 'SQLDATE', 'MonthYear', 'Year', 'FractionDate', 'Actor1Code', 'Actor1Name', 'Actor1CountryCode', 'Actor1KnownGroupCode', 'Actor1EthnicCode', 'Actor1Religion1Code', 'Actor1Religion2Code', 'Actor1Type1Code', 'Actor1Type2Code', 'Actor1Type3Code', 'Actor2Code', 'Actor2Name', 'Actor2CountryCode', 'Actor2KnownGroupCode', 'Actor2EthnicCode', 'Actor2Religion1Code', 'Actor2Religion2Code', 'Actor2Type1Code', 'Actor2Type2Code', 'Actor2Type3Code', 'IsRootEvent', 'EventCode', 'EventBaseCode', 'EventRootCode', 'QuadClass', 'GoldsteinScale', 'NumMentions', 'NumSources', 'NumArticles', 'AvgTone', 'Actor1Geo_Type', 'Actor1Geo_FullName', 'Actor1Geo_CountryCode', 'Actor1Geo_ADM1Code', 'Actor1Geo_Lat', 'Actor1Geo_Long', 'Actor1Geo_FeatureID', 'Actor2Geo_Type', 'Actor2Geo_FullName', 'Actor2Geo_CountryCode', 'Actor2Geo_ADM1Code', 'Actor2Geo_Lat', 'Actor2Geo_Long', 'Actor2Geo_FeatureID', 'ActionGeo_Type', 'ActionGeo_FullName', 'ActionGeo_CountryCode', 'ActionGeo_ADM1Code', 'ActionGeo_Lat', 

Now let's see what methods are available and what the pipeline is like.

First we load the urls.

In [155]:
datagdelt.load_urls('20130901','20131031') #the earliest available date is April 1st 2013 = 20130401

loading news for 20130901
loading news for 20130902
loading news for 20130903
loading news for 20130904
loading news for 20130905
loading news for 20130906
loading news for 20130907
loading news for 20130908
loading news for 20130909
loading news for 20130910
loading news for 20130911
loading news for 20130912
loading news for 20130913
loading news for 20130914
loading news for 20130915
loading news for 20130916
loading news for 20130917
loading news for 20130918
loading news for 20130919
loading news for 20130920
loading news for 20130921
loading news for 20130922
loading news for 20130923
loading news for 20130924
loading news for 20130925
loading news for 20130926
loading news for 20130927
loading news for 20130928
loading news for 20130929
loading news for 20130930
loading news for 20131001
loading news for 20131002
loading news for 20131003
loading news for 20131004
loading news for 20131005
loading news for 20131006
loading news for 20131007
loading news for 20131008
loading news

Now let's look at what the url_corpus attribute looks like

In [49]:
day=1 #please use 3 or 4 here
print('There are',len(datagdelt.url_corpus),'elements in it, because we loaded',len(datagdelt.dates),'days!')
print('The loaded day n.',day,'had',len(datagdelt.url_corpus[day-1]) ,'events in it that were mentioned more than',datagdelt.minimum_ment,'times:', datagdelt.url_corpus[day-1][:10],'\n etc...')
print('The first event was mentioned',datagdelt.url_corpus[day-1][0][0],'times, the second',datagdelt.url_corpus[day-1][1][0],'times, etc...')

There are 60 elements in it, because we loaded 60 days!
The loaded day n. 1 had 28 events in it that were mentioned more than 400 times: [[441, 'http://economictimes.indiatimes.com/news/news-by-industry/cons-products/food/creambell-to-attain-rs-500-crore-sales-by-end-2014-devyani-food/articleshow/22201890.cms'], [572, 'http://economictimes.indiatimes.com/news/international-business/zain-saudi-appoints-telecoms-veteran-hassan-kabbani-as-ceo/articleshow/22210051.cms'], [556, 'http://www.10news.com/money/consumer/ford-recalls-370000-cars-09012013'], [617, 'http://www.10news.com/money/consumer/ford-recalls-370000-cars-09012013'], [406, 'http://www.firstpost.com/world/radiation-readings-spike-at-water-tank-at-japans-ruined-nuclear-plant-1077395.html'], [496, 'http://romenews-tribune.com/bookmark/23500185'], [496, 'http://romenews-tribune.com/bookmark/23500185'], [410, 'http://www.todayonline.com/world/americas/kerry-us-has-evidence-sarin-gas-was-used-syria'], [474, 'http://www.therecord.com

We see that at least one of those urls contains wordings that can be very informative on what's happening in the world and therefore might tell us something about the near future of the markets!!

Now, let's process these messy raw urls!

In [50]:
datagdelt.gdelt_preprocess(tfidf=True)

And let's see what happened to the vect_corpus_tfidf attribute...

In [51]:
datagdelt.vect_corpus_tfidf.head(20)

Unnamed: 0_level_0,aab,aacbec,aacd,aaff,aaron,ab,abandon,abc,abcf,abcfa,...,your,youth,yulia,yyg,zagoklj,zealand,zeidan,zidan,zient,zoo
news_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
20130901,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
20130902,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.310006,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
20130903,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.135148,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
20130904,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.036081,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
20130905,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.054006
20130906,0.0,0.0,0.0,0.0,0.0,0.0,0.025471,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
20130907,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
20130908,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
20130909,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
20130910,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


BOOM! Now we have all of our datapoints with their nlp features neatly arranged in a pandas dataframe. this is ready for processing. Mission accomplished!

Notice that the dataframe is extremely sparse, which essentially means our dataset is going to be not informative at all. But again this is just a proof of principle.

If we try to run this expensive preprocessing again on the same exact data...

In [52]:
datagdelt.gdelt_preprocess(tfidf=True)

Nothing to be done, dataframes are up to date


In [57]:
print("Btw, if you're curious, the extracted words (stems, actually, the endings are absent) are:", list(datagdelt.vect_corpus_tfidf.columns)[:30],'...')

Btw, if you're curious, the extracted words (stems, actually, the endings are absent) are: ['aab', 'aacbec', 'aacd', 'aaff', 'aaron', 'ab', 'abandon', 'abc', 'abcf', 'abcfa', 'abduct', 'abe', 'abid', 'abil', 'aboard', 'abort', 'abrio', 'abroad', 'ac', 'aca', 'acabd', 'acapulco', 'acbecefb', 'acceler', 'access', 'accid', 'accolad', 'accord', 'account', 'accredit'] ...


## The model training module (work in progress, please be patient)
This section covers model training, validation, and testing, from our model_training module

We initialize a class instance by loading into it two lists: one of names of your choosing and one of dataframes, which in this case is the output form the previous module above, datagdelt.vect_corpus_tfidf.

In [61]:
#del predictorgdelt
predictorgdelt=mdlt.StockPrediction([['some_name_you_choose'],[datagdelt.vect_corpus_tfidf]])

In [66]:
predictorgdelt.dataset_list[0].head()

Unnamed: 0_level_0,aab,aacbec,aacd,aaff,aaron,ab,abandon,abc,abcf,abcfa,...,your,youth,yulia,yyg,zagoklj,zealand,zeidan,zidan,zient,zoo
news_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
20130901,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
20130902,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.310006,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
20130903,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.135148,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
20130904,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.036081,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
20130905,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.054006


In [67]:
predictorgdelt.dataset_names

['some_name_you_choose']

Next we convert these dataframes into numpy arrays well formatted to feed into scikit learn. Notice that you will now use the name you chose before to process that specific dataset

In [62]:
predictorgdelt.prepare_data('some_name_you_choose')

And, as you see below, we have now populated dictionaries (the key is the dataset name you chose) containing x datasets...

In [68]:
predictorgdelt.xdata

{'some_name_you_choose': array([[  0.00000000e+00,   0.00000000e+00,   0.00000000e+00, ...,
           0.00000000e+00,   1.00000000e+00,   1.63296997e+03],
        [  0.00000000e+00,   0.00000000e+00,   0.00000000e+00, ...,
           0.00000000e+00,   0.00000000e+00,   1.63977002e+03],
        [  0.00000000e+00,   0.00000000e+00,   0.00000000e+00, ...,
           0.00000000e+00,   0.00000000e+00,   1.65307996e+03],
        ..., 
        [  0.00000000e+00,   0.00000000e+00,   0.00000000e+00, ...,
           0.00000000e+00,   0.00000000e+00,   1.76210999e+03],
        [  0.00000000e+00,   0.00000000e+00,   0.00000000e+00, ...,
           0.00000000e+00,   0.00000000e+00,   1.77194995e+03],
        [  0.00000000e+00,   0.00000000e+00,   0.00000000e+00, ...,
           0.00000000e+00,   0.00000000e+00,   1.76331006e+03]])}

... and three different prediction options for the y set: tomorrow's S&P, tomorrow up or down?, tomorrow-today, respectively.

In [69]:
predictorgdelt.ydata

{'some_name_you_choose': array([[  1.63977002e+03,   1.00000000e+00,   6.80004900e+00],
        [  1.65307996e+03,   1.00000000e+00,   1.33099360e+01],
        [  1.65507996e+03,   1.00000000e+00,   2.00000000e+00],
        [  1.65517004e+03,   1.00000000e+00,   9.00880000e-02],
        [  1.67170996e+03,   1.00000000e+00,   1.65399170e+01],
        [  1.68398999e+03,   1.00000000e+00,   1.22800290e+01],
        [  1.68913000e+03,   1.00000000e+00,   5.14001500e+00],
        [  1.68342004e+03,   0.00000000e+00,  -5.70996100e+00],
        [  1.68798999e+03,   1.00000000e+00,   4.56994600e+00],
        [  1.69759998e+03,   1.00000000e+00,   9.60998600e+00],
        [  1.70476001e+03,   1.00000000e+00,   7.16003400e+00],
        [  1.72552002e+03,   1.00000000e+00,   2.07600100e+01],
        [  1.72233997e+03,   0.00000000e+00,  -3.18005400e+00],
        [  1.70991003e+03,   0.00000000e+00,  -1.24299320e+01],
        [  1.70183997e+03,   0.00000000e+00,  -8.07006800e+00],
        [  1.697

Now, how about we split the dataset into train+validation and test?

In [71]:
predictorgdelt.trval_test_split('some_name_you_choose',0.2) #0.2 here is the fraction #testset/#total

Ha! Now the x/ydata dictionary entry has been split into a 2-tuple of (training+validation,test) datapoints

In [72]:
predictorgdelt.xdata['some_name_you_choose']

(array([[  0.00000000e+00,   0.00000000e+00,   0.00000000e+00, ...,
           0.00000000e+00,   0.00000000e+00,   1.74466003e+03],
        [  0.00000000e+00,   0.00000000e+00,   0.00000000e+00, ...,
           0.00000000e+00,   0.00000000e+00,   1.67170996e+03],
        [  0.00000000e+00,   0.00000000e+00,   0.00000000e+00, ...,
           0.00000000e+00,   0.00000000e+00,   1.67866003e+03],
        ..., 
        [  0.00000000e+00,   0.00000000e+00,   0.00000000e+00, ...,
           0.00000000e+00,   0.00000000e+00,   1.68913000e+03],
        [  0.00000000e+00,   0.00000000e+00,   0.00000000e+00, ...,
           0.00000000e+00,   1.00000000e+00,   1.74450000e+03],
        [  0.00000000e+00,   0.00000000e+00,   0.00000000e+00, ...,
           5.40056182e-02,   0.00000000e+00,   1.65507996e+03]]),
 array([[  0.00000000e+00,   0.00000000e+00,   0.00000000e+00, ...,
           0.00000000e+00,   0.00000000e+00,   1.72154004e+03],
        [  0.00000000e+00,   0.00000000e+00,   0.00000000e+0

In [73]:
predictorgdelt.ydata['some_name_you_choose']

(array([[  1.75467004e+03,   1.00000000e+00,   1.00100100e+01],
        [  1.68398999e+03,   1.00000000e+00,   1.22800290e+01],
        [  1.69050000e+03,   1.00000000e+00,   1.18399660e+01],
        [  1.77194995e+03,   1.00000000e+00,   9.83996600e+00],
        [  1.67612000e+03,   0.00000000e+00,  -1.43800050e+01],
        [  1.68913000e+03,   1.00000000e+00,   5.14001500e+00],
        [  1.69867004e+03,   1.00000000e+00,   5.90002400e+00],
        [  1.69175000e+03,   0.00000000e+00,  -6.92004400e+00],
        [  1.70476001e+03,   1.00000000e+00,   7.16003400e+00],
        [  1.76210999e+03,   1.00000000e+00,   2.33996500e+00],
        [  1.68155005e+03,   0.00000000e+00,  -1.01999510e+01],
        [  1.70319995e+03,   1.00000000e+00,   1.06398920e+01],
        [  1.68798999e+03,   1.00000000e+00,   4.56994600e+00],
        [  1.67170996e+03,   1.00000000e+00,   1.65399170e+01],
        [  1.65507996e+03,   1.00000000e+00,   2.00000000e+00],
        [  1.65544995e+03,   0.00000000e

And now, for the real deal: k-fold training and validation!
The following method performs that in a very general manner. It lets you decide what regression model to choose, as well as the values of the hyperparameters (please see the module documentation in model_training.py for details on how to pass the hyperparameters), also you need to supply the number of folds you want your data split into, and a seed, for reproducibility. There is also an option to scale and normalize the features but it doesn't quite perform well in general.

The method returns the model average performance over the k training iterations. In short, tuning will consist of choosing the value for the hyperparameters that optimizes avg_validation_rmse (that is minimize the average root mean squared on the validation datasets)

In [91]:
#10-fold validated lasso linear regression with sliding hyperparameter alpha, seed=100, no scaling, 
#for dataset 'some_name_you_choose'.
for alpha in [12.+0.1*i for i in range(-8,8)]:
    print('alpha =',alpha)
    predictorgdelt.kfold_val_reg(10,'some_name_you_choose','lasso',alpha,100,scaling=False)

alpha = 11.2
avg_train_rmse: 11.6470100241 avg_validation_rmse: 11.8295047334
alpha = 11.3
avg_train_rmse: 11.6470935177 avg_validation_rmse: 11.8295020586
alpha = 11.4
avg_train_rmse: 11.6471777528 avg_validation_rmse: 11.8295001005
alpha = 11.5
avg_train_rmse: 11.6472627295 avg_validation_rmse: 11.8294988592
alpha = 11.6
avg_train_rmse: 11.6473484476 avg_validation_rmse: 11.8294983351
alpha = 11.7
avg_train_rmse: 11.6474349073 avg_validation_rmse: 11.8294985283
alpha = 11.8
avg_train_rmse: 11.6475221085 avg_validation_rmse: 11.829499439
alpha = 11.9
avg_train_rmse: 11.6476100511 avg_validation_rmse: 11.8295010676
alpha = 12.0
avg_train_rmse: 11.6476987351 avg_validation_rmse: 11.8295034142
alpha = 12.1
avg_train_rmse: 11.6477881607 avg_validation_rmse: 11.8295064791
alpha = 12.2
avg_train_rmse: 11.6478783276 avg_validation_rmse: 11.8295102625
alpha = 12.3
avg_train_rmse: 11.6479692359 avg_validation_rmse: 11.8295147646
alpha = 12.4
avg_train_rmse: 11.6480608856 avg_validation_rmse: 1

So we see that the minimum is reached for alpha = 11.6 (you'll probably get different values). So now we go into testing and use this parameter.

The following method, very similar to the previous one, retrains the model on the full train+validation dataset with the desired hyperparameters. If the model defines feature importances, these are returned by the method.

Importantly, the method also prints out the performance of a benchmark model (just a trivial flat prediction from today to tomorrow).

In [93]:
feature_importances=predictorgdelt.kfold_test_reg('some_name_you_choose','lasso',11.6,scaling=False)

model_test_rmse: 8.20685552719 flat_test_rmse: 8.53742476458


By chance, in this one case we outperform the benchmark model with a lower rmse, but this procedure should be performed a couple of time and an average final performance should be quoted instead.

Out of curiosity, let's see what the most important features were.

In [116]:
key_cols=list(predictorgdelt.dataset_list[0].columns)+['*weekend?','*yesterdayS&P']
print([[key_cols[i],feature_importances[i]] for i in np.argsort(abs(feature_importances))[::-1]][:10],'...')

[['*yesterdayS&P', 0.91139353353349029], ['errand', -0.0], ['everybodi', 0.0], ['everi', -0.0], ['ever', -0.0], ['evacu', 0.0], ['europ', 0.0], ['eurasian', 0.0], ['eu', 0.0], ['etx', 0.0]] ...


...which, isn't surprising. As we said at the beginning, the most important feature should have been today's closing, and it was, entirely offuscating everything else.

Let's see if classifying tomorrow's value going up or down will do us and better...
N.B. We need to specify a decision threshold which I recommend leaving at 0.5 for now.

In [124]:
#10-fold validated lasso logistic regression with sliding hyperparameter alpha, seed=100, no scaling, 
#for dataset 'some_name_you_choose'. Scaling not yet supported
for alpha in [1.+0.1*i for i in range(-5,5)]:
    print('alpha =',alpha)
    predictorgdelt.kfold_val_class(10,'some_name_you_choose','logreg',['l1',alpha],100,thres=0.5)

alpha = 0.5
avg_train_rec,prec,F1: [1.0, 0.6470967741935485, 0.78559676178163562] avg_validation_rec,prec,F1: [1.0, 0.65000000000000002, 0.77476190476190465]
alpha = 0.6
avg_train_rec,prec,F1: [1.0, 0.6470967741935485, 0.78559676178163562] avg_validation_rec,prec,F1: [1.0, 0.65000000000000002, 0.77476190476190465]
alpha = 0.7
avg_train_rec,prec,F1: [1.0, 0.6470967741935485, 0.78559676178163562] avg_validation_rec,prec,F1: [1.0, 0.65000000000000002, 0.77476190476190465]
alpha = 0.8
avg_train_rec,prec,F1: [1.0, 0.6470967741935485, 0.78559676178163562] avg_validation_rec,prec,F1: [1.0, 0.65000000000000002, 0.77476190476190465]
alpha = 0.9
avg_train_rec,prec,F1: [1.0, 0.6470967741935485, 0.78559676178163562] avg_validation_rec,prec,F1: [1.0, 0.65000000000000002, 0.77476190476190465]
alpha = 1.0
avg_train_rec,prec,F1: [1.0, 0.6470967741935485, 0.78559676178163562] avg_validation_rec,prec,F1: [1.0, 0.65000000000000002, 0.77476190476190465]
alpha = 1.1
avg_train_rec,prec,F1: [1.0, 0.647096774

As you can see, the method returns again average validation performances which are now measured in terms of recall, precision, and F1 score. In lack of a specific metric we want to optimize, we are going to use the F1 score for tuning.

The performance plateaus and is optimal for alpha ~1.0

In [129]:
feature_importances_class=predictorgdelt.kfold_test_class('some_name_you_choose','logreg',['l1',1.0],thres=0.5)

test_rec,prec,F1: [1.0, 0.55555555555555558, 0.7142857142857143]


I haven't yet implemented a benchmark model for classification. This is the test data though:

In [146]:
print('the training+validation data labels were',predictorgdelt.ydata['some_name_you_choose'][0][:,1])

the training+validation data labels were [ 1.  1.  1.  1.  0.  1.  1.  0.  1.  1.  0.  1.  1.  1.  1.  0.  1.  1.
  0.  0.  1.  0.  0.  1.  1.  1.  0.  0.  1.  1.  0.  0.  1.  1.]


so I'm gonna compare my model to one that predicts only 1's (majority class)

In [151]:
print("the benchmark model's recall, precision, and F1 scores are", mdlt.scores(predictorgdelt.ydata['some_name_you_choose'][1][:,1],np.ones(9),[0.5])[:3])

the benchmark model's recall, precision, and F1 scores are (1.0, 0.5555555555555556, 0.7142857142857143)


oh bummer, this model seems identical to mine... which is confirmed by feature importances:

In [152]:
key_cols=list(predictorgdelt.dataset_list[0].columns)+['*weekend?','*yesterdayS&P']
print([[key_cols[i],feature_importances_class[0][i]] for i in np.argsort(abs(feature_importances_class[0]))[::-1]][:10],'...')

[['*yesterdayS&P', 0.00035189982423960872], ['errand', 0.0], ['everybodi', 0.0], ['everi', 0.0], ['ever', 0.0], ['evacu', 0.0], ['europ', 0.0], ['eurasian', 0.0], ['eu', 0.0], ['etx', 0.0]] ...


... this model is exclusively bias... and by direct visual inspection:

In [153]:
print('and the predicted test labels are',predictorgdelt.yhat_class['some_name_you_choose'])

and the predicted test labels are [ 1.  1.  1.  1.  1.  1.  1.  1.  1.]


Bingo! Our model predicts all 1's. Not much gained...

Incidentally anyway, that's how you pull the predictions vector for a specific dataset.
In the future I'll give the option to save a specific model run instead of overwriting. Good for free exploration.

# Scratch from now on, please ignore!!

In [612]:
predictorgdelt.kfold_val_reg(10,'apriltodectfidf','lasso',1.3,10,scaling=True)

avg_train_rmse: 9.20135417438 avg_validation_rmse: 23.8508310037


In [629]:
predictorgdelt.kfold_val_reg(10,'apriltodectfidf','rfreg',[50,4500,10],10,scaling=True)

avg_train_rmse: 5.91587243572 avg_validation_rmse: 24.4121821352


In [643]:
predictorgdelt.kfold_val_reg(10,'apriltodectfidf','adabreg',15,10,scaling=False)

avg_train_rmse: 11.1816716619 avg_validation_rmse: 15.3459786376


In [670]:
predictorgdelt.kfold_val_reg(10,'apriltodectfidf','knnreg',7,10,scaling=False)

avg_train_rmse: 10.342872738 avg_validation_rmse: 11.4892361364


In [671]:
aa=predictorgdelt.kfold_test_reg('apriltodectfidf','knnreg',7)

model_test_rmse: 16.0651087792 flat_test_rmse: 14.4209762157


In [727]:
predictorgdelt.kfold_val_class(10,'apriltodectfidf','knnclass',7,10,[0.5])

avg_train_rec,prec,F1: [0.81837529044943147, 0.70479197132136429, 0.75720674859733283] avg_validation_rec,prec,F1: [0.76335497835497834, 0.64537684537684537, 0.6926744610887835]


In [724]:
predictorgdelt.kfold_test_class('apriltodectfidf','knnclass',7,[0.5])

test_rec,prec,F1: [0.5, 0.41666666666666669, 0.45454545454545453]


In [738]:
predictorgdelt.kfold_val_class(10,'apriltodectfidf','rfclass',[40,5,10],10,[0.5])

avg_train_rec,prec,F1: [1.0, 0.71820616787952551, 0.83593883914424061] avg_validation_rec,prec,F1: [1.0, 0.59458333333333324, 0.73921100638491943]


In [739]:
predictorgdelt.kfold_test_class('apriltodectfidf','rfclass',[40,5,10],[0.5])

test_rec,prec,F1: [1.0, 0.51282051282051277, 0.67796610169491522]


In [744]:
sum(predictorgdelt.ydata['apriltodectfidf'][1][:,1])/len(predictorgdelt.ydata['apriltodectfidf'][1][:,1])

0.51282051282051277

In [778]:
mdlt.scores(predictorgdelt.ydata['apriltodectfidf'][1][:,1],np.ones(len(predictorgdelt.ydata['apriltodectfidf'][1])),[0.5])

(1.0, 0.5128205128205128, 0.6779661016949152, 0.5128205128205128)

In [779]:
predictorgdelt.kfold_val_class(10,'apriltodectfidf','logreg',['l1',1.5],10,[0.5])

avg_train_rec,prec,F1: [1.0, 0.59213718334048937, 0.74374608177131407] avg_validation_rec,prec,F1: [1.0, 0.59458333333333324, 0.73921100638491943]


In [768]:
aa=predictorgdelt.kfold_test_class('apriltodectfidf','logreg',['l1',2.5],[0.5])

test_rec,prec,F1: [1.0, 0.51282051282051277, 0.67796610169491522]


In [773]:
predictorgdelt.kfold_val_class(10,'apriltodectfidf','svmclass',[1.,'poly'],10,[0.5])

avg_train_rec,prec,F1: [0.76262732475581552, 0.77500920878424662, 0.7669871458718045] avg_validation_rec,prec,F1: [0.66152958152958141, 0.61717171717171726, 0.62368359527432093]


In [774]:
predictorgdelt.kfold_test_class('apriltodectfidf','svmclass',[1.,'poly'],[0.5])

test_rec,prec,F1: [0.65000000000000002, 0.59090909090909094, 0.61904761904761907]


In [775]:
predictorgdelt.yhat_class

{'apriltodectfidf': array([ 0.,  1.,  1.,  1.,  1.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,  1.,
         0.,  1.,  1.,  0.,  0.,  0.,  0.,  1.,  1.,  1.,  0.,  1.,  0.,
         0.,  1.,  1.,  0.,  1.,  0.,  1.,  1.,  1.,  1.,  1.,  0.,  1.])}

In [772]:
predictorgdelt.ydata['apriltodectfidf'][1][:,1]

array([ 1.,  1.,  1.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  1.,  1.,  1.,
        1.,  0.,  1.,  0.,  0.,  1.,  1.,  0.,  1.,  1.,  1.,  1.,  0.,
        1.,  0.,  0.,  0.,  0.,  0.,  1.,  1.,  0.,  0.,  1.,  0.,  1.])

In [660]:
predictorgdelt.kfold_val_reg(10,'apriltodectfidf','svmreg',[15,'poly'],10,scaling=False)

KeyboardInterrupt: 

In [649]:
predictorgdelt.kfold_val_reg(10,'apriltodectfidf','mlpreg',['relu',(100,)],10,scaling=False)

avg_train_rmse: 12.6222333907 avg_validation_rmse: 12.1568874732


In [657]:
aa=predictorgdelt.kfold_test_reg('apriltodectfidf','mlpreg',['relu',(100,)])

model_test_rmse: 14.8924855598 flat_test_rmse: 14.4209762157


In [561]:
predictorgdelt.yhat_reg

{'apriltodectfidf': array([ 1744.52278741,  1803.29915885,  1694.4451712 ,  1693.34940803,
         1837.50763223,  1661.97637678,  1576.88280783,  1647.17413376,
         1653.98888935,  1660.05401055,  1630.59372504,  1760.84365842,
         1594.12637618,  1693.15485608,  1657.5242832 ,  1559.35083745,
         1693.66655617,  1770.69578534,  1678.18873354,  1618.16557548,
         1798.57971612,  1595.75929349,  1790.80377837,  1762.44887651,
         1569.11649056,  1588.40738375,  1769.48474365,  1773.99257709,
         1789.57343515,  1688.86454837,  1637.86982931,  1793.05291322,
         1655.27695931,  1687.8802146 ,  1631.17041088,  1657.8406513 ,
         1655.79591841,  1562.46506112,  1711.34272805])}

In [562]:
predictorgdelt.ydata['apriltodectfidf'][1][:,0]

array([ 1754.670044,  1800.900024,  1703.199951,  1689.469971,
        1841.069946,  1650.469971,  1553.689941,  1642.810059,
        1667.469971,  1630.47998 ,  1612.52002 ,  1767.930054,
        1592.430054,  1685.72998 ,  1655.079956,  1541.609985,
        1681.550049,  1767.689941,  1655.449951,  1606.280029,
        1795.150024,  1573.089966,  1785.030029,  1756.540039,
        1570.25    ,  1593.609985,  1747.150024,  1786.540039,
        1787.869995,  1689.130005,  1633.77002 ,  1792.810059,
        1628.930054,  1706.869995,  1639.040039,  1652.619995,
        1642.800049,  1562.5     ,  1698.060059])

In [620]:
predictorgdelt.kfold_val_reg(10,'apriltodectfidf','ridge',0.001,10,scaling=True)

avg_train_rmse: 6.33518759404 avg_validation_rmse: 24.5460444404


In [385]:
aa=predictorgdelt.kfold_test_reg('7daystfidf','lasso',37.5)

model_test_rmse: 14.2964522937 flat_test_rmse: 13.5413782105


In [417]:
predictorgdelt.kfold_val_reg(15,'7daystfidf','ridge',8400.,10)

avg_train_rmse: 11.1986058145 avg_validation_rmse: 10.800359651


In [418]:
aa=predictorgdelt.kfold_test_reg('7daystfidf','ridge',8400.)

model_test_rmse: 14.3814887421 flat_test_rmse: 13.5413782105


In [263]:
predictorgdelt.kfold_val_reg(10,'7daystfidf','svreg',[0.01,'poly'],10)

avg_train_rmse: 11.0071954777 avg_validation_rmse: 12.3585904621


In [264]:
predictorgdelt.kfold_test_reg('7daystfidf','svreg',[0.01,'poly'])

model_test_rmse: 10.3370215097 flat_test_rmse: 10.8353511909


In [161]:
key_cols=list(datagdelt.vect_corpus_tfidf.columns)+['*weekend?','*yesterdayS&P']

In [769]:
ab=aa[0]#model.coef_[0]
[[key_cols[i],ab[i]] for i in np.argsort(abs(ab))[::-1]]

[['north', 2.1730974001773933],
 ['train', 1.5123459726524862],
 ['snowden', 1.264379084303753],
 ['crew', -1.130072887205146],
 ['big', 1.0974619202530456],
 ['ae', -0.84521265295771053],
 ['milit', -0.26812945924067383],
 ['*yesterdayS&P', 1.8144839374748277e-05],
 ['juli', 0.0],
 ['exam', 0.0],
 ['everi', 0.0],
 ['everybodi', 0.0],
 ['evict', 0.0],
 ['evid', 0.0],
 ['evolut', 0.0],
 ['ex', 0.0],
 ['exactli', 0.0],
 ['examin', 0.0],
 ['egyptian', 0.0],
 ['exce', 0.0],
 ['except', 0.0],
 ['exception', 0.0],
 ['exchang', 0.0],
 ['exclus', 0.0],
 ['exec', 0.0],
 ['execut', 0.0],
 ['juror', 0.0],
 ['ever', 0.0],
 ['eventu', 0.0],
 ['event', 0.0],
 ['etern', 0.0],
 ['ethanol', 0.0],
 ['ethnic', 0.0],
 ['etx', 0.0],
 ['eu', 0.0],
 ['eurasian', 0.0],
 ['eurobank', 0.0],
 ['europ', 0.0],
 ['european', 0.0],
 ['eurozon', 0.0],
 ['evacu', 0.0],
 ['evad', 0.0],
 ['evalu', 0.0],
 ['eve', 0.0],
 ['even', 0.0],
 ['exelon', 0.0],
 ['exercis', 0.0],
 ['exhaust', 0.0],
 ['express', 0.0],
 ['extens', 

In [572]:
predictorgdelt.kfold_val_class(10,'apriltodectfidf','logreg',['l1',0.095],10,0.5)

avg_train_rec,prec,F1: [1.0, 0.77187932464248254, 0.87119529697081077] avg_validation_rec,prec,F1: [0.97070707070707074, 0.62297619047619035, 0.75461295226512615]


In [573]:
aa=predictorgdelt.kfold_test_class('apriltodectfidf','logreg',['l1',0.095],0.5)

test_rec,prec,F1: [0.93333333333333335, 0.3888888888888889, 0.5490196078431373]


In [583]:
len(predictorgdelt.ydata['apriltodectfidf'][1][:,1])

39

In [574]:
predictorgdelt.yhat_class['apriltodectfidf']

array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  0.,  1.,  1.,  0.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  0.,  1.,  1.,  1.,  1.,  1.])

In [247]:
predictorgdelt.kfold_val_class(10,'7daystfidf','svclass',[0.01,'poly'],10,0.5)

avg_train_rec,prec,F1: [0.99316945840049764, 0.87472075977840691, 0.93013643870901608] avg_validation_rec,prec,F1: [0.78934065934065933, 0.60476190476190483, 0.65097784615687426]


In [248]:
predictorgdelt.kfold_test_class('7daystfidf','svclass',[0.01,'poly'],0.5)

test_rec,prec,F1: [0.69230769230769229, 0.75, 0.71999999999999986]


In [249]:
predictorgdelt.yhat_class['7daystfidf']

array([ 0.,  1.,  1.,  1.,  0.,  1.,  1.,  0.,  1.,  0.,  0.,  1.,  1.,
        1.,  1.,  1.,  0.,  1.,  0.,  0.])

In [252]:
predictorgdelt.ydata['7daystfidf'][1][:,1]

array([ 0.,  0.,  1.,  1.,  1.,  1.,  1.,  0.,  0.,  0.,  1.,  1.,  0.,
        1.,  1.,  1.,  1.,  1.,  1.,  0.])

In [1248]:
#downloading and unzipping, run at your own risk, contains dreadful shell commands
for date in range(20131001,20131032):
    os.system('wget http://data.gdeltproject.org/events/'+str(date)+'.export.CSV.zip')
    os.system('unzip '+str(date)+'.export.CSV.zip')
    os.system('mv '+str(date)+'.export.CSV data/GDELT_1.0')
    os.system('rm '+str(date)+'.export.CSV.zip')

In [11]:
!ls -hl data/GDELT_1.0/20130401.export.CSV

-rw-r--r--  1 Maxos  staff    10M May 20  2013 data/GDELT_1.0/20130401.export.CSV


In [38]:
header_daily=pd.read_csv('data/GDELT_1.0/CSV.header.dailyupdates.txt',delimiter='\t')

In [39]:
#this is just to show what the GDELT files look like
sample_df=pd.read_csv('data/GDELT_1.0/20130401.export.CSV',delimiter='\t')
sample_df.columns=list(header_daily)
sample_df.head()

Unnamed: 0,GLOBALEVENTID,SQLDATE,MonthYear,Year,FractionDate,Actor1Code,Actor1Name,Actor1CountryCode,Actor1KnownGroupCode,Actor1EthnicCode,...,Actor2Geo_FeatureID,ActionGeo_Type,ActionGeo_FullName,ActionGeo_CountryCode,ActionGeo_ADM1Code,ActionGeo_Lat,ActionGeo_Long,ActionGeo_FeatureID,DATEADDED,SOURCEURL
0,253461012,20030404,200304,2003,2003.2575,AUS,AUSTRALIA,AUS,,,...,AS,1,Australia,AS,AS,-27.0,133.0,AS,20130401,http://www.bangkokpost.com/breakingnews/343522...
1,253461013,20030404,200304,2003,2003.2575,BUS,SHOP OWNER,,,,...,-1354145,4,"Tai Hang, Hong Kong (general), Hong Kong",HK,HK00,22.4667,114.15,-1354145,20130401,http://www.bloomberg.com/news/2013-04-01/hong-...
2,253461014,20030404,200304,2003,2003.2575,BUS,SHOP OWNER,,,,...,-1354454,4,"Tai Hang, Hong Kong (general), Hong Kong",HK,HK00,22.4667,114.15,-1354145,20130401,http://www.bloomberg.com/news/2013-04-01/hong-...
3,253461015,20030404,200304,2003,2003.2575,CVL,MIGRANT,,,,...,AS,1,Australia,AS,AS,-27.0,133.0,AS,20130401,http://www.bangkokpost.com/breakingnews/343522...
4,253461016,20030404,200304,2003,2003.2575,HLH,DOCTOR,,,,...,,2,"Nevada, United States",US,USNV,38.4199,-117.122,NV,20130401,http://www.startribune.com/nation/200818961.html


In [1206]:
corpus_url[0][1][1]

'http://www.rte.ie/news/2013/0401/379281-india-drug-patent-novartis/'

In [1211]:
#from urllib.request import urlopen
#from bs4 import BeautifulSoup

#with urlopen("https://www.google.com") as response:
        #html = response.read()
#        soup = BeautifulSoup(response)



#soup = BeautifulSoup(urlopen("https://github.com/jbwhit/jupyter-tips-and-tricks/tree/master/deliver"))    
    
#for url in corpus_url[0]:
 #   print(url[1])
    #soup = BeautifulSoup(urlopen(url[1]))#, "lxml"))#"https://www.google.com"), "lxml")
#print(soup.title.string)

In [20]:
re_tokenizer = RegexpTokenizer(r'\w+')
punctuation = re.compile(r'[-.?!,":;()|0-9]')
stop_words = set(stopwords.words('english')+[""])
porter = PorterStemmer()
vowels = list("aeiouy")
consonants = list("bcdfghjklmnpqrstvwxz")
spurious_beginnings = re.compile(r'idind.|idus.|iduk.')

def url_tokenizer(url):
    c,d,e=[],[],[]
    if url!='BBC Monitoring':
        a=urlparse(url)[2].split('.')[0].split('/')[-1]
        b = re_tokenizer.tokenize(a.lower())
        for word in b:
            c+=[punctuation.sub("", word)]
        for word in c:
            if word not in stop_words:
                d+=[word]
        if len(d)<=1:
            return []
        for word in d:
            stemtemp=porter.stem(word)
            length=len(stemtemp)
            unique=len(set(stemtemp))
            num_vow=sum(stemtemp.count(c) for c in vowels)
            num_cons=sum(stemtemp.count(c) for c in consonants)
            if length<15 and (num_cons-num_vow)<7 and unique>1 and num_vow>0 and (length-unique)<5 and not spurious_beginnings.match(stemtemp) and '_' not in stemtemp:
                e+=[stemtemp]
    return e

def wrapper_tokenizer(url_doc):
    wordlist=[]
    for url in url_doc:
        for mentions in range(url[0]):
            wordlist+=url_tokenizer(url[1])
    return wordlist

In [996]:
url_tokenizer('http://iosdevelopertips.com/bash/bash-trick-file-sizes-byte-kilobyte-megabyte-gigabyte.html'),url_tokenizer('http://alexgude.com/blog/software-testing-for-data-science')

(['bash', 'trick', 'file', 'size', 'byte', 'kilobyt', 'megabyt', 'gigabyt'],
 ['softwar', 'test', 'data', 'scienc'])

In [22]:
 wrapper_tokenizer([[3,'http://iosdevelopertips.com/bash/bash-trick-file-sizes-byte-kilobyte-megabyte-gigabyte.html']
                    ,[1,'http://alexgude.com/blog/software-testing-for-data-science']])

['bash',
 'trick',
 'file',
 'size',
 'byte',
 'kilobyt',
 'megabyt',
 'gigabyt',
 'bash',
 'trick',
 'file',
 'size',
 'byte',
 'kilobyt',
 'megabyt',
 'gigabyt',
 'bash',
 'trick',
 'file',
 'size',
 'byte',
 'kilobyt',
 'megabyt',
 'gigabyt',
 'softwar',
 'test',
 'data',
 'scienc']

In [1161]:
def vocabularycreator(date1,date2,cutoff_numb,save=False):
    word_corpus=set([])
    header=list(header_daily)
    for date in range(date1,date2):
        df=pd.read_csv('data/GDELT_1.0/'+str(date)+'.export.CSV',delimiter='\t')
        df.columns=header
        df=df.sort_values('NumMentions', ascending=False)
        for i in range(cutoff_numb):
            word_corpus=word_corpus.union(set(url_tokenizer(df.iloc[i,-1])))
        del df
    if save:
        print("Sorry, I haven't implemented this feature yet")
    return word_corpus

def corpuscreator_url(date1,date2,cutoff_numb,save=False):
    url_corpus=[]
    header=list(header_daily)
    for date in range(date1,date2):
        df=pd.read_csv('data/GDELT_1.0/'+str(date)+'.export.CSV',delimiter='\t')
        df.columns=header
        df=df.sort_values('NumMentions', ascending=False)
        url_doc=[]
        for i in range(cutoff_numb):
            url_doc+=[[df['NumMentions'][i],df.iloc[i,-1]]]
        url_corpus+=[url_doc]
    if save:
        print("Sorry, I haven't implemented a saving feature yet")
    return url_corpus

In [52]:
#creating the corpus of urls (and number of mentions) by reading over all csv files, takes a while, not very efficient
corpus_url=corpuscreator_url(20130401,20130431,100)+corpuscreator_url(20130501,20130532,100)+corpuscreator_url(20130601,20130631,100)+corpuscreator_url(20130701,20130732,100)+corpuscreator_url(20130801,20130832,100)+corpuscreator_url(20130901,20130931,100)

  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_

In [357]:
corpus_url+=corpuscreator_url(20131001,20131032,100)

  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_code(code, result):


In [18]:
#other features that I intend to use, but right now I'm just only using URLs
feat_columns=['FractionDate','Actor1Code','Actor1Name','Actor1CountryCode','Actor1Type1Code','Actor2Code',
              'Actor2Name','Actor2CountryCode','Actor2Type1Code','EventCode','QuadClass','GoldsteinScale',
              'NumMentions','AvgTone']
#out of which, categorical are
cat_columns=['Actor1Code','Actor1Name','Actor1CountryCode','Actor1Type1Code','Actor2Code','Actor2Name',
             'Actor2CountryCode','Actor2Type1Code','EventCode','QuadClass']


#def preprocess(date,corp,cutoff_numb,fcol=feat_columns,ccol=cat_columns,tfidf=False):
#    df=pd.read_csv('data/GDELT_1.0/'+str(date)+'.export.CSV',delimiter='\t')
#    df.columns=list(header_daily)
#    df=(df.sort_values('NumMentions', ascending=False))[0:cutoff_numb]
#    df_with_dummies = pd.get_dummies(df[fcol],columns=ccol)
#    if tfidf:
#        vectorizer = TfidfVectorizer(min_df=1,tokenizer=url_tokenizer)
#    else:
#        vectorizer = CountVectorizer(min_df=1,tokenizer=url_tokenizer)
#    X = vectorizer.fit_transform(corp)
#    Y=X.toarray()
#    for i,col in enumerate(vectorizer.get_feature_names()):
#        df_with_dummies[col]=pd.DataFrame(Y[:,i])
#    return df_with_dummies

#this is all about preprocessing the lists of words and vectorize them, possibly applying tfidf

def preprocess_red(corp,tfidf=False):
    if tfidf:
        vectorizer = TfidfVectorizer(min_df=1,tokenizer=wrapper_tokenizer,lowercase=False)
    else:
        vectorizer = CountVectorizer(min_df=1,tokenizer=wrapper_tokenizer,lowercase=False)
    X = vectorizer.fit_transform(corp)
    Y=X.toarray()
    dictionary={col:Y[:,i] for i,col in enumerate(vectorizer.get_feature_names())}
    return pd.DataFrame(dictionary)

In [17]:
len(datagdelt.url_corpus[6])

15509

In [21]:
tfidf_dataset_df=preprocess_red(datagdelt.url_corpus,tfidf=True)

KeyboardInterrupt: 

In [1103]:
#these are the feature dataframes, they contain bag of words or tf-idf vectorization of every single document
#(e.g. one full day of news)
bow_dataset_df=preprocess_red(corpus_url)
tfidf_dataset_df=preprocess_red(corpus_url,tfidf=True)

In [1013]:
bow_dataset_df.head()

Unnamed: 0,aab,aacbec,aacd,aad,aada,aadb,aae,aaf,aafeefc,aaff,...,zimvwodawr,zipwir,zmsm,zone,zoo,zookeep,zs,zuckerberg,zuma,zzg
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,10,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [1020]:
print('word','#char','#uniquechar','#vowels','#conson')
for ii in list(bow_dataset_df.columns):
    length=len(ii)
    unique=len(set(ii))
    num_vow=sum(ii.count(c) for c in vowels)
    num_cons=sum(ii.count(c) for c in consonants)
    if True:
        print(ii,length,unique,num_vow,num_cons)

#[[ii,len(ii)-len(set(ii)),len(set(ii)),sum(ii.count(c) for c in vowels),sum(ii.count(c) for c in consonants)] # if len(ii)-len(set(ii))>=4 and sum(ii.count(c) for c in vowels)==6 and sum(ii.count(c) for c in consonants)==3]

word #char #uniquechar #vowels #conson
aab 3 2 2 1
aacbec 6 4 3 3
aacd 4 3 2 2
aad 3 2 2 1
aada 4 2 3 1
aadb 4 3 2 2
aae 3 2 3 0
aaf 3 2 2 1
aafeefc 7 4 4 3
aaff 4 2 2 2
aaron 5 4 3 2
aarriv 6 4 3 3
ab 2 2 1 1
abandon 7 5 3 4
abb 3 2 1 2
abba 4 2 2 2
abc 3 3 1 2
abccb 5 3 1 4
abcf 4 4 1 3
abcfa 5 4 2 3
abd 3 3 1 2
abdic 5 5 2 3
abdoul 6 6 3 3
abduct 6 6 2 4
abductor 8 8 3 5
abe 3 3 2 1
abeb 4 3 2 2
abedf 5 5 2 3
abf 3 3 1 2
abfbd 5 4 1 4
abfff 5 3 1 4
abid 4 4 2 2
abil 4 4 2 2
abl 3 3 1 2
ablyazov 8 7 4 4
aboard 6 5 3 3
abort 5 5 2 3
abound 6 6 3 3
abramson 8 7 3 5
abrio 5 5 3 2
abroad 6 5 3 3
abu 3 3 2 1
abus 4 4 2 2
abuzz 5 4 2 3
abyei 5 5 4 1
ac 2 2 1 1
aca 3 2 2 1
acabd 5 4 2 3
academi 7 6 4 3
acapulco 8 6 4 4
acb 3 3 1 2
acbb 4 3 1 3
acbd 4 4 1 3
acbecefb 8 5 3 5
acbff 5 4 1 4
acc 3 2 1 2
accbcdd 7 4 1 6
acceler 7 5 3 4
accept 6 5 2 4
access 6 4 2 4
accid 5 4 2 3
accident 8 7 3 5
accolad 7 5 3 4
accord 6 5 2 4
account 7 6 3 4
accredit 8 7 3 5
accus 5 4 2 3
ace 3 3 2 1
acea 4 3 3 1

In [47]:
#this is loading the data for the S&P500 index which we'll be trying to predict
sp500=[]
with open('data/SP500am.csv','r') as mycsvfile:
    reader=csv.reader(mycsvfile)
    for row in reader:
        row[0]=re.sub('-','',row[0])
        sp500+=[row]

sp500red=sp500[:963][::-1]

In [48]:
sp500[1]

['20170123',
 '2267.780029',
 '2271.780029',
 '2257.02002',
 '2265.199951',
 '3152710000',
 '2265.199951']

In [1022]:
days=list(range(20130401,20130431))+list(range(20130501,20130532))+list(range(20130601,20130631))+list(range(20130701,20130732))+list(range(20130801,20130832))+list(range(20130901,20130931))+list(range(20131001,20131032))
days=[str(date)[:4]+'-'+str(date)[4:6]+'-'+str(date)[6:] for date in days]+['2013-11-01']
#prev_days=['2013-03-31']+days

In [51]:
isinstance(3,int)

True

In [1104]:
#dataset preparation
x_tfidf=[]
y_tfidf=[]
j=0
for i,tomorrow in enumerate(days[1:]):
    next_bizday=sp500red[j+1][0]
    if tomorrow==next_bizday:
        after_we=0.
        latest_bizday=sp500red[j][0]
        today=days[i]
        if latest_bizday!=today:
            after_we=1.
        x_tfidf+=[list(tfidf_dataset_df.iloc[i])+[after_we,float(sp500red[j][-1])]]
        y_tfidf+=[float(sp500red[j+1][-1])]
        j+=1

x_tfidf=np.array(x_tfidf)
y_tfidf=np.array(y_tfidf)

In [1105]:
#dataset preparation(bullish/bearish classifier)
x_tfidf_class=[]
y_tfidf_class=[]
j=0
for i,tomorrow in enumerate(days[1:]):
    next_bizday=sp500red[j+1][0]
    if tomorrow==next_bizday:
        after_we=0.
        latest_bizday=sp500red[j][0]
        today=days[i]
        if latest_bizday!=today:
            after_we=1.
        x_tfidf_class+=[list(tfidf_dataset_df.iloc[i])+[after_we,float(sp500red[j][-1])]]
        y_tfidf_class+=[(np.sign(float(sp500red[j+1][-1])-float(sp500red[j][-1]))+1)/2.]
        j+=1

x_tfidf_class=np.array(x_tfidf_class)
y_tfidf_class=np.array(y_tfidf_class)

In [1106]:
#dataset preparation
x_bow=[]
y_bow=[]
j=0
for i,tomorrow in enumerate(days[1:]):
    next_bizday=sp500red[j+1][0]
    if tomorrow==next_bizday:
        after_we=0.
        latest_bizday=sp500red[j][0]
        today=days[i]
        if latest_bizday!=today:
            after_we=1.
        x_bow+=[list(bow_dataset_df.iloc[i])+[after_we,float(sp500red[j][-1])]]
        y_bow+=[float(sp500red[j+1][-1])]
        j+=1

x_bow=np.array(x_bow)
y_bow=np.array(y_bow)

In [1125]:
key_cols=list(tfidf_dataset_df.columns)+['*weekend?','*yesterdayS&P']

In [1126]:
len(key_cols)

6607

In [1127]:
len(y_tfidf_class),len(x_tfidf_class[0])

(151, 6607)

In [1110]:
from sklearn.model_selection import train_test_split, KFold
from sklearn.metrics import roc_curve, roc_auc_score, precision_recall_curve
from sklearn.linear_model import Lasso, Ridge, LogisticRegression
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor, RandomForestClassifier, AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.svm import SVC, SVR
from sklearn.neural_network import MLPRegressor, MLPClassifier

## Trying out different regressors on the data, no luck so far :(

In [1111]:
#this function executes k-fold validation on a particular regressor model chosen by the user, 
#it outputs k-fold average rmse for the training set, validation set, and a benchmark "flat" 
#model which consists of predicting the same for tomorrow as today's closing

def scores(y,yhat,thres):
    score=[0,0,0,0]
    for ii in zip(yhat,y):
        temp=2*np.ceil(ii[0]-thres)-ii[1]
        for jj in temp:
            score[int(jj)]+=1
    fp=score[2]
    fn=score[-1]
    tp=score[1]
    tn=score[0]
    #print(fp,fn,tp,tn)
    if fn==0:
        rec=1.
    else:
        rec=tp/(tp+fn)
    if fp==0:
        prec=1.
    else:
        prec=tp/(tp+fp)
    if prec==0. and rec==0.:
        f1=0.
    else:
        f1=2*rec*prec/(rec+prec)
    if fp+fn==0:
        ji=1.
    else:
        ji=tp/(fp+fn+tp)
    return rec,prec,f1,ji


def kfold_val(n_folds_val,x_trainval,y_trainval,regressor,parm,seed):
    kf_val = KFold(n_splits=n_folds_val,shuffle=True,random_state=seed)
    avg_rms_mod_val=0
    avg_rms_mod_train=0
    for train_index, val_index in kf_val.split(x_trainval):
        x_train, x_val = x_trainval[train_index], x_trainval[val_index]
        y_train, y_val = y_trainval[train_index], y_trainval[val_index]
        if regressor in {Lasso,Ridge}:
            model=regressor(alpha=parm)
        elif regressor in {RandomForestRegressor,}:
            model=regressor(n_estimators=parm[0],max_features=parm[1])
        elif regressor in {MLPRegressor,}:
            model=regressor(activation=parm[0],hidden_layer_sizes=parm[1])
        elif regressor in {AdaBoostRegressor,}:
            model=regressor(n_estimators=parm)
        elif regressor in {KNeighborsRegressor,}:
            model=regressor(n_neighbors=parm)  
        elif regressor in {SVR,}:
            model=regressor(C=parm[0],kernel=parm[1])  
        else:
            print('houston, we have a unknown model problem')
            return
        model.fit(x_train,y_train)
        avg_rms_mod_val+=np.sqrt((sum((model.predict(x_val)-y_val)**2)/len(y_val)))
        avg_rms_mod_train+=np.sqrt((sum((model.predict(x_train)-y_train)**2)/len(y_train)))
    avg_rms_mod_val=avg_rms_mod_val/n_folds_val
    avg_rms_mod_train=avg_rms_mod_train/n_folds_val
    print('avg_train_rmse:',avg_rms_mod_train,'avg_validation_rmse:',avg_rms_mod_val)
    return

def kfold_test(x_trainval,y_trainval,x_test,y_test,regressor,parm):
    #x_trainval, x_test, y_trainval, y_test = train_test_split(x, y,test_size=test_fraction)
    coeff=True
    if regressor in {Lasso,Ridge}:
            model=regressor(alpha=parm)
    elif regressor in {RandomForestRegressor,}:
            model=regressor(n_estimators=parm[0],max_features=parm[1],max_depth=parm[2])
            coeff=False
    elif regressor in {MLPRegressor,}:
            model=regressor(activation=parm[0],hidden_layer_sizes=parm[1])
            coeff=False
    elif regressor in {AdaBoostRegressor,}:
            model=regressor(n_estimators=parm)
            coeff=False
    elif regressor in {KNeighborsRegressor,}:
            model=regressor(n_neighbors=parm)  
            coeff=False
    elif regressor in {SVR,}:
            model=regressor(C=parm[0],kernel=parm[1])  
            coeff=False
    else:
            print('houston, we have a unknown model problem')        
            return
    model.fit(x_trainval,y_trainval)

    rms_mod_test=np.sqrt((sum((model.predict(x_test)-y_test)**2)/len(y_test)))
    rms_rand_test=np.sqrt((sum((x_test[:,-1]-y_test)**2)/len(y_test)))
    print('model_test_rmse:',rms_mod_test,'flat_test_rmse:',rms_rand_test)
   
    if coeff:
        return model.coef_
    return

def kfold_val_class(n_folds_val,x_trainval,y_trainval,classifier,parm,seed,thres):
    #random_int=random.rand_int(1,1000)
    kf_val = KFold(n_splits=n_folds_val,shuffle=True,random_state=seed)
    avg_scores_train=np.zeros(3)
    avg_scores_val=np.zeros(3)
    for train_index, val_index in kf_val.split(x_trainval):
        x_train, x_val = x_trainval[train_index], x_trainval[val_index]
        y_train, y_val = y_trainval[train_index], y_trainval[val_index]
        if classifier in {LogisticRegression,}:
            model=classifier(penalty=parm[0], C=parm[1])
        elif classifier in {RandomForestClassifier,}:
            model=classifier(n_estimators=parm[0],max_features=parm[1],max_depth=parm[2])
        elif classifier in {MLPClassifier,}:
            model=classifier(activation=parm[0],hidden_layer_sizes=parm[1])
        elif classifier in {AdaBoostClassifier,}:
            model=classifier(n_estimators=parm)
        elif classifier in {KNeighborsClassifier,}:
            model=classifier(n_neighbors=parm)  
        elif classifier in {SVC,}:
            model=classifier(C=parm[0],kernel=parm[1])  
        else:
            print('houston, we have a unknown model problem')
            return
        
        model.fit(x_train,y_train)
        avg_scores_train+=np.array(scores(y_train,model.predict(x_train),[thres])[:3])
        avg_scores_val+=np.array(scores(y_val,model.predict(x_val),[thres])[:3])

    avg_scores_train=avg_scores_train/n_folds_val    
    avg_scores_val=avg_scores_val/n_folds_val
    print('avg_train_rec,prec,F1:',list(avg_scores_train),'avg_validation_rec,prec,F1:',list(avg_scores_val))
    return

def kfold_test_class(x_trainval,y_trainval,x_test,y_test,classifier,parm,thres):
    #x_trainval, x_test, y_trainval, y_test = train_test_split(x, y,test_size=test_fraction)
    coeff=True
    if classifier in {LogisticRegression,}:
        model=classifier(penalty=parm[0], C=parm[1])
    elif classifier in {RandomForestClassifier,}:
        model=classifier(n_estimators=parm[0],max_features=parm[1],max_depth=parm[2])
        coeff=False
    elif classifier in {MLPClassifier,}:
        model=classifier(activation=parm[0],hidden_layer_sizes=parm[1])
        coeff=False
    elif classifier in {AdaBoostClassifier,}:
        model=classifier(n_estimators=parm)
        coeff=False
    elif classifier in {KNeighborsClassifier,}:
        model=classifier(n_neighbors=parm)  
        coeff=False
    elif classifier in {SVC,}:
        model=classifier(C=parm[0],kernel=parm[1])  
        coeff=False
    else:
        print('houston, we have a unknown model problem')
        return
            
    model.fit(x_trainval,y_trainval)

    scores_test=np.array(scores(y_test,model.predict(x_test),[thres])[:3])
    #avg_scores_val+=np.array(scores(y_val,model.predict(x_val),[thres])[:3])
    print('test_rec,prec,F1:',list(scores_test))   
    if coeff:
        return model.coef_
    return

In [1162]:
y_tfidf_regclass=np.array(list(zip(y_tfidf,y_tfidf_class)))
x_tfidf_trainval, x_tfidf_test, y_tfidf_regclass_trainval, y_tfidf_regclass_test = train_test_split(x_tfidf, y_tfidf_regclass,test_size=0.2)
y_tfidf_trainval=y_tfidf_regclass_trainval[:,0]
y_tfidf_test=y_tfidf_regclass_test[:,0]
y_tfidf_class_trainval=y_tfidf_regclass_trainval[:,1]
y_tfidf_class_test=y_tfidf_regclass_test[:,1]
print(len(x_tfidf_test[0]))

6607


In [1131]:
seed=random.randint(1,10000)
for n_neigh in range(15,30):
    print('number of neighbors used =',n_neigh)
    kfold_val_class(10,x_tfidf_trainval,y_tfidf_class_trainval,KNeighborsClassifier,n_neigh,seed,0.5)
#for C in [1.0+0.1*i for i in range(-5,5)]:
#    print('C=',C)
#    kfold_val_class(11,x_tfidf_class_trainval,y_tfidf_class_trainval,SVC,[C,'poly'],0.5)

number of neighbors used = 15
avg_train_rec,prec,F1: [0.8926857734013135, 0.68977732715017093, 0.77741402257869652] avg_validation_rec,prec,F1: [0.78452380952380951, 0.60476911976911985, 0.66895572263993308]
number of neighbors used = 16
avg_train_rec,prec,F1: [0.84211418373841551, 0.70555993737259715, 0.76651441976819235] avg_validation_rec,prec,F1: [0.77023809523809528, 0.67176767676767679, 0.70201359356158111]
number of neighbors used = 17
avg_train_rec,prec,F1: [0.8941309948465348, 0.68426420811252431, 0.77435015298160526] avg_validation_rec,prec,F1: [0.8320238095238095, 0.60725108225108226, 0.69186299081035929]
number of neighbors used = 18
avg_train_rec,prec,F1: [0.8645346678489787, 0.69705512138400816, 0.77088738445239946] avg_validation_rec,prec,F1: [0.79345238095238091, 0.61661616161616162, 0.6817143159558019]
number of neighbors used = 19
avg_train_rec,prec,F1: [0.91040411739270388, 0.66973235481814553, 0.77083834364492021] avg_validation_rec,prec,F1: [0.86246031746031748, 0.

In [1132]:
kfold_test_class(x_tfidf_trainval,y_tfidf_class_trainval,x_tfidf_test,y_tfidf_class_test,KNeighborsClassifier,27,0.5)
#kfold_test_class(x_tfidf_class_trainval,y_tfidf_class_trainval,x_tfidf_class_test,y_tfidf_class_test,SVC,[0.8,'poly'],0.5)

test_rec,prec,F1: [1.0, 0.51724137931034486, 0.68181818181818188]


In [1138]:
seed=random.randint(1,10000)
for C in [2.35+.01*i for i in range(-10,10)]:
    print('penalty = l1, C =',C)
    kfold_val_class(10,x_tfidf_trainval,y_tfidf_class_trainval,LogisticRegression,['l1',C],seed,0.5)

penalty = l1, C = 2.25
avg_train_rec,prec,F1: [0.9970149253731343, 0.63318221342405268, 0.77442941554212519] avg_validation_rec,prec,F1: [0.98750000000000004, 0.62196969696969695, 0.75867315347191489]
penalty = l1, C = 2.2600000000000002
avg_train_rec,prec,F1: [0.9970149253731343, 0.63382286572457691, 0.77489433523781404] avg_validation_rec,prec,F1: [0.98750000000000004, 0.62196969696969695, 0.75867315347191489]
penalty = l1, C = 2.27
avg_train_rec,prec,F1: [0.9970149253731343, 0.63382286572457691, 0.77489433523781404] avg_validation_rec,prec,F1: [0.97638888888888897, 0.61969696969696975, 0.75295886775762921]
penalty = l1, C = 2.2800000000000002
avg_train_rec,prec,F1: [0.9970149253731343, 0.63382286572457691, 0.77489433523781404] avg_validation_rec,prec,F1: [0.97638888888888897, 0.61969696969696975, 0.75295886775762921]
penalty = l1, C = 2.29
avg_train_rec,prec,F1: [0.9970149253731343, 0.63382286572457691, 0.77489433523781404] avg_validation_rec,prec,F1: [0.97638888888888897, 0.6196969

In [1144]:
aa=kfold_test_class(x_tfidf_trainval,y_tfidf_class_trainval,x_tfidf_test,y_tfidf_class_test,LogisticRegression,['l1',2.35],0.5)

test_rec,prec,F1: [1.0, 0.4838709677419355, 0.65217391304347827]


In [896]:
#model=LogisticRegression(penalty='l2', C=1.) #, dual=False, tol=0.0001,, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver='liblinear', max_iter=100, multi_class='ovr', verbose=0, warm_start=False, n_jobs=1)
#model=RandomForestClassifier(n_estimators=10,max_features=4000,max_depth=5)#,max_features=7000)
#model=AdaBoostClassifier(n_estimators=20)
#model=MLPClassifier(activation='logistic',hidden_layer_sizes=(100,10,5))
#model=KNeighborsClassifier(n_neighbors=15)
model=SVC(C=.8,kernel='poly')
model.fit(x_tfidf_class_trainval,y_tfidf_class_trainval)
#model.fit(x_bow_class_trainval,y_bow_class_trainval)

SVC(C=0.8, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='poly',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [1045]:
model.predict(x_tfidf_class_test),y_tfidf_class_test
#model.predict(x_bow_class_test),y_bow_class_test

(array([ 1.,  1.,  0.,  1.,  1.,  1.,  1.,  1.,  1.,  0.,  1.,  1.,  0.,
         1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  0.,  1.,  1.,
         1.,  1.,  1.,  1.,  1.]),
 array([ 0.,  1.,  0.,  0.,  0.,  0.,  1.,  1.,  0.,  1.,  1.,  1.,  0.,
         1.,  1.,  1.,  1.,  1.,  1.,  0.,  1.,  0.,  0.,  1.,  0.,  0.,
         1.,  0.,  0.,  0.,  0.]))

In [1122]:
len(ab)

7250

In [1102]:
len(key_cols)

6607

In [1175]:
#seed=random.randint(1,10000)
seed=1
for alpha in [0.085+0.0001*i for i in range(-10,10)]:
    print('alpha =',alpha)
    kfold_val(10,x_tfidf_trainval,y_tfidf_trainval,Lasso,alpha,seed)

alpha = 0.084
avg_train_rmse: 10.3480002029 avg_validation_rmse: 11.5653321874
alpha = 0.08410000000000001
avg_train_rmse: 10.3506782103 avg_validation_rmse: 11.5654101171
alpha = 0.08420000000000001
avg_train_rmse: 10.3533588021 avg_validation_rmse: 11.5654879158
alpha = 0.0843
avg_train_rmse: 10.3560419379 avg_validation_rmse: 11.5655655445
alpha = 0.0844
avg_train_rmse: 10.3587274889 avg_validation_rmse: 11.5656447032
alpha = 0.0845
avg_train_rmse: 10.3614157333 avg_validation_rmse: 11.5657240254
alpha = 0.08460000000000001
avg_train_rmse: 10.3641062877 avg_validation_rmse: 11.5658045491
alpha = 0.08470000000000001
avg_train_rmse: 10.3667992053 avg_validation_rmse: 11.5658861134
alpha = 0.0848
avg_train_rmse: 10.3694950168 avg_validation_rmse: 11.5659672315
alpha = 0.0849
avg_train_rmse: 10.3721929615 avg_validation_rmse: 11.5660490418
alpha = 0.085
avg_train_rmse: 10.3748932686 avg_validation_rmse: 11.566132381
alpha = 0.08510000000000001
avg_train_rmse: 10.3775961322 avg_validatio

In [1173]:
aa=kfold_test(x_tfidf_trainval,y_tfidf_trainval,x_tfidf_test,y_tfidf_test,Lasso,0.085)

model_test_rmse: 13.86624214 flat_test_rmse: 13.3308619043


In [536]:
for alpha in [2986.+1.*i for i in range(-10,10)]:
    print(alpha)
    kfold_val(10,x_tfidf_trainval,y_tfidf_trainval,Ridge,alpha)

2976.0
avg_train_rmse: 12.3144325135 avg_validation_rms: 11.8707710762
2977.0
avg_train_rmse: 12.3144413631 avg_validation_rms: 11.8707710511
2978.0
avg_train_rmse: 12.3144502141 avg_validation_rms: 11.8707710287
2979.0
avg_train_rmse: 12.3144590666 avg_validation_rms: 11.870771009
2980.0
avg_train_rmse: 12.3144679206 avg_validation_rms: 11.8707709922
2981.0
avg_train_rmse: 12.314476776 avg_validation_rms: 11.870770978
2982.0
avg_train_rmse: 12.3144856329 avg_validation_rms: 11.8707709667
2983.0
avg_train_rmse: 12.3144944912 avg_validation_rms: 11.870770958
2984.0
avg_train_rmse: 12.3145033511 avg_validation_rms: 11.8707709522
2985.0
avg_train_rmse: 12.3145122123 avg_validation_rms: 11.8707709491
2986.0
avg_train_rmse: 12.3145210751 avg_validation_rms: 11.8707709488
2987.0
avg_train_rmse: 12.3145299393 avg_validation_rms: 11.8707709512
2988.0
avg_train_rmse: 12.3145388049 avg_validation_rms: 11.8707709563
2989.0
avg_train_rmse: 12.3145476721 avg_validation_rms: 11.8707709643
2990.0
avg

In [1147]:
aa=kfold_test(x_tfidf_trainval,y_tfidf_trainval,x_tfidf_test,y_tfidf_test,Ridge,2986.)

model_test_rmse: 11.8566884105 flat_test_rmse: 11.5372027387


In [1158]:
seed=random.randint(1,10000)
for n_neigh in range(1,16):
    print('number of neighbors used =',n_neigh)
    kfold_val(10,x_tfidf_trainval,y_tfidf_trainval,KNeighborsRegressor,n_neigh,seed)

number of neighbors used = 1
avg_train_rmse: 0.0 avg_validation_rmse: 18.7346395695
number of neighbors used = 2
avg_train_rmse: 9.21099271338 avg_validation_rmse: 15.9749167616
number of neighbors used = 3
avg_train_rmse: 10.8866941836 avg_validation_rmse: 14.7884607199
number of neighbors used = 4
avg_train_rmse: 11.5524183245 avg_validation_rmse: 14.2840436635
number of neighbors used = 5
avg_train_rmse: 11.6823533715 avg_validation_rmse: 13.5676823895
number of neighbors used = 6
avg_train_rmse: 11.7221060156 avg_validation_rmse: 13.1512017149
number of neighbors used = 7
avg_train_rmse: 11.7996893153 avg_validation_rmse: 13.2590311689
number of neighbors used = 8
avg_train_rmse: 11.9868059969 avg_validation_rmse: 13.2673248028
number of neighbors used = 9
avg_train_rmse: 12.2079140861 avg_validation_rmse: 13.5468349003
number of neighbors used = 10
avg_train_rmse: 12.4582087268 avg_validation_rmse: 13.4878606957
number of neighbors used = 11
avg_train_rmse: 12.6908831997 avg_valid

In [1159]:
aa=kfold_test(x_tfidf_trainval,y_tfidf_trainval,x_tfidf_test,y_tfidf_test,KNeighborsRegressor,6)

model_test_rmse: 12.4388299317 flat_test_rmse: 11.5372027387


In [552]:
for n_max in range(7341,7342):
    for n_estim in range(10,25):
        print(n_max)
        print(n_estim)
        kfold_val(10,x_tfidf_trainval,y_tfidf_trainval,RandomForestRegressor,[n_estim,n_max])

7341
10
avg_train_rmse: 6.56125169698 avg_validation_rms: 14.7165505646
7341
11
avg_train_rmse: 6.64534840047 avg_validation_rms: 14.1072301852
7341
12
avg_train_rmse: 6.33431200117 avg_validation_rms: 14.4841635548
7341
13
avg_train_rmse: 6.32376295083 avg_validation_rms: 15.016477313
7341
14
avg_train_rmse: 6.24080044741 avg_validation_rms: 14.668779199
7341
15
avg_train_rmse: 6.33693202241 avg_validation_rms: 13.7060000257
7341
16
avg_train_rmse: 6.08062693388 avg_validation_rms: 14.1038339637
7341
17
avg_train_rmse: 6.156247695 avg_validation_rms: 14.0256138444
7341
18
avg_train_rmse: 6.07984624958 avg_validation_rms: 14.3214477327
7341
19
avg_train_rmse: 6.07170529101 avg_validation_rms: 13.9314306097
7341
20
avg_train_rmse: 6.09830695159 avg_validation_rms: 13.6172442248
7341
21
avg_train_rmse: 5.87636029423 avg_validation_rms: 13.7587788516
7341
22
avg_train_rmse: 5.96883876264 avg_validation_rms: 14.3537764736
7341
23
avg_train_rmse: 6.03990780636 avg_validation_rms: 13.7969236

In [546]:
kfold_test(x_tfidf_trainval,y_tfidf_trainval,x_tfidf_test,y_tfidf_test,RandomForestRegressor,[20,7341])

model_test_rms: 11.350971682 flat_test_rms: 11.2206389071


In [520]:
kfold_val(10,x_tfidf_trainval,y_tfidf_trainval,MLPRegressor,['relu',(180,)])

avg_train_rmse: 15.3008899908 avg_validation_rms: 14.4453931395


In [558]:
for n_trees in range(1,35):
    print(n_trees)
    kfold_val(10,x_tfidf_trainval,y_tfidf_trainval,AdaBoostRegressor,n_trees)

1
avg_train_rmse: 15.3641291893 avg_validation_rmse: 19.3329761201
2
avg_train_rmse: 15.9220985727 avg_validation_rmse: 20.6843288173
3
avg_train_rmse: 11.3721752657 avg_validation_rmse: 16.3089140335
4
avg_train_rmse: 11.5384519305 avg_validation_rmse: 17.0213275843
5
avg_train_rmse: 9.8236857101 avg_validation_rmse: 16.562585481
6
avg_train_rmse: 10.1176905109 avg_validation_rmse: 15.7393770592
7
avg_train_rmse: 9.36440456592 avg_validation_rmse: 14.9201571742
8
avg_train_rmse: 9.3013977557 avg_validation_rmse: 15.5831706796
9
avg_train_rmse: 8.78863932644 avg_validation_rmse: 15.4780987087
10
avg_train_rmse: 8.86877017565 avg_validation_rmse: 15.3959011871
11
avg_train_rmse: 8.61072739689 avg_validation_rmse: 15.9081479263
12
avg_train_rmse: 8.54047018929 avg_validation_rmse: 15.4030816243
13
avg_train_rmse: 8.42922454503 avg_validation_rmse: 14.8967134826
14
avg_train_rmse: 8.31132330845 avg_validation_rmse: 14.3268475408
15
avg_train_rmse: 8.22950766821 avg_validation_rmse: 14.665

In [559]:
kfold_test(x_tfidf_trainval,y_tfidf_trainval,x_tfidf_test,y_tfidf_test,AdaBoostRegressor,23)

model_test_rmse: 14.1840667021 flat_test_rmse: 11.2206389071


In [424]:
from keras.layers import Convolution2D, MaxPooling2D, Input,ZeroPadding2D
from keras.layers import Activation, Dropout, Flatten, Dense
from keras.models import Sequential, Model, model_from_json
from keras.layers.advanced_activations import LeakyReLU
from keras.regularizers import l1,l2,l1l2
from keras.optimizers import Nadam, Adagrad

#linear regressor
inputsred=Input(shape=(len(x[0]),))

#xo=Dense(100,activation='relu',W_regularizer=l1(0.005))(inputsred)
#xo=LeakyReLU()(xo)
#xo=Dropout(0.1)(xo)
predsred=Dense(1, activation='relu',W_regularizer=l1(0.005))(inputsred)

modelDred = Model(input=inputsred, output=predsred)

nadam=Nadam(lr=0.002, beta_1=0.9, beta_2=0.999, epsilon=1e-08, schedule_decay=0.004)
adagrad=Adagrad(lr=0.01, epsilon=1e-08, decay=0.0)

modelDred.compile(optimizer=adagrad,
              loss='mean_squared_error',
              metrics=['accuracy'])

In [417]:
x_train,x_val,y_train,y_val=train_test_split(x,y,test_size=0.1)

In [473]:
modelDred.fit(x_train,y_train, batch_size=57, nb_epoch=500, verbose=2, 
          callbacks=[], validation_split=0.1, validation_data=[x_val,y_val],
              shuffle=True, class_weight=None)#cl_w_ing, sample_weight=None)

Train on 97 samples, validate on 11 samples
Epoch 1/500
0s - loss: 172.5491 - acc: 0.0000e+00 - val_loss: 206.9940 - val_acc: 0.0000e+00
Epoch 2/500
0s - loss: 172.5445 - acc: 0.0000e+00 - val_loss: 206.9854 - val_acc: 0.0000e+00
Epoch 3/500
0s - loss: 172.5419 - acc: 0.0000e+00 - val_loss: 206.9785 - val_acc: 0.0000e+00
Epoch 4/500
0s - loss: 172.5372 - acc: 0.0000e+00 - val_loss: 206.9696 - val_acc: 0.0000e+00
Epoch 5/500
0s - loss: 172.5332 - acc: 0.0000e+00 - val_loss: 206.9611 - val_acc: 0.0000e+00
Epoch 6/500
0s - loss: 172.5293 - acc: 0.0000e+00 - val_loss: 206.9532 - val_acc: 0.0000e+00
Epoch 7/500
0s - loss: 172.5258 - acc: 0.0000e+00 - val_loss: 206.9429 - val_acc: 0.0000e+00
Epoch 8/500
0s - loss: 172.5216 - acc: 0.0000e+00 - val_loss: 206.9315 - val_acc: 0.0000e+00
Epoch 9/500
0s - loss: 172.5155 - acc: 0.0000e+00 - val_loss: 206.9224 - val_acc: 0.0000e+00
Epoch 10/500
0s - loss: 172.5101 - acc: 0.0000e+00 - val_loss: 206.9132 - val_acc: 0.0000e+00
Epoch 11/500
0s - loss: 1

<keras.callbacks.History at 0x11c261588>

In [376]:
#descriptive names
df=pd.read_csv('20130401.export.CSV',delimiter='\t')

In [377]:
header1=pd.read_csv('CSV.header.dailyupdates.txt',delimiter='\t')

In [378]:
df.columns=list(header1)

In [382]:
df.head()

Unnamed: 0,GLOBALEVENTID,SQLDATE,MonthYear,Year,FractionDate,Actor1Code,Actor1Name,Actor1CountryCode,Actor1KnownGroupCode,Actor1EthnicCode,...,Actor2Geo_FeatureID,ActionGeo_Type,ActionGeo_FullName,ActionGeo_CountryCode,ActionGeo_ADM1Code,ActionGeo_Lat,ActionGeo_Long,ActionGeo_FeatureID,DATEADDED,SOURCEURL
0,253461012,20030404,200304,2003,2003.2575,AUS,AUSTRALIA,AUS,,,...,AS,1,Australia,AS,AS,-27.0,133.0,AS,20130401,http://www.bangkokpost.com/breakingnews/343522...
1,253461013,20030404,200304,2003,2003.2575,BUS,SHOP OWNER,,,,...,-1354145,4,"Tai Hang, Hong Kong (general), Hong Kong",HK,HK00,22.4667,114.15,-1354145,20130401,http://www.bloomberg.com/news/2013-04-01/hong-...
2,253461014,20030404,200304,2003,2003.2575,BUS,SHOP OWNER,,,,...,-1354454,4,"Tai Hang, Hong Kong (general), Hong Kong",HK,HK00,22.4667,114.15,-1354145,20130401,http://www.bloomberg.com/news/2013-04-01/hong-...
3,253461015,20030404,200304,2003,2003.2575,CVL,MIGRANT,,,,...,AS,1,Australia,AS,AS,-27.0,133.0,AS,20130401,http://www.bangkokpost.com/breakingnews/343522...
4,253461016,20030404,200304,2003,2003.2575,HLH,DOCTOR,,,,...,,2,"Nevada, United States",US,USNV,38.4199,-117.122,NV,20130401,http://www.startribune.com/nation/200818961.html


In [379]:
df_with_dummies = pd.get_dummies(df[feat_columns], columns = cat_columns )
df_with_dummies.head()

Unnamed: 0,FractionDate,GoldsteinScale,NumMentions,AvgTone,Actor1Code_AFG,Actor1Code_AFGBUS,Actor1Code_AFGCOP,Actor1Code_AFGCVL,Actor1Code_AFGGOV,Actor1Code_AFGGOVEDU,...,EventCode_1723,EventCode_1724,EventCode_1821,EventCode_1822,EventCode_1823,EventCode_1831,QuadClass_1,QuadClass_2,QuadClass_3,QuadClass_4
0,2003.2575,2.8,10,2.222222,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,2003.2575,-5.0,8,2.167369,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2,2003.2575,-5.0,2,2.167369,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3,2003.2575,1.9,10,2.222222,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
4,2003.2575,-0.4,10,1.843318,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


In [380]:
vectorizer = CountVectorizer(min_df=1,tokenizer=my_tokenizer)
corpus=[df.iloc[i,-1] for i in range(len(df))]
X = vectorizer.fit_transform(corpus)
Y=X.toarray()
for i,col in enumerate(vectorizer.get_feature_names()):
    df_with_dummies[col]=pd.DataFrame(Y[:,i])

In [381]:
vectorizertfidf = TfidfVectorizer(min_df=1,tokenizer=my_tokenizer)
Xtfidf = vectorizertfidf.fit_transform(corpus)
Ytfidf=Xtfidf.toarray()
for i,col in enumerate(vectorizertfidf.get_feature_names()):
    df_with_dummies['tfidf'+col]=pd.DataFrame(Ytfidf[:,i])

In [383]:
feat_df=df_with_dummies.iloc[:,0:10401]

In [385]:
feat_df.head()

Unnamed: 0,FractionDate,GoldsteinScale,NumMentions,AvgTone,Actor1Code_AFG,Actor1Code_AFGBUS,Actor1Code_AFGCOP,Actor1Code_AFGCVL,Actor1Code_AFGGOV,Actor1Code_AFGGOVEDU,...,zealand,zealotri,zeidan,zelda,zhiggkoea,ziivhmez_uxlgpnlo,zikir,zipwir,zmnmbcosjccynudfnuig,zoo
0,2003.2575,2.8,10,2.222222,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
1,2003.2575,-5.0,8,2.167369,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
2,2003.2575,-5.0,2,2.167369,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
3,2003.2575,1.9,10,2.222222,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
4,2003.2575,-0.4,10,1.843318,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0


In [384]:
feattfidf_df=df_with_dummies.iloc[:,list(range(0,5345))+list(range(10401,15457))]

In [386]:
feattfidf_df.head()

Unnamed: 0,FractionDate,GoldsteinScale,NumMentions,AvgTone,Actor1Code_AFG,Actor1Code_AFGBUS,Actor1Code_AFGCOP,Actor1Code_AFGCVL,Actor1Code_AFGGOV,Actor1Code_AFGGOVEDU,...,tfidfzealand,tfidfzealotri,tfidfzeidan,tfidfzelda,tfidfzhiggkoea,tfidfziivhmez_uxlgpnlo,tfidfzikir,tfidfzipwir,tfidfzmnmbcosjccynudfnuig,tfidfzoo
0,2003.2575,2.8,10,2.222222,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2003.2575,-5.0,8,2.167369,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2003.2575,-5.0,2,2.167369,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,2003.2575,1.9,10,2.222222,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,2003.2575,-0.4,10,1.843318,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [436]:
10401*len(feattfidfRED_df)

1040100

In [434]:
feattfidfRED_df=(feattfidf_df.sort_values('NumMentions', ascending=False))[0:100]

In [435]:
feattfidfRED_df

Unnamed: 0,FractionDate,GoldsteinScale,NumMentions,AvgTone,Actor1Code_AFG,Actor1Code_AFGBUS,Actor1Code_AFGCOP,Actor1Code_AFGCVL,Actor1Code_AFGGOV,Actor1Code_AFGGOVEDU,...,tfidfzealand,tfidfzealotri,tfidfzeidan,tfidfzelda,tfidfzhiggkoea,tfidfziivhmez_uxlgpnlo,tfidfzikir,tfidfzipwir,tfidfzmnmbcosjccynudfnuig,tfidfzoo
3196,2013.2493,0.0,643,3.227707,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
18436,2013.2493,3.0,384,4.111273,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12787,2013.2493,-4.0,372,3.989697,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6969,2013.2493,3.0,299,1.849065,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2468,2013.2493,-7.2,290,1.656668,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6971,2013.2493,-0.3,280,1.674708,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
24163,2013.2493,-10.0,275,1.254327,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1489,2013.2493,-10.0,270,3.960111,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
15388,2013.2493,0.0,256,1.609864,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6882,2013.2493,-10.0,246,3.980200,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


[(0, 'Adj Close'), (961, '1570.25')]

In [427]:
float(sp500[961][-1])

1570.25

In [428]:
float(sp500[960][-1])

1553.689941

In [293]:
my_tokenizer(df.iloc[3,-1]):
    print(porter.stem(word))

australia
peopl
smuggl
rise


In [280]:
ddf.head()

Unnamed: 0,FractionDate,Actor1Code,Actor1Name,Actor1CountryCode,Actor1Type1Code,Actor2Code,Actor2Name,Actor2CountryCode,Actor2Type1Code,EventCode,QuadClass,GoldsteinScale,NumMentions,AvgTone,url
0,2003.2575,AUS,AUSTRALIA,AUS,,CVL,MIGRANT,,CVL,43,1,2.8,10,2.222222,"[australia, people, smuggling, rising]"
1,2003.2575,BUS,SHOP OWNER,,BUS,CVL,NEIGHBORHOOD,,CVL,172,4,-5.0,8,2.167369,"[hong, kong, businesses, vanish, rents, soar, ..."
2,2003.2575,BUS,SHOP OWNER,,BUS,CVL,NEIGHBORHOOD,,CVL,172,4,-5.0,2,2.167369,"[hong, kong, businesses, vanish, rents, soar, ..."
3,2003.2575,CVL,MIGRANT,,CVL,AUS,AUSTRALIA,AUS,,42,1,1.9,10,2.222222,"[australia, people, smuggling, rising]"
10,2012.2493,,,,,BUS,COMPANY,,BUS,20,1,3.0,10,3.521127,"[pakistans, ambitious, program, educate, milit..."


In [68]:
a=re.split(r'"."|/',df.iloc[i,-1])

In [69]:
a

['http:',
 '',
 'www.channelnewsasia.com',
 'news',
 'world',
 'us-urges-serbia-kosovo-to-reach-agreemen',
 '624136.html']

In [266]:
df.loc[[1,3,4],['FractionDate','Actor1Code']]

Unnamed: 0,FractionDate,Actor1Code
1,2003.2575,BUS
3,2003.2575,CVL
4,2003.2575,HLH


In [270]:
pd.Series([[2,3],[1],[1,2]])

0    [2, 3]
1       [1]
2    [1, 2]
dtype: object

In [279]:
ddf['url']=my_ser

In [216]:
from sklearn.preprocessing import OneHotEncoder

In [206]:
en=OneHotEncoder()
en.fit([[0,1],[3,np.nan]])

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

In [203]:
en.transform([[2,0]]).toarray()

array([[ 0.,  0.,  1.,  0.]])

In [275]:
mask=[]
ser=[]
for i in range(len(df)):
    url=df.iloc[i,-1]
    #print(url)
    c=[]
    d=[]
    if url!='BBC Monitoring':
        #print(str(i)+'=========')
        a=urlparse(url)[2].split('.')[0].split('/')[-1]
        b = re_tokenizer.tokenize(a.lower())
        for word in b:
            c+=[punctuation.sub("", word)]
        for word in c:
            if word not in stop_words:
                d+=[word]
        if len(d)>1:
            mask+=[i]
            ser+=[d]
            #print(d)
#print(mask)
my_ser=pd.Series(ser)
#print(my_ser)

In [88]:
for i in range(1):
    stro=df.iloc[i,-1].split('.')[-2:]
    stri=""
    if len(stro)==2:
        if len(stro[0]) > len(stro[1]):
            stri=stro[0]
        elif len(stro[0])<len(stro[1]):
            stri=stro[1]
        else:
            stri="**"+stro[0]+stro[1]
    print(stri)
    print(stro)

com/breakingnews/343522/australia-people-smuggling-rising
['bangkokpost', 'com/breakingnews/343522/australia-people-smuggling-rising']


In [3]:
isinstance([3,2].all,int)

AttributeError: 'list' object has no attribute 'all'

In [7]:
gde=nlpp.CorpusGDELT()

In [14]:
gde.url_tokenizer('www.fakesite.net/hey-dude/poo')

[]

In [19]:
from urllib.parse import urlparse


In [33]:
urlparse('http://www.globalsecurity.org/breakingnews/343522/australia-people-smuggling-rising')#[2]#.split('.')[-1].split('/')#[-1]

ParseResult(scheme='http', netloc='www.globalsecurity.org', path='/breakingnews/343522/australia-people-smuggling-rising', params='', query='', fragment='')

In [26]:
gde.load_urls('20130401','20130401')

loading news for 20130401


In [34]:
11/38

0.2894736842105263

In [35]:
27/38

0.7105263157894737