# Preamble

This notebook is trying to predict future S&P500 based on past S&P500 values along with NLP features extracted from the daily updted GDELT 1.0 event database.

I extract features from the urls contained in the database.

For each day, all urls get parsed, tokenized, and stemmed and conflated together into a single bag of words (this is one document), weighted on the number of mentions of the event related to each specific url.

After that I may or may not apply a tdf-idf vectorization or stick with bag of words.

I use the extracted features (plus the same day's closing S&P500) to try and fit various regression models to predict the next day's S&P500 and compare them to the flat model, i.e. predicting the same for tomorrow as today.

The flat model is still the best performing, unfortunately.

In [1]:
import os
import csv
import pandas as pd
import nltk
import re
import numpy as np
from nltk.stem.porter import PorterStemmer
from urllib.parse import urlparse
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

In [356]:
#downloading and unzipping, run at your own risk, contains dreadful shell commands
for date in range(20131001,20131032):
    os.system('wget http://data.gdeltproject.org/events/'+str(date)+'.export.CSV.zip')
    os.system('unzip '+str(date)+'.export.CSV.zip')
    os.system('mv '+str(date)+'.export.CSV data/GDELT_1.0')
    os.system('rm '+str(date)+'.export.CSV.zip')

In [11]:
!ls -hl data/GDELT_1.0/20130401.export.CSV

-rw-r--r--  1 Maxos  staff    10M May 20  2013 data/GDELT_1.0/20130401.export.CSV


In [38]:
header_daily=pd.read_csv('data/GDELT_1.0/CSV.header.dailyupdates.txt',delimiter='\t')

In [39]:
#this is just to show what the GDELT files look like
sample_df=pd.read_csv('data/GDELT_1.0/20130401.export.CSV',delimiter='\t')
sample_df.columns=list(header_daily)
sample_df.head()

Unnamed: 0,GLOBALEVENTID,SQLDATE,MonthYear,Year,FractionDate,Actor1Code,Actor1Name,Actor1CountryCode,Actor1KnownGroupCode,Actor1EthnicCode,...,Actor2Geo_FeatureID,ActionGeo_Type,ActionGeo_FullName,ActionGeo_CountryCode,ActionGeo_ADM1Code,ActionGeo_Lat,ActionGeo_Long,ActionGeo_FeatureID,DATEADDED,SOURCEURL
0,253461012,20030404,200304,2003,2003.2575,AUS,AUSTRALIA,AUS,,,...,AS,1,Australia,AS,AS,-27.0,133.0,AS,20130401,http://www.bangkokpost.com/breakingnews/343522...
1,253461013,20030404,200304,2003,2003.2575,BUS,SHOP OWNER,,,,...,-1354145,4,"Tai Hang, Hong Kong (general), Hong Kong",HK,HK00,22.4667,114.15,-1354145,20130401,http://www.bloomberg.com/news/2013-04-01/hong-...
2,253461014,20030404,200304,2003,2003.2575,BUS,SHOP OWNER,,,,...,-1354454,4,"Tai Hang, Hong Kong (general), Hong Kong",HK,HK00,22.4667,114.15,-1354145,20130401,http://www.bloomberg.com/news/2013-04-01/hong-...
3,253461015,20030404,200304,2003,2003.2575,CVL,MIGRANT,,,,...,AS,1,Australia,AS,AS,-27.0,133.0,AS,20130401,http://www.bangkokpost.com/breakingnews/343522...
4,253461016,20030404,200304,2003,2003.2575,HLH,DOCTOR,,,,...,,2,"Nevada, United States",US,USNV,38.4199,-117.122,NV,20130401,http://www.startribune.com/nation/200818961.html


In [116]:
re_tokenizer = RegexpTokenizer(r'\w+')
punctuation = re.compile(r'[-.?!,":;()|0-9]')
stop_words = set(stopwords.words('english')+[""])
porter = PorterStemmer()


def url_tokenizer(url):
    c,d,e=[],[],[]
    if url!='BBC Monitoring':
        a=urlparse(url)[2].split('.')[0].split('/')[-1]
        b = re_tokenizer.tokenize(a.lower())
        for word in b:
            c+=[punctuation.sub("", word)]
        for word in c:
            if word not in stop_words:
                d+=[word]
        if len(d)<=1:
            return []
        for word in d:
            stemtemp=porter.stem(word)
            if len(stemtemp)>1 and "_" not in stemtemp and len(stemtemp)<20 and len(set(stemtemp))>1 and len(stemtemp)-len(set(stemtemp))<5:
                e+=[stemtemp]
    return e

def wrapper_tokenizer(url_doc):
    wordlist=[]
    for url in url_doc:
        for mentions in range(url[0]):
            wordlist+=url_tokenizer(url[1])
    return wordlist

In [117]:
url_tokenizer('http://iosdevelopertips.com/bash/bash-trick-file-sizes-byte-kilobyte-megabyte-gigabyte.html'),url_tokenizer('http://alexgude.com/blog/software-testing-for-data-science')

(['bash', 'trick', 'file', 'size', 'byte', 'kilobyt', 'megabyt', 'gigabyt'],
 ['softwar', 'test', 'data', 'scienc'])

In [22]:
 wrapper_tokenizer([[3,'http://iosdevelopertips.com/bash/bash-trick-file-sizes-byte-kilobyte-megabyte-gigabyte.html']
                    ,[1,'http://alexgude.com/blog/software-testing-for-data-science']])

['bash',
 'trick',
 'file',
 'size',
 'byte',
 'kilobyt',
 'megabyt',
 'gigabyt',
 'bash',
 'trick',
 'file',
 'size',
 'byte',
 'kilobyt',
 'megabyt',
 'gigabyt',
 'bash',
 'trick',
 'file',
 'size',
 'byte',
 'kilobyt',
 'megabyt',
 'gigabyt',
 'softwar',
 'test',
 'data',
 'scienc']

In [58]:
def vocabularycreator(date1,date2,cutoff_numb,save=False):
    word_corpus=set([])
    for date in range(date1,date2):
        df=pd.read_csv('data/GDELT_1.0/'+str(date)+'.export.CSV',delimiter='\t')
        df.columns=list(header_daily)
        df=df.sort_values('NumMentions', ascending=False)
        for i in range(cutoff_numb):
            word_corpus=word_corpus.union(set(url_tokenizer(df.iloc[i,-1])))
        del df
    if save:
        print("Sorry, I haven't implemented this feature yet")
    return word_corpus

def corpuscreator_url(date1,date2,cutoff_numb,save=False):
    url_corpus=[]
    for date in range(date1,date2):
        df=pd.read_csv('data/GDELT_1.0/'+str(date)+'.export.CSV',delimiter='\t')
        df.columns=list(header_daily)
        df=df.sort_values('NumMentions', ascending=False)
        url_doc=[]
        for i in range(cutoff_numb):
            url_doc+=[[df['NumMentions'][i],df.iloc[i,-1]]]
        url_corpus+=[url_doc]
    if save:
        print("Sorry, I haven't implemented a saving feature yet")
    return url_corpus

In [52]:
#creating the corpus of urls (and number of mentions) by reading over all csv files, takes a while, not very efficient
corpus_url=corpuscreator_url(20130401,20130431,100)+corpuscreator_url(20130501,20130532,100)+corpuscreator_url(20130601,20130631,100)+corpuscreator_url(20130701,20130732,100)+corpuscreator_url(20130801,20130832,100)+corpuscreator_url(20130901,20130931,100)

  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_

In [357]:
corpus_url+=corpuscreator_url(20131001,20131032,100)

  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_code(code, result):
  if self.run_code(code, result):


In [59]:
#other features that I intend to use, but right now I'm just only using URLs
feat_columns=['FractionDate','Actor1Code','Actor1Name','Actor1CountryCode','Actor1Type1Code','Actor2Code',
              'Actor2Name','Actor2CountryCode','Actor2Type1Code','EventCode','QuadClass','GoldsteinScale',
              'NumMentions','AvgTone']
#out of which, categorical are
cat_columns=['Actor1Code','Actor1Name','Actor1CountryCode','Actor1Type1Code','Actor2Code','Actor2Name',
             'Actor2CountryCode','Actor2Type1Code','EventCode','QuadClass']


#def preprocess(date,corp,cutoff_numb,fcol=feat_columns,ccol=cat_columns,tfidf=False):
#    df=pd.read_csv('data/GDELT_1.0/'+str(date)+'.export.CSV',delimiter='\t')
#    df.columns=list(header_daily)
#    df=(df.sort_values('NumMentions', ascending=False))[0:cutoff_numb]
#    df_with_dummies = pd.get_dummies(df[fcol],columns=ccol)
#    if tfidf:
#        vectorizer = TfidfVectorizer(min_df=1,tokenizer=url_tokenizer)
#    else:
#        vectorizer = CountVectorizer(min_df=1,tokenizer=url_tokenizer)
#    X = vectorizer.fit_transform(corp)
#    Y=X.toarray()
#    for i,col in enumerate(vectorizer.get_feature_names()):
#        df_with_dummies[col]=pd.DataFrame(Y[:,i])
#    return df_with_dummies

#this is all about preprocessing the lists of words and vectorize them, possibly applying tfidf

def preprocess_red(corp,tfidf=False):
    if tfidf:
        vectorizer = TfidfVectorizer(min_df=1,tokenizer=wrapper_tokenizer,lowercase=False)
    else:
        vectorizer = CountVectorizer(min_df=1,tokenizer=wrapper_tokenizer,lowercase=False)
    X = vectorizer.fit_transform(corp)
    Y=X.toarray()
    dictionary={col:Y[:,i] for i,col in enumerate(vectorizer.get_feature_names())}
    return pd.DataFrame(dictionary)

In [358]:
#these are the feature dataframes, they contain bag of words or tf-idf vectorization of every single document
#(e.g. one full day of news)
bow_dataset_df=preprocess_red(corpus_url)
tfidf_dataset_df=preprocess_red(corpus_url,tfidf=True)

In [119]:
bow_dataset_df.head()

Unnamed: 0,aab,aad,aada,aadb,aae,aaf,aafeefc,aaron,aarriv,ab,...,zimvwodawr,zipwir,zmsm,zone,zoo,zookeep,zs,zuckerberg,zuma,zzg
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,10,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [84]:
list(bow_dataset_df.columns)

['aab',
 'aad',
 'aada',
 'aadb',
 'aae',
 'aaeaeafba',
 'aaf',
 'aafeefc',
 'aaron',
 'aarriv',
 'ab',
 'abandon',
 'abb',
 'abba',
 'abc',
 'abccb',
 'abcf',
 'abcfa',
 'abd',
 'abdic',
 'abdoul',
 'abduct',
 'abductor',
 'abe',
 'abeb',
 'abedf',
 'abf',
 'abfbd',
 'abfff',
 'abid',
 'abil',
 'abl',
 'ablyazov',
 'aboard',
 'abort',
 'abound',
 'abramson',
 'abrio',
 'abroad',
 'abu',
 'abus',
 'abuzz',
 'abyei',
 'ac',
 'aca',
 'acabd',
 'acapulco',
 'acb',
 'acbb',
 'acbd',
 'acbeccacfa',
 'acbecefb',
 'acbff',
 'acc',
 'accbcdd',
 'acceler',
 'accept',
 'access',
 'accid',
 'accident',
 'account',
 'accus',
 'ace',
 'acea',
 'acetaminophen',
 'acf',
 'acfba',
 'acfcb',
 'ach',
 'acid',
 'acknowledg',
 'acquisit',
 'acquit',
 'acquitt',
 'across',
 'act',
 'action',
 'activ',
 'activist',
 'actress',
 'ad',
 'ada',
 'adafebc',
 'adapt',
 'adawiya',
 'adb',
 'adc',
 'adca',
 'add',
 'addc',
 'addec',
 'address',
 'ade',
 'adebolajo',
 'adeedcee',
 'adf',
 'adfcdac',
 'adjourn',
 'a

In [79]:
[[ii,len(ii)-len(set(ii)),len(set(ii))] for ii in list(bow_dataset_df.columns) if len(ii)-len(set(ii))>=4]

[['aaeaeafba', 5, 4],
 ['abaabfdfbfafb', 9, 4],
 ['abeddabfcefb', 6, 6],
 ['acbeccacfa', 5, 5],
 ['accfbcaedfafc', 7, 6],
 ['adafeafbbeedbebc', 10, 6],
 ['addcddcadcd', 8, 3],
 ['adeedcee', 4, 4],
 ['aleqmhiewnkjpnkuwl', 5, 13],
 ['alexanderbrep', 4, 9],
 ['alfhiigupfzueuhvinq', 6, 13],
 ['almfylcsslzmrz', 5, 9],
 ['americanairlin', 6, 8],
 ['anniversari', 4, 7],
 ['assassin', 4, 4],
 ['autonomouscar', 4, 9],
 ['awwmypzwaz', 4, 6],
 ['baaebeedbfec', 6, 6],
 ['babeebdddcaa', 7, 5],
 ['bbaedeaccbdacc', 9, 5],
 ['bbdaecdec', 4, 5],
 ['bcfdcfbdfa', 5, 5],
 ['bittersweet', 4, 7],
 ['blackonblack', 5, 7],
 ['bnejwhznoavogobca', 5, 12],
 ['brotherhood', 4, 7],
 ['bureaucraci', 4, 7],
 ['cadefeefccfa', 7, 5],
 ['cardiovascular', 4, 10],
 ['cbbbcbc', 5, 2],
 ['cbcefbacfebaca', 9, 5],
 ['ccadcabfdddbecfabd', 12, 6],
 ['ccdebcafadebab', 8, 6],
 ['cceeeffbdbcbb', 8, 5],
 ['cdcceeadcccb', 7, 5],
 ['centralafrica', 4, 9],
 ['cfbcfefaeedaf', 7, 6],
 ['cfbddbbce', 4, 5],
 ['cffecaeecddf', 7, 5],
 ['ch

In [87]:
#this is loading the data for the S&P500 index which we'll be trying to predict
sp500=[]
with open('data/SP500am.csv','r') as mycsvfile:
    reader=csv.reader(mycsvfile)
    for row in reader:
        sp500+=[row]

In [88]:
#just finding where april 1st, 2013 is in the array
[(i,ii[-1]) for i,ii in enumerate(sp500) if ii[0]=='2013-04-01']

[(962, '1562.170044')]

In [359]:
days=list(range(20130401,20130431))+list(range(20130501,20130532))+list(range(20130601,20130631))+list(range(20130701,20130732))+list(range(20130801,20130832))+list(range(20130901,20130931))+list(range(20131001,20131032))
days=[str(date)[:4]+'-'+str(date)[4:6]+'-'+str(date)[6:] for date in days]
prev_days=['2013-03-31']+days[:-1]

In [266]:
days

['2013-04-01',
 '2013-04-02',
 '2013-04-03',
 '2013-04-04',
 '2013-04-05',
 '2013-04-06',
 '2013-04-07',
 '2013-04-08',
 '2013-04-09',
 '2013-04-10',
 '2013-04-11',
 '2013-04-12',
 '2013-04-13',
 '2013-04-14',
 '2013-04-15',
 '2013-04-16',
 '2013-04-17',
 '2013-04-18',
 '2013-04-19',
 '2013-04-20',
 '2013-04-21',
 '2013-04-22',
 '2013-04-23',
 '2013-04-24',
 '2013-04-25',
 '2013-04-26',
 '2013-04-27',
 '2013-04-28',
 '2013-04-29',
 '2013-04-30',
 '2013-05-01',
 '2013-05-02',
 '2013-05-03',
 '2013-05-04',
 '2013-05-05',
 '2013-05-06',
 '2013-05-07',
 '2013-05-08',
 '2013-05-09',
 '2013-05-10',
 '2013-05-11',
 '2013-05-12',
 '2013-05-13',
 '2013-05-14',
 '2013-05-15',
 '2013-05-16',
 '2013-05-17',
 '2013-05-18',
 '2013-05-19',
 '2013-05-20',
 '2013-05-21',
 '2013-05-22',
 '2013-05-23',
 '2013-05-24',
 '2013-05-25',
 '2013-05-26',
 '2013-05-27',
 '2013-05-28',
 '2013-05-29',
 '2013-05-30',
 '2013-05-31',
 '2013-06-01',
 '2013-06-02',
 '2013-06-03',
 '2013-06-04',
 '2013-06-05',
 '2013-06-

In [360]:
#dataset preparation
x_tfidf=[]
y_tfidf=[]
j=962
for i,date in enumerate(days):
    if date ==sp500[j][0]:
        after_we=0.
        if sp500[j+1][0]!=prev_days[i]:
            after_we=1.
        x_tfidf+=[list(tfidf_dataset_df.iloc[i])+[float(sp500[j][-1])]+[after_we]]
        y_tfidf+=[float(sp500[j-1][-1])]
        j-=1

x_tfidf=np.array(x_tfidf)
y_tfidf=np.array(y_tfidf)

In [200]:
days[0],sp500[962+1][0]

('2013-04-01', '2013-03-28')

In [361]:
#classification dataset preparation
x_tfidf_class=[]
y_tfidf_class=[]
j=962
for i,date in enumerate(days):
    if date ==sp500[j][0]:
        after_we=0.
        if sp500[j+1][0]!=prev_days[i]:
            after_we=1.
        x_tfidf_class+=[list(tfidf_dataset_df.iloc[i])+[float(sp500[j][-1])]+[after_we]]
        y_tfidf_class+=[np.sign(float(sp500[j-1][-1])-float(sp500[j][-1]))]
        j-=1

x_tfidf_class=np.array(x_tfidf_class)
y_tfidf_class=np.array(y_tfidf_class)

In [362]:
#dataset preparation
x_bow=[]
y_bow=[]
j=962
for i,date in enumerate(days):
    if date ==sp500[j][0]:
        after_we=0.
        if sp500[j+1][0]!=prev_days[i]:
            after_we=1.
        x_bow+=[list(bow_dataset_df.iloc[i])+[float(sp500[j][-1])]+[after_we]]
        y_bow+=[float(sp500[j-1][-1])]
        j-=1

x_bow=np.array(x_bow)
y_bow=np.array(y_bow)

In [363]:
len(y_tfidf)

151

In [35]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve,roc_auc_score, precision_recall_curve

In [95]:
from sklearn.linear_model import Lasso,Ridge
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor
from sklearn.model_selection import KFold
from sklearn.neural_network import MLPRegressor

## Trying out different regressors on the data, no luck so far :(

In [401]:
#this function executes k-fold validation on a particular regressor model chosen by the user, 
#it outputs k-fold average rmse for the training set, validation set, and a benchmark "flat" 
#model which consists of predicting the same for tomorrow as today's closing

def kfold_val(n_folds_val,x_trainval,y_trainval,regressor,parm):
    kf_val = KFold(n_splits=n_folds_val)
    avg_rms_mod_val=0
    avg_rms_mod_train=0
    for train_index, val_index in kf_val.split(x_trainval):
        x_train, x_val = x_trainval[train_index], x_trainval[val_index]
        y_train, y_val = y_trainval[train_index], y_trainval[val_index]
        if regressor in {Lasso,Ridge}:
            model=regressor(alpha=parm)
        elif regressor in {RandomForestRegressor,}:
            model=regressor(n_estimators=parm[0],max_features=parm[1])
        elif regressor in {MLPRegressor,}:
            model=regressor(activation=parm[0],hidden_layer_sizes=parm[1])
        elif regressor in {AdaBoostRegressor,}:
            model=regressor(n_estimators=parm)
        else:
            print('houston, we have a unknown model problem')
        model.fit(x_train,y_train)
        avg_rms_mod_val+=np.sqrt((sum((model.predict(x_val)-y_val)**2)/len(y_val)))
        avg_rms_mod_train+=np.sqrt((sum((model.predict(x_train)-y_train)**2)/len(y_train)))
    avg_rms_mod_val=avg_rms_mod_val/n_folds_val
    avg_rms_mod_train=avg_rms_mod_train/n_folds_val
    print('avg_train_rmse:',avg_rms_mod_train,'avg_validation_rms:',avg_rms_mod_val)
    return

def kfold_test(x_trainval,y_trainval,x_test,y_test,regressor,parm):
    #x_trainval, x_test, y_trainval, y_test = train_test_split(x, y,test_size=test_fraction)
    coeff=True
    if regressor in {Lasso,Ridge}:
            model=regressor(alpha=parm)
    elif regressor in {RandomForestRegressor,}:
            model=regressor(n_estimators=parm[0],max_features=parm[1])
            coeff=False
    elif regressor in {MLPRegressor,}:
            model=regressor(activation=parm[0],hidden_layer_sizes=parm[1])
            coeff=False
    elif regressor in {AdaBoostRegressor,}:
            model=regressor(n_estimators=parm)
            coeff=False
    model.fit(x_trainval,y_trainval)
    
    if coeff:
        model_coeff=model.coef_
    rms_mod_test=np.sqrt((sum((model.predict(x_test)-y_test)**2)/len(y_test)))
    rms_rand_test=np.sqrt((sum((x_test[:,-2]-y_test)**2)/len(y_test)))
    print('model_test_rms:',rms_mod_test,'flat_test_rms:',rms_rand_test)
    return model_coeff

#def kfold_tuning(n_folds_val,x_trainval,y_trainval,regressor):
    

In [365]:
x_tfidf_trainval, x_tfidf_test, y_tfidf_trainval, y_tfidf_test = train_test_split(x_tfidf, y_tfidf,test_size=0.2)
x_bow_trainval, x_bow_test, y_bow_trainval, y_bow_test = train_test_split(x_bow, y_bow,test_size=0.2)

In [375]:
for alpha in [8.6+0.01*i for i in range(-10,10)]:
    print(alpha)
    kfold_val(10,x_tfidf_trainval,y_tfidf_trainval,Lasso,alpha)

8.5
avg_train_rmse: 12.2866745465 avg_validation_rms: 12.0339518065
8.51
avg_train_rmse: 12.2866771039 avg_validation_rms: 12.0339517782
8.52
avg_train_rmse: 12.2866796643 avg_validation_rms: 12.033951753
8.53
avg_train_rmse: 12.2866822277 avg_validation_rms: 12.0339517311
8.54
avg_train_rmse: 12.2866847941 avg_validation_rms: 12.0339517124
8.549999999999999
avg_train_rmse: 12.2866873634 avg_validation_rms: 12.0339516968
8.56
avg_train_rmse: 12.2866899358 avg_validation_rms: 12.0339516845
8.57
avg_train_rmse: 12.2866925113 avg_validation_rms: 12.0339516754
8.58
avg_train_rmse: 12.2866950897 avg_validation_rms: 12.0339516696
8.59
avg_train_rmse: 12.2866976711 avg_validation_rms: 12.0339516669
8.6
avg_train_rmse: 12.2867002555 avg_validation_rms: 12.0339516674
8.61
avg_train_rmse: 12.286702843 avg_validation_rms: 12.0339516711
8.62
avg_train_rmse: 12.2867054334 avg_validation_rms: 12.0339516781
8.629999999999999
avg_train_rmse: 12.2867080268 avg_validation_rms: 12.0339516883
8.6399999999

In [376]:
kfold_test(x_tfidf_trainval,y_tfidf_trainval,x_tfidf_test,y_tfidf_test,Lasso,8.6)

model_test_rms: 11.4370876123 flat_test_rms: 11.8504929168


In [400]:
print (model.coef_)

NameError: name 'model' is not defined

In [397]:
for alpha in [1407.+0.1*i for i in range(-10,10)]:
    print(alpha)
    kfold_val(10,x_tfidf_trainval,y_tfidf_trainval,Ridge,alpha)

1406.0
avg_train_rmse: 12.2789457944 avg_validation_rms: 12.0351292112
1406.1
avg_train_rmse: 12.2789467865 avg_validation_rms: 12.0351292107
1406.2
avg_train_rmse: 12.2789477785 avg_validation_rms: 12.0351292102
1406.3
avg_train_rmse: 12.2789487705 avg_validation_rms: 12.0351292098
1406.4
avg_train_rmse: 12.2789497625 avg_validation_rms: 12.0351292095
1406.5
avg_train_rmse: 12.2789507543 avg_validation_rms: 12.0351292092
1406.6
avg_train_rmse: 12.2789517461 avg_validation_rms: 12.0351292089
1406.7
avg_train_rmse: 12.2789527378 avg_validation_rms: 12.0351292086
1406.8
avg_train_rmse: 12.2789537295 avg_validation_rms: 12.0351292084
1406.9
avg_train_rmse: 12.2789547211 avg_validation_rms: 12.0351292083
1407.0
avg_train_rmse: 12.2789557126 avg_validation_rms: 12.0351292081
1407.1
avg_train_rmse: 12.278956704 avg_validation_rms: 12.0351292081
1407.2
avg_train_rmse: 12.2789576954 avg_validation_rms: 12.035129208
1407.3
avg_train_rmse: 12.2789586867 avg_validation_rms: 12.035129208
1407.4
av

In [403]:
aa=kfold_test(x_tfidf_trainval,y_tfidf_trainval,x_tfidf_test,y_tfidf_test,Ridge,1407.)

model_test_rms: 11.4261678588 flat_test_rms: 11.8504929168


In [412]:
aa[653]

-0.011090891822691353

In [444]:
key_cols=list(tfidf_dataset_df.columns)+['*yesterdayS&P','*weekend?']

In [445]:
[[key_cols[i],aa[i]] for i in np.argsort(abs(aa))[::-1]]

[['*yesterdayS&P', 0.98010943508012516],
 ['*weekend?', 0.019201223625657862],
 ['berlin', -0.011090891822691353],
 ['bold', -0.0098386943588391045],
 ['disarma', -0.0096837655064310173],
 ['taliban', -0.0096760591560714586],
 ['idukbrekaj', -0.0092001972730545376],
 ['pleas', -0.0091793956794847045],
 ['idukbrez', -0.0090768192062852734],
 ['uk', -0.0088036406590706381],
 ['fnday', 0.0083045035324715646],
 ['recogn', -0.008281394146376039],
 ['charit', 0.0082234030647761157],
 ['karzai', -0.0080033928573729624],
 ['korea', -0.0077083505155676598],
 ['north', -0.0075607548773270828],
 ['snowden', 0.0074256158780388075],
 ['benefit', 0.0073957860703819525],
 ['idukbreeq', 0.0073396619009473028],
 ['somalia', -0.0073063209779075117],
 ['chemic', -0.0069391457061913483],
 ['pick', 0.0068962273431370479],
 ['foundat', 0.0068962273431370479],
 ['bfff', 0.0067980395571941412],
 ['nuclear', -0.0067721845350085597],
 ['blackwat', 0.0066581804142140542],
 ['mubarak', 0.0065221147928019533],
 ['

In [352]:
len(x_tfidf[0])

6718

In [354]:
for n_max in range(6715,6718):
    print(n_max)
    kfold_val(10,x_tfidf_trainval,y_tfidf_trainval,RandomForestRegressor,[20,n_max])

6715
avg_train_rmse: 5.43629615831 avg_validation_rms: 13.5879183802
6716
avg_train_rmse: 5.50030269181 avg_validation_rms: 13.4911977642
6717
avg_train_rmse: 5.29555184707 avg_validation_rms: 13.7219463198


In [355]:
kfold_test(x_tfidf_trainval,y_tfidf_trainval,x_tfidf_test,y_tfidf_test,RandomForestRegressor,[20,6715])

model_test_rms: 16.1250170206 flat_test_rms: 14.5191980625


In [504]:
kfold_val(10,x,y,MLPRegressor,['relu',(180,)])

train_model: 15.9318931335 validation_model: 14.4322007665 flat model: 12.0795328172


In [249]:
kfold_val(10,x_bow,y_bow,MLPRegressor,['relu',(130,)])



train_model: 25.0379574512 validation_model: 272.288593758 flat model: 11.5934907331


In [250]:
kfold_val(11,x_tfidf,y_tfidf,AdaBoostRegressor,8),kfold_val(11,x_bow,y_bow,AdaBoostRegressor,8)

train_model: 8.43500702475 validation_model: 15.1585124175 flat model: 11.4145366544
train_model: 8.51410224946 validation_model: 14.7140553877 flat model: 11.4145366544


(None, None)

# Scratch!!

In [424]:
from keras.layers import Convolution2D, MaxPooling2D, Input,ZeroPadding2D
from keras.layers import Activation, Dropout, Flatten, Dense
from keras.models import Sequential, Model, model_from_json
from keras.layers.advanced_activations import LeakyReLU
from keras.regularizers import l1,l2,l1l2
from keras.optimizers import Nadam, Adagrad

#linear regressor
inputsred=Input(shape=(len(x[0]),))

#xo=Dense(100,activation='relu',W_regularizer=l1(0.005))(inputsred)
#xo=LeakyReLU()(xo)
#xo=Dropout(0.1)(xo)
predsred=Dense(1, activation='relu',W_regularizer=l1(0.005))(inputsred)

modelDred = Model(input=inputsred, output=predsred)

nadam=Nadam(lr=0.002, beta_1=0.9, beta_2=0.999, epsilon=1e-08, schedule_decay=0.004)
adagrad=Adagrad(lr=0.01, epsilon=1e-08, decay=0.0)

modelDred.compile(optimizer=adagrad,
              loss='mean_squared_error',
              metrics=['accuracy'])

In [417]:
x_train,x_val,y_train,y_val=train_test_split(x,y,test_size=0.1)

In [473]:
modelDred.fit(x_train,y_train, batch_size=57, nb_epoch=500, verbose=2, 
          callbacks=[], validation_split=0.1, validation_data=[x_val,y_val],
              shuffle=True, class_weight=None)#cl_w_ing, sample_weight=None)

Train on 97 samples, validate on 11 samples
Epoch 1/500
0s - loss: 172.5491 - acc: 0.0000e+00 - val_loss: 206.9940 - val_acc: 0.0000e+00
Epoch 2/500
0s - loss: 172.5445 - acc: 0.0000e+00 - val_loss: 206.9854 - val_acc: 0.0000e+00
Epoch 3/500
0s - loss: 172.5419 - acc: 0.0000e+00 - val_loss: 206.9785 - val_acc: 0.0000e+00
Epoch 4/500
0s - loss: 172.5372 - acc: 0.0000e+00 - val_loss: 206.9696 - val_acc: 0.0000e+00
Epoch 5/500
0s - loss: 172.5332 - acc: 0.0000e+00 - val_loss: 206.9611 - val_acc: 0.0000e+00
Epoch 6/500
0s - loss: 172.5293 - acc: 0.0000e+00 - val_loss: 206.9532 - val_acc: 0.0000e+00
Epoch 7/500
0s - loss: 172.5258 - acc: 0.0000e+00 - val_loss: 206.9429 - val_acc: 0.0000e+00
Epoch 8/500
0s - loss: 172.5216 - acc: 0.0000e+00 - val_loss: 206.9315 - val_acc: 0.0000e+00
Epoch 9/500
0s - loss: 172.5155 - acc: 0.0000e+00 - val_loss: 206.9224 - val_acc: 0.0000e+00
Epoch 10/500
0s - loss: 172.5101 - acc: 0.0000e+00 - val_loss: 206.9132 - val_acc: 0.0000e+00
Epoch 11/500
0s - loss: 1

<keras.callbacks.History at 0x11c261588>

In [376]:
#descriptive names
df=pd.read_csv('20130401.export.CSV',delimiter='\t')

In [377]:
header1=pd.read_csv('CSV.header.dailyupdates.txt',delimiter='\t')

In [378]:
df.columns=list(header1)

In [382]:
df.head()

Unnamed: 0,GLOBALEVENTID,SQLDATE,MonthYear,Year,FractionDate,Actor1Code,Actor1Name,Actor1CountryCode,Actor1KnownGroupCode,Actor1EthnicCode,...,Actor2Geo_FeatureID,ActionGeo_Type,ActionGeo_FullName,ActionGeo_CountryCode,ActionGeo_ADM1Code,ActionGeo_Lat,ActionGeo_Long,ActionGeo_FeatureID,DATEADDED,SOURCEURL
0,253461012,20030404,200304,2003,2003.2575,AUS,AUSTRALIA,AUS,,,...,AS,1,Australia,AS,AS,-27.0,133.0,AS,20130401,http://www.bangkokpost.com/breakingnews/343522...
1,253461013,20030404,200304,2003,2003.2575,BUS,SHOP OWNER,,,,...,-1354145,4,"Tai Hang, Hong Kong (general), Hong Kong",HK,HK00,22.4667,114.15,-1354145,20130401,http://www.bloomberg.com/news/2013-04-01/hong-...
2,253461014,20030404,200304,2003,2003.2575,BUS,SHOP OWNER,,,,...,-1354454,4,"Tai Hang, Hong Kong (general), Hong Kong",HK,HK00,22.4667,114.15,-1354145,20130401,http://www.bloomberg.com/news/2013-04-01/hong-...
3,253461015,20030404,200304,2003,2003.2575,CVL,MIGRANT,,,,...,AS,1,Australia,AS,AS,-27.0,133.0,AS,20130401,http://www.bangkokpost.com/breakingnews/343522...
4,253461016,20030404,200304,2003,2003.2575,HLH,DOCTOR,,,,...,,2,"Nevada, United States",US,USNV,38.4199,-117.122,NV,20130401,http://www.startribune.com/nation/200818961.html


In [379]:
df_with_dummies = pd.get_dummies(df[feat_columns], columns = cat_columns )
df_with_dummies.head()

Unnamed: 0,FractionDate,GoldsteinScale,NumMentions,AvgTone,Actor1Code_AFG,Actor1Code_AFGBUS,Actor1Code_AFGCOP,Actor1Code_AFGCVL,Actor1Code_AFGGOV,Actor1Code_AFGGOVEDU,...,EventCode_1723,EventCode_1724,EventCode_1821,EventCode_1822,EventCode_1823,EventCode_1831,QuadClass_1,QuadClass_2,QuadClass_3,QuadClass_4
0,2003.2575,2.8,10,2.222222,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,2003.2575,-5.0,8,2.167369,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2,2003.2575,-5.0,2,2.167369,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3,2003.2575,1.9,10,2.222222,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
4,2003.2575,-0.4,10,1.843318,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


In [380]:
vectorizer = CountVectorizer(min_df=1,tokenizer=my_tokenizer)
corpus=[df.iloc[i,-1] for i in range(len(df))]
X = vectorizer.fit_transform(corpus)
Y=X.toarray()
for i,col in enumerate(vectorizer.get_feature_names()):
    df_with_dummies[col]=pd.DataFrame(Y[:,i])

In [381]:
vectorizertfidf = TfidfVectorizer(min_df=1,tokenizer=my_tokenizer)
Xtfidf = vectorizertfidf.fit_transform(corpus)
Ytfidf=Xtfidf.toarray()
for i,col in enumerate(vectorizertfidf.get_feature_names()):
    df_with_dummies['tfidf'+col]=pd.DataFrame(Ytfidf[:,i])

In [383]:
feat_df=df_with_dummies.iloc[:,0:10401]

In [385]:
feat_df.head()

Unnamed: 0,FractionDate,GoldsteinScale,NumMentions,AvgTone,Actor1Code_AFG,Actor1Code_AFGBUS,Actor1Code_AFGCOP,Actor1Code_AFGCVL,Actor1Code_AFGGOV,Actor1Code_AFGGOVEDU,...,zealand,zealotri,zeidan,zelda,zhiggkoea,ziivhmez_uxlgpnlo,zikir,zipwir,zmnmbcosjccynudfnuig,zoo
0,2003.2575,2.8,10,2.222222,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
1,2003.2575,-5.0,8,2.167369,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
2,2003.2575,-5.0,2,2.167369,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
3,2003.2575,1.9,10,2.222222,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
4,2003.2575,-0.4,10,1.843318,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0


In [384]:
feattfidf_df=df_with_dummies.iloc[:,list(range(0,5345))+list(range(10401,15457))]

In [386]:
feattfidf_df.head()

Unnamed: 0,FractionDate,GoldsteinScale,NumMentions,AvgTone,Actor1Code_AFG,Actor1Code_AFGBUS,Actor1Code_AFGCOP,Actor1Code_AFGCVL,Actor1Code_AFGGOV,Actor1Code_AFGGOVEDU,...,tfidfzealand,tfidfzealotri,tfidfzeidan,tfidfzelda,tfidfzhiggkoea,tfidfziivhmez_uxlgpnlo,tfidfzikir,tfidfzipwir,tfidfzmnmbcosjccynudfnuig,tfidfzoo
0,2003.2575,2.8,10,2.222222,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2003.2575,-5.0,8,2.167369,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2003.2575,-5.0,2,2.167369,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,2003.2575,1.9,10,2.222222,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,2003.2575,-0.4,10,1.843318,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [436]:
10401*len(feattfidfRED_df)

1040100

In [434]:
feattfidfRED_df=(feattfidf_df.sort_values('NumMentions', ascending=False))[0:100]

In [435]:
feattfidfRED_df

Unnamed: 0,FractionDate,GoldsteinScale,NumMentions,AvgTone,Actor1Code_AFG,Actor1Code_AFGBUS,Actor1Code_AFGCOP,Actor1Code_AFGCVL,Actor1Code_AFGGOV,Actor1Code_AFGGOVEDU,...,tfidfzealand,tfidfzealotri,tfidfzeidan,tfidfzelda,tfidfzhiggkoea,tfidfziivhmez_uxlgpnlo,tfidfzikir,tfidfzipwir,tfidfzmnmbcosjccynudfnuig,tfidfzoo
3196,2013.2493,0.0,643,3.227707,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
18436,2013.2493,3.0,384,4.111273,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12787,2013.2493,-4.0,372,3.989697,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6969,2013.2493,3.0,299,1.849065,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2468,2013.2493,-7.2,290,1.656668,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6971,2013.2493,-0.3,280,1.674708,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
24163,2013.2493,-10.0,275,1.254327,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1489,2013.2493,-10.0,270,3.960111,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
15388,2013.2493,0.0,256,1.609864,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6882,2013.2493,-10.0,246,3.980200,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


[(0, 'Adj Close'), (961, '1570.25')]

In [427]:
float(sp500[961][-1])

1570.25

In [428]:
float(sp500[960][-1])

1553.689941

In [293]:
my_tokenizer(df.iloc[3,-1]):
    print(porter.stem(word))

australia
peopl
smuggl
rise


In [280]:
ddf.head()

Unnamed: 0,FractionDate,Actor1Code,Actor1Name,Actor1CountryCode,Actor1Type1Code,Actor2Code,Actor2Name,Actor2CountryCode,Actor2Type1Code,EventCode,QuadClass,GoldsteinScale,NumMentions,AvgTone,url
0,2003.2575,AUS,AUSTRALIA,AUS,,CVL,MIGRANT,,CVL,43,1,2.8,10,2.222222,"[australia, people, smuggling, rising]"
1,2003.2575,BUS,SHOP OWNER,,BUS,CVL,NEIGHBORHOOD,,CVL,172,4,-5.0,8,2.167369,"[hong, kong, businesses, vanish, rents, soar, ..."
2,2003.2575,BUS,SHOP OWNER,,BUS,CVL,NEIGHBORHOOD,,CVL,172,4,-5.0,2,2.167369,"[hong, kong, businesses, vanish, rents, soar, ..."
3,2003.2575,CVL,MIGRANT,,CVL,AUS,AUSTRALIA,AUS,,42,1,1.9,10,2.222222,"[australia, people, smuggling, rising]"
10,2012.2493,,,,,BUS,COMPANY,,BUS,20,1,3.0,10,3.521127,"[pakistans, ambitious, program, educate, milit..."


In [68]:
a=re.split(r'"."|/',df.iloc[i,-1])

In [69]:
a

['http:',
 '',
 'www.channelnewsasia.com',
 'news',
 'world',
 'us-urges-serbia-kosovo-to-reach-agreemen',
 '624136.html']

In [266]:
df.loc[[1,3,4],['FractionDate','Actor1Code']]

Unnamed: 0,FractionDate,Actor1Code
1,2003.2575,BUS
3,2003.2575,CVL
4,2003.2575,HLH


In [270]:
pd.Series([[2,3],[1],[1,2]])

0    [2, 3]
1       [1]
2    [1, 2]
dtype: object

In [279]:
ddf['url']=my_ser

In [216]:
from sklearn.preprocessing import OneHotEncoder

In [206]:
en=OneHotEncoder()
en.fit([[0,1],[3,np.nan]])

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

In [203]:
en.transform([[2,0]]).toarray()

array([[ 0.,  0.,  1.,  0.]])

In [275]:
mask=[]
ser=[]
for i in range(len(df)):
    url=df.iloc[i,-1]
    #print(url)
    c=[]
    d=[]
    if url!='BBC Monitoring':
        #print(str(i)+'=========')
        a=urlparse(url)[2].split('.')[0].split('/')[-1]
        b = re_tokenizer.tokenize(a.lower())
        for word in b:
            c+=[punctuation.sub("", word)]
        for word in c:
            if word not in stop_words:
                d+=[word]
        if len(d)>1:
            mask+=[i]
            ser+=[d]
            #print(d)
#print(mask)
my_ser=pd.Series(ser)
#print(my_ser)

In [88]:
for i in range(1):
    stro=df.iloc[i,-1].split('.')[-2:]
    stri=""
    if len(stro)==2:
        if len(stro[0]) > len(stro[1]):
            stri=stro[0]
        elif len(stro[0])<len(stro[1]):
            stri=stro[1]
        else:
            stri="**"+stro[0]+stro[1]
    print(stri)
    print(stro)

com/breakingnews/343522/australia-people-smuggling-rising
['bangkokpost', 'com/breakingnews/343522/australia-people-smuggling-rising']


In [38]:
for i in range(10):
    print(df.iloc[i,-1].split('.'))

['http://www', 'bangkokpost', 'com/breakingnews/343522/australia-people-smuggling-rising']
['http://www', 'bloomberg', 'com/news/2013-04-01/hong-kong-businesses-vanish-as-rents-soar-real-estate', 'html']
['http://www', 'bloomberg', 'com/news/2013-04-01/hong-kong-businesses-vanish-as-rents-soar-real-estate', 'html']
['http://www', 'bangkokpost', 'com/breakingnews/343522/australia-people-smuggling-rising']
['http://www', 'startribune', 'com/nation/200818961', 'html']
['BBC Monitoring']
['http://www', 'philippinetimes', 'com/index', 'php/sid/213539349/scat/2411cd3571b4f088']
['http://www', 'theglobeandmail', 'com/life/health-and-fitness/health/number-of-us-adhd-diagnoses-astronomical/article10606200/?cmpid=rss1']
['http://www', 'theglobeandmail', 'com/life/health-and-fitness/health/number-of-us-adhd-diagnoses-astronomical/article10606200/?cmpid=rss1']
['http://www', 'channelnewsasia', 'com/news/world/us-urges-serbia-kosovo-to-reach-agreemen/624136', 'html']


In [3]:
with open('GDELT.MASTERREDUCEDV2.csv') as csvfile:
    reader=csv.reader(csvfile,delimiter='\t')
    row1=next(reader)
    print(row1)

['Date', 'Source', 'Target', 'CAMEOCode', 'NumEvents', 'NumArts', 'QuadClass', 'Goldstein', 'SourceGeoType', 'SourceGeoLat', 'SourceGeoLong', 'TargetGeoType', 'TargetGeoLat', 'TargetGeoLong', 'ActionGeoType', 'ActionGeoLat', 'ActionGeoLong']


In [4]:
with open('20130401.export.csv') as csvfile:
    reader=csv.reader(csvfile,delimiter='\t')
    for i in range(3):
        row1=next(reader)
        print(row1)

['253461011', '20030404', '200304', '2003', '2003.2575', 'AFG', 'AFGHANISTAN', 'AFG', '', '', '', '', '', '', '', ' UIS', 'THE INTERNATIONAL COMMUNITY', '', '', '', '', '', 'UIS', '', '', '0', '043', '043', '04', '1', '2.8', '6', '1', '6', '0', '1', 'Algeria', 'AG', 'AG', '28', '3', 'AG', '1', 'Algeria', 'AG', 'AG', '28', '3', 'AG', '1', 'Algeria', 'AG', 'AG', '28', '3', 'AG', '20130401', 'BBC Monitoring']
['253461012', '20030404', '200304', '2003', '2003.2575', 'AUS', 'AUSTRALIA', 'AUS', '', '', '', '', '', '', '', ' CVL', 'MIGRANT', '', '', '', '', '', 'CVL', '', '', '1', '043', '043', '04', '1', '2.8', '10', '1', '10', '2.22222222222222', '1', 'Australia', 'AS', 'AS', '-27', '133', 'AS', '1', 'Australia', 'AS', 'AS', '-27', '133', 'AS', '1', 'Australia', 'AS', 'AS', '-27', '133', 'AS', '20130401', 'http://www.bangkokpost.com/breakingnews/343522/australia-people-smuggling-rising']
['253461013', '20030404', '200304', '2003', '2003.2575', 'BUS', 'SHOP OWNER', '', '', '', '', '', 'BUS',

In [None]:
df=pd.read_csv('GDELT.MASTERREDUCEDV2.csv',delimiter='\t')

In [243]:
header2=pd.read_csv('CSV.header.historical.txt',delimiter='\t')

In [36]:
icom=0
iBBC=0
iit=0
iru=0
inet=0
for i in range(len(df)):
    url=df.iloc[i,-1]
    if 'BBC Monitoring'==url:
        iBBC+=1
    elif '.com/' in url:
        icom+=1
    elif '.it/' in url:
        iit+=1
    elif '.ru/' in url:
        iru+=1
    elif '.net/' in url:
        inet+=1

In [37]:
len(df),icom,iBBC,iit,iru,inet

(27757, 17961, 2300, 13, 135, 502)

In [14]:
import re

http
www
bloomberg
com
news



women
tourists
avoid
india
following
sexual
assaults
study
says
html


In [23]:
nyt_re_tokens

['http',
 'www',
 'bloomberg',
 'com',
 'news',
 '2013',
 '04',
 '01',
 'women',
 'tourists',
 'avoid',
 'india',
 'following',
 'sexual',
 'assaults',
 'study',
 'says',
 'html']