# Predicting Stock Market Performance Based on Daily Headlines 

In this dataset found here on Kaggle: https://www.kaggle.com/aaron7sun/stocknews, I tried to see if I could predict how the Dow Jones (and in a different notebook, the S&P500 and Nasdaq) would preform based on the daily headlines.

Broadly speaking, used SpaCy to generate the Noun and Verb columns. From there, I used Logistic Regression, Naive Bayes, and Support Vector Machines to predict. Unfortunately, I was not able to make much headway. The algorithm with the highest accuracy was Method 12 which is the highest score on Kaggle with 56.7% accuracy.

I'd love feedback - what else could I incorporate?

In [479]:
import pandas as pd
import spacy
nlp = spacy.load('en')
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
import string
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB


In [358]:
df = pd.read_csv("./data/Combined_News_DJIA.csv")

In [359]:
df.head(2)

Unnamed: 0,Date,Label,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,...,Top19,Top20,Top21,Top22,Top23,Top24,Top25,Adjusted Label,Nasdaq,S&P500
0,8/8/2008,0,Georgia 'downs two Russian warplanes' as count...,BREAKING: Musharraf to be impeached.',Russia Today: Columns of troops roll into Sout...,Russian tanks are moving towards the capital o...,"Afghan children raped with 'impunity,' U.N. of...",150 Russian tanks have entered South Ossetia w...,"Breaking: Georgia invades South Ossetia, Russi...",The 'enemy combatent' trials are nothing but a...,...,This is a busy day: The European Union has ap...,"Georgia will withdraw 1,000 soldiers from Iraq...",Why the Pentagon Thinks Attacking Iran is a Ba...,Caucasus in crisis: Georgia invades South Osse...,Indian shoe manufactory - And again in a seri...,Visitors Suffering from Mental Illnesses Banne...,"No Help for Mexico's Kidnapping Surge""",1,1,1
1,8/11/2008,1,Why wont America and Nato help us? If they won...,Bush puts foot down on Georgian conflict',Jewish Georgian minister: Thanks to Israeli tr...,Georgian army flees in disarray as Russians ad...,"Olympic opening ceremony fireworks 'faked'""",What were the Mossad with fraudulent New Zeala...,Russia angered by Israeli military sale to Geo...,An American citizen living in S.Ossetia blames...,...,China to overtake US as largest manufacturer',War in South Ossetia [PICS]',Israeli Physicians Group Condemns State Torture',Russia has just beaten the United States over...,Perhaps *the* question about the Georgia - Rus...,Russia is so much better at war',So this is what it's come to: trading sex for ...,1,1,1


This is a pretty standard dataset - on the left, we have the Date and the Label (did the stock market go up (1) or down (0)?)

On the right, we have 25 top headlines for that day

In [365]:
df["combined"] = df["Top1"].astype(str)+' '+df['Top2']+' '+df['Top3']+' '+df['Top4']+' '+df['Top5']+' '+df['Top6']+' '+df['Top7']+' '+df['Top8']+' '+df['Top9']+' '+df['Top10']+' '+df['Top11']+' '+df['Top12']+' '+df['Top13']+' '+df['Top14']+' '+df['Top15']+' '+df['Top16']+' '+df['Top17']+' '+df['Top18']+' '+df['Top19']+' '+df['Top20']+' '+df['Top21']+' '+df['Top22']+' '+df['Top23']+' '+df['Top24']+' '+df['Top25']

I'm merging the 25 headlines into one column for ease of use

In [371]:
df.head(2)

Unnamed: 0,Date,Label,Adjusted Label,Nasdaq,S&P500,combined
0,8/8/2008,0,1,1,1,Georgia 'downs two Russian warplanes' as count...
1,8/11/2008,1,1,1,1,Why wont America and Nato help us? If they won...


In [367]:
df.columns

Index(['Date', 'Label', 'Top1', 'Top2', 'Top3', 'Top4', 'Top5', 'Top6', 'Top7',
       'Top8', 'Top9', 'Top10', 'Top11', 'Top12', 'Top13', 'Top14', 'Top15',
       'Top16', 'Top17', 'Top18', 'Top19', 'Top20', 'Top21', 'Top22', 'Top23',
       'Top24', 'Top25', 'Adjusted Label', 'Nasdaq', 'S&P500', 'combined'],
      dtype='object')

In [368]:
df.drop([ 'Top1', 'Top2', 'Top3', 'Top4', 'Top5', 'Top6', 'Top7',
       'Top8', 'Top9', 'Top10', 'Top11', 'Top12', 'Top13', 'Top14', 'Top15',
       'Top16', 'Top17', 'Top18', 'Top19', 'Top20', 'Top21', 'Top22', 'Top23',
       'Top24', 'Top25'],inplace=True,axis=1)

In [372]:
def subjFunc(data, indexValue):
    doc = nlp(data)

    nounList = []
    verbList = []
    numberList = []
    for token in doc:
        if token.pos_ == "PROPN":
            nounList.append(token.text)
        if token.pos_ == "VERB":
            verbList.append(token.text)
        if token.pos_ == "NUM":
            numberList.append(token.text)
    
    nounString = nounList = ', '.join(nounList)
    verbString = verbList = ', '.join(verbList)
    numberString = numberList = ', '.join(numberList)
    
    df.set_value(indexValue,"noun",nounString) 
    df.set_value(indexValue,"verb",verbString) 
    df.set_value(indexValue,"number",numberString) 

In [373]:
df['combined'] = df['combined'].astype(str)

In [374]:
for row in range(0,len(df.index)):
    subjFunc(df["combined"][row],row)



In the for loop above, I am going through each iteration of "combined" to pull out all the nouns, verbs, and numbers. I then put these in a different column with the goal being, I'd like to understand if there is a higher correlation with verbs and nouns opposed to the total text

In [375]:
df.tail()

Unnamed: 0,Date,Label,Adjusted Label,Nasdaq,S&P500,combined,noun,verb,number
1984,6/27/2016,0,0,0,0,Barclays and RBS shares suspended from trading...,"Barclays, RBS, Pope, Church, Poland, Poles, UK...","suspended, trading, tanking, says, should, ask...","8, 31-year, 50, 52, 48, 48, two, five, 43"
1985,6/28/2016,1,1,1,1,"2,500 Scientists To Australia: If You Want To ...","Australia, Barrier, Reef, Coal, Google, Drive,...","Want, Save, Stop, Supporting, have, been, uplo...","2,500, 112,000, two, 2, trillion, one, 2016, 2..."
1986,6/29/2016,1,1,1,1,Explosion At Airport In Istanbul Yemeni former...,"Airport, Istanbul, Wahhabism, Al, Saud, UK, EU...","is, must, accept, access, Devastated, captive,...","5, 99-Million, 160,000, 38, four, 40"
1987,6/30/2016,1,1,1,1,Jamaica proposes marijuana dispensers for tour...,"Jamaica, Stephen, Hawking, Boris, Johnson, Tor...","proposes, following, would, give, purchase, us...","2, Six, 1, billion, 40, 250, 50,000, 100, 85, ..."
1988,7/1/2016,1,1,1,1,A 117-year-old woman in Mexico City finally re...,"Mexico, City, Trinidad, Alvarez, Lira, IMF, At...","received, died, had, waited, had, been, born, ...","117-year, 1898, 24, 100, 12-, 14-year, 13-Year..."


Below, I'm shifting the Label column by one in case the impact of the headlines is delayed by one day

In [376]:
df["laggedByOne"] = df["Label"].shift(-1)

In [377]:
df["laggedByOne"][1988] = 0

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [378]:
df["laggedByTwo"] = df["Label"].shift(-2)

In [379]:
df.tail()

Unnamed: 0,Date,Label,Adjusted Label,Nasdaq,S&P500,combined,noun,verb,number,laggedByOne,laggedByTwo
1984,6/27/2016,0,0,0,0,Barclays and RBS shares suspended from trading...,"Barclays, RBS, Pope, Church, Poland, Poles, UK...","suspended, trading, tanking, says, should, ask...","8, 31-year, 50, 52, 48, 48, two, five, 43",1.0,1.0
1985,6/28/2016,1,1,1,1,"2,500 Scientists To Australia: If You Want To ...","Australia, Barrier, Reef, Coal, Google, Drive,...","Want, Save, Stop, Supporting, have, been, uplo...","2,500, 112,000, two, 2, trillion, one, 2016, 2...",1.0,1.0
1986,6/29/2016,1,1,1,1,Explosion At Airport In Istanbul Yemeni former...,"Airport, Istanbul, Wahhabism, Al, Saud, UK, EU...","is, must, accept, access, Devastated, captive,...","5, 99-Million, 160,000, 38, four, 40",1.0,1.0
1987,6/30/2016,1,1,1,1,Jamaica proposes marijuana dispensers for tour...,"Jamaica, Stephen, Hawking, Boris, Johnson, Tor...","proposes, following, would, give, purchase, us...","2, Six, 1, billion, 40, 250, 50,000, 100, 85, ...",1.0,
1988,7/1/2016,1,1,1,1,A 117-year-old woman in Mexico City finally re...,"Mexico, City, Trinidad, Alvarez, Lira, IMF, At...","received, died, had, waited, had, been, born, ...","117-year, 1898, 24, 100, 12-, 14-year, 13-Year...",0.0,


In [380]:
df["laggedByTwo"][1987] = 0
df["laggedByTwo"][1988] = 1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [381]:
df["laggedByThree"] = df["Label"].shift(-3)

In [382]:
df.tail()

Unnamed: 0,Date,Label,Adjusted Label,Nasdaq,S&P500,combined,noun,verb,number,laggedByOne,laggedByTwo,laggedByThree
1984,6/27/2016,0,0,0,0,Barclays and RBS shares suspended from trading...,"Barclays, RBS, Pope, Church, Poland, Poles, UK...","suspended, trading, tanking, says, should, ask...","8, 31-year, 50, 52, 48, 48, two, five, 43",1.0,1.0,1.0
1985,6/28/2016,1,1,1,1,"2,500 Scientists To Australia: If You Want To ...","Australia, Barrier, Reef, Coal, Google, Drive,...","Want, Save, Stop, Supporting, have, been, uplo...","2,500, 112,000, two, 2, trillion, one, 2016, 2...",1.0,1.0,1.0
1986,6/29/2016,1,1,1,1,Explosion At Airport In Istanbul Yemeni former...,"Airport, Istanbul, Wahhabism, Al, Saud, UK, EU...","is, must, accept, access, Devastated, captive,...","5, 99-Million, 160,000, 38, four, 40",1.0,1.0,
1987,6/30/2016,1,1,1,1,Jamaica proposes marijuana dispensers for tour...,"Jamaica, Stephen, Hawking, Boris, Johnson, Tor...","proposes, following, would, give, purchase, us...","2, Six, 1, billion, 40, 250, 50,000, 100, 85, ...",1.0,0.0,
1988,7/1/2016,1,1,1,1,A 117-year-old woman in Mexico City finally re...,"Mexico, City, Trinidad, Alvarez, Lira, IMF, At...","received, died, had, waited, had, been, born, ...","117-year, 1898, 24, 100, 12-, 14-year, 13-Year...",0.0,1.0,


In [383]:
df["laggedByThree"][1986] = 0
df["laggedByThree"][1987] = 1
df["laggedByThree"][1988] = 1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [384]:
df.tail()

Unnamed: 0,Date,Label,Adjusted Label,Nasdaq,S&P500,combined,noun,verb,number,laggedByOne,laggedByTwo,laggedByThree
1984,6/27/2016,0,0,0,0,Barclays and RBS shares suspended from trading...,"Barclays, RBS, Pope, Church, Poland, Poles, UK...","suspended, trading, tanking, says, should, ask...","8, 31-year, 50, 52, 48, 48, two, five, 43",1.0,1.0,1.0
1985,6/28/2016,1,1,1,1,"2,500 Scientists To Australia: If You Want To ...","Australia, Barrier, Reef, Coal, Google, Drive,...","Want, Save, Stop, Supporting, have, been, uplo...","2,500, 112,000, two, 2, trillion, one, 2016, 2...",1.0,1.0,1.0
1986,6/29/2016,1,1,1,1,Explosion At Airport In Istanbul Yemeni former...,"Airport, Istanbul, Wahhabism, Al, Saud, UK, EU...","is, must, accept, access, Devastated, captive,...","5, 99-Million, 160,000, 38, four, 40",1.0,1.0,0.0
1987,6/30/2016,1,1,1,1,Jamaica proposes marijuana dispensers for tour...,"Jamaica, Stephen, Hawking, Boris, Johnson, Tor...","proposes, following, would, give, purchase, us...","2, Six, 1, billion, 40, 250, 50,000, 100, 85, ...",1.0,0.0,1.0
1988,7/1/2016,1,1,1,1,A 117-year-old woman in Mexico City finally re...,"Mexico, City, Trinidad, Alvarez, Lira, IMF, At...","received, died, had, waited, had, been, born, ...","117-year, 1898, 24, 100, 12-, 14-year, 13-Year...",0.0,1.0,1.0


### Method 1 - Kaggle, 0 Lag, Noun

In [385]:
nounTraining = []
nounTest = []
for row in range(0,1391):  
    nounTraining.append((df["noun"][row].lower()))

for row in range(1392,1988):
    nounTest.append((df["noun"][row].lower()))

In [386]:
# training df["Label"][0:1391]
# test df["Label"][1392:1988]

In [387]:
advancedvectorizerMeth1 = CountVectorizer(ngram_range=(3,3))
advancedtrainMeth1 = advancedvectorizerMeth1.fit_transform(nounTraining)

In [388]:
advancedmodelMeth1 = LogisticRegression()
advancedmodelMeth1 = advancedmodelMeth1.fit(advancedtrainMeth1, df["Label"][0:1391])

In [389]:
advancedtestMeth1 = advancedvectorizerMeth1.transform(nounTest)
advpredictionsMeth1 = advancedmodelMeth1.predict(advancedtestMeth1)

In [390]:
print (metrics.accuracy_score(df["Label"][1392:1988], advpredictionsMeth1))

0.5218120805369127


### Method 2 - Kaggle, 1 Lag, Noun

In [391]:
advancedvectorizerMeth2 = CountVectorizer(ngram_range=(2,2))
advancedtrainMeth2 = advancedvectorizerMeth2.fit_transform(nounTraining)

In [392]:
advancedmodelMeth2 = LogisticRegression()
advancedmodelMeth2 = advancedmodelMeth2.fit(advancedtrainMeth2, df["laggedByOne"][0:1391])

In [393]:
advancedtestMeth2 = advancedvectorizerMeth2.transform(nounTest)
advpredictionsMeth2 = advancedmodelMeth2.predict(advancedtestMeth2)

In [394]:
print (metrics.accuracy_score(df["laggedByOne"][1392:1988], advpredictionsMeth2))

0.5268456375838926


### Method 3 - Kaggle, 2 Lag, Noun

In [395]:
advancedvectorizerMeth3 = CountVectorizer(ngram_range=(3,3))
advancedtrainMeth3 = advancedvectorizerMeth3.fit_transform(nounTraining)

advancedmodelMeth3 = LogisticRegression()
advancedmodelMeth3 = advancedmodelMeth3.fit(advancedtrainMeth3, df["laggedByTwo"][0:1391])

advancedtestMeth3 = advancedvectorizerMeth3.transform(nounTest)
advpredictionsMeth3 = advancedmodelMeth3.predict(advancedtestMeth3)

print (metrics.accuracy_score(df["laggedByTwo"][1392:1988], advpredictionsMeth3))

0.552013422818792


### Method 4 - Kaggle, 3 Lag, Noun

In [396]:
advancedvectorizerMeth4 = CountVectorizer(ngram_range=(3,3))
advancedtrainMeth4= advancedvectorizerMeth4.fit_transform(nounTraining)

advancedmodelMeth4= LogisticRegression()
advancedmodelMeth4= advancedmodelMeth4.fit(advancedtrainMeth4, df["laggedByThree"][0:1391])

advancedtestMeth4 = advancedvectorizerMeth4.transform(nounTest)
advpredictionsMeth4 = advancedmodelMeth4.predict(advancedtestMeth4)

print (metrics.accuracy_score(df["laggedByThree"][1392:1988], advpredictionsMeth4))

0.5201342281879194


### Method 5 - Kaggle, 0 Lag, Verb

In [397]:
verbTraining = []
verbTest = []
for row in range(0,1391):  
    verbTraining.append((df["verb"][row].lower()))

for row in range(1392,1988):
    verbTest.append((df["verb"][row].lower()))

In [398]:
advancedvectorizerMeth5 = CountVectorizer(ngram_range=(4,4))
advancedtrainMeth5= advancedvectorizerMeth5.fit_transform(verbTraining)

advancedmodelMeth5= LogisticRegression()
advancedmodelMeth5= advancedmodelMeth5.fit(advancedtrainMeth5, df["Label"][0:1391])

advancedtestMeth5 = advancedvectorizerMeth5.transform(verbTest)
advpredictionsMeth5 = advancedmodelMeth5.predict(advancedtestMeth5)

print (metrics.accuracy_score(df["Label"][1392:1988], advpredictionsMeth5))

0.5302013422818792


### Method 6 - Kaggle, 1 Lag, Verb

In [399]:
advancedvectorizerMeth6 = CountVectorizer(ngram_range=(4,4))
advancedtrainMeth6 = advancedvectorizerMeth6.fit_transform(verbTraining)

advancedmodelMeth6 = LogisticRegression()
advancedmodelMeth6 = advancedmodelMeth6.fit(advancedtrainMeth6, df["laggedByOne"][0:1391])

advancedtestMeth6 = advancedvectorizerMeth6.transform(verbTest)
advpredictionsMeth6 = advancedmodelMeth6.predict(advancedtestMeth6)

print (metrics.accuracy_score(df["laggedByOne"][1392:1988], advpredictionsMeth6))

0.5302013422818792


### Method 7 - Kaggle, 2 Lag, Verb

In [400]:
advancedvectorizerMeth7 = CountVectorizer(ngram_range=(3,3))
advancedtrainMeth7 = advancedvectorizerMeth7.fit_transform(verbTraining)

advancedmodelMeth7 = LogisticRegression()
advancedmodelMeth7 = advancedmodelMeth7.fit(advancedtrainMeth7, df["laggedByTwo"][0:1391])

advancedtestMeth7 = advancedvectorizerMeth7.transform(verbTest)
advpredictionsMeth7 = advancedmodelMeth7.predict(advancedtestMeth7)

print (metrics.accuracy_score(df["laggedByTwo"][1392:1988], advpredictionsMeth7))

0.5419463087248322


### Method 8 - Kaggle, 3 Lag, Verb

In [401]:
advancedvectorizerMeth8 = CountVectorizer(ngram_range=(3,3))
advancedtrainMeth8 = advancedvectorizerMeth8.fit_transform(verbTraining)

advancedmodelMeth8 = LogisticRegression()
advancedmodelMeth8 = advancedmodelMeth8.fit(advancedtrainMeth8, df["laggedByThree"][0:1391])

advancedtestMeth8 = advancedvectorizerMeth8.transform(verbTest)
advpredictionsMeth8 = advancedmodelMeth8.predict(advancedtestMeth8)

print (metrics.accuracy_score(df["laggedByThree"][1392:1988], advpredictionsMeth8))

0.5369127516778524


### Method 9 - Kaggle, 0 Lag, Verb + Noun

In [402]:
df["nounAndVerb"] = df["noun"].map(str) + df["verb"]

In [403]:
nounAndVerbTraining = []
nounAndVerbTest = []
for row in range(0,1391):  
    nounAndVerbTraining.append((df["nounAndVerb"][row].lower()))

for row in range(1392,1988):
    nounAndVerbTest.append((df["nounAndVerb"][row].lower()))

In [404]:
advancedvectorizerMeth9 = CountVectorizer(ngram_range=(4,4))
advancedtrainMeth9 = advancedvectorizerMeth9.fit_transform(nounAndVerbTraining)

advancedmodelMeth9 = LogisticRegression()
advancedmodelMeth9 = advancedmodelMeth9.fit(advancedtrainMeth9, df["Label"][0:1391])

advancedtestMeth9 = advancedvectorizerMeth9.transform(nounAndVerbTest)
advpredictionsMeth9 = advancedmodelMeth9.predict(advancedtestMeth9)

print (metrics.accuracy_score(df["Label"][1392:1988], advpredictionsMeth9))

0.5302013422818792


### Method 10 - Kaggle, 1 Lag, Verb + Noun

In [405]:
advancedvectorizerMeth10 = CountVectorizer(ngram_range=(4,4))
advancedtrainMeth10 = advancedvectorizerMeth10.fit_transform(nounAndVerbTraining)

advancedmodelMeth10 = LogisticRegression()
advancedmodelMeth10 = advancedmodelMeth10.fit(advancedtrainMeth10, df["laggedByOne"][0:1391])

advancedtestMeth10 = advancedvectorizerMeth10.transform(nounAndVerbTest)
advpredictionsMeth10 = advancedmodelMeth10.predict(advancedtestMeth10)

print (metrics.accuracy_score(df["laggedByOne"][1392:1988], advpredictionsMeth10))

0.5302013422818792


### Method 11 - Kaggle, 2 Lag, Verb + Noun

In [406]:
advancedvectorizerMeth11 = CountVectorizer(ngram_range=(4,4))
advancedtrainMeth11 = advancedvectorizerMeth11.fit_transform(nounAndVerbTraining)

advancedmodelMeth11 = LogisticRegression()
advancedmodelMeth11 = advancedmodelMeth11.fit(advancedtrainMeth11, df["laggedByTwo"][0:1391])

advancedtestMeth11 = advancedvectorizerMeth11.transform(nounAndVerbTest)
advpredictionsMeth11 = advancedmodelMeth11.predict(advancedtestMeth11)

print (metrics.accuracy_score(df["laggedByTwo"][1392:1988], advpredictionsMeth11))

0.5335570469798657


### Method 12 - Kaggle Solution

In [407]:
kaggleTraining = []
kaggleTest = []
for row in range(0,1391):  
    kaggleTraining.append((df["combined"][row].lower()))

for row in range(1392,1988):
    kaggleTest.append((df["combined"][row].lower()))

In [408]:
advancedvectorizer12 = CountVectorizer(ngram_range=(2,2))
advancedtrain12 = advancedvectorizer12.fit_transform(kaggleTraining)

In [409]:
advancedmodel12 = LogisticRegression()
advancedmodel12 = advancedmodel12.fit(advancedtrain12, df["Label"][0:1391])

In [410]:
advancedtest12 = advancedvectorizer12.transform(kaggleTest)
advpredictions12 = advancedmodel12.predict(advancedtest12)

In [411]:
print (metrics.accuracy_score(df["Label"][1392:1988], advpredictions12))

0.5671140939597316


In [480]:
advwords = advancedvectorizer12.get_feature_names()
advcoeffs = advancedmodel12.coef_.tolist()[0]
advcoeffdf = pd.DataFrame({'Words' : advwords, 
                        'Coefficient' : advcoeffs})
advcoeffdf = advcoeffdf.sort_values(['Coefficient', 'Words'], ascending=[0, 1])
advcoeffdf.head(10)

Unnamed: 0,Coefficient,Words
280178,0.257959,the first
139843,0.250185,in china
21741,0.243038,and other
241488,0.232606,right to
121721,0.223348,government has
278255,0.209556,that they
12207,0.20861,after the
128741,0.208278,have to
286150,0.207743,this is
169132,0.204027,likely to


In [481]:
advcoeffdf.tail(10)

Unnamed: 0,Coefficient,Words
284978,-0.188269,there is
8408,-0.188353,accused of
253541,-0.194755,sexual abuse
137652,-0.196025,if he
195264,-0.197356,nuclear weapons
279526,-0.218109,the country
43204,-0.222907,bin laden
302009,-0.224339,up in
318301,-0.225215,with iran
290249,-0.232685,to kill


### Method 13 - Naive Bayes, Noun

In [412]:
def text_process(mess):
    """
    Takes in a string of text, then performs the following:
    1. Remove all punctuation
    2. Remove all stopwords
    3. Returns a list of the cleaned text
    """
    # Check characters to see if they are in punctuation
    nopunc = [char for char in mess if char not in string.punctuation]

    # Join the characters again to form the string.
    nopunc = ''.join(nopunc)
    
    # Now just remove any stopwords
    return [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]

In [413]:
def predictFunc(X_train, X_test, y_train, y_test):

    bow_transformer = CountVectorizer(analyzer=text_process).fit(X_train)
    messages_bow = bow_transformer.transform(X_train)
    tfidf_transformer = TfidfTransformer().fit(messages_bow)
    messages_tfidf = tfidf_transformer.transform(messages_bow)
    
    stock_model = MultinomialNB().fit(messages_tfidf, y_train)
    
    messages_bow_xtest = bow_transformer.transform(X_test)
    messages_tfidf_xtest = tfidf_transformer.transform(messages_bow_xtest)
    
    predictions = stock_model.predict(messages_tfidf_xtest)
    print (metrics.accuracy_score(y_test, predictions))

In [414]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df["noun"], df["Label"], random_state=1)

In [415]:
predictFunc(X_train, X_test, y_train, y_test)

0.4919678714859438


### Method 14 - Naive Bayes, 1 lag, Noun

In [416]:
X_trainNoun1Lag, X_testNoun1Lag, y_trainNoun1Lag, y_testNoun1Lag = train_test_split(df["noun"], df["laggedByOne"], random_state=1)

In [417]:
predictFunc(X_trainNoun1Lag, X_testNoun1Lag, y_trainNoun1Lag, y_testNoun1Lag)

0.5522088353413654


### Method 15 - Naive Bayes, 2 lag, Noun

In [418]:
X_trainNoun2Lag, X_testNoun2Lag, y_trainNoun2Lag, y_testNoun2Lag = train_test_split(df["noun"], df["laggedByTwo"], random_state=1)

In [419]:
predictFunc(X_trainNoun2Lag, X_testNoun2Lag, y_trainNoun2Lag, y_testNoun2Lag)

0.536144578313253


### Method 16 - Naive Bayes, 3 lag, Noun

In [420]:
X_trainNoun3Lag, X_testNoun3Lag, y_trainNoun3Lag, y_testNoun3Lag = train_test_split(df["noun"], df["laggedByThree"], random_state=1)

In [421]:
predictFunc(X_trainNoun3Lag, X_testNoun3Lag, y_trainNoun3Lag, y_testNoun3Lag)

0.5220883534136547


### Method 17 - Naive Bayes, 0 lag, Verb

In [202]:
X_trainVerb0Lag, X_testVerb0Lag, y_trainVerb0Lag, y_testVerb0Lag = train_test_split(df["verb"], df["Label"], random_state=1)

In [203]:
predictFunc(X_trainVerb0Lag, X_testVerb0Lag, y_trainVerb0Lag, y_testVerb0Lag)

0.4879518072289157


### Method 18 - Naive Bayes, 1 lag, Verb


In [422]:
X_trainVerb1Lag, X_testVerb1Lag, y_trainVerb1Lag, y_testVerb1Lag = train_test_split(df["verb"], df["laggedByOne"], random_state=1)

In [423]:
predictFunc(X_trainVerb1Lag, X_testVerb1Lag, y_trainVerb1Lag, y_testVerb1Lag)

0.5522088353413654


### Method 19 - Naive Bayes, 2 lag, Verb


In [424]:
X_trainVerb2Lag, X_testVerb2Lag, y_trainVerb2Lag, y_testVerb2Lag = train_test_split(df["verb"], df["laggedByTwo"], random_state=1)

In [425]:
predictFunc(X_trainVerb2Lag, X_testVerb2Lag, y_trainVerb2Lag, y_testVerb2Lag)

0.5481927710843374


### Method 20 - Naive Bayes, 0 lag, Verb+Noun


In [426]:
X_trainNounAndVerb0Lag, X_testNounAndVerb0Lag, y_trainNounAndVerb0Lag, y_testNounAndVerb0Lag = train_test_split(df["nounAndVerb"], df["Label"], random_state=1)

In [427]:
predictFunc(X_trainNounAndVerb0Lag, X_testNounAndVerb0Lag, y_trainNounAndVerb0Lag, y_testNounAndVerb0Lag)

0.4919678714859438


### Method 22 - Naive Bayes, 1 lag, Verb+Noun


In [428]:
X_trainNounAndVerb1Lag, X_testNounAndVerb1Lag, y_trainNounAndVerb1Lag, y_testNounAndVerb1Lag = train_test_split(df["nounAndVerb"], df["laggedByOne"], random_state=1)

In [429]:
predictFunc(X_trainNounAndVerb1Lag, X_testNounAndVerb1Lag, y_trainNounAndVerb1Lag, y_testNounAndVerb1Lag)

0.5542168674698795


### Method 23 - Naive Bayes, 1 lag, All


In [430]:
X_trainAll1Lag, X_testAll1Lag, y_trainAll1Lag, y_testAll1Lag = train_test_split(df["combined"], df["laggedByOne"], random_state=1)

In [431]:
predictFunc(X_trainAll1Lag, X_testAll1Lag, y_trainAll1Lag, y_testAll1Lag)

0.5522088353413654


### Method 24 - h20

In [None]:
df.to_csv("./data/h20.csv")

In [452]:
import h2o
from h2o.automl import H2OAutoML
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321. connected.


0,1
H2O cluster uptime:,7 days 2 hours 41 mins
H2O cluster timezone:,America/Chicago
H2O data parsing timezone:,UTC
H2O cluster version:,3.18.0.2
H2O cluster version age:,1 month and 13 days
H2O cluster name:,H2O_from_python_AK_lom12y
H2O cluster total nodes:,1
H2O cluster free memory:,1.926 Gb
H2O cluster total cores:,12
H2O cluster allowed cores:,12


In [453]:
dfh2o = h2o.import_file("./data/h20.csv",header=1)

Parse progress: |█████████████████████████████████████████████████████████| 100%


In [454]:
dfh2o.shape

(1989, 3)

In [455]:
dfh2o.head(2)

Label,noun,verb
DOWN,"Georgia, impeached._Russia, South, Ossetia, South, Ossetia, U.N., nothing_150, South, Ossetia, Georgia, Georgia, South, Ossetia, Russia, SO, Salim, Haman, S., Osettain, U.S., Prep, Georgia, War, Green, Light, Israel, Attack, Iran, U.S., Class, Action, Lawsuit, Behalf, American, Public, FBI_So, Russia, Georgia, NYT, Olympics, journalism._China, Bush, World, War, III, Invades, South, Ossetia, Russia, NATO, Georgia, Islamist, Backlash_Condoleezza, Rice, US, Iran, Defense, Minister, Ehud, Barak, Israel, European, Union, Iran, Iraq, Georgia, South, Pentagon, Iran, Idea, US, News, World, Report_Caucasus, Georgia, South, Mental, Illnesses, Olympics_No, Help, Mexico, Kidnapping, Surge","move, brink, be, roll, fighting, are, moving, has, been, destroyed, raped, says, is, was, raped, do, have, entered, shoots, invades, warned, would, intervene, are, has, been, sentenced, will, be, kept, feel, retreat, leaving, killed, VIDEO]_Did, Gives, Says, has, are, is, opening, tells, stay, affairs_Did, gets, involved, will, absorb, unleash, Faces, would, act, prevent, is, prepared, is, has, approved, will, withdraw, help, fight, Ossetia_Why, Thinks, Attacking, is, invades, do, like, Suffering, Banned"
UP,"America, Nato, Iraq?_Bush, Russia, _, Russians, Gori, Russia, Mossad, New, Zealand, Passports, Georgia_An, S.Ossetia, U.S., World, War, IV, High, Definition!_Georgia, Russia, Georgia, U.S., goal_Abhinav, Bindra, Olympic, Gold, Medal, India, _, U.S., Arctic, Jerusalem, Ara_The, French, Team, Phelps, Relay, Team_Israel, US, Georgian, Montreal, Canada, Saturday._China, US, manufacturer_War, South, Ossetia, PICS]_Israeli, Physicians, Group, Condemns, State, Torture, Russia, United, States, Peak, Georgia, Russia, Russia","wo, help, wo, help, did, help, puts, re, fending, abandoned, were, doing, angered, living, blames, presses, says, is, wins, define, threaten, quit, is, Stunned, believe, are, are, going, murdered, overtake, has, beaten, is, is, s, come"




In [456]:
dfh2o.describe()

Rows:1989
Cols:3




Unnamed: 0,Label,noun,verb
type,enum,string,string
mins,,,
mean,,,
maxs,,,
sigma,,,
zeros,,0,0
missing,0,0,3
0,DOWN,"Georgia, impeached._Russia, South, Ossetia, South, Ossetia, U.N., nothing_150, South, Ossetia, Georgia, Georgia, South, Ossetia, Russia, SO, Salim, Haman, S., Osettain, U.S., Prep, Georgia, War, Green, Light, Israel, Attack, Iran, U.S., Class, Action, Lawsuit, Behalf, American, Public, FBI_So, Russia, Georgia, NYT, Olympics, journalism._China, Bush, World, War, III, Invades, South, Ossetia, Russia, NATO, Georgia, Islamist, Backlash_Condoleezza, Rice, US, Iran, Defense, Minister, Ehud, Barak, Israel, European, Union, Iran, Iraq, Georgia, South, Pentagon, Iran, Idea, US, News, World, Report_Caucasus, Georgia, South, Mental, Illnesses, Olympics_No, Help, Mexico, Kidnapping, Surge","move, brink, be, roll, fighting, are, moving, has, been, destroyed, raped, says, is, was, raped, do, have, entered, shoots, invades, warned, would, intervene, are, has, been, sentenced, will, be, kept, feel, retreat, leaving, killed, VIDEO]_Did, Gives, Says, has, are, is, opening, tells, stay, affairs_Did, gets, involved, will, absorb, unleash, Faces, would, act, prevent, is, prepared, is, has, approved, will, withdraw, help, fight, Ossetia_Why, Thinks, Attacking, is, invades, do, like, Suffering, Banned"
1,UP,"America, Nato, Iraq?_Bush, Russia, _, Russians, Gori, Russia, Mossad, New, Zealand, Passports, Georgia_An, S.Ossetia, U.S., World, War, IV, High, Definition!_Georgia, Russia, Georgia, U.S., goal_Abhinav, Bindra, Olympic, Gold, Medal, India, _, U.S., Arctic, Jerusalem, Ara_The, French, Team, Phelps, Relay, Team_Israel, US, Georgian, Montreal, Canada, Saturday._China, US, manufacturer_War, South, Ossetia, PICS]_Israeli, Physicians, Group, Condemns, State, Torture, Russia, United, States, Peak, Georgia, Russia, Russia","wo, help, wo, help, did, help, puts, re, fending, abandoned, were, doing, angered, living, blames, presses, says, is, wins, define, threaten, quit, is, Stunned, believe, are, are, going, murdered, overtake, has, beaten, is, is, s, come"
2,DOWN,"Georgia, Iraq, Islamic, Georgia, Putin, Outmaneuvers, Microsoft, Intel, Russo, Georgian, War, Balance, Power, Whole, Georgia, Russia, War, Georgia, Russia, Did_The, US, South, Ossetia, US, Monday_U.S., Beats, War, Drum, Iran, Dollar_Gorbachev, South, Tskhinvali, Tskhinvali, Olympics, Games, IOC, Russia._55, Luxor, Tokyo, Bay_The, Top, Party, Cities, World_U.S., Georgia, Georgia, Russias, Georgia, right_Gorbachev, U.S., Caucasus, region_Russia, Georgia, NATO, Cold, War, Two_Remember, Georgia, connection_All, US, Georgia, South, Ossetia, Goddamnit, Bush._Christopher, King, US, NATO, South, Ossetia, America, New, Mexico?_BBC, NEWS, |, Asia, Pacific, |, Extinction","Remember, sang, was, ends, had, would, have, is, losing, regards, including, buying, tried, kill, _, m, Trying, Get, Vote, Think, Started, Think, was, surprised, is, trying, sort, happened, said, Dumps, attacked, designed, devastate, ruins, cover, VIDEO]_Beginning, were, opening, violates, could, respond, taking, stacked, did, know, were, was, accuses, making, pursuing, led, based, was, too._War, encouraging, invade, argues, are, have, misjudged, _, climate"


In [457]:
y = "Label"
x = dfh2o.columns
x.remove(y)
# x.remove("laggedByOne")
# x.remove("laggedByTwo")
# x.remove("laggedByThree")

In [458]:
x

['noun', 'verb']

In [462]:
y

'Label'

In [463]:
len(nounAndVerbTraining)

1391

In [None]:
aml = H2OAutoML(max_models = 10, seed = 1)
aml.train(x = x, y = y, training_frame = dfh2o)

In [334]:
lb = aml.leaderboard

In [335]:
lb.head()

model_id,auc,logloss
GBM_grid_0_AutoML_20180401_195533_model_4,0.519296,0.692541
GBM_grid_0_AutoML_20180401_195533_model_3,0.517824,0.721868
GBM_grid_0_AutoML_20180401_195533_model_2,0.51151,0.715895
GBM_grid_0_AutoML_20180401_195533_model_1,0.50992,0.714119
GBM_grid_0_AutoML_20180401_195533_model_0,0.50345,0.715074
DeepLearning_0_AutoML_20180401_195533,0.495189,0.714597
StackedEnsemble_AllModels_0_AutoML_20180401_195533,0.49355,0.689716
StackedEnsemble_BestOfFamily_0_AutoML_20180401_195533,0.49355,0.689716
DRF_0_AutoML_20180401_195533,0.486147,2.21771
XRT_0_AutoML_20180401_195533,0.48532,1.60749




### Method 25 - SVM

In [469]:
from sklearn.pipeline import Pipeline
import numpy as np

In [470]:
from sklearn.pipeline import Pipeline
text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB()),
])

In [471]:
text_clf = text_clf.fit(kaggleTraining, df["Label"][0:1391])
predicted = text_clf.predict(kaggleTest)
np.mean(predicted == df["Label"][1392:1988])

0.5302013422818792

In [472]:
from sklearn.linear_model import SGDClassifier

In [473]:
text_clf_svm = Pipeline([('vect', CountVectorizer(stop_words="english")),
                      ('tfidf', TfidfTransformer()),
                      ('clf-svm', SGDClassifier(loss='hinge', penalty='l2',
                                            alpha=1e-3, n_iter=5, random_state=42)),
])

In [474]:
_ = text_clf_svm.fit(kaggleTraining, df["Nasdaq"][0:1391])



In [475]:
predicted_svm = text_clf_svm.predict(kaggleTest)
np.mean(predicted_svm == df["Nasdaq"][1392:1988])

0.5318791946308725

### Method 25 - Stemming

In [477]:
import nltk
# nltk.download()
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english", ignore_stopwords=True)

class StemmedCountVectorizer(CountVectorizer):
    def build_analyzer(self):
        analyzer = super(StemmedCountVectorizer, self).build_analyzer()
        return lambda doc: ([stemmer.stem(w) for w in analyzer(doc)])
    
stemmed_count_vect = StemmedCountVectorizer(stop_words='english')
text_mnb_stemmed = Pipeline([('vect', stemmed_count_vect),
                      ('tfidf', TfidfTransformer()),
                      ('mnb', MultinomialNB(fit_prior=False)),
])



In [478]:
text_mnb_stemmed = text_mnb_stemmed.fit(kaggleTraining,df["Label"][0:1391])
predicted_mnb_stemmed = text_mnb_stemmed.predict(kaggleTest)
np.mean(predicted_mnb_stemmed == df["Label"][1392:1988])

0.5302013422818792