## Final Project Draft

The purpose of this project is to come up with a model which reads news headlines and interprets them to come up with a prediction of the change in the stock market.  I will use a binary indicator for the stock market movement, with +1 equalling a flat or up day, and 0 indicating a negative day.  Given the binary indicator I will first use Logistic regression, and expand to addtional models after that.  A successful model such as this can be extremely useful in that:
- It can be used to provide investors, such as large portfolio managers, an automated way of analyzing a large volume of news information more quickly than an individual can process such information
- The investors can then use the prediction to overweight or underweight their portfolios 
- Even a small improvement beyond a 50/50 chance for an up or down day could be extremely valuable, as a small improvement in investor positioning over time could accumulate to a large improvement in investment returns

After the first step of predicting up or down movement, additional steps may include:
- Predicting magnitude of up or down movement
- Predicting more than 1 day of movement
- Predicing patterns of movement, such as up/down/up, up/up/down etc. 

The initial dataset I will use is from Kaggle, titled Daily News For Stock Market Prediction.  This data includes the following:
- A file of news headline data from the range 2008-06-08 to 2016-07-01.  The headlines are taken from Reddit WorldNews Channel.  The top 25 headlines for each date are included based on reddit users' votes.
- A binary variable showing for each date a 1 if the Dow Jones Industrial Index was flat or positive for the day, and a zero if the DJIA was negative for the day


In [148]:
#Step 1
# Import libraries
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
import datetime as dt
from dateutil.parser import parse
import numpy as np
import copy

In [149]:
#Step 2
#Read in the data file, maintain a duplicate to facilitate some operations below
data = pd.read_csv('~/Downloads/Combined_News_DJIA.csv')
infile=data
infile.head()

Unnamed: 0,Date,Label,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,...,Top16,Top17,Top18,Top19,Top20,Top21,Top22,Top23,Top24,Top25
0,2008-08-08,0,"b""Georgia 'downs two Russian warplanes' as cou...",b'BREAKING: Musharraf to be impeached.',b'Russia Today: Columns of troops roll into So...,b'Russian tanks are moving towards the capital...,"b""Afghan children raped with 'impunity,' U.N. ...",b'150 Russian tanks have entered South Ossetia...,"b""Breaking: Georgia invades South Ossetia, Rus...","b""The 'enemy combatent' trials are nothing but...",...,b'Georgia Invades South Ossetia - if Russia ge...,b'Al-Qaeda Faces Islamist Backlash',"b'Condoleezza Rice: ""The US would not act to p...",b'This is a busy day: The European Union has ...,"b""Georgia will withdraw 1,000 soldiers from Ir...",b'Why the Pentagon Thinks Attacking Iran is a ...,b'Caucasus in crisis: Georgia invades South Os...,b'Indian shoe manufactory - And again in a se...,b'Visitors Suffering from Mental Illnesses Ban...,"b""No Help for Mexico's Kidnapping Surge"""
1,2008-08-11,1,b'Why wont America and Nato help us? If they w...,b'Bush puts foot down on Georgian conflict',"b""Jewish Georgian minister: Thanks to Israeli ...",b'Georgian army flees in disarray as Russians ...,"b""Olympic opening ceremony fireworks 'faked'""",b'What were the Mossad with fraudulent New Zea...,b'Russia angered by Israeli military sale to G...,b'An American citizen living in S.Ossetia blam...,...,b'Israel and the US behind the Georgian aggres...,"b'""Do not believe TV, neither Russian nor Geor...",b'Riots are still going on in Montreal (Canada...,b'China to overtake US as largest manufacturer',b'War in South Ossetia [PICS]',b'Israeli Physicians Group Condemns State Tort...,b' Russia has just beaten the United States ov...,b'Perhaps *the* question about the Georgia - R...,b'Russia is so much better at war',"b""So this is what it's come to: trading sex fo..."
2,2008-08-12,0,b'Remember that adorable 9-year-old who sang a...,"b""Russia 'ends Georgia operation'""","b'""If we had no sexual harassment we would hav...","b""Al-Qa'eda is losing support in Iraq because ...",b'Ceasefire in Georgia: Putin Outmaneuvers the...,b'Why Microsoft and Intel tried to kill the XO...,b'Stratfor: The Russo-Georgian War and the Bal...,"b""I'm Trying to Get a Sense of This Whole Geor...",...,b'U.S. troops still in Georgia (did you know t...,b'Why Russias response to Georgia was right',"b'Gorbachev accuses U.S. of making a ""serious ...","b'Russia, Georgia, and NATO: Cold War Two'",b'Remember that adorable 62-year-old who led y...,b'War in Georgia: The Israeli connection',b'All signs point to the US encouraging Georgi...,b'Christopher King argues that the US and NATO...,b'America: The New Mexico?',"b""BBC NEWS | Asia-Pacific | Extinction 'by man..."
3,2008-08-13,0,b' U.S. refuses Israel weapons to attack Iran:...,"b""When the president ordered to attack Tskhinv...",b' Israel clears troops who killed Reuters cam...,b'Britain\'s policy of being tough on drugs is...,b'Body of 14 year old found in trunk; Latest (...,b'China has moved 10 *million* quake survivors...,"b""Bush announces Operation Get All Up In Russi...",b'Russian forces sink Georgian ships ',...,b'Elephants extinct by 2020?',b'US humanitarian missions soon in Georgia - i...,"b""Georgia's DDOS came from US sources""","b'Russian convoy heads into Georgia, violating...",b'Israeli defence minister: US against strike ...,b'Gorbachev: We Had No Choice',b'Witness: Russian forces head towards Tbilisi...,b' Quarter of Russians blame U.S. for conflict...,b'Georgian president says US military will ta...,b'2006: Nobel laureate Aleksander Solzhenitsyn...
4,2008-08-14,1,b'All the experts admit that we should legalis...,b'War in South Osetia - 89 pictures made by a ...,b'Swedish wrestler Ara Abrahamian throws away ...,b'Russia exaggerated the death toll in South O...,b'Missile That Killed 9 Inside Pakistan May Ha...,"b""Rushdie Condemns Random House's Refusal to P...",b'Poland and US agree to missle defense deal. ...,"b'Will the Russians conquer Tblisi? Bet on it,...",...,b'Bank analyst forecast Georgian crisis 2 days...,"b""Georgia confict could set back Russia's US r...",b'War in the Caucasus is as much the product o...,"b'""Non-media"" photos of South Ossetia/Georgia ...",b'Georgian TV reporter shot by Russian sniper ...,b'Saudi Arabia: Mother moves to block child ma...,b'Taliban wages war on humanitarian aid workers',"b'Russia: World ""can forget about"" Georgia\'s...",b'Darfur rebels accuse Sudan of mounting major...,b'Philippines : Peace Advocate say Muslims nee...


Inspecting the data, we see the Date column, followed by the up/down indicator which is titled: Label. After that are 25 columns consisting of the top 25 ranked headlines 

In order to facilitate processing of the headlines data, we will create a new column titled "Combined", which will concatenate all the headlines into a single string. 

Additional analysis may be done later which includes only certain headline columns to verify if the ranking has an impact on the results.

In [150]:
infile['Combined']=data.iloc[:,2:27].apply(lambda row: ''.join(str(row.values)), axis=1)


In order to create a test/train split, I needed to replace the dates with a datetime object, as the standard string format of the date was difficult to use in a comparison

In [151]:
def run_date_convert(xfile):

    cnt=0
    for x in data.Date:
        xfile.Date[cnt]=parse(data.Date[cnt])
        cnt+=1
    
    return xfile    

In [152]:
# Now create the train and test split function:
def create_test_train(fileuse):
    
    fileuse=run_date_convert(fileuse)
    
    xtrain=infile.loc[((fileuse['Date']) <= dt.datetime(2014,12,31)),['Label','Combined']]

    xtest=infile.loc[((fileuse['Date']) > dt.datetime(2014,12,31)),['Label','Combined']]
    
    return xtrain, xtest


In [153]:
# Create an array of tuples to represent the range of ngrams to be used in the logistic regression


def run_model1():
    trainheadlines = []
   
    for row in range(0,len(train.index)):
          trainheadlines.append(' '.join(str(x) for x in train.iloc[row,1:]))

    testheadlines = []
    for row in range(0,len(test.index)):
          testheadlines.append(' '.join(str(x) for x in test.iloc[row,1:]))

    # Then, cycle through the combinations of ngrams, to see which ones give the best results

    for twords in choices:

        bvectorizer = CountVectorizer(ngram_range=twords)
        btrain = bvectorizer.fit_transform(trainheadlines)
        bmodel = LogisticRegression()
        bmodel = bmodel.fit(btrain, train["Label"])

        #test:


        btest = bvectorizer.transform(testheadlines)

        pred  = bmodel.predict(btest)

        q=pd.crosstab(test["Label"], pred, rownames=["Actual"], colnames=["Predicted"])

        print '\n',twords,': \n'
        print q

        
choices=[(1,1),(1,2),(1,3),(2,2),(3,3)]

train, test = create_test_train(infile)

run_model1()


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """



(1, 1) : 

Predicted   0    1
Actual            
0          61  125
1          91  101

(1, 2) : 

Predicted   0    1
Actual            
0          65  121
1          75  117

(1, 3) : 

Predicted   0    1
Actual            
0          58  128
1          57  135

(2, 2) : 

Predicted   0    1
Actual            
0          65  121
1          46  146

(3, 3) : 

Predicted   0    1
Actual            
0          12  174
1          12  180


These results are quite interesting, in that the percentage of correct predictions ranges from 43% for the (1,1) set, ie. single word ngrams, up to 56% for the (2,2), ie 2 word ngrams.
However, even more exciting is the success rate for predicting an up day: The (1,1) set correctly predicted 53% of the up days, whereas the (3,3) set, ie 3 word ngrams, correctly predicted 94% of the up days. 

This is a very significant success rate.  However, the headlines and the price movements where all contemparaneous: by the time the headline data was available, the price movement had already occured.  This means that as an actual investment signal it comes too late.  The next step to address that is to offset the price movement data by 1 day.  That way, in practice the model could be run each day, before the market closes.  Based on the prediction for the next day's price movement, an investment position can be taken, market on close.  This may be too much pressure to execute each day, as you must wait for the news headlines to be published, run the model, then execute a trade, all within a few minutes prior to the market closing.  If you run the model too early, you risk missing headlines. If you run the model too late, the market has closed.   

A further refinement will therefore be to use the headline predictors to predict the market's next day change, from the open to the close.  That way the model could be run overnight and provide a signal to be executed market-on-open.

In [154]:
# Create a new series, and offset the price movement data by 1 day:

data = pd.read_csv('~/Downloads/Combined_News_DJIA.csv')
infile2=data

for x in xrange(0,len(data)-1):
   infile2['Label'].iloc[x]=data['Label'].iloc[x+1]



In [155]:
#Check offset
infile2.Label.head(5)


0    1
1    0
2    0
3    1
4    1
Name: Label, dtype: int64

In [133]:
#recreate the test and train sets

In [156]:
train, test = create_test_train(infile2)

run_model1()



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """



(1, 1) : 

Predicted   0    1
Actual            
0          61  125
1          91  101

(1, 2) : 

Predicted   0    1
Actual            
0          65  121
1          75  117

(1, 3) : 

Predicted   0    1
Actual            
0          58  128
1          57  135

(2, 2) : 

Predicted   0    1
Actual            
0          65  121
1          46  146

(3, 3) : 

Predicted   0    1
Actual            
0          12  174
1          12  180


Excellent results, in that there was hardly any degradation of the prediction quality when using the next day's price move compared to the current day's headlines.

Further steps will now include: 
- Using a price move of the next day's Open vs. Close
- Comparing the results using subsets of the 25 day's headlines
- Using additional models beyond Logistic regression


