<h1>Market Prediction Based on Daily News from The New York Times</h1>

<h2>Capstone Project 2</h2>

<h3>August 8, 2019</h3>
<h3>David Gray</h3>

<h1>Overview</h1>

The goal of this project is to train a machine learning model to predict the movement of the stock market based on the news for that day. First, daily news is analyzed with a random forest model.  Then, models are trained for news collected over the period of a week.  A longer time period was selected to try to improve performance seen with daily news.  Finally, analysis of bigram data was performed with one model as the other models required lengthy training.  The number of columns of the unigram and bigram table exceeded half a million.

<b>Note:</b> Some of the cells appear not to be run in order because of the time of execution on some could be over a day.  The results from prior execution are included

In [4]:
import sys
#reload(sys)
#sys.setdefaultencoding('utf8')
import time
import json
import datetime
from datetime import timedelta
from datetime import datetime
import requests
import numpy as np
import string
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold, cross_val_score
from sklearn import model_selection, naive_bayes, svm
import pandas as pd
from pprint import pprint
#import matplotlib.pyplot as plt
from dateutil import parser
from sklearn import grid_search
apikey = 'MYAPIKEY'

<h1>Import Financial Data and Extract Dates</h1>

The dates from the financial data are important because these inform what dates of news to analzye.  For instance, news from the weekend and holidays are not included.

In [5]:
financial_data = pd.read_csv('SP500from2009to2019.csv')

In [6]:
financial_data.head()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,7/20/09,942.070007,951.619995,940.98999,951.130005,951.130005,4853150000
1,7/21/09,951.969971,956.530029,943.219971,954.580017,954.580017,5309300000
2,7/22/09,953.400024,959.830017,947.75,954.070007,954.070007,4634100000
3,7/23/09,954.070007,979.419983,953.27002,976.289978,976.289978,5761650000
4,7/24/09,972.159973,979.789978,965.950012,979.26001,979.26001,4458300000


Several columns from the financial table need to be dropped

In [7]:
cropped_data = financial_data.drop(["High", "Low", "Close", "Volume"], axis=1)

In [8]:
cropped_data.head()

Unnamed: 0,Date,Open,Adj Close
0,7/20/09,942.070007,951.130005
1,7/21/09,951.969971,954.580017
2,7/22/09,953.400024,954.070007
3,7/23/09,954.070007,976.289978
4,7/24/09,972.159973,979.26001


Find out whether the market rose or fell and assign a "1" or "0" accordingly

In [9]:
cropped_data['Change'] = cropped_data['Adj Close'] - cropped_data['Open']

In [10]:
cropped_data.head()

Unnamed: 0,Date,Open,Adj Close,Change
0,7/20/09,942.070007,951.130005,9.059998
1,7/21/09,951.969971,954.580017,2.610046
2,7/22/09,953.400024,954.070007,0.669983
3,7/23/09,954.070007,976.289978,22.219971
4,7/24/09,972.159973,979.26001,7.100037


In [11]:
cropped_data['Change'] = (cropped_data['Change'] > 0)

In [12]:
cropped_data['Change'] = cropped_data['Change'] * 1

In [13]:
cropped_data.head()

Unnamed: 0,Date,Open,Adj Close,Change
0,7/20/09,942.070007,951.130005,1
1,7/21/09,951.969971,954.580017,1
2,7/22/09,953.400024,954.070007,1
3,7/23/09,954.070007,976.289978,1
4,7/24/09,972.159973,979.26001,1


We can drop the "Open" and "Adj Close" column because we have determined the rise and fall

In [14]:
change_data = cropped_data.drop(["Open", "Adj Close"], axis=1)

In [15]:
change_data.head(8)

Unnamed: 0,Date,Change
0,7/20/09,1
1,7/21/09,1
2,7/22/09,1
3,7/23/09,1
4,7/24/09,1
5,7/27/09,1
6,7/28/09,0
7,7/29/09,0


The dates are formatted for the purpose of calling for data with the NY Times API

In [16]:
change_data['Date'] = change_data['Date'].map(lambda date: datetime.strptime(date,'%m/%d/%y').strftime('%Y%m%d'))

In [17]:
change_data.head()

Unnamed: 0,Date,Change
0,20090720,1
1,20090721,1
2,20090722,1
3,20090723,1
4,20090724,1


In [18]:
formatted_dates = change_data["Date"]

In [19]:
formatted_dates.head()

0    20090720
1    20090721
2    20090722
3    20090723
4    20090724
Name: Date, dtype: object

<h1>Functions for Processing Data from JSON and to Tokenize</h1>

These functions help collect the news initially in JSON format, and process the text for later vectorization.

In [20]:
def collect(json_data):
    collection = ''
    i = 1
    for key in json_data["response"]["docs"]:
        collection += key["abstract"] + ' '
        i += 1
    return collection

In [21]:
lemmatizer = WordNetLemmatizer()

def clean_text(text):
    final_text = []
    stop_words=set(stopwords.words("english"))
    filtered_words = []
    sentences = nltk.sent_tokenize(text)
    for sentence in sentences:
        words = nltk.word_tokenize(sentence)
        #words = tokenizer.tokenize(encode_sentence)
        no_punc_words = [w for w in words if w not in string.punctuation]
        filtered_words = [w.lower() for w in no_punc_words if w not in stop_words]
        final_text += [lemmatizer.lemmatize(word) for word in filtered_words]
    return final_text

<h1>Downloading Articles and Storing Processed Text</h1>

In downloading the abstracts from The New York Times, there is a call limit set for the API, so a twenty-three second gap is introduced.  National and foreign news is collected for all the dates in which financial data was collected.  The encoding of the text is also changed to ASCII.

In [None]:
filling_df = pd.DataFrame()
news_df = pd.DataFrame()
news_collection_df = pd.DataFrame()

for date in formatted_dates:

    begin_date = date
    end_date = begin_date
    news_desk = 'National'
    url = 'https://api.nytimes.com/svc/search/v2/articlesearch.json?fq=news_desk:(' + news_desk + ')&print_page="1"&q= &begin_date=' + begin_date + '&end_date=' + end_date + '&api-key=' + apikey
    r = requests.get(url)
    national_json_data = r.json()
    #pprint(national_json_data)
    
    time.sleep(23)
    
    news_desk = 'Foreign'
    url = 'https://api.nytimes.com/svc/search/v2/articlesearch.json?fq=news_desk:(' + news_desk + ')&print_page="1"&q= &begin_date=' + begin_date + '&end_date=' + end_date + '&api-key=' + apikey
    r = requests.get(url)
    foreign_json_data = r.json()
    #pprint(foreign_json_data)
    
    time.sleep(23)
    
    national_collection = collect(national_json_data)
    foreign_collection = collect(foreign_json_data)
    
    one_days_collection = national_collection + foreign_collection
    #print(one_days_collection)
    
    one_days_collection = one_days_collection.encode('ascii', 'ignore')
    
    print(date, one_days_collection)
    
    news_df["Text"] = [one_days_collection]
    news_df.index = [str(date)]
    news_df.index.name = 'Date'
    news_collection_df = news_collection_df.append(news_df)
    
    news_collection_df.to_csv('20190722NYTimesFullTextANSII.csv')
    
    one_days_collection = [one_days_collection]
    
    count_vect = CountVectorizer(analyzer=clean_text)
    X_counts = count_vect.fit_transform(one_days_collection)
    #print(count_vect.get_feature_names())
    
    X_counts_df = pd.DataFrame(X_counts.toarray())
    X_counts_df.columns = count_vect.get_feature_names()
    X_counts_df.index = [str(date)]
    X_counts_df.index.name = 'Date'
    #print(X_counts_df)
    
    filling_df = filling_df.append(X_counts_df)
    #print(filling_df)
    filling_df = filling_df.fillna(0)
    
    filling_df.to_csv('20190722NYTimesDataANSII.csv')

A copy of the data was created to be safe in case an operation changed the original data

In [None]:
filling_df.to_csv('20190722NYTimesDataANSII_copy.csv')

In [None]:
news_collection_df.to_csv('20190722NYTimesFullTextANSII_copy.csv')

<h2>Full Analysis with All Daily Data</h2>

Again, a copy of the data was created after loading from file

In [22]:
filling_df = pd.read_csv('20190722NYTimesDataANSII_copy.csv')

In [23]:
copy_filling_df = filling_df

In [24]:
copy_filling_df.head(10)

Unnamed: 0,Date,'','91,'affluenza,'d,'n,'re,'s,'stand,*an,...,zulu,zuma,zumas,zuni,zuoren,zurich,zurita,zvonko,zvyagintsev,zygier
0,20090720,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,20090721,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,20090722,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,20090723,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,20090724,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,20090727,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,20090728,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,20090729,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,20090730,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,20090731,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


We are creating a simple table with only the rise and fall column and counts columns for both the text data and the financial data, respectively.

In [25]:
reset_copy_filling_df = copy_filling_df.reset_index()
no_date_copy_filling_df = reset_copy_filling_df.drop(["Date"], axis=1)

In [26]:
no_date_copy_filling_df.head(10)

Unnamed: 0,level_0,'','91,'affluenza,'d,'n,'re,'s,'stand,*an,...,zulu,zuma,zumas,zuni,zuoren,zurich,zurita,zvonko,zvyagintsev,zygier
0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [27]:
copy_change_data = change_data

In [28]:
copy_change_data.head(10)

Unnamed: 0,Date,Change
0,20090720,1
1,20090721,1
2,20090722,1
3,20090723,1
4,20090724,1
5,20090727,1
6,20090728,0
7,20090729,0
8,20090730,1
9,20090731,1


In [29]:
no_date_copy_change_data = copy_change_data.drop(["Date"], axis=1)

In [30]:
no_date_copy_change_data.head(10)

Unnamed: 0,Change
0,1
1,1
2,1
3,1
4,1
5,1
6,0
7,0
8,1
9,1


<h3>Test prediction with Random Forest</h3>

In [31]:
rf = RandomForestClassifier(n_jobs=-1)
k_fold = KFold(n_splits=5)
cross_val_score(rf, no_date_copy_filling_df, no_date_copy_change_data["Change"], cv=k_fold, scoring="accuracy", n_jobs=-1)

array([0.51190476, 0.50595238, 0.48809524, 0.49502982, 0.51093439])

In [32]:
sum([0.51190476, 0.50198413, 0.50396825, 0.48508946, 0.52485089]) / 5.0

0.505559498

Calculate the percentage of days that the market increased.

In [33]:
total = 0
for i in range(len(no_date_copy_change_data)):
    if no_date_copy_change_data["Change"][i] == 1:
        total += 1.0
print total
print total / len(no_date_copy_change_data)

1373.0
0.545274027006


Thus, the random forest model performed slightly worse than guessing the market always increased.  Try TD-IDF to attempt at improving the prediction

<h2>TD-IDF</h2>

Transform the counts data to TD-IDF

In [34]:
transformer = TfidfTransformer()
tdidf = transformer.fit_transform(no_date_copy_filling_df)

<h3>Test for prediction with Random Forest</h3>

In [35]:
rf = RandomForestClassifier(n_jobs=-1)
k_fold = KFold(n_splits=5)
cross_val_score(rf, tdidf, no_date_copy_change_data["Change"], cv=k_fold, scoring="accuracy", n_jobs=-1)

array([0.46825397, 0.49007937, 0.5       , 0.49701789, 0.48707753])

In [36]:
sum([0.5515873 , 0.4702381 , 0.4781746 , 0.48111332, 0.47316103]) / 5.0

0.49085487

The model performed a little worse with TD-IDF compared to count vectorizing.  The next step is to increase the window of analysis.

<h1>Stock Market Predictions for Five Days</h1>

To try to improve the accuracy of the model, the news from five stock market trading days was pooled, and the change in the S&P 500 was calculated with this new window to train the model.

First, read in the count vectorized table for comparison

In [37]:
data = pd.read_csv('20190722NYTimesDataANSII.csv', delimiter=',')

In [38]:
data.head()

Unnamed: 0,Date,'','91,'affluenza,'d,'n,'re,'s,'stand,*an,...,zulu,zuma,zumas,zuni,zuoren,zurich,zurita,zvonko,zvyagintsev,zygier
0,20090720,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,20090721,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,20090722,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,20090723,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,20090724,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Next, read in from file the full text of the abstracts.  These abstracts are from every trading day

In [39]:
text_data = pd.read_csv('20190722NYTimesFullTextANSII_copy.csv', delimiter=',')

In [40]:
text_data.head()

Unnamed: 0,Date,Text
0,20090720,A broken toilet pump that was contaminating pa...
1,20090721,"Drug makers, advocacy groups, biotechnology fi..."
2,20090722,A federal judge has ordered the Obama administ...
3,20090723,"Two years after the massacre at Virginia Tech,..."
4,20090724,After once spending his Sundays in football st...


In [41]:
text_data['Date'] = text_data['Date'].map(lambda date: parser.parse(str(date)))

In [42]:
text_data.head()

Unnamed: 0,Date,Text
0,2009-07-20,A broken toilet pump that was contaminating pa...
1,2009-07-21,"Drug makers, advocacy groups, biotechnology fi..."
2,2009-07-22,A federal judge has ordered the Obama administ...
3,2009-07-23,"Two years after the massacre at Virginia Tech,..."
4,2009-07-24,After once spending his Sundays in football st...


There are 2518 days of news.  This is the same as the number of days of financial data.  This is reduced to 503 entries for both the text and financial news after grouping in five day periods (there are a few days left over)

In [43]:
len(text_data)

2518

In [44]:
weekly_text_data = pd.DataFrame()

In [45]:
for i in range(0, 2514, 5):
    week_of_news = ""
    for j in range(5):
        week_of_news += text_data.loc[i + j, "Text"]
    weekly_text_data.loc[i / 5, "Text"] = week_of_news
    weekly_text_data.loc[i / 5, "Date"] = text_data.loc[i, "Date"]

This shows the grouping of news in approximately a weekly basis

In [46]:
weekly_text_data.head()

Unnamed: 0,Text,Date
0,A broken toilet pump that was contaminating pa...,2009-07-20
1,The president will be in Virginia to talk abou...,2009-07-27
2,Camp David offers more than respite; its bowli...,2009-08-03
3,The Los Angeles County coroners office said it...,2009-08-10
4,On President Obamas daybook: A meeting with hi...,2009-08-17


In [47]:
weekly_text_data.tail()

Unnamed: 0,Text,Date
498,A newly disclosed investigation suggests that ...,2019-06-11
499,Were interested in hearing from students who h...,2019-06-18
500,Some 500 Wayfair employees signed a letter ask...,2019-06-25
501,The chief was convicted of a single charge for...,2019-07-02
502,"Julia Keleher, Puerto Ricos former education s...",2019-07-10


Here is an example of abstracts collected for approximately one week

In [48]:
weekly_text_data.loc[0, "Text"]



Read in the financial news from Yahoo Finance, remove unnecessary columns, and calculate rise or fall

In [49]:
financial_data = pd.read_csv('SP500from2009to2019.csv')

In [50]:
len(financial_data)

2518

In [51]:
financial_data.head()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,7/20/09,942.070007,951.619995,940.98999,951.130005,951.130005,4853150000
1,7/21/09,951.969971,956.530029,943.219971,954.580017,954.580017,5309300000
2,7/22/09,953.400024,959.830017,947.75,954.070007,954.070007,4634100000
3,7/23/09,954.070007,979.419983,953.27002,976.289978,976.289978,5761650000
4,7/24/09,972.159973,979.789978,965.950012,979.26001,979.26001,4458300000


In [52]:
financial_data['Date'] = financial_data['Date'].map(lambda date: parser.parse(str(date)))

In [53]:
financial_data.head()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2009-07-20,942.070007,951.619995,940.98999,951.130005,951.130005,4853150000
1,2009-07-21,951.969971,956.530029,943.219971,954.580017,954.580017,5309300000
2,2009-07-22,953.400024,959.830017,947.75,954.070007,954.070007,4634100000
3,2009-07-23,954.070007,979.419983,953.27002,976.289978,976.289978,5761650000
4,2009-07-24,972.159973,979.789978,965.950012,979.26001,979.26001,4458300000


Remove unneeded columns

In [54]:
cropped_data = financial_data.drop(["High", "Low", "Close", "Volume"], axis=1)

Create a new table for the weekly change of the S&P 500 and find the weekly change of price

In [55]:
weekly_change = pd.DataFrame()

In [56]:
for i in range(0, 2514, 5):
    weekly_change.loc[(i + 5) / 5, "Change"] = cropped_data.loc[i + 5, 'Adj Close'] - cropped_data.loc[i, 'Open']
    weekly_change.loc[(i + 5) / 5, "Date"] = cropped_data.loc[i, 'Date']

In [57]:
weekly_change.tail()

Unnamed: 0,Change,Date
499,14.47998,2019-06-11
500,10.669922,2019-06-18
501,27.229981,2019-06-25
502,28.410156,2019-07-02
503,-4.880127,2019-07-10


Convert to a binary increase or decrease change

In [58]:
weekly_change['Change'] = (weekly_change['Change'] > 0)

In [59]:
weekly_change['Change'] = weekly_change['Change'] * 1

In [60]:
weekly_change.head()

Unnamed: 0,Change,Date
1,1,2009-07-20
2,1,2009-07-27
3,1,2009-08-03
4,0,2009-08-10
5,1,2009-08-17


Define "clean_text" for processing the weekly text.

In [61]:
lemmatizer = WordNetLemmatizer()

def clean_text(text):
    final_text = []
    stop_words=set(stopwords.words("english"))
    filtered_words = []
    sentences = nltk.sent_tokenize(text)
    for sentence in sentences:
        words = nltk.word_tokenize(sentence)
        #words = tokenizer.tokenize(encode_sentence)
        no_punc_words = [w for w in words if w not in string.punctuation]
        filtered_words = [w.lower() for w in no_punc_words if w not in stop_words]
        final_text += [lemmatizer.lemmatize(word) for word in filtered_words]
    return final_text

Create a table with the counts of each word for each week

In [62]:
week_filling_df = pd.DataFrame()

In [29]:
for i in range(len(weekly_text_data)):
    one_weeks_collection = weekly_text_data.loc[i, "Text"]
    one_weeks_collection = [one_weeks_collection]
    count_vect = CountVectorizer(analyzer=clean_text)
    X_counts = count_vect.fit_transform(one_weeks_collection)
    
    X_counts_df = pd.DataFrame(X_counts.toarray())
    X_counts_df.columns = count_vect.get_feature_names()
    X_counts_df.index = [weekly_text_data.loc[i, "Date"]]
    X_counts_df.index.name = "Date"
    
    week_filling_df = week_filling_df.append(X_counts_df)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.


  sort=sort)


Fill in NA with zero

In [30]:
week_filling_df = week_filling_df.fillna(0)
week_filling_df.head()

Unnamed: 0_level_0,'','91,'affluenza,'d,'n,'re,'s,'stand,*an,--,...,zulu,zuma,zumas,zuni,zuoren,zurich,zurita,zvonko,zvyagintsev,zygier
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2009-07-20,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2009-07-27,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2009-08-03,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2009-08-10,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2009-08-17,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Reset index and drop date for training of model

In [31]:
reset_week_filling_df = week_filling_df.reset_index()
no_date_week_filling_df = reset_week_filling_df.drop(["Date"], axis=1)

In [32]:
no_date_week_filling_df.head()

Unnamed: 0,'','91,'affluenza,'d,'n,'re,'s,'stand,*an,--,...,zulu,zuma,zumas,zuni,zuoren,zurich,zurita,zvonko,zvyagintsev,zygier
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Drop date from weekly market change table for processing model

In [33]:
no_date_weekly_change = weekly_change.drop(["Date"], axis=1)

In [34]:
no_date_weekly_change.head()

Unnamed: 0,Change
1,1
2,1
3,1
4,0
5,1


Convert to float as that seemed necessary for model

In [51]:
type(no_date_weekly_change.loc[1, "Change"])

numpy.int64

In [60]:
float_no_date_weekly_change = no_date_weekly_change

In [61]:
float_no_date_weekly_change["Change"] = float_no_date_weekly_change["Change"].map(lambda change: float(change))

In [63]:
type(float_no_date_weekly_change.loc[1, "Change"])

numpy.float64

In [65]:
float_no_date_weekly_change.head(5)

Unnamed: 0,Change
1,1.0
2,1.0
3,1.0
4,0.0
5,1.0


<h1>Three machine learning methods are tested here</h1>

<h3>Naive Bayes</h3>

In [77]:
naive = naive_bayes.MultinomialNB()
k_fold = KFold(n_splits=5)
cross_val_score(naive, no_date_week_filling_df, no_date_weekly_change["Change"], cv=k_fold, scoring="accuracy", n_jobs=-1)

array([0.62376238, 0.61386139, 0.56435644, 0.65      , 0.59      ])

In [38]:
np.average([0.62376238, 0.61386139, 0.56435644, 0.65, 0.59])

0.608396042

<h3>SVM</h3>

In [41]:
SVM = svm.SVC(C=1.0, kernel='linear', degree=3, gamma='auto')
k_fold = KFold(n_splits=5)
cross_val_score(SVM, no_date_week_filling_df, no_date_weekly_change["Change"], cv=k_fold, scoring="accuracy", n_jobs=-1)

array([0.46534653, 0.62376238, 0.53465347, 0.4       , 0.52      ])

In [43]:
np.average([0.46534653, 0.62376238, 0.53465347, 0.4, 0.52])

0.508752476

<h3>Grid Search with SVM</h3>

In [66]:
parameters = {'kernel':('linear', 'rbf'), 'C':[1.0, 10.0]}
svc = svm.SVC()
clf = GridSearchCV(svc, parameters, cv=5)
clf.fit(no_date_week_filling_df, float_no_date_weekly_change.loc[1:,"Change"])

GridSearchCV(cv=5, error_score='raise',
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'kernel': ('linear', 'rbf'), 'C': [1.0, 10.0]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

<h4>Insight</h4>
One of the results with grid search with SVM yielded an accuracy very close to the actual average number of "weeks" of increase (see below).

In [74]:
grid_scores = clf.grid_scores_
grid_scores



[mean: 0.51690, std: 0.05330, params: {'kernel': 'linear', 'C': 1.0},
 mean: 0.60835, std: 0.00235, params: {'kernel': 'rbf', 'C': 1.0},
 mean: 0.51690, std: 0.05330, params: {'kernel': 'linear', 'C': 10.0},
 mean: 0.59642, std: 0.01011, params: {'kernel': 'rbf', 'C': 10.0}]

<h3>Random Forest Classifier</h3>

In [89]:
rf = RandomForestClassifier(n_jobs=-1)
k_fold = KFold(n_splits=5)
cross_val_score(rf, no_date_week_filling_df, no_date_weekly_change["Change"], cv=k_fold, scoring="accuracy", n_jobs=-1)

array([0.65346535, 0.61386139, 0.56435644, 0.61      , 0.58      ])

In [92]:
np.average([0.65346535, 0.61386139, 0.56435644, 0.61, 0.58])

0.604336636

<h3>Average Number of Market Increase for Comparison</h3>

In [90]:
week_increase = 0
for i in range(len(weekly_change)):
    if weekly_change.loc[i + 1, "Change"]  == 1:
        week_increase += 1

In [91]:
week_increase / float(len(weekly_change))

0.6083499005964215

<h2>Summary with Weekly Change</h2>

<table>
    <tr>
        <td> Actual Increase Percentage </td>
        <td> Random Forest Classifier</td>
        <td> SVM with Grid Search</td>
        <td> Naive Bayes</td>
    </tr>
    <tr>
        <td> 60.83% </td>
        <td> 60.43% </td>
        <td> 60.84% </td>
        <td> 60.84% </td>
    </tr>
</table>

<h1>Using Bigrams</h2>

Vectorize the weekly news with an ngram range of (1, 2).  Note that printing integers of 10 was used as a count due to the long processing time.

In [82]:
bigram_week_filling_df = pd.DataFrame()

In [83]:
for i in range(len(weekly_text_data)):
    one_weeks_collection = weekly_text_data.loc[i, "Text"]
    one_weeks_collection = [one_weeks_collection]
    count_vect = CountVectorizer(ngram_range=(1, 2), analyzer='word', stop_words='english')
    X_counts = count_vect.fit_transform(one_weeks_collection)
    X_counts_df = pd.DataFrame(X_counts.toarray())
    X_counts_df.columns = count_vect.get_feature_names()
    X_counts_df.index = [weekly_text_data.loc[i, "Date"]]
    X_counts_df.index.name = "Date"
    if i % 10 == 0:
        print i #keep track of progress
    
    bigram_week_filling_df = bigram_week_filling_df.append(X_counts_df)

0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
160
170
180
190
200
210
220
230
240
250
260
270
280
290
300
310
320
330
340
350
360
370
380
390
400
410
420
430
440
450
460
470
480
490
500


In [84]:
bigram_week_filling_df.head()

Unnamed: 0_level_0,000,000 000,000 10,000 13,000 170,000 1948,000 1970,000 2004,000 2007,000 2010,...,zurich offices,zurich police,zurita,zurita working,zvonko,zvonko veselinovic,zvyagintsev,zvyagintsev loveless,zygier,zygier unintentionally
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2009-07-20,,,,,,,,,,,...,,,,,,,,,,
2009-07-27,2.0,,,,,,,,,,...,,,,,,,,,,
2009-08-03,2.0,,,,,,,,,,...,,,,,,,,,,
2009-08-10,3.0,,,,,,,,,,...,,,,,,,,,,
2009-08-17,1.0,,,,,,,,,,...,,,,,,,,,,


In [85]:
bigram_week_filling_df.to_csv('20190806NYTimeBigramTable.csv')

This line for loading the CSV file did not complete in approximately a day impacting testing of additional models

In [None]:
bigram_week_filling_df = pd.read_csv('20190806NYTimeBigramTable.csv')

Fill in NaN with zeros

In [86]:
bigram_nonan_week_filling_df = bigram_week_filling_df.fillna(0)

In [87]:
bigram_nonan_week_filling_df.head(5)

Unnamed: 0_level_0,000,000 000,000 10,000 13,000 170,000 1948,000 1970,000 2004,000 2007,000 2010,...,zurich offices,zurich police,zurita,zurita working,zvonko,zvonko veselinovic,zvyagintsev,zvyagintsev loveless,zygier,zygier unintentionally
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2009-07-20,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2009-07-27,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2009-08-03,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2009-08-10,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2009-08-17,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


The Random Forest Classifier applied to the bigram and unigram data did not increase compared to the unigram data alone.

In [88]:
rf = RandomForestClassifier(n_jobs=-1)
k_fold = KFold(n_splits=5)
cross_val_score(rf, bigram_nonan_week_filling_df, no_date_weekly_change["Change"], cv=k_fold, scoring="accuracy", n_jobs=-1)

array([0.58415842, 0.55445545, 0.62376238, 0.67      , 0.55      ])

In [89]:
np.average([0.58415842, 0.55445545, 0.62376238, 0.67, 0.55])

0.5964752499999999

<b>Note:</b> The code in the two cells below ran for up to a day before I terminated it due to extremely long processing

In [None]:
SVM = svm.SVC(C=1.0, kernel='linear', degree=3, gamma='auto')
k_fold = KFold(n_splits=5)
cross_val_score(SVM, bigram_nonan_week_filling_df, no_date_weekly_change["Change"], cv=k_fold, scoring="accuracy", n_jobs=-1)

In [None]:
naive = naive_bayes.MultinomialNB()
k_fold = KFold(n_splits=5)
cross_val_score(naive, bigram_nonan_week_filling_df, no_date_weekly_change["Change"], cv=k_fold, scoring="accuracy", n_jobs=-1)