## Introduction

It is known that stock prices and the health of the market are two very fickle variables - they change due to the slightest perturbations in world politics, economic downturns, and even results of sports championships. For example, after the 2016 presidential election, stocks fluctuated wildly, as the result of the election was extremely unexpected, which made investors anxious.

![Viz](http://www.moneychoice.org/wp-content/uploads/2014/11/stocks.jpg)

We wanted to explore the relationship between stock prices (we picked a couple of stocks below) and concurrent news articles published at the time - mainly we want to see if news can be a good predictor of whether a stock price will go up or down. We chose to go with this topic mainly because the influence of news and media has been under the magnifying glass for a long time, and we wanted to use key concepts that we used in our data science class to quantitatively assess whether or not the news has as much of an effect people say it does on the financial markets.


## Outline of Report - API's Used

We will mainly focus on the stock prices of leading companies in various different sectors in order to gain a good variety (our methodologies could easily be applied to entire market indices, or the stock markets of entire nations). For news, we have decided to focus on four different news sources, three of which are financial, and one is not. 

The stocks we will be focusing on in this report are:

- Google (GOOG)
- Exxon-Mobil (XOM)
- Citigroup (C)

As you can see, these are all stocks from different sectors of the economy (tech, energy, and finance). To get data for the stocks, we used **Google Finance** to get data.

The news sources whose tweets we will be pulling from are:

- Wall Street Journal (@WSJ)
- Bloomberg (@business)
- Financial Times (@FinancialTimes)
- New York Times (@nytimes)

We chose a mix of financial and regular news sources, in order to get news of different varieties. In order to get data from these sources, as mentioned, we will be using the **Twitter API**.


## Outline of Report - Methodology

Finally, how do these two different areas combine together? Good question. To figure out the correlation, and (hopefully), make a prediction on the stock prices, we need to do a bunch of things.

1. Gather data and clean it up
2. Visualize!
3. Classify tweets as either having a positive effect or negative effect
4. Use a learning method to predict stock prices

A lot of these steps will be using methods we learned in class. As mentioned before, we will be using the Google Finance API and Twitter API to get financial and news data. 

Then, we will be using matplotlib to see if there is any high-level correlation between the stocks we selected and the news that we received through the tweets about those stocks in a given time frame.

Finally, we will classify tweets as either "good news" or "bad news", using the NLTK packages, and their text classification capabilities. Using this, we will then compare different learning methods, such as support vector machines, random forest classifiers, and neural net, to figure out whether or not a stock price will increase or decrease based on given news articles. We will likely use data from the last few months for training, and data for this month as test data.

Lets get started with the first part of the process, which is gathering the data.

In [None]:
import twitter #allows us to use the twitter api
import json
import requests
import pandas as pd
import os

import matplotlib
matplotlib.use("svg")
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use("ggplot")

## Part 1: Gathering of Data - Tweets

We will be using the aforementioned news sources, as well as the Twitter API get the news. We'll start by initializing the API.

In [None]:
#API initialization (you need a consumer_key and secret - to read more about that look at the Twitter API documentation)
with open("secret.json", "rt") as fp:
    params = json.load(fp)

api = twitter.Api(consumer_key=params["consumer_key"],
                  consumer_secret=params["consumer_secret"],
                  access_token_key=params["access_token"],
                  access_token_secret=params["access_token_secret"])


with open("wsj.json", "wt") as dj:
    dj.write("[]")
with open("business.json", "wt") as dj:
    dj.write("[]")
with open("financialtimes.json", "wt") as dj:
    dj.write("[]")
with open("nytimes.json", "wt") as dj:
    dj.write("[]")

The first lines of code will initialize the API and allow you to make calls. The second part is what we use for our buffers to dump data from each news source. Note: you can get up to 200 tweets with a single call to the API.

Now, we will get tweets for each Twitter handle we are dealing with. Remember, there is a 200 tweet limit, so we need to spin the call to the API in a loop.

In [None]:
def del_last_char(filename):
    with open(filename, 'rb+') as f:
        f.seek(-1, os.SEEK_END)
        f.truncate()

def get_lots_of_tweets(handle, jsonfile, mxid=None):
    statuses = api.GetUserTimeline(screen_name=handle, count=200, max_id=mxid)
    del_last_char(jsonfile)
    with open(jsonfile, "a+") as dj:
        json.dump([json.loads(str(statuses[i])) for i in xrange(len(statuses))], dj)
        dj.write("]")
    for _ in xrange(4):
        mxid = statuses[-1].id - 1
        statuses = api.GetUserTimeline(screen_name=handle, count=200, max_id=mxid)
        del_last_char(jsonfile)
        with open(jsonfile, "a+") as dj:
            dj.write(",")
            json.dump([json.loads(str(statuses[i])) for i in xrange(len(statuses))], dj)
            dj.write("]")

get_lots_of_tweets("@WSJ",             "wsj.json"           )
get_lots_of_tweets("@business",        "business.json"      )
get_lots_of_tweets("@FinancialTimes",  "financialtimes.json")
get_lots_of_tweets("@nytimes",         "nytimes.json"       )
            
# BEWARE: creates a json which is a list of lists of tweets

Now that we have the last 800 tweets for all of the given handles, we can put them into the cache files that we prepared before when we were initializing our API.

In [None]:
with open("wsj.json", "rt") as dj:
    wt = json.load(dj)
with open("business.json", "rt") as dj:
    bt = json.load(dj)
with open("financialtimes.json", "rt") as dj:
    ft = json.load(dj)
with open("nytimes.json", "rt") as dj:
    nt = json.load(dj)
    
wsj_tweets = []
for lst in wt:
    wsj_tweets.extend(lst)
business_tweets = []
for lst in bt:
    business_tweets.extend(lst)
financialtimes_tweets = []
for lst in ft:
    financialtimes_tweets.extend(lst)
nytimes_tweets = []
for lst in nt:
    nytimes_tweets.extend(lst)

Finally, we will put the tweets into dataframes, so that we can very easily perform data analysis on them later.

In [None]:
wsj = pd.DataFrame(wsj_tweets)
wsj["created_at"] = pd.to_datetime(wsj["created_at"]) #convert to datetime for easier time series analysis
business = pd.DataFrame(business_tweets)
business["created_at"] = pd.to_datetime(business["created_at"])
financialtimes = pd.DataFrame(financialtimes_tweets)
financialtimes["created_at"] = pd.to_datetime(financialtimes["created_at"])
nytimes = pd.DataFrame(nytimes_tweets)
nytimes["created_at"] = pd.to_datetime(nytimes["created_at"])

Here is a partial view of the wsj dataframe that was created (we will be doing more with this later).

![Viz](https://s22.postimg.org/e0sxs3dfl/image.png)


## Part 1: Gathering of Data - Stocks

Again, we need to gather the data, but this time for the stocks mentioned above. To get the data, you have to manually export it as follows: https://support.google.com/finance/answer/71913?hl=en. We did that, and obtained these three CSV's, and made dataframes for each.

In [None]:
goog = pd.read_csv("goog.csv")
goog.columns = ["Date", "Open", "High", "Low", "Close", "Volume"]
goog["Date"] = pd.to_datetime(goog["Date"])

xom = pd.read_csv("xom.csv")
xom.columns = ["Date", "Open", "High", "Low", "Close", "Volume"]
xom["Date"] = pd.to_datetime(xom["Date"])

c = pd.read_csv("c.csv")
c.columns = ["Date", "Open", "High", "Low", "Close", "Volume"]
c["Date"] = pd.to_datetime(c["Date"])

In [None]:
print goog.head()
print xom.head()
print c.head()

![Viz](https://s12.postimg.org/q93xbhmvx/image.png "Logo Title Text 1")



Running this call will show you that we have successfully created dataframes for each of the stocks that we will be analyzing in this report. We have curated relevant data, including timestamp, prices, and volume. Now, its time to do some visualizations with the data.

## Part 2: Visualization 

As mentioned earlier, we will be using matplotlib to make effective visualizations of the stock prices and hopefully be able to correlate dips and rises in price with relevant news items extracted above (the tweet data comes with dates, and this will be key in determining correlation). The following code will create a graph of historical data for each stock from around May to November 2016.

In [None]:
plt.plot(goog["Date"], goog["Close"])
plt.plot(xom["Date"], xom["Close"])
plt.plot(c["Date"], c["Close"])
plt.show()

![Viz](https://s17.postimg.org/x2v6dn6gf/untitled.png "Logo Title Text 1")

This shows a visualization each of the stocks. The next step would be to manually look into the Twitter data to see any critical news articles about the companies listed above (something like the Google Pixel release announcement, or a critical scientific discovery about oil discovery) can be classified as turning points for the stock.
