# Predicting the Trump Election: An Introduction to Tensorflow
This notebook demonstrates an introduction to Tensorflow for predicting the Trump victory for the 2016 
presidential election, using stock market and 3rd party data. Through this guide you will utilize 
[tf.keras](https://www.tensorflow.org/api_docs/python/tf/keras) a library in Tensorflow that allows you to 
quickly build a fully connected neural network and train a model. 
Our input vectors will be a form of timeseries data.

The following steps are performed:

1. Files
2. Preprocess timeseries features
3. Play/Visualize the data
4. Model: Predicting Trump Election
5. Model: Predicting the market returns

### Overview 
The goal of this is to predict the winner of the 2016 Presidential Election using publicly available data at the time.  Instead of predicting the person to win we will phrase this problem as a binary classification task: predicting the political party that will win the election (Republican or Democratic). This will give us more data to sample from, hopefully improving the model performance. After that, we will use similar features to train another neural network which will be used to predict the market return after the election date. 

We have a small amount of data, overfitting and biases are a major problem. In practice you will have much more data. Using a GPU to accelerate the training time will be beneficial. A common GPU is the [Tesla K80 GPU](https://www.nvidia.com/en-us/data-center/tesla-k80/), and old but powerful and expensive GPU. 

You can use Google Colab, if you want to have more power at your fingertips, however, for brevity this will not be covered.

In [1]:
# Imports
import itertools
import pandas as pd
import tensorflow as tf
import datetime
import os

### 1. Files
Here you will find two files: data.csv, djw.csv

* **data.csv** contains historical data about presidential elections dating back until 1900.
* **djw.csv** contains historical data about the [Dow Jones Industrial Average](https://en.wikipedia.org/wiki/Dow_Jones_Industrial_Average) for the daily close prices. 
Although this is an index, it can closely approximate the DIA ETF and other ETF's that track "the market". This was chosen due the large dataset size. 

### 2. Preprocess Data
In this section we preprocess the data. Here we have to be careful that we do not inject forward looking bias. Therefore, we need to use the ```truncate``` command to get the nearest day before. The helper function below shows how to get the percentage return of the market given a dataframe of prices. 

In [2]:
def compute_td_pct(djw, index, days):
    """ Computes a percentage change between a given day and some timedelta (days)
    Args:
        djw(PandasDataframe): contains index of prices and dates
        index(datetime): day to search
        days(int): numbers of days to search back
    Returns:
        (pct, int): percent change, and direction (1 positive, 0 negative)
    """
    pct = None
    ntd = djw.truncate(after=index).iloc[-1]["Closing Value"]
    if days > 0:
        pct = (djw[index:index + datetime.timedelta(days=1)].iloc[-1]["Closing Value"] - ntd) / \
              djw[index:index + datetime.timedelta(days=days)].iloc[-1]["Closing Value"]
    else:
        pct = (ntd - djw[index + datetime.timedelta(days=days):index].iloc[0]["Closing Value"]) / ntd
    if pct > 0.0:
        return pct, 1
    else:
        return pct, 0


We need to convert times to datetimes for easier processing. Pandas has great built in libraries that allow for quick data parsing. Pandas include a nice helper function called ```.to_datetime()``` which will automatically convert and figure out datetimes for you.

In [3]:
djw = pd.read_csv("djw.csv")  # Dow Jones Industrial Average Prices by Day
djw = djw.set_index(pd.to_datetime(djw["Date"]))  # Set the Datetime as index
data = pd.read_csv("data.csv")  # Read in 3rd party handlabeled data
data = data.set_index(pd.to_datetime(data["date_elected"]))  # Set the datetime as the index
data = data[1:]  # We remove the first index to make sure we have enough data to look backwards

Label out the features to sample. Here we believe that the market or some combination of the market features may predict the election. Id est: smart money might know where the election may go and invest accordingly. 

In [4]:
# This could have been done in a list of lists but was made explicit for demonstration purposes
day_before_1 = []  # 1 day before the election
day_before_7 = []  # 7 days before the election
day_before_30 = []  # 30 days before the election
day_before_60 = []  # 60 days before the election
day_before_180 = []  # 180 days before the election
day_before_365 = []  # 365 days before the election
day_before_730 = []  # 730 days before the election
day_after_1 = []  # 1 day after the election
day_after_7 = []  # 7 days after the election
day_after_30 = []  # 30 days after the election
day_after_60 = []  # 60 days after the election
day_after_180 = []  # 180 days after the election
day_after_365 = []  # 365 days after the election
for index, row in data.iterrows():
    day_after_1.append(
        compute_td_pct(djw, index, 1)[1])  # Note here we are just getting the direction instead of the market change
    day_after_7.append(compute_td_pct(djw, index, 7)[0])
    day_after_30.append(compute_td_pct(djw, index, 30)[0])
    day_after_60.append(compute_td_pct(djw, index, 60)[0])
    day_after_180.append(compute_td_pct(djw, index, 180)[0])
    day_after_365.append(compute_td_pct(djw, index, 365)[0])
    day_before_1.append(compute_td_pct(djw, index, -1)[0])
    day_before_7.append(compute_td_pct(djw, index, -7)[0])
    day_before_30.append(compute_td_pct(djw, index, -30)[0])
    day_before_60.append(compute_td_pct(djw, index, -60)[0])
    day_before_180.append(compute_td_pct(djw, index, -180)[0])
    day_before_365.append(compute_td_pct(djw, index, -365)[0])
    day_before_730.append(compute_td_pct(djw, index, -730)[0])

# Finally construct a DataFrame containing all of the data and add column labels and concat
# the market data to the third party data
market_data_cols = [day_before_1, day_before_7, day_before_30, day_before_60, day_before_180, day_before_365,
                    day_before_730, day_after_1, day_after_7, day_after_30, day_after_60, day_after_180, day_after_365]
market_data_col_names = ["day_before_1", "day_before_7", "day_before_30", "day_before_60", "day_before_180",
                         "day_before_365", "day_before_730", "day_after_1", "day_after_7", "day_after_30",
                         "day_after_60", "day_after_180", "day_after_365"]
market_data = pd.DataFrame(market_data_cols).transpose()
market_data.columns = market_data_col_names
market_data = market_data.set_index(data.index)  # this operation is not inplace, use existing dataframe's index
frames = [data, market_data]  # Pandas has some quirks unlike sql when concatenating
combined_df = pd.concat(frames, axis=1)  # Axis 0 is after, 1 is next-to


### 3. Play/Visualize Data
Now that we have preprocessed the data, take a look at the data and get a feel for how it is structured. You will note that there is not that much data, as it is hard to find reliable stock data in the early 1900's. 

Examine the features to get a sense of what they mean. 
* Party - 1 if Republican, 0 if Democratic
* Previously Held Office - 1 if true
* Previous Party - the party that was previously in power (goes back 2 terms), 1 if Republican, 0 if Democratic
* Was VP or VP Runner - 1 if held the position of VP before the current election
* day_before_n - percentage or direction of the market for a given number of days before the current election cycle but not including the day
* day_after_n - percentage or direction of the market for a given number of days after the current election cycle

When I actually did the prediction, I had much more data than just the above. I used [Google Trends](trends.google.com) to add more feature data. Furthermore, I added a "sentiment analysis" by looking through social media and other documents to get a feeling for the expected outcome. I strongly reccomend you include more features and more data than the 20+ elements we have here. More data the better. High quality data is important. 

In [5]:
combined_df.head() # gives the top 5, can use tail to give the last 5

Unnamed: 0_level_0,date_elected,party,prev_held_office,previous_party_1,previous_party_2,was_vp_or_vp_runner,day_before_1,day_before_7,day_before_30,day_before_60,day_before_180,day_before_365,day_before_730,day_after_1,day_after_7,day_after_30,day_after_60,day_after_180,day_after_365
date_elected,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
1904-11-08,1904-11-08,1,0,1,1,1,0.0,0.037526,0.112577,0.145979,0.276082,0.363299,0.058144,1.0,0.012433,0.012648,0.012421,0.01129,0.010467
1908-11-03,1908-11-03,1,0,1,1,0,0.0,-0.007904,0.028157,-0.007904,0.144574,0.294583,-0.141116,1.0,0.022454,0.022702,0.022785,0.022257,0.019613
1912-11-05,1912-11-05,0,0,1,1,0,0.0,-0.000756,-0.040369,-0.009979,0.020714,0.125038,0.050801,1.0,0.018333,0.018795,0.018821,0.020826,0.021146
1916-11-07,1916-11-07,0,1,0,1,0,0.0,0.024251,0.065106,0.11734,0.15978,0.134409,0.490533,0.0,-0.003652,-0.00357,-0.003952,-0.004201,-0.005312
1920-11-02,1920-11-02,1,0,0,0,0,0.0,-0.001521,0.002691,-0.030066,-0.101661,-0.399392,-0.003042,0.0,-0.00613,-0.006339,-0.00681,-0.006236,-0.006665


In [6]:
combined_df.describe() # statistics about the dataframe

Unnamed: 0,party,prev_held_office,previous_party_1,previous_party_2,was_vp_or_vp_runner,day_before_1,day_before_7,day_before_30,day_before_60,day_before_180,day_before_365,day_before_730,day_after_1,day_after_7,day_after_30,day_after_60,day_after_180,day_after_365
count,29.0,29.0,29.0,29.0,29.0,29.0,29.0,29.0,29.0,29.0,29.0,29.0,29.0,29.0,29.0,29.0,29.0,29.0
mean,0.517241,0.344828,0.517241,0.551724,0.37931,0.002016,0.014532,0.016543,0.005033,0.043755,0.040059,0.08952,0.448276,-0.002882,-0.003152,-0.002937,-0.002924,-0.002029
std,0.508548,0.483725,0.508548,0.50612,0.493804,0.006578,0.017648,0.037084,0.064675,0.110046,0.232439,0.376834,0.50612,0.019684,0.020203,0.01946,0.019455,0.018122
min,0.0,0.0,0.0,0.0,0.0,-0.002982,-0.011032,-0.040369,-0.179777,-0.336769,-0.808455,-1.657169,0.0,-0.055902,-0.058022,-0.053794,-0.05918,-0.049582
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.011044,-0.022993,-0.012496,0.023558,0.052363,0.0,-0.009059,-0.009004,-0.008936,-0.008607,-0.008056
50%,1.0,0.0,1.0,1.0,0.0,0.0,0.013861,0.012609,0.013489,0.042738,0.096108,0.145892,0.0,-0.001941,-0.001941,-0.001933,-0.001832,-0.001758
75%,1.0,1.0,1.0,1.0,1.0,0.0,0.020732,0.035039,0.029554,0.114063,0.144247,0.230675,1.0,0.011486,0.01108,0.01025,0.010074,0.010467
max,1.0,1.0,1.0,1.0,1.0,0.031734,0.067513,0.112577,0.145979,0.276082,0.363299,0.490533,1.0,0.022454,0.022702,0.022785,0.022876,0.030659


### 4. Model: Predicting the Trump Election
Here we will train a DNN that aims to predict the 2016 Presidential Election. The features will be the features explored above (except for the forward looking ones). You do not need to fully understand how a [neural network](https://en.wikipedia.org/wiki/Feedforward_neural_network) works, however it can be thought of mapping inputs to outputs and the network will figure out everything inbetween. The aim is to not have the best network architecture possible, but to leverage neural network's ability to find patterns among data that otherwise would be difficult or timeconsuming to find by pure inspection. 

We are using Deep Learning to figure out the useful features and generate a model based upon those useful features to predict upon. 

Tensorflow is the selected Deep Learning framework, as it tends to be the most popular in industry. There are many others and each has a different purpose and use. Use what is best to get the job done.
* CNTK (Microsoft Cognitive Toolkit)
* Keras - this actually is a high level API that has general calls to other frameworks
* Theano
* Torch
* Caffe/Caffe2
* Scikit learn

In [7]:
# our goal is to predict the party that will win the election
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import RMSprop, Adam, SGD
from tensorflow.keras.layers import Dense, Dropout, Activation, Flatten, Softmax
from sklearn.model_selection import train_test_split

Take the relevant features and labels that we are trying to predict. The market prices and the presidential data. Here we will be having a MIMO (Multi Input Multi Output) problem where we will predict not only the expected winner, but also the expected market direction.

    X - Our features
    y - our labels (what we want to predict)

Ensure that we are not encoding data that may have forward looking bias. Thank you @Justin Jiang <jbjiang@g.hmc.edu> for catching this bug.

In [8]:
combined_df[["prev_held_office", "was_vp_or_vp_runner"]] = combined_df[["prev_held_office", "was_vp_or_vp_runner"]].shift(1).fillna(0)

X = combined_df[['prev_held_office', 'previous_party_1',
       'previous_party_2', 'was_vp_or_vp_runner', 'day_before_1',
       'day_before_7', 'day_before_30', 'day_before_60', 'day_before_180',
       'day_before_365', 'day_before_730']]


y = combined_df[["party", "day_after_1", "day_after_7", "day_after_30",
                         "day_after_60", "day_after_180", "day_after_365"]]

Seperate the data into a test and training set. Note, we will be only having one value for testing as this is what we want to predict. 

In [9]:
X_train = X.iloc[:-1]
y_train = y.iloc[:-1]
X_test  = X.iloc[-1:]
y_test  = y.iloc[-1:]

Finally we can build our model. Here we are using the `Sequential` library. Later you should explore the `Functional`. We are simply using a feedforward network that is fully connected with relu activations. This is to show that a simple network can be quite powerful with high quaility data. 

In [10]:
sgd = SGD(lr=0.01, momentum=0.9, nesterov=True)
model = Sequential()
model.add(Dense(10, input_dim=len(X.columns)))
model.add(Activation("relu"))
model.add(Dense(len(y.columns)))
model.add(Flatten())

Now compile the model and see how we do! Note: in the fitting process we are holding out the last 20% of the data to ensure that we are not overfitting. We want to perform about the same as we do on the training set as the testing set. Production models that have an extremely limited amount of training data will often be retrained with the whole dataset to allow for better model. This is done after we have decided upon a model.

In [11]:
model.compile(loss="mean_squared_error",optimizer=sgd,metrics=["mae"])
hist = model.fit(X_train, y_train, epochs=200, verbose=0, validation_split=0.1)
scores = model.evaluate(X_test, y_test)



In [12]:
out = model.predict(X_test)
results = pd.DataFrame([y_test.values.flatten(), out.flatten()], columns=y_test.keys().values.tolist()).transpose()
results.columns = ["actual", "predicted"]
results




Unnamed: 0,actual,predicted
party,1.0,0.582241
day_after_1,1.0,0.507118
day_after_7,0.013579,0.048224
day_after_30,0.0131,-0.178125
day_after_60,0.012871,-0.092291
day_after_180,0.012232,-0.021258
day_after_365,0.010905,0.106571


In [13]:
if results["predicted"].party > 0.5:
    print("Predicting a Trump Victory")
else:
    print("Predicting a Clinton Victory")

if results["predicted"]["day_after_1"] > 0.5:
    print("Predicting the market will be up.")
else:
    print("Predicting the market will be down.")

Predicting a Trump Victory
Predicting the market will be up.


And we were correct! We have correctly predicted the candidate to be elected, and the direction of the market return. Maybe play with some hyperparameters to improve the model. Be careful of overfitting given we have a tiny dataset.

Also try running this notebook multiple times, you may notice that you will get different results each time. This is due to the inherit random nature of neural networks upon initilization. Why is this?


For next steps, try what you expect will happen in the 2020 election?