# Finance using a crude Neural Network and NLTK by [eestra2](https://github.com/eestra2)

### An introductory walk-through of using code and machine learning modules for financial programming.

#### Note: 
#### Not geared towards the general audience. This walk-through assumes some basic Python knowledge, finance principles, and  general statistics.  Also this is toy data and modeling, far from perfect in every sense. The goal is to get people with finance/economics background, with amateur coding abilities, get their hands wet with python programming. The aim is to understand what is going on with the coding part (where most business majors lack in) line by line in order to show where everything is coming from and not seem like code is magic. This is not a foremost showcase of the optimal fine-tuned model predicting highly accurate stock prices. This is not the ideal starting point for someone who is starting to program for the first time. Think of this as a good place to step out of the basics of Python programming and start  something that is "beginner to mid-level transition". 

**WARNING:**

**Please do not use this content to invest! As aforementioned, this is for learning purposes ONLY!**

## Table of Contents

1. <a href="#3.-Natural-Language-Processing-&-Data-Prep">Natural Language Processing & Data Prep</a>
2. <a href="#2.-DJIA-Prediction-with-Machine-Learning-Perceptron">DJIA Prediction with Machine Learning Perceptron</a>
3. <a href="#3.-Results-and-Conclusion">Results and Conclusion</a>

# 1. Natural Language Processing & Data Prep

In this tutorial I will be using a Neural Network model to predict Dow Jones Index. Besides historical prices, I also want to include sentimental analysis features that can contribute to the way a stock price might move. For example, if there is negative news about a company not meeting its expected earnings, stock price for that company will certainly dip. On the other hand, if good news is reported to the public about a company surpassing expected earnings, the price of that company will most likely increase as investors scramble to bid for more shares of that company. Market sentiment plays a huge role in the short-term in equity markets, dictating the way prices move. Thus, I find it crucial to attempt to map media sentiment to stock price movement; however, I will be using the Dow Jones Industrial Average as a proxy to how the stock market moves in response to news. Dow Jones is a price-weighted index that contains 30 largest American companies that approximates the American economy. One quick take away of the Dow is that it is price-weighted. Companies in the Dow index with the highest stock prices will influence the overall index more so than others with lower stock prices. Dow Jones is the oldest and most quoted index in the world.  

In [2]:
# Here we have the imported modules to support us in doing the heavy lifting in coding.
# If these modules are not downloaded, use "pip install module" on your python terminal
# If using Anaconda distribution, you can either use the GUI navigator or use the conda  terminal and use "conda install module"
import numpy as np
import pandas as pd
import unicodedata
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# To concentrate on the modeling process rather than spending too much time on data aquisition through APIs, that tend to be a hassle at times, 
# I have included a pickle file that contains the clean data of the Dow Jones Industrial Average index and news articles for the sentiment analysis part. 
# File  contains instances of prices and articles on a daily basis, from 2007-01-01 to 2016-12-31 (A span of 10 years of data)
df_dow = pd.read_pickle('cleaned_data.pkl')


# Extracts the "adj close" column and convert the values into integers while assigning it to a new column
# Extract the two columns and assign to df_dow to overwrite previous information so to have only prices and articles
df_dow['prices'] = df_dow['adj close'].apply(np.int64)
df_dow = df_dow[['prices', 'articles']]

# Article text come with unwanted characters so we strip them one by one. Then create a new dataframe where we place "prices" 
df_dow['articles'] = df_dow['articles'].map(lambda x: x.lstrip('.-'))
df = df_dow[['prices']].copy()

# Creating labeled empty columns where values computed by the sentiment analyzer will be placed
df["compound"] = ''
df["neg"] = ''
df["neu"] = ''
df["pos"] = ''

#invoke the sentiment analyzer so that it can be used to compute values to text
sia = SentimentIntensityAnalyzer()


# We iterate through each date for news articles > text gets normalized > sia.polarity scores them > set scores to corresponding column and date (row) 
for date, row in df_dow.T.iteritems():
    try:
        sentence = unicodedata.normalize('NFKD', df_dow.loc[date, 'articles'])
        ss = sia.polarity_scores(sentence)
        df.at[date, 'compound'] = ss['compound']
        df.at[date, 'neg'] = ss['neg']
        df.at[date, 'neu'] = ss['neu']
        df.at[date, 'pos'] = ss['pos']
    except TypeError:
        print(df_dow.loc[date, 'articles'])
        print(date)

In [3]:
df.head()

Unnamed: 0,prices,compound,neg,neu,pos
2007-01-01,12469,-0.9735,0.153,0.748,0.099
2007-01-02,12472,-0.9702,0.122,0.786,0.092
2007-01-03,12474,-0.9994,0.203,0.736,0.06
2007-01-04,12480,-0.9982,0.131,0.806,0.062
2007-01-05,12398,-0.9901,0.124,0.794,0.082


[<a href="#Finance-using-Neural-Network-and-NLTK-by-eestra2">Back to top</a>]

# 2. DJIA Prediction with Machine Learning Perceptron

Since the aim is to train with the first 10-months worth of data and predict the DJIA of the last two months (Nov. and Dec.), I will be looping the code for every year. Therefore, I will not be able to break the rest of the code into more cells to show how each line works. I organized the code into "related blocks" to show clarity on the process. I hope the additional comments does not make the overall appearance hard to read or overwhelming to the eyes.

The Neural Network will have 3  hidden layers with each layer consisting of 100, 200, and 100 neurons respectively. Each hidden layer is called a perceptron which is the core of where the inputs given to a neural network gets computed. Inputs are multiplied by weights and then inserted into an activation function that "scales it" or "maps" it accordingly on an interval which depends on the type of function used for the activation part. This model will use ReLU (Rectified Linear unit) since I want to predict prices that can range from 0 to inifinity. And that's exactly what ReLU's range is. If the network has more hidden layers, these values become the inputs of succeeding layers and the process starts over just like in the preceding layer. One layer neural network models linear relationships; a one to one mapping. Prices and sentiment does not have that simple relationship and thus more layers are needed to map the non-linear connection. The other parameters are just as important but for now I will not go into much details. One major parameter to take notice is the learning rate init. This dictates "how big of a step", during the optimization of the weights process, to take in order to find weights that give the lowest error possible. 


Neural Network is one of the most powerful algorithms out there. Surprisingly, this algorithm is not new and has been out since 1958 which was originally formulated to model and understand the human brain processes. It slowly caught the attention of many as they saw it could be used to solve problems that regular algorithms and standard statistical computations could not do as they do not do well with noise. Noise typically means data points are not following an organized pattern and thefore looks like it is as a result of random chance.  

In [4]:
from sklearn.neural_network import MLPRegressor

In [5]:
# In my other tutorial, "Random Forest for Dow Jones", I trained on years '07 to '14 and the last two years of data were reserved for test dataset. 
# In this tutorial I will be training/testing on each year. Train dataset is from Jan. to Oct. and test dataset is from Nov. to Dec.  

years = [2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016]


for year in years:
    
    # Splitting the data from df into training and testing subsets
    # Jan-Oct = train set , Nov-Dec = test set
    train_start_date = str(year) + '-01-01'
    train_end_date = str(year) + '-10-31'
    test_start_date = str(year) + '-11-01'
    test_end_date = str(year) + '-12-31'
    train = df.loc[train_start_date : train_end_date]
    test = df.loc[test_start_date:test_end_date]
    
    # Array Conversion: train features
    # We extract from the train dataset the values of "neg","pos","neu", and "compound" from each date 
    # to be converted into numpy arrays within a numpy array
    # Essentially it is a matrix. Matrices are the magic of how most models compute. 
    sentiment_score_list = []
    for date, row in train.T.iteritems():
        sentiment_score = np.asarray([df.loc[date, 'compound'],df.loc[date, 'neg'],df.loc[date, 'neu'],df.loc[date, 'pos']])
        sentiment_score_list.append(sentiment_score)
    # This takes the list containing arrays and places it into an array
    numpy_df_train = np.asarray(sentiment_score_list)
    
    #Array Conversion: test features to be used to predict
    #Same procedure as above
    sentiment_score_list = []
    for date, row in test.T.iteritems():
        sentiment_score = np.asarray([df.loc[date, 'compound'],df.loc[date, 'neg'],df.loc[date, 'neu'],df.loc[date, 'pos']])
        sentiment_score_list.append(sentiment_score)
    numpy_df_test = np.asarray(sentiment_score_list)
    
###############################################################################################################################
    # Invoke the model, train it, and make it predict
    mlpc = MLPRegressor(hidden_layer_sizes=(100, 200, 100), 
                        activation='relu', 
                         solver='lbfgs', 
                        alpha=0.005, 
                        learning_rate_init = 0.001, 
                        shuffle=False) 
   
    mlpc.fit(numpy_df_train, train['prices'])   
    
    prediction = mlpc.predict(numpy_df_test)
    
    
###############################################################################################################################
    # This block of code organizes the predictions into a dataframe as well as the actual real values
    # I used a condition to start out the dataframes for actual and predicted prices and then the else condition serves to
    # just keep adding columns containing the values from each loop (year) to the dataframes. 
    
    if year == 2007:
        idx = pd.date_range(test_start_date, test_end_date)
        col_name = '{}_est_prices'.format(year)
    
        predictions_df = pd.DataFrame(data=prediction, index= idx, columns=[col_name])
        
        actual_df = pd.DataFrame(test['prices'].reset_index(drop=True))
        actual_df.rename(columns={'prices':'2007_actual_prices'}, inplace=True)
    else:
        predictions_df['{}_est_prices'.format(year)] = prediction
        actual_df['{}_actual_prices'.format(year)] = test['prices'].reset_index(drop=True)
    
    


In [6]:
# This variable assignment contains the prediction of the last year (2016)
# I like to use this as a test sample to check against the dataframe below to see if it matches
prediction

array([ 17739.99390317,  17588.88564029,  17556.43476319,  17621.64193656,
        17829.78348665,  17736.76025956,  17656.96873428,  17726.91487988,
        17855.06009717,  17577.2761266 ,  17972.57682123,  17586.61051732,
        17911.13567119,  17474.68499789,  17552.02036692,  17568.61523807,
        17498.73652923,  17483.08875351,  17705.94757496,  17744.40755994,
        17771.93846079,  17885.45323572,  17958.79030764,  17776.53750171,
        17799.28627345,  17574.17320287,  17800.73990393,  17494.03767643,
        17913.14226329,  17485.91799094,  17533.64655717,  17629.97748873,
        17552.81464004,  17768.1571643 ,  17368.70503413,  17591.54421535,
        17465.52105484,  17651.76449187,  17602.42282986,  17703.62703542,
        18017.96427337,  17441.51428871,  17792.97383346,  17536.03122704,
        17706.89461733,  17572.36194855,  17493.69566154,  17970.80574156,
        17622.75000212,  17636.78971201,  17685.6607323 ,  17852.17610899,
        17520.94841833,  

In [7]:
# Predicted prices of DJIA per year, ranging from Nov-1 to Dec-31
#Ignore the year in the index. Only pay attention to the month and date
predictions_df

Unnamed: 0,2007_est_prices,2008_est_prices,2009_est_prices,2010_est_prices,2011_est_prices,2012_est_prices,2013_est_prices,2014_est_prices,2015_est_prices,2016_est_prices
2007-11-01,13113.838310,11697.907647,8536.363194,10529.154566,11927.820786,12961.602571,14809.220135,16651.547256,17564.366816,17739.993903
2007-11-02,13464.297340,11572.074129,8610.889241,10564.745826,11899.146827,12967.634563,14842.488442,16618.810879,17514.292059,17588.885640
2007-11-03,13115.647510,11844.443093,8668.036783,10573.396730,11966.876062,12952.948087,14782.283733,16539.137733,17583.244453,17556.434763
2007-11-04,13097.799410,11899.939431,8644.719646,10522.390300,12039.206685,12979.027296,14815.040911,16577.614153,17562.666491,17621.641937
2007-11-05,13210.365145,11605.316586,8602.726708,10562.303137,12057.296227,12876.713481,14822.789834,16549.600601,17535.477934,17829.783487
2007-11-06,13127.301696,11625.786796,8628.541592,10494.394108,11893.197159,12993.387032,14829.991089,16671.970496,17432.574917,17736.760260
2007-11-07,13203.587883,11826.945670,8455.033426,10484.950309,11914.819864,12895.435889,14802.247454,16565.949572,17543.185178,17656.968734
2007-11-08,13149.745843,11828.385380,8461.039941,10497.035522,11984.148361,12923.523789,14847.822205,16601.496822,17566.194578,17726.914880
2007-11-09,13264.408819,11680.793308,8761.380963,10559.416016,11938.129748,12965.106339,14779.179489,16695.692342,17613.476478,17855.060097
2007-11-10,13228.341888,11778.394710,8621.840933,10551.215857,11930.389707,12936.344622,14830.816770,16567.895318,17608.259711,17577.276127


In [8]:
# These are the actual closing prices that the DJIA had during those dates

actual_df

Unnamed: 0,2007_actual_prices,2008_actual_prices,2009_actual_prices,2010_actual_prices,2011_actual_prices,2012_actual_prices,2013_actual_prices,2014_actual_prices,2015_actual_prices,2016_actual_prices
0,13567,9323,9763,11124,11657,13232,15615,17382,17773,18037
1,13595,9321,9789,11188,11836,13093,15623,17374,17828,17959
2,13577,9319,9771,11215,12044,13099,15631,17366,17918,17930
3,13560,9625,9802,11434,11983,13106,15639,17383,17867,17888
4,13543,9139,10005,11444,12011,13112,15618,17484,17863,18012
5,13660,8695,10023,11431,12040,13245,15746,17554,17910,18135
6,13300,8943,10091,11419,12068,12932,15593,17573,17850,18259
7,13266,8919,10159,11406,12170,12811,15761,17587,17790,18332
8,13042,8894,10226,11346,11780,12815,15768,17600,17730,18589
9,13024,8870,10246,11357,11893,12815,15775,17613,17758,18807


[<a href="#Finance-using-Neural-Network-and-NLTK-by-eestra2">Back to top</a>]

# 3. Results and Conclusion

The predictions are still far off from the actual values. The model is still largely a work-in-progress that needs more fine tunning, feature selection, and perhaps modification of data volume. We must keep in mind time-series trends are short-lived. They can be seasonal, cyclical, or other periodical time frame. That being said, sometimes way older data might not contribute much to the prediction of a future trend, much less so to an "immediate future". It is worthwhile to think profoundly on what other features might also influence price movements. Do financials of a company influence price movements? Economic indicators? Industry changes or new industry regulation? Prices of companies inputs? Think of the transportation industry. If oil prices rise, the cost of those companies doing business will surely rise which may hurt their profit margins. Thus, that can hurt their stock prices in turn. One last thing,  always remember that quality data is by far better than fancy analysis. If you give algorithms poor data, the results are not going correctly represent the thing at study. All these things and more are worth thinking about and eventually incorporate them into the model to see if it improves it and by how much.  

# Coming soon Recurrent Neural Network! =)