# Part 1: Features Engineering

Indicators are tools that help an investor or a trader to make a decision whether to buy stock or sell.
Technical indicators (which can be called features in this context) constructed from stock data, such as `price` or `volume`.
In this part we will create following features: `Bollinger Bands`, `RSI`, `MACD`, `Moving Average`, `Return`, `Momentum`, `Change` and `Volatility`.

`Return` will serve as a **target** or dependent variable. Other features will serve as independent variables.

## Importing Libraries

In [10]:
import pandas_datareader as pdr
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from importlib import reload
from features_engineering import ma7, ma21, rsi, macd, bollinger_bands, momentum, get_tesla_headlines
import importlib
import features_engineering
from ta.momentum import StochasticOscillator

from bs4 import BeautifulSoup
import requests
from nltk.sentiment.vader import SentimentIntensityAnalyzer
warnings.filterwarnings('ignore')
plt.rcParams['figure.dpi'] = 227 # native screen dpi for my computer

# Original Data

In [3]:
tsla_df = pdr.get_data_yahoo('tsla', '1980')
tsla_df.to_csv('data/raw_stocks.csv')

Let's take a look at the historical data of **Tesla**.

In [4]:
tsla_df.head()

Unnamed: 0_level_0,High,Low,Open,Close,Volume,Adj Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2010-06-29,5.0,3.508,3.8,4.778,93831500.0,4.778
2010-06-30,6.084,4.66,5.158,4.766,85935500.0,4.766
2010-07-01,5.184,4.054,5.0,4.392,41094000.0,4.392
2010-07-02,4.62,3.742,4.6,3.84,25699000.0,3.84
2010-07-06,4.0,3.166,4.0,3.222,34334500.0,3.222


In [5]:
tesla_df = pd.read_csv('data/raw_stocks.csv')

In [6]:
tesla_df.describe()

Unnamed: 0,High,Low,Open,Close,Volume,Adj Close
count,2894.0,2894.0,2894.0,2894.0,2894.0,2894.0
mean,124.193092,118.862103,121.604026,121.709061,31413200.0,121.709061
std,227.803314,217.676477,222.916185,223.170847,28251150.0,223.170847
min,3.326,2.996,3.228,3.16,592500.0,3.16
25%,17.6995,15.93,16.5105,17.088,12808000.0,17.088
50%,46.947001,45.407,46.136,46.099001,24956500.0,46.099001
75%,67.824503,65.405499,66.534502,66.6675,40017750.0,66.6675
max,1243.48999,1217.0,1234.410034,1229.910034,304694000.0,1229.910034


### Checking for missing data

In [7]:
print('No missing data') if sum(tesla_df.isna().sum()) == 0 else tesla_df.isna().sum()

No missing data


# Generating Features

In [17]:
importlib.reload(features_engineering) #automatically updatet the module without reloading

#del stocks
files = os.listdir('data/raw_stocks')
stocks = {}
for file in files:
    name = file.lower().split('.')[0]
    stocks[name] = pd.read_csv('data/raw_stocks/'+file)    
    
    # Return Feature
    stocks[name]['Return'] = round(stocks[name]['Close'] / stocks[name]['Open'] - 1, 3)
    
    # Change Feature
    # Change of the price from previous day, absolute value
    stocks[name]['Change'] = (stocks[name].Close - stocks[name].Close.shift(1)).fillna(0)
    
    
    # Date Feature
    stocks[name]['Date'] = pd.to_datetime(stocks[name]['Date'])
    stocks[name].set_index('Date', inplace=True)
    
    
    # Volatility Feature
    stocks[name]['Volatility'] = stocks[name].Close.ewm(21).std()
    # Moving Average, 7 days
    stocks[name]['MA7'] = ma7(stocks[name])
    # Moving Average, 21 days
    stocks[name]['MA21'] = ma21(stocks[name])
    # MA 50
    stocks[name]['MA50'] = stocks[name]['Close'].rolling(50).mean()
    # MA 200
    stocks[name]['MA200'] = stocks[name]['Close'].rolling(200).mean()
    # EMA50
    stocks[name]['EMA50'] = stocks[name]['Close'].ewm(50).mean() 

    
    # Momentum
    stocks[name]['Momentum'] = momentum(stocks[name].Close, 3)
    # RSI (Relative Strength Index)
    stocks[name]['RSI'] = rsi(stocks[name])
    # MACD - (Moving Average Convergence/Divergence)
    stocks[name]['MACD'], stocks[name]['Signal'],stocks[name]['MACD Singal Diff']  = macd(stocks[name])
    # Sochastics
    stoch = StochasticOscillator(high=stocks[name]['High'],
                             close=stocks[name]['Close'],
                             low=stocks[name]['Low'],
                             window=14, 
                             smooth_window=3)
    stocks[name]['Stoch'] = stoch.stoch()
    stocks[name]['Stoch Signal'] = stoch.stoch_signal()


    # Upper Band and Lower Band for Bollinger Bands
    stocks[name]['Upper_band'], stocks[name]['Lower_band'] = bollinger_bands(stocks[name])
    stocks[name].dropna(inplace=True)
    # Saving
    stocks[name].to_csv('data/stocks/'+name+'.csv')

stocks['tsla'].head()

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume,Return,Change,Volatility,MA7,...,EMA50,Momentum,RSI,MACD,Signal,MACD Singal Diff,Stoch,Stoch Signal,Upper_band,Lower_band
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2011-04-12,25.08,25.209999,24.299999,24.65,24.65,1357400,-0.017,-0.62,2.026496,26.095714,...,24.824358,24.24,49.263021,0.612514,0.527633,0.08488,39.673111,52.705644,28.161076,20.57321
2011-04-13,25.129999,25.690001,24.809999,24.93,24.93,1211500,-0.008,0.28,1.98068,25.967143,...,24.826468,23.49,50.769363,0.528057,0.527718,0.000339,40.095094,43.400163,28.20553,20.717327
2011-04-14,24.870001,25.280001,24.200001,25.139999,25.139999,983400,0.011,0.209999,1.937584,25.744286,...,24.832731,22.27,51.922251,0.472621,0.516699,-0.044078,42.045455,40.604553,28.248896,20.894913
2011-04-15,25.65,26.18,25.41,25.58,25.58,943500,-0.003,0.440001,1.902114,25.614286,...,24.847651,21.65,54.335232,0.458902,0.505139,-0.046237,45.087727,42.409425,28.31346,21.094158
2011-04-18,25.129999,25.620001,24.360001,25.030001,25.030001,1033900,-0.004,-0.549999,1.859389,25.298571,...,24.85129,21.93,50.896557,0.39905,0.483921,-0.084872,35.43862,40.857267,28.324011,21.280751


In [18]:
stocks['aapl'].head()

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume,Return,Change,Volatility,MA7,...,EMA50,Momentum,RSI,MACD,Signal,MACD Singal Diff,Stoch,Stoch Signal,Upper_band,Lower_band
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1982-05-25,0.256696,0.258929,0.256696,0.256696,0.204486,12891200.0,0.0,0.0,0.02562,0.254145,...,0.302859,-2.747768,39.02419,-0.009016,-0.008375,-0.00064,15.788357,14.034095,0.291781,0.242445
1982-05-26,0.256696,0.256696,0.254464,0.254464,0.202707,10819200.0,-0.009,-0.002232,0.02556,0.253826,...,0.301909,-2.745536,37.686379,-0.008862,-0.008473,-0.000389,10.525571,14.034095,0.291404,0.241122
1982-05-27,0.252232,0.252232,0.25,0.25,0.199151,7812000.0,-0.009,-0.004464,0.025655,0.253507,...,0.30089,-2.743304,35.095056,-0.008996,-0.008577,-0.000419,0.0,8.771309,0.291781,0.239682
1982-05-28,0.25,0.252232,0.25,0.25,0.199151,4799200.0,0.0,0.0,0.025686,0.253507,...,0.299891,-2.743304,35.095056,-0.008999,-0.008662,-0.000338,0.0,3.508524,0.292083,0.238317
1982-06-01,0.247768,0.247768,0.245536,0.245536,0.195595,11900000.0,-0.009,-0.004464,0.025874,0.252551,...,0.298825,-2.745536,32.50308,-0.009255,-0.008781,-0.000475,0.0,0.0,0.292569,0.23613


Mostly we will rely on historical data and technical indicators. Additionally, we will use news headlines of Tesla to check hypothesis if news affect price movement.

## Tesla News Headlines

For news source we will use <a href="nasdaq.com">NASDAQ</a> website.
At the moment of parsing there were 120 pages of news from `2019-01-10` till `2019-09-05`

In [11]:
headlines_list, dates_list = [], []
for i in range(1, 120):    
    headlines, dates = get_tesla_headlines("https://www.nasdaq.com/symbol/tsla/news-headlines?page={}".format(i))
    headlines_list.append(headlines)
    dates_list.append(dates)
    time.sleep(1)

KeyboardInterrupt: 

In [None]:
tesla_headlines = pd.DataFrame({'Title': [i for sub in headlines_list for i in sub], 'Date': [i for sub in dates_list for i in sub[:10]]})

## Unsupervised sentiment prediction

Once news are parsed, we will use unsupervised learning to assign sentiment to each news.

In [15]:
sid = SentimentIntensityAnalyzer()

In [57]:
tesla_headlines['Sentiment'] = tesla_headlines['Title'].map(lambda x: sid.polarity_scores(x)['compound'])
tesla_headlines.Date = pd.to_datetime(tesla_headlines.Date)
tesla_headlines.to_csv('data/tesla_headlines.csv')

In [24]:
tesla_headlines.head()

Unnamed: 0,Title,Date,Sentiment
0,Tesla's use of individual driver data for insu...,2019-09-05,0.0
1,U.S. safety agency cites Tesla Autopilot desig...,2019-09-04,0.0258
2,"U.S. safety agency cites driver error, Tesla A...",2019-09-04,-0.3818
3,"U.S. safety regulator cites driver error, Tesl...",2019-09-04,-0.3818
4,"U.S. NTSB cites driver error, Tesla Autopilot ...",2019-09-04,-0.6597


# Conclusion

Exploratory Analysis, Machine learning algorithms and Q-Learning will rely on features we generated at this point.