# Part 1: Features Engineering

Important part of stock market analysis is to use Indicators.
Indicators are tools that help investor or trader to make a decision whether to buy stock or sell.
Technical indicators constructed from stock data, such as stock `price` or `volume`.
One of the most important indicators are: `Bollinger Bands`, `RSI`, `MACD`, `Moving Average`.

## Importing Libraries

In [12]:
import pandas_datareader as pdr
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from importlib import reload
from features_engineering import ma7, ma21, rsi, macd, bollinger_bands, momentum, get_tesla_headlines

from bs4 import BeautifulSoup
import requests
from nltk.sentiment.vader import SentimentIntensityAnalyzer
warnings.filterwarnings('ignore')
plt.rcParams['figure.dpi'] = 227 # native screen dpi for my computer

# Original Data

In [17]:
tesla_df = pdr.get_data_yahoo('tsla', '1980')
tsla_df.to_csv('data/raw_stocks.csv')

Let's take a look at the historical data of **Tesla**.

In [18]:
tesla_df.head()

Unnamed: 0_level_0,High,Low,Open,Close,Volume,Adj Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2010-06-29,25.0,17.540001,19.0,23.889999,18766300,23.889999
2010-06-30,30.42,23.299999,25.790001,23.83,17187100,23.83
2010-07-01,25.92,20.27,25.0,21.959999,8218800,21.959999
2010-07-02,23.1,18.709999,23.0,19.200001,5139800,19.200001
2010-07-06,20.0,15.83,20.0,16.110001,6866900,16.110001


In [19]:
tesla_df.describe()

Unnamed: 0,High,Low,Open,Close,Volume,Adj Close
count,2351.0,2351.0,2351.0,2351.0,2351.0,2351.0
mean,183.098252,176.670595,179.967074,179.983777,5372103.0,179.983777
std,114.724786,111.103138,112.968696,112.991827,4713082.0,112.991827
min,16.629999,14.98,16.139999,15.8,118500.0,15.8
25%,34.5,33.385,33.949999,33.969999,1799750.0,33.969999
50%,214.020004,206.699997,210.25,209.970001,4477300.0,209.970001
75%,264.414993,255.709999,260.639999,260.669998,7168850.0,260.669998
max,389.609985,379.350006,386.690002,385.0,37163900.0,385.0


### Checking for missing data

In [22]:
print('No missing data') if sum(tesla_df.isna().sum()) == 0 else tesla_df.isna().sum()

No missing data


# Generating Features

Additionally to our original data we will generate following features:
 - Return
 - Change
 - Volatility
 - MA21 and MA7 (Moving Average)
 - Momentum
 - RSI
 - MACD
 - Bollinger Bands

In [6]:
#del stocks
files = os.listdir('data/raw_stocks')
stocks = {}
for file in files:
    name = file.lower().split('.')[0]
    stocks[name] = pd.read_csv('data/raw_stocks/'+file)    
    
    # Return Feature
    stocks[name]['Return'] = round(stocks[name]['Close'] / stocks[name]['Open'] - 1, 3)
    # Change Feature
    # Change of the price from previous day, absolute value
    stocks[name]['Change'] = (stocks[name].Close - stocks[name].Close.shift(1)).fillna(0)
    # Date Feature
    stocks[name]['Date'] = pd.to_datetime(stocks[name]['Date'])
    stocks[name].set_index('Date', inplace=True)
    # Volatility Feature
    stocks[name]['Volatility'] = stocks[name].Close.ewm(21).std()
    # Moving Average, 7 days
    stocks[name]['MA7'] = ma7(stocks[name])
    # Moving Average, 21 days
    stocks[name]['MA21'] = ma21(stocks[name])
    # Momentum
    stocks[name]['Momentum'] = momentum(stocks[name].Close, 3)
    # RSI (Relative Strength Index)
    stocks[name]['RSI'] = rsi(stocks[name])
    # MACD - (Moving Average Convergence/Divergence)
    stocks[name]['MACD'], stocks[name]['Signal'] = macd(stocks[name])
    # Upper Band and Lower Band for Bollinger Bands
    stocks[name]['Upper_band'], stocks[name]['Lower_band'] = bollinger_bands(stocks[name])
    stocks[name].dropna(inplace=True)
    # Saving
    stocks[name].to_csv('data/stocks/'+name+'.csv')

In [7]:
stocks['tsla'].head()

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume,Return,Change,Volatility,MA7,MA21,Momentum,RSI,MACD,Signal,Upper_band,Lower_band
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
2010-07-28,20.549999,20.9,20.51,20.719999,20.719999,467200,0.008,0.17,1.979836,20.718571,19.911904,18.290001,41.752948,-0.350607,-0.658177,24.403824,15.419985
2010-07-29,20.77,20.879999,20.0,20.35,20.35,616000,-0.02,-0.369999,1.908327,20.725714,19.743333,17.950001,40.449222,-0.337599,-0.594062,23.858103,15.628563
2010-07-30,20.200001,20.440001,19.549999,19.940001,19.940001,426900,-0.013,-0.409999,1.839567,20.685714,19.558095,17.549999,38.996148,-0.356267,-0.546503,23.226171,15.890019
2010-08-02,20.5,20.969999,20.33,20.92,20.92,718100,0.02,0.979999,1.789358,20.674286,19.508571,17.719999,44.159747,-0.288656,-0.494933,23.066887,15.950255
2010-08-03,21.0,21.950001,20.82,21.950001,21.950001,1230500,0.045,1.030001,1.791139,20.768572,19.639524,17.35,49.041837,-0.15023,-0.425993,23.34933,15.929718


Mostly we will rely on historical data and technical indicators. Additionally, we will use news headlines of Tesla to check hypothesis if news affect price movement.

## Tesla News Headlines

For news source we will use <a href="nasdaq.com">NASDAQ</a> website.
At the moment of parsing there were 120 pages of news from `2019-01-10` till `2019-09-05`

In [None]:
headlines_list, dates_list = [], []
for i in range(1, 120):    
    headlines, dates = get_tesla_headlines("https://www.nasdaq.com/symbol/tsla/news-headlines?page={}".format(i))
    headlines_list.append(headlines)
    dates_list.append(dates)
    time.sleep(1)

In [None]:
tesla_headlines = pd.DataFrame({'Title': [i for sub in headlines_list for i in sub], 'Date': [i for sub in dates_list for i in sub[:10]]})

## Unsupervised sentiment prediction

Once news are parsed, we will use unsupervised learning to assign sentiment to each news.

In [15]:
sid = SentimentIntensityAnalyzer()

In [57]:
tesla_headlines['Sentiment'] = tesla_headlines['Title'].map(lambda x: sid.polarity_scores(x)['compound'])
tesla_headlines.Date = pd.to_datetime(tesla_headlines.Date)
tesla_headlines.to_csv('data/tesla_headlines.csv')

In [24]:
tesla_headlines.head()

Unnamed: 0,Title,Date,Sentiment
0,Tesla's use of individual driver data for insu...,2019-09-05,0.0
1,U.S. safety agency cites Tesla Autopilot desig...,2019-09-04,0.0258
2,"U.S. safety agency cites driver error, Tesla A...",2019-09-04,-0.3818
3,"U.S. safety regulator cites driver error, Tesl...",2019-09-04,-0.3818
4,"U.S. NTSB cites driver error, Tesla Autopilot ...",2019-09-04,-0.6597
