# Stock Analysis With SciKit Learn
In this lesson, we will us the Alpha Vantage API to pull some stock data related to Microsoft & also the Money Flow Index indicator.

***
## Section One: Pull the Stock Price Data
First thing we need to do is pull the stock price data, we will be using the Alpha Vantage API as this is a rich API that offers a wide variety data for free. The request will be broken into a few parts, the first part we need to define our API Key, our end point and the API url. Then we will need to define the paramters of our pull, here I want my response to be for the Microsoft stock data, in a json format, that is the weekly price. We make the request, convert the json data, and then break the response into the metadata & and actual stock data.

In [3]:
import requests
import pprint
import json
import pandas as pd

# Define my API Key, My Endpoint, and My Header
DEVELOPER_KEY = ''
BASE_URL = 'https://www.alphavantage.co/query'
ENDPOINT = 'TIME_SERIES_INTRADA'

# Define my parameters of the search
PARAMETERS = {'function':ENDPOINT,
             'symbol':'MSFT',
             'datatype':'json',  #CSV
             'apikey': DEVELOPER_KEY}

# Make a request to the ALPHA VANTAGE API
response = requests.get(url = BASE_URL,
                        params = PARAMETERS)

# Decode the response
encoded_response = response.json()

# break the data into two components, meta data & actual data
meta_data = encoded_response['Meta Data']
time_series_data = encoded_response['Weekly Time Series']

<div class="alert alert-block alert-info">
<b>Tip:</b> If you would like more info on the Alpha Vantage API, I encourage you to visit their website. https://www.alphavantage.co/documentation/
</div>

***
## Convert the Data Frame Data Types
The data we have is good, but we need it in a format that works. Let's pass it through to a Pandas Dataframe using the `from_dict()` method where we can specify which orientation we want the keys to take. In this example, I want the data to be oriented on the index in other words I want the dates for each week to be a row index and the information in those weeks to be the column headers.

We have a data frame but the data types and column names aren't accurate so we should change that. The methodology will be the same for both, we will create a dictionary where we will specify the column name as the key and the value will be the corresponding new column name/data type we want that column to have. Use the `rename()` method to rename the columns and the `astype()` method to redefine the data types.

In [4]:
# create a dataframe with the data.
stock_dataframe = pd.DataFrame.from_dict(time_series_data, orient='index')

# redefine new column names, we will use a dictionary where the key is the old name and the value is the new name.
column_keys = {'1. open':'open','2. high':'high','3. low':'low','4. close':'close','5. volume':'volume'}

# rename the columns using the rename() method and we need to make sure that we assign this new data frame to a variable
stock_dataframe = stock_dataframe.rename(columns=column_keys)

# get the data types of each column, they're all objects. Let's fix that.
display('This was the old data type.')
display(stock_dataframe.dtypes)

# define dictionary of data type keys, the key is the column name, the value is the data type we want it to be.
data_keys = {'open': float,'high': float,'low': float,'close': float,'volume': int}

# call the as type method to convert the columns
stock_dataframe = stock_dataframe.astype(data_keys)

display('This is the new data type.')
display(stock_dataframe.dtypes)

'This was the old data type.'

open      object
high      object
low       object
close     object
volume    object
dtype: object

'This is the new data type.'

open      float64
high      float64
low       float64
close     float64
volume      int32
dtype: object

***
## Change the Index to Data Time
The index by default is of the type object and needs to be changed to a data time data type so it's easier to manipulate. Here is how we do that:

In [5]:
# let's work with the index next, this is still considered an object but it is clearly a date.
display(stock_dataframe.index)

# Converting the index as date
stock_dataframe.index = pd.to_datetime(stock_dataframe.index)

display(stock_dataframe.index)

Index(['1998-01-09', '1998-01-16', '1998-01-23', '1998-01-30', '1998-02-06',
       '1998-02-13', '1998-02-20', '1998-02-27', '1998-03-06', '1998-03-13',
       ...
       '2019-01-04', '2019-01-11', '2019-01-18', '2019-01-25', '2019-02-01',
       '2019-02-08', '2019-02-15', '2019-02-22', '2019-03-01', '2019-03-08'],
      dtype='object', length=1105)

DatetimeIndex(['1998-01-09', '1998-01-16', '1998-01-23', '1998-01-30',
               '1998-02-06', '1998-02-13', '1998-02-20', '1998-02-27',
               '1998-03-06', '1998-03-13',
               ...
               '2019-01-04', '2019-01-11', '2019-01-18', '2019-01-25',
               '2019-02-01', '2019-02-08', '2019-02-15', '2019-02-22',
               '2019-03-01', '2019-03-08'],
              dtype='datetime64[ns]', length=1105, freq=None)

***
## Adding New Date Columns
This is more for demonstration purposes, but now that we have the proper data type for our index we can create new columns that are segments of our index column.

In [7]:
# let's create some new columns that contain broken down dates.
stock_dataframe['year'] = stock_dataframe.index.year
stock_dataframe['month'] = stock_dataframe.index.month
stock_dataframe['day'] = stock_dataframe.index.day

print(stock_dataframe)

               open     high       low   close     volume  year  month  day
1998-01-09  131.250  133.630  125.8700  127.00   46857300  1998      1    9
1998-01-16  124.620  135.380  124.3700  135.25   40459900  1998      1   16
1998-01-23  134.130  139.880  134.0000  138.25   46621800  1998      1   23
1998-01-30  139.880  150.130  138.4500  149.19   46856000  1998      1   30
1998-02-06  151.750  158.750  150.5000  158.13   42349700  1998      2    6
1998-02-13  158.750  160.060  155.6300  157.50   37262800  1998      2   13
1998-02-20  158.500  158.500  152.8800  155.13   40268500  1998      2   20
1998-02-27   80.940   86.000   79.3700   84.75   95642150  1998      2   27
1998-03-06   85.870   85.870   79.2500   82.75   79005300  1998      3    6
1998-03-13   82.500   83.000   79.5000   82.37   55001400  1998      3   13
1998-03-20   82.440   83.000   79.6900   81.81   50579700  1998      3   20
1998-03-27   81.190   90.940   81.0600   87.81   73985400  1998      3   27
1998-04-03  

***
## Getting the MFI Data
With the stock data cleaned, we can get the MFI Data and then go through a similar cleaning process. Keep in the mind the request is the same we just need to change a few parameters. Here is the second request, using the 'MFI' endpoint.

In [5]:
# Define my API Key, My Endpoint, and My Header
DEVELOPER_KEY = 'ZSST3RVPUPNMQQCC'
BASE_URL = 'https://www.alphavantage.co/query'
ENDPOINT = 'MFI'

# Define my parameters of the search
PARAMETERS = {'function':ENDPOINT,
             'symbol':'MSFT',
             'interval':'weekly',
             'time_period':'10',
             'datatype':'json',  #CSV
             'apikey': DEVELOPER_KEY}

# Make a request to the ALPHA VANTAGE API
response = requests.get(url = BASE_URL,
                        params = PARAMETERS)

# Decode the response
encoded_response = response.json()

# break the data into two components, meta data & actual data
meta_data = encoded_response['Meta Data']
mfi_data = encoded_response['Technical Analysis: MFI']

***
## Clean the New Data
Looks like the data is good let's do the same transformations to this data set, mainly the data type transformation and index transformation.

In [6]:
# create a dataframe with the data.
mfi_dataframe = pd.DataFrame.from_dict(mfi_data, orient='index')

# define dictionary of data type keys, the key is the column name, the value is the data type we want it to be.
data_keys = {'MFI': float}

# call the as type method to convert the columns
mfi_dataframe = mfi_dataframe.astype(data_keys)

# covnert the index to a datetime.
mfi_dataframe.index = pd.to_datetime(mfi_dataframe.index)

display(mfi_dataframe.index)

DatetimeIndex(['1998-03-20', '1998-03-27', '1998-04-03', '1998-04-09',
               '1998-04-17', '1998-04-24', '1998-05-01', '1998-05-08',
               '1998-05-15', '1998-05-22',
               ...
               '2019-01-04', '2019-01-11', '2019-01-18', '2019-01-25',
               '2019-02-01', '2019-02-08', '2019-02-15', '2019-02-22',
               '2019-03-01', '2019-03-07'],
              dtype='datetime64[ns]', length=1095, freq=None)

****
## Merge the Two Datasets
With the two data frames cleaned we need to merge them, the simplest way to do this is use the `join()` method and we will merge the `mfi_dataframe` on the `stock_dataframe` using an outer join. After that some of the rows will have an NA for the MFI data because the MFI data did not exist for those dates. This means we will need to drop them using the `dropna()` method. I also chose to filter the data so that we only have data that is greater than 2013.

In [7]:
# we need to merge the two data frames, we will do a outer join.
price_mfi_data = stock_dataframe.join(mfi_dataframe, how='outer')

# next let's filter it so it only has data points beyond 2012
the_filter = (price_mfi_data['year'] > 2013)

# create a new data frame that is the filtered version of the old one.
stock_dataframe = price_mfi_data.loc[the_filter]
stock_dataframe = stock_dataframe.dropna()
stock_dataframe.head()

Unnamed: 0,open,high,low,close,volume,year,month,day,MFI
2014-01-03,37.22,37.58,36.6,36.91,95561000,2014,1,3,62.2169
2014-01-10,36.85,36.89,35.4,36.04,216443300,2014,1,10,51.564
2014-01-17,35.99,37.0,34.63,36.38,216624000,2014,1,17,38.4886
2014-01-24,36.82,37.55,35.52,36.805,173821100,2014,1,24,37.4011
2014-01-31,36.87,37.89,35.75,37.84,261570800,2014,1,31,49.8434


In [20]:
stock_dataframe[['open','close','volume']].corr()

Unnamed: 0,open,close,volume
open,1.0,0.996703,-0.183588
close,0.996703,1.0,-0.196329
volume,-0.183588,-0.196329,1.0


In [8]:
# we need to define our x (input variable) & y (output variable)

# the input will be everything minus our close price column
X = stock_dataframe.drop('close', axis = 1)
Y = stock_dataframe[['close']]

In [9]:
# now we will split the data into a train & test set.
from sklearn.model_selection import train_test_split

# Split X and y into X_
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.25, random_state=1)

In [10]:
# let's build our model.
from sklearn.linear_model import LinearRegression

# create a Linear Regression model object.
regression_model = LinearRegression()

# pass through the X_train & y_train data set.
regression_model.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [11]:
#  let's take a look at the coefficents for each column
for idx, col_name in enumerate(X_train.columns):
    print("Coefficient for {}: {}".format(col_name, regression_model.coef_[0][idx]))

Coefficient for open: -0.4756567145167192
Coefficient for high: 0.8733142398708361
Coefficient for low: 0.6060052209412231
Coefficient for volume: -2.7477528897872318e-09
Coefficient for year: -0.07053372221122134
Coefficient for month: -0.012995965324162074
Coefficient for day: -0.0029014077962713834
Coefficient for MFI: 0.000561933953040316


In [12]:
# now that we have our coefficents lets look at the intercept
intercept = regression_model.intercept_[0]
print("The intercept for our model is {}".format(intercept))

The intercept for our model is 142.28150272046634


In [13]:
# how much of our variability of y can be explained by x, in this case it's astronomically high over 99%
regression_model.score(X_test, y_test)

0.9984991774906313

In [14]:
from sklearn.metrics import mean_squared_error
import math 

y_predict = regression_model.predict(X_test)
regression_model_mse = mean_squared_error(y_predict, y_test)
math.sqrt(regression_model_mse)

0.9103074062190152