# Obtain Dataset from Yahoo Finance

**All of the datasets used for this website was downloaded from https://finance.yahoo.com/. The module yfinance allows us to extract the stock data by python code into our notebooks. The module pandas_datareader allows the extracted data to be read in as a dataframe. The module yfinance only works with pandas version 0.24.2 and newer. Before we run this notebook, we need to upgrade pandas to the newest version. The modeules yfinance and pandas_datareader are not part of the base modules installed in the server. Therefore, everytime the server restarts, we will need to redownload the modules. The following cell uses cell magic in order to run the necessary terminal commands to install the needed modules.** 

In [6]:
%%bash
pip install --upgrade pandas;
pip install pandas_datareader;
pip install yfinance;

Requirement already up-to-date: pandas in /opt/conda/lib/python3.6/site-packages (0.24.2)


In [7]:
# Modules needed to extract data 
import pandas as pd
from pandas_datareader import data as pdr
import yfinance as yf 

## Store stock data as CSV

*The modules of zipline and yfinance are not compatible since zipline only works with an older version of pandas. To circumvent this problem, we are using this notebook to obtain the dataset. We have saved the dataset into a CSV file to then open the CSV file in the main notebook. The function for saving and reading CSV files is the same for both versions of pandas. For the project, we will be working the pandas version 0.22.0 since that is the version that is compatiable to zipline.* 

## Decision for stock interval

**The zipline algorithm iterates in intervals of daily and minutes. We decided to work with daily data since we are not interested in day trade. We are interested in trading every couple of days. Furthermore, choosing to work with daily data simplifies that project since daily stock data is highly volatile. We will be choosing to only be working with the US stock market. The reason for choosing this stock exchange is simply because we are studying in a US univeristy. The New York stock exchange was first created in the year 1817. We will not be using all the data available from the stock echange. We are interested in creating a portfolio that uses current data. An impactful economic event in modern day US history was the recession in 2008. This recession collapsed the market. To simplfy the project, we have chosen to select a year after 2008 as our starting year for downloading stock data. The stocks chosen are a subset of a well established mutual fund. SPY is the benchmark for our portfolio which we will used to compare how well different models work. SPY is the S&P500 which is a stock index of the 500 biggest publically traded companies. We hope to outperform SPY. If time permits, we will select stocks that are uncorrelated since that will allow for a well balanced portfolio. Finally, the end date for the downloaded stock data is 2018. The year 2018 is the most current and complete year in stock data.**

In [8]:
# Allows us to Download data from yahoo finance with module
yf.pdr_override()

# Conditions for our stock data
startDate = '2013-01-01'
endDate = '2018-12-31'
intervalCycle = '1d'

# Create a class to download 
class StockDownLoad:
    def __init__(self, stock, start, end, interval):
        self.stock = stock
        self.start = start
        self.end = end
        self.interval = interval
    
    def GetData(self): # Function extract data from yahoo finance
        return pdr.get_data_yahoo(self.stock, self.start, self.end, interval=self.interval)
    
    
SPYStock = StockDownLoad('SPY',startDate, endDate, intervalCycle).GetData()
AAPLStock = StockDownLoad('AAPL',startDate, endDate, intervalCycle).GetData()
AMZNStock = StockDownLoad('AMZN',startDate, endDate, intervalCycle).GetData()
BAStock = StockDownLoad('BA',startDate, endDate, intervalCycle).GetData()
FBStock = StockDownLoad('FB',startDate, endDate, intervalCycle).GetData()
GOOGStock = StockDownLoad('GOOG',startDate, endDate, intervalCycle).GetData()
MAStock = StockDownLoad('MA',startDate, endDate, intervalCycle).GetData()
MSFTStock = StockDownLoad('MSFT',startDate, endDate, intervalCycle).GetData()
NVDAStock = StockDownLoad('NVDA',startDate, endDate, intervalCycle).GetData()
UNHStock = StockDownLoad('UNH',startDate, endDate, intervalCycle).GetData()
VStock = StockDownLoad('V',startDate, endDate, intervalCycle).GetData()

[*********************100%***********************]  1 of 1 downloaded
[*********************100%***********************]  1 of 1 downloaded
[*********************100%***********************]  1 of 1 downloaded
[*********************100%***********************]  1 of 1 downloaded
[*********************100%***********************]  1 of 1 downloaded
[*********************100%***********************]  1 of 1 downloaded
[*********************100%***********************]  1 of 1 downloaded
[*********************100%***********************]  1 of 1 downloaded
[*********************100%***********************]  1 of 1 downloaded
[*********************100%***********************]  1 of 1 downloaded
[*********************100%***********************]  1 of 1 downloaded


## Initial DataWrangling

**There are some duplicate rows in the datasets obtained through the module yfinance. The duplicate rows only included the date with missing values in the remainding columns. We decided to remove the duplicate rows since they provided no additional information. Our decision to remove the duplicate rows does not affect the data since the corresponding complete row consisits of no missing values.** 

In [9]:
# Zipline only works with lowercase column names, convert column in our dataframe to lower
# All the stock dataframes
StockData = [SPYStock,AAPLStock, AMZNStock, BAStock, FBStock, GOOGStock, MAStock, MSFTStock,
            NVDAStock, UNHStock, VStock]
# Iteration to convert columns
for stock in StockData:
    stock.columns = map(str.lower, stock.columns)
    stock.dropna(inplace=True)

In [10]:
# Iteration to save dataframe as csv
stocks  = ["SPY","AAPL", "AMZN", "BA", "FB", "GOOG", "MA", "MSFT", "NVDA", "UNH", "V"]

for data, stock in zip(StockData,stocks):
    data.to_csv('{}.csv'.format(stock))