# Obtain Dataset from Yahoo Finance

We will be getting our dataset from yahoo finance website. The module yfinance allows us to extract the stock data by python code into our notebooks. The module pandas_datareader allows the extracted data to be read in as a dataframe. The module yfinance only works with pandas version 0.24.2 and newer. Therefore, before we run this notebook it is necessary to upgrade pandas to the newest version. The packages are not included with the base modules in our jupyter notebook. The following cell will install the packages mentioned. 

In [1]:
%%bash
pip install --upgrade pandas;
pip install pandas_datareader;
pip install yfinance;

Collecting pandas
  Using cached https://files.pythonhosted.org/packages/19/74/e50234bc82c553fecdbd566d8650801e3fe2d6d8c8d940638e3d8a7c5522/pandas-0.24.2-cp36-cp36m-manylinux1_x86_64.whl
Installing collected packages: pandas
  Found existing installation: pandas 0.22.0
    Uninstalling pandas-0.22.0:
      Successfully uninstalled pandas-0.22.0
Successfully installed pandas-0.24.2
Collecting yfinance
Collecting requests>=2.20 (from yfinance)
  Using cached https://files.pythonhosted.org/packages/51/bd/23c926cd341ea6b7dd0b2a00aba99ae0f828be89d72b2190f27c11d4b7fb/requests-2.22.0-py2.py3-none-any.whl
Collecting multitasking>=0.0.7 (from yfinance)
Installing collected packages: requests, multitasking, yfinance
  Found existing installation: requests 2.12.4
    Uninstalling requests-2.12.4:
      Successfully uninstalled requests-2.12.4
Successfully installed multitasking-0.0.9 requests-2.22.0 yfinance-0.1.42


ERROR: zipline 1.3.0 has requirement pandas<=0.22,>=0.18.1, but you'll have pandas 0.24.2 which is incompatible.
ERROR: zipline 1.3.0 has requirement pandas<=0.22,>=0.18.1, but you'll have pandas 0.24.2 which is incompatible.
ERROR: okpy 1.14.15 has requirement requests==2.12.4, but you'll have requests 2.22.0 which is incompatible.


In [2]:
# Modules needed to extract data 
import pandas as pd
from pandas_datareader import data as pdr
import yfinance as yf 

## Store stock data as CSV

The modules of zipline and yfinance are not compatible since zipline uses an older version of pandas. To circumvent this problem, we are using this notebook to obtain the dataset. We have saved the dataset into a CSV file and then open the CSV file in the main notebook. The function for saving and reading CSV files is the same for both versions of pandas. For the project, we will be working the pandas version 0.22.0 since that is the version that is compatiable to zipline. 

### Decision for stock interval

Zipline uses an algorithm that runs in frequence of daily or minute. To make our data compatible with zipline, we will be extracting data from the year 2018. We selected this year since the year 2018 is the most recent year with complete data.

In [3]:
# Allows us to Download data from yahoo finance with module
yf.pdr_override()

# Conditions for our stock data
startDate = '2013-01-01'
endDate = '2013-12-31'
intervalCycle = '1d'

# Create a class to download 
class StockDownLoad:
    def __init__(self, stock, start, end, interval):
        self.stock = stock
        self.start = start
        self.end = end
        self.interval = interval
    
    def GetData(self): # Function extract data from yahoo finance
        return pdr.get_data_yahoo(self.stock, self.start, self.end, interval=self.interval)
    
    
SPYStock = StockDownLoad('SPY',startDate, endDate, intervalCycle).GetData()
AAPLStock = StockDownLoad('AAPL',startDate, endDate, intervalCycle).GetData()
AMZNStock = StockDownLoad('AMZN',startDate, endDate, intervalCycle).GetData()
BAStock = StockDownLoad('BA',startDate, endDate, intervalCycle).GetData()
FBStock = StockDownLoad('FB',startDate, endDate, intervalCycle).GetData()
GOOGStock = StockDownLoad('GOOG',startDate, endDate, intervalCycle).GetData()
MAStock = StockDownLoad('MA',startDate, endDate, intervalCycle).GetData()
MSFTStock = StockDownLoad('MSFT',startDate, endDate, intervalCycle).GetData()
NVDAStock = StockDownLoad('NVDA',startDate, endDate, intervalCycle).GetData()
UNHStock = StockDownLoad('UNH',startDate, endDate, intervalCycle).GetData()
VStock = StockDownLoad('V',startDate, endDate, intervalCycle).GetData()

[*********************100%***********************]  1 of 1 downloaded
[*********************100%***********************]  1 of 1 downloaded
[*********************100%***********************]  1 of 1 downloaded
[*********************100%***********************]  1 of 1 downloaded
[*********************100%***********************]  1 of 1 downloaded
[*********************100%***********************]  1 of 1 downloaded
[*********************100%***********************]  1 of 1 downloaded
[*********************100%***********************]  1 of 1 downloaded
[*********************100%***********************]  1 of 1 downloaded
[*********************100%***********************]  1 of 1 downloaded
[*********************100%***********************]  1 of 1 downloaded


## Initial DataWrangling

*There are some duplicate rows in the dataset obtained through the modules. The duplicate rows only included the date with missing values in the remainding variables. We decided to remove the duplicate rows since they provided no additional information. Furthermore, our decision to remove the duplicate rows does not affect the data since the corresponding complete row consisits of no missing values.* 

In [4]:
# Zipline only works with lowercase column names, convert column in our dataframe to lower
# All the stock dataframes
StockData = [SPYStock,AAPLStock, AMZNStock, BAStock, FBStock, GOOGStock, MAStock, MSFTStock,
            NVDAStock, UNHStock, VStock]
# Iteration to convert columns
for stock in StockData:
    stock.columns = map(str.lower, stock.columns)
    stock.dropna(inplace=True)

In [5]:
# Iteration to save dataframe as csv
stocks  = ["SPY","AAPL", "AMZN", "BA", "FB", "GOOG", "MA", "MSFT", "NVDA", "UNH", "V"]

for data, stock in zip(StockData,stocks):
    data.to_csv('{}.csv'.format(stock))
