# Data Management

Here we will discuss how we can set up the system to retrieve, store, load, and prepare the data for analysis.

## Setting Up Directory

We need to set up a couple directories

- Root directory which will be the directory that these tutorials are held
- Data directory 
- function directory

In [2]:
%ls

Backtesting Basics.ipynb        Intro to Strategies and Considerations.ipynb
Data_management.ipynb           README.md
Definitions and Formulas.ipynb  [0m[01;34mstockdata[0m/
[01;34mfunctions[0m/


In [4]:
%load_ext rpy2.ipython

In [53]:
%%R
rootdir <- "/home/ck1/Documents/Projects/Python/QuantTrade/"
datadir <- "/home/ck1/Documents/Projects/Python/QuantTrade/stockdata/"
functiondir <- "/home/ck1/Documents/Projects/Python/QuantTrade/functions/"

## URL Query Building

It seems that the Yahoo Stock Service has been closed. So we will use the new Quandl R package to extract some data... let's see how this works. Wee need to set up a function that will download all historical data from start to end date

In [33]:
%%R
library(Quandl) # load the library
Quandl.api_key("NcNrBJn7i4MR4k3u8D1t") # This is the Quandl API_key that they give you
GOOGL <- Quandl("WIKI/GOOGL")
head(GOOGL) # this gets us from the most current to the latest information

        Date    Open    High     Low   Close  Volume Ex-Dividend Split Ratio
1 2018-01-17 1136.36 1139.32 1123.49 1139.10 1353097           0           1
2 2018-01-16 1140.31 1148.88 1126.66 1130.70 1783881           0           1
3 2018-01-12 1110.10 1131.30 1108.01 1130.65 1914460           0           1
4 2018-01-11 1112.31 1114.85 1106.48 1111.88 1102461           0           1
5 2018-01-10 1107.00 1112.78 1103.98 1110.14 1027781           0           1
6 2018-01-09 1118.44 1118.44 1108.20 1112.79 1335995           0           1
  Adj. Open Adj. High Adj. Low Adj. Close Adj. Volume
1   1136.36   1139.32  1123.49    1139.10     1353097
2   1140.31   1148.88  1126.66    1130.70     1783881
3   1110.10   1131.30  1108.01    1130.65     1914460
4   1112.31   1114.85  1106.48    1111.88     1102461
5   1107.00   1112.78  1103.98    1110.14     1027781
6   1118.44   1118.44  1108.20    1112.79     1335995


## Data Acquisition

Now we want to fetch a list of esired stocks... we will start with S&P 500

In [34]:
%%R

url <- "http://trading.chrisconlan.com/SPstocks.csv"
S <- as.character(read.csv(url,header=FALSE)[,1])
setwd(rootdir)
dump(list="S","S.R")

Now we create a workflow that will download all of of the S&P data into the folder given that the ticker is in the S file if not then it will be inside the initial root directory.

In [47]:
%%R

# create function that will extract the closing price into dataframe
quandlfunc <- function(sym, start="2000-01-01"){
    library(data.table)
    tryCatch(
    suppressWarnings(
    Quandl(paste0("WIKI/",sym), start = start)),
    error = function(e) NULL
    )
}
setwd(functiondir)
dump(list=c("quandlfunc"), "quandlfunc.R")

In [48]:
%%R

GOOGL <- quandlfunc("GOOGL")
head(GOOGL)

        Date    Open    High     Low   Close  Volume Ex-Dividend Split Ratio
1 2018-01-17 1136.36 1139.32 1123.49 1139.10 1353097           0           1
2 2018-01-16 1140.31 1148.88 1126.66 1130.70 1783881           0           1
3 2018-01-12 1110.10 1131.30 1108.01 1130.65 1914460           0           1
4 2018-01-11 1112.31 1114.85 1106.48 1111.88 1102461           0           1
5 2018-01-10 1107.00 1112.78 1103.98 1110.14 1027781           0           1
6 2018-01-09 1118.44 1118.44 1108.20 1112.79 1335995           0           1
  Adj. Open Adj. High Adj. Low Adj. Close Adj. Volume
1   1136.36   1139.32  1123.49    1139.10     1353097
2   1140.31   1148.88  1126.66    1130.70     1783881
3   1110.10   1131.30  1108.01    1130.65     1914460
4   1112.31   1114.85  1106.48    1111.88     1102461
5   1107.00   1112.78  1103.98    1110.14     1027781
6   1118.44   1118.44  1108.20    1112.79     1335995


In [50]:
%%R

#load "invalid.R" file if available
invalid <- character(0)
setwd(rootdir)
if("invalid.R" %in% list.files()) source("invalid.R")

# fina all symbols not in directory and not missing
setwd(datadir)
toload <- setdiff(S[!paste0(S, ".csv") %in% list.files()], invalid) # we only make a list for those to load



In [61]:
%%R

# fetch symbols with the quandlfunction, as as .csv or missing
source(paste0(functiondir, "quandlfunc.R"))
if(length(toload)!=0){
    for(i in 1:length(toload)){
        
        df <- quandlfunc(toload[i])
        
        if(!is.null(df)){
            write.csv(df[nrow(df):1,], file = paste0(toload[i], ".csv"),row.names = FALSE)
        } else {
            invalid <- c(invalid, toload[i])
        }
    }
}

In [63]:
%%R

# Here is a check on which function is faster in loading data
library(quantmod)
library(microbenchmark)
microbenchmark(
  getSymbols("GOOGL"),
  Quandl("WIKI/GOOGL")
)

# faster way to load the above code is to vectorize the process and save accordingly



Learn from a quantmod author: https://www.datacamp.com/courses/importing-and-managing-financial-data-in-r

use auto.assign=FALSE in 0.5-0. You will still be able to use
‘loadSymbols’ to automatically load data. getOption("getSymbols.env")
and getOption("getSymbols.auto.assign") will still be checked for
alternate defaults.

This message is shown once per session and may be disabled by setting 


This message is shown once per session and may be disabled by setting



Unit: milliseconds
                 expr      min        lq      mean    median        uq      max
  getSymbols("GOOGL") 191.7464  273.1798  430.4343  337.5031  429.7114 2666.447
 Quandl("WIKI/GOOGL") 931.3664 1134.4128 1302.7230 1272.2823 1426.8681 2652.056
 neval cld
   100  a 
   100   b


Based on what we're seeing with `microbenchmark` it seems that the getSymbols function is 3x faster than the Quandl function... so let's edit the above code to download and also fast write into csv using the `fwrite` function in `library(data.table)`

We've successfully saved all the stock data (of interest) into our folder and now need to remove objects from the environment except for path variables and functions

In [None]:
%%R
rm(list = setdiff(ls(), c("rootdir", "functiondir", "datadir", "yahoo")))
gc()