Add Alpaca pricing source and update documentation

alpacahq · Jun 26, 2019 · 874654f · 874654f
1 parent cd2cb74
commit 874654f
Show file tree

Hide file tree

Showing 18 changed files with 373 additions and 70 deletions.
diff --git a/README.md b/README.md
@@ -15,11 +15,11 @@ If you are looking to use this library for your Quantopian algorithm,
 check out the [migration document](./migration.md).
 
 ## Data Sources
-This library predominantly relies on the [IEX public data API](https://iextrading.com/developer/docs/) for daily
-prices and fundamentals, but plans to connect to other data sources in
-the future. Currently supported data sources include the following.
+This library predominantly relies on the [Alpaca Data API](https://docs.alpaca.markets/api-documentation/api-v2/market-data/) for daily
+price data. For users with funded Alpaca brokerage accounts, several [Polygon](https://polygon.io/) fundamental
+data endpoints are supported. [IEX Cloud](https://iexcloud.io/docs/api/) data is also supported, though if too much
+data is requested, it stops being free. (See the note in the IEX section below.)
 
-- [Alpaca/Polygon](https://docs.alpaca.markets/)
 
 ## Install
 
@@ -75,8 +75,47 @@ and returns a DataFrame with the data for the current date (US/Eastern time).
 Its constructor accepts `list_symbol` function that is supposed to return the full set of
 symbols as a string list, which is used as the maximum universe inside the engine.
 
+## Alpaca Data API
+The [Alpaca Data API](https://docs.alpaca.markets/api-documentation/api-v2/market-data/) is currently the least-limited source of pricing data
+supported by pipeline-live. In order to use the Alpaca Data API, you'll need to
+register for an Alpaca account [here](https://app.alpaca.markets/signup) and generate API key information with
+the dashboard. Once you have your keys generated, you need to store them in
+the following environment variables:
+
+```
+APCA_API_BASE_URL
+APCA_API_KEY_ID
+APCA_API_SECRET_KEY
+```
+
+### pipeline_live.data.iex.pricing.USEquityPricing
+This class provides the basic price information retrieved from
+[Alpaca Data API](https://docs.alpaca.markets/api-documentation/api-v2/market-data/bars/).
+
+## Polygon Data Source API
+You will need to set an [Alpaca](https://alpaca.markets/) API key as `APCA_API_KEY_ID` to use this API.
+
+### pipeline_live.data.polygon.fundamentals.PolygonCompany
+This class provides the DataSet interface using
+[Polygon Symbol Details API](https://polygon.io/docs/#!/Meta-Data/get_v1_meta_symbols_symbol_company)
+
+### pipeline_live.data.polygon.filters.IsPrimaryShareEmulation
+Experimental. This class filteres symbols by the following
+rule to return something close to
+[IsPrimaryShare()](https://www.quantopian.com/help#quantopian_pipeline_filters_fundamentals_IsPrimaryShare) in Quantopian.
+
+- must be a US company
+- must have a valid financial data
+
 ## IEX Data Source API
-You don't have to configure anything to use these API
+To use IEX-source data, you need to sign up for an IEX Cloud account and save
+your IEX token as an environment variable called `IEX_TOKEN`.
+
+IMPORTANT NOTE: IEX data is now limited for free accounts. In order to
+avoid using more messages than you are allotted each month, please
+be sure that you are not using IEX-sourced factors too frequently
+or on too many securities. For more information about how many messages
+each method will cost, please refer to [this part](https://iexcloud.io/docs/api/#data-weighting) of the IEX Cloud documentation.
 
 ### pipeline_live.data.iex.pricing.USEquityPricing
 This class provides the basic price information retrieved from
@@ -102,18 +141,3 @@ A shortcut for `IEXCompany.sector.latest`
 
 ### pipeline_live.data.iex.classifiers.Industry()
 A shortcut for `IEXCompany.industry.latest`
-
-## Alpaca/Polygon Data Source API
-You will need to set [Alpaca](https://alpaca.markets/) API key to use these API.
-
-### pipeline_live.data.polygon.fundamentals.PolygonCompany
-This class provides the DataSet interface using
-[Polygon Symbol Details API](https://polygon.io/docs/#!/Meta-Data/get_v1_meta_symbols_symbol_company)
-
-### pipeline_live.data.polygon.filters.IsPrimaryShareEmulation
-Experimental. This class filteres symbols by the following
-rule to return something close to
-[IsPrimaryShare()](https://www.quantopian.com/help#quantopian_pipeline_filters_fundamentals_IsPrimaryShare) in Quantopian.
-
-- must be a US company
-- must have a valid financial data
diff --git a/migration.md b/migration.md
@@ -12,20 +12,22 @@ pylivetrader can run the pipeline object from this package.
 ## USEquityPricing
 The most important class to think about first is the USEquityPricing class
 and it is well covered by
-`pipeline_live.data.iex.pricing.USEquityPricing` class.
+`pipeline_live.data.alpaca.pricing.USEquityPricing` class.
 This class gets the market-wide daily price data (OHLCV) up to the
-previous day from [IEX chart API](https://iextrading.com/developer/docs/#chart).
-Depending on the requested window length from its upstream pipeline, it
-fetches different size of the data range (e.g. 3m, 1y). Again, the volume of
-this data is market-wide size, so it's safe to use this with factors such
-as AverageDollarVolume.
+previous day from [Alpaca data API](https://docs.alpaca.markets/api-documentation/api-v2/market-data/bars/).
 
 ## Factors
 In order to use many of the builtin factors with this price data loader,
-you need to use `pipeline_live.data.iex.factors` package which has
-all the builtin factor classes ported from zipline.  
+you need to use `pipeline_live.data.alpaca.factors` package which has
+all the builtin factor classes ported from zipline. Use of the Alpaca data API
+requires an Alpaca account, which you can sign up for [here](https://app.alpaca.markets/signup).
 
-For example, if you have these lines,
+Once you have an Alpaca account, you will need to store your account info
+from their dashboard as environment variables. You can find information about
+how to do so on [this documentation page](https://docs.alpaca.markets/api-documentation/how-to/).
+
+To use the Alpaca factors, import them from `pipeline_live.data.alpaca.factors`.
+For example, if you have these lines on Quantopian,
 
 ```py
 from quantopian.pipeline.factors import (
@@ -37,10 +39,10 @@ from quantopian.pipeline.data.builtin import USEquityPricing
 you can rewrite it to something like this.
 
 ```py
-from pipeline_live.data.iex.factors import (
+from pipeline_live.data.alpaca.factors import (
     AverageDollarVolume, SimpleMovingAverage,
 )
-from pipeline_live.data.iex.pricing import USEquityPricing
+from pipeline_live.data.alpaca.pricing import USEquityPricing
 ```
 
 Of course, the builtin factor classes in the original zipline are mostly
@@ -49,27 +51,35 @@ ones, they also work with this `USEquityPricing`.
 
 ```py
 from zipline.pipeline.factors import AverageDollarVolume
-from pipeline_live.data.iex.pricing import USEquityPricing
+from pipeline_live.data.alpaca.pricing import USEquityPricing
 
 dollar_volume = AverageDollarVolume(
     inputs=[USEquityPricing.close, USEquityPricing.volume],
     window_length=20,
 )
 ```
 
-The only difference in the factor classes in `pipeline_live.data.iex.factors`
-is that some of the classes have IEX's USEquityPricing as the default
+The only difference in the factor classes in `pipeline_live.data.alpaca.factors`
+is that some of the classes have Alpaca's USEquityPricing as the default
 inputs, so you don't need to explicitly specify it.
 
 ## Fundamentals
 The Quantopian platform allows you to retrieve various proprietary data
 sources through pipeline, including Morningstar fundamentals. While the
 intention of this pipline-live library is to add more such proprietary
-data sources, the free alternative at the moment is IEX. There are two
+data sources, the alternative at the moment is IEX. There are two
 main dataset classes are builtin in this library, `IEXCompany` and
 `IEXKeyStats`. Those both belong to the `pipeline_live.data.iex.fundamentals`
 package.
 
+Please note that, in order to use the IEX API data, you will need to sign up
+for an IEX Cloud account [here](https://iexcloud.io/cloud-login#/register/) and set your IEX Cloud token in the
+`IEX_TOKEN` environment variable. IEX limits your API messages per month. In
+order to avoid running over your message quota, please make sure that you
+filter your stock universe as much as possible before using IEX API data.
+If you wish to use IEX data to frequently filter a larger set of symbols, you
+may need to upgrade your IEX Cloud account.
+
 ### IEXCompany
 This dataset class maps the basic stock information from the
 [Company API](https://iextrading.com/developer/docs/#company).

diff --git a/pipeline_live/data/alpaca/__init__.py b/pipeline_live/data/alpaca/__init__.py
diff --git a/pipeline_live/data/alpaca/factors.py b/pipeline_live/data/alpaca/factors.py
@@ -0,0 +1,35 @@
+'''
+Duplicate builtin factor classes in zipline with IEX's USEquityPricing
+'''
+
+from zipline.pipeline.data import USEquityPricing as z_pricing
+from zipline.pipeline import factors as z_factors
+
+from .pricing import USEquityPricing as alpaca_pricing
+
+
+def _replace_inputs(inputs):
+    map = {
+        z_pricing.open: alpaca_pricing.open,
+        z_pricing.high: alpaca_pricing.high,
+        z_pricing.low: alpaca_pricing.low,
+        z_pricing.close: alpaca_pricing.close,
+        z_pricing.volume: alpaca_pricing.volume,
+    }
+
+    if type(inputs) not in (list, tuple, set):
+        return inputs
+    return tuple([
+        map.get(inp, inp) for inp in inputs
+    ])
+
+
+for name in dir(z_factors):
+    factor = getattr(z_factors, name)
+    if factor != z_factors.Factor and hasattr(
+            factor, 'inputs') and issubclass(
+            factor, z_factors.Factor):
+        new_factor = type(factor.__name__, (factor,), {
+            'inputs': _replace_inputs(factor.inputs)
+        })
+        locals()[factor.__name__] = new_factor
diff --git a/pipeline_live/data/alpaca/pricing.py b/pipeline_live/data/alpaca/pricing.py
@@ -0,0 +1,23 @@
+from zipline.pipeline.data.dataset import Column, DataSet
+from zipline.utils.numpy_utils import float64_dtype
+
+from .pricing_loader import USEquityPricingLoader
+
+
+# In order to use it as a cache key, we have to make it singleton
+_loader = USEquityPricingLoader()
+
+
+class USEquityPricing(DataSet):
+    """
+    Dataset representing daily trading prices and volumes.
+    """
+    open = Column(float64_dtype)
+    high = Column(float64_dtype)
+    low = Column(float64_dtype)
+    close = Column(float64_dtype)
+    volume = Column(float64_dtype)
+
+    @staticmethod
+    def get_loader():
+        return _loader
diff --git a/pipeline_live/data/alpaca/pricing_loader.py b/pipeline_live/data/alpaca/pricing_loader.py
@@ -0,0 +1,115 @@
+import numpy as np
+import logbook
+import pandas as pd
+
+from zipline.lib.adjusted_array import AdjustedArray
+from zipline.pipeline.loaders.base import PipelineLoader
+from zipline.utils.calendars import get_calendar
+from zipline.errors import NoFurtherDataError
+
+from pipeline_live.data.sources import alpaca
+
+
+log = logbook.Logger(__name__)
+
+
+class USEquityPricingLoader(PipelineLoader):
+    """
+    PipelineLoader for US Equity Pricing data
+    """
+
+    def __init__(self):
+        cal = get_calendar('NYSE')
+
+        self._all_sessions = cal.all_sessions
+
+    def load_adjusted_array(self, columns, dates, symbols, mask):
+        # load_adjusted_array is called with dates on which the user's algo
+        # will be shown data, which means we need to return the data that would
+        # be known at the start of each date.  We assume that the latest data
+        # known on day N is the data from day (N - 1), so we shift all query
+        # dates back by a day.
+        start_date, end_date = _shift_dates(
+            self._all_sessions, dates[0], dates[-1], shift=1,
+        )
+
+        sessions = self._all_sessions
+        sessions = sessions[(sessions >= start_date) & (sessions <= end_date)]
+
+        timedelta = pd.Timestamp.utcnow() - start_date
+        chart_range = timedelta.days + 1
+        log.info('chart_range={}'.format(chart_range))
+        prices = alpaca.get_stockprices(chart_range)
+
+        dfs = []
+        for symbol in symbols:
+            if symbol not in prices:
+                df = pd.DataFrame(
+                    {c.name: c.missing_value for c in columns},
+                    index=sessions
+                )
+            else:
+                df = prices[symbol]
+                df = df.reindex(sessions, method='ffill')
+            dfs.append(df)
+
+        raw_arrays = {}
+        for c in columns:
+            colname = c.name
+            raw_arrays[colname] = np.stack([
+                df[colname].values for df in dfs
+            ], axis=-1)
+        out = {}
+        for c in columns:
+            c_raw = raw_arrays[c.name]
+            out[c] = AdjustedArray(
+                c_raw.astype(c.dtype),
+                {},
+                c.missing_value
+            )
+        return out
+
+
+def _shift_dates(dates, start_date, end_date, shift):
+    try:
+        start = dates.get_loc(start_date)
+    except KeyError:
+        if start_date < dates[0]:
+            raise NoFurtherDataError(
+                msg=(
+                    "Pipeline Query requested data starting on {query_start}, "
+                    "but first known date is {calendar_start}"
+                ).format(
+                    query_start=str(start_date),
+                    calendar_start=str(dates[0]),
+                )
+            )
+        else:
+            raise ValueError("Query start %s not in calendar" % start_date)
+
+    # Make sure that shifting doesn't push us out of the calendar.
+    if start < shift:
+        raise NoFurtherDataError(
+            msg=(
+                "Pipeline Query requested data from {shift}"
+                " days before {query_start}, but first known date is only "
+                "{start} days earlier."
+            ).format(shift=shift, query_start=start_date, start=start),
+        )
+
+    try:
+        end = dates.get_loc(end_date)
+    except KeyError:
+        if end_date > dates[-1]:
+            raise NoFurtherDataError(
+                msg=(
+                    "Pipeline Query requesting data up to {query_end}, "
+                    "but last known date is {calendar_end}"
+                ).format(
+                    query_end=end_date,
+                    calendar_end=dates[-1],
+                )
+            )
+        else:
+            raise ValueError("Query end %s not in calendar" % end_date)
+    return dates[start - shift], dates[end - shift]
diff --git a/pipeline_live/data/polygon/filters.py b/pipeline_live/data/polygon/filters.py
@@ -22,6 +22,7 @@ def compute(self, today, symbols, out, *inputs):
         ], dtype=bool)
         out[:] = ary
 
+
 class StaticSymbols(CustomFilter):
     inputs = ()
     window_length = 1

diff --git a/pipeline_live/data/sources/alpaca.py b/pipeline_live/data/sources/alpaca.py
@@ -0,0 +1,41 @@
+import alpaca_trade_api as tradeapi
+
+from .util import (
+    daily_cache, parallelize
+)
+
+
+def list_symbols():
+    return [
+        a.symbol for a in tradeapi.REST().list_assets()
+        if a.tradable and a.status == 'active'
+    ]
+
+
+def get_stockprices(limit=365, timespan='day'):
+    all_symbols = list_symbols()
+
+    @daily_cache(filename='alpaca_chart_{}'.format(limit))
+    def get_stockprices_cached(all_symbols):
+        return _get_stockprices(all_symbols, limit, timespan)
+
+    return get_stockprices_cached(all_symbols)
+
+
+def _get_stockprices(symbols, limit=365, timespan='day'):
+    '''Get stock data (key stats and previous) from Alpaca.
+    Just deal with Alpaca's 200 stocks per request limit.
+    '''
+
+    def fetch(symbols):
+        barset = tradeapi.REST().get_barset(symbols, timespan, limit)
+        data = {}
+        for symbol in barset:
+            df = barset[symbol].df
+            # Update the index format for comparison with the trading calendar
+            df.index = df.index.tz_convert('UTC').normalize()
+            data[symbol] = df.asfreq('C')
+
+        return data
+
+    return parallelize(fetch, splitlen=199)(symbols)