# Feature Engineering and Labeling

We'll use the price-volume data and generate features that we can feed into a model.  We'll use this notebook for all the coding exercises of this lesson, so please open this notebook in a separate tab of your browser.  

Please run the following code up to and including "Make Factors."  Then continue on with the lesson.

requirements.txt contains:

alphalens==0.3.2
colour==0.1.5
cycler==0.10.0
numpy==1.14.5
pandas==0.18.1
plotly==2.2.3
pyparsing==2.2.0
python-dateutil==2.6.1
pytz==2017.3
requests==2.18.4
scipy==1.0.0
scikit-learn==0.19.1
six==1.11.0
zipline===1.2.0
graphviz==0.9
shap==0.25.2

In [1]:
import sys
!{sys.executable} -m pip install --quiet -r requirements.txt

Command "python setup.py egg_info" failed with error code 1 in C:\Users\inves\AppData\Local\Temp\pip-install-dtpj__ni\zipline\
You are using pip version 18.0, however version 20.3.3 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.


In [17]:
import numpy as np
import pandas as pd
import time

import matplotlib.pyplot as plt
%matplotlib inline

Unnamed: 0,B
0,
1,0.707107
2,1.562833
3,3.899228
4,4.366158


In [24]:
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (14, 8)

Unnamed: 0,B
0,
1,0.707107
2,1.414214
3,3.535534
4,2.828427


#### Registering data

In [None]:
import os
import project_helper
from zipline.data import bundles

os.environ['ZIPLINE_ROOT'] = os.path.join(os.getcwd(), '..', '..', 'data', 'module_4_quizzes_eod')

ingest_func = bundles.csvdir.csvdir_equities(['daily'], project_helper.EOD_BUNDLE_NAME)
bundles.register(project_helper.EOD_BUNDLE_NAME, ingest_func)

print('Data Registered')

In [None]:
from zipline.pipeline import Pipeline
from zipline.pipeline.factors import AverageDollarVolume
from zipline.utils.calendars import get_calendar


universe = AverageDollarVolume(window_length=120).top(500) 
trading_calendar = get_calendar('NYSE') 
bundle_data = bundles.load(project_helper.EOD_BUNDLE_NAME)
engine = project_helper.build_pipeline_engine(bundle_data, trading_calendar)

In [None]:
universe_end_date = pd.Timestamp('2016-01-05', tz='UTC')

universe_tickers = engine\
    .run_pipeline(
        Pipeline(screen=universe),
        universe_end_date,
        universe_end_date)\
    .index.get_level_values(1)\
    .values.tolist()

In [None]:
from zipline.data.data_portal import DataPortal

data_portal = DataPortal(
    bundle_data.asset_finder,
    trading_calendar=trading_calendar,
    first_trading_day=bundle_data.equity_daily_bar_reader.first_trading_day,
    equity_minute_reader=None,
    equity_daily_reader=bundle_data.equity_daily_bar_reader,
    adjustment_reader=bundle_data.adjustment_reader)

def get_pricing(data_portal, trading_calendar, assets, start_date, end_date, field='close'):
    end_dt = pd.Timestamp(end_date.strftime('%Y-%m-%d'), tz='UTC', offset='C')
    start_dt = pd.Timestamp(start_date.strftime('%Y-%m-%d'), tz='UTC', offset='C')

    end_loc = trading_calendar.closes.index.get_loc(end_dt)
    start_loc = trading_calendar.closes.index.get_loc(start_dt)

    return data_portal.get_history_window(
        assets=assets,
        end_dt=end_dt,
        bar_count=end_loc - start_loc,
        frequency='1d',
        field=field,
        data_frequency='daily')

# Make Factors

- We'll use the same factors we have been using in the lessons about alpha factor research.  Factors can be features that we feed into the model.


In [None]:
from zipline.pipeline.factors import CustomFactor, DailyReturns, Returns, SimpleMovingAverage
from zipline.pipeline.data import USEquityPricing

factor_start_date = universe_end_date - pd.DateOffset(years=3, days=2)
sector = project_helper.Sector()

def momentum_1yr(window_length, universe, sector):
    return Returns(window_length=window_length, mask=universe) \
        .demean(groupby=sector) \
        .rank() \
        .zscore()

def mean_reversion_5day_sector_neutral(window_length, universe, sector):
    return -Returns(window_length=window_length, mask=universe) \
        .demean(groupby=sector) \
        .rank() \
        .zscore()

def mean_reversion_5day_sector_neutral_smoothed(window_length, universe, sector):
    unsmoothed_factor = mean_reversion_5day_sector_neutral(window_length, universe, sector)
    return SimpleMovingAverage(inputs=[unsmoothed_factor], window_length=window_length) \
        .rank() \
        .zscore()

class CTO(Returns):
    """
    Computes the overnight return, per hypothesis from
    https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2554010
    """
    inputs = [USEquityPricing.open, USEquityPricing.close]
    
    def compute(self, today, assets, out, opens, closes):
        """
        The opens and closes matrix is 2 rows x N assets, with the most recent at the bottom.
        As such, opens[-1] is the most recent open, and closes[0] is the earlier close
        """
        out[:] = (opens[-1] - closes[0]) / closes[0]

        
class TrailingOvernightReturns(Returns):
    """
    Sum of trailing 1m O/N returns
    """
    window_safe = True
    
    def compute(self, today, asset_ids, out, cto):
        out[:] = np.nansum(cto, axis=0)

        
def overnight_sentiment(cto_window_length, trail_overnight_returns_window_length, universe):
    cto_out = CTO(mask=universe, window_length=cto_window_length)
    return TrailingOvernightReturns(inputs=[cto_out], window_length=trail_overnight_returns_window_length) \
        .rank() \
        .zscore()

def overnight_sentiment_smoothed(cto_window_length, trail_overnight_returns_window_length, universe):
    unsmoothed_factor = overnight_sentiment(cto_window_length, trail_overnight_returns_window_length, universe)
    return SimpleMovingAverage(inputs=[unsmoothed_factor], window_length=trail_overnight_returns_window_length) \
        .rank() \
        .zscore()

universe = AverageDollarVolume(window_length=120).top(500)
sector = project_helper.Sector()

pipeline = Pipeline(screen=universe)
pipeline.add(
    momentum_1yr(252, universe, sector),
    'Momentum_1YR')
pipeline.add(
    mean_reversion_5day_sector_neutral_smoothed(20, universe, sector),
    'Mean_Reversion_Sector_Neutral_Smoothed')
pipeline.add(
    overnight_sentiment_smoothed(2, 10, universe),
    'Overnight_Sentiment_Smoothed')

all_factors = engine.run_pipeline(pipeline, factor_start_date, universe_end_date)

all_factors.head()


# Universal Quant Features

* stock volatility: zipline has a custom factor called AnnualizedVolatility.  The [source code is here](https://github.com/quantopian/zipline/blob/master/zipline/pipeline/factors/basic.py) and also pasted below:

```
class AnnualizedVolatility(CustomFactor):
    """
    Volatility. The degree of variation of a series over time as measured by
    the standard deviation of daily returns.
    https://en.wikipedia.org/wiki/Volatility_(finance)
    **Default Inputs:** :data:`zipline.pipeline.factors.Returns(window_length=2)`  # noqa
    Parameters
    ----------
    annualization_factor : float, optional
        The number of time units per year. Defaults is 252, the number of NYSE
        trading days in a normal year.
    """
    inputs = [Returns(window_length=2)]
    params = {'annualization_factor': 252.0}
    window_length = 252

    def compute(self, today, assets, out, returns, annualization_factor):
        out[:] = nanstd(returns, axis=0) * (annualization_factor ** .5)
```

In [None]:
from zipline.pipeline.factors import AnnualizedVolatility
AnnualizedVolatility()

#### Quiz
We can see that the returns `window_length` is 2, because we're dealing with daily returns, which are calculated as the percent change from one day to the following day (2 days).  The `AnnualizedVolatility` `window_length` is 252 by default, because it's the one-year volatility.  Try to adjust the call to the constructor of `AnnualizedVolatility` so that this represents one-month volatility (still annualized, but calculated over a time window of 20 trading days)

#### Answer

In [None]:
# TODO
AnnualizedVolatility(window_length = 20)

#### Quiz: Create one-month and six-month annualized volatility.
Create `AnnualizedVolatility` objects for 20 day and 120 day (one month and six-month) time windows.  Remember to set the `mask` parameter to the `universe` object created earlier (this filters the stocks to match the list in the `universe`).  Convert these to ranks, and then convert the ranks to zscores.

In [None]:
# TODO
volatility_20d = AnnualizedVolatility(window_length = 20, mask = universe).rank().zscore()
volatility_120d = AnnualizedVolatility(window_length = 120, mask = universe).rank().zscore()

#### Add to the pipeline

In [None]:
pipeline.add(volatility_20d, 'volatility_20d')
pipeline.add(volatility_120d, 'volatility_120d')

#### Quiz: Average Dollar Volume feature
We've been using [AverageDollarVolume](http://www.zipline.io/appendix.html#zipline.pipeline.factors.AverageDollarVolume) to choose the stock universe based on stocks that have the highest dollar volume.  We can also use it as a feature that is input into a predictive model.  
Use 20 day and 120 day `window_length` for average dollar volume.  Then rank it and convert to a zscore.

In [None]:
"""already imported earlier, but shown here for reference"""
#from zipline.pipeline.factors import AverageDollarVolume 

# TODO: 20-day and 120 day average dollar volume
adv_20d = AverageDollarVolume(window_length = 20, mask = universe).rank().zscore()
adv_120d = AverageDollarVolume(window_length = 120, mask = universe).rank().zscore()

#### Add average dollar volume features to pipeline

In [None]:
pipeline.add(adv_20d, 'adv_20d')
pipeline.add(adv_120d, 'adv_120d')

In [None]:
### Market Regime Features
We are going to try to capture market-wide regimes:  Market-wide means we'll look at the aggregate movement of the universe of stocks.

High and low dispersion: dispersion is looking at the dispersion (standard deviation) of the cross section of all stocks at each period of time (on each day).  We'll inherit from [CustomFactor](http://www.zipline.io/appendix.html?highlight=customfactor#zipline.pipeline.CustomFactor).  We'll feed in [DailyReturns](http://www.zipline.io/appendix.html?highlight=dailyreturns#zipline.pipeline.factors.DailyReturns) as the `inputs`.  