回归：outputs是连续的值

# 特征工程

当涉及机器学习算法时，第一个要问的问题是哪些features可用，或者预测变量是什么。用于预测未来DJIA价格的driving factors，此处的收盘价，显然包括历史的和当前的开盘价以及历史表现（高、低、成交量）。

注意，当前或同一天的表现（高、低、成交量）不应该包含进来，因为我们不可能预测股票交易的最高或最低价位，以及在市场收盘前交易的股票总数。

只用这4个指标来预测收盘价似乎不靠谱，而且可能导致欠拟合。因此，我们需要考虑如何增加features和预测的效力。机器学习中，特征工程是在现有特征的基础上新建domain-specific features的过程，以提高算法性能。

特征工程要求充分的domain-knowledge，难度可能很大也很耗时间。

事实上，用于解决机器学习问题的features通常是不直接可得的，需要特别设计、构建。例如垃圾邮件检测和新闻分类中的词频、tf-idf。因此，特征工程是机器学习中的基石，解决实际问题时，我们常常要花大量功夫在特征工程上。

投资决策时，投资人通常不会某一天的价格，而是看一段时期内的历史价格。因此，在我们的股价预测中，我们可以计算过去一周的平均收盘价，过往一个月的，过往一年的，作为3个新特征。我们也可以自定义时间窗口的大小，比如过往1季度，过往半年。
在上面3个平均价格特征上，我们可以通过计算每对特征的比例来得到关联的价格走势

投资决策时，投资者通常会在一段时间内查看历史价格，而不只是前一天的价格。因此，在我们的股价预测中，我们可以计算过去一周（五天）的平均收盘价，过去一个月，以及过去一年，作为三个新特征。我们也可以自定义时间窗口的大小，如过去一季度，过去六个月。在这三个平均价格特征之上，我们可以通过计算三个不同时间范围中的平均价格之间的比率来产生与价格趋势相关的新特性。例如，过去一周的平均价格与过去一年的平均价格之比。除了价格之外，成交量也是投资者分析的另一个重要因素。类似地，我们可以通过计算多个不同时间范围的平均成交量和每对平均值之间的比率来生成新的基于成交量的特征。

除了在时间窗口内的历史平均值，投资者还大量考虑股票波动性。波动性是指给定股票或指数随时间变化的程度。从统计上讲，基本上是收盘价的标准差。我们可以通过计算特定时间范围内收盘价的标准差，以及交易量的标准差，生成新的特征集。类似的每对标准偏差值之间的比率可以包含在我们的工程特征池中。

回报是投资者密切关注的重要金融指标。收益率是股票/指数在某一特定时期内收盘价的百分比。例如，每日收益和年回报率是我们经常听到的财务术语。它们的计算如下：
![return](return.png)

移动平均：![movingAvg](movingAVG.png)

# 数据获得和特征生成
此处用到Python 库：quandl(https://www.quandl.com/tools/python)

In [1]:
import quandl

In [None]:
#原数据集不存在了，自己手动下载https://finance.yahoo.com/quote/%5EDJI/history?p=%5EDJI
#mydata = quandl.get("YAHOO/INDEX_DJI", start_date="2005-12-01", end_date="2005-12-05")

In [None]:
# authtoken = ''

#def get_data_quandl(symbol, start_date,end_date):
#    data = quandl.get(symbol, start_date=start_date, end_date=end_date, authtoken=authtoken)
#    return data

In [1]:
import pandas as pd

In [2]:
mydata = pd.read_csv('^DJI.csv')

In [3]:
mydata.head()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,1987-12-31,1927.390015,1951.26001,1912.630005,1938.829956,1938.829956,15360000
1,1988-01-04,1952.589966,2030.01001,1950.76001,2015.25,2015.25,20880000
2,1988-01-05,2056.370117,2075.27002,2021.390015,2031.5,2031.5,27200000
3,1988-01-06,2036.469971,2058.189941,2012.77002,2037.800049,2037.800049,18800000
4,1988-01-07,2019.890015,2061.51001,2004.640015,2051.889893,2051.889893,21370000


### 数据的获得

In [4]:
def get_data_from(data, start_date, end_date):
    data_raw = data[(data.Date >= start_date) & (mydata.Date <= end_date)]
    return data_raw

## 特征生成函数：

### 涉及的methods：

In [16]:
head=mydata.head()

In [21]:
head['Open']

0    1927.390015
1    1952.589966
2    2056.370117
3    2036.469971
4    2019.890015
Name: Open, dtype: float64

In [22]:
head.Open.shift(1) # 移动索引，即前一天；shift(5)即前5天。shift(5).shift(1)，前5天的前1天

0            NaN
1    1927.390015
2    1952.589966
3    2056.370117
4    2036.469971
Name: Open, dtype: float64

In [31]:
#Moving mean
tail = mydata.tail(10)
pd.rolling_mean(tail['Close'], window=5).shift(1)

	Series.rolling(window=5,center=False).mean()
  app.launch_new_instance()


7474             NaN
7475             NaN
7476             NaN
7477             NaN
7478             NaN
7479    21865.593750
7480    21900.371875
7481    21889.353906
7482    21877.808203
7483    21856.278125
Name: Close, dtype: float64

In [32]:
tail.Close.rolling(window=5).mean().shift(1)

7474             NaN
7475             NaN
7476             NaN
7477             NaN
7478             NaN
7479    21865.593750
7480    21900.371875
7481    21889.353906
7482    21877.808203
7483    21856.278125
Name: Close, dtype: float64

### 生成函数

In [8]:
def generate_features(df):
    """ Generate features for a stock/index based on historical price and performance
    Args:
        df (dataframe with columns 'Open', 'Close', 'High', 'Low', 'Volume', 'Adjusted Close')
    Returns:
        dataframe, data set with new features
    """
    df_new = pd.DataFrame()
    # 6个初始特征
    df_new['open'] = df['Open']  # 开盘价
    df_new['open_1'] = df['Open'].shift(1)  # 前一天的开盘价。 当天是不可知的。
    df_new['close_1'] = df['Close'].shift(1) # 前一天的收盘价
    df_new['high_1'] = df['High'].shift(1)  # 前一天的最高价
    df_new['low_1'] = df['Low'].shift(1)    # 前一天的最低价
    df_new['volume_1'] = df['Volume'].shift(1) # 前一天的成交量
    
    # 基于上6个生成额外的31个初始特征
    # average price
    df_new['avg_price_5'] = df.Close.rolling(window=5).mean().shift(1) # 前一周（5天）
    df_new['avg_price_30'] = df.Close.rolling(window=21).mean().shift(1) # 前一月（21天）
    df_new['avg_price_365'] = df.Close.rolling(window=252).mean().shift(1)
    df_new['ratio_avg_price_5_30'] = df_new['avg_price_5'] / df_new['avg_price_30']
    df_new['ratio_avg_price_5_365'] = df_new['avg_price_5'] / df_new['avg_price_365']
    df_new['ratio_avg_price_30_365'] = df_new['avg_price_30'] / df_new['avg_price_365']
    # average volume
    df_new['avg_volume_5'] = df.Volume.rolling(window=5).mean().shift(1)
    df_new['avg_volume_30'] = df.Volume.rolling(window=21).mean().shift(1)
    df_new['avg_volume_365'] = df.Volume.rolling(window=252).mean().shift(1)
    df_new['ratio_avg_volume_5_30'] = df_new['avg_volume_5'] / df_new['avg_volume_30']
    df_new['ratio_avg_volume_5_365'] = df_new['avg_volume_5'] / df_new['avg_volume_365']
    df_new['ratio_avg_volume_30_365'] = df_new['avg_volume_30'] / df_new['avg_volume_365']
    #standard deviation of prices
    df_new['std_price_5'] = df.Close.rolling(window=5).std().shift(1)
    df_new['std_price_30'] = df.Close.rolling(window=21).std().shift(1)
    df_new['std_price_365'] = df.Close.rolling(window=252).std().shift(1)
    df_new['ratio_std_price_5_30'] = df_new['std_price_5'] / df_new['std_price_30']
    df_new['ratio_std_price_5_365'] = df_new['std_price_5'] / df_new['std_price_365']
    df_new['ratio_std_price_30_365'] = df_new['std_price_30'] / df_new['std_price_365']
    # standard deviation of volumes
    df_new['std_volume_5'] = df.Volume.rolling(window=5).std().shift(1)
    df_new['std_volume_30'] = df.Volume.rolling(window=30).std().shift(1)
    df_new['std_volume_365'] = df.Volume.rolling(window=365).std().shift(1)
    df_new['ratio_std_volume_5_30'] = df_new['std_volume_5'] / df_new['std_volume_30']
    df_new['ratio_std_volume_5_365'] = df_new['std_volume_5'] / df_new['std_volume_365']
    df_new['ratio_std_volume_30_365'] = df_new['std_volume_30'] / df_new['std_volume_365']
    
    ## return  （当天价格-前一天价格）／前一天价格.shift(1) 这是当天，需要再shift(1)
    df_new['return_1'] = ((df.Close - df.Close.shift(1)) / df.Close.shift(1)).shift(1)
    df_new['return_5'] = ((df.Close - df.Close.shift(5)) / df.Close.shift(5)).shift(1)
    df_new['return_30'] = ((df.Close - df.Close.shift(21)) / df.Close.shift(21)).shift(1)
    df_new['return_365'] = ((df.Close - df.Close.shift(252)) / df.Close.shift(252)).shift(1)
    df_new['moving_avg_5'] = df_new['return_1'].rolling(window=5).mean()
    df_new['moving_avg_30'] = df_new['return_1'].rolling(window=21).mean()
    df_new['moving_avg_365'] = df_new['return_1'].rolling(window=252).mean()
    # the target
    df_new['close'] = df['Close']
    df_new = df_new.dropna(axis=0)
    return df_new

# 1988-2015的数据：

In [13]:
data_raw =  get_data_from(mydata,'1988-01-01','2015-12-31')

In [25]:
data_raw.shape

(7058, 7)

In [14]:
data=generate_features(data_raw)

In [15]:
data.round(decimals=3).head(3)

Unnamed: 0,open,open_1,close_1,high_1,low_1,volume_1,avg_price_5,avg_price_30,avg_price_365,ratio_avg_price_5_30,...,ratio_std_volume_5_365,ratio_std_volume_30_365,return_1,return_5,return_30,return_365,moving_avg_5,moving_avg_30,moving_avg_365,close
366,2501.68,2505.78,2518.84,2528.17,2484.7,16340000.0,2511.562,2487.315,2207.95,1.01,...,0.47,1.58,0.002,0.015,0.057,0.2,0.003,0.003,0.001,2503.54
367,2506.9,2501.68,2503.54,2520.71,2484.7,19810000.0,2513.006,2490.355,2209.454,1.009,...,0.472,1.562,-0.006,0.003,0.026,0.178,0.001,0.001,0.001,2503.36
368,2479.95,2506.9,2503.36,2520.71,2486.01,16390000.0,2511.214,2492.235,2210.93,1.008,...,0.541,1.569,-0.0,-0.004,0.016,0.175,-0.001,0.001,0.001,2475.0
