### 股票预测项目
本项目的目的是通过股票的历史行情价格来预测未来某只股票的涨跌。 问题本身其实就是二分类问题。数据是通过```tushare```库来获取到的，在压缩包里已经给出了一只股票的数据。本作业的目的是：
1. 根据已经给定的数据，构造出样本数据。在样本数据的构造过程我们需要使用特征工程，这个特征工程其实就是技术指标的提取。 
2. 提取完技术指标之后，做一些简单的数据处理
3. 构造训练数据和测试数据
4. 利用随机森立学习二分类器

本项目的重点是技术指标的提取，但为了方便大家，这些指标已经写好，建议可以去看一下每一个技术指标是如何定义的。

预估项目完成时间： 2小时

In [1]:
# 导入相应的函数库
import pandas as pd
import datetime
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score
import matplotlib.pyplot as plt



In [2]:
import logging
import pandas as pd
import numpy as np

log = logging.getLogger(__name__)


def moving_average(df, n):
    """计算股价的moving average. 
    参考： https://blog.csdn.net/FrankieHello/article/details/85938381
    
    :param df: pandas.DataFrame
    :param n: 
    :return: pandas.DataFrame
    """
    MA = pd.Series(df['close'].rolling(n, min_periods=n).mean(), name='MA_' + str(n))
    df = df.join(MA)
    return df


def exponential_moving_average(df, n):
    """
    计算股价的exponential moving average 
    参考：https://www.cnblogs.com/wuliytTaotao/p/9479958.html
    :param df: pandas.DataFrame
    :param n: 
    :return: pandas.DataFrame
    """
    EMA = pd.Series(df['close'].ewm(span=n, min_periods=n).mean(), name='EMA_' + str(n))
    df = df.join(EMA)
    return df


def momentum(df, n):
    """
    计算动力
    参考：http://www.waihuibang.com/fxschool/technical/54505.html
    :param df: pandas.DataFrame 
    :param n: 
    :return: pandas.DataFrame
    """
    M = pd.Series(df['close'].diff(n), name='Momentum_' + str(n))
    df = df.join(M)
    return df


def rate_of_change(df, n):
    """
    计算变化率
    参考：https://www.tradingview.com/wiki/Rate_of_Change_(ROC)
    :param df: pandas.DataFrame
    :param n: 
    :return: pandas.DataFrame
    """
    M = df['close'].diff(n - 1)
    N = df['close'].shift(n - 1)
    ROC = pd.Series(M / N, name='ROC_' + str(n))
    df = df.join(ROC)
    return df


def average_true_range(df, n):
    """
    
    :param df: pandas.DataFrame
    :param n: 
    :return: pandas.DataFrame
    """
    i = 0
    TR_l = [0]
    while i < df.index[-1]:
        TR = max(df.loc[i + 1, 'high'], df.loc[i, 'close']) - min(df.loc[i + 1, 'low'], df.loc[i, 'close'])
        TR_l.append(TR)
        i = i + 1
    TR_s = pd.Series(TR_l)
    ATR = pd.Series(TR_s.ewm(span=n, min_periods=n).mean(), name='ATR_' + str(n))
    df = df.join(ATR)
    return df


def bollinger_bands(df, n):
    """
    计算bolling bands
    :param df: pandas.DataFrame
    :param n: 
    :return: pandas.DataFrame
    """
    MA = pd.Series(df['close'].rolling(n, min_periods=n).mean())
    MSD = pd.Series(df['close'].rolling(n, min_periods=n).std())
    b1 = 4 * MSD / MA
    B1 = pd.Series(b1, name='BollingerB_' + str(n))
    df = df.join(B1)
    b2 = (df['close'] - MA + 2 * MSD) / (4 * MSD)
    B2 = pd.Series(b2, name='Bollinger%b_' + str(n))
    df = df.join(B2)
    return df



def stochastic_oscillator_k(df):
    """
    :param df: pandas.DataFrame
    :return: pandas.DataFrame
    """
    SOk = pd.Series((df['close'] - df['low']) / (df['high'] - df['low']), name='SO%k')
    df = df.join(SOk)
    return df


def stochastic_oscillator_d(df, n):
    """
    :param df: pandas.DataFrame
    :param n: 
    :return: pandas.DataFrame
    """
    SOk = pd.Series((df['close'] - df['low']) / (df['high'] - df['low']), name='SO%k')
    SOd = pd.Series(SOk.ewm(span=n, min_periods=n).mean(), name='SO%d_' + str(n))
    df = df.join(SOd)
    return df



def macd(df, n_fast, n_slow):
    """Calculate MACD, MACD Signal and MACD difference
    
    :param df: pandas.DataFrame
    :param n_fast: 
    :param n_slow: 
    :return: pandas.DataFrame
    """
    EMAfast = pd.Series(df['close'].ewm(span=n_fast, min_periods=n_slow).mean())
    EMAslow = pd.Series(df['close'].ewm(span=n_slow, min_periods=n_slow).mean())
    MACD = pd.Series(EMAfast - EMAslow, name='MACD_' + str(n_fast) + '_' + str(n_slow))
    MACDsign = pd.Series(MACD.ewm(span=9, min_periods=9).mean(), name='MACDsign_' + str(n_fast) + '_' + str(n_slow))
    MACDdiff = pd.Series(MACD - MACDsign, name='MACDdiff_' + str(n_fast) + '_' + str(n_slow))
    df = df.join(MACD)
    df = df.join(MACDsign)
    df = df.join(MACDdiff)
    return df


def ease_of_movement(df, n):
    """Ease of Movement for given data.
    
    :param df: pandas.DataFrame
    :param n: 
    :return: pandas.DataFrame
    """
    EoM = (df['high'].diff(1) + df['low'].diff(1)) * (df['high'] - df['low']) / (2 * df['volume'])
    Eom_ma = pd.Series(EoM.rolling(n, min_periods=n).mean(), name='EoM_' + str(n))
    df = df.join(Eom_ma)
    return df



def standard_deviation(df, n):
    """计算标准差.
    
    :param df: pandas.DataFrame
    :param n: 
    :return: pandas.DataFrame
    """
    df = df.join(pd.Series(df['close'].rolling(n, min_periods=n).std(), name='STD_' + str(n)))
    return df

In [4]:
# 导入股票数据，下面的股票数据是通过tushare库来获得的
stock = pd.read_csv("./600519.csv")
stock.head()

Unnamed: 0,date,open,high,close,low,volume,price_change,p_change,ma5,ma10,ma20,v_ma5,v_ma10,v_ma20
0,2019-09-12,1066.0,1109.98,1099.0,1066.0,41211.33,29.48,2.76,1114.276,1126.115,1108.345,40942.17,37385.21,37563.02
1,2019-09-11,1119.22,1119.97,1069.52,1068.0,81716.54,-54.33,-4.83,1123.276,1127.525,1105.64,39286.08,36197.12,36864.33
2,2019-09-10,1134.3,1135.0,1123.85,1120.01,26227.07,-12.67,-1.11,1134.374,1130.584,1104.33,29662.81,32726.43,34849.05
3,2019-09-09,1145.0,1148.0,1136.52,1135.0,29379.34,-5.97,-0.52,1137.604,1129.099,1099.035,30314.42,35320.15,35054.18
4,2019-09-06,1144.5,1146.15,1142.49,1131.0,26176.59,-1.51,-0.13,1138.052,1125.742,1093.141,30085.41,37232.0,37660.16


In [5]:
stock.sort_values("date",inplace=True)

In [6]:
stock.head()

Unnamed: 0,date,open,high,close,low,volume,price_change,p_change,ma5,ma10,ma20,v_ma5,v_ma10,v_ma20
613,2017-03-14,371.55,373.85,369.5,368.34,20416.49,-2.05,-0.55,369.5,369.5,369.5,20416.49,20416.49,20416.49
612,2017-03-15,369.5,375.15,374.68,369.01,25155.26,5.18,1.4,372.09,372.09,372.09,22785.88,22785.88,22785.88
611,2017-03-16,376.56,378.3,374.77,372.8,25022.6,0.09,0.02,372.983,372.983,372.983,23531.45,23531.45,23531.45
610,2017-03-17,373.1,384.45,378.48,373.1,34700.26,3.71,0.99,374.358,374.358,374.358,26323.65,26323.65,26323.65
609,2017-03-20,380.5,386.71,386.41,378.88,31545.94,7.93,2.1,376.768,376.768,376.768,27368.11,27368.11,27368.11


### 1. 对于股票数据提取技术指标
直接调用给定的技术指标库来获得这些数据， 但建议大家可以简单看一下这些指标是如何被计算出来的。 虽然没必要一定要掌握，但大致的计算逻辑可以学习一下的。 如果对某一种指标感兴趣，想深入理解建议在百度上搜索 ： “技术指标” + “指标名字”来获得相关的参考资料，比如搜索 “技术指标” + 'rate of change"， 有大量的资料可以参考的。

> ```TODO1```: 提取技术指标

In [7]:
# TODO: 提取各类技术指标， 你可以把所有的技术指标全部调用一遍，也可以选择几个来尝试。 或者感兴趣的话，可以把其他的技术指标也加进来。 
#       每个技术指标的参数是不一样的，但基本也就1-2个参数，最常用的参数是天数（函数里用n来表示）， 有些技术指标需要传入两个参数（比如MACD，
#       一个是针对于fast_line, 一个是针对于slow_line, 需要分别定义天数）。 由于每个指标都有参数，所以针对于同一类指标其实可以提取很多不同的特征的！

# 例子： stock = average_directional_movement_index(stock, 12, 26) #  提取技术指标并存放在新的dataframe中
#      stock = moving_average(stock, 5)
#      stock = moving_average(stock, 15)

# 为了简便，对于带参数的函数，我们统一取参数值为5，只调用一次。

stock = exponential_moving_average(stock, 5)
stock = momentum(stock, 5)
stock = rate_of_change(stock, 5)
stock = bollinger_bands(stock, 5)
stock = stochastic_oscillator_k(stock)
stock = stochastic_oscillator_d(stock, 5)
stock = ease_of_movement(stock, 5)
stock = standard_deviation(stock, 5)
stock = macd(stock, 12, 26)



stock.head()

Unnamed: 0,date,open,high,close,low,volume,price_change,p_change,ma5,ma10,...,ROC_5,BollingerB_5,Bollinger%b_5,SO%k,SO%d_5,EoM_5,STD_5,MACD_12_26,MACDsign_12_26,MACDdiff_12_26
613,2017-03-14,371.55,373.85,369.5,368.34,20416.49,-2.05,-0.55,369.5,369.5,...,,,,0.210526,,,,,,
612,2017-03-15,369.5,375.15,374.68,369.01,25155.26,5.18,1.4,372.09,372.09,...,,,,0.923453,,,,,,
611,2017-03-16,376.56,378.3,374.77,372.8,25022.6,0.09,0.02,372.983,372.983,...,,,,0.358182,,,,,,
610,2017-03-17,373.1,384.45,378.48,373.1,34700.26,3.71,0.99,374.358,374.358,...,,,,0.474009,,,,,,
609,2017-03-20,380.5,386.71,386.41,378.88,31545.94,7.93,2.1,376.768,376.768,...,0.045765,0.066529,0.884666,0.961686,0.672601,,6.266472,,,


In [8]:
stock

Unnamed: 0,date,open,high,close,low,volume,price_change,p_change,ma5,ma10,...,ROC_5,BollingerB_5,Bollinger%b_5,SO%k,SO%d_5,EoM_5,STD_5,MACD_12_26,MACDsign_12_26,MACDdiff_12_26
613,2017-03-14,371.55,373.85,369.50,368.34,20416.49,-2.05,-0.55,369.500,369.500,...,,,,0.210526,,,,,,
612,2017-03-15,369.50,375.15,374.68,369.01,25155.26,5.18,1.40,372.090,372.090,...,,,,0.923453,,,,,,
611,2017-03-16,376.56,378.30,374.77,372.80,25022.60,0.09,0.02,372.983,372.983,...,,,,0.358182,,,,,,
610,2017-03-17,373.10,384.45,378.48,373.10,34700.26,3.71,0.99,374.358,374.358,...,,,,0.474009,,,,,,
609,2017-03-20,380.50,386.71,386.41,378.88,31545.94,7.93,2.10,376.768,376.768,...,0.045765,0.066529,0.884666,0.961686,0.672601,,6.266472,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4,2019-09-06,1144.50,1146.15,1142.49,1131.00,26176.59,-1.51,-0.13,1138.052,1125.742,...,0.003275,0.026621,0.646487,0.758416,0.668328,0.000338,7.574039,39.671312,38.128774,1.542538
3,2019-09-09,1145.00,1148.00,1136.52,1135.00,29379.34,-5.97,-0.52,1137.604,1129.099,...,-0.003053,0.026680,0.464285,0.116923,0.484526,0.000315,7.587927,38.280271,38.159074,0.121197
2,2019-09-10,1134.30,1135.00,1123.85,1120.01,26227.07,-12.67,-1.11,1134.374,1130.584,...,-0.001031,0.033525,0.223272,0.256171,0.408408,-0.001712,9.507546,35.743470,37.675953,-1.932483
1,2019-09-11,1119.22,1119.97,1069.52,1068.00,81716.54,-54.33,-4.83,1123.276,1127.525,...,-0.065105,0.110680,0.067613,0.029248,0.282021,-0.004851,31.080953,29.014597,35.943682,-6.929085


In [9]:
stock.isna().sum(axis=0)

date               0
open               0
high               0
close              0
low                0
volume             0
price_change       0
p_change           0
ma5                0
ma10               0
ma20               0
v_ma5              0
v_ma10             0
v_ma20             0
EMA_5              4
Momentum_5         5
ROC_5              4
BollingerB_5       4
Bollinger%b_5      4
SO%k               1
SO%d_5             4
EoM_5              5
STD_5              4
MACD_12_26        25
MACDsign_12_26    33
MACDdiff_12_26    33
dtype: int64

### 2. 数据处理，以及训练样本和测试样本的提取
通过上面的环节我们已经提取好了所需要的技术指标。 接下来的环节是通过这批数据来构造训练数据和测试数据了。 具体构造用于监督学习的数据的方法在本章的视频课程里已经提过，可以按照此方法来做。 
注：数据中存在着NAN， 稍微思考一下为什么会出现这些NAN？ 为了去理解这些NAN的源头，需要看一下pandas里的rolling().mean()是如何工作的。 在我们项目中，我们是通过历史一段时间的数据来预测未来的涨跌的，所以一定不能使用未来数据来预测未来，只能用历史数据来预测未来。 

> ```TODO2```： 做必要的数据预处理，并构建好样本数据。这里我们要预测的标签是第二天的涨跌。如果第二天的```close```价格 >  第一天的```close```价格，我们可以认为这个样本为正样本（1）， 如果价格小于第一天的```close```价格，就认为这个样本为负样本（0）。 构建完训练样本之后，在把样本通过```train_test_split```来划分为训练集和测试集。

In [10]:
# TODO 2   完成样本数据的构造，并随机分成训练和测试数据

### 做label
lastday = stock['close'].values[:-1]
today = stock['close'].values[1:]
up = today-lastday
label = []
label.append(np.nan)
for val in up:
    if val>0:
        label.append(1)
    else:
        label.append(0)

stock['y'] = label

### 可以看上一步的结果，因为少量数据存在缺失，所以决定扔掉, 为什么存在nan的原因很简单，因为移动平均，假如选择参数5，那前4天的数据根本就没有从他开始往前5天的数据

stock= stock.dropna(axis=0) 

from sklearn.model_selection  import train_test_split
### 分割数据集

y = stock['y'].values
stock.drop(['y','date'],axis=1,inplace=True)

X = stock.values

X_train,X_test,y_train,y_test =  train_test_split(X, y, test_size=0.2, random_state=1)

print (X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(464, 25) (464,) (116, 25) (116,)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


### 3. 利用随机森林训练模型
模型训练部分跟之前没有太大区别，试着通过交叉验证来训练一下，然后看看结果如何。 
> ```TODO3```：训练模型 

In [12]:
# TODO: 训练随机森林模型，请尝试不同的参数，最后在测试集上输出最好的参数

from sklearn.ensemble import RandomForestClassifier
#from sklearn.grid_search import GridSearchCV #sklearn0.20移除了grid_search模块
from sklearn.model_selection import GridSearchCV

params = {'n_estimators': [10,50,100],'max_depth':[1,2,5,10],'min_samples_split':[2,5,10],'min_samples_leaf':[1,2,5]}
grid=GridSearchCV(RandomForestClassifier(),params,n_jobs = -1)
grid.fit(X_train,y_train)
print(grid.best_params_,grid.best_score_)



{'max_depth': 1, 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 50} 0.9935344827586207




In [13]:
predictions = grid.predict(X_test)
from sklearn.metrics import classification_report

print(classification_report(y_test,predictions))

              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00        61
         1.0       1.00      1.00      1.00        55

    accuracy                           1.00       116
   macro avg       1.00      1.00      1.00       116
weighted avg       1.00      1.00      1.00       116



> ```TODO4```: 问答题：得出来的结果怎么样？ 是否满足预期？ 你觉得有什么方式可以提升模型的准确率？ 

In [None]:
问答题回复：


一般如果准确率太低，就是做特征工程，增加特征，一些简单方法比如x+y,x*y等等操作去增加特征

拓展阅读： 从本项目中可以看到这里的核心其实就是一个一个指标，而且每一个指标都是通过大量的经验来构造出来的。 但有些复杂度的指标确实也比较难想出来。问题：有没有可能让计算机学出有用的指标呢？ 比如计算机可以学出这样的指标 = (close- open) * volum - close * close - open   虽然这个指标有点看不懂，但有可能是有效的，有没有可能让AI做这件事情？？？ 如果对这些感兴趣，可以参考一下下方链接： https://www.baidu.com/link?url=WmpaRS35js8T8gAUzaF6_rvdepe0OqpgmeU0fTxhXzMZnKCUXIECQeUFB6VTpFjg&wd=&eqid=b04a03b600117ba2000000035d88bea9