# 【計程期末報告-LSTM預測大盤漲跌】
## 【組員相關資訊】
#### 105207325 統計四 謝佳叡-資料抓取與尋找
#### 108306089 資管一 許柏穎-資料處理與模型建立
#### 108208071 經濟一 王聖棨-資料處理與抓取
#### 107405171 傳一丙 陳芷萱-參數優化與報告編輯

### 【使用套件】
#### talib-
##### 用來生成需要的技術指標
#### StringIO-
##### 將讀取到的文字檔以文件的方式餵給read_csv

### 【資料來源】
#### 原本打算直接從台灣證券交易所抓取資料，但是因為抓取該網站資料時，好像無法太頻繁的抓取，所以改用已經有人抓下來的資料，下方程式碼有直接從網站讀取的方法，但資料使用上還是用現有的資料進行。

### 【動機】
#### 原本是想預測哪些股票會漲，但是因為資料方面可能無法一下收集到一千多支股票的歷史資料，所以改變方向先以大盤的資料做預測。主要是以每三十分鐘的資料去預測第三十分鐘資料再隔天同一時間的漲跌。如果return超過1表示長，小於1表示跌。

### 【實做流程】
#### 1.到台灣證券交易所抓取資料，但因為該網站有限制無法過於頻繁的訪問加上資料量有點大，所以使用他人整理好的pickle檔
#### 2.資料處理方面，因為讀到的資料為每分鐘的指數資料，可能有將近一百萬筆，所以將資料切成每十五分鐘一筆。
#### 3.技術指標方面，因為有talib這個套件所以可以快速生成很多指標
#### 4.以LSTM訓練模型


## 【資料爬蟲】

In [None]:
import tqdm
# 時間物件
import datetime

# 下載網頁用的
import requests

# 資料處理
import pandas as pd

# 檔案串流
from io import StringIO

import time

In [None]:
from datetime import timedelta, date

#做date的loop
def daterange(start_date, end_date):
    for n in range(int ((end_date - start_date).days)):
        yield start_date + timedelta(n)

# for single_date in daterange(start_date, end_date):
#     print(single_date.strftime("%Y%m%d"))

In [None]:
#讀取網站資料
def requests_get(*args1, **args2):
    i = 3
#     time.sleep(1)
    while i >= 0:
        try:
            return requests.get(*args1, **args2)
        except (ConnectionError, ReadTimeout) as error:
            print(error)
            print('retry one more time after 60s', i, 'times left')
            time.sleep(60)
        i -= 1
    return pd.DataFrame()

In [None]:
#整理抓到的資料
def crawl_benchmark(date):

    date_str = date.strftime('%Y%m%d')
    res = requests_get("https://www.twse.com.tw/exchangeReport/MI_5MINS_INDEX?response=csv&date=" +
                       date_str + "&_=1544020420045")

    # 利用 pandas 將資料整理成表格

    if len(res.text) < 10:
        return pd.DataFrame()

    df = pd.read_csv(StringIO(res.text.replace("=","")), header=1, index_col='時間')

    # 資料處理

    df = df.dropna(how='all', axis=0).dropna(how='all', axis=1)
    df.index = pd.to_datetime(date.strftime('%Y %m %d ') + pd.Series(df.index))
    df = df.apply(lambda s: s.astype(str).str.replace(",", "").astype(float))
    df = df.reset_index().rename(columns={'時間':'date'})
    df['stock_id'] = '台股指數'
    return df.set_index(['stock_id', 'date'])

In [None]:
#設定日期並抓取
start_date = datetime.date(2006, 1, 1)
end_date = datetime.date(2020, 1, 1)
df = pd.DataFrame()
for single_date in tqdm.tqdm_notebook(daterange(start_date, end_date)):
#     print(single_date.strftime("%Y%m%d"))
    try:
        df = df.append(crawl_benchmark(single_date))
    except:
        pass

## 【主要程式】

In [2]:
import pandas as pd
import numpy as np

### benchmark.pkl檔案連結--->https://drive.google.com/file/d/18if_0-UIzJDRa9BEz3Qo8RALk6UAkKeH/view?usp=sharing

In [3]:
twii = pd.read_pickle('final_project/benchmark.pkl')

In [4]:
twii.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,發行量加權股價指數,未含金融保險股指數,未含電子股指數,未含金融電子股指數,水泥類指數,食品類指數,塑膠類指數,紡織纖維類指數,電機機械類指數,電器電纜類指數,...,資訊服務類指數,其他電子類指數,建材營造類指數,航運類指數,觀光類指數,金融保險類指數,貿易百貨類指數,油電燃氣類指數,其他類指數,百貨貿易類指數
stock_id,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
台股指數,2006-01-02 09:00:00,6548.34,5386.78,8267.61,6181.48,64.75,261.24,142.63,227.19,75.93,39.75,...,,,153.06,79.83,83.86,891.39,,,92.46,82.55
台股指數,2006-01-02 09:01:00,6457.61,5308.85,8161.84,6094.52,63.6,257.25,142.1,223.31,74.85,39.09,...,,,147.73,76.56,78.09,881.77,,,91.25,80.5
台股指數,2006-01-02 09:02:00,6452.82,5304.25,8154.32,6085.76,63.52,256.05,142.05,222.7,74.76,39.34,...,,,147.91,76.29,78.09,881.67,,,90.98,80.39
台股指數,2006-01-02 09:03:00,6452.39,5305.04,8154.18,6090.04,63.68,258.07,142.1,223.37,74.9,39.29,...,,,147.15,76.32,78.09,880.66,,,90.87,80.99
台股指數,2006-01-02 09:04:00,6451.61,5305.22,8146.6,6084.24,63.47,256.16,141.94,223.42,74.88,39.2,...,,,147.18,76.32,78.09,879.87,,,90.8,80.86


In [5]:
#只要發行量加權股價指數的部分
twii = pd.DataFrame(twii['發行量加權股價指數'])

## 得到只有時間跟指數的Series

In [6]:
twii = twii.reset_index()

In [7]:
twii = twii.set_index('date')

In [8]:
twii = twii.drop(columns='stock_id',axis=0)

In [9]:
twii_s = twii['發行量加權股價指數']

In [10]:
#將每分鐘改成每15分鐘
twii_s=twii_s.resample("15T").first().dropna()

In [11]:
twii_s

date
2006-01-02 09:00:00     6548.34
2006-01-02 09:15:00     6478.09
2006-01-02 09:30:00     6474.88
2006-01-02 09:45:00     6471.12
2006-01-02 10:00:00     6480.50
2006-01-02 10:15:00     6484.66
2006-01-02 10:30:00     6455.34
2006-01-02 10:45:00     6445.31
2006-01-02 11:00:00     6431.08
2006-01-02 11:15:00     6441.96
2006-01-02 11:30:00     6444.52
2006-01-02 11:45:00     6450.95
2006-01-02 12:00:00     6460.38
2006-01-02 12:15:00     6458.52
2006-01-02 12:30:00     6450.34
2006-01-02 12:45:00     6445.74
2006-01-02 13:00:00     6449.78
2006-01-02 13:15:00     6460.24
2006-01-02 13:30:00     6462.06
2006-01-03 09:00:00     6462.06
2006-01-03 09:15:00     6458.16
2006-01-03 09:30:00     6484.40
2006-01-03 09:45:00     6487.84
2006-01-03 10:00:00     6496.69
2006-01-03 10:15:00     6503.62
2006-01-03 10:30:00     6514.61
2006-01-03 10:45:00     6505.29
2006-01-03 11:00:00     6511.18
2006-01-03 11:15:00     6514.83
2006-01-03 11:30:00     6540.67
                         ...   
201

## 製作training data

In [12]:
import talib

#每30hr均值
sma = talib.SMA(twii_s, timeperiod=120)

k, d = talib.STOCH  (twii_s, twii_s, twii_s, fastk_period=120, slowk_period=60, slowd_period=60)
k2, d2 = talib.STOCH(twii_s, twii_s, twii_s, fastk_period=240, slowk_period=120, slowd_period=120)
k3, d3 = talib.STOCH(twii_s, twii_s, twii_s, fastk_period=360, slowk_period=180, slowd_period=180)
k4, d4 = talib.STOCH(twii_s, twii_s, twii_s, fastk_period=480, slowk_period=240, slowd_period=240)
k5, d5 = talib.STOCH(twii_s, twii_s, twii_s, fastk_period=640, slowk_period=320, slowd_period=320)
k6, d6 = talib.STOCH(twii_s, twii_s, twii_s, fastk_period=720, slowk_period=360, slowd_period=360)
k7, d7 = talib.STOCH(twii_s, twii_s, twii_s, fastk_period=840, slowk_period=420, slowd_period=420)
k8, d8 = talib.STOCH(twii_s, twii_s, twii_s, fastk_period=960, slowk_period=480, slowd_period=480)

rsi = talib.RSI (twii_s, timeperiod=120)
rsi2 = talib.RSI(twii_s, timeperiod=240)
rsi3 = talib.RSI(twii_s, timeperiod=480)
rsi4 = talib.RSI(twii_s, timeperiod=640)
rsi5 = talib.RSI(twii_s, timeperiod=720)
rsi6 = talib.RSI(twii_s, timeperiod=840)

dataset = pd.DataFrame({
#     RSI是以某段時間，股價”平均漲幅”與”平均跌幅”所計算出來的數值，可看出股價觀測時間內股票價格強勢或弱勢的指標。
    'RSIb': rsi / 50,
    'RSIb2': rsi2 / 50,
    'RSIb3': rsi3 / 50,
    'RSIb4': rsi4 / 50,
    'RSIb5': rsi5 / 50,
    'RSIb6': rsi6 / 50,
    
#     K值反應市場價格的速度較D值來的快，波動亦較D值來的大，所以當K值與D值交叉時被稱為黃金交叉或死亡交叉，是買進或賣出的指標訊號。
    'KDb': k - d,
    'KDb2': k2 - d2,
    'KDb3': k3 - d3,
    'KDb4': k4 - d4,
    'KDb5': k5 - d5,
    'KDb6': k6 - d6,
    'KDb7': k7 - d7,
    'KDb8': k8 - d8,
    
    'a5':   (twii_s.rolling(5).mean()   / twii_s),
    'a10':  (twii_s.rolling(10).mean()  / twii_s),
    'a20':  (twii_s.rolling(20).mean()  / twii_s),
    'a40':  (twii_s.rolling(40).mean()  / twii_s),
    'a80':  (twii_s.rolling(80).mean()  / twii_s),
    'a160': (twii_s.rolling(160).mean() / twii_s),
    'a320': (twii_s.rolling(320).mean() / twii_s),
    'a640': (twii_s.rolling(640).mean() / twii_s),
    'a720': (twii_s.rolling(720).mean() / twii_s),
    'a840': (twii_s.rolling(840).mean() / twii_s),
    'a960': (twii_s.rolling(960).mean() / twii_s),
    'a1024':(twii_s.rolling(1024).mean() / twii_s),

#     adxr半段市場趨勢，當adxr值越高標是市場會有上漲會是下跌的趨勢出現，這時再配合其他指標判斷趨勢
    'ADXR0': talib.ADXR(twii_s, twii_s, twii_s, 60),
    'ADXR1': talib.ADXR(twii_s, twii_s, twii_s, 120),
    'ADXR2': talib.ADXR(twii_s, twii_s, twii_s, 240),
    'ADXR3': talib.ADXR(twii_s, twii_s, twii_s, 360),
    'ADXR4': talib.ADXR(twii_s, twii_s, twii_s, 480),
    'ADXR5': talib.ADXR(twii_s, twii_s, twii_s, 640),
    
    #隔天同一時間的漲跌
    'return': twii_s.shift(-19) / twii_s,
})


#traing資料為return前面的features，return是要預測的結果
feature_names = list(dataset.columns[:-1])

In [13]:
dataset

Unnamed: 0_level_0,RSIb,RSIb2,RSIb3,RSIb4,RSIb5,RSIb6,KDb,KDb2,KDb3,KDb4,...,a840,a960,a1024,ADXR0,ADXR1,ADXR2,ADXR3,ADXR4,ADXR5,return
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2006-01-02 09:00:00,,,,,,,,,,,...,,,,,,,,,,0.986824
2006-01-02 09:15:00,,,,,,,,,,,...,,,,,,,,,,0.996923
2006-01-02 09:30:00,,,,,,,,,,,...,,,,,,,,,,1.001470
2006-01-02 09:45:00,,,,,,,,,,,...,,,,,,,,,,1.002584
2006-01-02 10:00:00,,,,,,,,,,,...,,,,,,,,,,1.002498
2006-01-02 10:15:00,,,,,,,,,,,...,,,,,,,,,,1.002924
2006-01-02 10:30:00,,,,,,,,,,,...,,,,,,,,,,1.009182
2006-01-02 10:45:00,,,,,,,,,,,...,,,,,,,,,,1.009306
2006-01-02 11:00:00,,,,,,,,,,,...,,,,,,,,,,1.012455
2006-01-02 11:15:00,,,,,,,,,,,...,,,,,,,,,,1.011312


In [14]:
feature_names

['RSIb',
 'RSIb2',
 'RSIb3',
 'RSIb4',
 'RSIb5',
 'RSIb6',
 'KDb',
 'KDb2',
 'KDb3',
 'KDb4',
 'KDb5',
 'KDb6',
 'KDb7',
 'KDb8',
 'a5',
 'a10',
 'a20',
 'a40',
 'a80',
 'a160',
 'a320',
 'a640',
 'a720',
 'a840',
 'a960',
 'a1024',
 'ADXR0',
 'ADXR1',
 'ADXR2',
 'ADXR3',
 'ADXR4',
 'ADXR5']

## 將nan去掉

In [15]:
dataset = dataset.dropna()

In [16]:
dataset

Unnamed: 0_level_0,RSIb,RSIb2,RSIb3,RSIb4,RSIb5,RSIb6,KDb,KDb2,KDb3,KDb4,...,a840,a960,a1024,ADXR0,ADXR1,ADXR2,ADXR3,ADXR4,ADXR5,return
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2006-06-05 13:30:00,0.864716,0.914157,0.961617,0.976392,0.981303,0.986445,12.844539,7.388285,-7.738919,-31.812805,...,1.045242,1.034321,1.030166,7.829569,9.211663,7.882150,6.566537,4.302512,2.499070,1.002234
2006-06-06 09:00:00,0.864716,0.914157,0.961617,0.976392,0.981303,0.986445,11.648423,7.191792,-7.594827,-31.730343,...,1.045273,1.034368,1.030197,7.909202,9.237814,7.871600,6.574175,4.307650,2.502219,1.002234
2006-06-06 09:15:00,0.842621,0.901533,0.954257,0.970556,0.976010,0.981798,10.615518,7.005822,-7.425890,-31.644605,...,1.052497,1.041535,1.037321,8.046780,9.276159,7.862430,6.583442,4.313284,2.506139,0.999177
2006-06-06 09:30:00,0.862835,0.911942,0.959761,0.974790,0.979811,0.985100,9.994657,6.895210,-7.225996,-31.552207,...,1.047549,1.036653,1.032444,8.148676,9.303652,7.850631,6.591503,4.317835,2.509662,0.996675
2006-06-06 09:45:00,0.875228,0.918381,0.963182,0.977425,0.982178,0.987158,9.469552,6.810684,-7.009980,-31.453328,...,1.044490,1.033643,1.029433,8.196932,9.320128,7.836315,6.598789,4.321293,2.513129,0.994166
2006-06-06 10:00:00,0.871331,0.916165,0.961892,0.976402,0.981250,0.986343,8.653479,6.686504,-6.798808,-31.367755,...,1.045775,1.034933,1.030703,8.250235,9.342005,7.823425,6.606835,4.324878,2.516551,0.996174
2006-06-06 10:15:00,0.855721,0.907235,0.956675,0.972261,0.977494,0.983042,7.637697,6.526318,-6.639958,-31.292043,...,1.050925,1.040055,1.035791,8.316409,9.370204,7.812171,6.615822,4.328929,2.520140,1.000731
2006-06-06 10:30:00,0.852850,0.905583,0.955707,0.971493,0.976796,0.982429,6.426085,6.355877,-6.477467,-31.221424,...,1.051903,1.041050,1.036766,8.360815,9.374061,7.800044,6.625302,4.333483,2.523824,0.997240
2006-06-06 10:45:00,0.859550,0.909047,0.957542,0.972905,0.978064,0.983532,5.048696,6.086491,-6.294426,-31.154689,...,1.050249,1.039442,1.035147,8.394145,9.374905,7.786848,6.634922,4.337511,2.527477,0.994460
2006-06-06 11:00:00,0.857327,0.907777,0.956801,0.972317,0.977530,0.983062,3.566401,5.805777,-6.117607,-31.098527,...,1.051000,1.040212,1.035897,8.429741,9.379003,7.774126,6.644063,4.341603,2.531271,0.995575


In [17]:
dataset.shape

(61295, 33)

## 將資料標準化

In [18]:
from sklearn.preprocessing import StandardScaler

ss = StandardScaler()

dataset_scaled = ss.fit_transform(dataset)
dataset_scaled = pd.DataFrame(dataset_scaled,columns = dataset.columns,index=dataset.index)

#return 不用標準化
dataset_scaled['return'] = dataset['return']
dataset_scaled.head()

Unnamed: 0_level_0,RSIb,RSIb2,RSIb3,RSIb4,RSIb5,RSIb6,KDb,KDb2,KDb3,KDb4,...,a840,a960,a1024,ADXR0,ADXR1,ADXR2,ADXR3,ADXR4,ADXR5,return
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2006-06-05 13:30:00,-1.398341,-1.289554,-0.921599,-0.74102,-0.671093,-0.593106,0.806784,0.470827,-0.514256,-2.091728,...,0.928817,0.656515,0.558504,-1.257293,-0.01171,0.594211,0.667941,-0.329303,-1.386946,1.002234
2006-06-06 09:00:00,-1.398341,-1.289554,-0.921599,-0.74102,-0.671093,-0.593106,0.731594,0.45819,-0.50477,-2.086303,...,0.929441,0.657373,0.559056,-1.237131,-0.002267,0.589065,0.67259,-0.325552,-1.384276,1.002234
2006-06-06 09:15:00,-1.598725,-1.44961,-1.052721,-0.861078,-0.786625,-0.702745,0.666664,0.446229,-0.493648,-2.080663,...,1.074511,0.790006,0.68577,-1.2023,0.011578,0.584593,0.67823,-0.321439,-1.380952,0.999177
2006-06-06 09:30:00,-1.415401,-1.317635,-0.954662,-0.773975,-0.703648,-0.624833,0.627635,0.439115,-0.480488,-2.074584,...,0.975157,0.699671,0.599027,-1.176502,0.021505,0.578838,0.683136,-0.318116,-1.377966,0.996675
2006-06-06 09:45:00,-1.303016,-1.236005,-0.893718,-0.719767,-0.651983,-0.576293,0.594626,0.433679,-0.466266,-2.068079,...,0.913717,0.643972,0.545467,-1.164285,0.027455,0.571855,0.687571,-0.315592,-1.375027,0.994166


In [19]:
len(dataset_scaled)

61295

In [20]:
#顯示進度條
import tqdm

n = 3 #每三個時間點預設一個return
x = [] #放入每三個時間點為一組的features
y = [] #放入最後一個時間點的return
indexes = [] #計入當前index
dataset_scaled_x = dataset_scaled[feature_names]

for i in tqdm.tqdm_notebook(range(0,len(dataset_scaled)-n)):
    x.append(dataset_scaled_x.iloc[i:i+n].values)
    y.append(dataset_scaled['return'].iloc[i+n-1])
    indexes.append(dataset_scaled.index[i+n-1])

HBox(children=(IntProgress(value=0, max=61292), HTML(value='')))




In [21]:
x

[array([[-1.39834114e+00, -1.28955381e+00, -9.21598614e-01,
         -7.41020323e-01, -6.71093074e-01, -5.93105847e-01,
          8.06784093e-01,  4.70827334e-01, -5.14256338e-01,
         -2.09172838e+00, -2.16137635e+00, -1.82966831e+00,
         -1.33665392e+00, -9.59500415e-01,  2.67885919e+00,
          3.92776583e+00,  3.67791610e+00,  2.84806451e+00,
          1.89953066e+00,  1.33349338e+00,  1.58140959e+00,
          1.39136193e+00,  1.19654303e+00,  9.28817232e-01,
          6.56515459e-01,  5.58504382e-01, -1.25729279e+00,
         -1.17097235e-02,  5.94210946e-01,  6.67941057e-01,
         -3.29302830e-01, -1.38694590e+00],
        [-1.39834114e+00, -1.28955381e+00, -9.21598614e-01,
         -7.41020323e-01, -6.71093074e-01, -5.93105847e-01,
          7.31594058e-01,  4.58189984e-01, -5.04769966e-01,
         -2.08630348e+00, -2.16742827e+00, -1.83954980e+00,
         -1.34508482e+00, -9.68232256e-01,  1.38004134e+00,
          3.29684764e+00,  3.40055148e+00,  2.78493082e+

In [22]:
x[0].shape
# <datetime.datetime(2016,1,1)

(3, 32)

In [23]:
import numpy as np
x = np.array(x)
y = np.array(y)

In [24]:
indexes = np.array(indexes)

In [25]:
import datetime
indexes<datetime.datetime(2016,1,1)

array([ True,  True,  True, ..., False, False, False])

## 建構神經網絡

In [37]:
import keras

model = keras.Sequential()
model.add(keras.layers.LSTM(100,return_sequences=True,input_shape=x[0].shape))
model.add(keras.layers.LSTM(100))
model.add(keras.layers.Dense(8))
model.add(keras.layers.Dense(1,kernel_initializer='uniform',activation='linear'))
adam = keras.optimizers.Adam(0.0006)
model.compile(optimizer=adam,loss = "binary_crossentropy",metrics=['accuracy'])
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_7 (LSTM)                (None, 3, 100)            53200     
_________________________________________________________________
lstm_8 (LSTM)                (None, 100)               80400     
_________________________________________________________________
dense_7 (Dense)              (None, 8)                 808       
_________________________________________________________________
dense_8 (Dense)              (None, 1)                 9         
Total params: 134,417
Trainable params: 134,417
Non-trainable params: 0
_________________________________________________________________


## 神經網路訓練

In [38]:
import datetime
#2016年以前的data作為訓練資料
x_train = x[indexes < datetime.datetime(2016, 1 ,1)]
y_train = y[indexes < datetime.datetime(2016, 1, 1)]
#2016年以後的data做為測試資料
x_test = x[indexes >= datetime.datetime(2016, 1 ,1)]
y_test = y[indexes >= datetime.datetime(2016, 1, 1)]

#以val_acc,val_loss為依據，將訓練最好的model存起來。
get_best_model = keras.callbacks.ModelCheckpoint("lstm.mdl", monitor="val_acc")
get_best_model2 = keras.callbacks.ModelCheckpoint("lstm2.mdl", monitor="val_loss")
history = model.fit(
    x_train,  
    y_train > 1, 
    batch_size=3000, 
    epochs=300, 
    validation_split=0.3, 
    callbacks=[get_best_model,get_best_model2])

Train on 31653 samples, validate on 13566 samples
Epoch 1/300
Epoch 2/300
Epoch 3/300
Epoch 4/300
Epoch 5/300
Epoch 6/300
Epoch 7/300
Epoch 8/300
Epoch 9/300
Epoch 10/300
Epoch 11/300
Epoch 12/300
Epoch 13/300
Epoch 14/300
Epoch 15/300
Epoch 16/300
Epoch 17/300
Epoch 18/300
Epoch 19/300
Epoch 20/300
Epoch 21/300
Epoch 22/300
Epoch 23/300
Epoch 24/300
Epoch 25/300
Epoch 26/300
Epoch 27/300
Epoch 28/300
Epoch 29/300
Epoch 30/300
Epoch 31/300
Epoch 32/300
Epoch 33/300
Epoch 34/300
Epoch 35/300
Epoch 36/300
Epoch 37/300
Epoch 38/300
Epoch 39/300
Epoch 40/300
Epoch 41/300
Epoch 42/300
Epoch 43/300
Epoch 44/300
Epoch 45/300
Epoch 46/300
Epoch 47/300
Epoch 48/300
Epoch 49/300
Epoch 50/300
Epoch 51/300
Epoch 52/300
Epoch 53/300
Epoch 54/300
Epoch 55/300
Epoch 56/300
Epoch 57/300
Epoch 58/300
Epoch 59/300
Epoch 60/300
Epoch 61/300
Epoch 62/300
Epoch 63/300
Epoch 64/300
Epoch 65/300
Epoch 66/300
Epoch 67/300
Epoch 68/300
Epoch 69/300
Epoch 70/300
Epoch 71/300
Epoch 72/300
Epoch 73/300
Epoch 74/3

## 叫回剛剛覺得最好的兩個model

In [44]:
model.load_weights("lstm.mdl")
# model_loss = model.load_weights("lstm2.mdl")

## 回測:當model預測數字超過0.5時就當作會上漲

In [67]:
x_test = x[indexes >= datetime.datetime(2016, 1 ,1)]
y_test = y[indexes >= datetime.datetime(2016, 1, 1)]

### 回測第一個使用val_acc作為best model標準的最佳模型做預測，準確率為:25.45%

In [68]:
Y1 = model.predict(x_test)
# Y2 = model_loss.predict(x_test)

In [96]:
%matplotlib inline
import matplotlib.pyplot as plt
indexs_test=indexes[indexes >= datetime.datetime(2016, 1, 1)]
Y1[Y1>0.5]=1

y_test.shape
Y1=Y1.reshape(16073,)

In [101]:
y_test[(y_test-Y1)>0].size/y_test.size

0.2545262241025322

### 回測第一個使用val_loss作為best model標準的最佳模型做預測，準確率為:77.68%

In [102]:
model_loss = model.load_weights("lstm2.mdl")

In [103]:
x_test = x[indexes >= datetime.datetime(2016, 1 ,1)]
y_test = y[indexes >= datetime.datetime(2016, 1, 1)]
Y2 = model.predict(x_test)

In [104]:
%matplotlib inline
import matplotlib.pyplot as plt
indexs_test=indexes[indexes >= datetime.datetime(2016, 1, 1)]
Y2[Y2>0.5]=1

y_test.shape
Y2=Y2.reshape(16073,)
y_test[(y_test-Y2)>0].size/y_test.size

0.7768307098861444

## 結論與心得

<p text-size=100px>在做最後一次的training前，前面還有做過許多次的training，像是adam的learning rate由原本的預設值改為較小的0.0006，透過這樣的改變可以使loss function在監測loss時更能找到適合的函數。再來還有batch_size由原本的5000降到3000，也使得訓練出來的模型更好。最後就是在訓練完後，我們發現好像不是最後訓練出來的模型就是最好的，因此設了callback.modelcheckpoint的方法，透過給予特定參數讓訓練最好的模型被挑出來，而不是用最後的訓練出的那個。透過結果，我們也發現，以這個案例來看如果只是單看accuracy，得到的結果並不是非常理想，但是以loss來挑出最佳模型，得到的結過在此案例來看遠好於單看accuracy。</p>

整體來說，我們覺得這次專案較困難的部分為資料的爬取跟處理，因為在爬取方面，網站有一定的訪問限制，因此還要設間隔時間，但是我們需要的資料量蠻大的所以即便開一個禮拜可能也跑不完。所以就找了別人處理好的資料來用，在資料處理部分，一開始因為不知道要訓練甚麼，想了一段時間，最後才決定用隔天同一時間的指數漲跌作為訓練的資料，決定好後還要對資料做許多處理最後才能送去訓練，也因為是以date作為Index所以處理起來也蠻複雜的，但經過這次的經驗，下次在處理相關問題時也會較上手。

# 1.請詳細列出專案的參考資料，包含資料來源、參考程式等

## 專案參考資料:
     -https://www.finlab.tw/%E8%B6%85%E7%B0%A1%E5%96%AE%E5%8F%B0%E8%82%A1%E6%AF%8F%E6%97%A5%E7%88%AC%E8%9F%B2%E6%95%99%E5%AD%B8/
     -https://www.finlab.tw/Python-%E8%B2%A1%E5%A0%B1%E7%88%AC%E8%9F%B2-1-%E7%B6%9C%E5%90%88%E6%90%8D%E7%9B%8A%E8%A1%A8/
## 資料來源:
     -https://www.twse.com.tw/zh/page/trading/exchange/MI_5MINS_INDEX.html

# 2.請詳細解釋主要是以每三十分鐘的資料去預測第三十分鐘資料再隔天同一時間的漲跌這個策略的理由及好處

之所以選擇以三十分鐘的資料去預測是因為，每十五分鐘會有一筆資料，每筆資料裡面會有32個features要做訓練，所以以訓練資料來說我們認為是足夠的。再來對於為甚麼要用隔天的同一時間來預測漲跌，不要用下一個時段或同一天的其他時段做預測，是因為我們認為如果只是去預測同一天的資料，那麼可能會因為同一天資料的差距太小而無法有效的訓練，再加上股市可能會在隔一天因為法人的買賣超有巨大的變化，如果希望模型將這個情況考慮進去用隔天的漲跌去預測會有比較好的效果。當然因為LSTM會有記憶的效果，所以不一定只能用隔天來判斷漲跌，可以一個禮拜後、一個月、或是當天的其他時段都可以，之後還可以試試看用拿一種方法會有比較好的效果。

# 3.請解釋在動機時你們有說如果return超過1表示長，小於1表示跌這個動機與最後的模型輸出不太一樣

輸出的是資料是模型認為會上漲的機率，我們將模型認為上漲機率超過0.5的情況視為上漲，上漲的意思就是return>1。