# Data Science / Machine Learning Meetup #1 Deep Learning Hands-on
# オルタナティブ・データと自然言語処理

## はじめに

演習の概略は以下の通りです。
1. [環境準備](#環境準備)
1. [WEBスクレイピング](#WEBスクレイピング)
1. [感情分析](#感情分析)
    1. 前処理
    1. ニューラル・ネットワーク構築
    1. トレーニング
    1. 予測

以下の点にご注意ください。
- 実行するコードの中に、ご利用中のユーザー名に合わせて、変更していただく部分があります。

## 環境準備

### パッケージのインストールとインポート

In [1]:
!pip3 install ipython-sql==0.3.9
!pip3 install PyHive==0.6.1
!pip3 install SQLAlchemy==1.3.13
!pip3 install thrift==0.13.0
!pip3 install sasl==0.2.1
!pip3 install thrift_sasl==0.3.0


!pip3 install nltk==3.4.5
!pip3 install torch==1.4.0

Collecting ipython-sql==0.3.9
  Downloading https://files.pythonhosted.org/packages/ab/df/427e7cf05ffc67e78672ad57dce2436c1e825129033effe6fcaf804d0c60/ipython_sql-0.3.9-py2.py3-none-any.whl
Collecting prettytable (from ipython-sql==0.3.9)
  Downloading https://files.pythonhosted.org/packages/ef/30/4b0746848746ed5941f052479e7c23d2b56d174b82f4fd34a25e389831f5/prettytable-0.7.2.tar.bz2
Collecting sqlparse (from ipython-sql==0.3.9)
  Downloading https://files.pythonhosted.org/packages/ef/53/900f7d2a54557c6a37886585a91336520e5539e3ae2423ff1102daf4f3a7/sqlparse-0.3.0-py2.py3-none-any.whl
Collecting sqlalchemy>=0.6.7 (from ipython-sql==0.3.9)
[?25l  Downloading https://files.pythonhosted.org/packages/af/47/35edeb0f86c0b44934c05d961c893e223ef27e79e1f53b5e6f14820ff553/SQLAlchemy-1.3.13.tar.gz (6.0MB)
[K     |████████████████████████████████| 6.0MB 5.4MB/s eta 0:00:01
Building wheels for collected packages: prettytable, sqlalchemy
  Building wheel for prettytable (setup.py) ... [?25ldone
[?2

  Building wheel for nltk (setup.py) ... [?25ldone
[?25h  Stored in directory: /home/cdsw/.cache/pip/wheels/96/86/f6/68ab24c23f207c0077381a5e3904b2815136b879538a24b483
Successfully built nltk
Installing collected packages: nltk
Successfully installed nltk-3.4.5
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
Collecting torch==1.4.0
[?25l  Downloading https://files.pythonhosted.org/packages/24/19/4804aea17cd136f1705a5e98a00618cb8f6ccc375ad8bfa437408e09d058/torch-1.4.0-cp36-cp36m-manylinux1_x86_64.whl (753.4MB)
[K     |████████████████████████████████| 753.4MB 53kB/s s eta 0:00:01   |▌                               | 10.9MB 4.0MB/s eta 0:03:08     |▊                               | 18.0MB 4.0MB/s eta 0:03:06     |████████████████████▏           | 474.7MB 74.6MB/s eta 0:00:04     |███████████████████████████████ | 728.5MB 76.7MB/s eta 0:00:01
[?25hInstalling collected packages: torch
Successfully installed torch-1.4.0
You should consider upgrading via t

上記でインストールしたPyHiveは、Pythonコードの中でimportして使われるのではなく、Hiveへの接続の際の接続文字列：`sqlalchemy.create_engine('hive://<host>:<port>')`の中でdialectsとして指定された際に必要になります。そのため、インストール後に利用するためには、新しくプロセスを始める必要があります。**インストールした後に一度、KernelをRestartしてください。**インストールしたプロセスでは、接続時に下記のようなエラーが発生します。
`NoSuchModuleError: Can't load plugin: sqlalchemy.dialects:hive`

In [1]:
import json
import os
import random
import re
import subprocess
import glob
import traceback
from datetime import datetime

from pyhive import hive
import sqlalchemy

import sys
#from random import random
from operator import add
from pyspark.sql import SparkSession

import torch
import nltk
from torch import nn, optim
import torch.nn.functional as F

## WEBスクレイピング

無償で利用できるAPIを用いて演習を行います。そのため、利用に一定の制限が課せられることにご留意ください。
例えば、ご利用状況に応じて、下記のようなエラーメッセージを受け取ることがあります。

```
{"response":{"status":429},"errors":[{"message":"Rate limit exceeded. Client may not make more than 200 requests an hour."}]}
```
まず、APIで取得したデータをCDSWプロジェクト内のファイルとして保存します。

取得する銘柄の候補が、`ticker.txt`に定義されています。

In [2]:
ticker_file = open("ticker.txt")
data = ticker_file.readlines()
ticker_file.close()

ticker_list = [i.rstrip('\n') for i in data]

print(len(ticker_list))
print(ticker_list)

2882
['A', 'AA', 'AAL', 'AAN', 'AAOI', 'AAON', 'AAP', 'AAPL', 'AAWW', 'AAXN', 'ABBV', 'ABC', 'ABCB', 'ABEO', 'ABG', 'ABM', 'ABMD', 'ABT', 'ABTX', 'ACA', 'ACAD', 'ACCO', 'ACEL', 'ACGL', 'ACHC', 'ACHN', 'ACHV', 'ACIA', 'ACIW', 'ACLS', 'ACM', 'ACN', 'ACNB', 'ACOR', 'ACRS', 'ACRX', 'ACTG', 'ADBE', 'ADES', 'ADI', 'ADM', 'ADMA', 'ADMP', 'ADMS', 'ADP', 'ADPT', 'ADRO', 'ADS', 'ADSK', 'ADSW', 'ADT', 'ADTN', 'ADUS', 'ADVM', 'ADXS', 'AE', 'AEE', 'AEGN', 'AEIS', 'AEL', 'AEM', 'AEMD', 'AEO', 'AEP', 'AERI', 'AES', 'AFG', 'AFI', 'AFL', 'AG', 'AGCO', 'AGEN', 'AGFS', 'AGI', 'AGIO', 'AGLE', 'AGM', 'AGN', 'AGO', 'AGR', 'AGRX', 'AGS', 'AGTC', 'AGX', 'AGYS', 'AHC', 'AHCO', 'AIG', 'AIMC', 'AIMT', 'AIN', 'AIR', 'AIRG', 'AIRT', 'AIT', 'AIZ', 'AJG', 'AJRD', 'AKAM', 'AKBA', 'AKCA', 'AKRO', 'AKRX', 'AKS', 'AL', 'ALB', 'ALCO', 'ALDX', 'ALE', 'ALEC', 'ALG', 'ALGN', 'ALGT', 'ALIM', 'ALK', 'ALKS', 'ALL', 'ALLK', 'ALLO', 'ALLY', 'ALNY', 'ALOT', 'ALPN', 'ALRM', 'ALRN', 'ALSK', 'ALSN', 'ALT', 'ALTR', 'ALV', 'ALXN', 'AM

In [3]:
!mkdir ./data

In [4]:
symbols = ['BBRY', 'AAPL', 'AMZN', 'BABA', 'YHOO', 'FB', 'GOOG', 'BBBY', 'JNUG', 'SBUX', 'MU']

NUM_REQUEST = 200 - len(symbols)

random.seed(12345)
symbols.extend(random.sample(ticker_list, NUM_REQUEST))

args = ['curl', '-X', 'GET', '']
URL = "https://api.stocktwits.com/api/2/streams/symbol/"

FILE_PATH = "./data/"

start_datetime = datetime.now().strftime("%Y%m%d_%H%M")
for symbol in symbols:
    try:
        args[3] = URL + symbol + ".json"
        print(args[3])
        proc = subprocess.run(args,stdout = subprocess.PIPE, stderr = subprocess.PIPE)

        path = FILE_PATH + symbol + "_" + start_datetime + ".json"
        print(path)
        with open(path, mode='w') as f:
            f.write(proc.stdout.decode("utf8"))
    except:
        traceback.print_exc()

https://api.stocktwits.com/api/2/streams/symbol/BBRY.json
./data/BBRY_20200203_0609.json
https://api.stocktwits.com/api/2/streams/symbol/AAPL.json
./data/AAPL_20200203_0609.json
https://api.stocktwits.com/api/2/streams/symbol/AMZN.json
./data/AMZN_20200203_0609.json
https://api.stocktwits.com/api/2/streams/symbol/BABA.json
./data/BABA_20200203_0609.json
https://api.stocktwits.com/api/2/streams/symbol/YHOO.json
./data/YHOO_20200203_0609.json
https://api.stocktwits.com/api/2/streams/symbol/FB.json
./data/FB_20200203_0609.json
https://api.stocktwits.com/api/2/streams/symbol/GOOG.json
./data/GOOG_20200203_0609.json
https://api.stocktwits.com/api/2/streams/symbol/BBBY.json
./data/BBBY_20200203_0609.json
https://api.stocktwits.com/api/2/streams/symbol/JNUG.json
./data/JNUG_20200203_0609.json
https://api.stocktwits.com/api/2/streams/symbol/SBUX.json
./data/SBUX_20200203_0609.json
https://api.stocktwits.com/api/2/streams/symbol/MU.json
./data/MU_20200203_0609.json
https://api.stocktwits.com/ap

./data/IMMR_20200203_0609.json
https://api.stocktwits.com/api/2/streams/symbol/ADUS.json
./data/ADUS_20200203_0609.json
https://api.stocktwits.com/api/2/streams/symbol/AR.json
./data/AR_20200203_0609.json
https://api.stocktwits.com/api/2/streams/symbol/ATO.json
./data/ATO_20200203_0609.json
https://api.stocktwits.com/api/2/streams/symbol/NRC.json
./data/NRC_20200203_0609.json
https://api.stocktwits.com/api/2/streams/symbol/BCC.json
./data/BCC_20200203_0609.json
https://api.stocktwits.com/api/2/streams/symbol/MATX.json
./data/MATX_20200203_0609.json
https://api.stocktwits.com/api/2/streams/symbol/CZZ.json
./data/CZZ_20200203_0609.json
https://api.stocktwits.com/api/2/streams/symbol/ADS.json
./data/ADS_20200203_0609.json
https://api.stocktwits.com/api/2/streams/symbol/LFUS.json
./data/LFUS_20200203_0609.json
https://api.stocktwits.com/api/2/streams/symbol/ENVA.json
./data/ENVA_20200203_0609.json
https://api.stocktwits.com/api/2/streams/symbol/WIRE.json
./data/WIRE_20200203_0609.json
http

./data/XNCR_20200203_0609.json
https://api.stocktwits.com/api/2/streams/symbol/WAFD.json
./data/WAFD_20200203_0609.json
https://api.stocktwits.com/api/2/streams/symbol/ATH.json
./data/ATH_20200203_0609.json
https://api.stocktwits.com/api/2/streams/symbol/FTR.json
./data/FTR_20200203_0609.json
https://api.stocktwits.com/api/2/streams/symbol/MYOK.json
./data/MYOK_20200203_0609.json
https://api.stocktwits.com/api/2/streams/symbol/AOS.json
./data/AOS_20200203_0609.json
https://api.stocktwits.com/api/2/streams/symbol/LBY.json
./data/LBY_20200203_0609.json
https://api.stocktwits.com/api/2/streams/symbol/PZZA.json
./data/PZZA_20200203_0609.json
https://api.stocktwits.com/api/2/streams/symbol/RLI.json
./data/RLI_20200203_0609.json
https://api.stocktwits.com/api/2/streams/symbol/SMED.json
./data/SMED_20200203_0609.json
https://api.stocktwits.com/api/2/streams/symbol/CAG.json
./data/CAG_20200203_0609.json
https://api.stocktwits.com/api/2/streams/symbol/TRU.json
./data/TRU_20200203_0609.json
http

正常なレスポンス・ステータスを持っていないファイルを取り除きます。

In [6]:
!grep -rlv '{"response":{"status":200}' data
!grep -rlv '{"response":{"status":200}' data | xargs rm

data/CSWI_20200203_0609.json
data/INBK_20200203_0609.json


In [8]:
!cat data/*.json > all_data.json

次に、保存したファイルを、分散処理環境（クラスター）を使って加工するためにHDFSへコピーします。

In [11]:
!export HADOOP_CONF_DIR=/etc/hadoop/conf; hdfs dfs -put all_data.json ./twits/



In [13]:
!export HADOOP_CONF_DIR=/etc/hadoop/conf; hdfs dfs -ls ./twits

Found 1 items
-rw-r--r--   3 user5 supergroup    9306615 2020-02-03 06:30 twits/all_data.json


### データ変換

クラスターでデータを変換します。CDSW上では、ユーザーごとに別のプロジェクトを使っていましたが、クラスター環境では、自分が利用しているユーザーとデータを意識して取り扱う必要があります。


あなたの（HADOOPクラスターへアクセスする）ユーザ名は以下で確認できます。

In [5]:
!echo $HADOOP_USER_NAME

user5


### データベースの準備



**下記のセルの中を適切なユーザ名とURL（Hiveサーバー）に置換してください。**

In [10]:
sqlalchemy.create_engine('hive://user4@master.ykono.work:10000')

Engine(hive://user2@master.ykono.work:10000)

In [14]:
%load_ext sql

**下記のセルの中を適切なユーザ名とURL（Hiveサーバー）に置換してください。**

In [16]:
%sql hive://user5@master.ykono.work:10000

'Connected: user5@None'

**あなたのユーザ名でデータベースを作成・利用してください**

In [17]:
%sql CREATE DATABASE user5
%sql USE user5
%sql SHOW TABLES

   hive://user4@master.ykono.work:10000
 * hive://user5@master.ykono.work:10000
Done.
   hive://user4@master.ykono.work:10000
 * hive://user5@master.ykono.work:10000
Done.
   hive://user4@master.ykono.work:10000
 * hive://user5@master.ykono.work:10000
Done.


tab_name


### ライブラリファイルのコピー・登録

Hiveクエリの中でjsonファイルを扱えるようにするためのライブラリを登録します。
ライブラリファイルはGithubリポジトリに含まれています（ライブラリの詳細は`/lib/README.jar`を参照ください）。
はじめにCDSWからHDFSにコピーし、HDFS上のファイルをHiveへ登録します。

コンパイル済みのライブラリファイルをリポジトリに含めています。
- json-1.3.7.3.jar
- json-serde-cdh5-shim-1.3.7.3.jar
- json-serde-1.3.7.3.jar'

- brickhouse-0.7.1-SNAPSHOT.jar

In [19]:
!export HADOOP_CONF_DIR=/etc/hadoop/conf; hdfs dfs -put `ls -1 ./lib/*.jar` .; hdfs dfs -ls .

put: `brickhouse-0.7.1-SNAPSHOT.jar': File exists
put: `json-1.3.7.3.jar': File exists
put: `json-serde-1.3.7.3.jar': File exists
put: `json-serde-cdh5-shim-1.3.7.3.jar': File exists
Found 6 items
-rw-r--r--   3 user5 supergroup    9306615 2020-02-03 06:23 all_data.json
-rw-r--r--   3 user5 supergroup     308146 2020-02-03 06:39 brickhouse-0.7.1-SNAPSHOT.jar
-rw-r--r--   3 user5 supergroup      44477 2020-02-03 06:39 json-1.3.7.3.jar
-rw-r--r--   3 user5 supergroup      36653 2020-02-03 06:39 json-serde-1.3.7.3.jar
-rw-r--r--   3 user5 supergroup       5110 2020-02-03 06:39 json-serde-cdh5-shim-1.3.7.3.jar
drwxr-xr-x   - user5 supergroup          0 2020-02-03 06:30 twits


In [20]:
%sql add jar hdfs:/user/user4/json-1.3.7.3.jar
%sql add jar hdfs:/user/user4/json-serde-1.3.7.3.jar
%sql add jar hdfs:/user/user4/json-serde-cdh5-shim-1.3.7.3.jar

   hive://user4@master.ykono.work:10000
 * hive://user5@master.ykono.work:10000
Done.
   hive://user4@master.ykono.work:10000
 * hive://user5@master.ykono.work:10000
Done.
   hive://user4@master.ykono.work:10000
 * hive://user5@master.ykono.work:10000
Done.


[]

In [21]:
%sql DROP TABLE IF EXISTS twits
%sql DROP TABLE IF EXISTS message_extracted
%sql DROP TABLE IF EXISTS message_filtered
%sql DROP TABLE IF EXISTS message_exploded
%sql DROP TABLE IF EXISTS sentiment_data

   hive://user4@master.ykono.work:10000
 * hive://user5@master.ykono.work:10000
Done.
   hive://user4@master.ykono.work:10000
 * hive://user5@master.ykono.work:10000
Done.
   hive://user4@master.ykono.work:10000
 * hive://user5@master.ykono.work:10000
Done.
   hive://user4@master.ykono.work:10000
 * hive://user5@master.ykono.work:10000
Done.
   hive://user4@master.ykono.work:10000
 * hive://user5@master.ykono.work:10000
Done.


[]

SNSメッセージファイルを格納した場所を指定して、テーブルを作成します。

**`LOCATION`指定にあなたがファイルをアップロードしたパスを指定してください**

In [22]:
%%sql
CREATE EXTERNAL TABLE twits (
	messages 
	ARRAY<
	    STRUCT<body: STRING,
	        symbols:ARRAY<STRUCT<symbol:STRING>>,
	        entities:STRUCT<sentiment:STRUCT<basic:STRING>>
	    >
	>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' 
STORED AS TEXTFILE
LOCATION '/user/user4/twits'

   hive://user4@master.ykono.work:10000
 * hive://user5@master.ykono.work:10000
Done.


[]

In [24]:
%%sql
select count(*) from twits

   hive://user4@master.ykono.work:10000
 * hive://user5@master.ykono.work:10000
Done.


_c0
197


In [23]:
%%sql
select * from twits limit 3

   hive://user4@master.ykono.work:10000
 * hive://user5@master.ykono.work:10000
Done.


messages
"[{""body"":""$AAPL heng is about to go negative get ready for tomorrow."",""symbols"":[{""symbol"":""AAPL""}],""entities"":{""sentiment"":{""basic"":""Bearish""}}},{""body"":""$AMRN $AAPL $STML $SCYX If you watch only one documentary, it should be this one:\n\nhttps://www.netflix.com/title/81026143\n\nGod bless these doctors and researchers."",""symbols"":[{""symbol"":""AAPL""},{""symbol"":""AMRN""},{""symbol"":""STML""},{""symbol"":""SCYX""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$AMZN of course not just tesla.. $AAPL &amp; $MSFT basically did nothing on the cash open after ER"",""symbols"":[{""symbol"":""AAPL""},{""symbol"":""AMZN""},{""symbol"":""MSFT""}],""entities"":{""sentiment"":null}},{""body"":""Ripping $AAPL $MSFT $SPCE $SPY"",""symbols"":[{""symbol"":""AAPL""},{""symbol"":""MSFT""},{""symbol"":""SPY""},{""symbol"":""SPCE""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$AAPL $BABA virus bears at this point are gonna get steam rolled tomorrow....Asia way up"",""symbols"":[{""symbol"":""AAPL""},{""symbol"":""BABA""}],""entities"":{""sentiment"":null}},{""body"":""$AAPL This should easily gap up to 325 tomorrow"",""symbols"":[{""symbol"":""AAPL""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$SPY $TVIX $AAPL \n\nSPY current behavior has shown that -1% - 1.5% drop is just a consolidating respectable pullback. Investors that bought in at the beginning of 2019 only bought the dip of fear to compound over &amp; over...Can easily be a 2-3 year rally with bears calling a crash at 410."",""symbols"":[{""symbol"":""AAPL""},{""symbol"":""SPY""},{""symbol"":""TVIX""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$AAPL\n\n$330 tomorrow!"",""symbols"":[{""symbol"":""AAPL""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$AAPL futures up"",""symbols"":[{""symbol"":""AAPL""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$AAPL gap up incoming"",""symbols"":[{""symbol"":""AAPL""}],""entities"":{""sentiment"":null}},{""body"":""$AAPL Apple PT Raised to $375.00 \n\nhttps://newsfilter.io/a/c44f08aed41cfc4109cfd90ca575dc5e"",""symbols"":[{""symbol"":""AAPL""}],""entities"":{""sentiment"":null}},{""body"":""$AAPL Apple Upgraded to Hold by Maxim Group \n\nhttps://newsfilter.io/a/3cc9600276d07833855007b0b68aeba3"",""symbols"":[{""symbol"":""AAPL""}],""entities"":{""sentiment"":null}},{""body"":""$AAPL Hey kids- remember when the bitch was a big deal? 😂😂😂😂"",""symbols"":[{""symbol"":""AAPL""}],""entities"":{""sentiment"":null}},{""body"":""$AMZN literally amazon is the one us company doing things masterfully. They employs tons of US workers unlike $AAPL and contribute greatly wherever they go. Tons of\nBezos haters look like dumbasses when they post today. He is self made entrepreneur i remember in 97 people put them down then for selling books online $SPY . Permabears never learn"",""symbols"":[{""symbol"":""AAPL""},{""symbol"":""AMZN""},{""symbol"":""SPY""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$AMZN Asia market recovering expect big day tomorrow $AAPL $MSFT $TSLA ✌️"",""symbols"":[{""symbol"":""AAPL""},{""symbol"":""AMZN""},{""symbol"":""MSFT""},{""symbol"":""TSLA""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$SPY ALL TIME HIGHS BABY!!! BEEN CALLING THE DIP SINCE MONDAY!!! LETS GO BULLS!!! SUPER BOWL WEEK!!! MONEY REPO TEAM BACK ONCE AGAIN SIPPIN HENN MIXED WITH JUICE AND GIN! JOIN THEM!!! $$$$$ $BA $AAPL $AMZN $TSLA"",""symbols"":[{""symbol"":""AAPL""},{""symbol"":""AMZN""},{""symbol"":""BA""},{""symbol"":""SPY""},{""symbol"":""TSLA""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""Peak profit for the last 6 expired option alerts for $AAPL -18.08 | 529.17 | 3.02 | 354.20 | 402.11 | 294.90 |"",""symbols"":[{""symbol"":""AAPL""}],""entities"":{""sentiment"":null}},{""body"":""$YM_F if you want to win come trade with me if you want to lose go somewhere else $AAPL $SPY"",""symbols"":[{""symbol"":""YM_F""},{""symbol"":""AAPL""},{""symbol"":""SPY""}],""entities"":{""sentiment"":null}},{""body"":""$AAPL $TSLA $AMZN $MSFT are earnings winners. Investors let your winners run. Pizza anyone?"",""symbols"":[{""symbol"":""AAPL""},{""symbol"":""AMZN""},{""symbol"":""MSFT""},{""symbol"":""TSLA""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$AAPL just leaving this here as future turn green https://www.cnbc.com/2020/01/31/china-economy-beijing-announces-official-manufacturing-pmi-for-january.html"",""symbols"":[{""symbol"":""AAPL""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$AAPL $BABA so futures are up, and Asia, so maybe those numbers aren’t that bad like it said....."",""symbols"":[{""symbol"":""AAPL""},{""symbol"":""BABA""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$NFLX $SPY $AAPL $FB \n\n After several successful trades in Netflix both ways it’s shown more obvious bullish strength then bearish. I have been trading Netflix swings above 340 primarily taking the outlook 80% Bull 20% bear which is a healthy practice I do personally let me not get ahead myself to be a perma-bear. Currently Netflix is 4% away from it’s all time high and has struggled several times to breakout above it, but as long as Netflix is trading well above 340 and vegans to build a base above 344 within the next couple of weeks Netflix should be kissing 362.12 - 373.40."",""symbols"":[{""symbol"":""AAPL""},{""symbol"":""NFLX""},{""symbol"":""SPY""},{""symbol"":""FB""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$AA Alcoa’s fundamentals horrible but it may have a bounce as I explained earlier. Moreover Alcoa Day Candle Daily Chart is a Dragonfly Doji 👍\nwhich is bullish \nwhen it occurs in a downtrend \nDaily technicals of RSI being extremely oversold for days and it is END OF MONTH\n $AAPL $AMZN $WDC $SPY"",""symbols"":[{""symbol"":""AAPL""},{""symbol"":""AMZN""},{""symbol"":""AA""},{""symbol"":""SPY""},{""symbol"":""WDC""}],""entities"":{""sentiment"":null}},{""body"":""$SPY we ripping again, so let’s make sure we stick to chicken 🐔 if we gonna be making any soup and leave them 🦇 alone, #corona $AAPL $AMZN $TSLA"",""symbols"":[{""symbol"":""AAPL""},{""symbol"":""AMZN""},{""symbol"":""SPY""},{""symbol"":""TSLA""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$AAPL Watching $317 support. Daily chart."",""symbols"":[{""symbol"":""AAPL""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""AA Alcoa’s fundamentals horrible but it may have a bounce as I explained earlier. Moreover Alcoa Day Candle Daily Chart is a Dragonfly Doji 👍\nwhich bullish \nwhen it occurs in a downtrend \nDaily technicals of RSI being extremely oversold for days and it is END OF MONTH\n $AAPL $AMZN $WDC $SPY"",""symbols"":[{""symbol"":""AAPL""},{""symbol"":""AMZN""},{""symbol"":""SPY""},{""symbol"":""WDC""}],""entities"":{""sentiment"":null}},{""body"":""$AAPL JPMorgan Chase &amp; Boosts Apple Price Target to $350.00 \n\nhttps://newsfilter.io/a/9fdc74006956fbe2cb5c99471f5bf803"",""symbols"":[{""symbol"":""AAPL""}],""entities"":{""sentiment"":null}},{""body"":""$AAPL coronavirus goes to India and Philippines google closed all offices Tesla closed https://m.youtube.com/watch?v=6DBFwIlT4fg"",""symbols"":[{""symbol"":""AAPL""}],""entities"":{""sentiment"":null}},{""body"":""$AMZN $SPY $AAPL $MCD"",""symbols"":[{""symbol"":""AAPL""},{""symbol"":""AMZN""},{""symbol"":""MCD""},{""symbol"":""SPY""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$AAPL never short a market when you have trump has president this guy won’t let the market go down"",""symbols"":[{""symbol"":""AAPL""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}}]"
"[{""body"":""$ABEO Stochastic is turning. Tomorrow will likely bring the stochastic buy signal, as the slow crosses above the fast, and that Olivergarden dimwit can go pound sand"",""symbols"":[{""symbol"":""ABEO""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$ABEO 2.30s ouch... can I get 2.10s?"",""symbols"":[{""symbol"":""ABEO""}],""entities"":{""sentiment"":{""basic"":""Bearish""}}},{""body"":""$ABEO added bigly here...insider owns $500k for a reason here"",""symbols"":[{""symbol"":""ABEO""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$ABEO $CLSN two biggest positions and two biggest drawdowns right now"",""symbols"":[{""symbol"":""CLSN""},{""symbol"":""ABEO""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$ABEO if it doesn’t hold here, it’s going to get really cheap."",""symbols"":[{""symbol"":""ABEO""}],""entities"":{""sentiment"":null}},{""body"":""$ABEO Their CSO stepped down at the beginning of the year. 🤦Big news into cell-therapy, reported earnings date of 3-21-20.👍 Look to get in around $2.20 or lower. 💵🤞"",""symbols"":[{""symbol"":""ABEO""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$ABEO Daily short volume at a whopping 75%, same as yesterday. That can&#39;t be sustainable. MMs have two days to make good on all the non-existent shares they sold today, and tomorrow is the settlement date for the assloads of non-existent shares they sold yesterday. They can&#39;t settle counterfeit shares using more non-existent shares. That doesn&#39;t work. Let&#39;s see what the short-bag-holding dimwits do tomorrow. The clock is ticking. I&#39;d like to see the fraudsters stick around and get torched"",""symbols"":[{""symbol"":""ABEO""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$ABEO ⏫✔ Wait on the right price point to get in. Looking around $2.20 🤘🐂"",""symbols"":[{""symbol"":""ABEO""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$ABEO will go to 2 soon"",""symbols"":[{""symbol"":""ABEO""}],""entities"":{""sentiment"":{""basic"":""Bearish""}}},{""body"":""I wish I shorted $ABEO a year ago; the -61.46% change sure looks sweet now https://wallmine.com/nasdaq/abeo?utm_source=stocktwits"",""symbols"":[{""symbol"":""ABEO""}],""entities"":{""sentiment"":null}},{""body"":""$ABEO vol just not there , I’m out GL all"",""symbols"":[{""symbol"":""ABEO""}],""entities"":{""sentiment"":null}},{""body"":""$ABEO here we go, bids stacking at 2.49, 2.5 mini wall down"",""symbols"":[{""symbol"":""ABEO""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$ABEO long hold 👍🏻"",""symbols"":[{""symbol"":""ABEO""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$ABEO need to knock down the 2.5 mini wall and we will move"",""symbols"":[{""symbol"":""ABEO""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$ABEO ↗ Charts say buy✔ I&#39;m holding off for another drop before conformation on positive price movement. 🤘"",""symbols"":[{""symbol"":""ABEO""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$ABEO back to 3 please. any news in the near future?!"",""symbols"":[{""symbol"":""ABEO""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$ABEO Daily short volume was 75%, and those parasites still lost ground"",""symbols"":[{""symbol"":""ABEO""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$ABEO another head and shoulders setup on the intraday pattern. @Reformed_Trader"",""symbols"":[{""symbol"":""ABEO""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$ABEO what a tin stock I tried to add 15,000 share and pop like 5 cent -- come on"",""symbols"":[{""symbol"":""ABEO""}],""entities"":{""sentiment"":null}},{""body"":""$ABEO really, no cheerleading or pumping? for shame"",""symbols"":[{""symbol"":""ABEO""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$ABEO looking good"",""symbols"":[{""symbol"":""ABEO""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$ABEO pretty clear there’s good support here, if your short why wouldn’t you lock in profits and ride it back to $3 at this point ?"",""symbols"":[{""symbol"":""ABEO""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$ABEO some volume coming now"",""symbols"":[{""symbol"":""ABEO""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$ABEO lol these two posts juxtaposed 😂 @jsp4423"",""symbols"":[{""symbol"":""ABEO""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$ABEO add again."",""symbols"":[{""symbol"":""ABEO""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$ABEO i am betting a dilution very soon..."",""symbols"":[{""symbol"":""ABEO""}],""entities"":{""sentiment"":null}},{""body"":""$ABEO insiders added recently. Dilution Done"",""symbols"":[{""symbol"":""ABEO""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$ABEO adding more, Wellington adding as well as other institutions is all I need to see, close to 4K shares now"",""symbols"":[{""symbol"":""ABEO""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""Wellington Management Group LLP has filed an amended 13G/A, reporting 8.04% ownership in $ABEO - https://fintel.io/so/us/abeo?utm_source=stocktwits.com&amp;utm_medium=social&amp;utm_campaign=owner"",""symbols"":[{""symbol"":""ABEO""}],""entities"":{""sentiment"":null}},{""body"":""$ABEO bot more down here, Wellington increase"",""symbols"":[{""symbol"":""ABEO""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}}]"
"[{""body"":""Departure of Directors or Certain http://www.conferencecalltranscripts.org/8/summary2/?id=7351264 $ABG"",""symbols"":[{""symbol"":""ABG""}],""entities"":{""sentiment"":null}},{""body"":""Thinking about investing in $ABG #AsburyAutomotive? The 8-K filing touching on departure of directors or certain officers among other topics might be what you&#39;re looking for https://wallmine.com/filing/redirect/12084309?utm_source=stocktwits"",""symbols"":[{""symbol"":""ABG""}],""entities"":{""sentiment"":null}},{""body"":""$ABG just filed a Event for Officers https://last10k.com/sec-filings/abg/0001144980-20-000008.htm?utm_source=stocktwits&amp;utm_medium=forum&amp;utm_campaign=8K&amp;utm_term=abg"",""symbols"":[{""symbol"":""ABG""}],""entities"":{""sentiment"":null}},{""body"":""$ABG / Asbury Automotive Group files form 8-K - Departure of Directors or Certain Officers; Election of Directors; Appointment of Certain Officers; Compensatory Arrangements of Certain Officers https://fintel.io/s/us/abg?utm_source=stocktwits.com&amp;utm_medium=Social&amp;utm_campaign=filing"",""symbols"":[{""symbol"":""ABG""}],""entities"":{""sentiment"":null}},{""body"":""$ABG filed form 8-K on January 30, 17:31:01: Item5.02: Departure of Election 0f Officers or Compensatory Arrangements https://s.flashalert.me/oUEIw"",""symbols"":[{""symbol"":""ABG""}],""entities"":{""sentiment"":null}},{""body"":""$ABG Form 8-K: Departure of Directors or Certain Officers; Election of Directors; Appointment of Certain Officers; Compensatory Arrangements of Certain Officers.Asbury Automotive Group’s has announced.. \n\nhttps://newsfilter.io/a/db75603965c2c7ba2de12f8e4613ad92"",""symbols"":[{""symbol"":""ABG""}],""entities"":{""sentiment"":null}},{""body"":""Victory Capital Management Inc has filed an amended 13G/A, reporting 3.17% ownership in $ABG - https://fintel.io/so/us/abg?utm_source=stocktwits.com&amp;utm_medium=social&amp;utm_campaign=owner"",""symbols"":[{""symbol"":""ABG""}],""entities"":{""sentiment"":null}},{""body"":""$ABG filed form SC 13G/A on January 30, 10:08:33 https://s.flashalert.me/ctDqz5"",""symbols"":[{""symbol"":""ABG""}],""entities"":{""sentiment"":null}},{""body"":""Victory Capital Management Inc. just provided an update on share ownership of Asbury Automotive http://www.conferencecalltranscripts.org/13G/summary/?id=7348696 $ABG"",""symbols"":[{""symbol"":""ABG""}],""entities"":{""sentiment"":null}},{""body"":""$ABG Form SC 13G/A (statement of acquisition of beneficial ownership by individuals) filed with the SEC \n\nhttps://newsfilter.io/a/5596c4ca5f31b12c5d5994aa21748dfe"",""symbols"":[{""symbol"":""ABG""}],""entities"":{""sentiment"":null}},{""body"":""BULLISH NEWS FOR $ABG\n\nhttps://www.nasdaq.com/articles/asbury-automotive-group-abg-earnings-expected-to-grow%3A-what-to-know-ahead-of-next-weeks"",""symbols"":[{""symbol"":""ABG""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$ABG $2.34 Earnings Per Share Expected for Asbury Automotive Group This Quarter \n\nhttps://newsfilter.io/a/6362dd8c82a409e658cd91c1f1f571a7"",""symbols"":[{""symbol"":""ABG""}],""entities"":{""sentiment"":null}},{""body"":""$ABG Asbury Automotive Group Price Target Cut to $104.00 \n\nhttps://newsfilter.io/a/9539ffb509fcedb246b9bf036f5b143c"",""symbols"":[{""symbol"":""ABG""}],""entities"":{""sentiment"":null}},{""body"":""Asbury Automotive Group&#39;s PT cut by Morgan Stanley to $104.00. equal weight rating. https://www.marketbeat.com/r/1333835 $ABG"",""symbols"":[{""symbol"":""ABG""}],""entities"":{""sentiment"":null}},{""body"":""$ABG Morgan Stanley Maintains to Equal-Weight : PT $104.00 https://stockhoot.com/ExtSymbol.aspx?from=AnalystRatingTweet&amp;symbol=ABG&amp;t=593&amp;Social=StockTwits"",""symbols"":[{""symbol"":""ABG""}],""entities"":{""sentiment"":null}},{""body"":""Asbury (ABG) to release earnings before the market closes on Monday, February 3. Expected EPS: 2.34. $ABG https://www.tipranks.com/stocks/ABG/earnings-calendar?ref=TREarnings"",""symbols"":[{""symbol"":""ABG""}],""entities"":{""sentiment"":null}},{""body"":""$ABG Asbury Automotive Group Schedules Release of Fourth Quarter and Full Year 2019 Financial Results \n\nhttps://newsfilter.io/a/f62d14cb5a8c54a2a4c53a2891ff572f"",""symbols"":[{""symbol"":""ABG""}],""entities"":{""sentiment"":null}},{""body"":""Undervalued Signal Alert: $ABG. More insights: https://stockinvest.us/technical-analysis/ABG?utm_source=stocktwits&amp;utm_medium=autopost"",""symbols"":[{""symbol"":""ABG""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""How does this make you feel? $ABG RSI Indicator left the oversold zone. View odds of uptrend. https://tickeron.com/go/1133340"",""symbols"":[{""symbol"":""ABG""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""AsburyAutomotiveGroup $ABG BidaskScore is Downgraded to Held https://bidaskclub.com/news/company-news/company-news-company-news/2020/01/asbury-automotive-group-abg-bidaskscore-is-downgraded-to-held/"",""symbols"":[{""symbol"":""ABG""}],""entities"":{""sentiment"":null}},{""body"":""$GPI - Hold Group 1 and Asbury Automotive ($ABG) in the same… http://dlvr.it/RMsv0z #portfolio_prospective #better_portfolio #diversify"",""symbols"":[{""symbol"":""ABG""},{""symbol"":""GPI""}],""entities"":{""sentiment"":null}},{""body"":""Departure of Directors or Certain http://www.conferencecalltranscripts.org/8/summary/?id=7272462 $ABG"",""symbols"":[{""symbol"":""ABG""}],""entities"":{""sentiment"":null}},{""body"":""$ABG / Asbury Automotive Group files form 8-K - Departure of Directors or Certain Officers; Election of Directors; Appointment of Certain Officers; Compensatory Arrangements of Certain Officers https://fintel.io/s/us/abg?utm_source=stocktwits.com&amp;utm_medium=Social&amp;utm_campaign=filing"",""symbols"":[{""symbol"":""ABG""}],""entities"":{""sentiment"":null}},{""body"":""$ABG just filed a Event for Officers https://last10k.com/sec-filings/abg/0001144980-20-000005.htm?utm_source=stocktwits&amp;utm_medium=forum&amp;utm_campaign=8K&amp;utm_term=abg"",""symbols"":[{""symbol"":""ABG""}],""entities"":{""sentiment"":null}},{""body"":""Fresh - #AsburyAutomotive released a current report talking about election of directors and other topics. See what others think $ABG https://wallmine.com/filing/redirect/12031426?utm_source=stocktwits"",""symbols"":[{""symbol"":""ABG""}],""entities"":{""sentiment"":null}},{""body"":""$ABG filed form 8-K on January 09, 17:29:18: Item5.02: Departure of Election 0f Officers or Compensatory Arrangements https://s.flashalert.me/6seZQ"",""symbols"":[{""symbol"":""ABG""}],""entities"":{""sentiment"":null}},{""body"":""$ABG Form 8-K: Departure of Directors or Certain Officers; Election of Directors; Appointment of Certain Officers; Compensatory Arrangements of Certain Officers.As previously reported by Asbury Automo.. https://newsfilter.io/a/38a217f41dadf750c697717dc4a83b50"",""symbols"":[{""symbol"":""ABG""}],""entities"":{""sentiment"":null}},{""body"":""$ABG Should Value Investors Pick Asbury Automotive Stock?: Value investing is easily one of the most popular ways to find great stocks in any market environment. After all, who wouldn’t want to find s.. https://newsfilter.io/a/2523652df6f32201b94582f805d556f8"",""symbols"":[{""symbol"":""ABG""}],""entities"":{""sentiment"":null}},{""body"":""BULLISH NEWS FOR $ABG\n\nhttps://simplywall.st/stocks/us/retail/nyse-abg/asbury-automotive-group/news/heres-why-i-think-asbury-automotive-group-nyseabg-is-an-interesting-stock/"",""symbols"":[{""symbol"":""ABG""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$LAD $ABG $GOLF $MRC $CMC 5 Growth Stocks to Buy as Middle-East Tensions Subside: When the markets were expecting Middle-East tensions to intensify following Tehran’s retaliatory attack, President Don.. https://newsfilter.io/a/bd29d2fae3108f0c0ea80b2e20467ba7"",""symbols"":[{""symbol"":""GOLF""},{""symbol"":""ABG""},{""symbol"":""CMC""},{""symbol"":""LAD""},{""symbol"":""MRC""}],""entities"":{""sentiment"":null}}]"


データ変換のためのテーブルを作成します。

In [25]:
%sql create table message_extracted (symbols array<struct<symbol:string>>, sentiment STRING, body STRING) STORED AS TEXTFILE
%sql create table message_filtered (symbols array<struct<symbol:string>>, sentiment STRING, body STRING) STORED AS TEXTFILE
%sql create table message_exploded (symbol string, sentiment STRING, body STRING) STORED AS TEXTFILE
%sql create table sentiment_data (sentiment int, body STRING) STORED AS TEXTFILE

   hive://user4@master.ykono.work:10000
 * hive://user5@master.ykono.work:10000
Done.
   hive://user4@master.ykono.work:10000
 * hive://user5@master.ykono.work:10000
Done.
   hive://user4@master.ykono.work:10000
 * hive://user5@master.ykono.work:10000
Done.
   hive://user4@master.ykono.work:10000
 * hive://user5@master.ykono.work:10000
Done.


[]

元のデータから必要なデータのみを抽出します。

In [26]:
%%sql
insert overwrite table message_extracted 
select message.symbols, message.entities.sentiment, message.body from twits 
lateral view explode(messages) messages as message

   hive://user4@master.ykono.work:10000
 * hive://user5@master.ykono.work:10000
Done.


[]

In [27]:
%%sql
select * from message_extracted limit 5

   hive://user4@master.ykono.work:10000
 * hive://user5@master.ykono.work:10000
Done.


symbols,sentiment,body
"[{""symbol"":""AAPL""}]",Bearish,$AAPL heng is about to go negative get ready for tomorrow.
"[{""symbol"":""AAPL""},{""symbol"":""AMRN""},{""symbol"":""STML""},{""symbol"":""SCYX""}]",Bullish,"$AMRN $AAPL $STML $SCYX If you watch only one documentary, it should be this one:"
[],,
"[{""symbol"":""https://www.netflix.com/title/81026143""}]",,
[],,


データから、メッセージ・ボディが含まれているデータのみを取り出します。同時に、銘柄に対するセンチメントを文字列からを数値に置換します。

In [28]:
%%sql
insert overwrite table message_filtered 
select symbols, 
    case sentiment when 'Bearish' then -2 when 'Bullish' then 2 ELSE 0 END as sentiment, 
    body from message_extracted 
    where body is not null

   hive://user4@master.ykono.work:10000
 * hive://user5@master.ykono.work:10000
Done.


[]

In [29]:
%%sql
select * from message_filtered limit 3

   hive://user4@master.ykono.work:10000
 * hive://user5@master.ykono.work:10000
Done.


symbols,sentiment,body
"[{""symbol"":""AAPL""}]",-2,$AAPL heng is about to go negative get ready for tomorrow.
"[{""symbol"":""AAPL""},{""symbol"":""AMRN""},{""symbol"":""STML""},{""symbol"":""SCYX""}]",2,"$AMRN $AAPL $STML $SCYX If you watch only one documentary, it should be this one:"
"[{""symbol"":""AAPL""},{""symbol"":""AMZN""},{""symbol"":""MSFT""}]",0,$AMZN of course not just tesla.. $AAPL &amp; $MSFT basically did nothing on the cash open after ER


一つのメッセージに複数の銘柄が紐づけられています。データ正規化のため、データ１行につき、一つの銘柄を持つようにデータを変換します（同じメッセージを持つ行が複数作られます）。

In [30]:
%%sql
insert overwrite table message_exploded 
select symbol.symbol, sentiment, body from message_filtered lateral view explode(symbols) symbols as symbol

   hive://user4@master.ykono.work:10000
 * hive://user5@master.ykono.work:10000
Done.


[]

In [31]:
%%sql
select * from message_exploded limit 3

   hive://user4@master.ykono.work:10000
 * hive://user5@master.ykono.work:10000
Done.


symbol,sentiment,body
AAPL,-2,$AAPL heng is about to go negative get ready for tomorrow.
AAPL,2,"$AMRN $AAPL $STML $SCYX If you watch only one documentary, it should be this one:"
AMRN,2,"$AMRN $AAPL $STML $SCYX If you watch only one documentary, it should be this one:"


ここまでの操作で、元の複雑な構造のデータから、１レコードにつき、銘柄、センチメント、メッセージ本文を持つフォーマットに変換されました。
銘柄毎のセンチメントの件数などの分析を行うには、このテーブルを利用します。

この後の感情分析では、メッセージ本文の文字列から、センチメントを判定する予測モデルを構築します。そのため銘柄情報は利用しないため、センチメントとメッセージ本文のみを取り出します。

In [32]:
%%sql
insert overwrite table sentiment_data 
select sentiment, body from message_filtered

   hive://user4@master.ykono.work:10000
 * hive://user5@master.ykono.work:10000
Done.


[]

In [33]:
%%sql
select * from sentiment_data limit 10

   hive://user4@master.ykono.work:10000
 * hive://user5@master.ykono.work:10000
Done.


sentiment,body
-2,$AAPL heng is about to go negative get ready for tomorrow.
2,"$AMRN $AAPL $STML $SCYX If you watch only one documentary, it should be this one:"
0,$AMZN of course not just tesla.. $AAPL &amp; $MSFT basically did nothing on the cash open after ER
2,Ripping $AAPL $MSFT $SPCE $SPY
0,$AAPL $BABA virus bears at this point are gonna get steam rolled tomorrow....Asia way up
2,$AAPL This should easily gap up to 325 tomorrow
2,$SPY $TVIX $AAPL
2,$AAPL
2,$AAPL futures up
0,$AAPL gap up incoming


### JSONファイルの作成

加工したデータをJSONファイルとして出力します。

感情分析を担当するデータサイエンティスト・機械学習エンジニアは、このJSONファイルを使います。

In [34]:
%sql add jar hdfs:/tmp/brickhouse-0.7.1-SNAPSHOT.jar
%sql CREATE TEMPORARY FUNCTION to_json AS 'brickhouse.udf.json.ToJsonUDF'

   hive://user4@master.ykono.work:10000
 * hive://user5@master.ykono.work:10000
Done.
   hive://user4@master.ykono.work:10000
 * hive://user5@master.ykono.work:10000
Done.


[]

In [35]:
%sql DROP TABLE IF EXISTS json_message
%sql create table json_message (message STRING) STORED AS TEXTFILE

   hive://user4@master.ykono.work:10000
 * hive://user5@master.ykono.work:10000
Done.
   hive://user4@master.ykono.work:10000
 * hive://user5@master.ykono.work:10000
Done.


[]

In [36]:
%%sql
insert overwrite table json_message
select to_json(named_struct('message_body', body, 'sentiment', sentiment)) from sentiment_data

   hive://user4@master.ykono.work:10000
 * hive://user5@master.ykono.work:10000
Done.


[]

In [37]:
%%sql
select * from json_message limit 5

   hive://user4@master.ykono.work:10000
 * hive://user5@master.ykono.work:10000
Done.


message
"{""message_body"":""$AAPL heng is about to go negative get ready for tomorrow."",""sentiment"":-2}"
"{""message_body"":""$AMRN $AAPL $STML $SCYX If you watch only one documentary, it should be this one:"",""sentiment"":2}"
"{""message_body"":""$AMZN of course not just tesla.. $AAPL &amp; $MSFT basically did nothing on the cash open after ER"",""sentiment"":0}"
"{""message_body"":""Ripping $AAPL $MSFT $SPCE $SPY"",""sentiment"":2}"
"{""message_body"":""$AAPL $BABA virus bears at this point are gonna get steam rolled tomorrow....Asia way up"",""sentiment"":0}"


**`HQL_SELECT_MESSAGE`をあなたが作成したデータベースを指定してください**

In [46]:
#from __future__ import print_function

HQL_SELECT_MESSAGE = "select * from user5.json_message"

spark = SparkSession\
    .builder\
    .appName("JsonGen")\
    .getOrCreate()
    
spark.sparkContext.setLogLevel("ERROR")

json_list = spark.sql(HQL_SELECT_MESSAGE)

path = "./output.json"

with open(path, mode='w') as f:
    f.write('{"data":[')
    bool_first_line = True
    for row in json_list.rdd.collect():
        if bool_first_line:
            bool_first_line = False
            f.write(row.message)
        else:
            # あまりスマートではありませんが、ある程度の量のデータを使ったDeep Learning処理をシミュレーションするため、
            # 同じ情報を使って、データを嵩増ししています。
            # API利用の制約や、演習時間の制約がなければ、
            # 上記のWebスクレイピングで、大量の訓練データを取得することが可能です。
            for i in range(100): 
                f.write(",\n")
                f.write(row.message)
    
    f.write("]}")

In [48]:
!ls -l

total 198992
-rw-r--r-- 1 cdsw cdsw  9306615 Feb  3 06:22 all_data.json
-rwxr-xr-x 1 cdsw cdsw       56 Feb  3 06:06 cdsw-build.sh
drwxr-xr-x 2 cdsw cdsw    12288 Feb  3 06:13 data
-rwxr-xr-x 1 cdsw cdsw      305 Feb  3 06:06 git_amend.sh
-rw-r--r-- 1 cdsw cdsw      278 Feb  3 06:06 git_env.sh
-rw-r--r-- 1 cdsw cdsw      122 Feb  3 06:06 hadoop_env.sh
-rwxr-xr-x 1 cdsw cdsw      774 Feb  3 06:06 json_get.py
drwxr-xr-x 2 cdsw cdsw     4096 Feb  3 06:06 lib
drwxr-xr-x 2 cdsw cdsw     4096 Feb  3 06:06 misc
-rwxr-xr-x 1 cdsw cdsw     3865 Feb  3 06:06 model_api.py
-rw-r--r-- 1 cdsw cdsw    51856 Feb  3 06:06 nlp_handson.ipynb
-rw-r--r-- 1 cdsw cdsw   199209 Feb  3 06:54 nlp_solution.ipynb
-rw-r--r-- 1 cdsw cdsw 97060803 Feb  3 06:52 output5.json
-rw-r--r-- 1 cdsw cdsw 97060803 Feb  3 06:56 output.json
-rw-r--r-- 1 cdsw cdsw      230 Feb  3 06:06 README.md
-rw-r--r-- 1 cdsw cdsw     1243 Feb  3 06:06 requirements.txt
drwxr-xr-x 2 cdsw cdsw     4096 Feb  3 06:06 sentiment


## 感情分析

投資判断のために、企業の価値を考慮する際のアプローチとして、従来の枠組みにとらわれない様々な情報（オルタナティブ・データ）を用いることを考えます。

投資家の判断を左右し得る様々な情報を入力とし、投資判断のための定量的なシグナルに変換する予測モデルを構築します。
入力となるデータには様々なものがあります。以下はその例です。

- ニュース（製品のリコール、自然災害など）

ニューラルネットワークを使ったDeep Learningによって、入力データの形式を問わず、予測モデルを構築することができます。

ここでは、ソーシャルメディアサイトStockTwitsの投稿を使用します。
StockTwitsのコミュニティは、投資家、トレーダー、起業家により利用されています。

感情のスコアを生成するこれらのtwitを中心にモデルを構築します。

モデルの訓練のためには、入力に対応するラベルが必要になります。ラベルの精度は、モデルの訓練に当たって大変重要な要素です。

センチメントの度合いを把握するために、非常にネガティブ、ネガティブ、ニュートラル、ポジティブ、非常にポジティブという5段階のスケールを使用します。それぞれ、-2から2までの数値に対応しています。

このラベル付きデータによって訓練されたモデルを使用して、自然言語を入力として、その文章の背後にある感情を予測するモデルを構築します。


### データの確認
データがどのように見えるかを確認します。

各フィールドの意味:

* `'message_body'`: メッセージ本文テキスト
* `'sentiment'`: センチメントスコア。-2から2までの５段階。0は中立。

下記のような内容になっているはずです。
```
{'data':
  {'message_body': '............................',
   'sentiment': 2},
  {'message_body': '............................',
   'sentiment': -2},
   ...
}
```

In [49]:
!head output.json
!tail output.json

{"data":[{"message_body":"$AAPL heng is about to go negative get ready for tomorrow.","sentiment":-2},
{"message_body":"$AMRN $AAPL $STML $SCYX If you watch only one documentary, it should be this one:","sentiment":2},
{"message_body":"$AMRN $AAPL $STML $SCYX If you watch only one documentary, it should be this one:","sentiment":2},
{"message_body":"$AMRN $AAPL $STML $SCYX If you watch only one documentary, it should be this one:","sentiment":2},
{"message_body":"$AMRN $AAPL $STML $SCYX If you watch only one documentary, it should be this one:","sentiment":2},
{"message_body":"$AMRN $AAPL $STML $SCYX If you watch only one documentary, it should be this one:","sentiment":2},
{"message_body":"$AMRN $AAPL $STML $SCYX If you watch only one documentary, it should be this one:","sentiment":2},
{"message_body":"$AMRN $AAPL $STML $SCYX If you watch only one documentary, it should be this one:","sentiment":2},
{"message_body":"$AMRN $AAPL $STML $SCYX If you watch only one documentary, it should

データを読み込みます。

In [50]:
with open('./output.json', 'r') as f:
    twits = json.load(f)

print(twits['data'][:10])

[{'message_body': '$AAPL heng is about to go negative get ready for tomorrow.', 'sentiment': -2}, {'message_body': '$AMRN $AAPL $STML $SCYX If you watch only one documentary, it should be this one:', 'sentiment': 2}, {'message_body': '$AMRN $AAPL $STML $SCYX If you watch only one documentary, it should be this one:', 'sentiment': 2}, {'message_body': '$AMRN $AAPL $STML $SCYX If you watch only one documentary, it should be this one:', 'sentiment': 2}, {'message_body': '$AMRN $AAPL $STML $SCYX If you watch only one documentary, it should be this one:', 'sentiment': 2}, {'message_body': '$AMRN $AAPL $STML $SCYX If you watch only one documentary, it should be this one:', 'sentiment': 2}, {'message_body': '$AMRN $AAPL $STML $SCYX If you watch only one documentary, it should be this one:', 'sentiment': 2}, {'message_body': '$AMRN $AAPL $STML $SCYX If you watch only one documentary, it should be this one:', 'sentiment': 2}, {'message_body': '$AMRN $AAPL $STML $SCYX If you watch only one docum

データ件数の確認

In [51]:
print(len(twits['data']))

590901


### データの前処理

テキストを前処理します。

本文に含まれるティッカーシンボル（「$シンボル」で示される）はセンチメントに関する情報を提供しないため削除します。
また、「@ユーザー名」で、ユーザに関する情報が記載されていますが、これもまたセンチメント情報を提供しないため、削除します。
URLも削除します。

### メッセージ本文とセンチメント・ラベルのリスト化

In [52]:
messages = [twit['message_body'] for twit in twits['data']]
# Since the sentiment scores are discrete, we'll scale the sentiments to 0 to 4 for use in our network
sentiments = [twit['sentiment'] + 2 for twit in twits['data']]

### プリプロセス関数の定義

In [53]:
nltk.download('wordnet')

def preprocess(message):
    """
    入力として文字列を受け取り、次の操作を実行する: 
        - 全てのアルファベットを小文字に変換
        - URLを削除
        - ティッカーシンボルを削除 
        - 句読点を削除
        - 文字列をスペースで分割しトークン化する
        - シングル・キャラクターのトークンを削除
    
    パラメータ
    ----------
        message : 前処理の対象テキストメッセージ
        
    戻り値
    -------
        tokens: 前処理後のトークン配列
    """ 
    #TODO: Implement 
    
    # Lowercase the twit message
    text = message.lower()
    
    # Replace URLs with a space in the message
    text = re.sub("http(s)?://([\w\-]+\.)+[\w-]+(/[\w\- ./?%&=]*)?",' ', text)
    
    # Replace ticker symbols with a space. The ticker symbols are any stock symbol that starts with $.
    text = re.sub("\$[^ \t\n\r\f]+", ' ', text)
    
    # Replace StockTwits usernames with a space. The usernames are any word that starts with @.
    text = re.sub("@[^ \t\n\r\f]+", ' ', text)

    # Replace everything not a letter with a space
    text = re.sub("[^a-z]", ' ', text)
    
    
    # Tokenize by splitting the string on whitespace into a list of words
    tokens = text.split()

    # Lemmatize words using the WordNetLemmatizer. You can ignore any word that is not longer than one character.
    wnl = nltk.stem.WordNetLemmatizer()
    tokens = [wnl.lemmatize(w, pos='v') for w in tokens if len(w) > 1]
    
    return tokens

[nltk_data] Downloading package wordnet to /home/cdsw/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


### Twitsメッセージ前処理
上記で定義した`preprocess`関数を全てStockTwitメッセージ・データに適用します。

※この処理には、データのサイズに応じて多少時間がかかります。

In [54]:
tokenized = list(map(preprocess, messages))

print(tokenized[:3])
print(len(tokenized))

[['heng', 'be', 'about', 'to', 'go', 'negative', 'get', 'ready', 'for', 'tomorrow'], ['if', 'you', 'watch', 'only', 'one', 'documentary', 'it', 'should', 'be', 'this', 'one'], ['if', 'you', 'watch', 'only', 'one', 'documentary', 'it', 'should', 'be', 'this', 'one']]
590901


### Bag of Words

すべてのメッセージがトークン化されたので、ボキャブラリ（語彙）データを作成します。
その際に、コーパス全体で各単語が出現する頻度をカウントします
（[`Counter`](https://docs.python.org/3.1/library/collections.html#collections.Counter)関数を利用）。

※この処理には、データのサイズに応じて多少時間がかかります。

In [55]:
from collections import Counter

#words = []
#for tokens in tokenized:
#    for token in tokens:
#        words.append(token)
out_list = tokenized
words = [element for in_list in out_list for element in in_list]

print(words[:13])
print(len(words))

"""
Create a vocabulary by using Bag of words
"""

word_counts = Counter(words)
sorted_vocab = sorted(word_counts, key=word_counts.get, reverse=True)
int_to_vocab = {ii: word for ii, word in enumerate(sorted_vocab)}
vocab_to_int = {word:ii for ii, word in int_to_vocab.items()}

bow = []
for tokens in tokenized:
    bow.append([vocab_to_int[token] for token in tokens])

print(len(bow))
print(bow[:3])

# This BOW will not be used because it is not filtered to eliminate common words.

['heng', 'be', 'about', 'to', 'go', 'negative', 'get', 'ready', 'for', 'tomorrow', 'if', 'you', 'watch']
7680210
590901
[[5949, 3, 83, 2, 37, 702, 51, 674, 9, 151], [100, 56, 208, 204, 106, 3350, 30, 216, 3, 17, 106], [100, 56, 208, 204, 106, 3350, 30, 216, 3, 17, 106]]


### 単語の重要性（メッセージに現れる頻度）に応じた調整

ボキャブラリーを使用して、「the」、「and」、「it」などの最も一般的な単語の一部を削除します。
これらの単語は非常に一般的であるため、センチメントを特定する目的に寄与せず、ニューラルネットワークへの入力のノイズとなります。これらを除外することで、ネットワークの学習時間を短縮することができます。

また、非常に稀にしか用いられない単語も削除します。
ここでは、各単語のカウントをメッセージの数で除算する必要があります。

次に、メッセージのごく一部にしか表示されない単語を削除します。

In [56]:
"""
Set the following variables:
    freqs
    low_cutoff
    high_cutoff
    K_most_common
"""

print("len(sorted_vocab):",len(sorted_vocab))
print("sorted_vocab - top:", sorted_vocab[:3])
print("sorted_vocab - least:", sorted_vocab[-15:])

# Dictionart that contains the Frequency of words appearing in messages.
# The key is the token and the value is the frequency of that word in the corpus.
total_count = len(words)
freqs = {word: count/total_count for word, count in word_counts.items()}

#print("freqs[supplication]:",freqs["supplication"] )
print("freqs[the]:",freqs["the"] )

"""
This was the post by Ricardo:

there's no exact value for low_cutoff and high_cutoff, 
however I'd recommend you to use 
a low_cutoff that's around 0.000002 and 0.000007 
(This depends on the values you get from your freqs calculations) and 
a high_cutofffrom 5 to 20 (this depends on the most_common values from the bow).
"""

# Float that is the frequency cutoff. Drop words with a frequency that is lower or equal to this number.
low_cutoff = 0.000002

# Integer that is the cut off for most common words. Drop words that are the `high_cutoff` most common words.
"""
example_count = []
example_count.append(sorted_vocab.index("the"))
example_count.append(sorted_vocab.index("for"))
example_count.append(sorted_vocab.index("of"))
print(example_count)
high_cutoff = min(example_count)
"""
high_cutoff = 20
print("high_cutoff:",high_cutoff)
print("low_cutoff:",low_cutoff)

# The k most common words in the corpus. Use `high_cutoff` as the k.
#K_most_common = [word for word in sorted_vocab[:high_cutoff]]
K_most_common = sorted_vocab[:high_cutoff]

print("K_most_common:",K_most_common)


filtered_words = [word for word in freqs if (freqs[word] > low_cutoff and word not in K_most_common)]

print("len(filtered_words):",len(filtered_words)) 

len(sorted_vocab): 5950
sorted_vocab - top: ['the', 'of', 'to']
sorted_vocab - least: ['desktop', 'verge', 'catalysts', 'nyseamerican', 'driver', 'astronomically', 'edit', 'theme', 'ignore', 'particular', 'explode', 'autoscalp', 'yaawn', 'salamis', 'heng']
freqs[the]: 0.027121654225600603
high_cutoff: 20
low_cutoff: 2e-06
K_most_common: ['the', 'of', 'to', 'be', 'amp', 'utm', 'and', 'on', 'in', 'for', 'file', 'form', 'share', 'stock', 'by', 'sec', 'report', 'this', 'earn', 'at']
len(filtered_words): 5929


### フィルターされた単語を削除して語彙を更新
ボキャブラリーに役立つ3つの変数を作成します。

In [57]:
"""
Set the following variables:
    vocab
    id2vocab
    filtered
"""

# A dictionary for the `filtered_words`. The key is the word and value is an id that represents the word. 
vocab =  {word:ii for ii, word in enumerate(filtered_words)}
# Reverse of the `vocab` dictionary. The key is word id and value is the word. 
id2vocab = {ii:word for word, ii in vocab.items()}
# tokenized with the words not in `filtered_words` removed.

print("len(tokenized):", len(tokenized))

filtered = [[token for token in tokens if token in vocab] for tokens in tokenized]
print("len(filtered):", len(filtered))
print("tokenized[:1]", tokenized[:1])
print("filtered[:1]",filtered[:1])

len(tokenized): 590901
len(filtered): 590901
tokenized[:1] [['heng', 'be', 'about', 'to', 'go', 'negative', 'get', 'ready', 'for', 'tomorrow']]
filtered[:1] [['about', 'go', 'negative', 'get', 'ready', 'tomorrow']]


### 分類クラス間のバランス

訓練データのラベルには、一般に偏りがあることがよく見受けられます（例外的なデータは少ない）。
例えば、データの50％がニュートラルであること場合、毎回0（ニュートラル）を予測するだけで、ネットワークの精度が50％になることを意味します。

ネットワークが適切に学習できるように、クラスのバランスを取る必要があります。つまり、それぞれのセンチメントスコアがデータにほぼ同じ頻度で表含まれていることが望ましいと言えます。

ここでは、中立的な感情を持つデータを全体の20%になるように、ランダムにドロップします。

データに含まれるニュートラルデータのパーセンテージと、データ削除により期待されるパーセンテージの値を使って、
データをドロップする確率を求めます。

同時に、長さが0のメッセージを削除します。

In [58]:
balanced = {'messages': [], 'sentiments':[]}

n_neutral = sum(1 for each in sentiments if each == 2)
N_examples = len(sentiments)

keep_prob = (N_examples - n_neutral)/4/n_neutral

print("keep prob:", keep_prob)

for idx, sentiment in enumerate(sentiments):
    message = filtered[idx]
    if len(message) == 0:
        # skip this message because it has length zero
        continue
    elif sentiment != 2 or random.random() < keep_prob:
        balanced['messages'].append(message)
        balanced['sentiments'].append(sentiment) 

keep prob: 0.0562932821895086


バランスされたデータ中、センチメントが「ニュートラル」であるデータの割合を確認します。

In [59]:
n_neutral = sum(1 for each in balanced['sentiments'] if each == 2)
N_examples = len(balanced['sentiments'])
n_neutral/N_examples

0.2128921529496288

Finally let's convert our tokens into integer ids which we can pass to the network.

メッセージをID（数値）に変換します。この処理は、ニューラルネットワークの入力として用いるために必要です。

In [60]:
token_ids = [[vocab[word] for word in message] for message in balanced['messages']]
sentiments = balanced['sentiments']

ボキャブラリ・ファイルを保存します。このファイルは、予測の際に、入力を変換するために必要になります。

In [61]:
import pickle

with open('vocab.pickle', 'wb') as f:
    pickle.dump(vocab, f)

### ニューラルネットワーク
これでボキャブラリーができたので、トークンをIDに変換し、それをネットワークに渡すことができます。ネットワークを定義します

下記は、ネットワークの概要です：

#### Embed -> RNN -> Dense -> Softmax

### SentimentClassifier (感情分類器)実装

クラスは、3つの主要な部分で構成されています：: 

1. init function `__init__` 
2. forward pass `forward`  
3. hidden state `init_hidden`. 

出力層では、softmaxを使用します。出力フォーマットによって出力層を選択します。

（例えば、出力が２値/バイナリであれば、シグモイド関数）

このネットワークでは、センチメントスコアには5つのクラスがあるためsoftmaxが適しています。

In [62]:
class SentimentClassifier(nn.Module):
    def __init__(self, vocab_size, embed_size, lstm_size, output_size, lstm_layers=1, dropout=0.1):
        """
        Initialize the model by setting up the layers.
        
        Parameters
        ----------
            vocab_size : The vocabulary size.
            embed_size : The embedding layer size.
            lstm_size : The LSTM layer size.
            output_size : The output size.
            lstm_layers : The number of LSTM layers.
            dropout : The dropout probability.
        """
        
        super().__init__()
        self.vocab_size = vocab_size
        self.embed_size = embed_size
        self.lstm_size = lstm_size
        self.output_size = output_size
        self.lstm_layers = lstm_layers
        self.dropout = dropout
        

        self.embedding = nn.Embedding(self.vocab_size, self.embed_size)
        self.lstm = nn.LSTM(self.embed_size, self.lstm_size, self.lstm_layers, dropout=self.dropout)
        
        
        self.dropout = nn.Dropout(dropout)
        self.fc = nn.Linear(lstm_size, output_size)
        
        self.softmax = nn.LogSoftmax(dim=1)


    def init_hidden(self, batch_size):
        """ 
        Initializes hidden state
        
        Parameters
        ----------
            batch_size : The size of batches.
        
        Returns
        -------
            hidden_state
            
        """
        
        # Create two new tensors with sizes n_layers x batch_size x hidden_dim,
        # 隠れ層として、n_layers x batch_size x hidden_dimの構造を持つテンソルを二つ作成し、ゼロで初期化
        # initialized to zero, for hidden state and cell state of LSTM
        
        weight = next(self.parameters()).data
        
        hidden = (weight.new(self.lstm_layers, batch_size,self.lstm_size).zero_(),
                         weight.new(self.lstm_layers, batch_size, self.lstm_size).zero_())
        return hidden


    def forward(self, nn_input, hidden_state):
        """
        Perform a forward pass of our model on nn_input.
        
        Parameters
        ----------
            nn_input : The batch of input to the NN.
            hidden_state : The LSTM hidden state.

        Returns
        -------
            logps: log softmax output
            hidden_state: The new hidden state.

        """
        
        batch_size = nn_input.size(0)
        
        # embed
        embeds = self.embedding(nn_input)
        
        # LSTM
        lstm_out, hidden_state = self.lstm(embeds, hidden_state)
        
        """
        remember here you do not have batch_first=True, 
        so accordingly shape your input. 
        Moreover, since now input is seq_length x batch you just need to transform lstm_out = lstm_out[-1,:,:].
        you don't have to use batch_first=True in this case, 
        nor reshape the outputs with .view just transform your lstm_out as advised and you should be good to go.
        """
        #lstm_out = lstm_out.contiguous().view(-1, self.lstm_size)    
        lstm_out = lstm_out[-1,:,:]
        
        # dropout
        out = self.dropout(lstm_out)
        
        # Dense Layer (nn.Linear) RNNの隠れ層から値を予測
        out = self.fc(out)
        
        # Softmax関数
        logps = self.softmax(out)
        
        
        return logps, hidden_state

### モデルの確認

In [63]:
model = SentimentClassifier(len(vocab), 10, 6, 5, dropout=0.1, lstm_layers=2)
model.embedding.weight.data.uniform_(-1, 1)
input = torch.randint(0, 1000, (5, 4), dtype=torch.int64)
batch_size = 4
hidden = model.init_hidden(4)

logps, _ = model.forward(input, hidden)
print(logps)

tensor([[-1.7733, -2.0665, -1.3319, -1.5904, -1.4448],
        [-1.7649, -2.0427, -1.3049, -1.6599, -1.4365],
        [-1.7764, -2.0752, -1.3315, -1.5847, -1.4433],
        [-1.7762, -2.0626, -1.3176, -1.6049, -1.4486]],
       grad_fn=<LogSoftmaxBackward>)


### トレーニング
### DataLoaderとバッチ処理
ここで、データをループするために使用できるジェネレーターを構築します。

効率化のため、シーケンスをバッチとして渡します。

入力テンソルは次のような形になります：(sequence_length, batch_size)

したがって、シーケンスが40トークンで、25シーケンスを渡す場合、入力サイズは(40, 25)になります。

シーケンスの長さを40に設定した場合、40トークンより多いまたは少ないメッセージは、以下のように処理します。
- 40トークン未満のメッセージの場合、空のスポットにゼロを埋め込む。
   - データを処理する前にRNNが何も開始しないように、必ずパッドを残しておく必要がある。
   - メッセージに20個のトークンがある場合、最初の20個のスポットは0になる。
- メッセージに40個を超えるトークンがある場合、最初の40個のトークンを保持。

In [64]:
#def dataloader(messages, labels, sequence_length=30, batch_size=32, shuffle=False):
def dataloader(messages, labels, sequence_length=20, batch_size=32, shuffle=False):
    """ 
    Build a dataloader.
    """
    if shuffle:
        indices = list(range(len(messages)))
        random.shuffle(indices)
        messages = [messages[idx] for idx in indices]
        labels = [labels[idx] for idx in indices]

    total_sequences = len(messages)

    for ii in range(0, total_sequences, batch_size):
        batch_messages = messages[ii: ii+batch_size]
        
        # First initialize a tensor of all zeros
        batch = torch.zeros((sequence_length, len(batch_messages)), dtype=torch.int64)
        for batch_num, tokens in enumerate(batch_messages):
            token_tensor = torch.tensor(tokens)
            # Left pad!
            start_idx = max(sequence_length - len(token_tensor), 0)
            batch[start_idx:, batch_num] = token_tensor[:sequence_length]
        
        label_tensor = torch.tensor(labels[ii: ii+len(batch_messages)])
        
        yield batch, label_tensor

###  データの分割（訓練用と検証用）

In [65]:
"""
Split data into training and validation datasets. Use an appropriate split size.
The features are the `token_ids` and the labels are the `sentiments`.
"""   

split_frac = 0.98 # for small data
#split_frac = 0.8 # for big data

## split data into training, validation, and test data (features and labels, x and y)

split_idx = int(len(token_ids)*split_frac)
train_features, remaining_features = token_ids[:split_idx], token_ids[split_idx:]
train_labels, remaining_labels = sentiments[:split_idx], sentiments[split_idx:]

test_idx = int(len(remaining_features)*0.5)
valid_features, test_features = remaining_features[:test_idx], remaining_features[test_idx:]
valid_labels, test_labels = remaining_labels[:test_idx], remaining_labels[test_idx:]

In [65]:
#　色々試したみただけのエリア（のはず）
#text_batch, labels = next(iter(dataloader(train_features, train_labels, sequence_length=20, batch_size=64)))
#print(text_batch)
#hidden = model.init_hidden(64)
#logps, hidden = model.forward(text_batch, hidden)
#print(hidden)

tensor([[ 0,  0,  0,  ...,  0,  0,  0],
        [ 0,  0,  0,  ...,  0,  0,  0],
        [ 0,  0,  0,  ...,  0,  0,  0],
        ...,
        [ 3, 12, 12,  ..., 12, 12, 12],
        [ 4, 13, 13,  ..., 13, 13, 13],
        [ 5, 10, 10,  ..., 10, 10, 10]])
(tensor([[[ 0.2675, -0.0482,  0.0554, -0.0824, -0.2360, -0.0629],
         [ 0.0143, -0.0622, -0.0563, -0.1774,  0.0205, -0.1005],
         [ 0.0143, -0.0622, -0.0563, -0.1774,  0.0205, -0.1005],
         [ 0.0143, -0.0622, -0.0563, -0.1774,  0.0205, -0.1005],
         [ 0.0143, -0.0622, -0.0563, -0.1774,  0.0205, -0.1005],
         [ 0.0143, -0.0622, -0.0563, -0.1774,  0.0205, -0.1005],
         [ 0.0143, -0.0622, -0.0563, -0.1774,  0.0205, -0.1005],
         [ 0.0143, -0.0622, -0.0563, -0.1774,  0.0205, -0.1005],
         [ 0.0143, -0.0622, -0.0563, -0.1774,  0.0205, -0.1005],
         [ 0.0143, -0.0622, -0.0563, -0.1774,  0.0205, -0.1005],
         [ 0.0143, -0.0622, -0.0563, -0.1774,  0.0205, -0.1005],
         [ 0.0143, -0.0622, -0

### トレーニング準備

利用可能なデバイス(CUDA/GPUまたはGPU)を確認します。

In [66]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

In [67]:
#model = SentimentClassifier(len(vocab)+1, 200, 128, 5, dropout=0.)
model = SentimentClassifier(len(vocab)+1, 1024, 512, 5, lstm_layers=2, dropout=0.2)
model.embedding.weight.data.uniform_(-1, 1)
model.to(device)

SentimentClassifier(
  (embedding): Embedding(5930, 1024)
  (lstm): LSTM(1024, 512, num_layers=2, dropout=0.2)
  (dropout): Dropout(p=0.2, inplace=False)
  (fc): Linear(in_features=512, out_features=5, bias=True)
  (softmax): LogSoftmax()
)

### トレーニング実施

トレーニングを実行します。モデルの訓練の進行度合を確認するために、定期的にLossを出力します。

※この処理には、データのサイズに応じて、十分な時間が必要です。

GPUを備えた環境で実行する場合、ターミナルで以下のコマンドを実行することで、GPUが利用されていることを確認することができます（ GPU実行中、コマンド実行により表示されるテーブルの右上のVolatile GPU-Utilのパーセンテージ値が増えます）
```
$ watch nvidia-smi
```

In [70]:
import numpy as np

epochs = 5
batch_size =  64
batch_size =  512
learning_rate = 0.001

print_every = 100
criterion = nn.NLLLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)
model.train()

#val_losses = []
total_losses = []
#accuracy = []

for epoch in range(epochs):
    print('Starting epoch {}'.format(epoch + 1))
    
    steps = 0
    for text_batch, labels in dataloader(
            train_features, train_labels, batch_size=batch_size, sequence_length=20, shuffle=True):
        steps += 1
        hidden = model.init_hidden(labels.shape[0]) 
        
        # デバイス(CPU, GPU)の設定
        text_batch, labels = text_batch.to(device), labels.to(device)
        for each in hidden:
            each.to(device)
        
        # モデルのトレーニング
        hidden = tuple([each.data for each in hidden])
        model.zero_grad()
        output, hidden = model(text_batch, hidden)
        loss = criterion(output.squeeze(), labels)
        loss.backward()
        clip = 5
        nn.utils.clip_grad_norm_(model.parameters(), clip)
        optimizer.step()
        
        # Accumulate loss
        #val_losses.append(loss.item())
        total_losses.append(loss.item())
        
        correct_count = 0.0
        if steps % print_every == 0:
            model.eval()
            
            # Calculate accuracy
            ps = torch.exp(output)
            top_p, top_class = ps.topk(1, dim=1)
            #?top_class = top_class.to(device)
            #?labels = labels.to(device)

            correct_count += torch.sum(top_class.squeeze()== labels)
            #accuracy.append(100*correct_count/len(labels))
            
            # TODO Implement: Print metrics
            print("Epoch: {}/{}...".format(epoch+1, epochs),
                 "Step: {}...".format(steps),
                 "Loss: {:.6f}...".format(loss.item()),
                 "Total Loss: {:.6f}".format(np.mean(total_losses)),
                 #"Collect Count: {}".format(correct_count),
                 #"Accuracy: {:.2f}".format((100*correct_count/len(labels))),
                 # AttributeError: 'torch.dtype' object has no attribute 'type'
                 #"Accuracy Avg: {:.2f}".format(np.mean(accuracy))
                 )
            
            model.train()

Starting epoch 1
Epoch: 1/5... Step: 100... Loss: 0.017694... Total Loss: 0.041833
Epoch: 1/5... Step: 200... Loss: 0.000830... Total Loss: 0.024216
Starting epoch 2
Epoch: 2/5... Step: 100... Loss: 0.011427... Total Loss: 0.015538
Epoch: 2/5... Step: 200... Loss: 0.000126... Total Loss: 0.012561
Starting epoch 3
Epoch: 3/5... Step: 100... Loss: 0.001006... Total Loss: 0.009969
Epoch: 3/5... Step: 200... Loss: 0.000395... Total Loss: 0.008784
Starting epoch 4
Epoch: 4/5... Step: 100... Loss: 0.000173... Total Loss: 0.007565
Epoch: 4/5... Step: 200... Loss: 0.000247... Total Loss: 0.006909
Starting epoch 5
Epoch: 5/5... Step: 100... Loss: 0.000030... Total Loss: 0.006198
Epoch: 5/5... Step: 200... Loss: 0.000130... Total Loss: 0.005765


In [71]:
torch.save({'state_dict': model.state_dict()}, 'checkpoint.pth.tar')

## 予測（Prediction）関数の作成

訓練されたモデルを使って、入力されたテキストから予測結果を生成するpredict関数を実装します。

テキストは、ネットワークに渡される前に前処理される必要があります。

In [72]:
import glob
import pickle
import re
import nltk
import numpy as np
import os
import sys

import torch

nltk.download('wordnet')

cur_dir = os.path.dirname(os.path.abspath('__file__'))
print(cur_dir)
sys.path.append(cur_dir)

vocab_filename = 'vocab.pickle'
vocab_path = cur_dir + "/" + vocab_filename
vocab_l = pickle.load(open(vocab_path, 'rb'))

#model_path = cur_dir + "/" + "model.torch"
#model_l = torch.load(model_path, map_location='cpu')

model_l = SentimentClassifier(len(vocab_l)+1, 1024, 512, 5, lstm_layers=2, dropout=0.2)
checkpoint = torch.load('./checkpoint.pth.tar')
model_l.load_state_dict(checkpoint['state_dict'])

class UnknownWordsError(Exception):
  "Only unknown words are included in text"


def predict_func(text, model, vocab):
    """ 
    Make a prediction on a single sentence.
    Parameters
    ----------
        text : The string to make a prediction on.
        model : The model to use for making the prediction.
        vocab : Dictionary for word to word ids. The key is the word and the value is the word id.
    Returns
    -------
        pred : 予測値（numpyベクトル）
    """

    tokens = preprocess(text)    

    # Filter non-vocab words
    tokens = [token for token in tokens if token in vocab] #pass
    # Convert words to ids
    tokens = [vocab[token] for token in tokens] #pass

    if len(tokens) == 0:
        raise UnknownWordsError

    # Adding a batch dimension
    text_input = torch.from_numpy(np.asarray(torch.LongTensor(tokens).view(-1, 1)))

    # Get the NN output       
    batch_size = 1
    hidden = model.init_hidden(batch_size) #pass
    
    logps, _ = model(text_input, hidden) #pass
    # Take the exponent of the NN output to get a range of 0 to 1 for each label.
    pred = torch.round(logps.squeeze())#pass
    pred = torch.exp(logps) 
    
    return pred


def predict_api(args):
    """ 
    Make a prediction on a single sentence.
    Parameters
    ----------
        args : 入力（Pythonディクショナリ）
    Returns
    -------
        pred : 予測値（Python配列）
    """
    text = args.get('text')
    try:
        result = predict_func(text, model_l, vocab_l)
        return result.detach().numpy()[0]
    except UnknownWordsError:
        return [0,0,1,0,0]

/home/cdsw


[nltk_data] Downloading package wordnet to /home/cdsw/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


ポジティブなセンチメントを連想させる文章を入力として予測（適宜、文章を変更して実行してみることができます）。

In [73]:
args = {"text": "I'm bullish on $goog"}
result = predict_api(args)
print(result)

[ 0.03567122  0.01573995  0.05834575  0.01576803  0.874475  ]


ネガティブなセンチメントを連想させる文章を入力として予測（適宜、文章を変更して実行してみることができます）。

In [74]:
args = {"text": "I'm bearish on $goog"}
result = predict_api(args)
print(result)

[ 0.23630378  0.20161127  0.31348726  0.19983846  0.04875924]


ボキャブラリ辞書に存在しない単語のみの文章を入力として予測（適宜、文章を変更して実行してみることができます）。

In [75]:
args = {"text": "kono yoshiyuki"}
result = predict_api(args)
print(result)

[0, 0, 1, 0, 0]


### 最後に

データベースを削除する場合は、**データベース名を適切に変更した後で**下記を実行します。

In [77]:
%sql DROP DATABASE IF EXISTS user5 CASCADE

   hive://user4@master.ykono.work:10000
 * hive://user5@master.ykono.work:10000
Done.


[]