# Data Science / Machine Learning Meetup #1 Deep Learning Hands-on
# オルタナティブ・データと自然言語処理

## はじめに

演習の概略は以下の通りです。
1. 環境準備
1. Web Scraping
1. データ変換
1. 感情分析
    1. 前処理
    1. ニューラル・ネットワーク構築
    1. トレーニング
    1. 予測

以下の点にご注意ください。
- 実行するコードの中に、ご利用中のユーザー名に合わせて、変更していただく部分があります。

## 1. 環境準備

### パッケージのインストールとインポート

In [1]:
!pip3 install ipython-sql==0.3.9
!pip3 install PyHive==0.6.1
!pip3 install SQLAlchemy==1.3.13
!pip3 install thrift==0.13.0
!pip3 install sasl==0.2.1
!pip3 install thrift_sasl==0.3.0

!pip3 install nltk==3.4.5
!pip3 install torch==1.4.0

Collecting nltk==3.4.5
[?25l  Downloading https://files.pythonhosted.org/packages/f6/1d/d925cfb4f324ede997f6d47bea4d9babba51b49e87a767c170b77005889d/nltk-3.4.5.zip (1.5MB)
[K     |████████████████████████████████| 1.5MB 11.0MB/s eta 0:00:01
Building wheels for collected packages: nltk
  Building wheel for nltk (setup.py) ... [?25ldone
[?25h  Stored in directory: /home/cdsw/.cache/pip/wheels/96/86/f6/68ab24c23f207c0077381a5e3904b2815136b879538a24b483
Successfully built nltk
Installing collected packages: nltk
Successfully installed nltk-3.4.5
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
Collecting torch==1.4.0
[?25l  Downloading https://files.pythonhosted.org/packages/24/19/4804aea17cd136f1705a5e98a00618cb8f6ccc375ad8bfa437408e09d058/torch-1.4.0-cp36-cp36m-manylinux1_x86_64.whl (753.4MB)
[K     |████████████████████████████████| 753.4MB 52kB/s s eta 0:00:01     |███████                         | 164.6MB 79.3MB/s eta 0:00:08     |███████████▌      

Successfully built sasl
Installing collected packages: sasl
Successfully installed sasl-0.2.1
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
Collecting thrift_sasl==0.3.0
  Downloading https://files.pythonhosted.org/packages/50/fe/89cbc910809e3757c762f56ee190ca39e0f28b7ea451835232c0c988d706/thrift_sasl-0.3.0.tar.gz
Building wheels for collected packages: thrift-sasl
  Building wheel for thrift-sasl (setup.py) ... [?25ldone
[?25h  Stored in directory: /home/cdsw/.cache/pip/wheels/c8/3a/34/1d82df3d652788fc211c245d51dde857a58e603695ea41d93d
Successfully built thrift-sasl
Installing collected packages: thrift-sasl
Successfully installed thrift-sasl-0.3.0
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


上記でインストールしたPyHiveは、Pythonコードの中でimportして使われるのではなく、Hiveへの接続の際の接続文字列`sqlalchemy.create_engine('hive://<host>:<port>')`の中でdialectsとして指定された際に必要になります。そのため、インストール後に利用するためには、新しくプロセスを始める必要があります。**インストールした後に一度、KernelをRestartしてください。**インストールしたプロセスでは、接続時に下記のようなエラーが発生します。
`NoSuchModuleError: Can't load plugin: sqlalchemy.dialects:hive`

In [36]:
import json
import os
import random
import re
import subprocess
import glob
import traceback
from datetime import datetime

from pyhive import hive
import sqlalchemy

import sys
#from random import random
from operator import add
from pyspark.sql import SparkSession

import torch
import nltk
from torch import nn, optim
import torch.nn.functional as F

## 2. Web Scraping

無償で利用できるAPIを用いて演習を行います。そのため、利用に一定の制限が課せられることにご留意ください。
例えば、ご利用状況に応じて、下記のようなエラーメッセージを受け取ることがあります。

```
{"response":{"status":429},"errors":[{"message":"Rate limit exceeded. Client may not make more than 200 requests an hour."}]}
```
まず、APIで取得したデータをCDSWプロジェクト内のファイルとして保存します。

In [6]:
!mkdir ./data

mkdir: cannot create directory ‘./data’: File exists


In [7]:
myfile = open("ticker.txt")
data = myfile.readlines()
myfile.close()
#print(data)
ticker_list = [i.rstrip('\n') for i in data]

print(len(ticker_list))
print(ticker_list)

2882
['A', 'AA', 'AAL', 'AAN', 'AAOI', 'AAON', 'AAP', 'AAPL', 'AAWW', 'AAXN', 'ABBV', 'ABC', 'ABCB', 'ABEO', 'ABG', 'ABM', 'ABMD', 'ABT', 'ABTX', 'ACA', 'ACAD', 'ACCO', 'ACEL', 'ACGL', 'ACHC', 'ACHN', 'ACHV', 'ACIA', 'ACIW', 'ACLS', 'ACM', 'ACN', 'ACNB', 'ACOR', 'ACRS', 'ACRX', 'ACTG', 'ADBE', 'ADES', 'ADI', 'ADM', 'ADMA', 'ADMP', 'ADMS', 'ADP', 'ADPT', 'ADRO', 'ADS', 'ADSK', 'ADSW', 'ADT', 'ADTN', 'ADUS', 'ADVM', 'ADXS', 'AE', 'AEE', 'AEGN', 'AEIS', 'AEL', 'AEM', 'AEMD', 'AEO', 'AEP', 'AERI', 'AES', 'AFG', 'AFI', 'AFL', 'AG', 'AGCO', 'AGEN', 'AGFS', 'AGI', 'AGIO', 'AGLE', 'AGM', 'AGN', 'AGO', 'AGR', 'AGRX', 'AGS', 'AGTC', 'AGX', 'AGYS', 'AHC', 'AHCO', 'AIG', 'AIMC', 'AIMT', 'AIN', 'AIR', 'AIRG', 'AIRT', 'AIT', 'AIZ', 'AJG', 'AJRD', 'AKAM', 'AKBA', 'AKCA', 'AKRO', 'AKRX', 'AKS', 'AL', 'ALB', 'ALCO', 'ALDX', 'ALE', 'ALEC', 'ALG', 'ALGN', 'ALGT', 'ALIM', 'ALK', 'ALKS', 'ALL', 'ALLK', 'ALLO', 'ALLY', 'ALNY', 'ALOT', 'ALPN', 'ALRM', 'ALRN', 'ALSK', 'ALSN', 'ALT', 'ALTR', 'ALV', 'ALXN', 'AM

In [9]:
symbols = ['BBRY', 'AAPL', 'AMZN', 'BABA', 'YHOO', 'LQMT', 'FB', 'GOOG', 'BBBY', 'JNUG', 'SBUX', 'MU']

NUM_REQUEST = 200
#symbols.extend(ticker_list[0:50])
symbols.extend(random.sample(ticker_list, NUM_REQUEST))

args = ['curl', '-X', 'GET', '']
URL = "https://api.stocktwits.com/api/2/streams/symbol/"

FILE_PATH = "./data/"

start_datetime = datetime.now().strftime("%Y%m%d_%H%M")
for symbol in symbols:
    try:
        args[3] = URL + symbol + ".json"
        print(args[3])
        proc = subprocess.run(args,stdout = subprocess.PIPE, stderr = subprocess.PIPE)

        path = FILE_PATH + symbol + "_" + start_datetime + ".json"
        print(path)
        with open(path, mode='w') as f:
            f.write(proc.stdout.decode("utf8"))
    except:
        traceback.print_exc()

https://api.stocktwits.com/api/2/streams/symbol/BBRY.json
./data/BBRY_20200129_0214.json
https://api.stocktwits.com/api/2/streams/symbol/AAPL.json
./data/AAPL_20200129_0214.json
https://api.stocktwits.com/api/2/streams/symbol/AMZN.json
./data/AMZN_20200129_0214.json
https://api.stocktwits.com/api/2/streams/symbol/BABA.json
./data/BABA_20200129_0214.json
https://api.stocktwits.com/api/2/streams/symbol/YHOO.json
./data/YHOO_20200129_0214.json
https://api.stocktwits.com/api/2/streams/symbol/LQMT.json
./data/LQMT_20200129_0214.json
https://api.stocktwits.com/api/2/streams/symbol/FB.json
./data/FB_20200129_0214.json
https://api.stocktwits.com/api/2/streams/symbol/GOOG.json
./data/GOOG_20200129_0214.json
https://api.stocktwits.com/api/2/streams/symbol/BBBY.json
./data/BBBY_20200129_0214.json
https://api.stocktwits.com/api/2/streams/symbol/JNUG.json
./data/JNUG_20200129_0214.json
https://api.stocktwits.com/api/2/streams/symbol/SBUX.json
./data/SBUX_20200129_0214.json
https://api.stocktwits.co

./data/ARNC_20200129_0214.json
https://api.stocktwits.com/api/2/streams/symbol/GORO.json
./data/GORO_20200129_0214.json
https://api.stocktwits.com/api/2/streams/symbol/SLGN.json
./data/SLGN_20200129_0214.json
https://api.stocktwits.com/api/2/streams/symbol/AMRS.json
./data/AMRS_20200129_0214.json
https://api.stocktwits.com/api/2/streams/symbol/MDB.json
./data/MDB_20200129_0214.json
https://api.stocktwits.com/api/2/streams/symbol/PGNY.json
./data/PGNY_20200129_0214.json
https://api.stocktwits.com/api/2/streams/symbol/ACOR.json
./data/ACOR_20200129_0214.json
https://api.stocktwits.com/api/2/streams/symbol/ANIX.json
./data/ANIX_20200129_0214.json
https://api.stocktwits.com/api/2/streams/symbol/FOXF.json
./data/FOXF_20200129_0214.json
https://api.stocktwits.com/api/2/streams/symbol/FIT.json
./data/FIT_20200129_0214.json
https://api.stocktwits.com/api/2/streams/symbol/FE.json
./data/FE_20200129_0214.json
https://api.stocktwits.com/api/2/streams/symbol/IMAX.json
./data/IMAX_20200129_0214.jso

./data/OBNK_20200129_0214.json
https://api.stocktwits.com/api/2/streams/symbol/NTAP.json
./data/NTAP_20200129_0214.json
https://api.stocktwits.com/api/2/streams/symbol/SBSI.json
./data/SBSI_20200129_0214.json
https://api.stocktwits.com/api/2/streams/symbol/VRS.json
./data/VRS_20200129_0214.json
https://api.stocktwits.com/api/2/streams/symbol/MUX.json
./data/MUX_20200129_0214.json
https://api.stocktwits.com/api/2/streams/symbol/TCX.json
./data/TCX_20200129_0214.json
https://api.stocktwits.com/api/2/streams/symbol/MGPI.json
./data/MGPI_20200129_0214.json
https://api.stocktwits.com/api/2/streams/symbol/SVMK.json
./data/SVMK_20200129_0214.json
https://api.stocktwits.com/api/2/streams/symbol/PRPL.json
./data/PRPL_20200129_0214.json
https://api.stocktwits.com/api/2/streams/symbol/CHE.json
./data/CHE_20200129_0214.json
https://api.stocktwits.com/api/2/streams/symbol/MDLZ.json
./data/MDLZ_20200129_0214.json
https://api.stocktwits.com/api/2/streams/symbol/NVST.json
./data/NVST_20200129_0214.jso

In [10]:
!grep -rl error data | xargs rm
!grep -rlv '{"response":{"status":200}' | xargs rm

次に、保存したファイルを、分散処理環境（クラスター）を使って加工するためにHDFSへコピーします。

In [11]:
os.environ['HADOOP_CONF_DIR'] = "/etc/spark/conf/yarn-conf"

HDFS_PATH_DIR = './twits/'

args = ['hdfs', 'dfs', '-put', '', HDFS_PATH_DIR]


try:
    args_mkdir = ['hdfs', 'dfs', '-mkdir', HDFS_PATH_DIR]
    proc = subprocess.run(args_mkdir,stdout = subprocess.PIPE, stderr = subprocess.PIPE)
except:
    traceback.print_exc()

file_list = glob.glob("./data/*")


for file in file_list:
    try:
        args[3] = file
        print(file)

        proc = subprocess.run(args,stdout = subprocess.PIPE, stderr = subprocess.PIPE)

    except:
        traceback.print_exc()

['hdfs', 'dfs', '-put', './data/UTHR_20200129_0214.json', './twits/']

['hdfs', 'dfs', '-put', './data/ICUI_20200129_0214.json', './twits/']

['hdfs', 'dfs', '-put', './data/HRI_20200129_0214.json', './twits/']

['hdfs', 'dfs', '-put', './data/MAR_20200129_0214.json', './twits/']

['hdfs', 'dfs', '-put', './data/NVST_20200129_0214.json', './twits/']

['hdfs', 'dfs', '-put', './data/GTN_20200129_0214.json', './twits/']

['hdfs', 'dfs', '-put', './data/V_20200129_0214.json', './twits/']

['hdfs', 'dfs', '-put', './data/ANIP_20200129_0214.json', './twits/']

['hdfs', 'dfs', '-put', './data/HFC_20200129_0214.json', './twits/']

['hdfs', 'dfs', '-put', './data/MMSI_20200129_0214.json', './twits/']

['hdfs', 'dfs', '-put', './data/SWKS_20200129_0214.json', './twits/']

['hdfs', 'dfs', '-put', './data/NWFL_20200129_0214.json', './twits/']

['hdfs', 'dfs', '-put', './data/OC_20200129_0214.json', './twits/']

['hdfs', 'dfs', '-put', './data/HTLF_20200129_0214.json', './twits/']

['hdfs', 'dfs',


['hdfs', 'dfs', '-put', './data/OBNK_20200129_0214.json', './twits/']

['hdfs', 'dfs', '-put', './data/AMZN_20200129_0214.json', './twits/']

['hdfs', 'dfs', '-put', './data/CEIX_20200129_0214.json', './twits/']

['hdfs', 'dfs', '-put', './data/NWN_20200129_0214.json', './twits/']

['hdfs', 'dfs', '-put', './data/MDLZ_20200129_0214.json', './twits/']

['hdfs', 'dfs', '-put', './data/SNPS_20200129_0214.json', './twits/']

['hdfs', 'dfs', '-put', './data/RAMP_20200129_0214.json', './twits/']

['hdfs', 'dfs', '-put', './data/SSP_20200129_0214.json', './twits/']

['hdfs', 'dfs', '-put', './data/CDNA_20200129_0214.json', './twits/']

['hdfs', 'dfs', '-put', './data/NHC_20200129_0214.json', './twits/']

['hdfs', 'dfs', '-put', './data/PFGC_20200129_0214.json', './twits/']

['hdfs', 'dfs', '-put', './data/RCM_20200129_0214.json', './twits/']

['hdfs', 'dfs', '-put', './data/CGNX_20200129_0214.json', './twits/']

['hdfs', 'dfs', '-put', './data/CACC_20200129_0214.json', './twits/']

['hdfs', 

## 2. データ変換

クラスターでデータを変換します。CDSW上では、ユーザーごとに別のプロジェクトを使っていましたが。
クラスター環境では、自分が利用しているユーザーとデータを意識して取り扱う必要があります。


あなたのユーザ名は以下で確認できます。

In [10]:
sqlalchemy.create_engine('hive://user2@master.ykono.work:10000')

Engine(hive://user2@master.ykono.work:10000)

In [None]:
#$ beeline -u 'jdbc:hive2://10.0.0.55:10000' -f tables.hql

In [15]:
HDFS_PATH_DIR = '/tmp/'
HDFS_PATH_DIR = './'

args = ['hdfs', 'dfs', '-put', '', HDFS_PATH_DIR]

file_list = glob.glob("./lib/*")

for file in file_list:
    try:
        args[3] = file
        print(args)

        proc = subprocess.run(args,stdout = subprocess.PIPE, stderr = subprocess.PIPE)
  
    except:
        traceback.print_exc()

['hdfs', 'dfs', '-put', './lib/json-1.3.7.3.jar', './']
stdout: 
['hdfs', 'dfs', '-put', './lib/README.md', './']
stdout: 
['hdfs', 'dfs', '-put', './lib/brickhouse-0.7.1-SNAPSHOT.jar', './']
stdout: 
['hdfs', 'dfs', '-put', './lib/json-serde-cdh5-shim-1.3.7.3.jar', './']
stdout: 
['hdfs', 'dfs', '-put', './lib/json-serde-1.3.7.3.jar', './']
stdout: 


In [11]:
%load_ext sql

The sql extension is already loaded. To reload it, use:
  %reload_ext sql


**下記のセルの中を適切なユーザ名とURL（Hiveサーバー）に置換してください。**

In [12]:
%sql hive://user2@ip-10-0-0-55.ap-northeast-1.compute.internal:10000

'Connected: user2@None'

**あなたのユーザ名でデータベースを作成・利用してください**

In [14]:
%sql CREATE DATABASE user2
%sql USE user2
%sql SHOW TABLES

 * hive://user2@ip-10-0-0-55.ap-northeast-1.compute.internal:10000
Done.
 * hive://user2@ip-10-0-0-55.ap-northeast-1.compute.internal:10000
Done.
 * hive://user2@ip-10-0-0-55.ap-northeast-1.compute.internal:10000
Done.


tab_name


In [17]:
%sql add jar hdfs:/tmp/json-1.3.7.3.jar
%sql add jar hdfs:/tmp/json-serde-1.3.7.3.jar
%sql add jar hdfs:/tmp/json-serde-cdh5-shim-1.3.7.3.jar

 * hive://user2@ip-10-0-0-55.ap-northeast-1.compute.internal:10000
Done.
 * hive://user2@ip-10-0-0-55.ap-northeast-1.compute.internal:10000
Done.
 * hive://user2@ip-10-0-0-55.ap-northeast-1.compute.internal:10000
Done.


[]

In [18]:
%sql DROP TABLE IF EXISTS twits
%sql DROP TABLE IF EXISTS message_extracted
%sql DROP TABLE IF EXISTS message_filtered
%sql DROP TABLE IF EXISTS message_exploded
%sql DROP TABLE IF EXISTS sentiment_data

 * hive://user2@ip-10-0-0-55.ap-northeast-1.compute.internal:10000
Done.
 * hive://user2@ip-10-0-0-55.ap-northeast-1.compute.internal:10000
Done.
 * hive://user2@ip-10-0-0-55.ap-northeast-1.compute.internal:10000
Done.
 * hive://user2@ip-10-0-0-55.ap-northeast-1.compute.internal:10000
Done.
 * hive://user2@ip-10-0-0-55.ap-northeast-1.compute.internal:10000
Done.


[]

**`LOCATION`にあなたがファイルをアップロードしたパスを指定してください**

In [20]:
%%sql
CREATE EXTERNAL TABLE twits (
	messages 
	ARRAY<
	    STRUCT<body: STRING,
	        symbols:ARRAY<STRUCT<symbol:STRING>>,
	        entities:STRUCT<sentiment:STRUCT<basic:STRING>>
	    >
	>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' 
STORED AS TEXTFILE
LOCATION '/user/user2/twits'

 * hive://user2@ip-10-0-0-55.ap-northeast-1.compute.internal:10000
Done.


[]

In [21]:
%%sql
select * from twits limit 3

 * hive://user2@ip-10-0-0-55.ap-northeast-1.compute.internal:10000
Done.


messages
"[{""body"":""$SPY $AMZN $MSFT $AAPL $TSLA currently using RH but wanting to switch to either think or swim or Webull. Anyone have preference?"",""symbols"":[{""symbol"":""AAPL""},{""symbol"":""AMZN""},{""symbol"":""MSFT""},{""symbol"":""SPY""},{""symbol"":""TSLA""}],""entities"":{""sentiment"":null}},{""body"":""$AAPL \n\nApple Reports 1Q 2020 Results: $22.2B Profit on $91.8B Revenue, Best Quarter Ever.\n\nHow is it even fathomable that some of you are still rationalizing.\nSome are saying 300 tomorrow.\nEither you’re a bear, placed puts, want to annoy some of the winners, or plain stupid!\nAt least 330 is a given.\nProbably between 335 - 360 if had to bet.\n400 is definitely a possibility.\nNo one knows for sure!"",""symbols"":[{""symbol"":""AAPL""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$AAPL Congrats, longs. It&#39;s been hard not booking some gains on looooong held shares. I felt a beat coming on, and strong guidance, but you never really know. Are we back to sandbagging guidance? Def. back to growth."",""symbols"":[{""symbol"":""AAPL""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$AAPL ambulance for the bears. \nTik tok:westtt80"",""symbols"":[{""symbol"":""AAPL""}],""entities"":{""sentiment"":null}},{""body"":""$AAPL goes up 4 dollars and bulls Cole out of the cave"",""symbols"":[{""symbol"":""AAPL""}],""entities"":{""sentiment"":{""basic"":""Bearish""}}},{""body"":""$AAPL Lackluster services # wasn&#39;t that supposed to be a primary growth driver."",""symbols"":[{""symbol"":""AAPL""}],""entities"":{""sentiment"":{""basic"":""Bearish""}}},{""body"":""$SPY $AAPL $AMZN $TSLA still early but why not have some fun 🤷‍♂️"",""symbols"":[{""symbol"":""AAPL""},{""symbol"":""AMZN""},{""symbol"":""SPY""},{""symbol"":""TSLA""}],""entities"":{""sentiment"":null}},{""body"":""$AAPL futures going very green right now with the info that the epidemic is possibly slowing down. Less new cases today than yesterday. Gonna help push Apple higher."",""symbols"":[{""symbol"":""AAPL""}],""entities"":{""sentiment"":null}},{""body"":""$SPY when I think of bears I always see them still using them flip phones 😂😂🤦‍♂️ $AAPL"",""symbols"":[{""symbol"":""AAPL""},{""symbol"":""SPY""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$AAPL can’t wait to more\nOpen flush"",""symbols"":[{""symbol"":""AAPL""}],""entities"":{""sentiment"":{""basic"":""Bearish""}}},{""body"":""$BABA 9988-HK, BABA HK chart, perfect double zig-zag correction, a breach of 215.2 in HK will confirm the long side, we may have to wait till earnings for that. $AAPL just confirmed new ATH on earnings which tells me the lows are in and next leg up started. Happy trading."",""symbols"":[{""symbol"":""AAPL""},{""symbol"":""BABA""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$AAPL 350 tomorrow!"",""symbols"":[{""symbol"":""AAPL""}],""entities"":{""sentiment"":null}},{""body"":""$SPCE i think a lot of money is going to go to aapl and tsla tomorrow, i think my baby space dips tomorrow and possibly until friday, i would ride the $AAPL and $TSLA wave and get back into spce on friday around noon... imho"",""symbols"":[{""symbol"":""AAPL""},{""symbol"":""TSLA""},{""symbol"":""SPCE""}],""entities"":{""sentiment"":null}},{""body"":""$AAPL \nTHIS ARTICLE WILL MAKE YOU MONEY!!\nhttps://stocktalks.substack.com/p/earning-report-plays-jan-27-31"",""symbols"":[{""symbol"":""AAPL""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""Apple announces earnings. $4.99 EPS. Beats estimates. $91.80b revenue. https://www.marketbeat.com/s/440450 $AAPL"",""symbols"":[{""symbol"":""AAPL""}],""entities"":{""sentiment"":null}},{""body"":""Apple announces earnings. $4.99 EPS. Beats estimates. $91.80b revenue. https://www.marketbeat.com/s/440450 $AAPL"",""symbols"":[{""symbol"":""AAPL""}],""entities"":{""sentiment"":null}},{""body"":""$AAPL what happened to all the bears? They get the corona virus?"",""symbols"":[{""symbol"":""AAPL""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$AAPL Blow off top inbound"",""symbols"":[{""symbol"":""AAPL""}],""entities"":{""sentiment"":{""basic"":""Bearish""}}},{""body"":""$AAPL The dummy parade of gloating bears looks silly, in hindsight."",""symbols"":[{""symbol"":""AAPL""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$FB $MSFT futures showing green. 👀 $AAPL ER blew it out of the water. Might be a big money day for bulls. More ERs to come"",""symbols"":[{""symbol"":""AAPL""},{""symbol"":""MSFT""},{""symbol"":""FB""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$AAPL So we opening at 320 or 330??"",""symbols"":[{""symbol"":""AAPL""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$AAPL i live in a retirement comunity in boca raton. I had s friend that held 400 shares for a few years. He rode it to 700 and it split 7 form1 Then he had 2800 shares at 100. Now he has 2800 shares at 300 and it is closer to going to to 700 again than going back to 100. I believe in this. I see it at 700 in a year or so. These earnings will just keep growing. Then it will split again. And it pays dividends and there a stock buy backs."",""symbols"":[{""symbol"":""AAPL""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$AAPL \nThe company made more money than the combined Money of ALL OF EASTERN CANADA in 1 q \n \nand these bear morons are drawing lines and think AH “sell off” is going to bring us to 300 \n \nYOU are going to get burned like Tim Cook taking it up the ass as he celebrates the amazing numbers !! \n \ndeal with it STOP POSTING \nyou are looking moronic !"",""symbols"":[{""symbol"":""AAPL""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$SPY $AAPL Friends."",""symbols"":[{""symbol"":""AAPL""},{""symbol"":""SPY""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$AAPL CREAM! GET THE DIVIDEND, DOLLAR DOLLAR BILL YALL \nWU TANG AINT NOTHING TO INVEST WITH."",""symbols"":[{""symbol"":""AAPL""}],""entities"":{""sentiment"":null}},{""body"":""$AAPL NEW ARTICLE : Apple&#39;s fastest-growing business segment, which includes AirPods and Watch, is now bigger than the Mac https://dashboard.stck.pro/news.php?ticker=AAPL&amp;rowid=3423747 Get all the latest $AAPL related news here : https://dashboard.stck.pro/news.php?ticker=AAPL"",""symbols"":[{""symbol"":""AAPL""}],""entities"":{""sentiment"":null}},{""body"":""$SPY bears will be chocking on them 🍏s tmrw $AAPL"",""symbols"":[{""symbol"":""AAPL""},{""symbol"":""SPY""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$TSLA $AAPL $UBER"",""symbols"":[{""symbol"":""AAPL""},{""symbol"":""TSLA""},{""symbol"":""UBER""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$AAPL right now estimates say earnings grow 27% end of 2021... and PE is around 27X... seems like PE multiple expansion based on more confidence on future growth with a two year window $SPY $QQQ $DIA"",""symbols"":[{""symbol"":""AAPL""},{""symbol"":""DIA""},{""symbol"":""SPY""},{""symbol"":""QQQ""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$AAPL Apple earnings and sales surge to record, sending stock toward new highs\n\nhttps://www.marketwatch.com/story/apple-stock-gains-after-record-earnings-upbeat-forecast-2020-01-28?mod=newsviewer_click"",""symbols"":[{""symbol"":""AAPL""}],""entities"":{""sentiment"":null}}]"
"[{""body"":""Piper Sandler Lowers Ameris Bancorp FY2020 Earnings Estimates to $4.10 EPS (Previously $4.18). https://www.marketbeat.com/x/776220 $ABCB"",""symbols"":[{""symbol"":""ABCB""}],""entities"":{""sentiment"":null}},{""body"":""SunTrust Banks Lowers Ameris Bancorp Q1 2020 Earnings Estimates to $0.99 EPS (Previously $1.03). https://www.marketbeat.com/x/776221 $ABCB"",""symbols"":[{""symbol"":""ABCB""}],""entities"":{""sentiment"":null}},{""body"":""$ABCB Ameris Bancorp Expected to Earn Q1 2020 Earnings of $0.99 Per Share \n\nhttps://newsfilter.io/a/77ad8ac1c1b5a3f81991e1453d4fa887"",""symbols"":[{""symbol"":""ABCB""}],""entities"":{""sentiment"":null}},{""body"":""Be aware! A lot of indicators predict a fall for Ameris Bancorp ($ABCB) during next few days. Full story HERE https://stockinvest.us/l/zCSsJmOe9v"",""symbols"":[{""symbol"":""ABCB""}],""entities"":{""sentiment"":{""basic"":""Bearish""}}},{""body"":""$ABCB Ameris Bancorp Issues Earnings Results \n\nhttps://newsfilter.io/a/d9bcba462224c2e882a89dbcfe37d9d9"",""symbols"":[{""symbol"":""ABCB""}],""entities"":{""sentiment"":null}},{""body"":""Ameris Bancorp (ABCB) announces earnings. $0.96 EPS. Meets estimates. 66.81M earnings. $ABCB https://www.tipranks.com/stocks/ABCB/earnings-calendar?ref=TREarnings"",""symbols"":[{""symbol"":""ABCB""}],""entities"":{""sentiment"":null}},{""body"":""Ameris Bancorp just filed its Current report, items 2.02, 7.01, and 9. http://www.conferencecalltranscripts.org/include?location=http://www.sec.gov/Archives/edgar/data/351569/000119312520013278/0001193125-20-013278-index.htm $ABCB"",""symbols"":[{""symbol"":""ABCB""}],""entities"":{""sentiment"":null}},{""body"":""$ABCB / Ameris Bancorp files form 8-K - Regulation FD Disclosure, Financial Statements and Exhibits, Results of Operations and Financial Condition https://fintel.io/s/us/abcb?utm_source=stocktwits.com&amp;utm_medium=Social&amp;utm_campaign=filing"",""symbols"":[{""symbol"":""ABCB""}],""entities"":{""sentiment"":null}},{""body"":""$ABCB just filed a Earnings Release, a Regulated Disclosure and a Financial Exhibit https://last10k.com/sec-filings/abcb/0001193125-20-013278.htm?utm_source=stocktwits&amp;utm_medium=forum&amp;utm_campaign=8K&amp;utm_term=abcb"",""symbols"":[{""symbol"":""ABCB""}],""entities"":{""sentiment"":null}},{""body"":""Ameris Bancorp ($ABCB) 4Q19 Investor Call On 24th January 2020 At 9:30 AM Eastern Time\n\nhttps://www.stockmarketintellects.com/ameris-bancorp-nasdaqabcb-4q19-investor-call-on-24th-january-2020-at-930-am-eastern-time/"",""symbols"":[{""symbol"":""ABCB""}],""entities"":{""sentiment"":null}},{""body"":""$ABCB Earnings Call Tomorrow: 9:30 AM EST \nAnalyst Rating: Strong Buy \nPhone: 412-317-0088 \nPIN: 10138227 \nWebcast: http://mmm.wallstreethorizon.com/u.asp?u=121865"",""symbols"":[{""symbol"":""ABCB""}],""entities"":{""sentiment"":null}},{""body"":""$ABCB beats on revenue too as per earningswhisper"",""symbols"":[{""symbol"":""ABCB""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$ABCB: Issued Press Release on January 23, 17:13:00: Ameris Bancorp Announces Fourth Quarter And Full Year 2019 Financial Results https://s.flashalert.me/qB9L8y"",""symbols"":[{""symbol"":""ABCB""}],""entities"":{""sentiment"":null}},{""body"":""$ABCB reported earnings of $0.96, consensus was $0.96, Earnings Whisper was $0.95 via @eWhispers #whisperbeat http://eps.sh/d/abcb"",""symbols"":[{""symbol"":""ABCB""}],""entities"":{""sentiment"":null}},{""body"":""$ABCB NEW ARTICLE : Ameris Bancorp EPS in-line, misses on revenue https://dashboard.stck.pro/news.php?ticker=ABCB&amp;rowid=3365091 Get all the latest $ABCB related news here : https://dashboard.stck.pro/news.php?ticker=ABCB"",""symbols"":[{""symbol"":""ABCB""}],""entities"":{""sentiment"":null}},{""body"":""$ABCB Ameris Bancorp EPS in-line, misses on revenue \n\nhttps://newsfilter.io/a/adc7d73ce6a8e088e301d23681b67e3a"",""symbols"":[{""symbol"":""ABCB""}],""entities"":{""sentiment"":null}},{""body"":""$ABCB Ameris Bancorp Announces Fourth Quarter And Full Year 2019 Financial Results \n\nhttps://newsfilter.io/a/4baf4f255b0bbc07599b072b70bd0e79"",""symbols"":[{""symbol"":""ABCB""}],""entities"":{""sentiment"":null}},{""body"":""$ABCB is scheduled to report #earnings after the market closes today via @eWhispers http://eps.sh/s/abcb"",""symbols"":[{""symbol"":""ABCB""}],""entities"":{""sentiment"":null}},{""body"":""$ABCB NEW ARTICLE : Notable earnings after Thursday&#39;s close https://dashboard.stck.pro/news.php?ticker=ABCB&amp;rowid=3355428"",""symbols"":[{""symbol"":""ABCB""}],""entities"":{""sentiment"":null}},{""body"":""Ameris Bancorp to release earnings after the market closes on Thursday. Analysts expect 0.96 EPS. $ABCB https://www.marketbeat.com/p/44115"",""symbols"":[{""symbol"":""ABCB""}],""entities"":{""sentiment"":null}},{""body"":""$ABCB Peregrine Capital Management Decreases Holdings in Ameris Bancorp \n\nhttps://newsfilter.io/a/adfb732351c913358e2f2aaecabdf7e7"",""symbols"":[{""symbol"":""ABCB""}],""entities"":{""sentiment"":null}},{""body"":""Wow this is a big change! $ABCB MACD Histogram just turned positive. View odds of uptrend. https://tickeron.com/go/1133393"",""symbols"":[{""symbol"":""ABCB""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""Piper Sandler Sets Ameris Bancorp FY2020 Earnings Estimates at $4.18 EPS. https://www.marketbeat.com/x/770045 $ABCB"",""symbols"":[{""symbol"":""ABCB""}],""entities"":{""sentiment"":null}},{""body"":""Summit Creek Advisors LLC,has filed Form 13F for Q4 2019.Opened NEW positions in $ABCB $GO $TREX $WEX"",""symbols"":[{""symbol"":""WEX""},{""symbol"":""ABCB""},{""symbol"":""TREX""},{""symbol"":""GO""}],""entities"":{""sentiment"":null}},{""body"":""Ameris Bancorp (ABCB) to release earnings after the market closes on Thursday, January 23. Expected EPS: 0.96. $ABCB https://www.tipranks.com/stocks/ABCB/earnings-calendar?ref=TREarnings"",""symbols"":[{""symbol"":""ABCB""}],""entities"":{""sentiment"":null}},{""body"":""$ABCB: Issued Press Release on January 10, 13:30:00: Ameris Bancorp Announces Date Of Fourth Quarter 2019 Earnings Release And Conference https://s.flashalert.me/mvHR2"",""symbols"":[{""symbol"":""ABCB""}],""entities"":{""sentiment"":null}},{""body"":""$ABCB Ameris Bancorp Announces Date Of Fourth Quarter 2019 Earnings Release And Conference Call: Ameris Bancorp announced that it intends to release its fourth quarter and full year 2019 financial res.. https://newsfilter.io/a/9541b6c3c7855bb46ded4ac930a1b94e"",""symbols"":[{""symbol"":""ABCB""}],""entities"":{""sentiment"":null}},{""body"":""$ABCB 8 sec ago: Ameris Bancorp Lowered to Strong Sell at BidaskClub: BidaskClub downgraded shares of Ameris Bancorp from a sell rating to a strong sell rating in a research note issued to investors on Saturday .. https://newsfilter.io/articles/ameris-bancorp-nasdaqabcb-lowered-to-strong-sell-at-bidaskclub-a3661e0916a89b867d46b8c8b7886d4e"",""symbols"":[{""symbol"":""ABCB""}],""entities"":{""sentiment"":null}},{""body"":""$ABCB - Should I keep both Ameris Bancorp and Eagle Bancorp… http://dlvr.it/RMYD3F #portfolio_prospective #better_portfolio #diversify"",""symbols"":[{""symbol"":""ABCB""}],""entities"":{""sentiment"":null}},{""body"":""$ABCB - Earnings call in two weeks. 81% chance to finish above $42 in February http://dlvr.it/RMQrr7"",""symbols"":[{""symbol"":""ABCB""}],""entities"":{""sentiment"":null}}]"
"[{""body"":""Acorda Therapeutics $ACOR Trading Report http://tinyurl.com/s7e9gw4"",""symbols"":[{""symbol"":""ACOR""}],""entities"":{""sentiment"":null}},{""body"":""$ADAP \nHere&#39;s why I bought in today:\n1. Bull flag holding a strong level of support on the daily AND weekly\n2. Overall trend is upward on the daily and weekly\n3. RSI is curling upwards on the daily and weekly\n4. Expecting the 20/50 sma cross on the daily and weekly\n5. MACD curling on the weekly\n6. Bullish engulfing on the daily\nWe can see on the daily it somewhat broke out of the bull flag with but pulled back to that resistance. On the weekly, it used to be in a downward trend but had a nice gap up and pulled back to support where we can see it was touched twice before. Right now the candlestick is a doji which imo is a good sign because the week before it was a red candle and this is telling us the buyers are trying to stop the bears. My SL for this is 3.5 and my PT are 4.5, 5.0, 5.5 and 6.38(200 sma on the weekly). I also found out there was some insider buys today AH which means we might see a move up soon just like $ACOR. I&#39;ll post the weekly chart below"",""symbols"":[{""symbol"":""ACOR""},{""symbol"":""ADAP""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$ACOR"",""symbols"":[{""symbol"":""ACOR""}],""entities"":{""sentiment"":null}},{""body"":""$ACOR Don’t forget the Bill and Melinda Gates Foundation has had an association with the ARCUS technology"",""symbols"":[{""symbol"":""ACOR""}],""entities"":{""sentiment"":null}},{""body"":""$ACOR lots of churning in the T/R .... but ... needs to see weekly chart break above 2.75 to get me in ..."",""symbols"":[{""symbol"":""ACOR""}],""entities"":{""sentiment"":null}},{""body"":""$ACOR very low volume"",""symbols"":[{""symbol"":""ACOR""}],""entities"":{""sentiment"":null}},{""body"":""$ACOR adding here 2.09"",""symbols"":[{""symbol"":""ACOR""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""When I compare the Beta value, $ACOR does not outperform its peers. Check for yourself https://wallmine.com/nasdaq/acor/peers/beta?utm_source=stocktwits"",""symbols"":[{""symbol"":""ACOR""}],""entities"":{""sentiment"":null}},{""body"":""$ACOR and with that dip, I’m in!"",""symbols"":[{""symbol"":""ACOR""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""Wedbush Sets Acorda Therapeutics FY2024 Earnings Estimates at ($0.99) EPS. https://www.marketbeat.com/x/776227 $ACOR"",""symbols"":[{""symbol"":""ACOR""}],""entities"":{""sentiment"":null}},{""body"":""$ACOR Watching for entry below 2.15 for a bounce. Target 2.50."",""symbols"":[{""symbol"":""ACOR""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$ACOR I expected this below $2 today and what a surprise.... sombody clearly accumulating....."",""symbols"":[{""symbol"":""ACOR""}],""entities"":{""sentiment"":null}},{""body"":""$ACOR Crazy 2 trading days in a row I could have sold on top end and rebought low to accumulate BUT I don’t want to sell and suddenly it keeps going up and out of range. Thursday high of $2.5, Friday $2.44 and today $2.37. Oh well I can’t predict the future"",""symbols"":[{""symbol"":""ACOR""}],""entities"":{""sentiment"":null}},{""body"":""$ACOR bullish tomorrow gap up"",""symbols"":[{""symbol"":""ACOR""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$ACOR"",""symbols"":[{""symbol"":""ACOR""}],""entities"":{""sentiment"":{""basic"":""Bearish""}}},{""body"":""$ACOR that was a technical bounce off the technical bounce which just took place @ 2.55. This one simply under the active trigger 2.40-2.50 we hit 2.54 1st last week then 2.55. Meaning 1 deviation and now we hit 2.37 3 of the 2.40 bottom trigger generating polarity to attempt to balance out and eat the spread that took place from the 2.54/2.55 manoeuvre"",""symbols"":[{""symbol"":""ACOR""}],""entities"":{""sentiment"":null}},{""body"":""$acor pullback near VWAP"",""symbols"":[{""symbol"":""ACOR""}],""entities"":{""sentiment"":null}},{""body"":""$acor holy grail - volume build"",""symbols"":[{""symbol"":""ACOR""}],""entities"":{""sentiment"":null}},{""body"":""$ACOR Stock 2 shares at $2.29 does this mean halt coming on pending news?"",""symbols"":[{""symbol"":""ACOR""}],""entities"":{""sentiment"":null}},{""body"":""$ACOR has one of two drugs for Parkinson’s off periods . Inbrija’s sales will continue to grow.I expect $3 by early summer."",""symbols"":[{""symbol"":""ACOR""}],""entities"":{""sentiment"":null}},{""body"":""$ACOR Is Biogen still a partner with Acorda?"",""symbols"":[{""symbol"":""ACOR""}],""entities"":{""sentiment"":null}},{""body"":""$ACOR Bought 50% more at 2.20. High risk reward setting"",""symbols"":[{""symbol"":""ACOR""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$ACOR Was dabbling with Tradestation&#39;s Option set up and accidentally picked up $3 Calls. Looks like I&#39;m doing a few options too now..."",""symbols"":[{""symbol"":""ACOR""}],""entities"":{""sentiment"":null}},{""body"":""$ACOR Undervalued and ignored by the market ... but cooking something strong, we will soon see a rise to more than $3"",""symbols"":[{""symbol"":""ACOR""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$ACOR I think it’s funny the other day this goes up and everyone is saying this is on scanners but today no one has said anything like that."",""symbols"":[{""symbol"":""ACOR""}],""entities"":{""sentiment"":null}},{""body"":""$ACOR Looks like go time is getting close."",""symbols"":[{""symbol"":""ACOR""}],""entities"":{""sentiment"":null}},{""body"":""$ACOR - Will Acorda Therapeutics price increase in February 2020? Upcoming quarterly earning… https://www.macroaxis.com/invest/market/ACOR--valuation--Acorda-Therapeutics #stocks #earnings"",""symbols"":[{""symbol"":""ACOR""}],""entities"":{""sentiment"":null}},{""body"":""$ACOR IMO only reason shares are being sold is because someone shorted them and trying to give appearance of something that is not."",""symbols"":[{""symbol"":""ACOR""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$ACOR $2.6 high of the day?"",""symbols"":[{""symbol"":""ACOR""}],""entities"":{""sentiment"":null}},{""body"":""$ACOR With securities sold at $2.42/share this is a bargain right?"",""symbols"":[{""symbol"":""ACOR""}],""entities"":{""sentiment"":null}}]"


In [22]:
%sql create table message_extracted (symbols array<struct<symbol:string>>, sentiment STRING, body STRING) STORED AS TEXTFILE
%sql create table message_filtered (symbols array<struct<symbol:string>>, sentiment STRING, body STRING) STORED AS TEXTFILE
%sql create table message_exploded (symbol string, sentiment STRING, body STRING) STORED AS TEXTFILE
%sql create table sentiment_data (sentiment int, body STRING) STORED AS TEXTFILE

 * hive://user2@ip-10-0-0-55.ap-northeast-1.compute.internal:10000
Done.
 * hive://user2@ip-10-0-0-55.ap-northeast-1.compute.internal:10000
Done.
 * hive://user2@ip-10-0-0-55.ap-northeast-1.compute.internal:10000
Done.
 * hive://user2@ip-10-0-0-55.ap-northeast-1.compute.internal:10000
Done.


[]

In [23]:
%%sql
insert overwrite table message_extracted 
select message.symbols, message.entities.sentiment, message.body from twits 
lateral view explode(messages) messages as message

 * hive://user2@ip-10-0-0-55.ap-northeast-1.compute.internal:10000
Done.


[]

In [24]:
%%sql
select * from message_extracted limit 5

 * hive://user2@ip-10-0-0-55.ap-northeast-1.compute.internal:10000
Done.


symbols,sentiment,body
"[{""symbol"":""AAPL""},{""symbol"":""AMZN""},{""symbol"":""MSFT""},{""symbol"":""SPY""},{""symbol"":""TSLA""}]",,$SPY $AMZN $MSFT $AAPL $TSLA currently using RH but wanting to switch to either think or swim or Webull. Anyone have preference?
"[{""symbol"":""AAPL""}]",Bullish,$AAPL
[],,
"[{""symbol"":""Apple Reports 1Q 2020 Results: $22.2B Profit on $91.8B Revenue, Best Quarter Ever.""}]",,
[],,


In [25]:
%%sql
insert overwrite table message_filtered 
select symbols, 
    case sentiment when 'Bearish' then -2 when 'Bullish' then 2 ELSE 0 END as sentiment, 
    body from message_extracted 
    where body is not null

 * hive://user2@ip-10-0-0-55.ap-northeast-1.compute.internal:10000
Done.


[]

In [26]:
%%sql
select * from message_filtered limit 3

 * hive://user2@ip-10-0-0-55.ap-northeast-1.compute.internal:10000
Done.


symbols,sentiment,body
"[{""symbol"":""AAPL""},{""symbol"":""AMZN""},{""symbol"":""MSFT""},{""symbol"":""SPY""},{""symbol"":""TSLA""}]",0,$SPY $AMZN $MSFT $AAPL $TSLA currently using RH but wanting to switch to either think or swim or Webull. Anyone have preference?
"[{""symbol"":""AAPL""}]",2,$AAPL
"[{""symbol"":""AAPL""}]",2,"$AAPL Congrats, longs. It&#39;s been hard not booking some gains on looooong held shares. I felt a beat coming on, and strong guidance, but you never really know. Are we back to sandbagging guidance? Def. back to growth."


In [27]:
%%sql
insert overwrite table message_exploded 
select symbol.symbol, sentiment, body from message_filtered lateral view explode(symbols) symbols as symbol

 * hive://user2@ip-10-0-0-55.ap-northeast-1.compute.internal:10000
Done.


[]

In [28]:
%%sql
select * from message_exploded limit 3

 * hive://user2@ip-10-0-0-55.ap-northeast-1.compute.internal:10000
Done.


symbol,sentiment,body
AAPL,0,$SPY $AMZN $MSFT $AAPL $TSLA currently using RH but wanting to switch to either think or swim or Webull. Anyone have preference?
AMZN,0,$SPY $AMZN $MSFT $AAPL $TSLA currently using RH but wanting to switch to either think or swim or Webull. Anyone have preference?
MSFT,0,$SPY $AMZN $MSFT $AAPL $TSLA currently using RH but wanting to switch to either think or swim or Webull. Anyone have preference?


In [29]:
%%sql
insert overwrite table sentiment_data 
select sentiment, body from message_filtered

 * hive://user2@ip-10-0-0-55.ap-northeast-1.compute.internal:10000
Done.


[]

In [30]:
%%sql
select * from sentiment_data limit 10

 * hive://user2@ip-10-0-0-55.ap-northeast-1.compute.internal:10000
Done.


sentiment,body
0,$SPY $AMZN $MSFT $AAPL $TSLA currently using RH but wanting to switch to either think or swim or Webull. Anyone have preference?
2,$AAPL
2,"$AAPL Congrats, longs. It&#39;s been hard not booking some gains on looooong held shares. I felt a beat coming on, and strong guidance, but you never really know. Are we back to sandbagging guidance? Def. back to growth."
0,$AAPL ambulance for the bears.
-2,$AAPL goes up 4 dollars and bulls Cole out of the cave
-2,$AAPL Lackluster services # wasn&#39;t that supposed to be a primary growth driver.
0,$SPY $AAPL $AMZN $TSLA still early but why not have some fun 🤷‍♂️
0,$AAPL futures going very green right now with the info that the epidemic is possibly slowing down. Less new cases today than yesterday. Gonna help push Apple higher.
2,$SPY when I think of bears I always see them still using them flip phones 😂😂🤦‍♂️ $AAPL
-2,$AAPL can’t wait to more


### JSONファイルの作成

加工したデータをJSONファイルとして出力します。

感情分析を担当するデータサイエンティスト・機械学習エンジニアは、このJSONファイルを使います。

add jar hdfs:/tmp/brickhouse-0.7.1-SNAPSHOT.jar;
CREATE TEMPORARY FUNCTION to_json AS 'brickhouse.udf.json.ToJsonUDF';

create table json_message (message STRING) STORED AS TEXTFILE;

insert overwrite table json_message
select to_json(named_struct('message_body', body, 'sentiment', sentiment)) from sentiment_data;

select * from json_message;

In [31]:
%sql add jar hdfs:/tmp/brickhouse-0.7.1-SNAPSHOT.jar
%sql CREATE TEMPORARY FUNCTION to_json AS 'brickhouse.udf.json.ToJsonUDF'

 * hive://user2@ip-10-0-0-55.ap-northeast-1.compute.internal:10000
Done.
 * hive://user2@ip-10-0-0-55.ap-northeast-1.compute.internal:10000
Done.


[]

In [32]:
%sql DROP TABLE IF EXISTS json_message
%sql create table json_message (message STRING) STORED AS TEXTFILE

 * hive://user2@ip-10-0-0-55.ap-northeast-1.compute.internal:10000
Done.
 * hive://user2@ip-10-0-0-55.ap-northeast-1.compute.internal:10000
Done.


[]

In [33]:
%%sql
insert overwrite table json_message
select to_json(named_struct('message_body', body, 'sentiment', sentiment)) from sentiment_data

 * hive://user2@ip-10-0-0-55.ap-northeast-1.compute.internal:10000
Done.


[]

In [35]:
%%sql
select * from json_message limit 5

 * hive://user2@ip-10-0-0-55.ap-northeast-1.compute.internal:10000
Done.


message
"{""message_body"":""$SPY $AMZN $MSFT $AAPL $TSLA currently using RH but wanting to switch to either think or swim or Webull. Anyone have preference?"",""sentiment"":0}"
"{""message_body"":""$AAPL "",""sentiment"":2}"
"{""message_body"":""$AAPL Congrats, longs. It&#39;s been hard not booking some gains on looooong held shares. I felt a beat coming on, and strong guidance, but you never really know. Are we back to sandbagging guidance? Def. back to growth."",""sentiment"":2}"
"{""message_body"":""$AAPL ambulance for the bears. "",""sentiment"":0}"
"{""message_body"":""$AAPL goes up 4 dollars and bulls Cole out of the cave"",""sentiment"":-2}"


In [38]:
#from __future__ import print_function


spark = SparkSession\
    .builder\
    .appName("JsonGen")\
    .getOrCreate()
    
spark.sparkContext.setLogLevel("ERROR")

#json_list = spark.read.table("json_message")
json_list = spark.sql("select * from user2.json_message")

#json_list.show(5)

path = "./output.json"

with open(path, mode='w') as f:
    f.write('{"data":[')
    bool_first_line = True
    for row in json_list.rdd.collect():
        if bool_first_line:
            bool_first_line = False
            f.write(row.message)
        else:
            #print(row.message)
            #f.write(row.message.encode("utf-8"))
            for i in range(100): # !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
                f.write(",\n")
                f.write(row.message)
    
    f.write("]}")

## 4. 感情分析
企業の価値を決定するときは、ニュースをフォローすることが重要です。たとえば、会社の製品チェーンにおける製品のリコールまたは自然災害。この情報を信号に変換できるようにしたいと考えています。現在、この仕事に最適なツールはニューラルネットワークです。

このプロジェクトでは、ソーシャルメディアサイトStockTwitsの投稿を使用します。StockTwitsのコミュニティは、投資家、トレーダー、起業家により利用されています。投稿された各メッセージはTwitと呼ばれます。これはTwitterのツイートによく似ています。感情のスコアを生成するこれらのtwitを中心にモデルを構築します。

多数のtwitsを収集し、それぞれの感情を手でラベル付けしました。センチメントの度合いを把握するために、非常にネガティブ、ネガティブ、ニュートラル、ポジティブ、非常にポジティブという5段階のスケールを使用します。各ツイットは、それぞれ非常に負から非常に正まで、1のステップで-2から2までラベル付けされます。このラベル付きデータを使用して、感情を自分でtwitsに割り当てることを学習する感情分析モデルを構築します。

最初にすべきことは、データをロードすることです。

### データの確認

This JSON file contains a list of objects for each twit in the `'data'` field:

```
{'data':
  {'message_body': 'Neutral twit body text here',
   'sentiment': 0},
  {'message_body': 'Happy twit body text here',
   'sentiment': 1},
   ...
}
```

The fields represent the following:

* `'message_body'`: The text of the twit.
* `'sentiment'`: センチメントスコアは、-2から2の範囲で1のステップで、0は中立です。


データがどのように見えるかを確認します。

In [39]:
#with open(os.path.join('..', '..', 'data', 'project_6_stocktwits', 'twits.json'), 'r') as f:
#with open('./twits_dumped.json', 'r') as f:
with open('./output.json', 'r') as f:
    twits = json.load(f)

print(twits['data'][:10])

[{'message_body': '$SPY $AMZN $MSFT $AAPL $TSLA currently using RH but wanting to switch to either think or swim or Webull. Anyone have preference?', 'sentiment': 0}, {'message_body': '$AAPL ', 'sentiment': 2}, {'message_body': '$AAPL ', 'sentiment': 2}, {'message_body': '$AAPL ', 'sentiment': 2}, {'message_body': '$AAPL ', 'sentiment': 2}, {'message_body': '$AAPL ', 'sentiment': 2}, {'message_body': '$AAPL ', 'sentiment': 2}, {'message_body': '$AAPL ', 'sentiment': 2}, {'message_body': '$AAPL ', 'sentiment': 2}, {'message_body': '$AAPL ', 'sentiment': 2}]


### データ長の確認
Now let's look at the number of twits in dataset. Print the number of twits below.

In [40]:
"""print out the number of twits"""

# TODO Implement 

print(len(twits['data']))

575901


### データの前処理
データを入手したら、テキストを前処理する必要があります。これらのtwitは、twit自体でリーダー$シンボルで示されるティッカーシンボルでフィルタリングすることにより収集されます。例えば、

{'message_body': 'RT @google Our annual look at the year in Google blogging (and beyond) http://t.co/sptHOAh8 $GOOG',
 'sentiment': 0}

ティッカーシンボルはセンチメントに関する情報を提供せず、すべてのツイットに含まれているため、削除する必要があります。このtwitには@googleユーザー名もあり、ここでもセンチメント情報は提供されないため、削除する必要があります。URLも表示されますhttp://t.co/sptHOAh8。これらも削除しましょう。

特定の単語やフレーズを削除する最も簡単な方法は、reモジュールを使用して正規表現を使用することです。スペースを使用して特定のパターンをサブアウトできます。

re.sub(pattern, ' ', text)
これにより、テキスト内のパターンが一致する場所でスペースが置換されます。後でテキストをトークン化するときに、それらのスペースで適切に分割します。

### Split Message Body and Sentiment Score

In [41]:
messages = [twit['message_body'] for twit in twits['data']]
# Since the sentiment scores are discrete, we'll scale the sentiments to 0 to 4 for use in our network
sentiments = [twit['sentiment'] + 2 for twit in twits['data']]

### Pre-Processing

In [42]:
nltk.download('wordnet')

def preprocess(message):
    """
    入力として文字列を受け取り、次の操作を実行する: 
        - 全てのアルファベットを小文字に変換
        - URLを削除
        - ティッカーシンボルを削除 
        - 句読点を削除
        - 文字列をスペースで分割しトークン化する
        - シングル・キャラクターのトークンを削除
    
    パラメータ
    ----------
        message : 前処理の対象テキストメッセージ
        
    戻り値
    -------
        tokens: 前処理後のトークン配列
    """ 
    #TODO: Implement 
    
    # Lowercase the twit message
    text = message.lower()
    
    # Replace URLs with a space in the message
    text = re.sub("http(s)?://([\w\-]+\.)+[\w-]+(/[\w\- ./?%&=]*)?",' ', text)
    
    # Replace ticker symbols with a space. The ticker symbols are any stock symbol that starts with $.
    text = re.sub("\$[^ \t\n\r\f]+", ' ', text)
    
    # Replace StockTwits usernames with a space. The usernames are any word that starts with @.
    text = re.sub("@[^ \t\n\r\f]+", ' ', text)

    # Replace everything not a letter with a space
    text = re.sub("[^a-z]", ' ', text)
    
    
    # Tokenize by splitting the string on whitespace into a list of words
    tokens = text.split()

    # Lemmatize words using the WordNetLemmatizer. You can ignore any word that is not longer than one character.
    wnl = nltk.stem.WordNetLemmatizer()
    tokens = [wnl.lemmatize(w, pos='v') for w in tokens if len(w) > 1]
    
    return tokens

[nltk_data] Downloading package wordnet to /home/cdsw/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


### Twitsメッセージ前処理
Now we can preprocess each of the twits in our dataset. Apply the function `preprocess` to all the twit messages.

※この処理には、データのサイズに応じて多少時間がかかります。

In [43]:
tokenized = list(map(preprocess, messages))

print(tokenized[:3])
print(len(tokenized))

[['currently', 'use', 'rh', 'but', 'want', 'to', 'switch', 'to', 'either', 'think', 'or', 'swim', 'or', 'webull', 'anyone', 'have', 'preference'], [], []]
575901


### Bag of Words

すべてのメッセージがトークン化されたので、語彙を作成し、コーパス全体で各単語が出現する頻度をカウントします。Counter関数を使用して、すべてのトークンをカウントアップします。
[`Counter`](https://docs.python.org/3.1/library/collections.html#collections.Counter)

※この処理には、データのサイズに応じて多少時間がかかります。

In [44]:
from collections import Counter

#words = []
#for tokens in tokenized:
#    for token in tokens:
#        words.append(token)
out_list = tokenized
words = [element for in_list in out_list for element in in_list]

print(words[:13])
print(len(words))

"""
Create a vocabulary by using Bag of words
"""

# TODO: Implement 

word_counts = Counter(words)
sorted_vocab = sorted(word_counts, key=word_counts.get, reverse=True)
int_to_vocab = {ii: word for ii, word in enumerate(sorted_vocab)}
vocab_to_int = {word:ii for ii, word in int_to_vocab.items()}

bow = []
for tokens in tokenized:
    bow.append([vocab_to_int[token] for token in tokens])

print(len(bow))
print(bow[:3])

# This BOW will not be used because it is not filtered to eliminate common words.

['currently', 'use', 'rh', 'but', 'want', 'to', 'switch', 'to', 'either', 'think', 'or', 'swim', 'or']
7449017
575901
[[712, 561, 3243, 89, 295, 3, 2268, 3, 791, 135, 67, 1801, 67, 3244, 675, 25, 3245], [], []]


### メッセージに現れる単語の頻度

ボキャブラリーを使用して、「the」、「and」、「it」などの最も一般的な単語の一部を削除します。
これらの単語は感情を特定するのに寄与せず、非常に一般的であるため、ニューラルネットワークの入力のノイズとなります。これらを除外することで、ネットワークの学習時間を短縮することができます。

また、ほんの数回しか使われていない、非常にまれな単語も削除します。ここでは、各単語のカウントをメッセージの数で除算する必要があります。次に、メッセージのごく一部にしか表示されない単語を削除します。

In [45]:
"""
Set the following variables:
    freqs
    low_cutoff
    high_cutoff
    K_most_common
"""

# TODO Implement 

print("len(sorted_vocab):",len(sorted_vocab))
print("sorted_vocab - top:", sorted_vocab[:3])
print("sorted_vocab - least:", sorted_vocab[-15:])

# Dictionart that contains the Frequency of words appearing in messages.
# The key is the token and the value is the frequency of that word in the corpus.
total_count = len(words)
freqs = {word: count/total_count for word, count in word_counts.items()}

#print("freqs[supplication]:",freqs["supplication"] )
print("freqs[the]:",freqs["the"] )

"""
This was the post by Ricardo:

there's no exact value for low_cutoff and high_cutoff, 
however I'd recommend you to use 
a low_cutoff that's around 0.000002 and 0.000007 
(This depends on the values you get from your freqs calculations) and 
a high_cutofffrom 5 to 20 (this depends on the most_common values from the bow).
"""

# Float that is the frequency cutoff. Drop words with a frequency that is lower or equal to this number.
low_cutoff = 0.000002

# Integer that is the cut off for most common words. Drop words that are the `high_cutoff` most common words.
"""
example_count = []
example_count.append(sorted_vocab.index("the"))
example_count.append(sorted_vocab.index("for"))
example_count.append(sorted_vocab.index("of"))
print(example_count)
high_cutoff = min(example_count)
"""
high_cutoff = 20
print("high_cutoff:",high_cutoff)
print("low_cutoff:",low_cutoff)

# The k most common words in the corpus. Use `high_cutoff` as the k.
#K_most_common = [word for word in sorted_vocab[:high_cutoff]]
K_most_common = sorted_vocab[:high_cutoff]

print("K_most_common:",K_most_common)


##  END of TODO Implement

filtered_words = [word for word in freqs if (freqs[word] > low_cutoff and word not in K_most_common)]

print("len(filtered_words):",len(filtered_words)) 

len(sorted_vocab): 5849
sorted_vocab - top: ['the', 'be', 'of']
sorted_vocab - least: ['reek', 'hahahahaha', 'god', 'rumor', 'otl', 'unless', 'spec', 'mouse', 'supplement', 'issuance', 'institutions', 'ponied', 'emotionally', 'knee', 'jerk']
freqs[the]: 0.027184794987043258
high_cutoff: 20
low_cutoff: 2e-06
K_most_common: ['the', 'be', 'of', 'to', 'amp', 'utm', 'in', 'for', 'and', 'on', 'file', 'form', 'stock', 'share', 'by', 'sec', 'at', 'earn', 'report', 'this']
len(filtered_words): 5829


### フィルターされた単語を削除して語彙を更新する¶
ボキャブラリーに役立つ3つの変数を作成します。

In [46]:
"""
Set the following variables:
    vocab
    id2vocab
    filtered
"""

#TODO Implement

# A dictionary for the `filtered_words`. The key is the word and value is an id that represents the word. 
vocab =  {word:ii for ii, word in enumerate(filtered_words)}
# Reverse of the `vocab` dictionary. The key is word id and value is the word. 
id2vocab = {ii:word for word, ii in vocab.items()}
# tokenized with the words not in `filtered_words` removed.

print("len(tokenized):", len(tokenized))

filtered = [[token for token in tokens if token in vocab] for tokens in tokenized]
print("len(filtered):", len(filtered))
print("tokenized[:1]", tokenized[:1])
print("filtered[:1]",filtered[:1])

len(tokenized): 575901
len(filtered): 575901
tokenized[:1] [['currently', 'use', 'rh', 'but', 'want', 'to', 'switch', 'to', 'either', 'think', 'or', 'swim', 'or', 'webull', 'anyone', 'have', 'preference']]
filtered[:1] [['currently', 'use', 'rh', 'but', 'want', 'switch', 'either', 'think', 'or', 'swim', 'or', 'webull', 'anyone', 'have', 'preference']]


### クラスのバランス
最後の前処理ステップをいくつか行いましょう。twitのラベル付けを見ると、twitの50％がニュートラルであることがわかります。これは、毎回0を推測するだけで、ネットワークの精度が50％になることを意味します。ネットワークが適切に学習できるように、クラスのバランスを取る必要があります。つまり、それぞれのセンチメントスコアがデータにほぼ同じ頻度で表示されることを確認します。

ここでできることは、それぞれの例に目を通し、中立的な感情を持つtwitsをランダムにドロップすることです。50％のニュートラルから20％のニュートラルtwitを取得したい場合、これらのtwitをドロップする確率はどうなりますか？この機会に、長さ0のメッセージを削除する必要もあります。

In [47]:
import random

balanced = {'messages': [], 'sentiments':[]}

n_neutral = sum(1 for each in sentiments if each == 2)
N_examples = len(sentiments)
keep_prob = (N_examples - n_neutral)/4/n_neutral

for idx, sentiment in enumerate(sentiments):
    message = filtered[idx]
    if len(message) == 0:
        # skip this message because it has length zero
        continue
    elif sentiment != 2 or random.random() < keep_prob:
        balanced['messages'].append(message)
        balanced['sentiments'].append(sentiment) 

If you did it correctly, you should see the following result 

In [48]:
n_neutral = sum(1 for each in balanced['sentiments'] if each == 2)
N_examples = len(balanced['sentiments'])
n_neutral/N_examples

0.21319965473740124

Finally let's convert our tokens into integer ids which we can pass to the network.

In [49]:
token_ids = [[vocab[word] for word in message] for message in balanced['messages']]
sentiments = balanced['sentiments']

In [50]:
import pickle
#from singer import Singer

#singer = Singer('Shanranran')

with open('vocab.pickle', 'wb') as f:
    pickle.dump(vocab, f)

### ニューラルネットワーク
これでボキャブラリーができたので、トークンをIDに変換し、それをネットワークに渡すことができます。ネットワークを定義します

下記は、ネットワークの概要です：

#### Embed -> RNN -> Dense -> Softmax
### Text classifier (テキスト分類器)実装
テキスト分類器を作成する前に、「RNNを使用したセンチメント分析」演習で作成した他のネットワーク（ここでは「SentimentRNN」と呼ばれるネットワーク、ここでは「TextClassifer」と呼びます）を覚えている場合、3つの主要な部分で構成されています：: 1) init function `__init__` 2) forward pass `forward`  3) hidden state `init_hidden`. 

このネットワークは、forwardパスで期待して構築したネットワークに非常に似ています 。シグモイドの代わりにsoftmaxを使用します。シグモイドを使用しないのは、NNの出力がバイナリではないためです。このネットワークでは、センチメントスコアには5つの結果があります。最も高い確率の結果を探しているため、softmaxの方が適しています。

In [51]:
class TextClassifier(nn.Module):
    def __init__(self, vocab_size, embed_size, lstm_size, output_size, lstm_layers=1, dropout=0.1):
        """
        Initialize the model by setting up the layers.
        
        Parameters
        ----------
            vocab_size : The vocabulary size.
            embed_size : The embedding layer size.
            lstm_size : The LSTM layer size.
            output_size : The output size.
            lstm_layers : The number of LSTM layers.
            dropout : The dropout probability.
        """
        
        super().__init__()
        self.vocab_size = vocab_size
        self.embed_size = embed_size
        self.lstm_size = lstm_size
        self.output_size = output_size
        self.lstm_layers = lstm_layers
        self.dropout = dropout
        
        # TODO Implement

        # Setup embedding layer
        self.embedding = nn.Embedding(self.vocab_size, self.embed_size)
        
        # Setup additional layers
        self.lstm = nn.LSTM(self.embed_size, self.lstm_size, self.lstm_layers, dropout=self.dropout)
        
        self.dropout = nn.Dropout(dropout)
        self.fc = nn.Linear(lstm_size, output_size)
        
        self.softmax = nn.LogSoftmax(dim=1)


    def init_hidden(self, batch_size):
        """ 
        Initializes hidden state
        
        Parameters
        ----------
            batch_size : The size of batches.
        
        Returns
        -------
            hidden_state
            
        """
        
        # TODO Implement 
        
        # Create two new tensors with sizes n_layers x batch_size x hidden_dim,
        # initialized to zero, for hidden state and cell state of LSTM
        
        weight = next(self.parameters()).data
        
        hidden = (weight.new(self.lstm_layers, batch_size,self.lstm_size).zero_(),
                         weight.new(self.lstm_layers, batch_size, self.lstm_size).zero_())
        return hidden


    def forward(self, nn_input, hidden_state):
        """
        Perform a forward pass of our model on nn_input.
        
        Parameters
        ----------
            nn_input : The batch of input to the NN.
            hidden_state : The LSTM hidden state.

        Returns
        -------
            logps: log softmax output
            hidden_state: The new hidden state.

        """
        
        # TODO Implement 
        batch_size = nn_input.size(0)
        
        embeds = self.embedding(nn_input)
        lstm_out, hidden_state = self.lstm(embeds, hidden_state)
        
        #lstm_out = lstm_out.contiguous().view(-1, self.lstm_size)    
        """
        remember here you do not have batch_first=True, 
        so accordingly shape your input. 
        Moreover, since now input is seq_length x batch you just need to transform lstm_out = lstm_out[-1,:,:].
        you don't have to use batch_first=True in this case, 
        nor reshape the outputs with .view just transform your lstm_out as advised and you should be good to go.
        """
        lstm_out = lstm_out[-1,:,:]
        
        out = self.dropout(lstm_out)
        out = self.fc(out)
        
        logps = self.softmax(out)
        
        
        return logps, hidden_state

### View Model

In [52]:
model = TextClassifier(len(vocab), 10, 6, 5, dropout=0.1, lstm_layers=2)
model.embedding.weight.data.uniform_(-1, 1)
input = torch.randint(0, 1000, (5, 4), dtype=torch.int64)
hidden = model.init_hidden(4)

logps, _ = model.forward(input, hidden)
print(logps)

tensor([[-1.5954, -1.7832, -1.7476, -1.4568, -1.5055],
        [-1.5747, -1.8081, -1.7448, -1.4557, -1.5095],
        [-1.5760, -1.8013, -1.7465, -1.4589, -1.5085],
        [-1.6398, -1.8044, -1.7004, -1.4428, -1.5027]],
       grad_fn=<LogSoftmaxBackward>)


### トレーニング
### DataLoaderとバッチ処理
ここで、データをループするために使用できるジェネレーターを構築する必要があります。シーケンスをバッチとして渡すことができれば、より効率的です。入力テンソルは次のようになり(sequence_length, batch_size)ます。したがって、シーケンスが40トークンで、25シーケンスを渡す場合、入力サイズはになり(40, 25)ます。

シーケンスの長さを40に設定した場合、40トークンより多いまたは少ないメッセージをどう処理しますか？40トークン未満のメッセージの場合、空のスポットにゼロを埋め込みます。データを処理する前にRNNが何も開始しないように、必ずパッドを残しておく必要があります。メッセージに20個のトークンがある場合、40個の長いシーケンスの最初の20個のスポットは0になります。メッセージに40個を超えるトークンがある場合、最初の40個のトークンを保持します。

In [53]:
#def dataloader(messages, labels, sequence_length=30, batch_size=32, shuffle=False):
def dataloader(messages, labels, sequence_length=20, batch_size=32, shuffle=False):
    """ 
    Build a dataloader.
    """
    if shuffle:
        indices = list(range(len(messages)))
        random.shuffle(indices)
        messages = [messages[idx] for idx in indices]
        labels = [labels[idx] for idx in indices]

    total_sequences = len(messages)

    for ii in range(0, total_sequences, batch_size):
        batch_messages = messages[ii: ii+batch_size]
        
        # First initialize a tensor of all zeros
        batch = torch.zeros((sequence_length, len(batch_messages)), dtype=torch.int64)
        for batch_num, tokens in enumerate(batch_messages):
            token_tensor = torch.tensor(tokens)
            # Left pad!
            start_idx = max(sequence_length - len(token_tensor), 0)
            batch[start_idx:, batch_num] = token_tensor[:sequence_length]
        
        label_tensor = torch.tensor(labels[ii: ii+len(batch_messages)])
        
        yield batch, label_tensor

### Training and  Validation
With our data in nice shape, we'll split it into training and validation sets.

In [54]:
"""
Split data into training and validation datasets. Use an appropriate split size.
The features are the `token_ids` and the labels are the `sentiments`.
"""   

# TODO Implement 

split_frac = 0.98 # for small data
#split_frac = 0.8 # for big data

## split data into training, validation, and test data (features and labels, x and y)

split_idx = int(len(token_ids)*split_frac)
train_features, remaining_features = token_ids[:split_idx], token_ids[split_idx:]
train_labels, remaining_labels = sentiments[:split_idx], sentiments[split_idx:]

test_idx = int(len(remaining_features)*0.5)
valid_features, test_features = remaining_features[:test_idx], remaining_features[test_idx:]
valid_labels, test_labels = remaining_labels[:test_idx], remaining_labels[test_idx:]

In [55]:
text_batch, labels = next(iter(dataloader(train_features, train_labels, sequence_length=20, batch_size=64)))
model = TextClassifier(len(vocab)+1, 200, 128, 5, dropout=0.)
hidden = model.init_hidden(64)
logps, hidden = model.forward(text_batch, hidden)

### Training
It's time to train the neural network!

In [56]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = TextClassifier(len(vocab)+1, 1024, 512, 5, lstm_layers=2, dropout=0.2)
model.embedding.weight.data.uniform_(-1, 1)
model.to(device)

TextClassifier(
  (embedding): Embedding(5830, 1024)
  (lstm): LSTM(1024, 512, num_layers=2, dropout=0.2)
  (dropout): Dropout(p=0.2, inplace=False)
  (fc): Linear(in_features=512, out_features=5, bias=True)
  (softmax): LogSoftmax()
)

### トレーニング実施

※この処理には、データのサイズに応じて、十分な時間が必要です。

GPUを備えた環境で実行する場合、ターミナルで以下のコマンドを実行することで、GPUが利用されていることを確認することができます（ GPU実行中、コマンド実行により表示されるテーブルの右上のVolatile GPU-Utilのパーセンテージ値が増えます）
```
$ watch nvidia-smi
```

In [None]:
"""
Train your model with dropout. Make sure to clip your gradients.
Print the training loss, validation loss, and validation accuracy for every 100 steps.
"""
import numpy as np

epochs = 4 #pass
batch_size =  64#pass
batch_size =  512#pass
learning_rate = 0.001 #pass

print_every = 100
criterion = nn.NLLLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)
model.train()

val_losses = []
accuracy = []

for epoch in range(epochs):
    print('Starting epoch {}'.format(epoch + 1))
    
    steps = 0
    for text_batch, labels in dataloader(
            train_features, train_labels, batch_size=batch_size, sequence_length=20, shuffle=True):
        steps += 1
        hidden = model.init_hidden(labels.shape[0]) #pass
        
        # Set Device
        text_batch, labels = text_batch.to(device), labels.to(device)
        for each in hidden:
            each.to(device)
        
        # TODO Implement: Train Model
        hidden = tuple([each.data for each in hidden])
        model.zero_grad()
        output, hidden = model(text_batch, hidden)
        loss = criterion(output.squeeze(), labels)
        loss.backward()
        clip = 5
        nn.utils.clip_grad_norm_(model.parameters(), clip)
        optimizer.step()
        
        # Accumulate loss
        val_losses.append(loss.item())
        
        correct_count = 0.0
        if steps % print_every == 0:
            model.eval()
            
            # Calculate accuracy
            ps = torch.exp(output)
            top_p, top_class = ps.topk(1, dim=1)
            #?top_class = top_class.to(device)
            #?labels = labels.to(device)

            correct_count += torch.sum(top_class.squeeze()== labels)
            accuracy.append(100*correct_count/len(labels))
            
            # TODO Implement: Print metrics
            print("Epoch: {}/{}...".format(epoch+1, epochs),
                 "Step: {}...".format(steps),
                 "Loss: {:.6f}...".format(loss.item()),
                 "Val Loss: {:.6f}".format(np.mean(val_losses)),
                 "Collect Count: {}".format(correct_count),
                 "Accuracy: {:.2f}".format((100*correct_count/len(labels))),
                 # AttributeError: 'torch.dtype' object has no attribute 'type'
                 #"Accuracy Avg: {:.2f}".format(np.mean(accuracy))
                 )
            
            model.train()

Starting epoch 1


In [24]:
torch.save({'state_dict': model.state_dict()}, 'checkpoint.pth.tar')

## 予測（Prediction）関数の作成
### Prediction 
訓練されたモデルを手に入れたので、新しいツイットでそれを試して、それが適切に機能するかどうか確かめてください。新しいテキストについては、ネットワークに渡す前に最初に前処理する必要があることに注意してください。predictメッセージから予測ベクトルを生成する関数を実装します。

In [31]:
import glob
print(glob.glob("/home/cdsw/*"))

import pickle
import re
import nltk
import numpy as np

nltk.download('wordnet')

import torch

#from sentiment import TextClassifier

import os
import sys
cur_dir = os.path.dirname(os.path.abspath('__file__'))
print(cur_dir)
sys.path.append(cur_dir)

vocab_filename = 'vocab.pickle'
vocab_path = cur_dir + "/" + vocab_filename
vocab_l = pickle.load(open(vocab_path, 'rb'))

#model_path = cur_dir + "/" + "model.torch"
#model_l = torch.load(model_path, map_location='cpu')

model_l = TextClassifier(len(vocab_l)+1, 1024, 512, 5, lstm_layers=2, dropout=0.2)
checkpoint = torch.load('./checkpoint.pth.tar')
model_l.load_state_dict(checkpoint['state_dict'])


def preprocess(message):
    """
    This function takes a string as input, then performs these operations: 
        - lowercase
        - remove URLs
        - remove ticker symbols 
        - removes punctuation
        - tokenize by splitting the string on whitespace 
        - removes any single character tokens
    
    Parameters
    ----------
        message : The text message to be preprocessed.
        
    Returns
    -------
        tokens: The preprocessed text into tokens.
    """ 
    
    # Lowercase the twit message
    text = message.lower()
    
    # Replace URLs with a space in the message
    text = re.sub("http(s)?://([\w\-]+\.)+[\w-]+(/[\w\- ./?%&=]*)?",' ', text)
    
    # Replace ticker symbols with a space. The ticker symbols are any stock symbol that starts with $.
    text = re.sub("\$[^ \t\n\r\f]+", ' ', text)
    
    # Replace StockTwits usernames with a space. The usernames are any word that starts with @.
    text = re.sub("@[^ \t\n\r\f]+", ' ', text)

    # Replace everything not a letter with a space
    text = re.sub("[^a-z]", ' ', text)
    
    
    # Tokenize by splitting the string on whitespace into a list of words
    tokens = text.split()

    # Lemmatize words using the WordNetLemmatizer. You can ignore any word that is not longer than one character.
    wnl = nltk.stem.WordNetLemmatizer()
    tokens = [wnl.lemmatize(w, pos='v') for w in tokens if len(w) > 1]
    
    return tokens


def predict_func(text, model, vocab):
    """ 
    Make a prediction on a single sentence.
    Parameters
    ----------
        text : The string to make a prediction on.
        model : The model to use for making the prediction.
        vocab : Dictionary for word to word ids. The key is the word and the value is the word id.
    Returns
    -------
        pred : Prediction vector
    """

    tokens = preprocess(text)    

    # Filter non-vocab words
    tokens = [token for token in tokens if token in vocab] #pass
    # Convert words to ids
    tokens = [vocab[token] for token in tokens] #pass

    if len(tokens) == 0:
      raise UnknownWordsError

    # Adding a batch dimension
    text_input = torch.from_numpy(np.asarray(torch.LongTensor(tokens).view(-1, 1)))

    # Get the NN output       
    batch_size = 1
    hidden = model.init_hidden(batch_size) #pass
    
    logps, _ = model(text_input, hidden) #pass
    # Take the exponent of the NN output to get a range of 0 to 1 for each label.
    pred = torch.round(logps.squeeze())#pass
    pred = torch.exp(logps) 
    
    return pred





def predict_api(args):
  text = args.get('text')
  try:
    result = predict_func(text, model_l, vocab_l)
    return result.detach().numpy()[0]
  except UnknownWordsError:
    return [0,0,1,0,0]
    

#args = {"text": "Google is working on self driving cars, I'm bullish on $goog"}
#args = {"text": "I'm bullish on $goog"}
args = {"text": "I'll strongly recommend to buy on $goog"}
#args = {"text": "elyoq baoq pquq $goog"}
result = predict_api(args)
print(result)

['/home/cdsw/checkpoint.pth.tar', '/home/cdsw/nlp_handson.ipynb', '/home/cdsw/twits_dumped.json', '/home/cdsw/test.py', '/home/cdsw/data', '/home/cdsw/lib', '/home/cdsw/nlp_solution.ipynb', '/home/cdsw/README.md', '/home/cdsw/model.torch', '/home/cdsw/nltk_data', '/home/cdsw/tables.hql', '/home/cdsw/ticker.txt', '/home/cdsw/vocab.pickle', '/home/cdsw/init.sh', '/home/cdsw/output.json']
/home/cdsw
[ 0.00204649  0.02024106  0.08254681  0.09021453  0.80495113]


[nltk_data] Downloading package wordnet to /home/cdsw/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [44]:
def predict(text, model, vocab):
    """ 
    Make a prediction on a single sentence.

    Parameters
    ----------
        text : The string to make a prediction on.
        model : The model to use for making the prediction.
        vocab : Dictionary for word to word ids. The key is the word and the value is the word id.

    Returns
    -------
        pred : Prediction vector
    """
    
    # TODO Implement
    tokens = preprocess(text)

    # Filter non-vocab words
    tokens = [token for token in tokens if token in vocab] #pass
    # Convert words to ids
    tokens = [vocab[token] for token in tokens] #pass

    # Adding a batch dimension
    text_input = torch.from_numpy(np.asarray(torch.LongTensor(tokens).view(-1, 1)))

    # Get the NN output       
    batch_size = 1
    hidden = model.init_hidden(batch_size) #pass
    
    logps, _ = model(text_input, hidden) #pass
    # Take the exponent of the NN output to get a range of 0 to 1 for each label.
    pred = torch.round(logps.squeeze())#pass
    pred = torch.exp(logps) 
    
    return pred

In [45]:
text = "Good good good wonderful"
model.eval()
model.to("cpu")
predict(text, model, vocab)

tensor([[0.1989, 0.1649, 0.1976, 0.1587, 0.2799]], grad_fn=<ExpBackward>)

In [46]:
text = "Bad bad bad worst"
model.eval()
model.to("cpu")
predict(text, model, vocab)

tensor([[0.1995, 0.1990, 0.2000, 0.1916, 0.2100]], grad_fn=<ExpBackward>)

In [47]:
text = "Google is working on self driving cars, I'm bullish on $goog"
model.eval()
model.to("cpu")
predict(text, model, vocab)

tensor([[0.2138, 0.1600, 0.2078, 0.1529, 0.2655]], grad_fn=<ExpBackward>)

### Questions: What is the prediction of the model? What is the uncertainty of the prediction?
** TODO: Answer Question**

#### What is the prediction of the model?
The prediction to the text above is positive as the highest value is positive in the probability list - very negative, negative, neutral, positive, very positive.
#### What is the uncertainty of the prediction?
When considering the sum of the rest values except for the highest class, the uncertainty of the prediction is low and when taking into account the both positive and very positive, the uncertainty is very low. So, the prediction seems appropriate.

Now we have a trained model and we can make predictions. We can use this model to track the sentiments of various stocks by predicting the sentiments of twits as they are coming in. Now we have a stream of twits. For each of those twits, pull out the stocks mentioned in them and keep track of the sentiments. Remember that in the twits, ticker symbols are encoded with a dollar sign as the first character, all caps, and 2-4 letters, like $AAPL. Ideally, you'd want to track the sentiments of the stocks in your universe and use this as a signal in your larger model(s).

## Testing
### Load the Data 

In [24]:
with open(os.path.join('..', '..', 'data', 'project_6_stocktwits', 'test_twits.json'), 'r') as f:
    test_data = json.load(f)

### Twit Stream

In [25]:
def twit_stream():
    for twit in test_data['data']:
        yield twit

next(twit_stream())

{'message_body': '$JWN has moved -1.69% on 10-31. Check out the movement and peers at  https://dividendbot.com?s=JWN',
 'timestamp': '2018-11-01T00:00:05Z'}

Using the `prediction` function, let's apply it to a stream of twits.

In [26]:
def score_twits(stream, model, vocab, universe):
    """ 
    Given a stream of twits and a universe of tickers, return sentiment scores for tickers in the universe.
    """
    for twit in stream:

        # Get the message text
        text = twit['message_body']
        symbols = re.findall('\$[A-Z]{2,4}', text)
        score = predict(text, model, vocab)

        for symbol in symbols:
            if symbol in universe:
                yield {'symbol': symbol, 'score': score, 'timestamp': twit['timestamp']}

In [27]:
universe = {'$BBRY', '$AAPL', '$AMZN', '$BABA', '$YHOO', '$LQMT', '$FB', '$GOOG', '$BBBY', '$JNUG', '$SBUX', '$MU'}
score_stream = score_twits(twit_stream(), model, vocab, universe)

next(score_stream)

{'symbol': '$AAPL',
 'score': tensor([[ 0.1006,  0.1506,  0.2158,  0.2898,  0.2432]]),
 'timestamp': '2018-11-01T00:00:18Z'}

In [34]:
!pip3 freeze > requirements.txt

In [None]:
%sql DROP DATABASE IF EXISTS user1 CASCADE;

That's it. You have successfully built a model for sentiment analysis! 