# Data Science / Machine Learning Meetup #1 Deep Learning Hands-on
# オルタナティブ・データと自然言語処理

## はじめに

演習の概略は以下の通りです。
1. [環境準備](#環境準備)
1. [WEBスクレイピング](#WEBスクレイピング)
1. [感情分析](#感情分析)
    1. 前処理
    1. ニューラル・ネットワーク構築
    1. トレーニング
    1. 予測

以下の点にご注意ください。
- 実行するコードの中に、ご利用中のユーザー名に合わせて、変更していただく部分があります。

## 環境準備

### パッケージのインストールとインポート

In [1]:
!pip3 install ipython-sql==0.3.9
!pip3 install PyHive==0.6.1
!pip3 install SQLAlchemy==1.3.13
!pip3 install thrift==0.13.0
!pip3 install sasl==0.2.1
!pip3 install thrift_sasl==0.3.0


!pip3 install nltk==3.4.5
!pip3 install torch==1.4.0

Collecting ipython-sql==0.3.9
  Downloading https://files.pythonhosted.org/packages/ab/df/427e7cf05ffc67e78672ad57dce2436c1e825129033effe6fcaf804d0c60/ipython_sql-0.3.9-py2.py3-none-any.whl
Collecting prettytable (from ipython-sql==0.3.9)
  Downloading https://files.pythonhosted.org/packages/ef/30/4b0746848746ed5941f052479e7c23d2b56d174b82f4fd34a25e389831f5/prettytable-0.7.2.tar.bz2
Collecting sqlalchemy>=0.6.7 (from ipython-sql==0.3.9)
[?25l  Downloading https://files.pythonhosted.org/packages/af/47/35edeb0f86c0b44934c05d961c893e223ef27e79e1f53b5e6f14820ff553/SQLAlchemy-1.3.13.tar.gz (6.0MB)
[K     |████████████████████████████████| 6.0MB 9.3MB/s eta 0:00:01
[?25hCollecting sqlparse (from ipython-sql==0.3.9)
  Downloading https://files.pythonhosted.org/packages/ef/53/900f7d2a54557c6a37886585a91336520e5539e3ae2423ff1102daf4f3a7/sqlparse-0.3.0-py2.py3-none-any.whl
Building wheels for collected packages: prettytable, sqlalchemy
  Building wheel for prettytable (setup.py) ... [?25ldon

  Building wheel for nltk (setup.py) ... [?25ldone
[?25h  Stored in directory: /home/cdsw/.cache/pip/wheels/96/86/f6/68ab24c23f207c0077381a5e3904b2815136b879538a24b483
Successfully built nltk
Installing collected packages: nltk
Successfully installed nltk-3.4.5
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
Collecting torch==1.4.0
[?25l  Downloading https://files.pythonhosted.org/packages/24/19/4804aea17cd136f1705a5e98a00618cb8f6ccc375ad8bfa437408e09d058/torch-1.4.0-cp36-cp36m-manylinux1_x86_64.whl (753.4MB)
[K     |████████████████████████████████| 753.4MB 41kB/s s eta 0:00:01     |█▌                              | 35.2MB 5.4MB/s eta 0:02:13[K     |█▊                              | 39.4MB 5.4MB/s eta 0:02:12     |██                              | 47.3MB 5.4MB/s eta 0:02:11MB/s eta 0:00:097MB 69.9MB/s eta 0:00:09| 191.8MB 69.9MB/s eta 0:00:091MB 56.1MB/s eta 0:00:10  | 276.2MB 72.9MB/s eta 0:00:07MB/s eta 0:00:05███████████████▋          | 508.0MB 69

上記でインストールしたPyHiveは、Pythonコードの中でimportして使われるのではなく、Hiveへの接続の際の接続文字列：`sqlalchemy.create_engine('hive://<host>:<port>')`の中でdialectsとして指定された際に必要になります。そのため、インストール後に利用するためには、新しくプロセスを始める必要があります。**インストールした後に一度、KernelをRestartしてください。**インストールしたプロセスでは、接続時に下記のようなエラーが発生します。
`NoSuchModuleError: Can't load plugin: sqlalchemy.dialects:hive`

In [1]:
import json
import os
import random
import re
import subprocess
import glob
import traceback
from datetime import datetime

from pyhive import hive
import sqlalchemy

import sys
#from random import random
from operator import add
from pyspark.sql import SparkSession

import torch
import nltk
from torch import nn, optim
import torch.nn.functional as F

## WEBスクレイピング

無償で利用できるAPIを用いて演習を行います。そのため、利用に一定の制限が課せられることにご留意ください。
例えば、ご利用状況に応じて、下記のようなエラーメッセージを受け取ることがあります。

```
{"response":{"status":429},"errors":[{"message":"Rate limit exceeded. Client may not make more than 200 requests an hour."}]}
```
まず、APIで取得したデータをCDSWプロジェクト内のファイルとして保存します。

取得する銘柄の候補が、`ticker.txt`に定義されています。

In [1]:
ticker_file = open("ticker.txt")
data = ticker_file.readlines()
ticker_file.close()

ticker_list = [i.rstrip('\n') for i in data]

print(len(ticker_list))
print(ticker_list)

2882
['A', 'AA', 'AAL', 'AAN', 'AAOI', 'AAON', 'AAP', 'AAPL', 'AAWW', 'AAXN', 'ABBV', 'ABC', 'ABCB', 'ABEO', 'ABG', 'ABM', 'ABMD', 'ABT', 'ABTX', 'ACA', 'ACAD', 'ACCO', 'ACEL', 'ACGL', 'ACHC', 'ACHN', 'ACHV', 'ACIA', 'ACIW', 'ACLS', 'ACM', 'ACN', 'ACNB', 'ACOR', 'ACRS', 'ACRX', 'ACTG', 'ADBE', 'ADES', 'ADI', 'ADM', 'ADMA', 'ADMP', 'ADMS', 'ADP', 'ADPT', 'ADRO', 'ADS', 'ADSK', 'ADSW', 'ADT', 'ADTN', 'ADUS', 'ADVM', 'ADXS', 'AE', 'AEE', 'AEGN', 'AEIS', 'AEL', 'AEM', 'AEMD', 'AEO', 'AEP', 'AERI', 'AES', 'AFG', 'AFI', 'AFL', 'AG', 'AGCO', 'AGEN', 'AGFS', 'AGI', 'AGIO', 'AGLE', 'AGM', 'AGN', 'AGO', 'AGR', 'AGRX', 'AGS', 'AGTC', 'AGX', 'AGYS', 'AHC', 'AHCO', 'AIG', 'AIMC', 'AIMT', 'AIN', 'AIR', 'AIRG', 'AIRT', 'AIT', 'AIZ', 'AJG', 'AJRD', 'AKAM', 'AKBA', 'AKCA', 'AKRO', 'AKRX', 'AKS', 'AL', 'ALB', 'ALCO', 'ALDX', 'ALE', 'ALEC', 'ALG', 'ALGN', 'ALGT', 'ALIM', 'ALK', 'ALKS', 'ALL', 'ALLK', 'ALLO', 'ALLY', 'ALNY', 'ALOT', 'ALPN', 'ALRM', 'ALRN', 'ALSK', 'ALSN', 'ALT', 'ALTR', 'ALV', 'ALXN', 'AM

In [2]:
!mkdir ./data

In [5]:
symbols = ['BBRY', 'AAPL', 'AMZN', 'BABA', 'YHOO', 'FB', 'GOOG', 'BBBY', 'JNUG', 'SBUX', 'MU']

NUM_REQUEST = 200 - len(symbols)

random.seed(12345)
symbols.extend(random.sample(ticker_list, NUM_REQUEST))

args = ['curl', '-X', 'GET', '']
URL = "https://api.stocktwits.com/api/2/streams/symbol/"

FILE_PATH = "./data/"

start_datetime = datetime.now().strftime("%Y%m%d_%H%M")
for symbol in symbols:
    try:
        args[3] = URL + symbol + ".json"
        print(args[3])
        proc = subprocess.run(args,stdout = subprocess.PIPE, stderr = subprocess.PIPE)

        path = FILE_PATH + symbol + "_" + start_datetime + ".json"
        print(path)
        with open(path, mode='w') as f:
            f.write(proc.stdout.decode("utf8"))
    except:
        traceback.print_exc()

https://api.stocktwits.com/api/2/streams/symbol/BBRY.json
./data/BBRY_20200207_0459.json
https://api.stocktwits.com/api/2/streams/symbol/AAPL.json
./data/AAPL_20200207_0459.json
https://api.stocktwits.com/api/2/streams/symbol/AMZN.json
./data/AMZN_20200207_0459.json
https://api.stocktwits.com/api/2/streams/symbol/BABA.json
./data/BABA_20200207_0459.json
https://api.stocktwits.com/api/2/streams/symbol/YHOO.json
./data/YHOO_20200207_0459.json
https://api.stocktwits.com/api/2/streams/symbol/FB.json
./data/FB_20200207_0459.json
https://api.stocktwits.com/api/2/streams/symbol/GOOG.json
./data/GOOG_20200207_0459.json
https://api.stocktwits.com/api/2/streams/symbol/BBBY.json
./data/BBBY_20200207_0459.json
https://api.stocktwits.com/api/2/streams/symbol/JNUG.json
./data/JNUG_20200207_0459.json
https://api.stocktwits.com/api/2/streams/symbol/SBUX.json
./data/SBUX_20200207_0459.json
https://api.stocktwits.com/api/2/streams/symbol/MU.json
./data/MU_20200207_0459.json
https://api.stocktwits.com/ap

./data/IMMR_20200207_0459.json
https://api.stocktwits.com/api/2/streams/symbol/ADUS.json
./data/ADUS_20200207_0459.json
https://api.stocktwits.com/api/2/streams/symbol/AR.json
./data/AR_20200207_0459.json
https://api.stocktwits.com/api/2/streams/symbol/ATO.json
./data/ATO_20200207_0459.json
https://api.stocktwits.com/api/2/streams/symbol/NRC.json
./data/NRC_20200207_0459.json
https://api.stocktwits.com/api/2/streams/symbol/BCC.json
./data/BCC_20200207_0459.json
https://api.stocktwits.com/api/2/streams/symbol/MATX.json
./data/MATX_20200207_0459.json
https://api.stocktwits.com/api/2/streams/symbol/CZZ.json
./data/CZZ_20200207_0459.json
https://api.stocktwits.com/api/2/streams/symbol/ADS.json
./data/ADS_20200207_0459.json
https://api.stocktwits.com/api/2/streams/symbol/LFUS.json
./data/LFUS_20200207_0459.json
https://api.stocktwits.com/api/2/streams/symbol/ENVA.json
./data/ENVA_20200207_0459.json
https://api.stocktwits.com/api/2/streams/symbol/WIRE.json
./data/WIRE_20200207_0459.json
http

./data/XNCR_20200207_0459.json
https://api.stocktwits.com/api/2/streams/symbol/WAFD.json
./data/WAFD_20200207_0459.json
https://api.stocktwits.com/api/2/streams/symbol/ATH.json
./data/ATH_20200207_0459.json
https://api.stocktwits.com/api/2/streams/symbol/FTR.json
./data/FTR_20200207_0459.json
https://api.stocktwits.com/api/2/streams/symbol/MYOK.json
./data/MYOK_20200207_0459.json
https://api.stocktwits.com/api/2/streams/symbol/AOS.json
./data/AOS_20200207_0459.json
https://api.stocktwits.com/api/2/streams/symbol/LBY.json
./data/LBY_20200207_0459.json
https://api.stocktwits.com/api/2/streams/symbol/PZZA.json
./data/PZZA_20200207_0459.json
https://api.stocktwits.com/api/2/streams/symbol/RLI.json
./data/RLI_20200207_0459.json
https://api.stocktwits.com/api/2/streams/symbol/SMED.json
./data/SMED_20200207_0459.json
https://api.stocktwits.com/api/2/streams/symbol/CAG.json
./data/CAG_20200207_0459.json
https://api.stocktwits.com/api/2/streams/symbol/TRU.json
./data/TRU_20200207_0459.json
http

正常なレスポンス・ステータスを持っていないファイルを取り除きます。

In [6]:
!grep -rlv '{"response":{"status":200}' data
!grep -rlv '{"response":{"status":200}' data | xargs rm

data/INBK_20200207_0459.json
data/CSWI_20200207_0459.json


次に、保存したファイルを、分散処理環境（クラスター）を使って加工するためにHDFSへコピーします。

In [10]:
!export HADOOP_CONF_DIR=/etc/hadoop/conf; hdfs dfs -mkdir ./twits/

In [35]:
!export HADOOP_CONF_DIR=/etc/hadoop/conf; hdfs dfs -put ./data/* ./twits/

In [38]:
!export HADOOP_CONF_DIR=/etc/hadoop/conf; hdfs dfs -ls ./twits

Found 198 items
-rw-r--r--   3 admin supergroup      40135 2020-02-07 05:31 twits/AAPL_20200207_0459.json
-rw-r--r--   3 admin supergroup      34640 2020-02-07 05:31 twits/ABEO_20200207_0459.json
-rw-r--r--   3 admin supergroup      56001 2020-02-07 05:31 twits/ABG_20200207_0459.json
-rw-r--r--   3 admin supergroup      48370 2020-02-07 05:31 twits/ACIA_20200207_0459.json
-rw-r--r--   3 admin supergroup      53569 2020-02-07 05:31 twits/ADES_20200207_0459.json
-rw-r--r--   3 admin supergroup      34594 2020-02-07 05:31 twits/ADMA_20200207_0459.json
-rw-r--r--   3 admin supergroup      48297 2020-02-07 05:31 twits/ADS_20200207_0459.json
-rw-r--r--   3 admin supergroup      50407 2020-02-07 05:31 twits/ADUS_20200207_0459.json
-rw-r--r--   3 admin supergroup      51851 2020-02-07 05:31 twits/AEE_20200207_0459.json
-rw-r--r--   3 admin supergroup      50550 2020-02-07 05:31 twits/AJG_20200207_0459.json
-rw-r--r--   3 admin supergroup      42060 2020-02-07 05:31 twits/ALGN_202002

-rw-r--r--   3 admin supergroup      44154 2020-02-07 05:31 twits/REZI_20200207_0459.json
-rw-r--r--   3 admin supergroup      48815 2020-02-07 05:31 twits/RLI_20200207_0459.json
-rw-r--r--   3 admin supergroup      47491 2020-02-07 05:31 twits/RMD_20200207_0459.json
-rw-r--r--   3 admin supergroup      48705 2020-02-07 05:31 twits/RNET_20200207_0459.json
-rw-r--r--   3 admin supergroup      42964 2020-02-07 05:31 twits/RNWK_20200207_0459.json
-rw-r--r--   3 admin supergroup      51350 2020-02-07 05:31 twits/SAIC_20200207_0459.json
-rw-r--r--   3 admin supergroup      47672 2020-02-07 05:31 twits/SBSI_20200207_0459.json
-rw-r--r--   3 admin supergroup      37341 2020-02-07 05:31 twits/SBUX_20200207_0459.json
-rw-r--r--   3 admin supergroup      46350 2020-02-07 05:31 twits/SCI_20200207_0459.json
-rw-r--r--   3 admin supergroup      49072 2020-02-07 05:31 twits/SCSC_20200207_0459.json
-rw-r--r--   3 admin supergroup      31558 2020-02-07 05:31 twits/SGBX_20200207_0459.json
-r

### データ変換

クラスターでデータを変換します。CDSW上では、ユーザーごとに別のプロジェクトを使っていましたが、クラスター環境では、自分が利用しているユーザーとデータを意識して取り扱う必要があります。


あなたの（HADOOPクラスターへアクセスする）ユーザ名は以下で確認できます。

In [13]:
!echo $HADOOP_USER_NAME

admin


### データベースの準備



**下記のセルの中を適切なユーザ名とURL（Hiveサーバー）に置換してください。**

In [27]:
%load_ext sql

**下記のセルの中を適切なユーザ名とURL（Hiveサーバー）に置換してください。**

In [29]:
%sql hive://admin@master.ykono.work:10000

'Connected: admin@None'

**あなたのユーザ名でデータベースを作成・利用してください**

In [16]:
%sql CREATE DATABASE admin
%sql USE admin
%sql SHOW TABLES

 * hive://admin@master.ykono.work:10000
(pyhive.exc.OperationalError) TExecuteStatementResp(status=TStatus(statusCode=3, infoMessages=['*org.apache.hive.service.cli.HiveSQLException:Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Database admin already exists:29:28', 'org.apache.hive.service.cli.operation.Operation:toSQLException:Operation.java:329', 'org.apache.hive.service.cli.operation.SQLOperation:runQuery:SQLOperation.java:258', 'org.apache.hive.service.cli.operation.SQLOperation:runInternal:SQLOperation.java:293', 'org.apache.hive.service.cli.operation.Operation:run:Operation.java:260', 'org.apache.hive.service.cli.session.HiveSessionImpl:executeStatementInternal:HiveSessionImpl.java:505', 'org.apache.hive.service.cli.session.HiveSessionImpl:executeStatement:HiveSessionImpl.java:480', 'sun.reflect.NativeMethodAccessorImpl:invoke0:NativeMethodAccessorImpl.java:-2', 'sun.reflect.NativeMethodAccessorImpl:invoke:Na

tab_name


### ライブラリファイルのコピー・登録

Hiveクエリの中でjsonファイルを扱えるようにするためのライブラリを登録します。
ライブラリファイルはGithubリポジトリに含まれています（ライブラリの詳細は`/lib/README.jar`を参照ください）。
はじめにCDSWからHDFSにコピーし、HDFS上のファイルをHiveへ登録します。

コンパイル済みのライブラリファイルをリポジトリに含めています。
- json-1.3.7.3.jar
- json-serde-cdh5-shim-1.3.7.3.jar
- json-serde-1.3.7.3.jar'

- brickhouse-0.7.1-SNAPSHOT.jar

In [19]:
!export HADOOP_CONF_DIR=/etc/hadoop/conf; hdfs dfs -put `ls -1 ./lib/*.jar` .; hdfs dfs -ls .

put: `brickhouse-0.7.1-SNAPSHOT.jar': File exists
put: `json-1.3.7.3.jar': File exists
put: `json-serde-1.3.7.3.jar': File exists
put: `json-serde-cdh5-shim-1.3.7.3.jar': File exists
Found 5 items
-rw-r--r--   3 admin supergroup     308146 2020-02-07 05:14 brickhouse-0.7.1-SNAPSHOT.jar
-rw-r--r--   3 admin supergroup      44477 2020-02-07 05:14 json-1.3.7.3.jar
-rw-r--r--   3 admin supergroup      36653 2020-02-07 05:14 json-serde-1.3.7.3.jar
-rw-r--r--   3 admin supergroup       5110 2020-02-07 05:14 json-serde-cdh5-shim-1.3.7.3.jar
drwxr-xr-x   - admin supergroup          0 2020-02-07 05:13 twits


**下記のパスを適切なユーザ名で置換してください。**

In [54]:
%sql add jar hdfs:/user/admin/json-1.3.7.3.jar
%sql add jar hdfs:/user/admin/json-serde-1.3.7.3.jar
%sql add jar hdfs:/user/admin/json-serde-cdh5-shim-1.3.7.3.jar
%sql add jar hdfs:/user/admin/brickhouse-0.7.1-SNAPSHOT.jar
%sql CREATE TEMPORARY FUNCTION to_json AS 'brickhouse.udf.json.ToJsonUDF'

 * hive://admin@master.ykono.work:10000
Done.
 * hive://admin@master.ykono.work:10000
Done.
 * hive://admin@master.ykono.work:10000
Done.
 * hive://admin@master.ykono.work:10000
Done.
 * hive://admin@master.ykono.work:10000
Done.


[]

In [39]:
%sql DROP TABLE IF EXISTS twits
%sql DROP TABLE IF EXISTS message_extracted
%sql DROP TABLE IF EXISTS message_filtered
%sql DROP TABLE IF EXISTS message_exploded
%sql DROP TABLE IF EXISTS sentiment_data

 * hive://admin@master.ykono.work:10000
Done.
 * hive://admin@master.ykono.work:10000
Done.
 * hive://admin@master.ykono.work:10000
Done.
 * hive://admin@master.ykono.work:10000
Done.
 * hive://admin@master.ykono.work:10000
Done.


[]

SNSメッセージファイルを格納した場所を指定して、テーブルを作成します。

**`LOCATION`指定にあなたがファイルをアップロードしたパスを指定してください**

In [40]:
%%sql
CREATE EXTERNAL TABLE twits (
	messages 
	ARRAY<
	    STRUCT<body: STRING,
	        symbols:ARRAY<STRUCT<symbol:STRING>>,
	        entities:STRUCT<sentiment:STRUCT<basic:STRING>>
	    >
	>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' 
STORED AS TEXTFILE
LOCATION '/user/admin/twits'

 * hive://admin@master.ykono.work:10000
Done.


[]

In [41]:
%%sql
select count(*) from twits

 * hive://admin@master.ykono.work:10000
Done.


_c0
198


In [42]:
%%sql
select * from twits limit 3

 * hive://admin@master.ykono.work:10000
Done.


messages
"[{""body"":""$AAPL had approximately 2395M USD go to the short side at 52 pct short Bears and Bulls are fighting close https://www.algowins.com/?wdt_column_filter[1]=AAPL"",""symbols"":[{""symbol"":""AAPL""}],""entities"":{""sentiment"":null}},{""body"":""$AAPL nice squeeze on the daily. If jobs report goes well 330 tomorrow"",""symbols"":[{""symbol"":""AAPL""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""In the last month $AAPL has a been trading in the 302.22 - 327.85 range, which is quite wide. https://www.chartmill.com/stock/quote/AAPL/technical-analysis?key=bb853040-a4ac-41c6-b549-d218d2f21b32&amp;utm_source=stocktwits&amp;utm_medium=TA&amp;utm_content=AAPL&amp;utm_campaign=social_tracking"",""symbols"":[{""symbol"":""AAPL""}],""entities"":{""sentiment"":null}},{""body"":""$EROS SEEING IS BELIEVING - EROSNOW IS NOW ON APPLE TV+ AS OF TODAY! \n \nThanks @InvesThor / Waleed. You beat me to this find &amp; I thank you. For those of you that are new to Eros or just may not understand the significance of this then let me tell you. \n \nI&#39;ve been checking the Apple plus app every few days for the past few months based on information disclosed by Eros in the last quarterly conference call. They had said that something was cooking with Apple $AAPL but there has been no press release yet. \n \nEros mentioned again in las Vegas last month that an Apple business deal was still in the works. \n \nAs we all know in life, people say things but it does not always happen. Well it has happened. It is here. Expect pre paid revenues for Eros off the charts immediately. \n \nExpect a press release by tomorrow or Monday am. \n \nExpect mainstream mutual funds to jump in. \n \nExpect Eros SP to gap back up to at least $8 &amp; then some. \n \nBuckle up folks, this is HUUGE! \nGLTL!!! Go $EROS"",""symbols"":[{""symbol"":""AAPL""},{""symbol"":""EROS""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""How I Plan To Trade Tesla Stock https://youtu.be/RLXL4V6J6RE $TSLA 📈 \n\nHolding $AAPL and $BYND"",""symbols"":[{""symbol"":""AAPL""},{""symbol"":""TSLA""},{""symbol"":""BYND""}],""entities"":{""sentiment"":null}},{""body"":""$GOOG $GOOGL $MSFT $AAPL $AMZN \nT-Birds Performance!"",""symbols"":[{""symbol"":""AAPL""},{""symbol"":""AMZN""},{""symbol"":""GOOG""},{""symbol"":""MSFT""},{""symbol"":""GOOGL""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$AAPL Does anyone know what date you need to own shares by in order to get the dividend?"",""symbols"":[{""symbol"":""AAPL""}],""entities"":{""sentiment"":null}},{""body"":""$AAPL \nWE ALL HAVE A RIGHT TO SAY WHAT WE THINK ABOUT A SPECIFIC STOCK!\nNO ONE IS DECLARING OR EVEN ASSUMING!\nIT JUST WOULD NOT SHOCK ME IF SKYROCKETS TOMORROW!!!\nSIMPLE LOGIC \nIT BUGS ME THAT SIMPLE LOGIC BUGS PEOPLE"",""symbols"":[{""symbol"":""AAPL""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$AAPL $400 by August... seems about right! Undervalued at these levels and at the $400 price target!"",""symbols"":[{""symbol"":""AAPL""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$AAPL \nFOR THE RECORD THERE IS SOMETHING CALLED PRICE TARGET \nNOT SURE WHAT THAT IS???"",""symbols"":[{""symbol"":""AAPL""}],""entities"":{""sentiment"":null}},{""body"":""$AAPL Less than three dollars from making new ATH"",""symbols"":[{""symbol"":""AAPL""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$AAPL \nPEOPLE WHO HEAR BOMBSHELLS LIKE THIS AND THINK APPLE WILL GO BEARISH IS A LOT MORE DELUSIONAL THAN ONE WHO HAS HIGH HOPES FOR THE STOCK!\nGOOD TRY!"",""symbols"":[{""symbol"":""AAPL""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$AAPL $AMZN $ENPH $TSLA \nAvailable on Amazon"",""symbols"":[{""symbol"":""AAPL""},{""symbol"":""AMZN""},{""symbol"":""TSLA""},{""symbol"":""ENPH""}],""entities"":{""sentiment"":null}},{""body"":""$AAPL"",""symbols"":[{""symbol"":""AAPL""}],""entities"":{""sentiment"":{""basic"":""Bearish""}}},{""body"":""$AAPL \n\nARE YOU FREAKIN KIDDING?\n💣💣💣💣💣💣💣💣💣💣💣💣💣💣🐻🐻🐻🐻🐻🐻🐻🐻🐻🐻🐻🐻🐻🐻🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥\n\nAPPLE STOCK IS NOT WORTH $325.50\nWITH THIS 2 BOMBSHELLS APPLE MAY SEE ITS MOST GLORIOUS DAY TOMORROW.\nBOMBSHELL 1-I PHONE 12 LEAKED \nBOMBSHELL 2-AND CHINA CUTTING TARIFFS IS OFFICIAL!\n\n340- 400 would not surprise me at all!!"",""symbols"":[{""symbol"":""AAPL""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$AAPL [Feb-07 310.00 Calls] Option volume Up +113.96 % | Volume: 21,118 vs 9,870 https://www.sleekoptions.com/sleekscan.aspx?sub1=dsc"",""symbols"":[{""symbol"":""AAPL""}],""entities"":{""sentiment"":null}},{""body"":""Peak profit for the last 6 expired option alerts for $AAPL 11.86 | -18.08 | 529.17 | 3.02 | 354.20 | 402.11 |"",""symbols"":[{""symbol"":""AAPL""}],""entities"":{""sentiment"":null}},{""body"":""$AAPL we gold for tomorrow I need another property 🐋🧐🌎💰"",""symbols"":[{""symbol"":""AAPL""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$SPY $TSLA $AAPL $AMZN \n\nI’m sorry, this just really bugs me. \n\nAnyone who says that a stock “will be at (stated price)” is out of their minds and delusional. \nYou cannot, I repeat, YOU CANNOT say where a security on the short term will be. You’re just telling everyone what you hope it will be based on your position."",""symbols"":[{""symbol"":""AAPL""},{""symbol"":""AMZN""},{""symbol"":""SPY""},{""symbol"":""TSLA""}],""entities"":{""sentiment"":null}},{""body"":""$AAPL - Exit Apple and check out… http://dlvr.it/RPY5vL #portfolio_prospective #better_portfolio #diversify"",""symbols"":[{""symbol"":""AAPL""}],""entities"":{""sentiment"":null}},{""body"":""$SPY $MCD $AAPL Why stonks only up since 1993\n\nEveryone knows this"",""symbols"":[{""symbol"":""AAPL""},{""symbol"":""MCD""},{""symbol"":""SPY""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$AAPL half of China is shut down. Schools are shut to mid February. I wonder why schools are closed longer than work?"",""symbols"":[{""symbol"":""AAPL""}],""entities"":{""sentiment"":null}},{""body"":""$AAPL the 24 provinces not working account for 80% of national GDP"",""symbols"":[{""symbol"":""AAPL""}],""entities"":{""sentiment"":null}},{""body"":""$SPY China in a pandemic? Will they be able to buy the new iPhone? $AAPL so ex dividend is tomorrow? Long term holders shouldn&#39;t worry but option traders will have some fun tomorrow 😊"",""symbols"":[{""symbol"":""AAPL""},{""symbol"":""SPY""}],""entities"":{""sentiment"":null}},{""body"":""$AAPL \nREAD HEADLINE OF FORBES CAREFULLY \n“ACCIDENTALLY LEAKED” (HA, GOOD DECOY)\nTHIS A ROCKET 🚀 TO THE MOON \n\nEDITORS&#39; PICK|9,012 views|Feb 6, 2020,8:00 pm\nApple “Accidentally Leaks” Radical iPhone Upgrade"",""symbols"":[{""symbol"":""AAPL""}],""entities"":{""sentiment"":null}},{""body"":""$AAPL \n\nAnd the potential here is mind boggling. Not just for personal car ownership (where you can already find similar technology in apps from Tesla and others) but as a universal Apple car key across multiple brands for car hire, car sharing and more."",""symbols"":[{""symbol"":""AAPL""}],""entities"":{""sentiment"":null}},{""body"":""$AAPL \n\nStrings of code inside iOS 13.4 explain that CarKey will work just like Apple Pay with a user authenticating via biometrics then holding their iPhone / Apple Watch to a reader in the car"",""symbols"":[{""symbol"":""AAPL""}],""entities"":{""sentiment"":null}},{""body"":""$AAPL \n\nPicked up by 9to5Mac, Apple accidentally left code in its newly released iOS 13.4 beta for ‘CarKey’, an unannounced all-new service which has the potential to transform the automotive landscape by enabling iPhone and Apple Watch owners to use their devices as digital car keys."",""symbols"":[{""symbol"":""AAPL""}],""entities"":{""sentiment"":null}},{""body"":""$AAPL \nWOW WOW WOW 🤩 \nCALL ME CRAZY, OR OUT OF MY MIND \n\nAPPLE SHOULD GO UP 5-15% Tomorrow \nMAJOR NEWS COMING\nAPPLE LEAKED INFO ON I PHONE 12!!!\nTHIS INFO IS MIND BLOWING ON WHAT TYPE OF PHONE THIS WILL BE:\nFORBES FIRST TO REPORT LEAK!\nALL THE NEWS WILL BE TALKING ABOUT TOMORROW IS THE I PHONE 12!\n\nhttps://www.forbes.com/sites/gordonkelly/2020/02/06/apple-iphone-2020-ios-134-carkey-upgrade-iphone-11-pro-max-update/amp/"",""symbols"":[{""symbol"":""AAPL""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$AAPL $SPY $ITOT Where is my boy ShortyMcShortDick at tonight? Markets red in looking for some emoji overkill"",""symbols"":[{""symbol"":""AAPL""},{""symbol"":""SPY""},{""symbol"":""ITOT""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}}]"
"[{""body"":""$ABEO wow not that much activity here.. usually means rocket lovers are selling 👍"",""symbols"":[{""symbol"":""ABEO""}],""entities"":{""sentiment"":{""basic"":""Bearish""}}},{""body"":""Integrated Core Strategies (us) Llc has filed an amended 13G/A, reporting 4.1% ownership in $ABEO - https://fintel.io/so/us/abeo?utm_source=stocktwits.com&amp;utm_medium=social&amp;utm_campaign=owner"",""symbols"":[{""symbol"":""ABEO""}],""entities"":{""sentiment"":null}},{""body"":""$ABEO Form SC 13G/A (statement of acquisition of beneficial ownership by individuals) filed with the SEC \n\nhttps://newsfilter.io/a/55b43bc8126ab9934aad553997957c67"",""symbols"":[{""symbol"":""ABEO""}],""entities"":{""sentiment"":null}},{""body"":""$ABEO cranking back up"",""symbols"":[{""symbol"":""ABEO""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$ABEO power our run will be fun"",""symbols"":[{""symbol"":""ABEO""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$ABEO Halted?"",""symbols"":[{""symbol"":""ABEO""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$ABEO Next week will be interesting!"",""symbols"":[{""symbol"":""ABEO""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""Short sale volume (not short interest) for $ABEO on 2020-02-05 is 50%. http://shortvolumes.com/?t=ABEO via @shortvolumes"",""symbols"":[{""symbol"":""ABEO""}],""entities"":{""sentiment"":null}},{""body"":""$VBIV $abeo"",""symbols"":[{""symbol"":""ABEO""},{""symbol"":""VBIV""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$ABEO Stochastic shows plenty of headroom. It&#39;ll keep climbing. Pretty much a foregone conclusion, as you can see"",""symbols"":[{""symbol"":""ABEO""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$ABEO we drinking la croix tonight...had. A pretty big position here. Big daddy gains this morning"",""symbols"":[{""symbol"":""ABEO""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$ABEO Great volume and nice end of the day close"",""symbols"":[{""symbol"":""ABEO""}],""entities"":{""sentiment"":null}},{""body"":""$ABEO run eod, lets go"",""symbols"":[{""symbol"":""ABEO""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$ABEO It&#39;s just getting started. The stochastic buy signal first appeared yesterday. Someone must&#39;ve forgotten to tell that Olivergarden bashtard, so he&#39;s back for another ass whooping"",""symbols"":[{""symbol"":""ABEO""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$ABEO Oliverrr? Oliver where are you"",""symbols"":[{""symbol"":""ABEO""}],""entities"":{""sentiment"":null}},{""body"":""$ABEO pT?"",""symbols"":[{""symbol"":""ABEO""}],""entities"":{""sentiment"":null}},{""body"":""$ABEO lets break 260 and its good to go"",""symbols"":[{""symbol"":""ABEO""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$ABEO Welcome Back Oliver, I hope you changed your undies 🤣"",""symbols"":[{""symbol"":""ABEO""}],""entities"":{""sentiment"":null}},{""body"":""$ABEO 2.20s to 2.70s, those guys wont hold this. Just here to scam and swing\n\nDont jump this as no news is expected for a long time\n\nProtect your cost basis 💥"",""symbols"":[{""symbol"":""ABEO""}],""entities"":{""sentiment"":{""basic"":""Bearish""}}},{""body"":""$ABEO oh no I see scammers here 💩"",""symbols"":[{""symbol"":""ABEO""}],""entities"":{""sentiment"":{""basic"":""Bearish""}}},{""body"":""$ABEO Shix, why the huge pullback? Break 2.80, you pig!!"",""symbols"":[{""symbol"":""ABEO""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$MGEN \n$SVRA \n$ABEO \n$ABUS"",""symbols"":[{""symbol"":""SVRA""},{""symbol"":""ABUS""},{""symbol"":""MGEN""},{""symbol"":""ABEO""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$ABEO nice morning runs on both ABEO and $CETX. Profit in the morning buy back mid day. Best part is you won’t have to use a daytrade!"",""symbols"":[{""symbol"":""ABEO""},{""symbol"":""CETX""}],""entities"":{""sentiment"":null}},{""body"":""$ABEO this stock is between 2-3 bucks for months.. if u swing here, it js better than holding for months...\n\nI am long so i hold but if u want to swing here, u want to sell on every pop..."",""symbols"":[{""symbol"":""ABEO""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$ABEO let’s move this past 3"",""symbols"":[{""symbol"":""ABEO""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$ABEO daily reminder than I’m just a pumper 👇👇👇. Definitely don’t buy $ACHV $XBIO $ACST or any of my other bio picks as I have no clue what I’m doing."",""symbols"":[{""symbol"":""ACHV""},{""symbol"":""ACST""},{""symbol"":""ABEO""},{""symbol"":""XBIO""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$ABEO on the move today!"",""symbols"":[{""symbol"":""ABEO""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$ABEO i even gave the price watches ;) cant do more . congrats to all my friends who banked on that call with me. 2.17 - 2.75 so far ;)."",""symbols"":[{""symbol"":""ABEO""}],""entities"":{""sentiment"":null}},{""body"":""$ABEO absolutely beautiful. I am behind baby 100%"",""symbols"":[{""symbol"":""ABEO""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""$ABEO 😬👇🏼"",""symbols"":[{""symbol"":""ABEO""}],""entities"":{""sentiment"":null}}]"
"[{""body"":""$ABG reported 9 new insider trades to the SEC in the last 2 minutes.\n\n2,222 shares acquired by Stax William Frederick (CAO &amp; Interim PFO) https://newsfilter.io/articles/4-form-2bd9ea66af5dffa97255d109ffdd8e86\n$55,387.14 of shares sold by James Juanita T (Director) https://newsfilter.io/articles/4-form-60e7ba95315d3ae158458e8efa1e9863\n$250,169.01 of shares sold by Hult David W (President &amp; CEO) https://newsfilter.io/articles/4-form-3704a40c36d2a6df040a89b09fb87ecc\n$55,387.14 of shares sold by Ryan Berman Bridget (Director) https://newsfilter.io/articles/4-form-371b6d2f56d167110fa5ce899a0b76fa\n$58,065.62 of shares sold by Alsfine Joel (Director) https://newsfilter.io/articles/4-form-12ae0276c06abb1d01d247617120dd11\n4,286 shares acquired by Milstein Jed (SVP &amp; CHRO) https://newsfilter.io/articles/4-form-fd08e9c74c01038f1bbdf2048637a6b3\n1,411 shares acquired by Deloach Thomas C Jr (Director) https://newsfilter.io/articles/4-form-7956b6f97d2d28c78017277dd2ab195f\n1,411 shares acquired by Maritz Philip F (Director) https://newsfilter.io/articles/4-form-c51c5e658b77854a6f9b9f5ca6a0ce5d\n$65,986.05 of shares sold by Villasana George A (SVP, GC &amp; Secretary) https://newsfilter.io/articles/4-form-a130909e55462da17528d0a3a4ea0aff"",""symbols"":[{""symbol"":""ABG""}],""entities"":{""sentiment"":null}},{""body"":""$ABG filed SEC form 4: SVP, GC &amp; Secretary Villasana George A: \nDelivered securities 685 of Common Stock at price $96.33 and Acquired 7,16 https://s.flashalert.me/fQwirI"",""symbols"":[{""symbol"":""ABG""}],""entities"":{""sentiment"":null}},{""body"":""$ABG filed SEC form 4: CAO &amp; Interim PFO Stax William Frederick: \nGranted 2,222 of Common Stock at price $0 on 2020-02-04, increased holdi https://s.flashalert.me/Ek4nfD"",""symbols"":[{""symbol"":""ABG""}],""entities"":{""sentiment"":null}},{""body"":""$ABG filed SEC form 4: SVP &amp; CHRO Milstein Jed: \nGranted 4,286 of Common Stock at price $0 on 2020-02-04, increased holding by 41% to 14,6 https://s.flashalert.me/rXRnu"",""symbols"":[{""symbol"":""ABG""}],""entities"":{""sentiment"":null}},{""body"":""$ABG filed SEC form 4: SVP, Operations Clara Daniel: \nGranted 2,091 of Common Stock at price $0 on 2020-02-04, increased holding by 12% to https://s.flashalert.me/OHtFO"",""symbols"":[{""symbol"":""ABG""}],""entities"":{""sentiment"":null}},{""body"":""$ABG filed SEC form 4: President &amp; CEO Hult David W: \nDelivered securities 2,597 of Common Stock at price $96.33 and Acquired 20,072 of Co https://s.flashalert.me/WsaB0P"",""symbols"":[{""symbol"":""ABG""}],""entities"":{""sentiment"":null}},{""body"":""$ABG filed SEC form 4: Director Morrison Maureen F: \nGranted 1,411 of Common Stock at price $0 on 2020-02-04, increased holding by 116% to https://s.flashalert.me/FEdTfa"",""symbols"":[{""symbol"":""ABG""}],""entities"":{""sentiment"":null}},{""body"":""$ABG filed SEC form 4: Director RYAN BERMAN BRIDGET: \nDelivered securities 579 of Common Stock at price $95.66 and Granted 1,411 of Common https://s.flashalert.me/kdafDP"",""symbols"":[{""symbol"":""ABG""}],""entities"":{""sentiment"":null}},{""body"":""SVP of Asbury Automotive Group Inc just declared owning 16,904 shares of Asbury Automotive Grou http://www.conferencecalltranscripts.org/5/summary2/?id=7386787 $ABG"",""symbols"":[{""symbol"":""ABG""}],""entities"":{""sentiment"":null}},{""body"":""$ABG filed SEC form 4: Director MARITZ PHILIP F: \nGranted 1,411 of Common Stock at price $0 on 2020-02-04, increased holding by 16% to 10, https://s.flashalert.me/XE3zHl"",""symbols"":[{""symbol"":""ABG""}],""entities"":{""sentiment"":null}},{""body"":""$ABG filed SEC form 4: Director Katz Eugene S: \nDelivered securities 494 of Common Stock at price $95.66 and Granted 1,411 of Common Stock https://s.flashalert.me/JkG1jk"",""symbols"":[{""symbol"":""ABG""}],""entities"":{""sentiment"":null}},{""body"":""$ABG filed SEC form 4: Director JAMES JUANITA T: \nDelivered securities 579 of Common Stock at price $95.66 and Granted 1,411 of Common Sto https://s.flashalert.me/uKU5E"",""symbols"":[{""symbol"":""ABG""}],""entities"":{""sentiment"":null}},{""body"":""$ABG filed SEC form 4: Director DELOACH THOMAS C JR: \nGranted 1,411 of Common Stock at price $0 on 2020-02-04, increased holding by 9% to https://s.flashalert.me/fLD4TG"",""symbols"":[{""symbol"":""ABG""}],""entities"":{""sentiment"":null}},{""body"":""$ABG filed SEC form 4: Director ALSFINE JOEL: \nDelivered securities 607 of Common Stock at price $95.66 and Granted 1,411 of Common Stock https://s.flashalert.me/epHIa0"",""symbols"":[{""symbol"":""ABG""}],""entities"":{""sentiment"":null}},{""body"":""$ABG filed SEC form 4: Director REDDIN THOMAS: \nGranted 1,411 of Common Stock at price $0 on 2020-02-04, increased holding by 22% to 7,777 https://s.flashalert.me/T1iEF3"",""symbols"":[{""symbol"":""ABG""}],""entities"":{""sentiment"":null}},{""body"":""$ABG 1,411 shares acquired by Reddin Thomas (Director), reported in a new form 4 filed with the SEC \n\nhttps://newsfilter.io/a/4ffbd7ee2bb2d297ec4c0e39cbc52fd7"",""symbols"":[{""symbol"":""ABG""}],""entities"":{""sentiment"":null}},{""body"":""$ABG filed SEC form 3: SVP, Operations Clara Daniel: \n https://s.flashalert.me/RmaSRk"",""symbols"":[{""symbol"":""ABG""}],""entities"":{""sentiment"":null}},{""body"":""$ABG Form 3 (initial statement of beneficial ownership of securities) filed with the SEC \n\nhttps://newsfilter.io/a/ef55703cb4d3295b0d15ca6101387ae7"",""symbols"":[{""symbol"":""ABG""}],""entities"":{""sentiment"":null}},{""body"":""$ABG: Issued Press Release on February 05, 18:26:00: Asbury Automotive Group Announces Pricing Of Its Private Offering Of Senior Notes Due https://s.flashalert.me/l92TH"",""symbols"":[{""symbol"":""ABG""}],""entities"":{""sentiment"":null}},{""body"":""$ABG: Asbury Automotive Group Announces Pricing Of Its Private Offering Of Senior Notes Due 2028 And Senior Notes ... https://www.chartmill.com/news/ABG/prnews-2020-2-5-asbury-automotive-group-announces-pricing-of-its-private-offering-of-senior-notes-due-2028-and-senior-notes-due-2030?utm_source=stocktwits&amp;utm_medium=pressRelease&amp;utm_content=ABG&amp;utm_campaign=social_tracking"",""symbols"":[{""symbol"":""ABG""}],""entities"":{""sentiment"":null}},{""body"":""$ABG Asbury Automotive Group Announces Pricing Of Its Private Offering Of Senior Notes Due 2028 And Senior Notes Due 2030 \n\nhttps://newsfilter.io/a/400da2b9f9d1b113f980346903bcb4ef"",""symbols"":[{""symbol"":""ABG""}],""entities"":{""sentiment"":null}},{""body"":""Asbury Automotive Group&#39;s PT cut by SunTrust Banks, Inc. to . hold rating. https://www.marketbeat.com/r/1341215 $ABG"",""symbols"":[{""symbol"":""ABG""}],""entities"":{""sentiment"":null}},{""body"":""BULLISH NEWS FOR $ABG\n\nhttps://finance.yahoo.com/news/know-asbury-automotive-abg-rating-170005382.html"",""symbols"":[{""symbol"":""ABG""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""A review of $ABG ’s past year earnings was reported yesterday.The earnings reported by a firm often differ from its cash flows. This distinction, referred to as &quot;accruals&quot;, speaks to the firm’s estimation of profits not yet received. see http://www.financial-education-hub.com/1/ABG.However, managers may make mistakes intentionally or unintentionally in the estimation, making the realization of accruals inconsistent with cash flows.Based on this, whereas two firms report the same earnings, the one with higher accruals would perform worse next period."",""symbols"":[{""symbol"":""ABG""}],""entities"":{""sentiment"":null}},{""body"":""A new $ABG report about financial statements and exhibits and other topics just got released by #SEC. Read it now https://wallmine.com/filing/redirect/12101180?utm_source=stocktwits"",""symbols"":[{""symbol"":""ABG""}],""entities"":{""sentiment"":null}},{""body"":""Asbury (ABG) announces earnings. $2.53 EPS. Beats estimates. 45.37M earnings. $ABG https://www.tipranks.com/stocks/ABG/earnings-calendar?ref=TREarnings"",""symbols"":[{""symbol"":""ABG""}],""entities"":{""sentiment"":null}},{""body"":""$ABG Craig-Hallum Downgrades to Hold : PT $110.00 https://stockhoot.com/ExtSymbol.aspx?from=AnalystRatingTweet&amp;symbol=ABG&amp;t=66&amp;Social=StockTwits"",""symbols"":[{""symbol"":""ABG""}],""entities"":{""sentiment"":null}},{""body"":""Asbury Automotive Group downgraded by Craig Hallum to hold. https://www.marketbeat.com/r/1341017 $ABG"",""symbols"":[{""symbol"":""ABG""}],""entities"":{""sentiment"":null}},{""body"":""What do you think of this? $ABG RSI Indicator left the oversold zone. View odds of uptrend. https://tickeron.com/go/1204531"",""symbols"":[{""symbol"":""ABG""}],""entities"":{""sentiment"":{""basic"":""Bullish""}}},{""body"":""Asbury Automotive Group announces earnings. $2.53 EPS. Beats estimates. $1.89b revenue. https://www.marketbeat.com/s/441917 $ABG"",""symbols"":[{""symbol"":""ABG""}],""entities"":{""sentiment"":null}}]"


データ変換のためのテーブルを作成します。

In [43]:
%sql create table message_extracted (symbols array<struct<symbol:string>>, sentiment STRING, body STRING) STORED AS TEXTFILE
%sql create table message_filtered (symbols array<struct<symbol:string>>, sentiment STRING, body STRING) STORED AS TEXTFILE
%sql create table message_exploded (symbol string, sentiment STRING, body STRING) STORED AS TEXTFILE
%sql create table sentiment_data (sentiment int, body STRING) STORED AS TEXTFILE

 * hive://admin@master.ykono.work:10000
Done.
 * hive://admin@master.ykono.work:10000
Done.
 * hive://admin@master.ykono.work:10000
Done.
 * hive://admin@master.ykono.work:10000
Done.


[]

元のデータから必要なデータのみを抽出します。

In [44]:
%%sql
insert overwrite table message_extracted 
select message.symbols, message.entities.sentiment, message.body from twits 
lateral view explode(messages) messages as message

 * hive://admin@master.ykono.work:10000
Done.


[]

In [45]:
%%sql
select * from message_extracted limit 5

 * hive://admin@master.ykono.work:10000
Done.


symbols,sentiment,body
"[{""symbol"":""AAPL""}]",,$AAPL had approximately 2395M USD go to the short side at 52 pct short Bears and Bulls are fighting close https://www.algowins.com/?wdt_column_filter[1]=AAPL
"[{""symbol"":""AAPL""}]",Bullish,$AAPL nice squeeze on the daily. If jobs report goes well 330 tomorrow
"[{""symbol"":""AAPL""}]",,"In the last month $AAPL has a been trading in the 302.22 - 327.85 range, which is quite wide. https://www.chartmill.com/stock/quote/AAPL/technical-analysis?key=bb853040-a4ac-41c6-b549-d218d2f21b32&amp;utm_source=stocktwits&amp;utm_medium=TA&amp;utm_content=AAPL&amp;utm_campaign=social_tracking"
"[{""symbol"":""AAPL""},{""symbol"":""EROS""}]",Bullish,$EROS SEEING IS BELIEVING - EROSNOW IS NOW ON APPLE TV+ AS OF TODAY!
"[{""symbol"":"" ""}]",,


In [46]:
%%sql
select count(*) from message_extracted

 * hive://admin@master.ykono.work:10000
Done.


_c0
9036


データから、メッセージ・ボディが含まれているデータのみを取り出します。同時に、銘柄に対するセンチメントを文字列からを数値に置換します。

In [47]:
%%sql
insert overwrite table message_filtered 
select symbols, 
    case sentiment when 'Bearish' then -2 when 'Bullish' then 2 ELSE 0 END as sentiment, 
    body from message_extracted 
    where body is not null

 * hive://admin@master.ykono.work:10000
Done.


[]

In [48]:
%%sql
select * from message_filtered limit 3

 * hive://admin@master.ykono.work:10000
Done.


symbols,sentiment,body
"[{""symbol"":""AAPL""}]",0,$AAPL had approximately 2395M USD go to the short side at 52 pct short Bears and Bulls are fighting close https://www.algowins.com/?wdt_column_filter[1]=AAPL
"[{""symbol"":""AAPL""}]",2,$AAPL nice squeeze on the daily. If jobs report goes well 330 tomorrow
"[{""symbol"":""AAPL""}]",0,"In the last month $AAPL has a been trading in the 302.22 - 327.85 range, which is quite wide. https://www.chartmill.com/stock/quote/AAPL/technical-analysis?key=bb853040-a4ac-41c6-b549-d218d2f21b32&amp;utm_source=stocktwits&amp;utm_medium=TA&amp;utm_content=AAPL&amp;utm_campaign=social_tracking"


一つのメッセージに複数の銘柄が紐づけられています。データ正規化のため、データ１行につき、一つの銘柄を持つようにデータを変換します（同じメッセージを持つ行が複数作られます）。

In [49]:
%%sql
insert overwrite table message_exploded 
select symbol.symbol, sentiment, body from message_filtered lateral view explode(symbols) symbols as symbol

 * hive://admin@master.ykono.work:10000
Done.


[]

In [50]:
%%sql
select * from message_exploded limit 3

 * hive://admin@master.ykono.work:10000
Done.


symbol,sentiment,body
AAPL,0,$AAPL had approximately 2395M USD go to the short side at 52 pct short Bears and Bulls are fighting close https://www.algowins.com/?wdt_column_filter[1]=AAPL
AAPL,2,$AAPL nice squeeze on the daily. If jobs report goes well 330 tomorrow
AAPL,0,"In the last month $AAPL has a been trading in the 302.22 - 327.85 range, which is quite wide. https://www.chartmill.com/stock/quote/AAPL/technical-analysis?key=bb853040-a4ac-41c6-b549-d218d2f21b32&amp;utm_source=stocktwits&amp;utm_medium=TA&amp;utm_content=AAPL&amp;utm_campaign=social_tracking"


ここまでの操作で、元の複雑な構造のデータから、１レコードにつき、銘柄、センチメント、メッセージ本文を持つフォーマットに変換されました。
銘柄毎のセンチメントの件数などの分析を行うには、このテーブルを利用します。

この後の感情分析では、メッセージ本文の文字列から、センチメントを判定する予測モデルを構築します。そのため銘柄情報は利用しないため、センチメントとメッセージ本文のみを取り出します。

In [51]:
%%sql
insert overwrite table sentiment_data 
select sentiment, body from message_filtered

 * hive://admin@master.ykono.work:10000
Done.


[]

In [52]:
%%sql
select * from sentiment_data limit 10

 * hive://admin@master.ykono.work:10000
Done.


sentiment,body
0,$AAPL had approximately 2395M USD go to the short side at 52 pct short Bears and Bulls are fighting close https://www.algowins.com/?wdt_column_filter[1]=AAPL
2,$AAPL nice squeeze on the daily. If jobs report goes well 330 tomorrow
0,"In the last month $AAPL has a been trading in the 302.22 - 327.85 range, which is quite wide. https://www.chartmill.com/stock/quote/AAPL/technical-analysis?key=bb853040-a4ac-41c6-b549-d218d2f21b32&amp;utm_source=stocktwits&amp;utm_medium=TA&amp;utm_content=AAPL&amp;utm_campaign=social_tracking"
2,$EROS SEEING IS BELIEVING - EROSNOW IS NOW ON APPLE TV+ AS OF TODAY!
0,How I Plan To Trade Tesla Stock https://youtu.be/RLXL4V6J6RE $TSLA 📈
2,$GOOG $GOOGL $MSFT $AAPL $AMZN
0,$AAPL Does anyone know what date you need to own shares by in order to get the dividend?
2,$AAPL
2,$AAPL $400 by August... seems about right! Undervalued at these levels and at the $400 price target!
0,$AAPL


### JSONファイルの作成

加工したデータをJSONファイルとして出力します。

感情分析を担当するデータサイエンティスト・機械学習エンジニアは、このJSONファイルを使います。

In [55]:
%sql DROP TABLE IF EXISTS json_message
%sql create table json_message (message STRING) STORED AS TEXTFILE

 * hive://admin@master.ykono.work:10000
Done.
 * hive://admin@master.ykono.work:10000
Done.


[]

In [56]:
%%sql
insert overwrite table json_message
select to_json(named_struct('message_body', body, 'sentiment', sentiment)) from sentiment_data

 * hive://admin@master.ykono.work:10000
Done.


[]

In [57]:
%%sql
select * from json_message limit 5

 * hive://admin@master.ykono.work:10000
Done.


message
"{""message_body"":""$AAPL had approximately 2395M USD go to the short side at 52 pct short Bears and Bulls are fighting close https://www.algowins.com/?wdt_column_filter[1]=AAPL"",""sentiment"":0}"
"{""message_body"":""$AAPL nice squeeze on the daily. If jobs report goes well 330 tomorrow"",""sentiment"":2}"
"{""message_body"":""In the last month $AAPL has a been trading in the 302.22 - 327.85 range, which is quite wide. https://www.chartmill.com/stock/quote/AAPL/technical-analysis?key=bb853040-a4ac-41c6-b549-d218d2f21b32&amp;utm_source=stocktwits&amp;utm_medium=TA&amp;utm_content=AAPL&amp;utm_campaign=social_tracking"",""sentiment"":0}"
"{""message_body"":""$EROS SEEING IS BELIEVING - EROSNOW IS NOW ON APPLE TV+ AS OF TODAY! "",""sentiment"":2}"
"{""message_body"":""How I Plan To Trade Tesla Stock https://youtu.be/RLXL4V6J6RE $TSLA 📈 "",""sentiment"":0}"


**`HQL_SELECT_MESSAGE`をあなたが作成したデータベースを指定してください**

In [58]:
#from __future__ import print_function

HQL_SELECT_MESSAGE = "select * from admin.json_message"

spark = SparkSession\
    .builder\
    .appName("JsonGen")\
    .getOrCreate()
    
spark.sparkContext.setLogLevel("ERROR")

json_list = spark.sql(HQL_SELECT_MESSAGE)

path = "./output.json"

with open(path, mode='w') as f:
    f.write('{"data":[')
    bool_first_line = True
    for row in json_list.rdd.collect():
        if bool_first_line:
            bool_first_line = False
            f.write(row.message)
        else:
            # あまりスマートではありませんが、ある程度の量のデータを使ったDeep Learning処理をシミュレーションするため、
            # 同じ情報を使って、データを嵩増ししています。
            # API利用の制約や、演習時間の制約がなければ、
            # 上記のWebスクレイピングで、大量の訓練データを取得することが可能です。
            for i in range(100): 
                f.write(",\n")
                f.write(row.message)
    
    f.write("]}")

In [60]:
!ls -l | grep output.json

-rw-r--r-- 1 cdsw cdsw 96174003 Feb  7 05:39 output.json


In [61]:
!head output.json
!tail output.json

{"data":[{"message_body":"$AAPL had approximately 2395M USD go to the short side at 52 pct short  Bears and Bulls are fighting close  https://www.algowins.com/?wdt_column_filter[1]=AAPL","sentiment":0},
{"message_body":"$AAPL nice squeeze on the daily. If jobs report goes well 330 tomorrow","sentiment":2},
{"message_body":"$AAPL nice squeeze on the daily. If jobs report goes well 330 tomorrow","sentiment":2},
{"message_body":"$AAPL nice squeeze on the daily. If jobs report goes well 330 tomorrow","sentiment":2},
{"message_body":"$AAPL nice squeeze on the daily. If jobs report goes well 330 tomorrow","sentiment":2},
{"message_body":"$AAPL nice squeeze on the daily. If jobs report goes well 330 tomorrow","sentiment":2},
{"message_body":"$AAPL nice squeeze on the daily. If jobs report goes well 330 tomorrow","sentiment":2},
{"message_body":"$AAPL nice squeeze on the daily. If jobs report goes well 330 tomorrow","sentiment":2},
{"message_body":"$AAPL nice squeeze on the daily. If jobs repo

## 感情分析

投資判断のために、企業の価値を考慮する際のアプローチとして、従来の枠組みにとらわれない様々な情報（オルタナティブ・データ）を用いることを考えます。

投資家の判断を左右し得る様々な情報を入力とし、投資判断のための定量的なシグナルに変換する予測モデルを構築します。
入力となるデータには様々なものがあります。以下はその例です。

- ニュース（製品のリコール、自然災害など）

ニューラルネットワークを使ったDeep Learningによって、入力データの形式を問わず、予測モデルを構築することができます。

ここでは、ソーシャルメディアサイトStockTwitsの投稿を使用します。
StockTwitsのコミュニティは、投資家、トレーダー、起業家により利用されています。

感情のスコアを生成するこれらのtwitを中心にモデルを構築します。

モデルの訓練のためには、入力に対応するラベルが必要になります。ラベルの精度は、モデルの訓練に当たって大変重要な要素です。

センチメントの度合いを把握するために、非常にネガティブ、ネガティブ、ニュートラル、ポジティブ、非常にポジティブという5段階のスケールを使用します。それぞれ、-2から2までの数値に対応しています。

このラベル付きデータによって訓練されたモデルを使用して、自然言語を入力として、その文章の背後にある感情を予測するモデルを構築します。


### データの確認
データがどのように見えるかを確認します。

各フィールドの意味:

* `'message_body'`: メッセージ本文テキスト
* `'sentiment'`: センチメントスコア。-2から2までの５段階。0は中立。

下記のような内容になっているはずです。
```
{'data':
  {'message_body': '............................',
   'sentiment': 2},
  {'message_body': '............................',
   'sentiment': -2},
   ...
}
```

データを読み込みます。

In [2]:
with open('./output.json', 'r') as f:
    twits = json.load(f)

print(twits['data'][:10])

[{'message_body': '$AAPL had approximately 2395M USD go to the short side at 52 pct short  Bears and Bulls are fighting close  https://www.algowins.com/?wdt_column_filter[1]=AAPL', 'sentiment': 0}, {'message_body': '$AAPL nice squeeze on the daily. If jobs report goes well 330 tomorrow', 'sentiment': 2}, {'message_body': '$AAPL nice squeeze on the daily. If jobs report goes well 330 tomorrow', 'sentiment': 2}, {'message_body': '$AAPL nice squeeze on the daily. If jobs report goes well 330 tomorrow', 'sentiment': 2}, {'message_body': '$AAPL nice squeeze on the daily. If jobs report goes well 330 tomorrow', 'sentiment': 2}, {'message_body': '$AAPL nice squeeze on the daily. If jobs report goes well 330 tomorrow', 'sentiment': 2}, {'message_body': '$AAPL nice squeeze on the daily. If jobs report goes well 330 tomorrow', 'sentiment': 2}, {'message_body': '$AAPL nice squeeze on the daily. If jobs report goes well 330 tomorrow', 'sentiment': 2}, {'message_body': '$AAPL nice squeeze on the da

データ件数の確認

In [3]:
print(len(twits['data']))

593901


### データの前処理

テキストを前処理します。

本文に含まれるティッカーシンボル（「$シンボル」で示される）はセンチメントに関する情報を提供しないため削除します。
また、「@ユーザー名」で、ユーザに関する情報が記載されていますが、これもまたセンチメント情報を提供しないため、削除します。
URLも削除します。

### メッセージ本文とセンチメント・ラベルのリスト化

In [4]:
messages = [twit['message_body'] for twit in twits['data']]
# Since the sentiment scores are discrete, we'll scale the sentiments to 0 to 4 for use in our network
sentiments = [twit['sentiment'] + 2 for twit in twits['data']]

### プリプロセス関数の定義

In [5]:
nltk.download('wordnet')

def preprocess(message):
    """
    入力として文字列を受け取り、次の操作を実行する: 
        - 全てのアルファベットを小文字に変換
        - URLを削除
        - ティッカーシンボルを削除 
        - 句読点を削除
        - 文字列をスペースで分割しトークン化する
        - シングル・キャラクターのトークンを削除
    
    パラメータ
    ----------
        message : 前処理の対象テキストメッセージ
        
    戻り値
    -------
        tokens: 前処理後のトークン配列
    """ 
    #TODO: Implement 
    
    # Lowercase the twit message
    text = message.lower()
    
    # Replace URLs with a space in the message
    text = re.sub("http(s)?://([\w\-]+\.)+[\w-]+(/[\w\- ./?%&=]*)?",' ', text)
    
    # Replace ticker symbols with a space. The ticker symbols are any stock symbol that starts with $.
    text = re.sub("\$[^ \t\n\r\f]+", ' ', text)
    
    # Replace StockTwits usernames with a space. The usernames are any word that starts with @.
    text = re.sub("@[^ \t\n\r\f]+", ' ', text)

    # Replace everything not a letter with a space
    text = re.sub("[^a-z]", ' ', text)
    
    
    # Tokenize by splitting the string on whitespace into a list of words
    tokens = text.split()

    # Lemmatize words using the WordNetLemmatizer. You can ignore any word that is not longer than one character.
    wnl = nltk.stem.WordNetLemmatizer()
    tokens = [wnl.lemmatize(w, pos='v') for w in tokens if len(w) > 1]
    
    return tokens

[nltk_data] Downloading package wordnet to /home/cdsw/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### Twitsメッセージ前処理
上記で定義した`preprocess`関数を全てStockTwitメッセージ・データに適用します。

※この処理には、データのサイズに応じて多少時間がかかります。

In [6]:
tokenized = list(map(preprocess, messages))

print(tokenized[:3])
print(len(tokenized))

[['have', 'approximately', 'usd', 'go', 'to', 'the', 'short', 'side', 'at', 'pct', 'short', 'bear', 'and', 'bull', 'be', 'fight', 'close', 'aapl'], ['nice', 'squeeze', 'on', 'the', 'daily', 'if', 'job', 'report', 'go', 'well', 'tomorrow'], ['nice', 'squeeze', 'on', 'the', 'daily', 'if', 'job', 'report', 'go', 'well', 'tomorrow']]
593901


### Bag of Words

すべてのメッセージがトークン化されたので、ボキャブラリ（語彙）データを作成します。
その際に、コーパス全体で各単語が出現する頻度をカウントします
（[`Counter`](https://docs.python.org/3.1/library/collections.html#collections.Counter)関数を利用）。

※この処理には、データのサイズに応じて多少時間がかかります。

In [7]:
from collections import Counter

#words = []
#for tokens in tokenized:
#    for token in tokens:
#        words.append(token)
out_list = tokenized
words = [element for in_list in out_list for element in in_list]

print(words[:13])
print(len(words))

"""
Create a vocabulary by using Bag of words
"""

word_counts = Counter(words)
sorted_vocab = sorted(word_counts, key=word_counts.get, reverse=True)
int_to_vocab = {ii: word for ii, word in enumerate(sorted_vocab)}
vocab_to_int = {word:ii for ii, word in int_to_vocab.items()}

bow = []
for tokens in tokenized:
    bow.append([vocab_to_int[token] for token in tokens])

print(len(bow))
print(bow[:3])

# This BOW will not be used because it is not filtered to eliminate common words.

['have', 'approximately', 'usd', 'go', 'to', 'the', 'short', 'side', 'at', 'pct', 'short', 'bear', 'and']
7600018
593901
[[16, 115, 111, 35, 2, 0, 19, 98, 23, 116, 19, 101, 7, 144, 3, 240, 86, 2310], [216, 485, 9, 0, 159, 120, 751, 12, 35, 227, 188], [216, 485, 9, 0, 159, 120, 751, 12, 35, 227, 188]]


### 単語の重要性（メッセージに現れる頻度）に応じた調整

ボキャブラリーを使用して、「the」、「and」、「it」などの最も一般的な単語の一部を削除します。
これらの単語は非常に一般的であるため、センチメントを特定する目的に寄与せず、ニューラルネットワークへの入力のノイズとなります。これらを除外することで、ネットワークの学習時間を短縮することができます。

また、非常に稀にしか用いられない単語も削除します。
ここでは、各単語のカウントをメッセージの数で除算する必要があります。

次に、メッセージのごく一部にしか表示されない単語を削除します。

In [8]:
"""
Set the following variables:
    freqs
    low_cutoff
    high_cutoff
    K_most_common
"""

print("len(sorted_vocab):",len(sorted_vocab))
print("sorted_vocab - top:", sorted_vocab[:3])
print("sorted_vocab - least:", sorted_vocab[-15:])

# Dictionart that contains the Frequency of words appearing in messages.
# The key is the token and the value is the frequency of that word in the corpus.
total_count = len(words)
freqs = {word: count/total_count for word, count in word_counts.items()}

#print("freqs[supplication]:",freqs["supplication"] )
print("freqs[the]:",freqs["the"] )

"""
This was the post by Ricardo:

there's no exact value for low_cutoff and high_cutoff, 
however I'd recommend you to use 
a low_cutoff that's around 0.000002 and 0.000007 
(This depends on the values you get from your freqs calculations) and 
a high_cutofffrom 5 to 20 (this depends on the most_common values from the bow).
"""

# Float that is the frequency cutoff. Drop words with a frequency that is lower or equal to this number.
low_cutoff = 0.000002

# Integer that is the cut off for most common words. Drop words that are the `high_cutoff` most common words.
"""
example_count = []
example_count.append(sorted_vocab.index("the"))
example_count.append(sorted_vocab.index("for"))
example_count.append(sorted_vocab.index("of"))
print(example_count)
high_cutoff = min(example_count)
"""
high_cutoff = 20
print("high_cutoff:",high_cutoff)
print("low_cutoff:",low_cutoff)

# The k most common words in the corpus. Use `high_cutoff` as the k.
#K_most_common = [word for word in sorted_vocab[:high_cutoff]]
K_most_common = sorted_vocab[:high_cutoff]

print("K_most_common:",K_most_common)


filtered_words = [word for word in freqs if (freqs[word] > low_cutoff and word not in K_most_common)]

print("len(filtered_words):",len(filtered_words)) 

len(sorted_vocab): 5664
sorted_vocab - top: ['the', 'of', 'to']
sorted_vocab - least: ['php', 'ref', 'nqjktv', 'png', 'developments', 'sayin', 'geologist', 'specialization', 'fossil', 'deposit', 'convince', 'merit', 'laughable', 'terrify', 'verge']
freqs[the]: 0.02588164922767288
high_cutoff: 20
low_cutoff: 2e-06
K_most_common: ['the', 'of', 'to', 'be', 'amp', 'utm', 'file', 'and', 'in', 'on', 'form', 'for', 'report', 'share', 'sec', 'medium', 'have', 'by', 'campaign', 'short']
len(filtered_words): 5644


### フィルターされた単語を削除して語彙を更新
ボキャブラリーに役立つ3つの変数を作成します。

In [9]:
"""
Set the following variables:
    vocab
    id2vocab
    filtered
"""

# A dictionary for the `filtered_words`. The key is the word and value is an id that represents the word. 
vocab =  {word:ii for ii, word in enumerate(filtered_words)}
# Reverse of the `vocab` dictionary. The key is word id and value is the word. 
id2vocab = {ii:word for word, ii in vocab.items()}
# tokenized with the words not in `filtered_words` removed.

print("len(tokenized):", len(tokenized))

filtered = [[token for token in tokens if token in vocab] for tokens in tokenized]
print("len(filtered):", len(filtered))
print("tokenized[:1]", tokenized[:1])
print("filtered[:1]",filtered[:1])

len(tokenized): 593901
len(filtered): 593901
tokenized[:1] [['have', 'approximately', 'usd', 'go', 'to', 'the', 'short', 'side', 'at', 'pct', 'short', 'bear', 'and', 'bull', 'be', 'fight', 'close', 'aapl']]
filtered[:1] [['approximately', 'usd', 'go', 'side', 'at', 'pct', 'bear', 'bull', 'fight', 'close', 'aapl']]


### 分類クラス間のバランス

訓練データのラベルには、一般に偏りがあることがよく見受けられます（例外的なデータは少ない）。
例えば、データの50％がニュートラルであること場合、毎回0（ニュートラル）を予測するだけで、ネットワークの精度が50％になることを意味します。

ネットワークが適切に学習できるように、クラスのバランスを取る必要があります。つまり、それぞれのセンチメントスコアがデータにほぼ同じ頻度で表含まれていることが望ましいと言えます。

ここでは、中立的な感情を持つデータを全体の20%になるように、ランダムにドロップします。

データに含まれるニュートラルデータのパーセンテージと、データ削除により期待されるパーセンテージの値を使って、
データをドロップする確率を求めます。

同時に、長さが0のメッセージを削除します。

In [10]:
balanced = {'messages': [], 'sentiments':[]}

n_neutral = sum(1 for each in sentiments if each == 2)
N_examples = len(sentiments)

keep_prob = (N_examples - n_neutral)/4/n_neutral

print("keep prob:", keep_prob)

for idx, sentiment in enumerate(sentiments):
    message = filtered[idx]
    if len(message) == 0:
        # skip this message because it has length zero
        continue
    elif sentiment != 2 or random.random() < keep_prob:
        balanced['messages'].append(message)
        balanced['sentiments'].append(sentiment) 

keep prob: 0.0504957488448718


バランスされたデータ中、センチメントが「ニュートラル」であるデータの割合を確認します。

In [11]:
n_neutral = sum(1 for each in balanced['sentiments'] if each == 2)
N_examples = len(balanced['sentiments'])
n_neutral/N_examples

0.21339761481857397

Finally let's convert our tokens into integer ids which we can pass to the network.

メッセージをID（数値）に変換します。この処理は、ニューラルネットワークの入力として用いるために必要です。

In [12]:
token_ids = [[vocab[word] for word in message] for message in balanced['messages']]
sentiments = balanced['sentiments']

ボキャブラリ・ファイルを保存します。このファイルは、予測の際に、入力を変換するために必要になります。

In [13]:
import pickle

with open('vocab.pickle', 'wb') as f:
    pickle.dump(vocab, f)

### ニューラルネットワーク
これでボキャブラリーができたので、トークンをIDに変換し、それをネットワークに渡すことができます。ネットワークを定義します

下記は、ネットワークの概要です：

#### Embed -> RNN -> Dense -> Softmax

### SentimentClassifier (感情分類器)実装

クラスは、3つの主要な部分で構成されています：: 

1. init function `__init__` 
2. forward pass `forward`  
3. hidden state `init_hidden`. 

出力層では、softmaxを使用します。出力フォーマットによって出力層を選択します。

（例えば、出力が２値/バイナリであれば、シグモイド関数）

このネットワークでは、センチメントスコアには5つのクラスがあるためsoftmaxが適しています。

In [14]:
class SentimentClassifier(nn.Module):
    def __init__(self, vocab_size, embed_size, lstm_size, output_size, lstm_layers=1, dropout=0.1):
        """
        Initialize the model by setting up the layers.
        
        Parameters
        ----------
            vocab_size : The vocabulary size.
            embed_size : The embedding layer size.
            lstm_size : The LSTM layer size.
            output_size : The output size.
            lstm_layers : The number of LSTM layers.
            dropout : The dropout probability.
        """
        
        super().__init__()
        self.vocab_size = vocab_size
        self.embed_size = embed_size
        self.lstm_size = lstm_size
        self.output_size = output_size
        self.lstm_layers = lstm_layers
        self.dropout = dropout
        

        self.embedding = nn.Embedding(self.vocab_size, self.embed_size)
        self.lstm = nn.LSTM(self.embed_size, self.lstm_size, self.lstm_layers, dropout=self.dropout)
        
        
        self.dropout = nn.Dropout(dropout)
        self.fc = nn.Linear(lstm_size, output_size)
        
        self.softmax = nn.LogSoftmax(dim=1)


    def init_hidden(self, batch_size):
        """ 
        Initializes hidden state
        
        Parameters
        ----------
            batch_size : The size of batches.
        
        Returns
        -------
            hidden_state
            
        """
        
        # Create two new tensors with sizes n_layers x batch_size x hidden_dim,
        # 隠れ層として、n_layers x batch_size x hidden_dimの構造を持つテンソルを二つ作成し、ゼロで初期化
        # initialized to zero, for hidden state and cell state of LSTM
        
        weight = next(self.parameters()).data
        
        hidden = (weight.new(self.lstm_layers, batch_size,self.lstm_size).zero_(),
                         weight.new(self.lstm_layers, batch_size, self.lstm_size).zero_())
        return hidden


    def forward(self, nn_input, hidden_state):
        """
        Perform a forward pass of our model on nn_input.
        
        Parameters
        ----------
            nn_input : The batch of input to the NN.
            hidden_state : The LSTM hidden state.

        Returns
        -------
            logps: log softmax output
            hidden_state: The new hidden state.

        """
        
        batch_size = nn_input.size(0)
        
        # embed
        embeds = self.embedding(nn_input)
        
        # LSTM
        lstm_out, hidden_state = self.lstm(embeds, hidden_state)
        
        """
        remember here you do not have batch_first=True, 
        so accordingly shape your input. 
        Moreover, since now input is seq_length x batch you just need to transform lstm_out = lstm_out[-1,:,:].
        you don't have to use batch_first=True in this case, 
        nor reshape the outputs with .view just transform your lstm_out as advised and you should be good to go.
        """
        #lstm_out = lstm_out.contiguous().view(-1, self.lstm_size)    
        lstm_out = lstm_out[-1,:,:]
        
        # dropout
        out = self.dropout(lstm_out)
        
        # Dense Layer (nn.Linear) RNNの隠れ層から値を予測
        out = self.fc(out)
        
        # Softmax関数
        logps = self.softmax(out)
        
        
        return logps, hidden_state

### モデルの確認

In [15]:
model = SentimentClassifier(len(vocab), 10, 6, 5, dropout=0.1, lstm_layers=2)
model.embedding.weight.data.uniform_(-1, 1)
input = torch.randint(0, 1000, (5, 4), dtype=torch.int64)
batch_size = 4
hidden = model.init_hidden(4)

logps, _ = model.forward(input, hidden)
print(logps)

tensor([[-1.4577, -1.7675, -1.4326, -1.6824, -1.7611],
        [-1.4568, -1.7760, -1.4235, -1.6869, -1.7618],
        [-1.4543, -1.7657, -1.4373, -1.6784, -1.7653],
        [-1.4654, -1.7506, -1.4481, -1.6645, -1.7659]],
       grad_fn=<LogSoftmaxBackward>)


### トレーニング
### DataLoaderとバッチ処理
ここで、データをループするために使用できるジェネレーターを構築します。

効率化のため、シーケンスをバッチとして渡します。

入力テンソルは次のような形になります：(sequence_length, batch_size)

したがって、シーケンスが40トークンで、25シーケンスを渡す場合、入力サイズは(40, 25)になります。

シーケンスの長さを40に設定した場合、40トークンより多いまたは少ないメッセージは、以下のように処理します。
- 40トークン未満のメッセージの場合、空のスポットにゼロを埋め込む。
   - データを処理する前にRNNが何も開始しないように、必ずパッドを残しておく必要がある。
   - メッセージに20個のトークンがある場合、最初の20個のスポットは0になる。
- メッセージに40個を超えるトークンがある場合、最初の40個のトークンを保持。

In [16]:
#def dataloader(messages, labels, sequence_length=30, batch_size=32, shuffle=False):
def dataloader(messages, labels, sequence_length=20, batch_size=32, shuffle=False):
    """ 
    Build a dataloader.
    """
    if shuffle:
        indices = list(range(len(messages)))
        random.shuffle(indices)
        messages = [messages[idx] for idx in indices]
        labels = [labels[idx] for idx in indices]

    total_sequences = len(messages)

    for ii in range(0, total_sequences, batch_size):
        batch_messages = messages[ii: ii+batch_size]
        
        # First initialize a tensor of all zeros
        batch = torch.zeros((sequence_length, len(batch_messages)), dtype=torch.int64)
        for batch_num, tokens in enumerate(batch_messages):
            token_tensor = torch.tensor(tokens)
            # Left pad!
            start_idx = max(sequence_length - len(token_tensor), 0)
            batch[start_idx:, batch_num] = token_tensor[:sequence_length]
        
        label_tensor = torch.tensor(labels[ii: ii+len(batch_messages)])
        
        yield batch, label_tensor

###  データの分割（訓練用と検証用）

In [17]:
"""
Split data into training and validation datasets. Use an appropriate split size.
The features are the `token_ids` and the labels are the `sentiments`.
"""   

split_frac = 0.98 # for small data
#split_frac = 0.8 # for big data

## split data into training, validation, and test data (features and labels, x and y)

split_idx = int(len(token_ids)*split_frac)
train_features, remaining_features = token_ids[:split_idx], token_ids[split_idx:]
train_labels, remaining_labels = sentiments[:split_idx], sentiments[split_idx:]

test_idx = int(len(remaining_features)*0.5)
valid_features, test_features = remaining_features[:test_idx], remaining_features[test_idx:]
valid_labels, test_labels = remaining_labels[:test_idx], remaining_labels[test_idx:]

### トレーニング準備

利用可能なデバイス(CUDA/GPUまたはGPU)を確認します。

In [18]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

In [19]:
#model = SentimentClassifier(len(vocab)+1, 200, 128, 5, dropout=0.)
model = SentimentClassifier(len(vocab)+1, 1024, 512, 5, lstm_layers=2, dropout=0.2)
model.embedding.weight.data.uniform_(-1, 1)
model.to(device)

SentimentClassifier(
  (embedding): Embedding(5645, 1024)
  (lstm): LSTM(1024, 512, num_layers=2, dropout=0.2)
  (dropout): Dropout(p=0.2, inplace=False)
  (fc): Linear(in_features=512, out_features=5, bias=True)
  (softmax): LogSoftmax()
)

### トレーニング実施

トレーニングを実行します。モデルの訓練の進行度合を確認するために、定期的にLossを出力します。

※この処理には、データのサイズに応じて、十分な時間が必要です。

GPUを備えた環境で実行する場合、ターミナルで以下のコマンドを実行することで、GPUが利用されていることを確認することができます（ GPU実行中、コマンド実行により表示されるテーブルの右上のVolatile GPU-Utilのパーセンテージ値が増えます）
```
$ watch nvidia-smi
```

In [20]:
import numpy as np

epochs = 5
batch_size =  64
batch_size =  512
learning_rate = 0.001

print_every = 100
criterion = nn.NLLLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)
model.train()

#val_losses = []
total_losses = []
#accuracy = []

for epoch in range(epochs):
    print('Starting epoch {}'.format(epoch + 1))
    
    steps = 0
    for text_batch, labels in dataloader(
            train_features, train_labels, batch_size=batch_size, sequence_length=20, shuffle=True):
        steps += 1
        hidden = model.init_hidden(labels.shape[0]) 
        
        # デバイス(CPU, GPU)の設定
        text_batch, labels = text_batch.to(device), labels.to(device)
        for each in hidden:
            each.to(device)
        
        # モデルのトレーニング
        hidden = tuple([each.data for each in hidden])
        model.zero_grad()
        output, hidden = model(text_batch, hidden)
        loss = criterion(output.squeeze(), labels)
        loss.backward()
        clip = 5
        nn.utils.clip_grad_norm_(model.parameters(), clip)
        optimizer.step()
        
        # Accumulate loss
        #val_losses.append(loss.item())
        total_losses.append(loss.item())
        
        correct_count = 0.0
        if steps % print_every == 0:
            model.eval()
            
            # Calculate accuracy
            ps = torch.exp(output)
            top_p, top_class = ps.topk(1, dim=1)
            #?top_class = top_class.to(device)
            #?labels = labels.to(device)

            correct_count += torch.sum(top_class.squeeze()== labels)
            #accuracy.append(100*correct_count/len(labels))
            
            # Print metrics
            print("Epoch: {}/{}...".format(epoch+1, epochs),
                 "Step: {}...".format(steps),
                 "Loss: {:.6f}...".format(loss.item()),
                 "Total Loss: {:.6f}".format(np.mean(total_losses)),
                 #"Collect Count: {}".format(correct_count),
                 #"Accuracy: {:.2f}".format((100*correct_count/len(labels))),
                 # AttributeError: 'torch.dtype' object has no attribute 'type'
                 #"Accuracy Avg: {:.2f}".format(np.mean(accuracy))
                 )
            
            model.train()

Starting epoch 1
Epoch: 1/5... Step: 100... Loss: 0.040382... Total Loss: 0.219460
Epoch: 1/5... Step: 200... Loss: 0.000827... Total Loss: 0.118474
Starting epoch 2
Epoch: 2/5... Step: 100... Loss: 0.002647... Total Loss: 0.076874
Epoch: 2/5... Step: 200... Loss: 0.001187... Total Loss: 0.059721
Starting epoch 3
Epoch: 3/5... Step: 100... Loss: 0.000218... Total Loss: 0.047139
Epoch: 3/5... Step: 200... Loss: 0.000292... Total Loss: 0.040001
Starting epoch 4
Epoch: 4/5... Step: 100... Loss: 0.000775... Total Loss: 0.034031
Epoch: 4/5... Step: 200... Loss: 0.000757... Total Loss: 0.030247
Starting epoch 5
Epoch: 5/5... Step: 100... Loss: 0.000326... Total Loss: 0.026750
Epoch: 5/5... Step: 200... Loss: 0.001380... Total Loss: 0.024398


In [21]:
torch.save({'state_dict': model.state_dict()}, 'checkpoint.pth.tar')

## 予測（Prediction）関数の作成

訓練されたモデルを使って、入力されたテキストから予測結果を生成するpredict関数を実装します。

テキストは、ネットワークに渡される前に前処理される必要があります。

In [22]:
import glob
import pickle
import re
import nltk
import numpy as np
import os
import sys

import torch

nltk.download('wordnet')

cur_dir = os.path.dirname(os.path.abspath('__file__'))
print(cur_dir)
sys.path.append(cur_dir)

vocab_filename = 'vocab.pickle'
vocab_path = cur_dir + "/" + vocab_filename
vocab_l = pickle.load(open(vocab_path, 'rb'))

#model_path = cur_dir + "/" + "model.torch"
#model_l = torch.load(model_path, map_location='cpu')

model_l = SentimentClassifier(len(vocab_l)+1, 1024, 512, 5, lstm_layers=2, dropout=0.2)
checkpoint = torch.load('./checkpoint.pth.tar')
model_l.load_state_dict(checkpoint['state_dict'])

class UnknownWordsError(Exception):
  "Only unknown words are included in text"


def predict_func(text, model, vocab):
    """ 
    Make a prediction on a single sentence.
    Parameters
    ----------
        text : The string to make a prediction on.
        model : The model to use for making the prediction.
        vocab : Dictionary for word to word ids. The key is the word and the value is the word id.
    Returns
    -------
        pred : 予測値（numpyベクトル）
    """

    tokens = preprocess(text)    

    # Filter non-vocab words
    tokens = [token for token in tokens if token in vocab] #pass
    # Convert words to ids
    tokens = [vocab[token] for token in tokens] #pass

    if len(tokens) == 0:
        raise UnknownWordsError

    # Adding a batch dimension
    text_input = torch.from_numpy(np.asarray(torch.LongTensor(tokens).view(-1, 1)))

    # Get the NN output       
    batch_size = 1
    hidden = model.init_hidden(batch_size) #pass
    
    logps, _ = model(text_input, hidden) #pass
    # Take the exponent of the NN output to get a range of 0 to 1 for each label.
    pred = torch.round(logps.squeeze())#pass
    pred = torch.exp(logps) 
    
    return pred


def predict_api(args):
    """ 
    Make a prediction on a single sentence.
    Parameters
    ----------
        args : 入力（Pythonディクショナリ）
    Returns
    -------
        pred : 予測値（Python配列）
    """
    text = args.get('text')
    try:
        result = predict_func(text, model_l, vocab_l)
        return result.detach().numpy()[0]
    except UnknownWordsError:
        return [0,0,1,0,0]

/home/cdsw


[nltk_data] Downloading package wordnet to /home/cdsw/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


ポジティブなセンチメントを連想させる文章を入力として予測（適宜、文章を変更して実行してみることができます）。

In [23]:
args = {"text": "I'm bullish on $goog"}
result = predict_api(args)
print(result)

[ 0.02395445  0.02243885  0.03738181  0.02565586  0.89056903]


ネガティブなセンチメントを連想させる文章を入力として予測（適宜、文章を変更して実行してみることができます）。

In [24]:
args = {"text": "I'm bearish on $goog"}
result = predict_api(args)
print(result)

[ 0.76727909  0.07537365  0.05268526  0.08244011  0.02222182]


ボキャブラリ辞書に存在しない単語のみの文章を入力として予測（適宜、文章を変更して実行してみることができます）。

In [25]:
args = {"text": "kono yoshiyuki"}
result = predict_api(args)
print(result)

[0, 0, 1, 0, 0]


### 最後に

データベースを削除する場合は、**データベース名を適切に変更した後で**下記を実行します。

In [30]:
%sql DROP DATABASE IF EXISTS admin CASCADE

 * hive://admin@master.ykono.work:10000
Done.


[]