Table of Content

- 甚麼是 ELMo
- 定義問題
- Data
- 使用 Python 實作 ELMo

### 甚麼是 ELMo

word embedding 的方法，透過雙層的雙向語言模型計算而來

ELMo 與其他 word embedding 的差別在於

傳統的 word embedding 方法不管意思為何，每個單詞都只會擁有一個 vector

而 ELMo 分配給單詞的 ELMo vector 會透過分析整個句子而來

所以同個單詞在不同的句子中會產生不同的 ELMo vector

### 定義問題 

此篇文章的目標是使用 NLP 進行情感分析 <br>
從客戶對技術公司的推文，分辨是否包含負面的評價 <br>
### Data

分析的資料來源為 https://datahack.analyticsvidhya.com/contest/linguipedia-codefest-natural-language-processing-1/#data_dictionary

共有兩個檔案，分別為「測試資料」及「訓練資料」 <br>
- train data: 共有 7920 筆資料，資料內容包含 ID、label 以及評論
- test data: 共有 1953 筆資料，資料內容只有 ID 和評論

### 使用 Python 實作 ELMo

在實作之前先安裝好所需的 Library

In [1]:
import pandas as pd
import numpy as np
import spacy
import re
import time
import pickle

**step1 : Read Data**

In [19]:
train = pd.read_csv("train_2kmZucJ.csv")
# test = pd.read_csv("test_oJQbWVk.csv")

train.shape

(7920, 3)

In [20]:
train.head()

Unnamed: 0,id,label,tweet
0,1,0,#fingerprint #Pregnancy Test https://goo.gl/h1...
1,2,0,Finally a transparant silicon case ^^ Thanks t...
2,3,0,We love this! Would you go? #talk #makememorie...
3,4,0,I'm wired I know I'm George I was made that wa...
4,5,1,What amazing service! Apple won't even talk to...


準備好訓練資料後，我們來看一下正面及負面評價的比例

In [21]:
train['label'].value_counts(normalize = True)

0    0.744192
1    0.255808
Name: label, dtype: float64

**step2 : Data Preprocessing**

需要做的前處理有下列 6 項: <br>
- 去掉 URL
- 去除標點符號
- 字母轉為小寫
- 移除數字
- 去除空白
- 將多型態的單詞轉回基本形式

In [22]:
# remove URL's from train and test
train['clean_tweet'] = train['tweet'].apply(lambda x: re.sub(r'http\S+', '', x)) # \S 非空白字元
# test['clean_tweet'] = test['tweet'].apply(lambda x: re.sub(r'http\S+', '', x))

In [23]:
# remove punctuation marks
punctuation = '!"#$%&()*+-/:;<=>?@[\\]^_`{|}~'

train['clean_tweet'] = train['clean_tweet'].apply(lambda x: ''.join(ch for ch in x if ch not in set(punctuation)))
# test['clean_tweet'] = test['clean_tweet'].apply(lambda x: ''.join(ch for ch in x if ch not in set(punctuation)))

In [24]:
# convert text to lowercase
train['clean_tweet'] = train['clean_tweet'].str.lower()
# test['clean_tweet'] = test['clean_tweet'].str.lower()

In [25]:
# remove numbers
train['clean_tweet'] = train['clean_tweet'].str.replace("[0-9]", " ")
# test['clean_tweet'] = test['clean_tweet'].str.replace("[0-9]", " ")

In [26]:
# remove mutiple whitespaces
train['clean_tweet'] = train['clean_tweet'].apply(lambda x:' '.join(x.split()))
# test['clean_tweet'] = test['clean_tweet'].apply(lambda x: ' '.join(x.split()))

**將多型態的單詞轉回基本形式**

我們將使用 spacy 提供的語言模型 <br>
在 load "en" 模型前，先下載模型(以下為使用 anaconda 的步驟)
1. 使用系統管理員開啟 anaconda
2. 開啟 anaconda 的 terminal，輸入下列指令 <br>
    `python -m spacy download en` (Ubuntu 直接執行此行指令)
    
模型下載成功後就可以 load 模型了

接著我們先來看一下該如何使用 spacy 的模型

In [9]:
nlp = spacy.load('en')

doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')

for token in doc:
    print(token.text, token.lemma_)

Apple apple
is be
looking look
at at
buying buy
U.K. u.k.
startup startup
for for
$ $
1 1
billion billion


In [28]:
# lemmatize text

# import spaCy's language model
nlp = spacy.load('en', disable=['parser', 'ner'])

# function to lemmatize text
def lemmatization(texts):
    output = []
    for i in texts:
        s = [token.lemma_ for token in nlp(i)]
        output.append(' '.join(s))
    return output

train['clean_tweet'] = lemmatization(train['clean_tweet'])
# test['clean_tweet'] = lemmatization(test['clean_tweet'])

In [29]:
train.head()

Unnamed: 0,id,label,tweet,clean_tweet
0,1,0,#fingerprint #Pregnancy Test https://goo.gl/h1...,fingerprint pregnancy test android app beautif...
1,2,0,Finally a transparant silicon case ^^ Thanks t...,finally a transparant silicon case thank to -P...
2,3,0,We love this! Would you go? #talk #makememorie...,-PRON- love this would -PRON- go talk makememo...
3,4,0,I'm wired I know I'm George I was made that wa...,-PRON- be wired i know -PRON- be george i be m...
4,5,1,What amazing service! Apple won't even talk to...,what amazing service apple will not even talk ...


**step3 : 準備 ELMo vectors**

我們將使用套件 TensorFlow 來實現

先安裝套件 tensorflow 1.7.0 版及 tensorflow-hub <br>

`pip install "tensorflow>=1.7.0"` <br>
`pip install tensorflow-hub` <br>

接著 import 已經訓練好的 ELMo model

In [30]:
import tensorflow_hub as hub
import tensorflow as tf

elmo = hub.Module("https://tfhub.dev/google/elmo/2", trainable=True)

這樣就成功的取得了 ELMo 模型

接著讓我們用個簡單的範例來看一下該如何生成 ELMo vector 

In [31]:
x = ["Roasted ants are a popular snack in Columbia"]

embeddings = elmo(x, signature="default", as_dict=True)["elmo"]

embeddings.shape

TensorShape([Dimension(1), Dimension(8), Dimension(1024)])

輸出的結果會是三維的，內容分別是: <br>
- 輸入的句子數量
- 最長的句子數
- 每個單詞的 ELMo vector 長度

轉換完成後，使用 tf 取得 ELMo vector

In [32]:
with tf.Session() as sess: # 創建一個tensorflow session
    sess.run(tf.global_variables_initializer())
    sess.run(tf.tables_initializer())
    # return average of ELMo features
    test_sess = sess.run(tf.reduce_mean(embeddings, 1)) 

test_sess

array([[-0.0823871 ,  0.07898234,  0.08993164, ..., -0.2251663 ,
         0.05554263,  0.29672712]], dtype=float32)

了解 ELMo model 的使用方法後，就可以開始轉換訓練資料了 <br>

為了避免耗盡運算資源，我們將批次轉換成 ELMo vector

In [33]:
list_train = [train[i:i+100] for i in range(0,train.shape[0],100)]
# list_test = [test[i:i+100] for i in range(0,test.shape[0],100)]

list_train

[     id  label                                              tweet  \
 0     1      0  #fingerprint #Pregnancy Test https://goo.gl/h1...   
 1     2      0  Finally a transparant silicon case ^^ Thanks t...   
 2     3      0  We love this! Would you go? #talk #makememorie...   
 3     4      0  I'm wired I know I'm George I was made that wa...   
 4     5      1  What amazing service! Apple won't even talk to...   
 5     6      1  iPhone software update fucked up my phone big ...   
 6     7      0  Happy for us .. #instapic #instadaily #us #son...   
 7     8      0  New Type C charger cable #UK http://www.ebay.c...   
 8     9      0  Bout to go shopping again listening to music #...   
 9    10      0  Photo: #fun #selfie #pool #water #sony #camera...   
 10   11      1  hey #apple when you make a new ipod dont make ...   
 11   12      1  Ha! Not heavy machinery but it does what I nee...   
 12   13      1  Contemplating giving in to the iPhone bandwago...   
 13   14      0  I j

轉換完成後，使用剛剛載入的 ELMo model 做資料的轉換

再使用 tf 取得 ELMo vector

最後把資料轉回 array 的形式

In [34]:
def elmo_vectors(x):
    embeddings = elmo(x.tolist(), signature="default", as_dict=True)["elmo"]

    with tf.Session() as sess: # 創建一個tensorflow session
        sess.run(tf.global_variables_initializer())
        sess.run(tf.tables_initializer())
        # return average of ELMo features
        return sess.run(tf.reduce_mean(embeddings, 1))  # 取 vector 的平均, 1 代表 column 取平均

elmo_train = [elmo_vectors(x['clean_tweet']) for x in list_train]
# elmo_test = [elmo_vectors(x['clean_tweet']) for x in list_test]

In [36]:
elmo_train_new = np.concatenate(elmo_train, axis = 0)
# elmo_test_new = np.concatenate(elmo_test, axis = 0)
elmo_train_new

array([[-0.01367856,  0.0471907 ,  0.04993869, ...,  0.00911632,
         0.02073942,  0.06375235],
       [-0.04931365, -0.01975196,  0.02596117, ...,  0.03484079,
         0.05203516, -0.01941845],
       [-0.07244584, -0.04607351,  0.06534164, ..., -0.00022881,
        -0.08720941,  0.03318136],
       ...,
       [-0.13460152,  0.00312679, -0.16876186, ...,  0.09333538,
         0.17530221,  0.0798686 ],
       [ 0.06581538,  0.17325026,  0.06232745, ...,  0.08818179,
        -0.1882427 ,  0.01674475],
       [ 0.02453895, -0.10790104,  0.19223242, ..., -0.04737334,
         0.10106892,  0.13966535]], dtype=float32)

因為花費了很多時間做 ELMo vector 的轉換，所以先將轉換的結果存起來 <br>

使用的 library 是 pickle

In [37]:
# save elmo_train_new
pickle_out = open("elmo_train_03032019.pickle","wb")
pickle.dump(elmo_train_new, pickle_out)
pickle_out.close()

# save elmo_test_new
# pickle_out = open("elmo_test_03032019.pickle","wb")
# pickle.dump(elmo_test_new, pickle_out)
# pickle_out.close()

接著再載入剛剛存的 ELMo vector

In [38]:
# load elmo_train_new
pickle_in = open("elmo_train_03032019.pickle", "rb")
elmo_train_new = pickle.load(pickle_in)

# load elmo_train_new
# pickle_in = open("elmo_test_03032019.pickle", "rb")
# elmo_test_new = pickle.load(pickle_in)

將 ELMo vector 讀取完之後，接著就可以開始訓練模型了

**step4 : Model Building**

訓練模型前先準備好 training data 和 testing data

In [39]:
from sklearn.model_selection import train_test_split

xtrain, xvalid, ytrain, yvalid = train_test_split(elmo_train_new, 
                                                  train['label'],  
                                                  random_state=42, 
                                                  test_size=0.2)

接著我們建立 logistic regression model

In [40]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score

lreg = LogisticRegression()
lreg.fit(xtrain, ytrain)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

成功地建立完分類器後，就可以來預測 testing data 

In [44]:
preds_valid = lreg.predict(xvalid)

In [45]:
from sklearn.metrics import accuracy_score

accuracy_score(yvalid, preds_valid)

0.8819444444444444

In [46]:
from sklearn.metrics import f1_score

f1_score(yvalid, preds_valid)

0.7776456599286562