# 專題（一）：訓練LSTM之歌詞自動填詞器

## 專案目標
- 目標：使用 LSTM 模型去學習五月天歌詞，並且可以自動填詞來產生歌詞
- mayday_lyrics.txt 資料說明：
    - 每一行都是一首歌的歌詞
    - 除去標點符號並以空白表示間格
- 利用 mayday_lyrics.txt 來產生歌詞的序列
- 使用 LSTM 模型去學習歌詞的序列
- 當我們給定開頭的一段歌詞，例如：”給我一首歌”，就可以用 LSTM 猜下一個字，反覆這個過程就可以自動填詞

## 實作提示
- STEP1：從 mayday_lyrics.txt 中取出歌詞
- STEP2：建立每個字的 Index
- STEP3：用 Rolling 的方式打造 LyricsDataset
- STEP4：使用 DataLoader 來包裝 LyricsDataset
- STEP5：建立 LSTM 模型： inputs > nn.Embedding > nn.LSTM > nn.Dropout > 取最後一個 state > nn.Linear > softmax
- STEP6：開始訓練並調整參數
- STEP7：進行 Demo，給定 pre_text ，使用模型迭代的預測下一個字產生歌詞
- (進階) STEP8：在 Demo 時可以採用依照 Softmax 機率來作隨機採樣，這可以增加隨機性，讓歌詞有更多變化，當然你還可以使用機率閥值來避免太奇怪的字出現

## 重要知識點：專題結束後你可以學會
- 如何讀取並處理需要 Rolling 的序列資料
- 了解如何用 Pytorch 建制一個 LSTM 的模型
- 學會如何訓練一個語言模型
- 學會如何隨機抽樣自 Softmax 的分布

In [1]:
import torch
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import torch.nn as nn
import torch.nn.functional as F

In [2]:
lyrics_list = [line.strip() for line in open('mayday_lyrics.txt')]

In [6]:
lyrics_list[1]

'如果你眼神能夠為我 片刻的降臨 如果你能聽到 心碎的聲音 沈默的守護著你 沈默的等奇跡 沈默的讓自己 像是空氣 大家都吃著聊著笑著 今晚多開心 最角落里的我 笑得多合群 盤底的洋蔥像我 永遠是調味品 偷偷地看著你 偷偷地隱藏著自己 如果你願意一層一層一層的剝開我的心 你會發現你會訝異 你是我最壓抑最深處的秘密 如果你願意一層一層一層的剝開我的心 你會鼻酸你會流淚 只要你能聽到我看到我的全心全意 聽你說你和你的他們 曖昧的空氣 我和我的絕望 裝得很風趣 我就像一顆洋蔥 永遠是配角戲 多希望能與你 有一秒專屬的劇情 如果你願意一層一層一層的剝開我的心 你會發現你會訝異 你是我最壓抑最深處的秘密 如果你願意一層一層一層的剝開我的心 你會鼻酸你會流淚 只要你能聽到我看到我的全心全意 如果你願意一層一層一層的剝開我的心 你會發現你會訝異 你是我最壓抑最深處的秘密 如果你願意一層一層一層的剝開我的心 你會鼻酸你會流淚 只要你能看到我聽到我的全心全意 你會鼻酸你會流淚 只要你能聽到我看到我的全心全意'

In [3]:
# 建立詞典對照表
word2index = {}
index2word = {}

i = 0
for words in lyrics_list:
    for word in words:
        if word not in word2index:
            word2index[word] = i
            index2word[i] = word
            i += 1

In [4]:
len(word2index)

2101

In [11]:
# 建立數據集
class LyricsDataset(Dataset):
    def __init__(self, lyrics_list, word2index, num_unrollings=10):
        self.word2index = word2index
        self.lyrics_list = lyrics_list
        self.sample = []
        
        for lyrics in lyrics_list:
            for i in range(len(lyrics) - num_unrollings + 1):
                sample = lyrics[i:i+num_unrollings]
                self.sample.append(sample)

    def __getitem__(self, idx):
        source = self.sample[idx]
        
        data_X = [self.word2index[w] for w in source[:-1]]
        target_Y = self.word2index[source[-1]]
        
        data_X = torch.tensor(data_X, dtype=torch.long)
        target_Y = torch.tensor(target_Y, dtype=torch.long)
        
        return data_X, target_Y
        

    def __len__(self):
        return len(self.sample)

In [12]:
batch_size = 128

dataset = LyricsDataset(lyrics_list, word2index)

train_loader = DataLoader(dataset=dataset, batch_size=batch_size, shuffle=True)


In [13]:
len(dataset)

57223

In [14]:
dataset[2]

(tensor([ 2,  3,  4,  5,  6,  7,  8,  9, 10]), tensor(11))

In [15]:
# 建立模型 inputs > nn.Embedding > nn.LSTM > nn.Dropout > 取最後一個 state > nn.Linear > softmax
class LM_LSTM(nn.Module):
    def __init__(self, n_hidden, vocab_size, num_layers, dropout_ratio):
        super(LM_LSTM, self).__init__()
        
        self.embedding = nn.Embedding(vocab_size, n_hidden)
        
        self.Lstm = nn.LSTM(input_size=n_hidden, hidden_size=n_hidden*2, 
                            num_layers=num_layers, dropout=dropout_ratio)
        
        self.fc = nn.Linear(in_features=n_hidden*2, out_features=vocab_size)
        self.dropout = nn.Dropout(dropout_ratio)

    def forward(self, inputs):
        
        embedded = self.embedding(inputs)
        embedded = embedded.transpose(0, 1) # [n_step, batch_size, n_class]
        
        outputs, (hidden, cell) = self.Lstm(embedded)
        
        outputs = self.dropout(outputs)  # [n_step, batch_size, n_hidden]
        output = outputs[-1]  # [batch_size, n_hidden]
        logits = self.fc(output)

        return logits

In [16]:
def train_batch(model, data, criterion, optimizer, device):
    model.train()
    inputs, targets = [d.to(device) for d in data]

    outputs = model(inputs)

    loss = criterion(outputs, targets)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    return loss.item()

In [18]:
# 訓練模型
epochs = 100
lr = 0.001

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model = LM_LSTM(128, len(word2index), 2, 0.2)
model.to(device)

criterion = nn.CrossEntropyLoss(size_average=False)
criterion.to(device)

optimizer = optim.Adam(model.parameters(), lr=lr)


for epoch in range(1, 1 + epochs):
    tot_train_loss = 0
    tot_train_count = 0

    for train_data in train_loader:
        loss = train_batch(model, train_data, criterion, optimizer, device)

        tot_train_loss += loss
        tot_train_count += train_data[0].size(0)

    print('epoch ', epoch, 'train_loss: ', tot_train_loss / tot_train_count)

    if epoch % 10 == 0:
        for idx in [0, 50, 99]:
            input_batch = dataset[idx][0].unsqueeze(0).to(device)
            predict = model(input_batch).argmax(dim=-1).item()
            print('Example: "{}"+"{}"'.format(dataset.sample[idx][:-1], index2word[predict]))

epoch  1 train_loss:  5.553820373473559
epoch  2 train_loss:  4.990153993702863
epoch  3 train_loss:  4.475222526804815
epoch  4 train_loss:  4.005738793288062
epoch  5 train_loss:  3.5782345234196757
epoch  6 train_loss:  3.183004314175352
epoch  7 train_loss:  2.823365365296653
epoch  8 train_loss:  2.4952560051687924
epoch  9 train_loss:  2.203695993543623
epoch  10 train_loss:  1.9366773782638806
Example: "摸不到的顏色 是否"+"就"
Example: " 只留下結果 時間"+"遺"
Example: "麼多的燦爛的夢 以"+"為"
epoch  11 train_loss:  1.710828801960338
epoch  12 train_loss:  1.5125062583120843
epoch  13 train_loss:  1.3371879967686064
epoch  14 train_loss:  1.1865977422253153
epoch  15 train_loss:  1.0542544796126718
epoch  16 train_loss:  0.9410625085084389
epoch  17 train_loss:  0.8456063230153976
epoch  18 train_loss:  0.7546933578132027
epoch  19 train_loss:  0.680472565837215
epoch  20 train_loss:  0.6125455360378599
Example: "摸不到的顏色 是否"+"叫"
Example: " 只留下結果 時間"+"偷"
Example: "麼多的燦爛的夢 以"+"為"
epoch  21 train_loss:  0.554

KeyboardInterrupt: 

In [46]:
# 模型inference
pre_text = '給我一首歌'
generate_len = 2
prob_threshold = 0.01

result = [word2index[c] for c in pre_text]
for _ in range(generate_len):
    input_example = torch.tensor([result], dtype=torch.long, device=device)
    logit = model(input_example)

    probs = F.softmax(logit)

    # if probs > prob_threshold output=probs else torch.zeros_like(probs)
    probs = torch.where(probs > prob_threshold, probs, torch.zeros_like(probs))
    print(probs)
    probs=torch.zeros_like(probs)    
    predict = torch.multinomial(probs, 1).item()
    print(predict)
    result += [predict]

print(''.join([index2word[i] for i in result]))

tensor([[0., 0., 0.,  ..., 0., 0., 0.]], grad_fn=<SWhereBackward>)


  # This is added back by InteractiveShellApp.init_path()


RuntimeError: invalid multinomial distribution (sum of probabilities <= 0)

In [45]:
a = torch.tensor([[0., 0., 0., 0.]], dtype=torch.float)
torch.multinomial(a, 1)

RuntimeError: invalid multinomial distribution (sum of probabilities <= 0)