# 專題（一）：訓練LSTM之歌詞自動填詞器

## 專案目標
- 目標：使用 LSTM 模型去學習五月天歌詞，並且可以自動填詞來產生歌詞
- mayday_lyrics.txt 資料說明：
    - 每一行都是一首歌的歌詞
    - 除去標點符號並以空白表示間格
- 利用 mayday_lyrics.txt 來產生歌詞的序列
- 使用 LSTM 模型去學習歌詞的序列
- 當我們給定開頭的一段歌詞，例如：”給我一首歌”，就可以用 LSTM 猜下一個字，反覆這個過程就可以自動填詞

## 實作提示
- STEP1：從 mayday_lyrics.txt 中取出歌詞
- STEP2：建立每個字的 Index
- STEP3：用 Rolling 的方式打造 LyricsDataset
- STEP4：使用 DataLoader 來包裝 LyricsDataset
- STEP5：建立 LSTM 模型： inputs > nn.Embedding > nn.LSTM > nn.Dropout > 取最後一個 state > nn.Linear > softmax
- STEP6：開始訓練並調整參數
- STEP7：進行 Demo，給定 pre_text ，使用模型迭代的預測下一個字產生歌詞
- (進階) STEP8：在 Demo 時可以採用依照 Softmax 機率來作隨機採樣，這可以增加隨機性，讓歌詞有更多變化，當然你還可以使用機率閥值來避免太奇怪的字出現

## 重要知識點：專題結束後你可以學會
- 如何讀取並處理需要 Rolling 的序列資料
- 了解如何用 Pytorch 建制一個 LSTM 的模型
- 學會如何訓練一個語言模型
- 學會如何隨機抽樣自 Softmax 的分布

In [None]:
import pandas as pd
import numpy as np

import torch
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from torch.utils.data import random_split
import torch.nn as nn
import torch.nn.functional as F
from tqdm.notebook import tqdm

In [None]:
# from: https://github.com/gaussic/Chinese-Lyric-Corpus

lyrics_list = [line.strip() for line in open('mayday_lyrics.txt')]

In [None]:
# 建立詞典對照表
word2index = {}
index2word = {}

i = 0
for words in lyrics_list:
    for word in words:
        if word not in word2index:
            word2index[word] = i
            index2word[i] = word
            i += 1

In [None]:
len(word2index)

2101

In [None]:
# 建立數據集
class LyricsDataset(Dataset):
    def __init__(self, lyrics_list, word2index, num_unrollings=10):
        ## Code Here

    def __getitem__(self, idx):
        ## Code Here

    def __len__(self):
        return len(self.samples)

In [None]:
batch_size = 128

dataset = LyricsDataset(lyrics_list, word2index)

train_loader = DataLoader(
    dataset=dataset,
    batch_size=batch_size,
    shuffle=True)

In [None]:
# 建立模型
class LM_LSTM(nn.Module):
    def __init__(self, n_hidden, vocab_size, num_layers, dropout_ratio):
        super(LM_LSTM, self).__init__()
        ## Code Here

    def forward(self, inputs):
        ## Code Here

        return logits

In [None]:
def train_batch(model, data, criterion, optimizer, device):
    model.train()
    inputs, targets = [d.to(device) for d in data]

    outputs = model(inputs)

    loss = criterion(outputs, targets)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    return loss.item()

In [None]:
# 訓練模型
epochs = 100
lr = 0.001

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model = LM_LSTM(128, len(word2index), 2, 0.2)
model.to(device)

criterion = nn.CrossEntropyLoss(size_average=False)
criterion.to(device)

optimizer = optim.Adam(model.parameters(), lr=lr)


for epoch in range(1, 1 + epochs):
    tot_train_loss = 0
    tot_train_count = 0

    for train_data in train_loader:
        loss = train_batch(model, train_data, criterion, optimizer, device)

        tot_train_loss += loss
        tot_train_count += train_data[0].size(0)

    print('epoch ', epoch, 'train_loss: ', tot_train_loss / tot_train_count)

    if epoch % 10 == 0:
        for idx in [0, 50, 99]:
            input_batch = dataset[idx][0].unsqueeze(0).to(device)
            predict = model(input_batch).argmax(dim=-1).item()
            print('Example: "{}"+"{}"'.format(dataset.samples[idx][:-1], index2word[predict]))



epoch  1 train_loss:  5.638955304613609
epoch  2 train_loss:  5.200973100893721
epoch  3 train_loss:  4.884452973550726
epoch  4 train_loss:  4.619896265843904
epoch  5 train_loss:  4.386603825503506
epoch  6 train_loss:  4.173271787438875
epoch  7 train_loss:  3.9685163989069663
epoch  8 train_loss:  3.7773809852131452
epoch  9 train_loss:  3.5979248391393033
epoch  10 train_loss:  3.430616509478376
Example: "摸不到的顏色 是否"+"唱"
Example: " 只留下結果 時間"+"有"
Example: "麼多的燦爛的夢 以"+"為"
epoch  11 train_loss:  3.273799685966239
epoch  12 train_loss:  3.122795784916941
epoch  13 train_loss:  2.9793283374761645
epoch  14 train_loss:  2.845751559415817
epoch  15 train_loss:  2.7248610255581696
epoch  16 train_loss:  2.6080863793743774
epoch  17 train_loss:  2.4989614634187785
epoch  18 train_loss:  2.391229770826993
epoch  19 train_loss:  2.2937631835545518
epoch  20 train_loss:  2.202239961635645
Example: "摸不到的顏色 是否"+"叫"
Example: " 只留下結果 時間"+"有"
Example: "麼多的燦爛的夢 以"+"為"
epoch  21 train_loss:  2.124041

In [None]:
# 模型inference
pre_text = '給我一首歌'
generate_len = 50
prob_threshold = 0.01

result = [word2index[c] for c in pre_text]
for _ in range(generate_len):
    input_example = torch.tensor([result], dtype=torch.long, device=device)
    logit = model(input_example)

    ## Code Here

    ## End
    result += [predict]
print(''.join([index2word[i] for i in result]))

給我一首歌的火 勇果我不斷不想要給你 一千個小快 每一分攏都樂的了在心風 遺憾的感動 是你的永然在幽上 天使像
