# 專題（一）：訓練LSTM之歌詞自動填詞器

## 專案目標
- 目標：使用 LSTM 模型去學習五月天歌詞，並且可以自動填詞
- mayday_lyrics.txt 資料說明：
    - 每一行都是一首歌的歌詞
    - 除去標點符號並以空白表示間隔
- 利用 mayday_lyrics.txt 來產生歌詞序列
- 使用 LSTM 模型去學習歌詞序列
- 當給定開頭的一段歌詞，例如：”給我一首歌”，就可以用 LSTM 猜下一個字，反覆這個過程就可以自動填詞

## 實作提示
- STEP1：從 mayday_lyrics.txt 中取出歌詞
- STEP2：建立每個字的 Index
- STEP3：用 Rolling 的方式打造 LyricsDataset
- STEP4：使用 DataLoader 來包裝 LyricsDataset
- STEP5：建立 LSTM 模型： inputs > nn.Embedding > nn.LSTM > nn.Dropout > 取最後一個 state > nn.Linear > softmax
- STEP6：開始訓練並調整參數
- STEP7：進行 Demo，給定 pre_text ，使用模型迭代來預測下一個字產生歌詞
- (進階) STEP8：在 Demo 時可以採用 Softmax 機率來作隨機採樣，這可以增加隨機性，讓歌詞有更多變化，當然還可以使用機率閥值來避免太奇怪的字出現

## 重要知識點：專題結束後可以學會
- 如何讀取並處理需要 Rolling 的序列資料
- 了解如何用 Pytorch 建制一個 LSTM 的模型
- 學會如何訓練一個語言模型
- 學會如何隨機抽樣 Softmax 的分布

In [1]:
import torch
import torch.nn.functional as F
from torch import nn, optim
from torch.utils.data import Dataset, DataLoader
from collections import Counter

In [2]:
# data from: https://github.com/gaussic/Chinese-Lyric-Corpus
with open('mayday_lyrics.txt', encoding='utf-8') as f:
    lyrics_list = [line.strip() for line in f.readlines()]

In [3]:
# 建立詞典對照表
cnt = Counter(''.join(lyrics_list))
word2index = {word: idx for idx, word in enumerate(cnt)}
index2word = {idx: word for word, idx in word2index.items()}

In [4]:
len(word2index)

2101

In [5]:
# 建立數據集
class LyricsDataset(Dataset):
    def __init__(self, lyrics_list, word2index, num_unrollings=10):
        self.word2index = word2index
        self.samples = []
        for lyrics in lyrics_list:
            for idx in range(len(lyrics) - num_unrollings + 1):
                self.samples.append(lyrics[idx:idx + num_unrollings])

    def __getitem__(self, idx):
        sample = self.samples[idx]
        input_lyric = torch.LongTensor([self.word2index[w] for w in sample[:-1]])
        output_lyric = torch.LongTensor([self.word2index[sample[-1]]])

        return input_lyric, output_lyric

    def __len__(self):
        return len(self.samples)

In [6]:
batch_size = 128

dataset = LyricsDataset(lyrics_list, word2index)
train_loader = DataLoader(
    dataset=dataset,
    batch_size=batch_size,
    shuffle=True
)

In [7]:
# 建立模型
class LSTM_LM(nn.Module):
    def __init__(self, vocab_size, n_hidden, num_layers, dropout_ratio):
        super(LSTM_LM, self).__init__()
        self.embed = nn.Embedding(vocab_size, n_hidden)
        self.lstm = nn.LSTM(input_size=n_hidden,
                            hidden_size=n_hidden,
                            num_layers=num_layers,
                            batch_first=True,
                            dropout=dropout_ratio)
        self.fc = nn.Linear(n_hidden, vocab_size)
        self.dropout = nn.Dropout(dropout_ratio)

    def forward(self, inputs):
        embed = self.embed(inputs)  # [batch_size, num_unrollings - 1, n_hidden]
        outputs, _ = self.lstm(embed)
        outputs = self.dropout(outputs)
        output = outputs[:,-1]  # [batch_size, n_hidden]
        logits = self.fc(output)

        return logits

In [8]:
def train_batch(model, data, criterion, optimizer, device):
    model.train()
    inputs, targets = [d.to(device) for d in data]
    outputs = model(inputs)
    loss = criterion(outputs, targets.view(-1))

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    return loss.item()

In [9]:
# 訓練模型
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

epochs = 100
lr = 0.001

model = LSTM_LM(len(word2index), 128, 2, 0.2)
model.to(device)

criterion = nn.CrossEntropyLoss(reduction='sum')
criterion.to(device)
optimizer = optim.Adam(model.parameters(), lr=lr)

for epoch in range(1, 1 + epochs):
    tot_train_loss = 0
    tot_train_count = 0

    for train_data in train_loader:
        loss = train_batch(model, train_data, criterion, optimizer, device)
        tot_train_loss += loss
        tot_train_count += train_data[0].size(0)
    print(f"epoch {epoch}, train_loss: {tot_train_loss / tot_train_count}")

    if epoch % 10 == 0:
        for idx in [0, 50, 99]:
            input_batch = dataset[idx][0].unsqueeze(0).to(device)
            predict = model(input_batch).argmax(dim=-1).item()
            print(f"Example: {dataset.samples[idx][:-1]}+{index2word[predict]}")

epoch 1, train_loss: 5.665972850566171
epoch 2, train_loss: 5.25386716693558
epoch 3, train_loss: 4.980346700753231
epoch 4, train_loss: 4.73194461736047
epoch 5, train_loss: 4.503753290925858
epoch 6, train_loss: 4.296388218150082
epoch 7, train_loss: 4.0994421856577885
epoch 8, train_loss: 3.9187318455128612
epoch 9, train_loss: 3.746693789387104
epoch 10, train_loss: 3.5832426848032495
Example: 摸不到的顏色 是否+是
Example:  只留下結果 時間+世
Example: 麼多的燦爛的夢 以+為
epoch 11, train_loss: 3.4247585837473506
epoch 12, train_loss: 3.280980289677152
epoch 13, train_loss: 3.139299197975509
epoch 14, train_loss: 3.006912438536116
epoch 15, train_loss: 2.886356460821678
epoch 16, train_loss: 2.7607782600314437
epoch 17, train_loss: 2.6641395160132473
epoch 18, train_loss: 2.55301278853636
epoch 19, train_loss: 2.4545082605395705
epoch 20, train_loss: 2.372064666350731
Example: 摸不到的顏色 是否+叫
Example:  只留下結果 時間+無
Example: 麼多的燦爛的夢 以+為
epoch 21, train_loss: 2.2735611999401324
epoch 22, train_loss: 2.20059407402534

In [10]:
# 模型inference
pre_text = '給我一首歌'
generate_len = 50
prob_threshold = 0.01

result = [word2index[c] for c in pre_text]
for _ in range(generate_len):
    input_example = torch.LongTensor([result]).to(device)
    logit = model(input_example)

    prob = F.softmax(logit, dim=-1)
    probs = torch.where(prob > prob_threshold, prob, torch.zeros_like(prob))
    predict = torch.multinomial(probs, 1).item()
    result += [predict]

print(''.join([index2word[i] for i in result]))

給我一首歌路上這裡面夢 我想要對對你 知道 突然 氣情的空衣 雨家最後還是我的的永言 愛我 始嘛的手 我的處我
