# HW 4: Character-level RNN Language Modeling 
#### COSC 410: Spring 2024, Colgate University

In this homework you will be working with language data.

## Task

Your task is to build a character-level RNN language model. For debugging purposes, a small plain text file called `red_riding_hood.txt` is included. Once you have set up the pipeline, you should consider the `train.txt`, `valid.txt`, and `test.txt`. 

In [None]:
# Load some helpful packages 
import pandas as pd
import torch
import numpy as np
import time 

## Part 0: Thinking through the task

How is this model going to be different from the model in lab? What are you predicting, what has to change in your model, etc? 

[**Written Answer**]

## Part 1: Load and Preprocess Data

[**Code Answer**] In this section, load your data, chunk it into sequence lengths, and one-hot encode it.

In [None]:
def load_data(filename: str): 
    with open(filename, 'r') as f:
        text = f.read().replace('\n', ' ').lower()
    return text

def get_vocab(text: str):
    unique_chars = set(text)
    mapping = {}
    idx = 0
    for char in unique_chars:
        mapping[char] = idx
        idx += 1
    return mapping

def sequence(text: str, seqLen:int) -> np.array: 
    """Make sequence length chunks of continguous text """
    data = []
    for i in range(0, len(text)-seqLen, seqLen):
        data.append(text[i:i+seqLen])
    return np.array(data)

def encode_data(data, mapping: dict): 
    """ You implement """
    pass

def decode_data(data, mapping: dict):
    """ You implement """ 
    # Reverse mapping 
    pass

text = load_data('train.txt')
mapping = get_vocab(text)
data = sequence(text, 40)
# This assert should work if you've done things correctly
assert (data == decode_data(encode_data(data, mapping), mapping)).all()

In [None]:
# make tensors and one hot
def oneHot(data: np.array, mapping: dict) -> torch.tensor:
    """ You implement """
    pass

def unHot(data: torch.tensor, mapping: dict) -> np.array:
    """ You implement """
    pass

# This assert should work
assert (data == decode_data(unHot(oneHot(encode_data(data, mapping), mapping), mapping),mapping)).all()

## Part 2: Build a RNN

[**Code Answer**] In this section, build an RNN class for a character-level RNN.

## Part 3: Train

[**Code Answer**] In this section, implement a train function.

In [None]:
# Encode data and set up y
fname = 'train.txt'
sequenceLength = 50
text = load_data(fname)
mapping = get_vocab(text)
data = oneHot(encode_data(sequence(text, sequenceLength), mapping), mapping)

# Find input/output 
# YOUR CODE HERE

In [None]:
nInput = X.shape[-1]
nHidden = 512
nLayers = 3
batchSize = 50
nEpochs = 2
lr = 0.1
model = RNNModel(nInput, nHidden, nLayers)
train(X, Y, model, nEpochs, batchSize, lr)

## Save and Load Model

In [None]:
def save(model, outname):
    torch.save(model, outname)

def load(filename):
    return torch.load(filename)

modelName = 'draculaRNN.pt'
save(model, modelName)
model = load(modelName)

## Part 4: Evaluate your model

[**Code Answer**] Evaluate your model on valid data, tuning the hyperparameters to get a better model. Once you have a model tuned, evaluate it on test. 

[**Written Answer**] Look at the data split, and reflect (in your response) on what differentiates `test.txt` from `train.txt` and `valid.txt` (NOTE: `valid.txt` contains the final chapter of Dracula). 

## Part 5: Generate

[**Code Answer**] Play around with your model's ability to generate language data using the `generate` function below.

[**Written Answer**] Explain in bullet points how `generate` works, experiment with different prefixes, and play with different temperatures so you can explain what temperature means (at a high level). 

In [None]:
@torch.no_grad()
def generate(model, mapping, prefix, num_chars, temperature):
    model.eval()
    result = prefix.lower()
    
    X = oneHot(encode_data([result], mapping), mapping)
    
    hidden = model.init_hidden(1)
    logits, hidden = model(X, hidden)

    for i in range(num_chars):
        dist = torch.distributions.Categorical(logits=logits[:,-1,:] / temperature)
        prediction = dist.sample()
        prediction = prediction[None, :]

        char = decode_data(prediction.numpy(), mapping)[0]
        result += char

        X = oneHot(prediction, mapping)
        logits, hidden = model(X, hidden)
    return result

print(generate(model, mapping, 'To stop and see people', 100, 1))