# Assignment 0, YuWei Hsu, 2024-06-20

## Question 1
Using TensorFlow implement an Encoder-Decoder RNN model that can convert a date string from one format - April 22, 2019 - to another format - 2019-04-22. Generate the necessary train/test datasets.

### Step 1: Import Libraries

In [1]:
import tensorflow as tf
from datetime import date
import numpy as np

# Define constants
MONTHS = ["January", "February", "March", "April", "May", "June",
          "July", "August", "September", "October", "November", "December"]

INPUT_CHARS = "".join(sorted(set("".join(MONTHS) + "0123456789, ")))
OUTPUT_CHARS = "0123456789-"
sos_id = len(OUTPUT_CHARS) + 1


### Step 2: Data Generation

In [2]:
def random_dates(n_dates):
    min_date = date(1000, 1, 1).toordinal()
    max_date = date(9999, 12, 31).toordinal()

    ordinals = np.random.randint(max_date - min_date, size=n_dates) + min_date
    dates = [date.fromordinal(ordinal) for ordinal in ordinals]

    x = [MONTHS[dt.month - 1] + " " + dt.strftime("%d, %Y") for dt in dates]
    y = [dt.isoformat() for dt in dates]
    return x, y

def date_str_to_ids(date_str, chars=INPUT_CHARS):
    return [chars.index(c) for c in date_str]

def prepare_date_strs(date_strs, chars=INPUT_CHARS):
    X_ids = [date_str_to_ids(dt, chars) for dt in date_strs]
    X = tf.ragged.constant(X_ids, ragged_rank=1)
    return (X + 1).to_tensor(shape=[None, None]) # Ensure consistent tensor shape

def create_dataset(n_dates):
    x, y = random_dates(n_dates)
    return prepare_date_strs(x, INPUT_CHARS), prepare_date_strs(y, OUTPUT_CHARS)

np.random.seed(42)
X_train, Y_train = create_dataset(10000)
X_valid, Y_valid = create_dataset(2000)
X_test, Y_test = create_dataset(2000)

def shifted_output_sequences(Y):
    sos_tokens = tf.fill(dims=(len(Y), 1), value=sos_id)
    return tf.concat([sos_tokens, Y[:, :-1]], axis=1)

X_train_decoder = shifted_output_sequences(Y_train)
X_valid_decoder = shifted_output_sequences(Y_valid)
X_test_decoder = shifted_output_sequences(Y_test)

<div class="alert alert-block alert-info">
<b>Seq2Seq Model:</b> The encoder processes the input date string and encodes it into a state vector.
    The decoder generates the output date string using the state vector from the encoder.
</div>

<div class="alert alert-block alert-info">
<b>Teacher Forcing:</b> Implemented by shifting the target sequences (shifted_output_sequences function) and using them as inputs to the decoder during training. This means that during training, the model learns to predict the next character in the sequence using the actual previous character, rather than its own previous predictions.
</div>

### Step 3: Model Building


In [3]:
encoder_embedding_size = 32
decoder_embedding_size = 32
lstm_units = 128

np.random.seed(42)
tf.random.set_seed(42)

encoder_input = tf.keras.layers.Input(shape=[None], dtype=tf.int32)
encoder_embedding = tf.keras.layers.Embedding(
    input_dim=len(INPUT_CHARS) + 1,
    output_dim=encoder_embedding_size)(encoder_input)
_, encoder_state_h, encoder_state_c = tf.keras.layers.LSTM(
    lstm_units, return_state=True)(encoder_embedding)
encoder_state = [encoder_state_h, encoder_state_c]

decoder_input = tf.keras.layers.Input(shape=[None], dtype=tf.int32)
decoder_embedding = tf.keras.layers.Embedding(
    input_dim=len(OUTPUT_CHARS) + 2,
    output_dim=decoder_embedding_size)(decoder_input)
decoder_lstm_output = tf.keras.layers.LSTM(lstm_units, return_sequences=True)(
    decoder_embedding, initial_state=encoder_state)
decoder_output = tf.keras.layers.Dense(len(OUTPUT_CHARS) + 1,
                                       activation="softmax")(decoder_lstm_output)

model = tf.keras.Model(inputs=[encoder_input, decoder_input],
                       outputs=[decoder_output])

optimizer = tf.keras.optimizers.Nadam()
model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer,
              metrics=["accuracy"])

model.summary()


In [7]:
# Increase model complexity
encoder_embedding_size = 64
decoder_embedding_size = 64
lstm_units = 256

np.random.seed(2024)
tf.random.set_seed(2024)

encoder_input = tf.keras.layers.Input(shape=[None], dtype=tf.int32)
encoder_embedding = tf.keras.layers.Embedding(
    input_dim=len(INPUT_CHARS) + 1,
    output_dim=encoder_embedding_size)(encoder_input)
_, encoder_state_h, encoder_state_c = tf.keras.layers.LSTM(
    lstm_units, return_state=True)(encoder_embedding)
encoder_state = [encoder_state_h, encoder_state_c]

decoder_input = tf.keras.layers.Input(shape=[None], dtype=tf.int32)
decoder_embedding = tf.keras.layers.Embedding(
    input_dim=len(OUTPUT_CHARS) + 2,
    output_dim=decoder_embedding_size)(decoder_input)
decoder_lstm_output = tf.keras.layers.LSTM(lstm_units, return_sequences=True)(
    decoder_embedding, initial_state=encoder_state)
decoder_output = tf.keras.layers.Dense(len(OUTPUT_CHARS) + 1,
                                       activation="softmax")(decoder_lstm_output)

model = tf.keras.Model(inputs=[encoder_input, decoder_input],
                       outputs=[decoder_output])

optimizer = tf.keras.optimizers.Nadam()
model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer,
              metrics=["accuracy"])

model.summary()


### Step 4: Training the Model

In [5]:
history = model.fit([X_train, X_train_decoder], Y_train, epochs=20,
                    validation_data=([X_valid, X_valid_decoder], Y_valid))

Epoch 1/20
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 14ms/step - accuracy: 0.6148 - loss: 1.0657 - val_accuracy: 0.8136 - val_loss: 0.5483
Epoch 2/20
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 14ms/step - accuracy: 0.8570 - loss: 0.4369 - val_accuracy: 0.9659 - val_loss: 0.1704
Epoch 3/20
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 14ms/step - accuracy: 0.9736 - loss: 0.1416 - val_accuracy: 0.9925 - val_loss: 0.0671
Epoch 4/20
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 14ms/step - accuracy: 0.9967 - loss: 0.0497 - val_accuracy: 0.9977 - val_loss: 0.0340
Epoch 5/20
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 14ms/step - accuracy: 0.9993 - loss: 0.0248 - val_accuracy: 0.9997 - val_loss: 0.0190
Epoch 6/20
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 14ms/step - accuracy: 0.9859 - loss: 0.0734 - val_accuracy: 0.9995 - val_loss: 0.0187
Epoch 7/20
[1m313/313

### Step 6: Inference and Prediction (Testing the Model)


In [8]:
def ids_to_date_strs(ids, chars=OUTPUT_CHARS):
    return ["".join([("?" + chars)[index] for index in sequence])
            for sequence in ids]

def prepare_date_strs_padded(date_strs):
    X = prepare_date_strs(date_strs)
    max_input_length = X_train.shape[1]
    if X.shape[1] < max_input_length:
        X = tf.pad(X, [[0, 0], [0, max_input_length - X.shape[1]]])
    return X

max_output_length = Y_train.shape[1]

def predict_date_strs(date_strs):
    X = prepare_date_strs_padded(date_strs)
    Y_pred = tf.fill(dims=(len(X), 1), value=sos_id)
    for index in range(max_output_length):
        pad_size = max_output_length - Y_pred.shape[1]
        X_decoder = tf.pad(Y_pred, [[0, 0], [0, pad_size]])
        Y_probas_next = model.predict([X, X_decoder])[:, index:index+1]
        Y_pred_next = tf.argmax(Y_probas_next, axis=-1, output_type=tf.int32)
        Y_pred = tf.concat([Y_pred, Y_pred_next], axis=1)
    return ids_to_date_strs(Y_pred[:, 1:])

test_dates = [
    "January 22, 2019", "February 13, 2020", "March 31, 2000",
    "April 16, 1978", "May 25, 1555", "June 07, 1943",
    "July 08, 1781", "August 04, 1234", "September 12, 1003",
    "October 17, 1996", "November 01, 2089", "December 30, 3001"
]
converted_dates = predict_date_strs(test_dates)
for original, converted in zip(test_dates, converted_dates):
    print(f"Original: {original} --> Converted: {converted}")


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 16ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 12ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step
Original: January 22, 2019 --> Converted: 2019-01-22
Original: February 13, 2020 --> Converted: 2020-02-13
Original: March 31, 2000 --> Converted: 2000-03-31
Original: April 16, 1978 --> Converted: 1978-04-16
Original: May 25, 1555 --> Converted: 1555-05-25
Original: June 07, 1943 --> Co

# Summary of Encoder-Decoder RNN for Date Conversion
In this project, we implemented an Encoder-Decoder Recurrent Neural Network (RNN) using TensorFlow to convert date strings from the format "Month DD, YYYY" to "YYYY-MM-DD". The model effectively handles different date formats and accurately converts them as demonstrated in the results.

### Key Steps Involved:

**1. Data Preparation:**

* **Random Date Generation:** We generated random dates between the years 1000 and 9999.
* **Date Format Conversion:** The generated dates were formatted into the required input and output string formats.
* **Character Encoding:** Input and output date strings were encoded into numerical representations using unique character indices.

**2. Dataset Creation:**

* Prepared training, validation, and test datasets, each consisting of thousands of date samples to ensure comprehensive learning.

**3. Model Construction:**

* **Encoder:** An `LSTM-based encoder` that processes the input date strings.
* **Decoder:** An `LSTM-based decoder` that generates the output date strings.
* **Embedding Layers:** Embedding layers were used for both encoder and decoder inputs to transform character indices into dense vectors.
* **Dense Output Layer:** A dense layer with softmax activation to output the probability distribution over the possible characters for each position in the sequence.

**4. Training:**

* Used the `Nadam optimizer` and sparse categorical cross-entropy loss function.
* Implemented `teacher forcing` by shifting the target sequences to train the decoder.
* Trained the model for 20 epochs to ensure sufficient learning.

**5. Inference and Prediction:**

* Implemented a function to prepare padded input sequences for consistent length.
* Used a loop to iteratively predict each character in the output sequence until the end of the sequence was reached.
* Converted the predicted indices back to date strings.

The implementation demonstrated the effective use of an Encoder-Decoder RNN for sequence-to-sequence learning tasks, accurately transforming date formats with comprehensive data preparation, model architecture design, and training methodologies.