# Predicting Disaster Tweets with a Simple LSTM Model
Anthony Lee 2024-11-26

[GitHub archive](https://github.com/anthropikos/kaggle_kernels/blob/40d7c483c8f0d57a5b4708e9a126f428c79cdda6/lstm-tweets.ipynb)

[Kaggle link](https://www.kaggle.com/code/anthonyylee/lstm-tweets)

## Abstract
This notebook documents a simple LSTM model created to predict whether each tweet from a dataset pertains to a disaster. Instead of a bag-of-words approach, the LSTM model retains memory of what it was trained on and thus can learn from the sequences of words. Each tweet is read from a CSV and then tokenized, lemmatized, vectorized using SpaCy's `en_core_web_lg` pipeline and then used to train the LSTM model built in PyTorch. The training is unbatched and CPU based. The model takes approximately 7.5 minutes to train on a quad-core Intel Xeon 2.20Ghz CPU, and was used to inference on a test dataset achieving an F1 score of 0.803. Further, the notebook discusses some of the learnings from the development of the model and presents some suggestions on future improvements.

## Methods
### Train / Validation Split
For overfitting detection, the training dataset was split by 90/10 with 90% of the data being training data and the remaining 10% being held out for validation. The split stratifies based on the training data's target ensuring that the proportion of diaster to non-disaster is equal between the training set and valdiation set.

### NLP Pipeline
SpaCy was used for the text processing step. Using the tools and pretrained model in the `en_core_web_lg` pipeline, each tweet is tokenized, lemmatized, and vectorized for the deep learning model's consumption. The decision to include a lemmatization step is to reduce the domain space of each tweet as words retain similar meanings when reduced to its lemma form. The `en_core_web_lg` pipeline is approximatley 382MB with 685,000 keys and 343,000 unique vectors of 300 dimensions. The output of the pipeline is thus a two dimensional array of N x D where N is the sequence legnth of tweet, and D is the dimension of the token vector (300).

### LSTM Architecture in PyTorch
The LSTM model is constructed using PyTorch and consists of one LSTM layer, one fully-connected (FC) layer, and a sigmoid activation function. The LSTM layer accepts arbitrary sequence of 300 dimension vectors with an output dimension of ten. The FC layer accepts input dimension of ten and outputs a dimension one scalar. Finally, the scalar is passed through a Sigmoid function to map into the probability space indicating the probability of the tweet pertaining to a disaster.

A Mean-Square-Error (MSE) loss was used as the criterion and Adaptive Moment Estimation (Adam) (Kingma & Ba, 2017) is used in the LSTM model.


## Discussion
### CPU vs GPU
To utilize the acceleration provided by a GPU, the data should be batched to allow for parallelized processing of multiple data points (i.e., tweets). Unlike a convolutional model often used in computer vision, the LSTM is sequential model and cannot be meaningfully parallelized for a single data point. I opted out from using the GPU because 1) SpaCy `en_core_web_lg` pipeline is optimized for CPU, 2) copying data between CPU and GPU adds too much overhead for how small this dataset is, and 3) batching the tweets proves to be a challenge.

### Batching vs Unbatching
To utilize GPUs, batching had to be done with the data. However, padding had to be added because of the varying lengths of each tweet. I chose to pad with a vector of zeros which maps to a space character. When padding the dataset with space characters, the model fails to understand the spaces as null and attempts to interpret the spaces resulting in the model parameters being zeroed out during the backpropagation step and constantly inferences to a value of zero thus losses its inferencing power.

### Hyperparameter tuning the learning rate
The Adam optimizer has multiple hyperparameters that can be adjusted and the main one is the learning rate. Without systematic hyperparameter tuning I tested a comically large learning rate of 300 and and small learning rate or 0.001. Due to the lack of batching, the parameters are updated for each data point used for training and thus a large learning rate constantly overshoots the optimal parameter value. Instead, a small learning rate is shown to perform much better and is sufficient enough that signs of overfitting starts showing after one or two epoch(s).

![Train Loss vs Validation Loss](https://www.kaggleusercontent.com/kf/209127182/eyJhbGciOiJkaXIiLCJlbmMiOiJBMTI4Q0JDLUhTMjU2In0..7RGNCq2PeJo-FZ6mmDNNaw.2p09SjEiaYSXXpWWOt3ECZ67mLuL-DN98x-jFIRiNrP8__SvwzHQZuh5P6k5kEYl3xK32WakvgjiA3nf-jGO3HT6n9KINmAxqn5KiMakeAttzyfUWC_w_gY8b968ZL2B6sBrA5wpote0ehGa0p5Kmte0_kPuEY7MVL0fpZ7pZwLoKqXwTX7vjpP-DlLGI3pWWHaxbKhwnIuJAg02BYu-2C04TIERG9i5TBvoYTCyA_gcehPeBc_NjrQIJQA18O_8UQ9oy9DcQwrDq5mrfNQOyDoeGbyxcSMNKLfJU-3T-W_DofiUHFqGAiWoVG1IbyLNAP4hth5PwVN0ilx_T-m_CX2cbbLuTggAvgUwfuYskTojP5KvfMi-asoozN-6zQen61ah29en5Y8OWUL3Jt52fW0h6hs7TINndcZfYx6vxdWf4h49CATELSA_tGiKxM89vsE7ZJ2oNc9Vn__AtkkxIrm4jRZfMEJso9YXcVIBcZBhB8xFxMNnFDhhUCEPYMce3DHgZXUW8eEG-yi4XpGEuVEt5mo1FWsSXbEYVuBqQPicmUX4ZbpjVISapvsQC__5tnYC6kd5j9cSB7R5E_HJ8wx5Yykbkf2ZcBCFnfB19KjAcsfkSourqysIVkNW1kKi.zCL0r__229ksbNBmi7e9sQ/__results___files/__results___7_1.png)

## Conclusion
In conclusion, the LSTM model made parallelization more challenging and considering the overhead introduced when training with GPU, training and inferencing with CPU was the better approach. Even though padding the varying lengths of tweets proves to be challenging, there are options to still be batch train the model such as early halting the training when encountering a token of all zeros. Using the Adam optimizer and a small learning rate, the model is able to achieve F1 score of 0.803 with a training time of aprpoximatly 7.5 minutes. To push the model performance even further, a few suggested improvement is proposed below. However, this simple model shows the value that a model with memory is capable of extracting further information that a simpler model such as bag-of-words is not able to achieve. 

## Future Improvements
- Explore other vectorization methods
    - The pipeline model used is 382MB and takes some time to download/load. For a more portable inteligent device such as a IoT device, the storage requirement could make this model unfeasible. Thus, it would worth looking into utilizing SpaCy's other pretrained pipelines or another encoding method.
- Explore the use of transformers
    - Transformers are shown tremendous progress in the field of NLP and this is a valuable avenue to explore and compare performance with this simple LSTM model. Additionally, various pretrained transformer models have been published and shared freely in an open-source manner. To be able to further train these published model could help fine tune the performance of these published models.
- Explore how to batch train and utilize GPU
    - To be able to batch these varying lengths of tweets have proved challenging and could lead to errors if proper precautions were not considered. One suggested approach is to pad all tweets with a specialized sentinel vector and terminate the LSTM sequential training of each token when the sentinel is encountered.
- Explore how to deploy the trained model in an application that can be used to demo
    - Publish the model using some sort of interactive applet such that the model can be tested with other texts. 
- Improve data cleaning and extraction
    - In this notebook, the NLP processing was essentially offloaded to the SpaCy library and their methodologies. Certain special characters remained in the the tweets and was vectorized accordingly. It would be valuable to create an alternative dataset of which the special characters such as emojis or emoticons were removed and train the LSTM model on these data. Through this comparison we could be able to hypothesize the amount of information informing whether the tweet pertains to a disaster is provided by these non-traditional characters.
- Systematic tune the hyperparamters
    - Use a tool such as RayTune to systematically tune the hyperparameter values and consider the impact of learning rate to the outcome of the model's inference power.
    - Also systematically tune and discover whether additional LSTM layers or LSTM hidden output dimensions could improve the model's performance.
- Permute the order of the training dataset
    - The order of the training dataset is static in this model and because of LSTM's memory, the order of which each training data is introduce could impact the outcome of model's parameters. Permute the order of the dataset and test if the model retains its inference power.
- Test removing lemmatizer
    - Considering that these are tweets thus proper English grammar may be lacking; there may be value in training on non-lemmatized tweets.
- Test training with increased LSTM layers and hidden output dimensions

## References
Kingma, D. P., & Ba, J. (2017). Adam: A Method for Stochastic Optimization (No. arXiv:1412.6980). arXiv. https://doi.org/10.48550/arXiv.1412.6980


## Setting up the environment

In [None]:
# Install the needed langauge package
!python -m spacy download en_core_web_lg

In [None]:
import gc
from typing import List, Union ,Iterable, Tuple, Dict
from multiprocessing import set_start_method, cpu_count
from pathlib import Path
import cProfile, pstats
from collections import namedtuple

import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import spacy
import torch
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split
from tqdm import tqdm


def read_in_csv(data_dir_path:Path=None) -> Tuple:
    """Read in the dataset CSVs and return a tuple of dataframes."""
    
    OutputTuple = namedtuple("read_in_csv", ["train", "test", "submission_example"])
    
    if data_dir_path is None: 
        data_dir_path = Path("/kaggle/input")
    
    ## Read in the data
    df_train = pd.read_csv(data_dir_path / Path("nlp-getting-started/train.csv"))
    df_test = pd.read_csv(data_dir_path / Path("nlp-getting-started/test.csv"))
    df_sample_submission = pd.read_csv(data_dir_path / Path("nlp-getting-started/sample_submission.csv"))

    return OutputTuple(train=df_train, test=df_test, submission_example=df_sample_submission)


def train_validation_split(df_train:pd.DataFrame, validation_percentage:float=None) -> Tuple:
    """Train/Validation split the train dataframe."""
    OutputTuple = namedtuple("train_validation_split", ["train_datas", "validation_datas", "train_targets", "validation_targets"])

    if validation_percentage is None: 
        validation_percentage = 0.1
        
    train_percentage = 1 - validation_percentage    
    
    ## Simple Train/Validate split
    train_datas, validation_datas, train_targets, validation_targets = train_test_split(
        df_train.text.to_list(), 
        df_train.target.to_list(),
        train_size = train_percentage, 
        random_state = 7,  # For consistency
        stratify = df_train.target.to_list(),
    )

    return OutputTuple(train_datas=train_datas, 
                       validation_datas=validation_datas, 
                       train_targets=train_targets, 
                       validation_targets=validation_targets
                      )


def text_to_vector(documents:Iterable, n_process=None) -> List:
    """Iterate through a list of documents and returns a list of vectorized documents.
    
    Uses spaCy's en_core_web_log model to transform each tweet into a 2D ndarray
    of floats. Each ndarray is the word_count by 300 where 300 is the vector length
    of each token used in the spaCy model.
    """
    if not isinstance(documents, Iterable):
        raise TypeError("`documents` has to be a list of strings.")
        
    nlp = spacy.load("en_core_web_lg")
    
    vec_length = 300  # Token vector length in SpaCy
    
    if n_process is None: 
        n_process = cpu_count()
        
    docs = nlp.pipe(texts=documents, n_process=n_process, batch_size=50)
    
    holder_all_tweets = []
    
    for doc in docs:
        tweet_length = len(doc)
        doc_ndarray = np.zeros(shape=(tweet_length, vec_length), dtype=np.float64)

        for idx, token in enumerate(doc):
            doc_ndarray[idx, :] = token.vector
        holder_all_tweets.append(doc_ndarray)
        
    return holder_all_tweets


class DisasterTweetDataset(Dataset):
    def __init__(self, vectorized_tweets:Iterable, targets:Iterable) -> None:
        self.vectorized_tweets = vectorized_tweets
        self.targets = targets
        
        self.__data_validation()

    def __len__(self) -> int: 
        return len(self.targets)
    
    def __getitem__(self, idx:int) -> Tuple[List, List]:
        ReturnedResult = namedtuple("disaster_tweet", ["target", "vectorized_tweet"])

        target = self.targets[idx]
        vectorized_tweet = self.vectorized_tweets[idx]
        
        return ReturnedResult(target=target, vectorized_tweet=vectorized_tweet)

    def __data_validation(self) -> None:
        if len(self.vectorized_tweets) != len(self.targets): 
            raise ValueError(f"The data counts do NOT match, got {len(self.vectorized_tweets)} and {len(self.targets)}")


def checkpoint_save(model, optimizer, epoch, training_loss, validation_loss, dir_path=None):
    """Save the model checkpoint to current working directory."""
    
    #from datetime import datetime, timezone
    #timestamp = datetime.now(timezone.utc).strftime("%Y%m%d_%H%M_%S")
    #filename = f"torch_checkpoint_epoch_{epoch}_{timestamp}.checkpoint"
    filename = f"checkpoint_epoch_{epoch}.checkpoint"
    
    if dir_path is None: 
        dir_path = Path.cwd()
    else: 
        dir_path = Path(dir_path)
    
    dict_to_save = {
        "epoch": epoch,
        "training_loss": training_loss,
        "validation_loss": validation_loss,
        
        "model_class": model.__class__,
        #"model_class_name": model.__class__.__name__,
        "model_state_dict": model.state_dict(),
        
        "optimizer_class": optimizer.__class__,
        #"optimizer_class_name": optimizer.__class__.__name__,
        "optimizer_state_dict": optimizer.state_dict(),
    }
    
    torch.save(dict_to_save, dir_path/filename)
    
    return dict_to_save
    

def checkpoint_load(file_path):
    """Instantiates and loads the state of the model and optimizer from the checkpoint."""

    ## TODO: Need to fix this loading function such that when LSTM model have different 
    ##      architecture it would still work. For example, more than 1 LSTM layers and/or more
    ##      than 10 hidden output dimensions. Currently assumes default architecture.
    
    CheckpointLoad = namedtuple("checkpoint_load", ["model", "optimizer", "training_loss", "validation_loss"])
    file_path = Path(file_path)
    checkpoint = torch.load(file_path, weights_only=False)

    # Create and load model state dict
    model = checkpoint["model_class"]()  # Instantiate using the class name
    model.load_state_dict(checkpoint["model_state_dict"])

    # Create and load optimizer state dict
    optimizer = checkpoint["optimizer_class"](model.parameters())  # Instantiate using the class name
    optimizer.load_state_dict(checkpoint["optimizer_state_dict"])

    # Load loss information
    training_loss = checkpoint["training_loss"]
    validation_loss = checkpoint["validation_loss"]

    return CheckpointLoad(model=model, optimizer=optimizer, training_loss=training_loss, validation_loss=validation_loss)

    
def plot_train_validation_loss(avg_training_loss:np.ndarray, avg_validation_loss:np.ndarray) -> mpl.axes.Axes:
    assert len(avg_training_loss) == len(avg_validation_loss), f"Training loss and validation loss arrays should have the same length, got {len(avg_training_loss)} and {len(avg_validation_loss)}"
    
    fig, ax = plt.subplots()
    marker_size = 10

    ax.set_title("Avg training and validation loss for each epoch")
    ax.set_ylabel("Avg [training | validation] loss")
    ax.set_xlabel("Epoch")

    num_of_epochs = len(avg_training_loss)
    
    ax.scatter(range(num_of_epochs), avg_training_loss, color="C1", s=marker_size, label="avg training loss")
    ax.scatter(range(num_of_epochs), avg_validation_loss, color="C2", s=marker_size ,label="avg validation loss")
    
    ax.plot(avg_training_loss, "--", alpha=0.3, color="C1")
    ax.plot(avg_validation_loss, "--", alpha=0.3, color="C2")

    ax.legend(loc="upper right")
    ax.grid(visible=True, which="both", axis="both", alpha=0.2)
    return ax


def predict_test_data_for_submission(model:torch.nn.Module, df_test:pd.DataFrame, save:bool=None) -> List:
    """Convenience function to predict for submission."""
    if save is None:
        save = False

    test_datas_vectorized = text_to_vector(df_test.text.to_list())
    
    model.train(False)
    holder = []
    
    for item in test_datas_vectorized: 
        prediction = model(torch.tensor(item)).detach()
    
        # Convert probability to categorical label
        if prediction > 0.5: 
            prediction = 1
        else:
            prediction = 0
        
        holder.append(prediction)

    if save is True:
        submission = pd.DataFrame({"id": df_test.id, "target": holder})
        submission.to_csv("/kaggle/working/submission.csv", index=False)

    return holder


class SimpleLSTM(torch.nn.Module):
    
    def __init__(self, input_size:int=None, lstm_num_layers:int=None, lstm_hidden_size:int=None) -> None:
        """Simple LSTM model.
        
        Structure:
            LSTM-Layer(s) > Dense-Layer > Sigmoid activation function

        - Unable to to train in batches as each tweet (or document) has varied length.
        - Batch normalization is not really needed because outputs have a sigmoid activation function.
        """
        super().__init__()
        
        # Defaults
        if input_size is None: 
            input_size = 300  # SpaCy vector size are 300
        if lstm_num_layers is None: 
            lstm_num_layers = 1
        if lstm_hidden_size is None: 
            lstm_hidden_size = 10  # LSTM output to have 10 dimensions
        target_output_size = 1
        
        # Variables
        self.input_size = int(input_size)
        self.lstm_num_layers = int(lstm_num_layers)
        self.lstm_hidden_size = int(lstm_hidden_size)
        self.target_output_size = int(target_output_size)
        self.dtype = torch.float64
        
        # Layers
        self.layer_lstm = torch.nn.LSTM(
            input_size=self.input_size, 
            hidden_size=self.lstm_hidden_size,
            num_layers=self.lstm_num_layers,
            bias=True,
            batch_first=True,  # Batch first is more nature, but hidden and cell state outputs are not batch first (see PyTorch documentation)
            dropout=0.0,
            bidirectional=False,
            dtype=self.dtype,
        )
        
        # Fully connected layer
        self.layer_fc = torch.nn.Linear(
            in_features=self.lstm_hidden_size,
            out_features=self.target_output_size,
            bias=True,
            dtype=self.dtype,
        )
        
        # Final sigmoid layer
        self.layer_sigmoid = torch.nn.Sigmoid()  # transforms to probability space
        self.to(dtype=torch.float64)  # self.double() also works
        
        return

    
    def forward(self, input_data:torch.Tensor) -> torch.Tensor:
        """Forward calculation, call the module instance instead.
        
        Even though this method is defined, one should call the module instance instead to make
        sure that all the registered hooks are taken care of.
        Source: https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module.forward
        """
        
        # LSTM layer
        # - last_layer_output: the last layer output for each sequence of the input sequence (tokens of a sentence)
        # - Output vs h_n: The former only has output for the last LSTM layer, whereas the latter has output for all LSTM layers.
        last_layer_output, (h_n, c_n) = self.layer_lstm(input_data)
        if last_layer_output.dim() == 3: 
            last_layer_output = last_layer_output[:, -1, :]  # All batches; Last sequence output; hidden size
        elif last_layer_output.dim() == 2:
            last_layer_output = last_layer_output[-1, :]  # When no batches - Last seq output; hidden size
        else:
            raise ValueError(f"Output of LSTM layer expected to be either 3 or 2 dimensions (got {last_layer_output.dim()} dimensions.)")

        # Dense fully connected layer
        output = torch.squeeze(last_layer_output)     
        output = self.layer_fc(output)
        
        # Sigmoid activation function
        output = self.layer_sigmoid(output)
        
        return output


def validation_loss(model:torch.nn.Module, criterion:torch.nn.modules.loss._Loss, validation_dataset:torch.utils.data.Dataset):
    """Calculate validation loss and return last loss and running loss of the model."""
    
    BatchedValidationLoss = namedtuple("batched_validation_loss", ["last_loss", "running_loss"])
    
    last_validation_loss = 0
    running_validation_loss = 0

    model.train(False)  # Eval mode

    with torch.no_grad():
        for idx_data, (validation_target, validation_data) in enumerate(validation_dataset):

            # Inference
            prediction = model(torch.tensor(validation_data))
            
            # Calculate loss
            loss = criterion(prediction, torch.tensor(validation_target).double())
            
            # Keep track of the loss
            last_validation_loss = loss.detach()  # Solves memory leak - Or use loss.item()
            running_validation_loss += last_validation_loss
            
    return BatchedValidationLoss(last_loss=last_validation_loss, running_loss=running_validation_loss)


def train_SimpleLSTM(model:torch.nn.Module, optimizer:torch.optim.Optimizer, criterion:torch.nn.modules.loss._Loss, 
                    train_dataset:torch.utils.data.Dataset):
    """Model training function, and returns the last loss and running loss as a tuple."""
    
    BatchedTrainLoss = namedtuple("batched_train_loss", ["last_loss", "running_loss"])
    last_train_loss = 0
    running_train_loss = 0

    model.train(True)  # Training mode
    
    # Training loop
    for idx_data, (train_target, train_data) in enumerate(tqdm(train_dataset, desc="    Training...", unit="Tweet", miniters=100)):
        #if idx_batch == 5: break  # DEBUG

        # Forward prop
        model.zero_grad()  # Zero out the graident
        optimizer.zero_grad()
        prediction = model(torch.tensor(train_data))
        
        # Calculate loss
        loss = criterion(prediction, torch.tensor(train_target).double())
        
        # Backward prop
        loss.backward()  # Calculate gradients after the loss is aggregated with the reduction strategy
        optimizer.step() # Update parameter with gradients
        
        # Keep track of loss
        last_train_loss = loss.detach()  # Solves memory leak - Or loss.item()
        running_train_loss += last_train_loss
        
    return BatchedTrainLoss(last_loss=last_train_loss, running_loss=running_train_loss)


## Data Processing
The data are read in from CSV, train-validation splitted, vectorized, and then wrapped as PyTorch Datasets.

In [None]:
## Data Processing Steps ##

# (1) Read in the CSVs
df_train, df_test, df_sample_submission = read_in_csv()

# (2) Train / Validation split
train_datas, validation_datas, train_targets, validation_targets = train_validation_split(
    df_train=df_train, 
    validation_percentage=0.1
)

# (3) Vectorize the text datas so can be processed by the model
train_datas_vectorized = text_to_vector(train_datas)
validation_datas_vectorized = text_to_vector(validation_datas)
test_datas_vectorized = text_to_vector(df_test.text.to_list())

# (4) Turn them into dataset for convenience
disaster_tweet_dataset = DisasterTweetDataset(train_datas_vectorized, train_targets)
disaster_tweet_dataset_validation = DisasterTweetDataset(validation_datas_vectorized, validation_targets)


## Understanding the Training dataset
The training dataset includes id, keyword, location, text, and target information. However, the keyword and location columns are sparsely populated and contains a lot of characters encoded in some other encoding methods that do not seem to contain valuable information for what we are seeking here.

In [None]:
df_train.head()

In [None]:
## Pivot table for location and target
df_train.pivot_table(values='id', columns='target', index='location', aggfunc='count', dropna=False, fill_value=0)

In [None]:
## Pivot keyword vs target
df_train.pivot_table(values='id', columns='target', index='keyword', aggfunc='count', dropna=False, fill_value=0)

## Dataset distribution
There are more non-disaster tweets in this dataset than disaster ones, hence the importance of stratifying the two targets (disaster and non-disaster) when splitting them into the train/validation sets.

In [None]:
## Plotting target balance - disaster vs non-disaster
def pie_chart_label(pct, all_data): 
    output = pct/100*np.sum(all_data)
    return f"{output}"
    
fig, ax = plt.subplots()
target_label_mapping = {0: "Non-disaster", 1:"Disaster"}
series_target_counts = df_train.target.value_counts()

ax.pie( 
    x=series_target_counts.values, 
    labels=list(map(lambda label: target_label_mapping[label], 
                    series_target_counts.index.values)), 
    autopct=lambda pct: f"{round(pct, 3)}%",
)

ax.set_title("Tweet Target distribution")
plt.show(fig)

## Train the Model
Below we train the model with the parameter set to 10 epochs and a learning rate of 0.001.

In [None]:
## Training the model!

# Some parameters
num_of_epochs = 10
learning_rate = 0.001

# Instantiating the model, optimizer, and criterion
lstm = SimpleLSTM()
optimizer = torch.optim.Adam(lstm.parameters(), lr=learning_rate)
criterion = torch.nn.MSELoss(reduction="none")  # Non-batched, thus don't need reduction strategy

holder_avg_train_loss = np.zeros(shape=num_of_epochs, dtype=np.float64)
holder_avg_validation_loss = np.zeros(shape=num_of_epochs, dtype=np.float64)

# Epoch training loop
for idx_epoch in tqdm(range(num_of_epochs), desc="Training epoch...", unit="Epoch"):
    train_result = train_SimpleLSTM(lstm, optimizer, criterion, disaster_tweet_dataset)
    validation_result = validation_loss(lstm, criterion, disaster_tweet_dataset_validation)

    # Calculate avg train loss and validation loss for this epoch
    running_train_loss = train_result.running_loss
    running_validation_loss = validation_result.running_loss

    avg_train_loss = running_train_loss / len(disaster_tweet_dataset)
    avg_validation_loss = running_validation_loss / len(disaster_tweet_dataset_validation)

    # Add to holder
    holder_avg_train_loss[idx_epoch] = avg_train_loss
    holder_avg_validation_loss[idx_epoch] = avg_validation_loss

    # Checkpoint save
    checkpoint_save(lstm, optimizer, epoch=idx_epoch, 
                    training_loss=avg_train_loss, 
                    validation_loss=avg_validation_loss
                   )

# Plot results
ax = plot_train_validation_loss(holder_avg_train_loss, holder_avg_validation_loss)
fig = ax.get_figure()
fig.savefig("/kaggle/working/avg_train_validation_loss_trend.svg")
plt.show()


## Load model snapshot and inference test dataset for competition

In [None]:
# Submit to competition
checkpoint = checkpoint_load("/kaggle/working/checkpoint_epoch_3.checkpoint")  # Because after the 3rd epoch, shows overfitting
lstm = checkpoint.model
prediction = predict_test_data_for_submission(model=lstm, df_test=df_test, save=True)