# Documentation: Generative Variational Autoencoder (VAE) for Molecular SMILES Generation

The notebook _generativeVAEV1_ implements a Variational Autoencoder (VAE) designed to learn a compressed representation of molecular structures (in SMILES format) and then use this learned representation to generate novel, chemically valid SMILES strings. 

## 1. Installing Dependencies why we are using RDKit

-   **RDKit**: A fundamental open-source cheminformatics toolkit. It's imp for this project as it allows us to:
    -   Parse and interpret SMILES strings (Simplified Molecular Input Line Entry System), which are linear notations for molecular structures.
    -   Validate whether a generated SMILES string corresponds to a chemically plausible molecule.
    -   Perform various other molecular computations if needed (though primarily used for validation here).

If RDKit is already installed in your environment, this command will simply confirm its presence.

## 2. The activity of each library

*(Corresponds to the cell with `import numpy as np`, etc.)*

This cell imports all the Python libraries and modules required for the notebook's functionality:

-   **`numpy` (as `np`)**: A cornerstone library for numerical computing in Python. It's used for efficient array operations, especially for handling the numerical data fed into and produced by the neural network.
-   **`pandas` (as `pd`)**: A powerful library for data manipulation and analysis. Here, it's primarily used to read the input dataset from a CSV file into a DataFrame.
-   **`tensorflow` (as `tf`)**: The core deep learning framework.
    -   **`keras` from `tensorflow`**: TensorFlow's high-level API for building and training neural networks.
    -   **`layers` from `tensorflow.keras`**: Contains the building blocks for neural networks.
    -   **`Model` from `tensorflow.keras`**: The class used to define a Keras model.
    -   **`Adam` from `tensorflow.keras.optimizers`**: An efficient gradient-based optimization algorithm.
    -   **`SparseCategoricalCrossentropy` from `tensorflow.keras.losses`**: A loss function suitable for multi-class classification problems where labels are integers.
    -   **`plot_model` from `tensorflow.keras.utils`**: A utility to create a visual plot of the Keras model architecture.
    -   **`ModelCheckpoint`, `TensorBoard` from `tensorflow.keras.callbacks`**: Utilities that can be applied at different stages of model training.
-   **`Chem` from `rdkit`**: The core RDKit module for working with chemical structures.
-   **`os`**: For interacting with the operating system (e.g., creating directories).
-   **`pickle`**: For serializing and de-serializing Python objects (saving/loading the tokenizer).
-   **`matplotlib.pyplot` (as `plt`)**: A widely used plotting library.

## 3. Defining Constants and Hyperparameters

*(Corresponds to the cell defining `DATA_PATH`, `LATENT_DIM`, etc.)*

This section defines global constants for file paths and hyperparameters:

### File Paths:
-   **`DATA_PATH`**: Location of the input CSV file with SMILES strings.
-   **`TOKENIZER_PATH`**: Path to save/load the SMILES tokenizer.
-   **`MODEL_PATH`**: Base directory to save trained models.
-   **`LOGS_PATH`**: Directory for TensorBoard logs.

### Hyperparameters:
-   **`LATENT_DIM`**: Dimensionality of the VAE's latent space (e.g., 128). This determines the "bottleneck" size.
-   **`BATCH_SIZE`**: Number of samples processed before model weights are updated (e.g., 256).
-   **`EPOCHS`**: Number of full passes through the training dataset (e.g., 100).
-   **`MAX_LENGTH`**: Maximum length for SMILES strings after tokenization and padding (e.g., 120).

## 4. Creating Necessary Directories

*(Corresponds to the cell with `os.makedirs(...)`)*

This cell creates the directories for storing the tokenizer, models, logs, and generated molecules if they don't already exist. This prevents errors during file saving operations.

## 5. Loading the Dataset

*(Corresponds to the cell with `pd.read_csv(DATA_PATH)`)*

The molecular data (SMILES strings) is loaded from a CSV file using pandas. It's assumed the CSV has a column named 'smiles'.

## 6. SMILES Tokenizer Class Definition

*(Corresponds to the `SMILESTokenizer` class definition)*

Neural networks require numerical input. This `SMILESTokenizer` class handles the conversion of textual SMILES strings into numerical sequences (tokenization) and back (detokenization).

### `__init__(self, max_length)`
-   Initializes dictionaries for character-to-integer (`char_to_int`) and integer-to-character (`int_to_char`) mappings.
-   Sets `max_length`.
-   Defines special tokens:
    -   `<pad>`: For padding shorter sequences.
    -   `<start>`: To mark the beginning of a sequence.
    -   `<end>`: To mark the end of a sequence.
    -   `<unk>`: For unknown characters not in the vocabulary.

### `fit(self, smiles_list)`
-   Builds the vocabulary from a list of SMILES strings.
-   Collects all unique characters and adds special tokens.
-   Creates `char_to_int` and `int_to_char` mappings.
-   Sets `vocab_size` (total number of unique tokens).

### `transform(self, smiles_list)`
-   Converts a list of SMILES strings into a NumPy array of padded integer sequences.
-   Each SMILES string is:
    1.  Prepended with `<start>` token.
    2.  Characters are mapped to integers (or `<unk>` if not in vocab).
    3.  Appended with `<end>` token.
    4.  Padded with `<pad>` token or truncated to `max_length`.

### `reverse_transform(self, sequences)`
-   Converts integer sequences back to SMILES strings.
-   For each sequence, integers are mapped back to characters, stopping at `<end>` or `<pad>`. `<start>` and `<unk>` tokens are not included in the final string.

### `save(self, filepath)` & `load(filepath)`
-   Methods to save and load the tokenizer object using `pickle`, allowing reuse without refitting.

## 7. Initializing, Fitting, and Saving the Tokenizer

*(Corresponds to `tokenizer = SMILESTokenizer(...)`, `tokenizer.fit(...)`, `tokenizer.save(...)`)*

1.  An instance of `SMILESTokenizer` is created.
2.  The tokenizer is `fit` to the loaded `smiles_data` to build its vocabulary.
3.  The fitted tokenizer is saved to disk for later use.
4.  The vocabulary size is printed.

## 8. Transforming SMILES Data into Tokenized Sequences

*(Corresponds to `data_tokenized = tokenizer.transform(smiles_data)`)*

The `smiles_data` (list of SMILES strings) is converted into `data_tokenized` (a NumPy array of integer sequences) using the fitted tokenizer. This numerical data will be the input to the VAE.

## 9. Defining the Variational Autoencoder (VAE) Model

*(Corresponds to the VAE model definition, including Encoder, Sampling, and Decoder)*

A VAE learns a probabilistic mapping from input data to a lower-dimensional continuous latent space, and then back to the input space.

### 9.1. Sampling Layer (`Sampling` class)
The VAE encoder outputs parameters of a distribution (mean $z_{\mu}$ and log-variance $z_{\log \sigma^2}$) for each input. The `Sampling` layer draws a sample $z$ from this distribution using the **reparameterization trick**:
$$ z = z_{\mu} + \exp(0.5 \cdot z_{\log \sigma^2}) \cdot \epsilon $$
where $\epsilon$ is a random sample from a standard normal distribution $N(0, I)$. This allows gradients to flow through the stochastic sampling process.

### 9.2. Encoder Network
Maps an input tokenized SMILES sequence to the latent distribution parameters $z_{\mu}$ and $z_{\log \sigma^2}$.
-   **`Input`**: Defines input shape (`MAX_LENGTH`).
-   **`Embedding`**: Converts integer tokens into dense vectors of fixed size (e.g., 128 dimensions). `mask_zero=True` handles padding by ignoring zero-indexed pad tokens in subsequent layers like LSTMs.
-   **`LSTM`**: Long Short-Term Memory layers process sequential data. Two LSTM layers are used here to capture features from the embedded sequences. The first LSTM has `return_sequences=True` to pass its full output sequence to the next LSTM. The second LSTM outputs only the final hidden state.
-   **`Dense` layers for $z_{\mu}$ and $z_{\log \sigma^2}$**: Fully connected layers that map the LSTM output to the `LATENT_DIM`-dimensional mean and log-variance vectors.

### 9.3. Decoder Network
Maps a sampled latent vector $z$ back to a sequence of tokens, aiming to reconstruct the input SMILES.
-   **`Input`**: Defines input shape (`LATENT_DIM` for the latent vector $z$).
-   **`RepeatVector(MAX_LENGTH)`**: Repeats the latent vector $z$ `MAX_LENGTH` times to provide an initial sequence input for the decoder's LSTMs.
-   **`LSTM`**: Two LSTM layers (with `return_sequences=True` for both) generate an output sequence from the repeated latent vector.
-   **`TimeDistributed(Dense)`**: Applies a `Dense` layer to every time step of the LSTM output. This layer has `tokenizer.vocab_size` units and `softmax` activation, producing a probability distribution over the vocabulary for each token in the output sequence.

### 9.4. VAE Model (Combined)
Connects the encoder, sampling layer, and decoder.
-   Input SMILES $\rightarrow$ Encoder $\rightarrow (z_{\mu}, z_{\log \sigma^2})$
-   $(z_{\mu}, z_{\log \sigma^2}) \rightarrow$ Sampling layer $\rightarrow z$
-   $z \rightarrow$ Decoder $\rightarrow$ Reconstructed SMILES (as probability distributions over tokens)

### VAE Loss Function
The VAE is trained by minimizing a loss function comprising two terms:
1.  **Reconstruction Loss**: Measures how well the VAE reconstructs the input.
    -   Uses `SparseCategoricalCrossentropy` because the decoder outputs probability distributions for token classes, and the target is integer token indices.
    $$ L_{\text{reconstruction}} = -\sum_{t=1}^{T} \log p(x_t | z) $$
    (Typically implemented as cross-entropy over the sequence.)

2.  **KL Divergence (KLD) Loss**: A regularization term that encourages the learned latent distribution $q(z|x)$ (from the encoder) to be close to a prior distribution $p(z)$ (typically a standard normal $N(0, I)$). This helps create a smooth and continuous latent space useful for generation.
    The KL divergence between the learned distribution $q(z|x) = N(z | z_{\mu}(x), \text{diag}(\exp(z_{\log \sigma^2}(x))))$ and the prior $p(z) = N(z | 0, I)$ is given by:
    $$ D_{KL}(q(z|x) || p(z)) = -0.5 \sum_{j=1}^{\text{LATENT\_DIM}} (1 + z_{\log \sigma^2_j} - (z_{\mu_j})^2 - \exp(z_{\log \sigma^2_j})) $$
    This term is added to the total loss, often weighted by a factor $\beta$. The `vae.add_loss(kl_loss * 0.1)` line in the code implements this, where `kl_loss` is the negative of the sum above, averaged over the batch.

The total loss function (actually the negative of the Evidence Lower Bound, ELBO) is:
$$ L_{\text{VAE}} = L_{\text{reconstruction}} + \beta \cdot D_{KL}(q(z|x) || p(z)) $$

## 10. Compiling the VAE Model

*(Corresponds to `vae.compile(...)`)*

The VAE model is compiled, specifying:
-   **`optimizer=Adam(learning_rate=0.001)`**: The Adam optimization algorithm.
-   **`loss=reconstruction_loss_fn`**: The primary loss function (SparseCategoricalCrossentropy). The KL divergence was added via `vae.add_loss()` and is automatically included by Keras.

## 11. Defining Callbacks for Training

*(Corresponds to the `callbacks = [...]` list)*

Callbacks monitor and influence training:
-   **`ModelCheckpoint`**: Saves the model (or best version based on a monitored metric like `loss` or `val_loss`) during training.
-   **`TensorBoard`**: Logs training metrics (loss, etc.) for visualization with TensorBoard.

## 12. Training the VAE Model

*(Corresponds to `history = vae.fit(...)`)*

The VAE is trained using the `fit` method:
-   Input data: `data_tokenized`.
-   Target data: `data_tokenized` (since it's an autoencoder structure trying to reconstruct its input).
-   `epochs`, `batch_size`, and `callbacks` are passed as arguments.
-   An optional `validation_split` could be used to monitor performance on a held-out validation set.

## 13. Saving the Trained Model Components

*(Corresponds to `encoder.save(...)`, `decoder.save(...)`, `vae.save(...)`)*

After training, the encoder, decoder, and the full VAE model are saved to disk. This allows them to be loaded later for inference or further training.

## 14. Utility Functions for SMILES Generation and Validation

*(Corresponds to `generate_smiles_from_latent_space` and `is_valid_smiles` function definitions)*

### `generate_smiles_from_latent_space(...)`
-   Generates new SMILES strings:
    1.  Samples random vectors from the latent space (typically $N(0, I)$).
    2.  Feeds these vectors to the trained `decoder`.
    3.  The decoder outputs token probability distributions for each position in the sequence.
    4.  `np.argmax()` selects the most probable token at each position (greedy decoding).
    5.  The resulting token index sequences are converted back to SMILES strings using `tokenizer.reverse_transform()`.

### `is_valid_smiles(smiles)`
-   Checks chemical validity of a SMILES string using RDKit's `Chem.MolFromSmiles(smiles)`.
-   Returns `True` if the SMILES string can be parsed into a valid molecule object, `False` otherwise.

## 15. Generating and Filtering New SMILES Strings

*(Corresponds to the cell generating and filtering SMILES with `N_SAMPLES_TO_GENERATE`)*

-   A specified number of raw SMILES strings are generated using `generate_smiles_from_latent_space`.
-   Each generated SMILES string is then validated using `is_valid_smiles`.
-   Only valid SMILES strings are kept.
-   The code also includes a step to find unique valid SMILES strings.

## 16. Displaying a Sample of Generated SMILES

*(Corresponds to the cell printing the first 100 generated SMILES)*

A sample of the generated unique and valid SMILES strings is printed to visually inspect the quality of the generated molecules.

## 17. Saving Generated Valid SMILES to a CSV File

*(Corresponds to the cell saving generated SMILES to `molecules_generated.csv`)*

The unique and valid generated SMILES strings are saved into a CSV file for later use, analysis, or evaluation.