# Using RDKit to Build a Dataset

<div class="alert alert-block alert-warning"> 
<h2>Lab Activity</h2>

There is a file called `amino_acids.txt` in the `data` folder that contains SMILES for the 20 naturally occurring amino acids.  Your task is to combine all the skills we have learned so far in our class to build a dataset for molecules and write a file. Your goal is to create a comma-separated value file with columns `SMILES`, `num_heavy` (number of heavy atoms), `molecular_weight`, and one other molecular descriptor of your choice for the molecules in the file.

For this task, you will need to:

1. Read SMILES from a text file.
2. Make an RDKit molecule for each SMILES.  Note that if you print the list of smiles strings after you read them in from the file, they will each have a newline character (`\n`) at the end of each item.  This makes sense because each line was printed on a new line of the file, but you will need to delete those newline characters before you can make the RDKit molecule from each SMILES string.  You can do this using this syntax: `smiles_string = smiles_string.strip()`.
3. Get the number of heavy atoms, molecular weight, and one other descriptor of your choice.
4. Write a file with the data

The first few lines of your file should look like this:
```
SMILES,num_heavy,molecular_weight,tpsa
C[C@H](N)C(=O)O,6,89.09399999999998,63.32000000000001
NC(N)=NCCC[C@H](N)C(=O)O,12,174.20399999999998,127.72000000000001
```
   

Save your file as `amino_acids_processed.csv` and upload it to the Canvas assignment for this week.

</div>

In [None]:
# Don't forget to run this cell to import the relevant libraries
from rdkit import Chem
from rdkit.Chem import Descriptors
import numpy as np

In [None]:
# Your code goes here.  Add additional cells if needed

In [None]:
# This solution picks TPSA (polar surface area) as the additional descriptor

filehandle = open('data/amino_acids.txt', 'r')
amino_acid_data = filehandle.readlines()
filehandle.close()
#print(amino_acid_data)

# Prepare the lists to store all the descriptors 
# This isn't strictly necessary since it is written to a file
molecules = []
molecular_weight = []
number_heavy = []
tpsa_list = []

# Open the output file for writing
output = open('amino_acids_processed.csv', 'w+')

# Write the headers to the file
output.write(F"SMILES,num_heavy,molecular_weight,tpsa\n")

# Compute molecular descriptors
for smiles in amino_acid_data:
    smiles = smiles.strip()
    # Create the molecule object
    mol = Chem.MolFromSmiles(smiles)
    # Append it to the molecules list
    molecules.append(mol)
    # Get the molecular weight
    mol_weight = Descriptors.MolWt(mol)
    # Append it to the list
    molecular_weight.append(mol_weight)
    # Get the number of heavy atoms
    num_heavy = Descriptors.HeavyAtomCount(mol)
    # Append it to the list
    number_heavy.append(num_heavy)
    # Get the polar surface area    
    tpsa = Descriptors.rdMolDescriptors.CalcTPSA(mol)
    # Append it to the list
    tpsa_list.append(tpsa)

    # Write the data to the file for this amino acid
    output.write(F'{smiles},{num_heavy},{mol_weight},{tpsa}\n')
        
output.close()

In [None]:
#This is the code cell that makes the student version
# Go to Edit --> Clear Outputs of all cells
# Save the notebook
# Change the file name in the nbconvert line to the relevant thing
# Fill in the output filename with the student-version-filename
# Run this cell
# This cell is itself deleted in the student notebook
!jupyter nbconvert RDKit_dataset.ipynb  --to notebook \
    --TagRemovePreprocessor.enabled=True \
    --ClearOutputPreprocessor.enabled=True \
    --TagRemovePreprocessor.remove_cell_tags=\[\"solution\"\] \
    --output RDKit_dataset_student.ipynb