# DeepMatter Data Science/Cheminformatician Interview Notebook

This Jupyter notebook contains tasks to be completed prior to your second interview, and is expected to take
2-3 hours. Please complete it as necessary and be prepared to discuss it at the interview. Code should be
completed using Python, unless a viable/better/justifiable alternative exists.

## Environmental Setup

Projects in DeepMatter are structured using Anaconda and Pipenv. As these questions will
make use of RDKit, Anaconda is the preferred virtual environment setup. Before you continue,
please install Anaconda and create an environment named "dm_interview".

## Task 1 - Runtime Sensor Data

Clone the `dm_datascience` repository located on Github at
https://github.com/deepmatterltd/dm_datascience, and unzip the data located under
`02_PCML_and_PCRR/data/pcrr.zip`.

PCRR (Practical Chemistry Runtime Record) files are an XML-based method of capturing all data
associated with a chemistry run in our product DigitalGlassware. They capture things such as the
operations associated with the reaction, the reagents, the product, the final yield and other outcomes.
They also contain observations such as textual and photo notes, as well as timestamps associated
with when each operation was performed (e.g. *add 5mg of catalyst X to reactor vessel*).

# Download and Extract Data

In [8]:
import zipfile

# clone repo
dm_datascience_repo_dir = "../dm_datascience"
!git clone https://github.com/deepmatterltd/dm_datascience $dm_datascience_repo_dir

# extract zip file
zip_file_path = f"{dm_datascience_repo_dir}/02_PCML_and_PCRR/data/pcrr.zip"
zip_file_target_dir = "../pcrr/"

with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    zip_ref.extractall(zip_file_target_dir)

Cloning into '../dm_datascience'...
remote: Enumerating objects: 78, done.[K
remote: Counting objects: 100% (26/26), done.[K
remote: Compressing objects: 100% (24/24), done.[K
remote: Total 78 (delta 12), reused 5 (delta 0), pack-reused 52[K
Unpacking objects: 100% (78/78), 81.68 MiB | 1.53 MiB/s, done.


1. For each of the PCRR files in the run, extract the name of the recipe, author and number of
operations in the recipe.

2. XML is great for explicitly structuring data, but not much fun to handle for signal processing.
Write a function which can take in a PCRR filename and sensor name (e.g. `irObjTempI` for immersed temperature, or
`uvaI` for immersed UVA level) and returns the data a Pandas Series of length N. Note that each sensor
reading in the XML has a timestamp associated with it. This timestamp should also be parsed and assigned
to the index of the Series object returned.

3. Use the function to extract immersed temperature (`irObjTempI`) and plot against time.

4. Compute the rolling median of the immersed temperature, with a window of 30 seconds, and plot it.
Overlay the rolling median of the immersed UVA trace (`uvaI`) on the same plot.

## Task 2 - Reaction Data

While PCRR is used as a means of encapsulating chemistry runs in DigitalGlassware, the concept
 of using XML as a structuring mechanism is quite recent. There are many older datasets which
 use formats such as RDF, wherein lists of reactions describe the structural change which
 occurs during the chemistry. This section will focus on a small sample dataset which uses
 this format

To perform this task, install [RDKit](https://www.rdkit.org) to your Anaconda environment.

An RD file has been provided under the `data` directory for use in the following tasks.

1. Write a parser to split up `data/spresi-100.rdf` into `$RXN` blocks, which
denote individual reactions. A valid reaction starts with `$RXN` and includes
every subsequent line until the next `$RXN` block (inclusive) or the
end of the file.
Everything before the first `$RXN` block can be ignored.
<p>
Alternatively, there is a Python library which will do this for you, but it's up to you
to find it.


2. Use RDKit to parse each of the `$RXN` text blocks you parsed in above,
 and print out the SMILES for the reagents and products
 on the first 5 reactions.


2. Generate [molecular fingerprints](https://www.rdkit.org/docs/GettingStartedInPython.html#morgan-fingerprints-circular-fingerprints)
for all reagents and products and store them in a binary numpy matrix (bonus: use a sparse matrix).
Use a radius of 2 and fingerprint length of 1000. Ignore any molecules which throw an exception, and remove them from
the final array.

3. Perform dimensionality reduction on the fingerprints to visualise the data in 2D.
Use whether the molecule's fingerprint was a reactant or product as the colour of the point.

Perform clustering on the fingerprints and visualise the results in the same embedding used
in the previous question.

## Wrap Up

Please bring the completed code in this notebook/code for generating results along to your interview.
The code should be executed in real-time unless there is a good reason to avoid this. You will be
asked how you completed the tasks and why you did it a certain way, as well as discussion
on other ways the tasks could have been performed.

