# How to Write Cheminformatics Blog Posts

As the YouTubers would say, "A lot of you have been asking me about how to write cheminformatics blog posts". Well, not a lot, but at least a couple!

This is pitched towards chemists.

Throughout this post, I'll use the example from my last blog post, [Why some organic molecules have a color: Correlating optical absorption wavelength with conjugated bond chain length]]({% post_url 2024-10-15-Color-from-Conjugation %}).

I realized I've written 16 blog posts.

## Find a jumping-off point

A jumping-off point can be one of these that you find interesting:
- A tutorial. Examples: [Getting Started with the RDKit in Python](https://www.rdkit.org/docs/GettingStartedInPython.html), [RDKit Cookbook](https://rdkit.org/docs/Cookbook.html), or a cheminformatics blog by [Takayuki Serizawa](https://iwatobipen.wordpress.com/) or others***.
- A talk or presentation. Examples: [RDKit User Group Meeting (UGM) recordings](https://www.youtube.com/playlist?list=PLugOo5eIVY3EHeBuSABISVok5-Q7kE0O1).
- A dataset. Examples: [Experimental database of optical properties of organic compounds](https://www.nature.com/articles/s41597-020-00634-8), datasets on [MoleculeNet](https://deepchem.readthedocs.io/en/latest/api_reference/moleculenet.html).

Example: My background is in optical spectroscopy, so I searched for an optical spectroscopy dataset, and found one with thousands of chromophores and their absorption and emission wavelength maxima. Thanks to relatively new publications such as [Nature *Scientific Data*](https://www.nature.com/sdata/), there are a good number of datasets available.

## Ask yourself "What would a chemist want to do with this?"

This is where your chemistry background comes into play. Given the jumping-off point, think about what you as a chemist would want to do with it.

Your interests will determine what you find interesting as a jumping-off point and what you want to do with it. It's that combination that means that each cheminformatician will come up with different topics.

Example: When I considered the optical dataset, I remembered that there's a correlation between conjugated bond chain length and absorption wavelength. So I thought about verifying that using this dataset.

## Formulate an approach

### What chemical information do you want?

I think it's best to start with the question of what chemical information you want, then figure out a way to calculate it. This ensures that the topic is chemically relevant.

Example: I needed to determine the longest conjugated bond chain length in a molecule. Absorption wavelength was already provided, though I wanted to convert it to energy, so that took some work.

### How can you extract the chemical information using code?

This is where the coding part comes in: Translating the chemical information question into code.

Example: The optical dataset provided the molecular graph in the form of SMILES, so the task was to devise an algorithm that would traverse the conjugated bonds in the molecular graph and determine which were connected to a starting bond. Then find the longest such chain in the molecule.

#### Make the code more efficient

Often, there's a single cheminformatics operation which takes a significant amount of time on a sizeable dataset. This slow step is worth speeding up.

**Example:** In the optical properties post, the slow step was traversing each molecular graph to determine its longest conjugated bond chain. I sped this up by only traversing each bond once, rather than using each bond as a starting point and then traversing all connected bonds. I also sped up the operation at the dataframe level: Because it contained multiple rows for many chromophores, I cached the result for each chromophore so it wouldn't need to be recalculated.

Note that speeding up the code is not the only goal: Because the blog post is intended to be educational, there's a balance between code speed and clarity. For example, in many posts, I use Polars in eager mode rather than lazy mode because I want to show the steps along the way, for example the dataframe after the data is initially loaded. If speed was the sole goal, I'd use lazy mode to evaluate only once, to get the final results.

## Consult a journal article

It's often helpful to consult a journal article to get molecular structures, experimental details, etc. It's been helpful to have access to journal articles via my American Chemical Society membership [ACS Member Universal Access (50 article accesses per term)](https://www.acs.org/membership/member-benefits.html), though probably many articles are available on preprint archive servers now.

## Code your way through the project

## Write the narrative with the reader in mind

Find something compelling to you. Follow your curiosity. When you find something you want to work on and can't put down, you've found a good topic. That will be different for everyone.