# Introduction to Python

V.F. Scalfani, University of Alabama Libraries

Fall 2024

## Anticipated Length of Workshop

1 hour

## What is the purpose of this workshop?

This workshop is designed to review the basics of Python and ensure that participants for the computer-assisted retrosynthesis workshops have a working knowledge of Python. If you are already engaged with writing your own Python code, this will be review. Here is what we will cover:

1. Setting up a Conda Python development environemnt
2. Getting help in Python
3. Python syntax and variables
4. Indexing and accessing data
5. Functions
6. Conditional statements
7. Loops
8. Data I/O
9. Multiprocessing

## Additional Resources

Content in this workshop was adapted from a previsouly offered Conda and 3-part Python workshop series from UA Libraries: https://github.com/ualibweb/UALIB_Workshops

In addition to the workshop content, we recommend starting with the following Python resources to learn more:

1. https://github.com/jakevdp/WhirlwindTourOfPython

2. http://swcarpentry.github.io/python-novice-gapminder/

3. https://docs.python.org/3/tutorial/index.html

4. Python Crash Course: a hands-on, project-based introduction to programming, by Eric Matthes: [Scout Link](http://libdata.lib.ua.edu/login?url=https://search.ebscohost.com/login.aspx?direct=true&db=cat00456a&AN=ua.8611906&site=eds-live&scope=site)


## 1. Setting up a Conda Python development environment

As you develop Python code, you will notice that there are various dependencies that need to be installed. It's often necassary or convenient to have different Python environments for different projects. There are several good solutions available to manage Python dependencies.

Our preferred method is to use the Conda package manager. See our previous workshop on "Introduction to Conda" for more information: https://github.com/ualibweb/UALIB_Workshops

**What is Conda?**

“Conda is an open source package management system and environment management system that runs
on Windows, macOS, Linux and z/OS. Conda quickly installs, runs and updates packages and their
dependencies. Conda easily creates, saves, loads and switches between environments on your local
computer. It was created for Python programs, but it can package and distribute software for any
language.” [1]

**Local Installation Instructions**

1. Install Miniforge, a community packaged conda-forge distribution, follow docs to install: https://github.com/conda-forge/miniforge

2. Setup a basic rdkit-based cheminformatics Python environment:

The following code is entered in a terminal or Anaconda Navigator:

```
conda create --name retro_rdkit_env
conda activate retro_rdkit_env
conda install -c conda-forge rdkit jupyterlab numpy matplotlib pandas

```

3. Install VSCode and Python extensions (optional, as you could use Jupyter Lab or another IDE)

https://code.visualstudio.com/

### Check Installation

In [1]:
# First let's make sure everything is working with Python
import sys
print(sys.version)

3.12.5 | packaged by conda-forge | (main, Aug  8 2024, 18:36:51) [GCC 12.4.0]


In [2]:
print("Hello, Cheminformatics World!")

Hello, Cheminformatics World!


## 2. Getting help in Python

### Web Documentation

We recommend starting out with viewing the online web-based documentation for Python: https://docs.python.org/3/. See also the Library Reference section for built-in Python functions: https://docs.python.org/3/library/index.html

### help() function 

For additional tips, see refs [2,3]

In [3]:
# If you know the name of the function, use the help() function 
# to display the docstring
help(sorted)

Help on built-in function sorted in module builtins:

sorted(iterable, /, *, key=None, reverse=False)
    Return a new list containing all items from the iterable in ascending order.

    A custom key function can be supplied to customize the sort order, and the
    reverse flag can be set to request the result in descending order.



In [4]:
# Help also works on variables
a = [6,3,0,5]
help(a)

Help on list object:

class list(object)
 |  list(iterable=(), /)
 |
 |  Built-in mutable sequence.
 |
 |  If no argument is given, the constructor creates a new empty list.
 |  The argument must be an iterable if specified.
 |
 |  Methods defined here:
 |
 |  __add__(self, value, /)
 |      Return self+value.
 |
 |  __contains__(self, key, /)
 |      Return bool(key in self).
 |
 |  __delitem__(self, key, /)
 |      Delete self[key].
 |
 |  __eq__(self, value, /)
 |      Return self==value.
 |
 |  __ge__(self, value, /)
 |      Return self>=value.
 |
 |  __getattribute__(self, name, /)
 |      Return getattr(self, name).
 |
 |  __getitem__(self, index, /)
 |      Return self[index].
 |
 |  __gt__(self, value, /)
 |      Return self>value.
 |
 |  __iadd__(self, value, /)
 |      Implement self+=value.
 |
 |  __imul__(self, value, /)
 |      Implement self*=value.
 |
 |  __init__(self, /, *args, **kwargs)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |
 |  __it

### dir() function [4]

In [5]:
# The Python dir() function is useful for exploring modules
# this prints a list of available functions and variables
# example with time module
import time  # In python, you will often need to import libraries to use them.
dir(time)

['CLOCK_BOOTTIME',
 'CLOCK_MONOTONIC',
 'CLOCK_MONOTONIC_RAW',
 'CLOCK_PROCESS_CPUTIME_ID',
 'CLOCK_REALTIME',
 'CLOCK_TAI',
 'CLOCK_THREAD_CPUTIME_ID',
 '_STRUCT_TM_ITEMS',
 '__doc__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 'altzone',
 'asctime',
 'clock_getres',
 'clock_gettime',
 'clock_gettime_ns',
 'clock_settime',
 'clock_settime_ns',
 'ctime',
 'daylight',
 'get_clock_info',
 'gmtime',
 'localtime',
 'mktime',
 'monotonic',
 'monotonic_ns',
 'perf_counter',
 'perf_counter_ns',
 'process_time',
 'process_time_ns',
 'pthread_getcpuclockid',
 'sleep',
 'strftime',
 'strptime',
 'struct_time',
 'thread_time',
 'thread_time_ns',
 'time',
 'time_ns',
 'timezone',
 'tzname',
 'tzset']

In [6]:
# now we can use help() to get more information about a particular function or method
help(time.sleep)

Help on built-in function sleep in module time:

sleep(...)
    sleep(seconds)

    Delay execution for a given number of seconds.  The argument may be
    a floating-point number for subsecond precision.



## 3. Python syntax and variables 

Some content in this section adapted and inspired from refs [5,6].

### Simple Variables

In [7]:
# An integer
a = 5
print(a)

5


In [8]:
type(a)

int

In [9]:
# Floating points
b = 1.825
print(b)

1.825


In [10]:
type(b)

float

In [11]:
# a string
s1 = "Thanks for coming to our workshop!"
print(s1)

Thanks for coming to our workshop!


In [12]:
type(s1)

str

In [13]:
s2 = 'single quotes work too'
print(s2)

single quotes work too


In [14]:
# Sometimes it is necassary to use double quotes
s3 = "Like when there is a single quote ' within the string"
print(s3)

Like when there is a single quote ' within the string


### Compound Variables

Two common compound variables are lists and dictionaries. For others, see: https://docs.python.org/3/library/index.html


In [15]:
# create a list with numbers
atom_nums = [9, 17, 35, 53]
print(atom_nums)

[9, 17, 35, 53]


In [16]:
type(atom_nums)

list

In [17]:
# create a list with strings
halogens = ["F", "Cl", "Br", "I"]
print(halogens)

['F', 'Cl', 'Br', 'I']


In [18]:
# Lists can mix types
mixed = [9, "F", 17, "Cl"]
print(mixed)

[9, 'F', 17, 'Cl']


In [19]:
# We can even do lists within lists
halogens_list = [[9, 17, 35, 53],["F", "Cl", "Br", "I"]]
print(halogens_list)

[[9, 17, 35, 53], ['F', 'Cl', 'Br', 'I']]


In [20]:
# We often prefer dictionaries over lists
# As it's easier to keep track of variables and idxs

halogens_dict = {
    "F": 9,
    "Cl": 17,
    "Br": 35,
    "I": 53
}

In [21]:
halogens_dict

{'F': 9, 'Cl': 17, 'Br': 35, 'I': 53}

In [22]:
# Or something like this works well too
# If you want more defined key-value pairs

halogens_entries_dict = {'Entry 1': {'halogen': 'F', 'num': 9},
 'Entry 2': {'halogen': 'Cl', 'num': 17},
 'Entry 3': {'halogen': 'Br', 'num': 35},
 'Entry 4': {'halogen': 'I', 'num': 53}}

halogens_entries_dict

{'Entry 1': {'halogen': 'F', 'num': 9},
 'Entry 2': {'halogen': 'Cl', 'num': 17},
 'Entry 3': {'halogen': 'Br', 'num': 35},
 'Entry 4': {'halogen': 'I', 'num': 53}}

## 4. Indexing and accessing data

Some content in this section adapted and inspired from refs [5,6].

In [23]:
# Python indexing starts at 0 from left to right. 
# When indexing from right to left, the indexing starts at -1.
print(atom_nums)
print(atom_nums[0])
print(atom_nums[-1])
print(atom_nums[0:2]) # a slice

[9, 17, 35, 53]
9
53
[9, 17]


In [24]:
# When indexing a lists of lists
# Need to go 2 or more levels
print(halogens_list)
print(halogens_list[0])
print(halogens_list[1])
print(halogens_list[0][0])
print(halogens_list[1][3])

[[9, 17, 35, 53], ['F', 'Cl', 'Br', 'I']]
[9, 17, 35, 53]
['F', 'Cl', 'Br', 'I']
9
I


In [25]:
# Access dictionary data using keys
print(halogens_dict)
print(halogens_dict["Br"])

{'F': 9, 'Cl': 17, 'Br': 35, 'I': 53}
35


In [26]:
# We could also cast the dictionary into a list
print(list(halogens_dict.keys())[0])
print(list(halogens_dict.values())[0])

F
9


In [27]:
# Example with 2 key-value pairs
print(halogens_entries_dict["Entry 3"])
print(halogens_entries_dict["Entry 3"]["halogen"])
print(halogens_entries_dict["Entry 3"]["num"])

{'halogen': 'Br', 'num': 35}
Br
35


In [28]:
print(list(halogens_entries_dict.keys())[0])
print(list(halogens_entries_dict.values())[0])
print(list(halogens_entries_dict.values())[0]["halogen"])
print(list(halogens_entries_dict.values())[0]["num"])

Entry 1
{'halogen': 'F', 'num': 9}
F
9


## 5. Functions

Some content in this section adapted and inspired from ref [7].

### Using Existing Functions

In [29]:
# functions are called with parentheses
print("Hello Cheminformatics World!")

Hello Cheminformatics World!


In [30]:
# functions can be applied directly to objects: "methods"
myList = [23, 1, 45, 9]
myList.reverse() # () evaluates reverse function method with no arguments
print(myList)

[9, 45, 1, 23]


In [31]:
# use a function from within a module
import math
math.sqrt(9)

3.0

In [32]:
# or
from math import sqrt
sqrt(9)

3.0

### Define Custom Functions

Python functions are defined using the `def`` statement. A general Python syntax format for a function looks like this:

```python

def function_name():
    do something

```
or

```python

def function_name(param1, parmam2, ...):
    do something
```

In [33]:
# functions do not need inputs
def print_halogens():
    """ Prints a list of common halogen elements
        There are no inputs.
    """
    print("Fluorine")
    print("Chlorine")
    print("Bromine")
    print("Iodine")

In [34]:
# call the function
print_halogens()

Fluorine
Chlorine
Bromine
Iodine


In [35]:
# Use return to output a variable
def molecular_formula(InChI):
    """ returns extracted molecular formula from a standard InChI input"""
    L = InChI.split('/')
    return L[1]

In [36]:
my_inchi = "InChI=1S/C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)12/h2-5H,1H3,(H,11,12)"
molecular_formula(my_inchi)

'C9H8O4'

## 6. Conditional statements

Some content in this section adapted and inspired from refs [8,9]


A simplified general Python syntax for conditional statements is as follows:

```
if expression1:
  do something1
elif expression2:
  do something2
else:
  do something3

```

### if


Use an `if` statement to make a choice and determine the direction of code execution. Start the line of code with `if` followed by the condition, then end with a colon, `:`. Conditional statements are often tested with comparison operators (e.g., >) or sequence operations (e.g., x in s):

https://docs.python.org/3/library/stdtypes.html#boolean-operations-and-or-not

https://docs.python.org/3/library/stdtypes.html#sequence-types-list-tuple-range

In [37]:
# if statement with condition met
element = 'Iodine'

if len(element) > 5:
  print(element, 'has more than 5 characters')

Iodine has more than 5 characters


In [38]:
# if statement with condition not met
element = 'Lead'

if len(element) > 5:
  print(element, 'has more than 5 characters')

### else

In the above example, the condition is not met, so nothing happens, we can add an else condition to create an alternative code execution.

In [39]:
# add an else
element = 'Lead'

if len(element) > 5:
  print(element, 'has more than 5 characters')
else:
  print(element, 'has less than 5 characters')

Lead has less than 5 characters


### elif

Additional conditional tests can be added before else with the elif statement (else if).

In [40]:
# for example, what if we want to test len(element) == 5
element = 'Boron'

if len(element) > 5:
  print(element, 'has more than 5 characters')
elif len(element) == 5:
   print(element, 'has 5 characters') 
else:
  print(element, 'has less than 5 characters')

Boron has 5 characters


In [41]:
# caution, the if-elif-else sequence stops when the first one is true
element = 'Boron'

if len(element) > 5:
  print(element, 'has more than 5 characters')
elif len(element) == 5:
   print(element, 'has 5 characters')
elif 'n' in element:
   print(element, 'contains the character n') 
else:
  print(element, 'has less than 5 characters')

Boron has 5 characters


In [42]:
# One solution with a boolean

element = 'Boron'

if len(element) > 5:
  print(element, 'has more than 5 characters')
elif len(element) == 5 and 'n' in element:
   print(element, 'has 5 characters and contains the character n')
else:
  print(element, 'has less than 5 characters')

Boron has 5 characters and contains the character n


In [43]:
# Alternative with all ifs
element = 'Boron'

if len(element) > 5:
  print(element, 'has more than 5 characters')
if 'n' in element:
   print(element, 'contains the character n')
if len(element) == 5:
   print(element, 'has 5 characters')  
if len(element) < 5:
  print(element, 'has less than 5 characters')

Boron contains the character n
Boron has 5 characters


## 7. Loops [10,11,12]

If we wanted to print a series of statements, we could do this one at a time, but it is very slow and inefficient:

In [44]:
print("Fluorine 9")
print("Chlorine 17")
print("Bromine 35")
print("Iodine 53")

Fluorine 9
Chlorine 17
Bromine 35
Iodine 53


`for` loops allow repeated execution of code on a known collection of values such as a range of numbers or a list. A general syntax example is as follows:

```python
for item in items:
  do something

```

`while` loops are another type of loop and are useful when you need to iterate for a specific condition and/or don't know the specific number of iterations.
We will not cover these today,  see reference [11] for examples. (For what it's worth, I believe we will only be using for loops for these workshops)

```python
while condition:
  do something

```

In [45]:
halogens = ["Fluorine", "Chlorine", "Bromine", "Iodine"]

In [46]:
# 1. Method one where we access list elements directly
for halogen in halogens:
    print(halogen)

Fluorine
Chlorine
Bromine
Iodine


In [47]:
# 2. Method two, use a range to access idxs
for idx in range(len(halogens)):
    print(idx, halogens[idx])

0 Fluorine
1 Chlorine
2 Bromine
3 Iodine


In [48]:
# 3. Method 3, use enumerate
for idx, halogen in enumerate(halogens):
    print(idx, halogen)

0 Fluorine
1 Chlorine
2 Bromine
3 Iodine


In [49]:
# We can also loop through lists of lists, like this:
halogens = [["Fluorine", 9], ["Chlorine", 17], ["Bromine", 35], ["Iodine", 53]]

In [50]:
# 1. Direct
for halogen, num in halogens:
    print(halogen, num)

Fluorine 9
Chlorine 17
Bromine 35
Iodine 53


In [51]:
# 2. range
for idx in range(len(halogens)):
    print(idx, halogens[idx][0], halogens[idx][1])

0 Fluorine 9
1 Chlorine 17
2 Bromine 35
3 Iodine 53


In [52]:
# 3. enumerate
for idx, (halogen, num) in enumerate(halogens):
    print(idx, halogen, num)


0 Fluorine 9
1 Chlorine 17
2 Bromine 35
3 Iodine 53


In [53]:
# It is sometimes necassary and useful to use more than one loop
# We will come back to this as we work with some of the reaction data
halogens = [["Fluorine", 9], ["Chlorine", 17], ["Bromine", 35], ["Iodine", 53]]

for halogen_data in halogens:
    print(halogen_data)
    for item in halogen_data:
        print(item)

['Fluorine', 9]
Fluorine
9
['Chlorine', 17]
Chlorine
17
['Bromine', 35]
Bromine
35
['Iodine', 53]
Iodine
53


In [54]:
# Let's look at how to loop through a dictionary:
halogens_dict = {
    "F": 9,
    "Cl": 17,
    "Br": 35,
    "I": 53
}

In [55]:
for key in halogens_dict.keys():
    print(key)

F
Cl
Br
I


In [56]:
for value in halogens_dict.values():
    print(value)

9
17
35
53


In [57]:
# Both at same time
for key,value in halogens_dict.items():
    print(key, value)

F 9
Cl 17
Br 35
I 53


In [58]:
# another example
halogens_entries_dict = {'Entry 1': {'halogen': 'F', 'num': 9},
 'Entry 2': {'halogen': 'Cl', 'num': 17},
 'Entry 3': {'halogen': 'Br', 'num': 35},
 'Entry 4': {'halogen': 'I', 'num': 53}}

for key, value in halogens_entries_dict.items():
    print(key, value)

Entry 1 {'halogen': 'F', 'num': 9}
Entry 2 {'halogen': 'Cl', 'num': 17}
Entry 3 {'halogen': 'Br', 'num': 35}
Entry 4 {'halogen': 'I', 'num': 53}


In [59]:
# If you want to access value elements
for key, value in halogens_entries_dict.items():
    print(key, value['halogen'], value['num'])

Entry 1 F 9
Entry 2 Cl 17
Entry 3 Br 35
Entry 4 I 53


## 8. Data I/O

### Loading Data

Loading tabular data into python lists or dictionaries will be needed for the workshops. Here are a few methods

We will use a the dataset, rxns_10.rsmi, which is a random sample of 10 from the 1976_Sep2016_USPTOgrants_smiles.rsmi dataset:
https://figshare.com/articles/dataset/Chemical_reactions_from_US_patents_1976-Sep2016_/5104873

In [60]:
# import the data
import csv

rxn_data_list = []
with open('../X_Data/rxns_10.rsmi', 'r') as infile:
    reader = csv.reader(infile, delimiter='\t')

    for idx,row in enumerate(reader): # this let's us add an index
        rxn_data_list.append([idx] + row)

In [61]:
rxn_data_list[0:3]

[[0,
  '[NH2:1][C:2]1[C:7]([O:8]C)=[CH:6][C:5]([Cl:10])=[CH:4][C:3]=1[C:11](=[O:16])[C:12]([F:15])([F:14])[F:13].B(Br)(Br)Br.O>C(Cl)Cl>[NH2:1][C:2]1[C:7]([OH:8])=[CH:6][C:5]([Cl:10])=[CH:4][C:3]=1[C:11](=[O:16])[C:12]([F:15])([F:13])[F:14]',
  'US06140499',
  '',
  '2000',
  '100%',
  '102.1%'],
 [1,
  '[Cl:1][C:2]1[CH:10]=[C:9]2[C:5]([C:6]([C:11]3[N:12]=[C:13]4[C:19]([C:20]([NH:22][CH:23]([CH3:25])[CH3:24])=[O:21])=[CH:18][N:17](COCC[Si](C)(C)C)[C:14]4=[N:15][CH:16]=3)=[N:7][NH:8]2)=[CH:4][CH:3]=1.FC(F)(F)C(O)=O.C(N)CN>ClCCl>[Cl:1][C:2]1[CH:10]=[C:9]2[C:5]([C:6]([C:11]3[N:12]=[C:13]4[C:19]([C:20]([NH:22][CH:23]([CH3:25])[CH3:24])=[O:21])=[CH:18][NH:17][C:14]4=[N:15][CH:16]=3)=[N:7][NH:8]2)=[CH:4][CH:3]=1',
  'US08658646B2',
  '2169',
  '2014',
  '41%',
  '40.6%'],
 [2,
  'S(=O)(=O)(O)O.[N+:6]([O-:9])([O-])=[O:7].[Na+].[Br:11][C:12]1[CH:17]=[CH:16][CH:15]=[CH:14][C:13]=1[OH:18]>O.C(OCC)(=O)C>[Br:11][C:12]1[CH:17]=[CH:16][CH:15]=[C:14]([N+:6]([O-:9])=[O:7])[C:13]=1[OH:18] |f:1.2|',
  'U

In [62]:
# Alternatively let's use a dictionary
rxn_data_dict = {}
col_names = ['reaction_smiles', 'PatentNumber', 'ParagraphNum', 'Year', 'TextMinedYield', 'CalculatedYield']
with open('../X_Data/rxns_10.rsmi', 'r') as infile:
    reader = csv.DictReader(infile, delimiter='\t', fieldnames=col_names)

    for idx,row in enumerate(reader):
        rxn_data_dict[idx] = row

In [63]:
list(rxn_data_dict.items())[0:3]

[(0,
  {'reaction_smiles': '[NH2:1][C:2]1[C:7]([O:8]C)=[CH:6][C:5]([Cl:10])=[CH:4][C:3]=1[C:11](=[O:16])[C:12]([F:15])([F:14])[F:13].B(Br)(Br)Br.O>C(Cl)Cl>[NH2:1][C:2]1[C:7]([OH:8])=[CH:6][C:5]([Cl:10])=[CH:4][C:3]=1[C:11](=[O:16])[C:12]([F:15])([F:13])[F:14]',
   'PatentNumber': 'US06140499',
   'ParagraphNum': '',
   'Year': '2000',
   'TextMinedYield': '100%',
   'CalculatedYield': '102.1%'}),
 (1,
  {'reaction_smiles': '[Cl:1][C:2]1[CH:10]=[C:9]2[C:5]([C:6]([C:11]3[N:12]=[C:13]4[C:19]([C:20]([NH:22][CH:23]([CH3:25])[CH3:24])=[O:21])=[CH:18][N:17](COCC[Si](C)(C)C)[C:14]4=[N:15][CH:16]=3)=[N:7][NH:8]2)=[CH:4][CH:3]=1.FC(F)(F)C(O)=O.C(N)CN>ClCCl>[Cl:1][C:2]1[CH:10]=[C:9]2[C:5]([C:6]([C:11]3[N:12]=[C:13]4[C:19]([C:20]([NH:22][CH:23]([CH3:25])[CH3:24])=[O:21])=[CH:18][NH:17][C:14]4=[N:15][CH:16]=3)=[N:7][NH:8]2)=[CH:4][CH:3]=1',
   'PatentNumber': 'US08658646B2',
   'ParagraphNum': '2169',
   'Year': '2014',
   'TextMinedYield': '41%',
   'CalculatedYield': '40.6%'}),
 (2,
  {'reaction_

### Writing Data

Let's go ahead and write the data with the new idx values

In [64]:
# For the rxn_data_list, which is a list of lists
with open('../X_Data/rxn_list.txt', 'w', newline='') as outfile:
    writer = csv.writer(outfile, delimiter='\t')

    for row in rxn_data_list:
        writer.writerow(row)

In [65]:
# For the rxn_data_dict
with open('../X_Data/rxn_dict.txt', 'w', newline='') as outfile:
    writer = csv.writer(outfile, delimiter='\t')

    # header
    header = ['idx'] # first one is idx
    for key in list(rxn_data_dict.values())[0]:
        header.append(key)
    writer.writerow(header)

    # write the data
    for key, sub_dict in rxn_data_dict.items():
        row = [key] + list(sub_dict.values())
        writer.writerow(row)

### Multiprocessing (Linux/Mac)

Like most programming languages, Python, by default, will only use one CPU core. When processing a large dataset, multiprocessing can be useful to split the data into chunks and then process the data chunks on more than one CPU. There are several ways to do this in Python, we will look at the `concurrent.futures` and the`multiprocessing Pool` methods built into Python:

https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.ProcessPoolExecutor

https://docs.python.org/3/library/multiprocessing.html#using-a-pool-of-workers


For this part, we will use a random sample of 100,000 rxns from Lowe, D. Chemical reactions from UA patents (1976-Sep2016), CC0 license: https://figshare.com/articles/dataset/Chemical_reactions_from_US_patents_1976-Sep2016_/5104873

Here is how I prepared the dataset using my terminal:

```
7z x 1976_Sep2016_USPTOgrants_smiles.7z
(head -n 1 1976_Sep2016_USPTOgrants_smiles.rsmi && tail -n +2 1976_Sep2016_USPTOgrants_smiles.rsmi | shuf -n 100000) > rxns_100k.rsmi
```


In [66]:
# unzip the sample rxn data if necassary
# !unzip ../X_Data/rxns_100k.zip -d ../X_Data/

import os
import subprocess

# Path to the expected unzipped file
unzipped_file_path = '../X_Data/rxns_100k.rsmi'

# Check if the file exists
if not os.path.exists(unzipped_file_path):
    # Unzip the file if the .rsmi file does not exist
    subprocess.run(["unzip", "../X_Data/rxns_100k.zip", "-d", "../X_Data/"])
else:
    print("File is already unzipped.")

Archive:  ../X_Data/rxns_100k.zip
  inflating: ../X_Data/rxns_100k.rsmi  


In [67]:
# First load a the sample of rxns
import csv
rxn_data = {}
col_names = ['reaction_smiles', 'PatentNumber', 'ParagraphNum', 'Year', 'TextMinedYield', 'CalculatedYield']
with open('../X_Data/rxns_100k.rsmi', 'r') as infile:
    reader = csv.DictReader(infile, delimiter='\t', fieldnames=col_names)
    next(reader)
    for idx,row in enumerate(reader):
        rxn_data[idx] = row

In [68]:
# We will create a small RDKit demo function that canonicalizes the reactions
# We will talk more about this tomorrow! :)

from rdkit.Chem import AllChem, rdChemReactions
def canonical_reaction(rxn_smiles):
    try:
        # attempt to parse rxn
        rxn = rdChemReactions.ReactionFromSmarts(rxn_smiles, useSmiles=True)
        # attempt to sanitize rxn
        rdChemReactions.SanitizeRxn(rxn)
        # attempt to canonicalize
        canonical_rxn_smiles = rdChemReactions.ReactionToSmiles(rxn)
    except Exception as e:
        print("Error parsing or canonicalizing")
        return None
    return canonical_rxn_smiles

In [69]:
# make a copy of the dictionary that are separate objects:
import copy
rxn_data1 = copy.deepcopy(rxn_data)

# Serial process all 100k reactions (1 worker/CPU)
for key,value in rxn_data1.items():
    rxn_smiles = value["reaction_smiles"]
    canonical_smiles = canonical_reaction(rxn_smiles)
    value["canonical_rxn_smiles"] = canonical_smiles

In [70]:
list(rxn_data1.items())[0:3]

[(0,
  {'reaction_smiles': '[Cl:1][C:2]1[CH:15]=[CH:14][CH:13]=[CH:12][C:3]=1[CH2:4][NH:5][C:6]1[S:7][CH2:8][C:9](=[O:11])[N:10]=1.[N:16]1[C:25]2[C:20](=[N:21][C:22]([CH:26]=O)=[CH:23][CH:24]=2)[CH:19]=[CH:18][CH:17]=1.C(O)(=O)C1C=CC=CC=1.N1CCCCC1>C1(C)C=CC=CC=1.CN(C=O)C.O>[Cl:1][C:2]1[CH:15]=[CH:14][CH:13]=[CH:12][C:3]=1[CH2:4][NH:5][C:6]1[S:7][C:8](=[CH:26][C:22]2[CH:23]=[CH:24][C:25]3[C:20](=[CH:19][CH:18]=[CH:17][N:16]=3)[N:21]=2)[C:9](=[O:11])[N:10]=1',
   'PatentNumber': 'US07268231B2',
   'ParagraphNum': '0109',
   'Year': '2007',
   'TextMinedYield': '37.4%',
   'CalculatedYield': '37.4%',
   'canonical_rxn_smiles': 'C1CCNCC1.O=C(O)c1ccccc1.O=[CH:26][c:22]1[n:21][c:20]2[cH:19][cH:18][cH:17][n:16][c:25]2[cH:24][cH:23]1.[Cl:1][c:2]1[c:3]([CH2:4][NH:5][C:6]2=[N:10][C:9](=[O:11])[CH2:8][S:7]2)[cH:12][cH:13][cH:14][cH:15]1>CC1=CC=CC=C1.CN(C)C=O.O>[Cl:1][c:2]1[c:3]([CH2:4][NH:5][C:6]2=[N:10][C:9](=[O:11])[C:8](=[CH:26][c:22]3[n:21][c:20]4[cH:19][cH:18][cH:17][n:16][c:25]4[cH:24][cH:2

In [71]:
# Using a list like this will makes multiprocessing a bit easier
# So we can pass in the key, and dictionary values and keep track of them
rxn_data2 = copy.deepcopy(rxn_data)
items2 = list(rxn_data2.items())
items2[0:3]

[(0,
  {'reaction_smiles': '[Cl:1][C:2]1[CH:15]=[CH:14][CH:13]=[CH:12][C:3]=1[CH2:4][NH:5][C:6]1[S:7][CH2:8][C:9](=[O:11])[N:10]=1.[N:16]1[C:25]2[C:20](=[N:21][C:22]([CH:26]=O)=[CH:23][CH:24]=2)[CH:19]=[CH:18][CH:17]=1.C(O)(=O)C1C=CC=CC=1.N1CCCCC1>C1(C)C=CC=CC=1.CN(C=O)C.O>[Cl:1][C:2]1[CH:15]=[CH:14][CH:13]=[CH:12][C:3]=1[CH2:4][NH:5][C:6]1[S:7][C:8](=[CH:26][C:22]2[CH:23]=[CH:24][C:25]3[C:20](=[CH:19][CH:18]=[CH:17][N:16]=3)[N:21]=2)[C:9](=[O:11])[N:10]=1',
   'PatentNumber': 'US07268231B2',
   'ParagraphNum': '0109',
   'Year': '2007',
   'TextMinedYield': '37.4%',
   'CalculatedYield': '37.4%'}),
 (1,
  {'reaction_smiles': '[CH2:1]([C:15]1[CH:16]=[C:17]([OH:21])[CH:18]=[CH:19][CH:20]=1)[CH2:2][CH2:3][CH2:4][CH2:5][CH2:6][CH2:7][CH2:8][CH2:9][CH2:10][CH2:11][CH2:12][CH2:13][CH3:14].[C:22]([O-])(=[O:24])[CH3:23].[Cl-].[Al+3].[Cl-].[Cl-].Cl>CCOCC>[OH:21][C:17]1[CH:16]=[C:15]([CH2:1][CH2:2][CH2:3][CH2:4][CH2:5][CH2:6][CH2:7][CH2:8][CH2:9][CH2:10][CH2:11][CH2:12][CH2:13][CH3:14])[CH:20]=

In [72]:
# We will need to adjust our function a bit too
from rdkit.Chem import AllChem, rdChemReactions
def canonical_reaction(item):
    key,value = item
    try:
        # attempt to parse rxn
        rxn = rdChemReactions.ReactionFromSmarts(value["reaction_smiles"], useSmiles=True)
        # attempt to sanitize rxn
        rdChemReactions.SanitizeRxn(rxn)
        # attempt to canonicalize
        canonical_rxn_smiles = rdChemReactions.ReactionToSmiles(rxn)
    except Exception as e:
        return key, None
    return key, canonical_rxn_smiles

In [73]:
# Now let's do multiprocessing
# Decide how many cores you want to use
# You may need to experiment with finding the ideal number, as the max is not always best

import os
# check your max number of CPUs
max_cpus = int(os.cpu_count())
print(f"{max_cpus} CPUs")

24 CPUs


In [74]:
# This is adjusting the number of CPUs based on your hardware
num_cpus = 1
if max_cpus >= 16:
    num_cpus = 8
elif max_cpus >= 4:
    num_cpus = 4
elif max_cpus >= 2:
    num_cpus = 2
else:
    num_cpus = 1
print(num_cpus)

8


In [75]:
# Using ProcessPoolExecutor for multiprocessing
import concurrent.futures
import multiprocessing

# make a new copy
rxn_data2 = copy.deepcopy(rxn_data)

# I believe the if __name__ == "__main__": and force to fork is necassary to run on Mac (not tested)
# On Linux, you can usually omit this.
# This code won't work on Windows.

if __name__ == '__main__':
    multiprocessing.set_start_method('fork', force=True)

    # create a list
    items2 = list(rxn_data2.items())
    
    # We will use num_cpus
    with concurrent.futures.ProcessPoolExecutor(max_workers=num_cpus) as executor:
        # Map each dictionary item to the canonical_reaction function
        results = list(executor.map(canonical_reaction, items2))

    # Add results to the dictionary
    for key, value in results:
        rxn_data2[key]["canonical_reaction"] = value
    

In [76]:
list(rxn_data2.items())[0:3]

[(0,
  {'reaction_smiles': '[Cl:1][C:2]1[CH:15]=[CH:14][CH:13]=[CH:12][C:3]=1[CH2:4][NH:5][C:6]1[S:7][CH2:8][C:9](=[O:11])[N:10]=1.[N:16]1[C:25]2[C:20](=[N:21][C:22]([CH:26]=O)=[CH:23][CH:24]=2)[CH:19]=[CH:18][CH:17]=1.C(O)(=O)C1C=CC=CC=1.N1CCCCC1>C1(C)C=CC=CC=1.CN(C=O)C.O>[Cl:1][C:2]1[CH:15]=[CH:14][CH:13]=[CH:12][C:3]=1[CH2:4][NH:5][C:6]1[S:7][C:8](=[CH:26][C:22]2[CH:23]=[CH:24][C:25]3[C:20](=[CH:19][CH:18]=[CH:17][N:16]=3)[N:21]=2)[C:9](=[O:11])[N:10]=1',
   'PatentNumber': 'US07268231B2',
   'ParagraphNum': '0109',
   'Year': '2007',
   'TextMinedYield': '37.4%',
   'CalculatedYield': '37.4%',
   'canonical_reaction': 'C1CCNCC1.O=C(O)c1ccccc1.O=[CH:26][c:22]1[n:21][c:20]2[cH:19][cH:18][cH:17][n:16][c:25]2[cH:24][cH:23]1.[Cl:1][c:2]1[c:3]([CH2:4][NH:5][C:6]2=[N:10][C:9](=[O:11])[CH2:8][S:7]2)[cH:12][cH:13][cH:14][cH:15]1>CC1=CC=CC=C1.CN(C)C=O.O>[Cl:1][c:2]1[c:3]([CH2:4][NH:5][C:6]2=[N:10][C:9](=[O:11])[C:8](=[CH:26][c:22]3[n:21][c:20]4[cH:19][cH:18][cH:17][n:16][c:25]4[cH:24][cH:23]

In [77]:
# Now with multiprocessing pool

import multiprocessing

# make a new copy
rxn_data3 = copy.deepcopy(rxn_data)

if __name__ == '__main__':
    multiprocessing.set_start_method('fork', force=True)

    # num_cpus cpus
    with multiprocessing.Pool(num_cpus) as pool:
    
        # create a list
        
        items3 = list(rxn_data3.items())

        # Use map to apply canonical_reaction function to each item (key, value of data)
        results = pool.map(canonical_reaction, items3)

    # Add results to the dictionary
    for key, value in results:
        rxn_data3[key]["canonical_reaction"] = value


In [78]:
list(rxn_data3.items())[0:3]

[(0,
  {'reaction_smiles': '[Cl:1][C:2]1[CH:15]=[CH:14][CH:13]=[CH:12][C:3]=1[CH2:4][NH:5][C:6]1[S:7][CH2:8][C:9](=[O:11])[N:10]=1.[N:16]1[C:25]2[C:20](=[N:21][C:22]([CH:26]=O)=[CH:23][CH:24]=2)[CH:19]=[CH:18][CH:17]=1.C(O)(=O)C1C=CC=CC=1.N1CCCCC1>C1(C)C=CC=CC=1.CN(C=O)C.O>[Cl:1][C:2]1[CH:15]=[CH:14][CH:13]=[CH:12][C:3]=1[CH2:4][NH:5][C:6]1[S:7][C:8](=[CH:26][C:22]2[CH:23]=[CH:24][C:25]3[C:20](=[CH:19][CH:18]=[CH:17][N:16]=3)[N:21]=2)[C:9](=[O:11])[N:10]=1',
   'PatentNumber': 'US07268231B2',
   'ParagraphNum': '0109',
   'Year': '2007',
   'TextMinedYield': '37.4%',
   'CalculatedYield': '37.4%',
   'canonical_reaction': 'C1CCNCC1.O=C(O)c1ccccc1.O=[CH:26][c:22]1[n:21][c:20]2[cH:19][cH:18][cH:17][n:16][c:25]2[cH:24][cH:23]1.[Cl:1][c:2]1[c:3]([CH2:4][NH:5][C:6]2=[N:10][C:9](=[O:11])[CH2:8][S:7]2)[cH:12][cH:13][cH:14][cH:15]1>CC1=CC=CC=C1.CN(C)C=O.O>[Cl:1][c:2]1[c:3]([CH2:4][NH:5][C:6]2=[N:10][C:9](=[O:11])[C:8](=[CH:26][c:22]3[n:21][c:20]4[cH:19][cH:18][cH:17][n:16][c:25]4[cH:24][cH:23]

Interestingly, using the multiprocessing Pool seems faster than the concurrent.futures method in this case. We mostly used the concurrent.futures method in some of the retrosynthesis tutorials, so maybe a good contribution would be to refactor with the multiprocessing Pool method if that is a lot faster...

## References

[1] https://docs.conda.io/en/latest/

[2] https://jakevdp.github.io/PythonDataScienceHandbook/01.01-help-and-documentation.html

[3] https://colab.research.google.com/notebooks/basic_features_overview.ipynb

[4] https://stackoverflow.com/questions/139180/how-to-list-all-functions-in-a-python-module

[5] http://swcarpentry.github.io/python-novice-gapminder/

[6] https://github.com/jakevdp/WhirlwindTourOfPython

[7] http://swcarpentry.github.io/python-novice-gapminder/16-writing-functions.html

[8] http://swcarpentry.github.io/python-novice-gapminder/13-conditionals.html

[9] https://github.com/vfscalfani/UALIB_Workshops/blob/master/01_MATLAB/06_MATLAB_Conditional_Statements.md

[10] http://swcarpentry.github.io/python-novice-gapminder/12-for-loops.html

[11] https://github.com/jakevdp/WhirlwindTourOfPython/blob/master/07-Control-Flow-Statements.ipynb

[12] https://github.com/vfscalfani/UALIB_Workshops/blob/master/01_MATLAB/05_MATLAB_Loops.md