# Data exploration

Suppose you've got an XML-file and want to build something useful. Take a look at it.

In [1]:
!head -n 25 ../data/001.xml

<?xml version="1.0"?>
<xf>
  <reactions>
    <reaction index="26">
      <RX>
        <RX.ID>9185275</RX.ID>
        <RX01>
          <RX.RXRN>1706223</RX.RXRN>
          <RX.RCT>L-leucine tert-butyl ester</RX.RCT>
        </RX01>
        <RX01>
          <RX.RXRN>9297179</RX.RXRN>
          <RX.RCT>(S)-2-[[[(1,3-dioxo-1,3-dihydro-2H-isoindol-2-yl)methyl]diphenylsilanyl]methyl]-4-methylpentanoic acid</RX.RCT>
        </RX01>
        <RX02>
          <RX.PXRN>9307186</RX.PXRN>
          <RX.PRO>(S)-2-[(S)-2-[[[(1,3-dioxo-1,3-dihydro-2H-isoindol-2-yl)methyl]diphenylsilanyl]methyl]-4-methylpentanoylamino]-4-methylpentanoic acid tert-butyl ester</RX.PRO>
        </RX02>
        <RX.BLB>1706223</RX.BLB>
        <RX.BLB>9297179</RX.BLB>
        <RX.BLB>9307186</RX.BLB>
        <RX.BLC>1706223</RX.BLC>
        <RX.BLC>9297179</RX.BLC>
        <RX.BLC>9307186</RX.BLC>
        <RX.NVAR>1</RX.NVAR>


lxml is the fastest way to process xml file in Python.

In [2]:
from lxml import etree as ElementTree

In [3]:
tree = ElementTree.parse("../data/001.xml")
root = tree.getroot()

In [4]:
root.getchildren()

[<Element reactions at 0x7f13184780c8>]

In [5]:
len(root.getchildren()[0].getchildren())

2975

In [6]:
root.getchildren()[0].getchildren()[0].getchildren()

[<Element RX at 0x7f13087a6688>,
 <Element RXD at 0x7f13087a6648>,
 <Element RY at 0x7f13087a6608>]

# RX

In [7]:
rx, rxd, ry = root.getchildren()[0].getchildren()[0].getchildren()

In [8]:
rx.getchildren()

[<Element RX.ID at 0x7f1318478cc8>,
 <Element RX01 at 0x7f13087a6ac8>,
 <Element RX01 at 0x7f13087a6a88>,
 <Element RX02 at 0x7f13087a6a48>,
 <Element RX.BLB at 0x7f13087a6bc8>,
 <Element RX.BLB at 0x7f13087a6c08>,
 <Element RX.BLB at 0x7f13087a6c48>,
 <Element RX.BLC at 0x7f13087a6c88>,
 <Element RX.BLC at 0x7f13087a6cc8>,
 <Element RX.BLC at 0x7f13087a6d08>,
 <Element RX.NVAR at 0x7f13087a6d48>,
 <Element RX03 at 0x7f13087a6d88>,
 <Element RX04 at 0x7f13087a6dc8>,
 <Element RX.RXNFILE at 0x7f13087a6e08>,
 <Element RX.REG at 0x7f13087a6e48>,
 <Element RX.RANK at 0x7f13087a6e88>,
 <Element RX.MYD at 0x7f13087a6ec8>,
 <Element RX.SKW at 0x7f13087a6f08>,
 <Element RX.RTYP at 0x7f13087a6f48>,
 <Element RX.RTYP at 0x7f13087a6f88>,
 <Element RX.RAVAIL at 0x7f13087a6fc8>,
 <Element RX.PAVAIL at 0x7f13087a7048>,
 <Element RX.MAXPUB at 0x7f13087a7088>,
 <Element RX.NUMREF at 0x7f13087a70c8>,
 <Element RX.MAXPMW at 0x7f13087a7108>,
 <Element RX.ED at 0x7f13087a7148>,
 <Element RX.UPD at 0x7f130

RX01, RX02 are reactants and products respectively.

In [9]:
print("Reactant #1:", rx[1].getchildren()[1].text)
print("Reactant #2:", rx[2].getchildren()[1].text)
print("Product:", rx[3].getchildren()[1].text)

Reactant #1: L-leucine tert-butyl ester
Reactant #2: (S)-2-[[[(1,3-dioxo-1,3-dihydro-2H-isoindol-2-yl)methyl]diphenylsilanyl]methyl]-4-methylpentanoic acid
Product: (S)-2-[(S)-2-[[[(1,3-dioxo-1,3-dihydro-2H-isoindol-2-yl)methyl]diphenylsilanyl]methyl]-4-methylpentanoylamino]-4-methylpentanoic acid tert-butyl ester


What are RX03 and RX04?

In [10]:
print(ElementTree.tostring(rx[11]).decode("utf-8"))

<RX03>
          <RX.BCODE>261039242542204</RX.BCODE>
          <RX.MCODE>325399193666863</RX.MCODE>
          <RX.NCODE>334727620812468</RX.NCODE>
        </RX03>
        


In [11]:
print(ElementTree.tostring(rx[12]).decode("utf-8"))

<RX04>
          <RX.TRANS highlight="true"><hi>0/80F51(0205)|80F42(030406)|40F61()|01E62()|01E41(0708)|01E41(090A)|01D42(0C0D)|01D41(0E)|01D41(0B)|01D41(0F)|01CB1(101113)|01C62()|01C61(12)|01C41(1415)|01C41(1617)|01B45(191A)|01B45(1B1C)|01B41(1D1E1F)|01B41(18)|01B41()|01B41()|01B41()|01B41()|01A51(2425)|01A45(20)|01A45(21)|01A45(22)|01A45(23)|01A41()|01A41()|01A41()|01945(2A)|01945(2A)|01945(2B)|01945(2B)|01942(2628)|01942(2729)|01862()|01862()|01845(292C)|01845(2D)|01845()|01845()|01745(2E)|01745(2F)|01645(2F)|01645()|</hi></RX.TRANS>
          <RX.BIN>283141</RX.BIN>
          <RX.BFREQ>469</RX.BFREQ>
          <RX.BRANGE>282515-283231</RX.BRANGE>
          <RX.BNAME>NH2 + -(C=)-O- to -NH-C(=)-</RX.BNAME>
          <RX.QRY0>0/80F51(02*)|80F42(03*)|40F61()|*</RX.QRY0>
          <RX.QRY1>0/80F51(02*)|80F42(03*)|40F61()|01E62(*)|01E41(*)|01E41(*)|*</RX.QRY1>
          <RX.QRY2>0/80F51(0205)|80F42(030406)|40F61()|01E62(*)|01E41(*)|01E41(*)|*</RX.QRY2>
          <RX.QRY3>0/80F51(0205)|80

🤔

RTYP stands for 'reaction type', what's that?

In [12]:
[c for c in rx.getchildren() if c.tag == "RX.RTYP"][0].text

'full reaction'

In [13]:
[c for c in rx.getchildren() if c.tag == "RX.RTYP"][1].text

'has preparation'

🤔🤔🤔

# RXD

In [14]:
rxd.getchildren()

[<Element RXD.L at 0x7f13087a68c8>,
 <Element RXD.CL at 0x7f13087a7d08>,
 <Element RXD.SCO at 0x7f13087a7cc8>,
 <Element RXD.STP at 0x7f13087a7c88>,
 <Element RXD01 at 0x7f13087a7e88>,
 <Element RXDS01 at 0x7f13087a7ec8>,
 <Element citations at 0x7f13087a7f08>]

In [16]:
l, cl, sco, stp, rxd01, rxds01, cit = rxd.getchildren()

In [17]:
l.text, cl.text, sco.text

('6375709', 'Preparation', '17')

In [18]:
print(ElementTree.tostring(cit).decode("utf-8"))

<citations>
          <citation index="93">
            <CNR>
              <CNR.CNR>6375709</CNR.CNR>
              <CNR.CED>2007/10/03</CNR.CED>
              <CNR.CUPD>2018/08/19</CNR.CUPD>
            </CNR>
            <CIT>
              <CIT.DT>Article</CIT.DT>
              <CIT.AU>Kim, Jaeseung; Glekas, Athanasios; Sieburth, Scott McN</CIT.AU>
              <CIT.ABPR>Y</CIT.ABPR>
              
              <CIT.PREPY>2002</CIT.PREPY>
              <CIT.PUI>35346502</CIT.PUI>
              <CIT01>
                <CIT.CO>BMCLE</CIT.CO>
                <CIT.JT>Bioorganic and Medicinal Chemistry Letters</CIT.JT>
                <CIT.JTS>Bioorg. Med. Chem. Lett.</CIT.JTS>
                <CIT.CC>gbr</CIT.CC>
                <CIT.LA>English</CIT.LA>
                <CIT.PUB>Elsevier Ltd</CIT.PUB>
                <CIT.VL>12</CIT.VL>
                <CIT.NB>24</CIT.NB>
                <CIT.PY>2002</CIT.PY>
                <CIT.PAG>3625 - 3627</CIT.PAG>
                <CIT.DOI>10.1

In [22]:
print(ElementTree.tostring(rxd01).decode("utf-8"))

<RXD01>
          <RXD.YXRN>9307186</RXD.YXRN>
          <RXD.YPRO>(S)-2-[(S)-2-[[[(1,3-dioxo-1,3-dihydro-2H-isoindol-2-yl)methyl]diphenylsilanyl]methyl]-4-methylpentanoylamino]-4-methylpentanoic acid tert-butyl ester</RXD.YPRO>
          <RXD.YD>89 percent</RXD.YD>
          <RXD.NYD>89</RXD.NYD>
        </RXD01>

        


Note RXD.NYD field. It is experimental yield. What is yield?  
Suppose we have a photosynthesis reaction:

$$
6CO_2 + 6H_2O \to C_6H_{12}O_6 + 6O_2
$$

If we take 600 molecules of carbon dioxide and 600 molecules of water, we should get 100 molecule of glucose and 600 molecules of oxygen.  
That's not the case in real conditions: only part of reactants undergo a chemical transformation.  
Thus, probably only 480 molecules of $CO_2$ will react resulting in 80 molecules of glucose. Part of expected amount, 80% is yield of the reaction.

In [25]:
rxds01.getchildren()

[<Element RXD03 at 0x7f130877fe08>, <Element RXD.DED at 0x7f13087a11c8>]

In [27]:
d3, ded = rxds01.getchildren()

In [28]:
d3.text

'\n            '

In [29]:
d3.getchildren()

[<Element RXD.RGTXRN at 0x7f130879d088>, <Element RXD.RGT at 0x7f13087a1e08>]

RGT stands for reagent, a compound which does not directly provides atoms to the product.  
SOL stands for solvent, a compound which provides environment required for interaction of molecules.

In [30]:
# print([dd.text for dd in d1.getchildren()])
# print([dd.text for dd in d2.getchildren()])
print([dd.text for dd in d3.getchildren()])

['507429', 'N-(3-dimethylaminopropyl)-N-ethylcarbodiimide']


# RY

In [32]:
ry.getchildren()

[<Element RY.RCT at 0x7f1308723d88>,
 <Element RY.RCT at 0x7f1308723dc8>,
 <Element RY.PRO at 0x7f1308723e08>]

Two reactants and a product.

In [33]:
rct_1, rct_2, prd = ry.getchildren()

In [34]:
print(rct_1.text)




  0  0  0     0  0            999 V3000
M  V30 BEGIN CTAB
M  V30 COUNTS 14 13 0 0 1 REGNO=1706223
M  V30 BEGIN ATOM
M  V30 1 C -19.8374 0.8098 0 0
M  V30 2 O -21.0393 -0.1119 0 0
M  V30 3 C -18.6362 -0.1119 0 0
M  V30 4 C -18.8133 1.9261 0 0
M  V30 5 C -20.8609 1.9261 0 0
M  V30 6 C -22.4389 0.4679 0 0
M  V30 7 C -23.6395 -0.4545 0 0 CFG=2
M  V30 8 O -22.6353 1.9685 0 0
M  V30 9 C -24.842 0.4679 0 0
M  V30 10 N -24.6628 -1.5708 0 0
M  V30 11 C -26.2396 -0.1119 0 0
M  V30 12 C -27.4402 0.8098 0 0
M  V30 13 C -26.438 -1.6125 0 0
M  V30 14 H -22.616 -1.5708 0 0
M  V30 END ATOM
M  V30 BEGIN BOND
M  V30 1 1 1 2
M  V30 2 1 1 3
M  V30 3 1 1 4
M  V30 4 1 1 5
M  V30 5 1 2 6
M  V30 6 1 6 7
M  V30 7 2 6 8
M  V30 8 1 7 9
M  V30 9 1 7 10
M  V30 10 1 7 14 CFG=3
M  V30 11 1 9 11
M  V30 12 1 11 12
M  V30 13 1 11 13
M  V30 END BOND
M  V30 END CTAB
M  END



It is MOL format. It has the following information:  
- atom types (sulfur, nitrogen, carbon, ...)
- atom coordinates
- chemical bonds: begin/end atoms and single/double/triple order

# Finally
<ul> <span style="font-size:larger;">We have the following information: </span>
<li> structures of reactants, their names
<li> structure of product, its name 
<li> solvent used
<li> reagents used 
<li> reaction type, which provides no useful information.
</ul>

RXNFILE is potentially interesting, because it describes reactions.  
Let's take a look to an example rxn file.

In [None]:
!cat ../data/example.rxn

We can find here information about three molecules: two of them are reactants and the last one is product. The first block of molecule decription is about atoms: symbols (S, N, C, O, ...), coordinates and auxiliary info. The second block is about chemical bonds. We are interested in the first four columns: bond_index, first_atom, second_atom, bond_order.