Quantum Machine 9 (QM9) dataset is now uploaded to Kaggle, [find here](https://www.kaggle.com/zaharch/quantum-machine-9-aka-qm9). 

Note that QM9 contains extra information both for the train and the test datasets of the competition. **This kernel extracts the features from the dataset and saves them into file data.covs.pickle for convenience of use**. I think you can even add this kernel to your pipeline of kernels to experiment with the features, or just download the output file (it takes a few hours to create it). The kernel also includes a simple LightGBM model on the extracted features with feature importance graph. Spoiler: mulliken partial charges for the two atoms are on top.

Note: I am not sure how to extract the information from the list of frequencies correctly, currently I have just taken min/max and mean values of the list.

Disclaimer: **the dataset is not allowed to use for your final submissions in this competition**. But we can still learn from it.

**Does QM9 contain the information from extra files given in the competition?**

1. dipole_moments.csv contains X,Y,Z values per molecule and I found that sqrt(X^2+Y^2+Z^2)=mu where mu is given in QM9
2. mulliken_charges.csv matches the mulliken charges from QM9
3. scalar_coupling_contributions.csv I can't find this info in QM9
4. magnetic_shielding_tensors.csv I can't find this info in QM9

QM9 contains the structure information, and additionally the following information both for the train and the test:

1. Mulliken partial charge for each atom
2. Frequencies for degrees of freedom
3. SMILES from GDB9 and for relaxed geometry
4. InChI for GDB9 and for relaxed geometry

and also the following 17 properties per molecule:

`
I. Property  Unit         Description
 1  tag       -            gdb9; string constant to ease extraction via grep
 2  index     -            Consecutive, 1-based integer identifier of molecule
 3  A         GHz          Rotational constant A
 4  B         GHz          Rotational constant B
 5  C         GHz          Rotational constant C
 6  mu        Debye        Dipole moment
 7  alpha     Bohr^3       Isotropic polarizability
 8  homo      Hartree      Energy of Highest occupied molecular orbital (HOMO)
 9  lumo      Hartree      Energy of Lowest occupied molecular orbital (LUMO)
10  gap       Hartree      Gap, difference between LUMO and HOMO
11  r2        Bohr^2       Electronic spatial extent
12  zpve      Hartree      Zero point vibrational energy
13  U0        Hartree      Internal energy at 0 K
14  U         Hartree      Internal energy at 298.15 K
15  H         Hartree      Enthalpy at 298.15 K
16  G         Hartree      Free energy at 298.15 K
17  Cv        cal/(mol K)  Heat capacity at 298.15 K
`

Example of QM9 data format (for dsgdb9nsd_000001.xyz):

`
5
gdb 1	157.7118	157.70997	157.70699	0.	13.21	-0.3877	0.1171	0.5048	35.3641	0.044749	-40.47893	-40.476062	-40.475117	-40.498597	6.469	
C	-0.0126981359	 1.0858041578	 0.0080009958	-0.535689
H	 0.002150416	-0.0060313176	 0.0019761204	 0.133921
H	 1.0117308433	 1.4637511618	 0.0002765748	 0.133922
H	-0.540815069	 1.4475266138	-0.8766437152	 0.133923
H	-0.5238136345	 1.4379326443	 0.9063972942	 0.133923
1341.307	1341.3284	1341.365	1562.6731	1562.7453	3038.3205	3151.6034	3151.6788	3151.7078
C	C	
InChI=1S/CH4/h1H4	InChI=1S/CH4/h1H4
`

In [15]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.io as sio
import os
from pathlib import Path
import csv
import pickle
from joblib import Parallel, delayed
from tqdm import tqdm_notebook as tqdm     # iteration tiem estimation tool
from time import sleep
import pdb
#import lightgbm as lgb
#import xgboost as xgb
import random
from sklearn.model_selection import GroupKFold, StratifiedKFold, KFold
import seaborn as sn


In [16]:
PATH_QM9 = Path('../data/dsgdb9nsd.xyz')
PATH_BASE = Path('../data/')
PATH_WORKING = Path('../working')

train = pd.read_csv(PATH_BASE/'train.csv')
test = pd.read_csv(PATH_BASE/'test.csv')

both = pd.concat([train, test], axis=0, sort=False)
both = both.set_index('molecule_name',drop=False)

both.sort_index(inplace=True)

In [17]:
both.head()

Unnamed: 0_level_0,id,molecule_name,atom_index_0,atom_index_1,type,scalar_coupling_constant
molecule_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
dsgdb9nsd_000001,0,dsgdb9nsd_000001,1,0,1JHC,84.8076
dsgdb9nsd_000001,1,dsgdb9nsd_000001,1,2,2JHH,-11.257
dsgdb9nsd_000001,2,dsgdb9nsd_000001,1,3,2JHH,-11.2548
dsgdb9nsd_000001,3,dsgdb9nsd_000001,1,4,2JHH,-11.2543
dsgdb9nsd_000001,4,dsgdb9nsd_000001,2,0,1JHC,84.8074


In [22]:
## Somethings of tes set are empty --> No matching between both data and structure data

all_files = os.listdir(PATH_BASE/'dsgdb9nsd.xyz')
file_name = pd.DataFrame()
file_name['molecule_name'] = all_files

file_name.to_csv('../data/file_names.csv')

In [4]:
PATH_QM9 = Path('../data/dsgdb9nsd.xyz')
PATH_BASE = Path('../data/')
PATH_WORKING = Path('../working')

def processQM9_list(files):
    df = pd.DataFrame()
    for i,filename in enumerate(files):
        stats = processQM9_file(filename)
        df = pd.concat([df, stats], axis = 0)
    return df

def processQM9_file(filename):
    path = PATH_QM9/filename
    molecule_name = filename[:-4]
    
    row_count = sum(1 for row in csv.reader(open(path)))
    na = row_count-5
#    print("Number of rows = ", row_count)    
    # Vibration Frequency of Molecule
    freqs = pd.read_csv(path,sep=' |\t',engine='python',skiprows=row_count-3,nrows=1,header=None)
    sz = freqs.shape[1]
#    print("Number of normal mode vibration = ",sz)
    
    is_linear = np.nan
    if 3*na - 5 == sz:
        is_linear = False
    elif 3*na - 6 == sz:
        is_linear = True

#    print("The molecule is linear : ", is_linear)
    
    stats = pd.read_csv(path,sep=' |\t',engine='python',skiprows=1,nrows=1,header=None)
    stats = stats.loc[:,2:]
    stats.columns = ['rc_A','rc_B','rc_C','mu','alpha','homo','lumo','gap','r2','zpve','U0','U','H','G','Cv']
    
    stats['freqs_min'] = freqs.values[0].min()
    stats['freqs_max'] = freqs.values[0].max()
    stats['freqs_mean'] = freqs.values[0].mean()
    stats['linear'] = is_linear
    
    mm = pd.read_csv(path,sep='\t',engine='python', skiprows=2, skipfooter=3, names=range(5))[4]
    if mm.dtype == 'O':
        mm = mm.str.replace('*^','e',regex=False).astype(float)
    stats['mulliken_min'] = mm.min()
    stats['mulliken_max'] = mm.max()
    stats['mulliken_mean'] = mm.mean()
    
    stats['molecule_name'] = molecule_name
#    print(stats)
#    print(mm, '\n')

    data = pd.merge(both.loc[[molecule_name],:].reset_index(drop=True), stats, how='left', on='molecule_name')
    data['mulliken_atom_0'] = mm[data['atom_index_0'].values].values
    data['mulliken_atom_1'] = mm[data['atom_index_1'].values].values
    
#   print(data)
    return data

In [11]:
all_files = os.listdir(PATH_BASE/'structures')
all_files = all_files[:100]
print(all_files)

data = processQM9_list(all_files)

# %time result = Parallel(n_jobs=4, temp_folder=PATH_WORKING)(processQM9_list(all_files[idx:min(idx+1, len(all_files))]) for idx in tqdm(range(int(np.ceil(len(all_files))))))

# data = pd.concat(result)
# data = data.reset_index(drop=True)
# data.to_pickle(PATH_WORKING/'data.covs.pickle')

# print(data)

['dsgdb9nsd_000001.xyz', 'dsgdb9nsd_000002.xyz', 'dsgdb9nsd_000003.xyz', 'dsgdb9nsd_000004.xyz', 'dsgdb9nsd_000005.xyz', 'dsgdb9nsd_000007.xyz', 'dsgdb9nsd_000008.xyz', 'dsgdb9nsd_000009.xyz', 'dsgdb9nsd_000010.xyz', 'dsgdb9nsd_000011.xyz', 'dsgdb9nsd_000012.xyz', 'dsgdb9nsd_000013.xyz', 'dsgdb9nsd_000014.xyz', 'dsgdb9nsd_000015.xyz', 'dsgdb9nsd_000016.xyz', 'dsgdb9nsd_000017.xyz', 'dsgdb9nsd_000018.xyz', 'dsgdb9nsd_000019.xyz', 'dsgdb9nsd_000020.xyz', 'dsgdb9nsd_000021.xyz', 'dsgdb9nsd_000022.xyz', 'dsgdb9nsd_000023.xyz', 'dsgdb9nsd_000024.xyz', 'dsgdb9nsd_000026.xyz', 'dsgdb9nsd_000027.xyz', 'dsgdb9nsd_000028.xyz', 'dsgdb9nsd_000029.xyz', 'dsgdb9nsd_000030.xyz', 'dsgdb9nsd_000031.xyz', 'dsgdb9nsd_000032.xyz', 'dsgdb9nsd_000033.xyz', 'dsgdb9nsd_000034.xyz', 'dsgdb9nsd_000035.xyz', 'dsgdb9nsd_000036.xyz', 'dsgdb9nsd_000037.xyz', 'dsgdb9nsd_000038.xyz', 'dsgdb9nsd_000039.xyz', 'dsgdb9nsd_000040.xyz', 'dsgdb9nsd_000041.xyz', 'dsgdb9nsd_000042.xyz', 'dsgdb9nsd_000043.xyz', 'dsgdb9nsd_0000

KeyError: "None of [Index(['dsgdb9nsd_000058'], dtype='object', name='molecule_name')] are in the [index]"

In [10]:
# data = pd.read_pickle(PATH_WORKING/'data.covs.pickle')
data

Unnamed: 0,id,molecule_name,atom_index_0,atom_index_1,type,scalar_coupling_constant,rc_A,rc_B,rc_C,mu,...,Cv,freqs_min,freqs_max,freqs_mean,linear,mulliken_min,mulliken_max,mulliken_mean,mulliken_atom_0,mulliken_atom_1
0,0,dsgdb9nsd_000001,1,0,1JHC,84.807600,157.71180,157.709970,157.706990,0.0000,...,6.469,1341.3070,3151.7078,2182.525478,True,-0.535689,0.133923,0.000000e+00,0.133921,-0.535689
1,1,dsgdb9nsd_000001,1,2,2JHH,-11.257000,157.71180,157.709970,157.706990,0.0000,...,6.469,1341.3070,3151.7078,2182.525478,True,-0.535689,0.133923,0.000000e+00,0.133921,0.133922
2,2,dsgdb9nsd_000001,1,3,2JHH,-11.254800,157.71180,157.709970,157.706990,0.0000,...,6.469,1341.3070,3151.7078,2182.525478,True,-0.535689,0.133923,0.000000e+00,0.133921,0.133923
3,3,dsgdb9nsd_000001,1,4,2JHH,-11.254300,157.71180,157.709970,157.706990,0.0000,...,6.469,1341.3070,3151.7078,2182.525478,True,-0.535689,0.133923,0.000000e+00,0.133921,0.133923
4,4,dsgdb9nsd_000001,2,0,1JHC,84.807400,157.71180,157.709970,157.706990,0.0000,...,6.469,1341.3070,3151.7078,2182.525478,True,-0.535689,0.133923,0.000000e+00,0.133922,-0.535689
5,5,dsgdb9nsd_000001,2,3,2JHH,-11.254100,157.71180,157.709970,157.706990,0.0000,...,6.469,1341.3070,3151.7078,2182.525478,True,-0.535689,0.133923,0.000000e+00,0.133922,0.133923
6,6,dsgdb9nsd_000001,2,4,2JHH,-11.254800,157.71180,157.709970,157.706990,0.0000,...,6.469,1341.3070,3151.7078,2182.525478,True,-0.535689,0.133923,0.000000e+00,0.133922,0.133923
7,7,dsgdb9nsd_000001,3,0,1JHC,84.809300,157.71180,157.709970,157.706990,0.0000,...,6.469,1341.3070,3151.7078,2182.525478,True,-0.535689,0.133923,0.000000e+00,0.133923,-0.535689
8,8,dsgdb9nsd_000001,3,4,2JHH,-11.254300,157.71180,157.709970,157.706990,0.0000,...,6.469,1341.3070,3151.7078,2182.525478,True,-0.535689,0.133923,0.000000e+00,0.133923,0.133923
9,9,dsgdb9nsd_000001,4,0,1JHC,84.809500,157.71180,157.709970,157.706990,0.0000,...,6.469,1341.3070,3151.7078,2182.525478,True,-0.535689,0.133923,0.000000e+00,0.133923,-0.535689


In [26]:
all_files = os.listdir(PATH_BASE/'structures')

%time result = Parallel(n_jobs=4, temp_folder=PATH_WORKING)(delayed(processQM9_list)(all_files[idx:min(idx+1, len(all_files))]) for idx in tqdm(range(int(np.ceil(len(all_files))))))

data = pd.concat(result)
data = data.reset_index(drop=True)
data.to_pickle(PATH_WORKING/'data.covs.pickle')

print(data)

HBox(children=(IntProgress(value=0, max=99), HTML(value='')))


Wall time: 688 ms
Empty DataFrame
Columns: []
Index: []


In [7]:
def processQM9_file(filename):
    path = PATH_QM9/filename
    molecule_name = filename[:-4]
    
    row_count = sum(1 for row in csv.reader(open(path)))
    na = row_count-5
    freqs = pd.read_csv(path,sep=' |\t',engine='python',skiprows=row_count-3,nrows=1,header=None)
    sz = freqs.shape[1]
    is_linear = np.nan
    if 3*na - 5 == sz:
        is_linear = False
    elif 3*na - 6 == sz:
        is_linear = True
    
    stats = pd.read_csv(path,sep=' |\t',engine='python',skiprows=1,nrows=1,header=None)
    stats = stats.loc[:,2:]
    stats.columns = ['rc_A','rc_B','rc_C','mu','alpha','homo','lumo','gap','r2','zpve','U0','U','H','G','Cv']
    
    stats['freqs_min'] = freqs.values[0].min()
    stats['freqs_max'] = freqs.values[0].max()
    stats['freqs_mean'] = freqs.values[0].mean()
    stats['linear'] = is_linear
    
    mm = pd.read_csv(path,sep='\t',engine='python', skiprows=2, skipfooter=3, names=range(5))[4]
    if mm.dtype == 'O':
        mm = mm.str.replace('*^','e',regex=False).astype(float)
    stats['mulliken_min'] = mm.min()
    stats['mulliken_max'] = mm.max()
    stats['mulliken_mean'] = mm.mean()
    
    stats['molecule_name'] = molecule_name
    
    data = pd.merge(both.loc[[molecule_name],:].reset_index(drop=True), stats, how='left', on='molecule_name')
    data['mulliken_atom_0'] = mm[data['atom_index_0'].values].values
    data['mulliken_atom_1'] = mm[data['atom_index_1'].values].values
    
    return data

def processQM9_list(files):
    df = pd.DataFrame()
    for i,filename in enumerate(files):
        stats = processQM9_file(filename)
        df = pd.concat([df, stats], axis = 0)
    return df

In [59]:
for idx in tqdm(range(int(np.ceil(len(all_files)/10)))):
    result = all_files[10*idx:min(10*(idx+1), len(all_files))] 
    print(result)

HBox(children=(IntProgress(value=0, max=10), HTML(value='')))

['dsgdb9nsd_000001.xyz', 'dsgdb9nsd_000002.xyz', 'dsgdb9nsd_000003.xyz', 'dsgdb9nsd_000004.xyz', 'dsgdb9nsd_000005.xyz', 'dsgdb9nsd_000007.xyz', 'dsgdb9nsd_000008.xyz', 'dsgdb9nsd_000009.xyz', 'dsgdb9nsd_000010.xyz', 'dsgdb9nsd_000011.xyz']
['dsgdb9nsd_000012.xyz', 'dsgdb9nsd_000013.xyz', 'dsgdb9nsd_000014.xyz', 'dsgdb9nsd_000015.xyz', 'dsgdb9nsd_000016.xyz', 'dsgdb9nsd_000017.xyz', 'dsgdb9nsd_000018.xyz', 'dsgdb9nsd_000019.xyz', 'dsgdb9nsd_000020.xyz', 'dsgdb9nsd_000021.xyz']
['dsgdb9nsd_000022.xyz', 'dsgdb9nsd_000023.xyz', 'dsgdb9nsd_000024.xyz', 'dsgdb9nsd_000025.xyz', 'dsgdb9nsd_000026.xyz', 'dsgdb9nsd_000027.xyz', 'dsgdb9nsd_000028.xyz', 'dsgdb9nsd_000029.xyz', 'dsgdb9nsd_000030.xyz', 'dsgdb9nsd_000031.xyz']
['dsgdb9nsd_000032.xyz', 'dsgdb9nsd_000033.xyz', 'dsgdb9nsd_000034.xyz', 'dsgdb9nsd_000035.xyz', 'dsgdb9nsd_000036.xyz', 'dsgdb9nsd_000037.xyz', 'dsgdb9nsd_000038.xyz', 'dsgdb9nsd_000039.xyz', 'dsgdb9nsd_000040.xyz', 'dsgdb9nsd_000041.xyz']
['dsgdb9nsd_000042.xyz', 'dsgdb9nsd_

In [61]:
for idx in tqdm(range(int(np.ceil(len(all_files))))):
    result = processQM9_list(all_files[idx:min((idx+1), len(all_files))])
    print(result)
    
#print(data)

HBox(children=(IntProgress(value=0, max=99), HTML(value='')))

   id     molecule_name  atom_index_0  atom_index_1  type  \
0   0  dsgdb9nsd_000001             1             0  1JHC   
1   1  dsgdb9nsd_000001             1             2  2JHH   
2   2  dsgdb9nsd_000001             1             3  2JHH   
3   3  dsgdb9nsd_000001             1             4  2JHH   
4   4  dsgdb9nsd_000001             2             0  1JHC   
5   5  dsgdb9nsd_000001             2             3  2JHH   
6   6  dsgdb9nsd_000001             2             4  2JHH   
7   7  dsgdb9nsd_000001             3             0  1JHC   
8   8  dsgdb9nsd_000001             3             4  2JHH   
9   9  dsgdb9nsd_000001             4             0  1JHC   

   scalar_coupling_constant      rc_A       rc_B       rc_C   mu  ...     Cv  \
0                   84.8076  157.7118  157.70997  157.70699  0.0  ...  6.469   
1                  -11.2570  157.7118  157.70997  157.70699  0.0  ...  6.469   
2                  -11.2548  157.7118  157.70997  157.70699  0.0  ...  6.469   
3       

[27 rows x 30 columns]
   id     molecule_name  atom_index_0  atom_index_1  type  \
0  55  dsgdb9nsd_000008             5             0  2JHC   
1  54  dsgdb9nsd_000008             4             5  3JHH   
2  50  dsgdb9nsd_000008             3             0  1JHC   
3  52  dsgdb9nsd_000008             3             5  3JHH   
4  51  dsgdb9nsd_000008             3             4  2JHH   
5  49  dsgdb9nsd_000008             2             5  3JHH   
6  48  dsgdb9nsd_000008             2             4  2JHH   
7  47  dsgdb9nsd_000008             2             3  2JHH   
8  46  dsgdb9nsd_000008             2             0  1JHC   
9  53  dsgdb9nsd_000008             4             0  1JHC   

   scalar_coupling_constant       rc_A      rc_B      rc_C      mu  ...  \
0                  -1.36995  127.83497  24.85872  23.97872  1.5258  ...   
1                  13.78650  127.83497  24.85872  23.97872  1.5258  ...   
2                  87.62530  127.83497  24.85872  23.97872  1.5258  ...   
3    

[12 rows x 30 columns]
    id     molecule_name  atom_index_0  atom_index_1  type  \
0   94  dsgdb9nsd_000011             5             6  3JHH   
1   96  dsgdb9nsd_000011             6             1  1JHC   
2   95  dsgdb9nsd_000011             6             0  2JHC   
3   93  dsgdb9nsd_000011             5             1  2JHC   
4   84  dsgdb9nsd_000011             3             1  2JHC   
5   91  dsgdb9nsd_000011             4             6  3JHH   
6   90  dsgdb9nsd_000011             4             5  2JHH   
7   89  dsgdb9nsd_000011             4             1  2JHC   
8   88  dsgdb9nsd_000011             4             0  1JHC   
9   87  dsgdb9nsd_000011             3             6  3JHH   
10  86  dsgdb9nsd_000011             3             5  2JHH   
11  85  dsgdb9nsd_000011             3             4  2JHH   
12  83  dsgdb9nsd_000011             3             0  1JHC   
13  92  dsgdb9nsd_000011             5             0  1JHC   

    scalar_coupling_constant      rc_A      rc

[43 rows x 30 columns]
     id     molecule_name  atom_index_0  atom_index_1  type  \
0   165  dsgdb9nsd_000014             6             1  1JHC   
1   172  dsgdb9nsd_000014             8             1  2JHC   
2   171  dsgdb9nsd_000014             8             0  3JHC   
3   170  dsgdb9nsd_000014             7             8  3JHH   
4   169  dsgdb9nsd_000014             7             1  1JHC   
5   168  dsgdb9nsd_000014             7             0  2JHC   
6   167  dsgdb9nsd_000014             6             8  3JHH   
7   166  dsgdb9nsd_000014             6             7  2JHH   
8   164  dsgdb9nsd_000014             6             0  2JHC   
9   151  dsgdb9nsd_000014             3             4  2JHH   
10  162  dsgdb9nsd_000014             5             6  3JHH   
11  152  dsgdb9nsd_000014             3             5  2JHH   
12  150  dsgdb9nsd_000014             3             1  2JHC   
13  153  dsgdb9nsd_000014             3             6  3JHH   
14  154  dsgdb9nsd_000014       

[18 rows x 30 columns]
         id     molecule_name  atom_index_0  atom_index_1  type  \
0   4658182  dsgdb9nsd_000016             4             6  3JHH   
1   4658191  dsgdb9nsd_000016             6             0  2JHC   
2   4658190  dsgdb9nsd_000016             5             8  3JHH   
3   4658189  dsgdb9nsd_000016             5             7  3JHH   
4   4658188  dsgdb9nsd_000016             5             6  2JHH   
5   4658187  dsgdb9nsd_000016             5             2  2JHC   
6   4658186  dsgdb9nsd_000016             5             1  1JHC   
7   4658185  dsgdb9nsd_000016             5             0  2JHC   
8   4658184  dsgdb9nsd_000016             4             8  3JHH   
9   4658183  dsgdb9nsd_000016             4             7  3JHH   
10  4658181  dsgdb9nsd_000016             4             5  3JHH   
11  4658180  dsgdb9nsd_000016             4             2  2JHC   
12  4658179  dsgdb9nsd_000016             4             1  2JHC   
13  4658178  dsgdb9nsd_000016          

[33 rows x 30 columns]
     id     molecule_name  atom_index_0  atom_index_1  type  \
0   186  dsgdb9nsd_000017             6             1  1JHC   
1   185  dsgdb9nsd_000017             6             0  2JHC   
2   184  dsgdb9nsd_000017             5             6  2JHH   
3   183  dsgdb9nsd_000017             5             1  1JHC   
4   182  dsgdb9nsd_000017             5             0  2JHC   
5   181  dsgdb9nsd_000017             4             6  3JHH   
6   178  dsgdb9nsd_000017             4             0  1JHC   
7   179  dsgdb9nsd_000017             4             1  2JHC   
8   177  dsgdb9nsd_000017             3             6  3JHH   
9   176  dsgdb9nsd_000017             3             5  3JHH   
10  175  dsgdb9nsd_000017             3             4  2JHH   
11  174  dsgdb9nsd_000017             3             1  2JHC   
12  173  dsgdb9nsd_000017             3             0  1JHC   
13  180  dsgdb9nsd_000017             4             5  3JHH   

    scalar_coupling_constant   

[24 rows x 30 columns]
     id     molecule_name  atom_index_0  atom_index_1  type  \
0   224  dsgdb9nsd_000019             7             1  2JHC   
1   229  dsgdb9nsd_000019             8             2  1JHN   
2   228  dsgdb9nsd_000019             8             1  2JHC   
3   227  dsgdb9nsd_000019             8             0  3JHC   
4   226  dsgdb9nsd_000019             7             8  2JHH   
5   225  dsgdb9nsd_000019             7             2  1JHN   
6   223  dsgdb9nsd_000019             7             0  3JHC   
7   222  dsgdb9nsd_000019             6             2  3JHN   
8   220  dsgdb9nsd_000019             6             0  1JHC   
9   219  dsgdb9nsd_000019             5             6  2JHH   
10  218  dsgdb9nsd_000019             5             2  3JHN   
11  217  dsgdb9nsd_000019             5             1  2JHC   
12  216  dsgdb9nsd_000019             5             0  1JHC   
13  215  dsgdb9nsd_000019             4             6  2JHH   
14  214  dsgdb9nsd_000019       

[14 rows x 30 columns]
     id     molecule_name  atom_index_0  atom_index_1  type  \
0   246  dsgdb9nsd_000021             6             3  3JHC   
1   239  dsgdb9nsd_000021             5             2  3JHC   
2   245  dsgdb9nsd_000021             6             2  3JHC   
3   244  dsgdb9nsd_000021             6             1  2JHC   
4   243  dsgdb9nsd_000021             6             0  1JHC   
5   242  dsgdb9nsd_000021             5             7  3JHH   
6   241  dsgdb9nsd_000021             5             6  2JHH   
7   240  dsgdb9nsd_000021             5             3  3JHC   
8   238  dsgdb9nsd_000021             5             1  2JHC   
9   231  dsgdb9nsd_000021             4             1  2JHC   
10  236  dsgdb9nsd_000021             4             7  3JHH   
11  235  dsgdb9nsd_000021             4             6  2JHH   
12  234  dsgdb9nsd_000021             4             5  2JHH   
13  233  dsgdb9nsd_000021             4             3  3JHC   
14  232  dsgdb9nsd_000021       

[58 rows x 30 columns]
         id     molecule_name  atom_index_0  atom_index_1  type  \
0   4658231  dsgdb9nsd_000022             6             7  3JHH   
1   4658225  dsgdb9nsd_000022             5             2  3JHC   
2   4658226  dsgdb9nsd_000022             5             6  2JHH   
3   4658227  dsgdb9nsd_000022             5             7  3JHH   
4   4658228  dsgdb9nsd_000022             6             0  1JHC   
5   4658243  dsgdb9nsd_000022             8            10  2JHH   
6   4658229  dsgdb9nsd_000022             6             1  2JHC   
7   4658242  dsgdb9nsd_000022             8             9  2JHH   
8   4658241  dsgdb9nsd_000022             8             2  1JHC   
9   4658230  dsgdb9nsd_000022             6             2  3JHC   
10  4658240  dsgdb9nsd_000022             8             1  2JHC   
11  4658238  dsgdb9nsd_000022             7            11  3JHH   
12  4658237  dsgdb9nsd_000022             7            10  3JHH   
13  4658236  dsgdb9nsd_000022          

[37 rows x 30 columns]
    id     molecule_name  atom_index_0  atom_index_1  type  \
0  288  dsgdb9nsd_000023             4             0  3JHC   
1  290  dsgdb9nsd_000023             4             2  1JHC   
2  291  dsgdb9nsd_000023             5             0  2JHC   
3  292  dsgdb9nsd_000023             5             1  3JHC   
4  293  dsgdb9nsd_000023             5             3  1JHC   
5  289  dsgdb9nsd_000023             4             1  2JHC   

   scalar_coupling_constant  rc_A      rc_B      rc_C   mu  ...      Cv  \
0                   5.43401   0.0  4.425973  4.425973  0.0  ...  15.312   
1                 201.88400   0.0  4.425973  4.425973  0.0  ...  15.312   
2                  18.07090   0.0  4.425973  4.425973  0.0  ...  15.312   
3                   5.43401   0.0  4.425973  4.425973  0.0  ...  15.312   
4                 201.88400   0.0  4.425973  4.425973  0.0  ...  15.312   
5                  18.07090   0.0  4.425973  4.425973  0.0  ...  15.312   

   freqs_min  fr

KeyError: "None of [Index(['dsgdb9nsd_000025'], dtype='object', name='molecule_name')] are in the [index]"

In [14]:
all_files = os.listdir(PATH_BASE/'structures')

%time result = Parallel(n_jobs=2, temp_folder=PATH_WORKING)(delayed(processQM9_list)(all_files[100*idx:min(100*(idx+1), len(all_files))]) for idx in tqdm(range(int(np.ceil(len(all_files)/100)))))

data = pd.concat(result)
data = data.reset_index(drop=True)
data.to_pickle(PATH_WORKING/'data.covs.pickle')

HBox(children=(IntProgress(value=0, max=1), HTML(value='')))




KeyError: "None of [Index(['dsgdb9nsd_000025'], dtype='object', name='molecule_name')] are in the [index]"

NameError: name 'result' is not defined