* This notebook aims to predict the `scalar_coupling_constants` between atom pairs in molecules,given the two atom types (e.g., C and H), the coupling type (e.g., 2JHC).

In [1]:
# importing the libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
os.chdir(r"D:\Random_Datasets\Kaggle\champs-scalar-coupling")
import xgboost as xgb
import pickle
from sklearn.model_selection import train_test_split

# 1. Importing the Datasets

In [2]:
# importing the dataset
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")
dipole_df = pd.read_csv("dipole_moments.csv")
magnetic_shielding_tensors_df = pd.read_csv("magnetic_shielding_tensors.csv")
potential_energy_df = pd.read_csv("potential_energy.csv")
structures_df = pd.read_csv("structures.csv")

# printing the size of datasets
print("Length of train_df = {}".format(len(train_df)))
print("Length of test_df = {}".format(len(test_df)))
print("Length of dipole_df = {}".format(len(dipole_df)))
print("Length of magnetic_shielding_tensors_df = {}".format(len(magnetic_shielding_tensors_df)))
print("Length of potential_energy_df = {}".format(len(potential_energy_df)))
print("Length of structures_df = {}".format(len(structures_df)))

Length of train_df = 4658147
Length of test_df = 2505542
Length of dipole_df = 85003
Length of magnetic_shielding_tensors_df = 1533537
Length of potential_energy_df = 85003
Length of structures_df = 2358657


### 1.1 train_df
- *molecule_name* = name of the molecule where the coupling constant originates. The corresponding values of XYZ are there in **structures.df**.
- *atom_index_0*, *atom_index_1* = atom pair creating coupling constant.
- *type* = coupling type
- *scaler_coupling_constant* = **predictor**

In [3]:
train_df.head() 

Unnamed: 0,id,molecule_name,atom_index_0,atom_index_1,type,scalar_coupling_constant
0,0,dsgdb9nsd_000001,1,0,1JHC,84.8076
1,1,dsgdb9nsd_000001,1,2,2JHH,-11.257
2,2,dsgdb9nsd_000001,1,3,2JHH,-11.2548
3,3,dsgdb9nsd_000001,1,4,2JHH,-11.2543
4,4,dsgdb9nsd_000001,2,0,1JHC,84.8074


In [4]:
test_df.head()

Unnamed: 0,id,molecule_name,atom_index_0,atom_index_1,type
0,4658147,dsgdb9nsd_000004,2,0,2JHC
1,4658148,dsgdb9nsd_000004,2,1,1JHC
2,4658149,dsgdb9nsd_000004,2,3,3JHH
3,4658150,dsgdb9nsd_000004,3,0,1JHC
4,4658151,dsgdb9nsd_000004,3,1,2JHC


In [5]:
def info_df(data):
    """
    returns the dataset of unique and missing values
    """
    return pd.DataFrame({
        "unique_values": data.nunique(),
        "null_values": data.isnull().sum()
    })

In [6]:
train_info = info_df(train_df)
train_info

Unnamed: 0,unique_values,null_values
id,4658147,0
molecule_name,85003,0
atom_index_0,29,0
atom_index_1,29,0
type,8,0
scalar_coupling_constant,2182935,0


In [7]:
test_info = info_df(test_df)
test_info

Unnamed: 0,unique_values,null_values
id,2505542,0
molecule_name,45772,0
atom_index_0,28,0
atom_index_1,29,0
type,8,0


### dipole_df
Contains the molecular electric dipole moments. These are three dimensional vectors that indicate the charge distribution in the molecule.
X, Y, Z are three components of dipole moment.

In [8]:
dipole_df.head()

Unnamed: 0,molecule_name,X,Y,Z
0,dsgdb9nsd_000001,0.0,0.0,0.0
1,dsgdb9nsd_000002,-0.0002,0.0,1.6256
2,dsgdb9nsd_000003,0.0,0.0,-1.8511
3,dsgdb9nsd_000005,0.0,0.0,-2.8937
4,dsgdb9nsd_000007,0.0,0.0,0.0


In [9]:
dipole_info = info_df(dipole_df)
dipole_info

Unnamed: 0,unique_values,null_values
molecule_name,85003,0
X,52494,0
Y,46825,0
Z,34847,0


The `dipole_df` contains 85003 values, unique for each molecule. This can merged with the main dataframe to increase the predictors.

## magnetic_shielding_tensors_df
contains the magnetic shielding tensors for all atoms in the molecules. The first column `(molecule_name)` contains the molecule name, the second column `(atom_index)` contains the index of the atom in the molecule, the third to eleventh columns contain the `XX`, `YX`, `ZX`, `XY`, `YY`, `ZY`, `XZ`, `YZ` and `ZZ` elements of the tensor/matrix respectively.

In [10]:
magnetic_shielding_tensors_df.head()

Unnamed: 0,molecule_name,atom_index,XX,YX,ZX,XY,YY,ZY,XZ,YZ,ZZ
0,dsgdb9nsd_000001,0,195.315,0.0,-0.0001,0.0,195.317,0.0007,-0.0001,0.0007,195.317
1,dsgdb9nsd_000001,1,31.341,-1.2317,4.0544,-1.2317,28.9546,-1.7173,4.0546,-1.7173,34.0861
2,dsgdb9nsd_000001,2,31.5814,1.2173,-4.1474,1.2173,28.9036,-1.6036,-4.1476,-1.6036,33.8967
3,dsgdb9nsd_000001,3,31.5172,4.1086,1.2723,4.1088,33.9068,1.695,1.2724,1.6951,28.9579
4,dsgdb9nsd_000001,4,31.4029,-4.0942,-1.1793,-4.0944,34.0776,1.6259,-1.1795,1.626,28.9013


In [11]:
mst_info_df = info_df(magnetic_shielding_tensors_df)
mst_info_df

Unnamed: 0,unique_values,null_values
molecule_name,85003,0
atom_index,29,0
XX,523980,0
YX,549223,0
ZX,457665,0
XY,546475,0
YY,532658,0
ZY,436576,0
XZ,463352,0
YZ,444112,0


## Note
How to use atom index in molecule name as only one index is given?

## potential_energy_df
Contains the potential energy of the molecules. The first column `(molecule_name)` contains the name of the molecule, the second column `(potential_energy)` contains the potential energy of the molecule.

In [12]:
potential_energy_df.head()

Unnamed: 0,molecule_name,potential_energy
0,dsgdb9nsd_000001,-40.52368
1,dsgdb9nsd_000002,-56.56025
2,dsgdb9nsd_000003,-76.42608
3,dsgdb9nsd_000005,-93.42849
4,dsgdb9nsd_000007,-79.83869


In [13]:
pe_info_df = info_df(potential_energy_df)
pe_info_df

Unnamed: 0,unique_values,null_values
molecule_name,85003,0
potential_energy,54596,0


## sturctures_df
X, Y and Z  = cartesian coordinates

In [14]:
structures_df.sample(5)

Unnamed: 0,molecule_name,atom_index,atom,x,y,z
222171,dsgdb9nsd_014089,0,O,0.020187,1.285568,-0.025301
2204854,dsgdb9nsd_123004,20,H,-3.217167,-4.115411,2.146514
1908217,dsgdb9nsd_108429,6,N,-2.405384,-1.83792,-1.261009
1679200,dsgdb9nsd_096680,9,H,1.030286,1.896786,-0.080596
1039951,dsgdb9nsd_062883,6,C,0.846766,-1.912934,0.2069


In [15]:
structures_info_df = info_df(structures_df)
structures_info_df

Unnamed: 0,unique_values,null_values
molecule_name,130775,0
atom_index,29,0
atom,5,0
x,2358441,0
y,2358364,0
z,2358421,0


---
# 2. Feature Engineering

## 2.1 Merging the datasets

In [16]:
def map_atom_info(df, atom_idx):
    """
    works in three steps :-
    1. merges train_df/test_df with structures_df
    2. drops the atom_index on which it is merged
    3. renames the columns
    """
    df = pd.merge(df, structures_df, how = 'left',
                  left_on  = ['molecule_name', f'atom_index_{atom_idx}'],
                  right_on = ['molecule_name',  'atom_index'])
    
    df = df.drop('atom_index', axis=1)
    df = df.rename(columns={'atom': f'atom_{atom_idx}',
                            'x': f'x_{atom_idx}',
                            'y': f'y_{atom_idx}',
                            'z': f'z_{atom_idx}'})
    return df


# implementing above function
train_df = map_atom_info(train_df, 0)
train_df = map_atom_info(train_df, 1)

test_df = map_atom_info(test_df, 0)
test_df = map_atom_info(test_df, 1)

In [17]:
train_df.sample(4)

Unnamed: 0,id,molecule_name,atom_index_0,atom_index_1,type,scalar_coupling_constant,atom_0,x_0,y_0,z_0,atom_1,x_1,y_1,z_1
2575568,2575568,dsgdb9nsd_077929,19,7,2JHC,-1.65605,H,-1.315682,1.021588,3.211969,C,-0.489892,-0.386776,2.174721
3420336,3420336,dsgdb9nsd_098761,15,16,3JHH,0.092005,H,-0.031678,-3.052639,0.613046,H,-2.1763,-4.861805,-0.508995
1663609,1663609,dsgdb9nsd_053975,17,5,1JHC,84.4851,H,-2.023858,-1.490915,-0.960655,C,-1.458258,-0.967919,-1.735948
225039,225039,dsgdb9nsd_008176,18,6,1JHC,85.7334,H,-1.75951,-0.528723,-1.122144,C,-1.306304,-0.765165,-0.147735


In [18]:
test_df.sample(4)

Unnamed: 0,id,molecule_name,atom_index_0,atom_index_1,type,atom_0,x_0,y_0,z_0,atom_1,x_1,y_1,z_1
1549032,6207179,dsgdb9nsd_085495,9,2,3JHC,H,1.179027,1.308974,-0.754086,C,0.356169,-0.065225,1.530186
1100177,5758324,dsgdb9nsd_064209,9,3,3JHC,H,0.825489,2.00248,0.249167,C,1.142225,-0.456219,-0.93179
91946,4750093,dsgdb9nsd_006632,11,1,3JHC,H,0.700173,-3.073476,-0.399243,C,-0.024295,0.022245,0.002855
983686,5641833,dsgdb9nsd_058880,9,12,3JHH,H,0.879281,2.038315,0.299221,H,0.374108,-0.011234,-1.007299


### Columns after merging the datasets :-
- molecule name
- atom_index_0
- atom_index_1
- type = presumably scaling coupling types. `T` in evaluation formula
- scalar_coupling_constant = **Predictor**
- atom_0
- x_0, y_0, z_0 = cartesian places of atom_0
- atom_1
- x_1, y_1, z_1 = cartesian places of atom_1

## 2.2 Distances

In [19]:
####################################################
#### Magnitude of distance b/w two atoms ###########
####################################################

def dist_magnitude(df):
    x = df["x_0"] - df["x_1"]
    y = df["y_0"] - df["y_1"]
    z = df["z_0"] - df["z_1"]
    
    dist = (x**2 + y**2 + z**2)**0.5
    
    return dist

train_df["abs_distance"] = dist_magnitude(train_df)
test_df["abs_distance"] = dist_magnitude(test_df)

###################################################
### individual cartesian distance #################
###################################################

def dist_individual(df):
    df["x_dist"] = df["x_0"] - df["x_1"]
    df["y_dist"] = df["y_0"] - df["y_1"]
    df["z_dist"] = df["z_0"] - df["z_1"]
    
    return df

train_df = dist_individual(train_df)
test_df = dist_individual(test_df)

In [20]:
train_df.head()

Unnamed: 0,id,molecule_name,atom_index_0,atom_index_1,type,scalar_coupling_constant,atom_0,x_0,y_0,z_0,atom_1,x_1,y_1,z_1,abs_distance,x_dist,y_dist,z_dist
0,0,dsgdb9nsd_000001,1,0,1JHC,84.8076,H,0.00215,-0.006031,0.001976,C,-0.012698,1.085804,0.008001,1.091953,0.014849,-1.091835,-0.006025
1,1,dsgdb9nsd_000001,1,2,2JHH,-11.257,H,0.00215,-0.006031,0.001976,H,1.011731,1.463751,0.000277,1.78312,-1.00958,-1.469782,0.0017
2,2,dsgdb9nsd_000001,1,3,2JHH,-11.2548,H,0.00215,-0.006031,0.001976,H,-0.540815,1.447527,-0.876644,1.783147,0.542965,-1.453558,0.87862
3,3,dsgdb9nsd_000001,1,4,2JHH,-11.2543,H,0.00215,-0.006031,0.001976,H,-0.523814,1.437933,0.906397,1.783157,0.525964,-1.443964,-0.904421
4,4,dsgdb9nsd_000001,2,0,1JHC,84.8074,H,1.011731,1.463751,0.000277,C,-0.012698,1.085804,0.008001,1.091952,1.024429,0.377947,-0.007724


In [21]:
test_df.head()

Unnamed: 0,id,molecule_name,atom_index_0,atom_index_1,type,atom_0,x_0,y_0,z_0,atom_1,x_1,y_1,z_1,abs_distance,x_dist,y_dist,z_dist
0,4658147,dsgdb9nsd_000004,2,0,2JHC,H,-1.661639,0.0,1.0,C,0.599539,0.0,1.0,2.261178,-2.261178,0.0,0.0
1,4658148,dsgdb9nsd_000004,2,1,1JHC,H,-1.661639,0.0,1.0,C,-0.599539,0.0,1.0,1.062099,-1.062099,0.0,0.0
2,4658149,dsgdb9nsd_000004,2,3,3JHH,H,-1.661639,0.0,1.0,H,1.661639,0.0,1.0,3.323277,-3.323277,0.0,0.0
3,4658150,dsgdb9nsd_000004,3,0,1JHC,H,1.661639,0.0,1.0,C,0.599539,0.0,1.0,1.062099,1.062099,0.0,0.0
4,4658151,dsgdb9nsd_000004,3,1,2JHC,H,1.661639,0.0,1.0,C,-0.599539,0.0,1.0,2.261178,2.261178,0.0,0.0


## 2.3 Descriptive Statistics features
These features will be decided on the `type` column. The features values created in train_df will used in the testset to make the model more robust. The following features will be created :-
- mean_value
- difference from the mean
- std_dev
- z_score

In [22]:
train_info = info_df(train_df)
train_info

Unnamed: 0,unique_values,null_values
id,4658147,0
molecule_name,85003,0
atom_index_0,29,0
atom_index_1,29,0
type,8,0
scalar_coupling_constant,2182935,0
atom_0,1,0
x_0,785811,0
y_0,785790,0
z_0,785809,0


In [23]:
features_df = train_df.groupby("type").agg({
    "x_0": ["mean", "median", "max", "min", "std"],
    "y_0": ["mean", "median", "max", "min", "std"],
    "z_0": ["mean", "median", "max", "min", "std"],
    "x_1": ["mean", "median", "max", "min", "std"],
    "y_1": ["mean", "median", "max", "min", "std"],
    "z_1": ["mean", "median", "max", "min", "std"],
    "x_dist": ["mean", "median", "max", "min", "std"],
    "y_dist": ["mean", "median", "max", "min", "std"],
    "z_dist": ["mean", "median", "max", "min", "std"],
    "abs_distance": ["mean", "median", "max", "min", "std"]
})

features_df.columns = ["x_0_mean", "x_0_median", "x_0_max", "x_0_min", "x_0_std", 
                       "y_0_mean", "y_0_median", "y_0_max", "y_0_min", "y_0_std",
                       "z_0_mean", "z_0_median", "z_0_max", "z_0_min", "z_0_std",
                       "abs_distance_mean", "abs_distance_median", "abs_distance_max", "abs_distance_min", "abs_distance_std",
                       "x_dist_mean", "x_dist_median", "x_dist_max", "x_dist_min", "x_dist_std",
                       "y_dist_mean", "y_dist_median", "y_dist_max", "y_dist_min", "y_dist_std",
                       "z_dist_mean", "z_dist_median", "z_dist_max", "z_dist_min", "z_dist_std",
                       "x_1_mean", "x_1_median", "x_1_max", "x_1_min", "x_1_std", 
                       "y_1_mean", "y_1_median", "y_1_max", "y_1_min", "y_1_std",
                       "z_1_mean", "z_1_median", "z_1_max", "z_1_min", "z_1_std"
                      ]

features_df = features_df.reset_index()

In [24]:
features_df

Unnamed: 0,type,x_0_mean,x_0_median,x_0_max,x_0_min,x_0_std,y_0_mean,y_0_median,y_0_max,y_0_min,...,y_1_mean,y_1_median,y_1_max,y_1_min,y_1_std,z_1_mean,z_1_median,z_1_max,z_1_min,z_1_std
0,1JHC,0.08493,0.058999,9.38224,-9.234889,1.771849,-0.186959,-0.249816,9.714469,-9.49416,...,-0.003067,-0.000335,1.11588,-1.115035,0.66248,1.0929,1.094027,1.247942,1.060901,0.006819
1,1JHN,0.091999,0.220282,6.608726,-7.942035,1.882975,-0.441424,-0.441583,6.925556,-8.038531,...,-0.019069,-0.008793,1.028183,-1.019802,0.539832,1.012865,1.01225,1.142078,1.002241,0.005905
2,2JHC,0.096261,0.12693,9.38224,-9.234889,1.801936,-0.373901,-0.415277,9.714469,-9.49416,...,-0.002155,-0.004854,2.443864,-2.452288,1.218074,2.190397,2.183838,2.521012,1.792659,0.085421
3,2JHH,0.244981,0.548284,8.22077,-9.21897,1.643658,0.202182,0.958404,8.118444,-8.917476,...,-0.232781,-0.373,1.852715,-1.848904,1.113075,1.774895,1.772448,1.96934,1.513358,0.0233
4,2JHN,0.097695,0.095681,8.159108,-7.933077,1.916708,-0.294241,-0.402484,8.050028,-7.350396,...,0.0036,0.002435,2.335901,-2.334142,1.159502,2.135921,2.134787,2.360334,1.83161,0.068134
5,3JHC,0.118323,0.135798,9.38224,-9.234889,1.748925,-0.270132,-0.33304,9.714469,-9.49416,...,-0.001876,-0.00778,3.830196,-3.85284,1.713129,3.079049,3.071731,3.924354,2.033387,0.312258
6,3JHH,0.048216,-0.00884,7.044957,-7.933077,1.710285,-0.070418,-0.204659,7.960801,-8.112643,...,-0.080573,-0.069856,3.14134,-3.138241,1.489444,2.702122,2.582409,3.177613,2.06459,0.264836
7,3JHN,0.046278,0.047638,8.072174,-7.735907,1.776123,-0.240304,-0.313551,7.633978,-7.316332,...,0.006778,-0.002464,3.715544,-3.735911,1.654663,3.050378,3.117424,3.861395,2.146172,0.301125


In [25]:
train_df = pd.merge(train_df, features_df, on = "type", how = "inner")
test_df = pd.merge(test_df, features_df, on = "type", how = "inner")

In [26]:
train_df.sample(4)

Unnamed: 0,id,molecule_name,atom_index_0,atom_index_1,type,scalar_coupling_constant,atom_0,x_0,y_0,z_0,...,y_1_mean,y_1_median,y_1_max,y_1_min,y_1_std,z_1_mean,z_1_median,z_1_max,z_1_min,z_1_std
1426354,720685,dsgdb9nsd_023394,12,5,2JHC,2.64037,H,-2.304567,-1.382418,-2.006515,...,-0.002155,-0.004854,2.443864,-2.452288,1.218074,2.190397,2.183838,2.521012,1.792659,0.085421
60898,380145,dsgdb9nsd_012921,16,4,1JHC,84.0219,H,-0.091086,-1.462102,-2.323377,...,-0.003067,-0.000335,1.11588,-1.115035,0.66248,1.0929,1.094027,1.247942,1.060901,0.006819
1542806,1192118,dsgdb9nsd_039631,17,4,2JHC,-0.51264,H,-3.118122,0.089996,0.733107,...,-0.002155,-0.004854,2.443864,-2.452288,1.218074,2.190397,2.183838,2.521012,1.792659,0.085421
2138868,3582553,dsgdb9nsd_102938,11,1,2JHC,-0.883646,H,-0.53574,1.922864,0.99513,...,-0.002155,-0.004854,2.443864,-2.452288,1.218074,2.190397,2.183838,2.521012,1.792659,0.085421


---
## 3. Model 01 - XGBoost

In [27]:
train_df["type"] = train_df["type"].map({
    '1JHC': 0, '2JHH': 1, '1JHN': 2, '2JHN': 3, '2JHC': 4, '3JHH': 5, '3JHC': 6, '3JHN': 7
})

train_df["atom_0"] = train_df["atom_0"].map({
    "H": 0
})

train_df["atom_1"] = train_df["atom_1"].map({
    'C': 0, 'H': 1, 'N': 0
})

X = train_df.drop(columns = ["id", "molecule_name", "scalar_coupling_constant"])
y = train_df["scalar_coupling_constant"]

x_train, x_valid, y_train, y_valid = train_test_split(X, y, test_size = 0.2, random_state = 42)

In [28]:
# preprocessing test_df
id_ = test_df["id"]
test_df = test_df.drop(columns = ["id", "molecule_name"])

test_df["type"] = test_df["type"].map({
    '1JHC': 0, '2JHH': 1, '1JHN': 2, '2JHN': 3, '2JHC': 4, '3JHH': 5, '3JHC': 6, '3JHN': 7
})

test_df["atom_0"] = test_df["atom_0"].map({
    "H": 0
})

test_df["atom_1"] = test_df["atom_1"].map({
    'C': 0, 'H': 1, 'N': 0
})

In [29]:
train_df.to_csv("train_2.csv")
test_df.to_csv("test_2.csv")

In [None]:
print("x_train shape = {}".format(x_train.shape))
print("x_valid shape = {}".format(x_valid.shape))
print("y_train shape = {}".format(y_train.shape))
print("y_valid shape = {}".format(y_valid.shape))

In [8]:
# converting the dataset into DMatrices
dtrain = xgb.DMatrix(x_train, label = y_train)
dvalid = xgb.DMatrix(x_valid, label = y_valid)

In [9]:
# training the Model
params = {
    "eta":0.1,
    "alpha": 0.1,
    "max_leaves": 128,
    "max_depth": 9,
    "n_estimators": 100,
    "max_depth": 6,
    "subsample": 0.8,
    "colsample_bytree": 1,
    "gamma": 0,
    "eval_metric": "mae",
    "nthreads": 4,
    "objective": "reg:linear",
    "silent": 1
}

model_train = xgb.train(params, dtrain, 5000, evals = [(dvalid, "valid_set")], verbose_eval=10, early_stopping_rounds = 40)

[0]	valid_set-mae:16.4455
Will train until valid_set-mae hasn't improved in 40 rounds.
[10]	valid_set-mae:6.72256
[20]	valid_set-mae:3.49046
[30]	valid_set-mae:2.61448
[40]	valid_set-mae:2.47224
[50]	valid_set-mae:2.43631
[60]	valid_set-mae:2.42
[70]	valid_set-mae:2.40067
[80]	valid_set-mae:2.38743
[90]	valid_set-mae:2.37652
[100]	valid_set-mae:2.36598
[110]	valid_set-mae:2.35516
[120]	valid_set-mae:2.3465
[130]	valid_set-mae:2.33778
[140]	valid_set-mae:2.33101
[150]	valid_set-mae:2.32556
[160]	valid_set-mae:2.31953
[170]	valid_set-mae:2.31136
[180]	valid_set-mae:2.30554
[190]	valid_set-mae:2.30134
[200]	valid_set-mae:2.29491
[210]	valid_set-mae:2.29197
[220]	valid_set-mae:2.28643
[230]	valid_set-mae:2.27898
[240]	valid_set-mae:2.27518
[250]	valid_set-mae:2.27085
[260]	valid_set-mae:2.26546
[270]	valid_set-mae:2.26193
[280]	valid_set-mae:2.2585
[290]	valid_set-mae:2.25573
[300]	valid_set-mae:2.2521
[310]	valid_set-mae:2.24934
[320]	valid_set-mae:2.24698
[330]	valid_set-mae:2.24294
[340

[2860]	valid_set-mae:2.06821
[2870]	valid_set-mae:2.06795
[2880]	valid_set-mae:2.06772
[2890]	valid_set-mae:2.06759
[2900]	valid_set-mae:2.06746
[2910]	valid_set-mae:2.06734
[2920]	valid_set-mae:2.06712
[2930]	valid_set-mae:2.0669
[2940]	valid_set-mae:2.06684
[2950]	valid_set-mae:2.06661
[2960]	valid_set-mae:2.0664
[2970]	valid_set-mae:2.06611
[2980]	valid_set-mae:2.06592
[2990]	valid_set-mae:2.0657
[3000]	valid_set-mae:2.06551
[3010]	valid_set-mae:2.06537
[3020]	valid_set-mae:2.06526
[3030]	valid_set-mae:2.06506
[3040]	valid_set-mae:2.06463
[3050]	valid_set-mae:2.06455
[3060]	valid_set-mae:2.06436
[3070]	valid_set-mae:2.06415
[3080]	valid_set-mae:2.06398
[3090]	valid_set-mae:2.0639
[3100]	valid_set-mae:2.06371
[3110]	valid_set-mae:2.0636
[3120]	valid_set-mae:2.06339
[3130]	valid_set-mae:2.06324
[3140]	valid_set-mae:2.06316
[3150]	valid_set-mae:2.06297
[3160]	valid_set-mae:2.06276
[3170]	valid_set-mae:2.06248
[3180]	valid_set-mae:2.0622
[3190]	valid_set-mae:2.0621
[3200]	valid_set-mae:

In [15]:
pickle.dump(model_train, open('pmp_xgb.pickle.dat','wb'))

In [16]:
dtest = xgb.DMatrix(test_df)

In [17]:
y_pred = model_train.predict(dtest)

In [18]:
result_df = pd.DataFrame({
    "id": id_,
    "scalar_coupling_constant": y_pred
})

In [19]:
result_df.sample(10)

Unnamed: 0,id,scalar_coupling_constant
943354,6847461,83.750916
1235155,6629264,9.011694
1869455,6326065,5.472533
934023,6787176,83.76606
391883,6228808,-2.68089
595439,7075901,-4.706295
1388028,4908599,2.798285
215702,5516591,-0.257709
136647,5222099,-4.480159
720255,5368351,103.469635


In [20]:
result_df.to_csv("submission.csv", index = False)