# Predicting HOMO-LUMO Gap using Random Forest and Decision Tree Regression  
This notebook demonstrates the prediction of the HOMO–LUMO energy gap for 5000 molecules from the QM9 dataset using Random Forest and Decision Tree Regression models. The features include molecular descriptors such as dipole moment, polarizability, orbital energies, and thermochemical properties.


#Importing all necessary libraries

In [32]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

#Importing our dataset "QM9_HOMO_LUMO_Gap.csv"

In [33]:
dataset = pd.read_csv("QM9_HOMO_LUMO_Gap.csv")
x = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values
missing_data = dataset.isnull().sum()

##Printing all matrix of features(x)

In [34]:
print(x)

[[ 1.577e+02  1.577e+02  1.577e+02 ... -3.986e+02 -4.010e+02 -3.725e+02]
 [ 2.936e+02  2.935e+02  1.914e+02 ... -2.786e+02 -2.804e+02 -2.593e+02]
 [ 7.996e+02  4.379e+02  2.829e+02 ... -2.140e+02 -2.152e+02 -2.014e+02]
 ...
 [ 3.387e+00  2.360e+00  1.403e+00 ... -1.399e+03 -1.406e+03 -1.304e+03]
 [ 3.310e+00  2.383e+00  1.397e+00 ... -1.465e+03 -1.473e+03 -1.363e+03]
 [ 3.299e+00  2.423e+00  1.409e+00 ... -1.440e+03 -1.448e+03 -1.338e+03]]


##Printing all target variables(y)

In [35]:
print(y)

[0.505 0.34  0.361 ... 0.194 0.174 0.167]


##Printing if there are any missing_data in the dataset

In [36]:
print(missing_data)

A            0
B            0
C            0
mu           0
alpha        0
homo         0
lumo         0
r2           0
zpve         0
u0           0
u298         0
h298         0
g298         0
cv           0
u0_atom      0
u298_atom    0
h298_atom    0
g298_atom    0
gap          0
dtype: int64


#Splitting the entire dataset into training & test sets

In [37]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 1)

In [38]:
print(x_train)
print(y_train)
print(x_test)
print(y_test)

[[ 7.716e+00  1.100e+00  1.001e+00 ... -1.435e+03 -1.443e+03 -1.336e+03]
 [ 7.500e+00  1.929e+00  1.568e+00 ... -1.287e+03 -1.294e+03 -1.194e+03]
 [ 3.878e+00  1.673e+00  1.258e+00 ... -1.880e+03 -1.892e+03 -1.731e+03]
 ...
 [ 5.988e+00  2.836e+00  1.925e+00 ... -1.161e+03 -1.167e+03 -1.081e+03]
 [ 6.496e+00  7.955e-01  7.118e-01 ... -1.316e+03 -1.323e+03 -1.225e+03]
 [ 9.680e+00  2.172e+00  1.814e+00 ... -1.127e+03 -1.133e+03 -1.053e+03]]
[0.247 0.245 0.34  ... 0.21  0.171 0.237]
[[ 3.722e+00  2.263e+00  1.610e+00 ... -1.625e+03 -1.634e+03 -1.503e+03]
 [ 4.604e+00  1.409e+00  1.100e+00 ... -1.571e+03 -1.580e+03 -1.455e+03]
 [ 1.027e+01  2.029e+00  1.694e+00 ... -8.923e+02 -8.964e+02 -8.349e+02]
 ...
 [ 1.999e+00  1.804e+00  9.539e-01 ... -1.477e+03 -1.485e+03 -1.366e+03]
 [ 6.606e+00  1.260e+00  1.118e+00 ... -1.515e+03 -1.524e+03 -1.402e+03]
 [ 4.801e+00  2.648e+00  2.159e+00 ... -1.501e+03 -1.510e+03 -1.385e+03]]
[0.223 0.242 0.212 0.258 0.213 0.188 0.31  0.297 0.197 0.254 0.215 0.2

#Building & training our Random Forest Regression Model

In [39]:
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators = 200, random_state = 1)
regressor.fit(x_train, y_train)

#Predicting new results and displaying the predicted vs actual values side by side

In [40]:
y_pred = regressor.predict(x_test)
np.set_printoptions(precision = 3)
print(np.concatenate((y_pred.reshape(len(y_pred), 1), y_test.reshape(len(y_test), 1)), 1))

[[0.224 0.223]
 [0.243 0.242]
 [0.208 0.212]
 ...
 [0.164 0.17 ]
 [0.292 0.292]
 [0.336 0.337]]


#Making a single prediction

In [41]:
y_pred_new = regressor.predict([[145.5879, 55.78, 25.9658, 4, 25, -0.5628, 0.0928, 36, 0.045895, -104.282, -85.02, -78.002, -152.784, 13.02, -302.254, -458.213, -815.412, -658.214]])

In [42]:
print('HOMO-LUMO Gap is:', y_pred_new)

HOMO-LUMO Gap is: [0.492]


#Building & training our Decision Tree Regression Model for comparison

#

In [43]:
from sklearn.tree import DecisionTreeRegressor
regressor_2 = DecisionTreeRegressor(random_state = 1)
regressor_2.fit(x_train, y_train)

#Predicting new results and displaying the predicted vs actual values side by side

In [44]:
y_pred = regressor_2.predict(x_test)
np.set_printoptions(precision = 3)
print(np.concatenate((y_pred.reshape(len(y_pred), 1), y_test.reshape(len(y_test), 1)), 1))

[[0.224 0.223]
 [0.241 0.242]
 [0.184 0.212]
 ...
 [0.171 0.17 ]
 [0.297 0.292]
 [0.338 0.337]]


#Making a single prediction

In [45]:
y_pred_new = regressor_2.predict([[145.5879, 55.78, 25.9658, 4, 25, -0.5628, 0.0928, 36, 0.045895, -104.282, -85.02, -78.002, -152.784, 13.02, -302.254, -458.213, -815.412, -658.214]])

In [46]:
print('HOMO-LUMO Gap is:', y_pred_new)

HOMO-LUMO Gap is: [0.505]


### Comparison Summary  
Both models (Decision Tree and Random Forest) were trained to predict the HOMO-LUMO gap.  
Random Forest provided more stable and accurate predictions due to ensemble averaging, while Decision Tree captured sharp local patterns but was more sensitive to overfitting.



#Congratulations, we have succesfully built our Random Forest Regression Model!