## HW5 Band Gap Prediction
### Darian Yang

You can start with basic models, and then try your best to optimize your predictions by using more sophisticated models, feature engineering, and fine-tuning the hyperparameters. The grade of this homework will be based on the score you get on Kaggle. 

There is no specific requirement for a written summary for this homework, but please leave some necessary notes/comments in your submission to help the grader understand your workflow.

#### TODO:
* Given the training data: X_train and Y_train
* Find best suitable features
* You are allowed to build any type of regression model
* You are allowed to use any type of data processing
* Find the best performing model
* Use your model to score Y_test

This dataset provides quantitative measurements of the band gap (Egap) for a set of inorganic crystaline materials.

#### File descriptions
* X_train_kaggle.csv - the training set: file with Material column
* y_train_kaggle.csv - the training set: Egap for training with Id column
* X_test_kaggle.csv - the test set: file with Material column with Id column that you should predict
* y_sample_submission.csv - a sample submission file in the correct format

#### Data fields
* Id - an id unique to a given material ( Please note 'Id' is the last column)
* D1-D132 - chemical descriptor to a given material
* Egap - Egap values in y- files

The evaluation metric for this competition is Mean absolute error .

#### Submission Format
For every molecule in the dataset, submission files should contain two columns: Id and Egap.

The file should contain a header and have the following format:
```
Id,Egap
1,0.456
```

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [73]:
from sklearn import linear_model
from sklearn.preprocessing import StandardScaler
from sklearn import model_selection
from sklearn import tree
from sklearn.ensemble import RandomForestRegressor
from sklearn import svm

In [95]:
X_train = pd.read_csv("X_train_kaggle.csv", index_col=133).to_numpy()
y_train = pd.read_csv("y_train_kaggle.csv", index_col=1).to_numpy()
X_test = pd.read_csv("X_test_kaggle.csv", index_col=133).to_numpy()

In [161]:
TEST_ID = pd.read_csv("X_test_kaggle.csv",index_col=133).index
TEST_ID

Int64Index([13062,   587,  2333,  7740,  9463,  5652,  8999, 11427,  3677,
             2200,
            ...
            13472,  2237,  3379,  1273,  5807,  4754,   108,  7641,  7194,
             3063],
           dtype='int64', name='Id', length=4132)

In [106]:
TEST_ID.shape

(4132,)

In [93]:
np.unique(TEST_ID).shape

(3852,)

Note that there is duplicate values in the test set.

In [94]:
TRAIN_ID = pd.read_csv("X_train_kaggle.csv",index_col=0).index
print(TRAIN_ID.shape)
print(np.unique(TRAIN_ID).shape)

(9640,)
(8462,)


Looks like there is duplicates in the training set as well.

Let's remove duplicate values from the test and training sets:

In [132]:
np.where(TEST_ID == "Eu1S1")[0][0]

148

In [133]:
def make_bool_filter(id_array):
    # create a filter from id array
    bool_filter = []
    for ix, idval in enumerate(id_array):
        # prevent False at first occurence
        if np.count_nonzero(id_array == idval) > 1 and ix != np.where(id_array == idval)[0][0]:
            bool_filter.append(False)
        else:
            bool_filter.append(True)
    return bool_filter

In [134]:
# create a train and test filter
train_filter = make_bool_filter(TRAIN_ID)
test_filter = make_bool_filter(TEST_ID)

In [135]:
len(train_filter)

9640

In [136]:
X_train.shape

(9640, 133)

In [137]:
X_train[train_filter].shape

(8462, 133)

It seems to work, let's filter the data arrays now.

In [138]:
X_train = X_train[train_filter]
y_train = y_train[train_filter]
X_test = X_test[test_filter]

In [139]:
y_train.shape

(8462, 1)

In [140]:
# reshape to 1d
y_train = y_train[:,0]
y_train.shape

(8462,)

In [141]:
def worker(model, save_to=None, X_train=X_train, y_train=y_train, X_test=X_test):
    
    model = model.fit(X_train, y_train)
    y_test = model.predict(X_test)

    if save_to:
        ret = {"Id":TEST_ID, "Egap":y_test}
        ret = pd.DataFrame(data=ret)
        ret.set_index("Id")
        ret.to_csv(save_to, index=False)
    
    return y_test

In [27]:
from sklearn.metrics import mean_squared_error

def calc_score(model, X_train, y_train, X_test, y_test):
    """
    Find the mse using an sklearn model.
    """
    model = model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    # RF
    if hasattr(model, "oob_score_"):
        #print(f"OOB: {model.oob_score_}")
        return model.oob_score_, mse
    else:
        return mse

In [142]:
# first scale
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [156]:
# split training/test
X_train_b, X_test_b, y_train_b, y_test_b = \
    model_selection.train_test_split(X_train_scaled, y_train, test_size=0.6)

In [157]:
calc_score(linear_model.LinearRegression(), X_train_b, y_train_b, X_test_b, y_test_b)

1.1483690835143443

In [158]:
calc_score(RandomForestRegressor(max_depth=None, oob_score=True), X_train_b, y_train_b, X_test_b, y_test_b)

(0.6749313645377382, 0.8003077263645966)

In [None]:
calc_score(RandomForestRegressor(max_depth=None, oob_score=True, criterion="absolute_error"), X_train_b, y_train_b, X_test_b, y_test_b)

In [None]:
calc_score(RandomForestRegressor(max_depth=None, oob_score=True, criterion="poisson"), X_train_b, y_train_b, X_test_b, y_test_b)

In [None]:
calc_score(svm.LinearSVR(), X_train=X_train_b, y_train=y_train_b, X_test=X_test_b, y_test=y_test_b)

For now, I will submit kaggle using random forest.

In [83]:
worker(RandomForestRegressor(max_depth=None), "rf.csv")

array([3.049612, 2.575218, 2.8167  , ..., 1.675019, 1.675954, 3.391856])

So between linear regression, a linear SVM, and random forest, RF is the best but note that we can't extrapolate with RF so it might not be the best option here. I will try optimizing SVMs since this dataset is smaller than the last HW.

In [62]:
calc_score(svm.LinearSVR(max_iter=10000, C=1), X_train=X_train_b, y_train=y_train_b, X_test=X_test_b, y_test=y_test_b)



1.1962796001555847

In [61]:
calc_score(svm.LinearSVR(max_iter=10000, C=10), X_train=X_train_b, y_train=y_train_b, X_test=X_test_b, y_test=y_test_b)



1.19948599580424

In [63]:
calc_score(svm.LinearSVR(max_iter=10000, C=100), X_train=X_train_b, y_train=y_train_b, X_test=X_test_b, y_test=y_test_b)



1.4804158560572647

In [65]:
calc_score(svm.SVR(kernel="linear", C=1, cache_size=1000), X_train=X_train_b, y_train=y_train_b, X_test=X_test_b, y_test=y_test_b)

1.1848964057198201

In [66]:
calc_score(svm.SVR(kernel="rbf", C=1, cache_size=1000), X_train=X_train_b, y_train=y_train_b, X_test=X_test_b, y_test=y_test_b)

0.9220149361378379

Note that from literature for a similar problem (https://pubs.acs.org/doi/full/10.1021/acs.jpclett.8b00124):

C and γ were optimized to 10 and 0.01 for SVR, respectively, while ϵ was set at 0.1

In [67]:
calc_score(svm.SVR(kernel="rbf", C=10, cache_size=1000), X_train=X_train_b, y_train=y_train_b, X_test=X_test_b, y_test=y_test_b)

0.8344598149031313

In [68]:
calc_score(svm.SVR(kernel="rbf", C=10, gamma=0.01, cache_size=1000), X_train=X_train_b, y_train=y_train_b, X_test=X_test_b, y_test=y_test_b)

0.8376258904445875

In [69]:
calc_score(svm.SVR(kernel="rbf", C=10, gamma=0.01, epsilon=0.1, cache_size=1000), 
           X_train=X_train_b, y_train=y_train_b, X_test=X_test_b, y_test=y_test_b)

0.8376258904445875

Let's actually optimize this:

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.model_selection import GridSearchCV

# for 1d
C_range = np.logspace(-2, 10, 13)
gamma_range = np.logspace(-9, 3, 13)
param_grid = dict(gamma=gamma_range, C=C_range)
cv = StratifiedShuffleSplit(n_splits=5, test_size=0.2)
grid = GridSearchCV(svm.SVR(), param_grid=param_grid, cv=cv)
grid.fit(X_train_scaled, y_train)

print(
    "The best parameters are %s with a score of %0.2f"
    % (grid.best_params_, grid.best_score_)
)

In [79]:
C_2d_range = [1e-2, 1, 1e2]
gamma_2d_range = [1e-1, 1, 1e1]
classifiers = []
for C in C_2d_range:
    for gamma in gamma_2d_range:
        clf = svm.SVR(C=C, gamma=gamma)
        clf.fit(X_train_scaled, y_train)
        classifiers.append((C, gamma, clf))

In [84]:
classifiers

[(0.01, 0.1, SVR(C=0.01, gamma=0.1)),
 (0.01, 1, SVR(C=0.01, gamma=1)),
 (0.01, 10.0, SVR(C=0.01, gamma=10.0)),
 (1, 0.1, SVR(C=1, gamma=0.1)),
 (1, 1, SVR(C=1, gamma=1)),
 (1, 10.0, SVR(C=1, gamma=10.0)),
 (100.0, 0.1, SVR(C=100.0, gamma=0.1)),
 (100.0, 1, SVR(C=100.0, gamma=1)),
 (100.0, 10.0, SVR(C=100.0, gamma=10.0))]