1. (2x) Classification for electronic band gap type.

(a) Use AFLOW to download the electronic band gap types from AFLOW for all materials that do not contain lanthanides or actinides, that have the zincblende structure (AFLOW prototype: AB_cF8_216_c_a). (Hint: modify the code introduced in Lecture 14).

In [2]:
import json, sys, os
from urllib.request import urlopen

SERVER = "https://aflow.org"
API = "/API/aflux/?"
MATCHBOOK = "species(!Lanthanides,!Pa,!U,!Pu,!Th),Egap_type(*),aflow_prototype_label_relax(AB_cF8_216_c_a)"
DIRECTIVE = "$paging(0)"
SUMMONS = MATCHBOOK + "," + DIRECTIVE

# print(SERVER + API + SUMMONS)

response = json.loads(urlopen(SERVER + API + SUMMONS).read().decode("utf-8"))
# print(response)
print(len(response))

339


(b) Clean the data to remove duplicate entries.

In [3]:
response_clean = []
response_compounds = []
for entry in response:
    if entry["compound"] not in response_compounds:
        response_clean.append(entry)
        response_compounds.append(entry["compound"])
print(response_clean[0])
print(len(response_clean))

{'compound': 'Ag1C1', 'auid': 'aflow:159b79fdb767a2be', 'aurl': 'aflowlib.duke.edu:AFLOWDATA/ICSD_WEB/FCC/Ag1C1_ICSD_183176', 'spacegroup_relax': 216, 'Pearson_symbol_relax': 'cF8', 'species': ['Ag', 'C'], 'Egap_type': 'metal', 'aflow_prototype_label_relax': 'AB_cF8_216_c_a'}
85


(c) Read in the JSON file with the elemental properties, and use it to generate the features based on the differences in the electronegativities and ionization energies.

In [4]:
from google.colab import drive
drive.mount('/content/drive')

f = open("/content/drive/MyDrive/Colab Notebooks/Chemical_element_data.json", "r+")
Chemical_element_data = json.load(f)
f.close()
drive.flush_and_unmount()
print(Chemical_element_data)

x_list = []
y_list = []

for datum in response_clean:
    species1 = datum["species"][0]
    species2 = datum["species"][1]
    en_diff = abs(Chemical_element_data[species1]["electronegativity"] - Chemical_element_data[species2]["electronegativity"])
    ie_diff = abs(Chemical_element_data[species1]["first_ionization_energy"] - Chemical_element_data[species2]["first_ionization_energy"])
    x_list.append([en_diff, ie_diff])
    if "metal" in datum["Egap_type"]:
        y_list.append(0)  #  if Egap_type is metal, assign to class 0
    else:
        y_list.append(1)  #  otheriwse, assign to class 1

print(x_list)
print(y_list)
print(len(x_list))
print(len(y_list))

Mounted at /content/drive
{'H': {'valence': 1.0, 'atomic_mass': 1.008, 'first_ionization_energy': 1310.0, 'electronegativity': 2.1}, 'Li': {'valence': 1.0, 'atomic_mass': 6.94, 'first_ionization_energy': 519.0, 'electronegativity': 1.0}, 'Be': {'valence': 2.0, 'atomic_mass': 9.013, 'first_ionization_energy': 900.0, 'electronegativity': 1.5}, 'B': {'valence': 3.0, 'atomic_mass': 10.82, 'first_ionization_energy': 799.0, 'electronegativity': 2.0}, 'C': {'valence': 4.0, 'atomic_mass': 12.01, 'first_ionization_energy': 1090.0, 'electronegativity': 2.5}, 'N': {'valence': 5.0, 'atomic_mass': 14.008, 'first_ionization_energy': 1400.0, 'electronegativity': 3.0}, 'O': {'valence': 6.0, 'atomic_mass': 16.0, 'first_ionization_energy': 1310.0, 'electronegativity': 3.5}, 'F': {'valence': 7.0, 'atomic_mass': 19.0, 'first_ionization_energy': 1680.0, 'electronegativity': 4.0}, 'Na': {'valence': 1.0, 'atomic_mass': 22.97, 'first_ionization_energy': 494.0, 'electronegativity': 0.9}, 'Mg': {'valence': 2.0,

(d) Fit k-nearest neighbors classification to the data sets. Use separate training, validation and test sets to optimize the number of nearest neighbors. Use the confusion matrix to evaluate the resulting model.

In [5]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import confusion_matrix
import numpy as np


x = np.array(x_list)
y = np.array(y_list)

#  Split x and y into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state = 0)

kvals = []
cv_scores = []

#  Generate average scores for k values 1 to 10
for k in range(1, 10 + 1):
    clf = KNeighborsClassifier(n_neighbors = k)
    scores = cross_val_score(clf, x_train, y_train)
    avg_scores = sum(scores) / len(scores)
    kvals.append(k)
    cv_scores.append(avg_scores)

print("K values tested: ", kvals)
print("Average scores for each K value: ", cv_scores)

#  Find optimal k value with highest average score
optimal_k = 0
highest_score = 0

for k, score in zip(kvals, cv_scores):
    if score > highest_score:
        optimal_k = k
        highest_score = score

print("Optimal K value: ", optimal_k)
print("Highest average score: ", highest_score)

#  Train final model using optimal K value
clf = KNeighborsClassifier(n_neighbors = optimal_k)
clf.fit(x_train, y_train)
print("Training score for k = " + str(optimal_k) + ": " + str(clf.score(x_train, y_train)))
print("Testing score for k = " + str(optimal_k) + ": " + str(clf.score(x_test, y_test)))

#  Confusion matrix for KNN
y_pred = clf.predict(x_test)
confusion = confusion_matrix(y_test, y_pred)
print("Confusion Matrix: \n", confusion)

K values tested:  [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Average scores for each K value:  [0.5576923076923077, 0.5397435897435897, 0.635897435897436, 0.5717948717948719, 0.44102564102564107, 0.46025641025641023, 0.5102564102564103, 0.458974358974359, 0.4935897435897436, 0.49230769230769234]
Optimal K value:  3
Highest average score:  0.635897435897436
Training score for k = 3: 0.8253968253968254
Testing score for k = 3: 0.5
Confusion Matrix: 
 [[5 3]
 [8 6]]


(e) Fit Gaussian Bayes Nearest Neighbor to the same data. Compare its confusion matrix to that for k-nearest neighbors.

In [6]:
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
import numpy as np

gnb = GaussianNB()
y_pred = gnb.fit(x_train, y_train).predict(x_test)
nummislabelled = (y_test != y_pred).sum()

print("Testing score: ", gnb.score(x_test, y_test))

print("Number of predicted points: ", x_test.shape[0])
print("Number of mislabeled points: ", nummislabelled)

confusion = confusion_matrix(y_test, y_pred)
print("Confusion Matrix: \n", confusion)

Testing score:  0.5
Number of predicted points:  22
Number of mislabeled points:  11
Confusion Matrix: 
 [[6 2]
 [9 5]]


(f) Save the data in a file zincblende_EgaptypeClean_AFLOW.json so that it can be used for future exercises.

In [7]:
from google.colab import drive
drive.mount('/content/drive')

fout = open("/content/drive/MyDrive/Colab Notebooks/zincblende_EgaptypeClean_AFLOW.json", "w")
fout.write(json.dumps(response_clean))  #  json.dumps() converts into json string (contains double quotes instead of single quotes)
fout.close()
drive.flush_and_unmount()

Mounted at /content/drive


2. (2x) Regression for electronic band gap.

(a) Use AFLOW to download the electronic band gap values from AFLOW for all materials that do not contain lanthanides or actinides, that have the zincblende structure (AFLOW prototype: AB_cF8_216_c_a). (Hint: modify the code introduced in Lecture 14).

In [8]:
import json, sys, os
from urllib.request import urlopen

SERVER = "https://aflow.org"
API = "/API/aflux/?"
MATCHBOOK = "species(!Lanthanides,!Pa,!U,!Pu,!Th),Egap(*),Egap_type(*),aflow_prototype_label_relax(AB_cF8_216_c_a)"
DIRECTIVE = "$paging(0)"
SUMMONS = MATCHBOOK + "," + DIRECTIVE

# print(SERVER + API + SUMMONS)

response = json.loads(urlopen(SERVER + API + SUMMONS).read().decode("utf-8"))
# print(response)
print(len(response))

339


(b) Clean the data to remove duplicates. Only include entries that are not metals (if "metal" not in entry["Egap_type"]).

In [9]:
response_clean = []
response_compounds = []
for entry in response:
    if entry["compound"] not in response_compounds and "metal" not in entry["Egap_type"]:
        response_clean.append(entry)
        response_compounds.append(entry["compound"])
print(response_clean[0])
print(len(response_clean))

{'compound': 'Ag1I1', 'auid': 'aflow:09ceeacc6ea1e806', 'aurl': 'aflowlib.duke.edu:AFLOWDATA/ICSD_WEB/FCC/Ag1I1_ICSD_164964', 'spacegroup_relax': 216, 'Pearson_symbol_relax': 'cF8', 'species': ['Ag', 'I'], 'Egap': 1.977, 'Egap_type': 'insulator-direct', 'aflow_prototype_label_relax': 'AB_cF8_216_c_a'}
44


(c) Read in the JSON file with the elemental properties, and use it to generate the features based on the differences in the electronegativities and ionization energies.

In [10]:
from google.colab import drive
drive.mount('/content/drive')

f = open("/content/drive/MyDrive/Colab Notebooks/Chemical_element_data.json", "r+")
Chemical_element_data = json.load(f)
f.close()
drive.flush_and_unmount()
print(Chemical_element_data)

for datum in response_clean:
    if "metal" not in datum["Egap_type"]:
        species1 = datum["species"][0]
        species2 = datum["species"][1]
        en_diff = abs(Chemical_element_data[species1]["electronegativity"] - Chemical_element_data[species2]["electronegativity"])
        ie_diff = abs(Chemical_element_data[species1]["first_ionization_energy"] - Chemical_element_data[species2]["first_ionization_energy"])
        x_list.append([en_diff, ie_diff])
        y_list.append(datum["Egap"])

Mounted at /content/drive
{'H': {'valence': 1.0, 'atomic_mass': 1.008, 'first_ionization_energy': 1310.0, 'electronegativity': 2.1}, 'Li': {'valence': 1.0, 'atomic_mass': 6.94, 'first_ionization_energy': 519.0, 'electronegativity': 1.0}, 'Be': {'valence': 2.0, 'atomic_mass': 9.013, 'first_ionization_energy': 900.0, 'electronegativity': 1.5}, 'B': {'valence': 3.0, 'atomic_mass': 10.82, 'first_ionization_energy': 799.0, 'electronegativity': 2.0}, 'C': {'valence': 4.0, 'atomic_mass': 12.01, 'first_ionization_energy': 1090.0, 'electronegativity': 2.5}, 'N': {'valence': 5.0, 'atomic_mass': 14.008, 'first_ionization_energy': 1400.0, 'electronegativity': 3.0}, 'O': {'valence': 6.0, 'atomic_mass': 16.0, 'first_ionization_energy': 1310.0, 'electronegativity': 3.5}, 'F': {'valence': 7.0, 'atomic_mass': 19.0, 'first_ionization_energy': 1680.0, 'electronegativity': 4.0}, 'Na': {'valence': 1.0, 'atomic_mass': 22.97, 'first_ionization_energy': 494.0, 'electronegativity': 0.9}, 'Mg': {'valence': 2.0,

(d) Split the data into testing and training sets. Fit linear regression to the training set. What is the fit score for the test set?

In [11]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import numpy as np

x = np.array(x_list)
y = np.array(y_list)
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state = 0)
linreg = LinearRegression().fit(x_train, y_train)
print("Testing score: ", linreg.score(x_test, y_test))

Testing score:  0.13695454545521102


(e) Fit k-nearest neighbors regression to the same data. Use separate training, validation and test sets to optimize the number of nearest neighbors, and evaluate the resulting model. How does it compare to linear regression?

In [12]:
from sklearn.neighbors import KNeighborsRegressor  #  import KNN regression package
from sklearn.model_selection import train_test_split, cross_val_score  #  import method to split data into training & testing sets

kvals = []
cv_scores = []

x = np.array(x_list)
y = np.array(y_list)
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state = 0)

for k in range(1, 10 + 1):
    reg = KNeighborsRegressor(n_neighbors = k)
    scores = cross_val_score(reg, x_train, y_train)
    sum_scores = sum(scores)
    ave_scores = sum_scores / len(scores)
    kvals.append(k)
    cv_scores.append(ave_scores)

print("K values tested: ", kvals)
print("Average scores for each K value: ", cv_scores)

optimal_k = 0
highest_score = -10.0

for k, score in zip(kvals, cv_scores):
    if score > highest_score:
        optimal_k = k
        highest_score = score

print("Optimal K value: ", optimal_k)
print("Highest average score: ", highest_score)

clf = KNeighborsRegressor(n_neighbors = optimal_k)
clf.fit(x_train, y_train)
print("Training score for k = " + str(optimal_k) + ": " + str(clf.score(x_train, y_train)))
print("Testing score for k = " + str(optimal_k) + ": " + str(clf.score(x_test, y_test)))

K values tested:  [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Average scores for each K value:  [-2.690423332078484, -1.2082309584745492, -0.641814620067146, -0.4602956615558333, -0.5180127108553962, -0.44148532362964144, -0.4008286195167023, -0.3844271719138898, -0.34182723999488906, -0.30624188799465696]
Optimal K value:  10
Highest average score:  -0.30624188799465696
Training score for k = 10: 0.07645028767262385
Testing score for k = 10: -0.3462131404360038


(f) Save the data in a file zincblende_EgapClean_AFLOW.json so that it can be used for future exercises.

In [13]:
from google.colab import drive
drive.mount('/content/drive')

fout = open("/content/drive/MyDrive/Colab Notebooks/zincblende_EgapClean_AFLOW.json", "w")
fout.write(json.dumps(response_clean))
fout.close()
drive.flush_and_unmount()

Mounted at /content/drive


3. (2x)  Classification with logistic regression for electronic band gap type.


(a) Read in the JSON file with the cleaned electronic band gap type data for rocksalt structure, and the JSON file with the elemental properties. Convert both to dictionary-like objects.

In [14]:
import json, sys, os
from google.colab import drive
drive.mount('/content/drive')

f = open("/content/drive/MyDrive/Colab Notebooks/zincblende_EgaptypeClean_AFLOW.json", "r+")
zincblende_EgaptypeClean = json.load(f)
f.close()
# print(zincblende_EgaptypeClean)  #  already in dictionary-like form

f2 = open("/content/drive/MyDrive/Colab Notebooks/Chemical_element_data.json", "r+")
Chemical_element_data = json.load(f2)
f2.close()
# print(Chemical_element_data)  #  already in dictionary-like form

drive.flush_and_unmount()

Mounted at /content/drive


(b) Generate the feature vectors based on the differences of the electronegativities, ionization energies, valences and atomic masses (4 features). Assign the metals to class zero and the insulators to class 1. Generate the training and test sets.

In [15]:
x_list = []
y_list = []

for datum in zincblende_EgaptypeClean:
    species1 = datum["species"][0]
    species2 = datum["species"][1]
    en_diff = abs(Chemical_element_data[species1]["electronegativity"] - Chemical_element_data[species2]["electronegativity"])
    ie_diff = abs(Chemical_element_data[species1]["first_ionization_energy"] - Chemical_element_data[species2]["first_ionization_energy"])
    val_diff = abs(Chemical_element_data[species1]["valence"] - Chemical_element_data[species2]["valence"])
    am_diff = abs(Chemical_element_data[species1]["atomic_mass"] - Chemical_element_data[species2]["atomic_mass"])
    x_list.append([en_diff, ie_diff, val_diff, am_diff])
    if "metal" in datum["Egap_type"]:
        y_list.append(0)  #  if Egap_type is metal, assign to class 0
    else:
        y_list.append(1)  #  otheriwse, assign to class 1

x = np.array(x_list)
y = np.array(y_list)
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state = 0)

(c) Use cross-validation to optimize the value of C, using the parameter grid: [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]. What is the optimum value of C?

In [16]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

cvals = []
cv_scores = []

for i in range(-3, 2 + 1):
    cval = 10 ** i
    logreg = LogisticRegression(C = cval)
    scores = cross_val_score(logreg, x_train, y_train)
    avg_score = sum(scores) / len(scores)
    cvals.append(cval)
    cv_scores.append(avg_score)

print("C values tested: ", cvals)
print("Average scores for each C value: ", cv_scores)

optimal_c = 0
highest_score = -10.0

for c, score in zip(cvals, cv_scores):
    if score > highest_score:
        optimal_c = c
        highest_score = score

print("Optimal C value: ", optimal_c)
print("Highest average score: ", highest_score)

C values tested:  [0.001, 0.01, 0.1, 1, 10, 100]
Average scores for each C value:  [0.6833333333333333, 0.6833333333333333, 0.6512820512820514, 0.6987179487179487, 0.6987179487179487, 0.6833333333333333]
Optimal C value:  1
Highest average score:  0.6987179487179487


(d) Fit the model for the optimized value of C. What is the test score accuracy for this value? Generate the confusion matrix for this model. How many metals are misclassified as non-metals? How many non-metals are missclassified as metals?

In [17]:
logreg = LogisticRegression(C = optimal_c)
logreg.fit(x_train, y_train)
print("Training score for C = " + str(optimal_c) + ": " + str(logreg.score(x_train, y_train)))
print("Testing score for C = " + str(optimal_c) + ": " + str(logreg.score(x_test, y_test)))

y_pred = logreg.predict(x_test)
confusion = confusion_matrix(y_test, y_pred)
print("Confusion Matrix: \n", confusion)
#  3 metals were mislabeled as non-metals
#  5 non-metals were mislabeled as metals

Training score for C = 1: 0.746031746031746
Testing score for C = 1: 0.6363636363636364
Confusion Matrix: 
 [[5 3]
 [5 9]]
