1. (2x) Downloading and parsing data from the PDB.

(a) Use the Protein Data Bank (https://www.rcsb.org/) search functionality to find and download the structure of the hemocyanin molecule (an oxygen transport molecule used instead of hemoglobin by some invertebrate animals, which uses Cu instead of Fe) in the PDBx/mmCIF format. You can select the first entry you find with your search: write the file name in the text box below.

Filename: 1hcy.cif

(b) If using Colab, upload the file to your Google Drive directory, otherwise move it to a directory on your own computer. Modify the code from Lecture 10 to open and parse this file and extract the cell lengths and angles, and convert them to a set of lattice vectors. Save these lattice vectors as a list of numpy arrays inside a dictionary, with the key "lattice_vectors".

In [None]:
from google.colab import drive
drive.mount('/content/drive')

import numpy as np
import math

f = open("/content/drive/MyDrive/Colab Notebooks/1hcy.cif", "r")
cell_lengths = []
cell_angles = []
readcell = False

for line in f.readlines():
    if "_cell" in line:
        readcell = True

    if "_cell.length_" in line:
        cell_length_parts = line.split()

        try:
            cell_lengths.append(float(cell_length_parts[1]))
        except ValueError or TypeError:
            print("Cell length is not a number: ", line)


    elif "_cell.angle_" in line:
        cell_angle_parts = line.split()

        try:
            cell_angles.append(float(cell_angle_parts[1]))

        except ValueError or TypeError:
            print("Cell angle is not a number: ", line)

    if readcell and (not "_cell" in line):
        break

print("Cell lengths = ", cell_lengths)
print("Cell angles = ", cell_angles)

f.close()
drive.flush_and_unmount()

alpha = cell_angles[0] * np.pi / 180.0
beta = cell_angles[1] * np.pi / 180.0
gamma = cell_angles[2] * np.pi / 180.0

avec = np.array([0.0, 0.0, 0.0])
bvec = np.array([0.0, 0.0, 0.0])
cvec = np.array([0.0, 0.0, 0.0])

avec[0] = cell_lengths[0]
bvec[0] = cell_lengths[1] * math.cos(gamma)
bvec[1] = cell_lengths[1] * math.sin(gamma)
cvec[0] = cell_lengths[2] * math.cos(beta)
cvec[1] = cell_lengths[2] * (math.cos(alpha) - math.cos(beta) * math.cos(gamma) / math.sin(gamma))
cvec[2] = math.sqrt(cell_lengths[2]**2 - (cvec[0]**2 + cvec[1]**2))

print(avec)
print(bvec)
print(cvec)

chem_dict = {}
chem_dict["lattice_vector"] = [avec, bvec, cvec]
print(chem_dict)

Mounted at /content/drive
Cell lengths =  [119.8, 193.1, 122.2]
Cell angles =  [90.0, 118.1, 90.0]
[119.8   0.    0. ]
[1.18239648e-14 1.93100000e+02 0.00000000e+00]
[-5.75576519e+01  1.10069817e-14  1.07795903e+02]
{'lattice_vector': [array([119.8,   0. ,   0. ]), array([1.18239648e-14, 1.93100000e+02, 0.00000000e+00]), array([-5.75576519e+01,  1.10069817e-14,  1.07795903e+02])]}


(c) Modify the code from Lecture 10 to parse this file and extract the atomic positions, and store the results as a list of dictionaries (as in the code from Lecture 10). Add this list to the dictionary from part (b), under the keyword "atom_coordinates". (Make sure to use the version in the notebook for Lecture 10 currently uploaded in eLearning).

In [None]:
from google.colab import drive
drive.mount("/content/drive")

f = open("/content/drive/MyDrive/Colab Notebooks/1hcy.cif", "r")
atomnum = 0
atomentries = []
readatom = False
loopstart = False

for line in f.readlines():
    if "loop_" in line:
        loopstart = True
    elif loopstart and ("_atom_site" in line) and (not "_atom_sites" in line):
        readatom = True
        loopstart = False
    else:
        loopstart = False

    if readatom:
        if "type_symbol" in line:
            symbolpos = atomnum
        elif "label_comp_id" in line:
            componentpos = atomnum
        elif "Cartn_x" in line:
            xatompos = atomnum
        elif "Cartn_y" in line:
            yatompos = atomnum
        elif "Cartn_z" in line:
            zatompos = atomnum

        atomnum += 1

    if readatom and ("ATOM" in line or "HETATM" in line):
        atomentry = {}
        atomlineparts = line.split()
        atomentry["species"] = atomlineparts[symbolpos]
        try:
            atomentry["xpos"] = float(atomlineparts[xatompos])
        except ValueError or TypeError:
            print("X-coordinate is not a number: ", line)
        try:
            atomentry["ypos"] = float(atomlineparts[yatompos])
        except ValueError or TypeError:
            print("Y-coordinate is not a number: ", line)
        try:
            atomentry["zpos"] = float(atomlineparts[zatompos])
        except ValueError or TypeError:
            print("z-coordinate is not a number: ", line)
        atomentries.append(atomentry)
    if readatom and (not ("ATOM" in line or "_atom_site" in line or "HETATM" in line)):
        print(line)
        break

f.close()
drive.flush_and_unmount()

print(atomentries[:3])  #  print out first three entries in atomentries
print(len(atomentries))

chem_dict["atom_coordinates"] = atomentries
print(chem_dict)

Output hidden; open in https://colab.research.google.com to view.

(d) Using the dictionary, extract and print the atomic positions of the Cu atoms (species value "CU").


In [None]:
for i in range(len(chem_dict["atom_coordinates"])):
    if atomentries[i]["species"] == "CU":
        print("X: " + str(atomentries[i]["xpos"]))
        print("Y: " + str(atomentries[i]["ypos"]))
        print("Z: " + str(atomentries[i]["zpos"]))
        print()

X: -25.799
Y: 55.987
Z: 91.714

X: -23.136
Y: 54.753
Z: 90.549



2. (2x) Downloading data from the AFLOW database using AFLUX.

(a) Use AFLUX to download the band gaps for the alkali halides with the rocksalt structure from the AFLOW database. (Hint: use "AlkaliMetals" and "Halogens" as the species; the rocksalt structure prototype label is "AB_cF8_225_a_b"). Convert the resulting JSON file to a dictionary-like object.

In [None]:
import json, sys, os
from urllib.request import urlopen

SERVER = "https://aflow.org"
API = "/API/aflux/?"
MATCHBOOK = "species(AlkaliMetals,Halogens),nspecies(2),aflow_prototype_label_relax(AB_cF8_225_a_b),Egap(*)"
DIRECTIVES = "$paging(0)"
SUMMONS = MATCHBOOK + "," + DIRECTIVES

response = json.loads(urlopen(SERVER + API + SUMMONS).read().decode("utf-8"))
print(response)

[{'compound': 'Br1K1', 'auid': 'aflow:afe3eb563871153b', 'aurl': 'aflowlib.duke.edu:AFLOWDATA/ICSD_WEB/FCC/Br1K1_ICSD_44282', 'spacegroup_relax': 225, 'Pearson_symbol_relax': 'cF8', 'species': ['Br', 'K'], 'nspecies': 2, 'aflow_prototype_label_relax': 'AB_cF8_225_a_b', 'Egap': 4.3393}, {'compound': 'Br1K1', 'auid': 'aflow:f43a65e2772ab1f5', 'aurl': 'aflowlib.duke.edu:AFLOWDATA/ICSD_WEB/FCC/Br1K1_ICSD_22157', 'spacegroup_relax': 225, 'Pearson_symbol_relax': 'cF8', 'species': ['Br', 'K'], 'nspecies': 2, 'aflow_prototype_label_relax': 'AB_cF8_225_a_b', 'Egap': 4.3284}, {'compound': 'Br1Rb1', 'auid': 'aflow:9e643376d04eba9f', 'aurl': 'aflowlib.duke.edu:AFLOWDATA/ICSD_WEB/FCC/Br1Rb1_ICSD_18017', 'spacegroup_relax': 225, 'Pearson_symbol_relax': 'cF8', 'species': ['Br', 'Rb'], 'nspecies': 2, 'aflow_prototype_label_relax': 'AB_cF8_225_a_b', 'Egap': 4.2245}, {'compound': 'Cl1K1', 'auid': 'aflow:8f1151d625186d0e', 'aurl': 'aflowlib.duke.edu:AFLOWDATA/ICSD_WEB/FCC/Cl1K1_ICSD_187219', 'spacegroup_

(b) Use the results in the dictionary-like object to print the band gap and compound formula (key "compound") for each returned result.

In [None]:
response = json.loads(urlopen(SERVER + API + SUMMONS).read().decode("utf-8"))
for datum in response:
    bandgap = [float(datum["Egap"])]
    compound = [str(x) for x in datum["compound"].split(",")]
    print("{}, {}".format(compound, bandgap))

['Br1K1'], [4.3393]
['Br1K1'], [4.3284]
['Br1Rb1'], [4.2245]
['Cl1K1'], [5.0588]
['Cl1Na1'], [5.0504]
['Cl1Na1'], [5.0544]
['Cl1Na1'], [5.0498]
['Cl1Na1'], [5.0569]
['Cl1Rb1'], [4.8347]
['Cs1F1'], [5.277]
['Cs4F4'], [5.2752]
['Cs1I1'], [3.8625]
['F1K1'], [5.9619]
['F1Li1'], [8.745]
['F1Na1'], [6.157]
['F1Na1'], [6.1624]
['F1Na1'], [6.1348]
['I1Li1'], [4.2448]
['I1Na1'], [3.6141]


(c) Use AFLUX to download the data for materials that contain the element Sn but not Pb, and have a band gap of less than 5eV. Use the paging(0) directive to download all of the available materials, and print the number of downloaded entries.

In [None]:
SERVER = "https://aflow.org"
API = "/API/aflux/?"
MATCHBOOK = "species(Sn,!Pb),Egap(*5)"
DIRECTIVES = "$paging(0)"
SUMMONS = MATCHBOOK + "," + DIRECTIVES

response = json.loads(urlopen(SERVER + API + SUMMONS).read().decode("utf-8"))
print(response)
print(len(response))

[{'compound': 'Ag2Au1Sn1', 'auid': 'aflow:0b34171631525382', 'aurl': 'aflowlib.duke.edu:AFLOWDATA/LIB3_RAW/AgAuSn/T0001.A2BC', 'spacegroup_relax': 225, 'Pearson_symbol_relax': 'cF16', 'species': ['Ag', 'Au', 'Sn'], 'Egap': 0}, {'compound': 'Ag1Au2Sn1', 'auid': 'aflow:21a8fcc9f957181d', 'aurl': 'aflowlib.duke.edu:AFLOWDATA/LIB3_RAW/AgAuSn/T0001.AB2C', 'spacegroup_relax': 225, 'Pearson_symbol_relax': 'cF16', 'species': ['Ag', 'Au', 'Sn'], 'Egap': 0}, {'compound': 'Ag2Au1Sn1', 'auid': 'aflow:2780aad95ed2a75e', 'aurl': 'aflowlib.duke.edu:AFLOWDATA/LIB3_RAW/AgAuSn/T0002.A2BC', 'spacegroup_relax': 216, 'Pearson_symbol_relax': 'cF16', 'species': ['Ag', 'Au', 'Sn'], 'Egap': 0}, {'compound': 'Ag2Au1Sn1', 'auid': 'aflow:2780aad95ed2a75e', 'aurl': 'aflowlib.duke.edu:AFLOWDATA/LIB3_RAW/AgAuSn/T0002.A2BC', 'spacegroup_relax': 216, 'Pearson_symbol_relax': 'cF16', 'species': ['Ag', 'Au', 'Sn'], 'Egap': 0}, {'compound': 'Ag2Au1Sn1', 'auid': 'aflow:0b34171631525382', 'aurl': 'aflowlib.duke.edu:AFLOWDAT

(d) Use AFLUX to retrieve the band gaps for all materials that have a calculated bulk modulus between of at least 200GPa. The keyword for bulk modulus is "ael_bulk_modulus_vrh". Use the paging(0) directive to download all of the available materials, and print the number of downloaded entries.

In [None]:
SERVER = "https://aflow.org"
API = "/API/aflux/?"
MATCHBOOK = "Egap(*),ael_bulk_modulus_vrh(200*)"  #  at least 200 GPa
DIRECTIVES = "$paging(0)"
SUMMONS = MATCHBOOK + "," + DIRECTIVES

response = json.loads(urlopen(SERVER + API + SUMMONS).read().decode("utf-8"))
print(response)
print(len(response))

[{'compound': 'C1Er1Rh3', 'auid': 'aflow:02dec0b796759ab7', 'aurl': 'aflowlib.duke.edu:AFLOWDATA/ICSD_WEB/CUB/C1Er1Rh3_ICSD_108131', 'spacegroup_relax': 221, 'Pearson_symbol_relax': 'cP5', 'Egap': 0, 'ael_bulk_modulus_vrh': 202.65}, {'compound': 'Nb2Pt2', 'auid': 'aflow:02606407b3d5e7ab', 'aurl': 'aflowlib.duke.edu:AFLOWDATA/ICSD_WEB/ORC/Nb1Pt1_ICSD_645211', 'spacegroup_relax': 51, 'Pearson_symbol_relax': 'oP4', 'Egap': 0, 'ael_bulk_modulus_vrh': 216.745}, {'compound': 'Ni3P3W3', 'auid': 'aflow:033207703b32945f', 'aurl': 'aflowlib.duke.edu:AFLOWDATA/ICSD_WEB/HEX/Ni1P1W1_ICSD_646181', 'spacegroup_relax': 189, 'Pearson_symbol_relax': 'hP9', 'Egap': 0, 'ael_bulk_modulus_vrh': 253.719}, {'compound': 'Ir4Si4Ta4', 'auid': 'aflow:0333f4a56bc81637', 'aurl': 'aflowlib.duke.edu:AFLOWDATA/ICSD_WEB/ORC/Ir1Si1Ta1_ICSD_411884', 'spacegroup_relax': 62, 'Pearson_symbol_relax': 'oP12', 'Egap': 0, 'ael_bulk_modulus_vrh': 256.698}, {'compound': 'Al1Pt3', 'auid': 'aflow:03988d8bec1cfd41', 'aurl': 'aflowli

(e) Use AFLUX to retrieve the space group numbers of the relaxed structures of materials that contain Cu and Ti but not V (space group keyword: "spacegroup_relax"; do not include the paging directive). Print the space groups and the compound formula (keyword "compound").

In [None]:
SERVER = "https://aflow.org"
API = "/API/aflux/?"
MATCHBOOK = "species(Cu,Ti,!V),spacegroup_relax(*)"

response = json.loads(urlopen(SERVER + API + MATCHBOOK).read().decode("utf-8"))
print(response)

for datum in response:
    compoundformula = [str(datum["compound"])]
    spacegroupnums = [float(datum["spacegroup_relax"])]
    print("{}, {}".format(compoundformula, spacegroupnums))

[{'compound': 'Ag1Al1Cu1Ti1', 'auid': 'aflow:26dd7b7d87d8bb59', 'aurl': 'aflowlib.duke.edu:AFLOWDATA/LIB4_RAW/AgAlCu_pvTi_sv:PAW_PBE/ABCD_cF16_216_c_d_b_a.BCAD', 'spacegroup_relax': 216, 'Pearson_symbol_relax': 'cF16', 'species': ['Ag', 'Al', 'Cu', 'Ti']}, {'compound': 'Ag1Al1Cu1Ti1', 'auid': 'aflow:c3550fdbea8c0e60', 'aurl': 'aflowlib.duke.edu:AFLOWDATA/LIB4_RAW/AgAlCu_pvTi_sv:PAW_PBE/ABCD_cF16_216_c_d_b_a.CABD', 'spacegroup_relax': 216, 'Pearson_symbol_relax': 'cF16', 'species': ['Ag', 'Al', 'Cu', 'Ti']}, {'compound': 'Ag1Al1Cu1Ti1', 'auid': 'aflow:c76a1eadc1b5f622', 'aurl': 'aflowlib.duke.edu:AFLOWDATA/LIB4_RAW/AgAlCu_pvTi_sv:PAW_PBE/ABCD_cF16_216_c_d_b_a.ABCD', 'spacegroup_relax': 216, 'Pearson_symbol_relax': 'cF16', 'species': ['Ag', 'Al', 'Cu', 'Ti']}, {'compound': 'Ag1Au1Cu1Ti1', 'auid': 'aflow:6bd236656ab0b81b', 'aurl': 'aflowlib.duke.edu:AFLOWDATA/LIB4_RAW/AgAuCu_pvTi_sv:PAW_PBE/ABCD_cF16_216_c_d_b_a.CABD', 'spacegroup_relax': 216, 'Pearson_symbol_relax': 'cF16', 'species': ['

3. Using Optimade to download materials data.

(a) Use Optimade to retrieve ternary materials (nelements=3) in the AFLOW database that contain the element Hf. Print the resulting response.

In [None]:
SERVER = "https://aflow.org"
API = "/API/optimade/v1/structures?"
FILTER = "filter=elements%20HAS%20%22Hf%22%20AND%20nelements=3"
print(SERVER + API + FILTER)

response = json.loads(urlopen(SERVER + API + FILTER).read().decode("utf-8"))
print(response)

https://aflow.org/API/optimade/v1/structures?filter=elements%20HAS%20%22Hf%22%20AND%20nelements=3
{'data': [{'attributes': {'chemical_formula_descriptive': 'Hf2Nb1V1', 'nperiodic_dimensions': 3, 'elements': ['Hf', 'Nb', 'V'], 'nelements': 3, 'elements_ratios': [0.5, 0.25, 0.25], 'nsites': 4, 'last_modified': '2020-05-14T04:36:57Z', 'chemical_formula_anonymous': 'A2BC', 'cartesian_site_positions': [[0, 10.2726, 1.271], [0, 6.13972, 2.83193], [0, 9.02031, 4.32219], [0, 3.00677, 1.44073]], 'species_at_sites': ['Hf', 'Hf', 'Nb', 'V'], 'dimension_types': [1, 1, 1], 'species': [{'name': 'Hf', 'chemical_symbols': ['Hf'], 'concentration': [0.5]}, {'name': 'Nb', 'chemical_symbols': ['Nb'], 'concentration': [0.25]}, {'name': 'V', 'chemical_symbols': ['V'], 'concentration': [0.25]}], 'structure_features': [], 'chemical_formula_reduced': 'Hf2NbV'}, 'id': 'aflow:000004dec6c1b4d1', 'type': 'structures', 'relationships': {'references': {'data': [{'id': 'aflow++', 'type': 'references'}, {'id': 'aflow_

(b) Use Optimade to retrieve materials in the Nomad database that contain Cu and Ti but not V. Print the resulting response.

In [None]:
import json, sys, os
from urllib.request import urlopen

SERVER = "https://nomad-lab.eu"
API = "/prod/rae/optimade/structures?"
FILTER = "filter=elements%20HAS%20ANY%20%22Cu%22,%22Ti%22%20AND%20NOT%20elements%20HAS%20%22V%22"
print(SERVER + API + FILTER)

response = json.loads(urlopen(SERVER + API + FILTER).read().decode("utf-8"))
print(response)

https://nomad-lab.eu/prod/rae/optimade/structures?filter=elements%20HAS%20ANY%20%22Cu%22,%22Ti%22%20AND%20NOT%20elements%20HAS%20%22V%22
{'data': [{'id': 'KBzsVfpmNKk7IEQ1AlBEf4ib1kn_', 'type': 'structures', 'attributes': {'immutable_id': 'KBzsVfpmNKk7IEQ1AlBEf4ib1kn_', 'last_modified': '2021-03-01T08:11:21Z', 'elements': ['Ac', 'Al', 'Ti'], 'nelements': 3, 'elements_ratios': [0.5, 0.25, 0.25], 'chemical_formula_descriptive': 'Ac2AlTi', 'chemical_formula_reduced': 'Ac2AlTi', 'chemical_formula_hill': 'Ac2AlTi', 'chemical_formula_anonymous': 'A2BC', 'dimension_types': [1, 1, 1], 'nperiodic_dimensions': 3, 'lattice_vectors': [[0.0, 3.9449512600000003, 3.9449512600000003], [3.9449512600000003, 0.0, 3.9449512600000003], [3.9449512600000003, 3.9449512600000003, 0.0]], 'cartesian_site_positions': [[0.0, 0.0, 0.0], [3.944954179263932, 3.944954179263932, 3.944954179263932], [1.9724720795438657, 1.9724720795438657, 1.9724720795438657], [5.917426258807798, 5.917426258807798, 5.917426258807798]], 

4. (2x) Fitting linear regression model to data set of alkaline-earth chalcogenides.

(a) Use AFLUX to download the band gaps for the alkaline-earth chalcogenides with the rocksalt structure from the AFLOW database. (Hint: use "AlkaliEarths" and "Chalcogens" as the species; the rocksalt structure prototype label is "AB_cF8_225_a_b"). Convert the resulting JSON file to a dictionary-like object. Use the directive $paging(0).

In [None]:
import json, sys, os
from urllib.request import urlopen

In [None]:
SERVER = "https://aflow.org"
API = "/API/aflux/?"
MATCHBOOK = "species(AlkaliEarths,Chalcogens),Egap(*),aflow_prototype_label_relax(AB_cF8_225_a_b)"
DIRECTIVES = "$paging(0)"
SUMMONS = MATCHBOOK + "," + DIRECTIVES

response = json.loads(urlopen(SERVER + API + SUMMONS).read().decode("utf-8"))
print(response)

[{'compound': 'Ba1O1', 'auid': 'aflow:0e8038c02ae8c17c', 'aurl': 'aflowlib.duke.edu:AFLOWDATA/ICSD_WEB/FCC/Ba1O1_ICSD_26961', 'spacegroup_relax': 225, 'Pearson_symbol_relax': 'cF8', 'species': ['Ba', 'O'], 'Egap': 2.0927, 'aflow_prototype_label_relax': 'AB_cF8_225_a_b'}, {'compound': 'Ba1O1', 'auid': 'aflow:16381b6c1faa8de6', 'aurl': 'aflowlib.duke.edu:AFLOWDATA/ICSD_WEB/FCC/Ba1O1_ICSD_181199', 'spacegroup_relax': 225, 'Pearson_symbol_relax': 'cF8', 'species': ['Ba', 'O'], 'Egap': 2.0908, 'aflow_prototype_label_relax': 'AB_cF8_225_a_b'}, {'compound': 'Ba1O1', 'auid': 'aflow:1c62ca1b1d49611f', 'aurl': 'aflowlib.duke.edu:AFLOWDATA/ICSD_WEB/FCC/Ba1O1_ICSD_58663', 'spacegroup_relax': 225, 'Pearson_symbol_relax': 'cF8', 'species': ['Ba', 'O'], 'Egap': 2.091, 'aflow_prototype_label_relax': 'AB_cF8_225_a_b'}, {'compound': 'Ba1O1', 'auid': 'aflow:4b761225b2587de2', 'aurl': 'aflowlib.duke.edu:AFLOWDATA/ICSD_WEB/FCC/Ba1O1_ICSD_52278', 'spacegroup_relax': 225, 'Pearson_symbol_relax': 'cF8', 'spec

(b) Remove duplicate entries to generate a clean version of the data set.

In [None]:
response_clean = []
response_compounds = []
for entry in response:
    if entry["compound"] not in response_compounds:
        response_clean.append(entry)
        response_compounds.append(entry["compound"])
print(response_clean)
print(len(response_clean))

[{'compound': 'Ba1O1', 'auid': 'aflow:0e8038c02ae8c17c', 'aurl': 'aflowlib.duke.edu:AFLOWDATA/ICSD_WEB/FCC/Ba1O1_ICSD_26961', 'spacegroup_relax': 225, 'Pearson_symbol_relax': 'cF8', 'species': ['Ba', 'O'], 'Egap': 2.0927, 'aflow_prototype_label_relax': 'AB_cF8_225_a_b'}, {'compound': 'Ba1S1', 'auid': 'aflow:0fc63077aece4a94', 'aurl': 'aflowlib.duke.edu:AFLOWDATA/ICSD_WEB/FCC/Ba1S1_ICSD_616053', 'spacegroup_relax': 225, 'Pearson_symbol_relax': 'cF8', 'species': ['Ba', 'S'], 'Egap': 2.1518, 'aflow_prototype_label_relax': 'AB_cF8_225_a_b'}, {'compound': 'Ba1Se1', 'auid': 'aflow:0f9d6888d48157bf', 'aurl': 'aflowlib.duke.edu:AFLOWDATA/ICSD_WEB/FCC/Ba1Se1_ICSD_616124', 'spacegroup_relax': 225, 'Pearson_symbol_relax': 'cF8', 'species': ['Ba', 'Se'], 'Egap': 1.9489, 'aflow_prototype_label_relax': 'AB_cF8_225_a_b'}, {'compound': 'Ba1Te1', 'auid': 'aflow:1c9c8b13be804a36', 'aurl': 'aflowlib.duke.edu:AFLOWDATA/ICSD_WEB/FCC/Ba1Te1_ICSD_616165', 'spacegroup_relax': 225, 'Pearson_symbol_relax': 'cF8

(c) Create the feature space using the ionization energy differences and electronegativity differences in the elements. The ionization energy and electronegativity values are in the python dictionary below.

In [None]:
Chemical_element_data={"Be":{"electronegativity":1.5,"first_ionization_energy":900},"Mg":{"electronegativity":1.2,"first_ionization_energy":736},"Ca":{"electronegativity":1.0,"first_ionization_energy":590},"Sr":{"electronegativity":1.0,"first_ionization_energy":548},"Ba":{"electronegativity":0.9,"first_ionization_energy":502},"O":{"electronegativity":3.5,"first_ionization_energy":1310},"S":{"electronegativity":2.5,"first_ionization_energy":1000},"Se":{"electronegativity":2.4,"first_ionization_energy":941},"Te":{"electronegativity":2.1,"first_ionization_energy":870}}

In [None]:
x_list = []
y_list = []

for datum in response_clean:
    species1 = datum["species"][0]
    species2 = datum["species"][1]
    en_diff = abs(Chemical_element_data[species1]["electronegativity"] - Chemical_element_data[species2]["electronegativity"])
    ie_diff = abs(Chemical_element_data[species1]["first_ionization_energy"] - Chemical_element_data[species2]["first_ionization_energy"])
    x_list.append([en_diff, ie_diff])
    y_list.append(datum["Egap"])

print(x_list)
print(y_list)

[[2.6, 808], [1.6, 498], [1.5, 439], [1.2000000000000002, 368], [2.0, 410], [1.0, 100], [0.8999999999999999, 41], [0.6000000000000001, 30], [2.5, 720], [1.5, 410], [1.4, 351], [1.1, 280], [2.3, 574], [1.3, 264], [1.2, 205], [0.9000000000000001, 134], [2.5, 762], [1.5, 452], [1.1, 322]]
[2.0927, 2.1518, 1.9489, 1.5908, 8.185, 0.9853, 0, 0, 3.6356, 2.381, 2.076, 1.538, 4.4686, 2.7703, 1.7714, 0.4374, 3.2754, 2.4956, 1.7635]


(d) Fit a linear regression model to this data, splitting into training and sets. Print the coefficients and intercept of the model, and the $R^2$ score for the training and test sets.

In [None]:
import numpy as np
from sklearn.linear_model import LinearRegression  #  import lin reg package
from sklearn.model_selection import train_test_split  #  import method to split data into training & testing sets

x = np.array(x_list)
y = np.array(y_list)

x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=42)  #  random data split
linreg = LinearRegression().fit(x_train, y_train)  #  model, fit lin reg model

print("linreg.coef_:", linreg.coef_)  #  print model coefficients
print("linreg.intercept_:", linreg.intercept_)  #  print y-axis intercept
print("R^2 training score: ", linreg.score(x_train, y_train))
print("R^2 testing score: ", linreg.score(x_test, y_test))

linreg.coef_: [ 5.46789378 -0.00746215]
linreg.intercept_: -2.865173875368178
R^2 training score:  0.6408297323533231
R^2 testing score:  -2.7377308734424397


5. (2x) Regression for electronic band gap.

(a) Use AFLOW to download the electronic band gap values from AFLOW for all materials that do not contain lanthanides or actinides, that have the rocksalt structure. Also download the band gap type (keyword: "Egap_type"). (Hint: modify the code introduced in Lecture 14).

In [None]:
import json, sys, os
from urllib.request import urlopen

SERVER = "https://aflow.org"
API = "/API/aflux/?"
FILTER = "species(!Lanthanides,!Pa,!U,!Pu,!Th),Egap(*),Egap_type(*),aflow_prototype_label_relax(AB_cF8_225_a_b),$paging(0)"

response = json.loads(urlopen(SERVER + API + FILTER).read().decode("utf-8"))
print(response)
print(len(response))

[{'compound': 'Ag1Br1', 'auid': 'aflow:1ffe490975e7aeeb', 'aurl': 'aflowlib.duke.edu:AFLOWDATA/ICSD_WEB/FCC/Ag1Br1_ICSD_52246', 'spacegroup_relax': 225, 'Pearson_symbol_relax': 'cF8', 'species': ['Ag', 'Br'], 'Egap': 1.5727, 'Egap_type': 'insulator-indirect', 'aflow_prototype_label_relax': 'AB_cF8_225_a_b'}, {'compound': 'Ag1Br1', 'auid': 'aflow:2ade73cf53aa6df5', 'aurl': 'aflowlib.duke.edu:AFLOWDATA/ICSD_WEB/FCC/Ag1Br1_ICSD_65061', 'spacegroup_relax': 225, 'Pearson_symbol_relax': 'cF8', 'species': ['Ag', 'Br'], 'Egap': 1.5728, 'Egap_type': 'insulator-indirect', 'aflow_prototype_label_relax': 'AB_cF8_225_a_b'}, {'compound': 'Ag1Br1', 'auid': 'aflow:331f32c6e910ae2a', 'aurl': 'aflowlib.duke.edu:AFLOWDATA/ICSD_WEB/FCC/Ag1Br1_ICSD_56548', 'spacegroup_relax': 225, 'Pearson_symbol_relax': 'cF8', 'species': ['Ag', 'Br'], 'Egap': 1.5728, 'Egap_type': 'insulator-indirect', 'aflow_prototype_label_relax': 'AB_cF8_225_a_b'}, {'compound': 'Ag1Br1', 'auid': 'aflow:acc935168652beb5', 'aurl': 'aflowl

(b) Remove duplicate entries to generate a clean version of the data set.

In [None]:
response_clean = []
response_compounds = []
for entry in response:
    if entry["compound"] not in response_compounds:
        response_clean.append(entry)
        response_compounds.append(entry["compound"])
print(response_clean)
print(len(response_clean))

[{'compound': 'Ag1Br1', 'auid': 'aflow:1ffe490975e7aeeb', 'aurl': 'aflowlib.duke.edu:AFLOWDATA/ICSD_WEB/FCC/Ag1Br1_ICSD_52246', 'spacegroup_relax': 225, 'Pearson_symbol_relax': 'cF8', 'species': ['Ag', 'Br'], 'Egap': 1.5727, 'Egap_type': 'insulator-indirect', 'aflow_prototype_label_relax': 'AB_cF8_225_a_b'}, {'compound': 'Ag1C1', 'auid': 'aflow:7726f25d86a7ac89', 'aurl': 'aflowlib.duke.edu:AFLOWDATA/ICSD_WEB/FCC/Ag1C1_ICSD_183175', 'spacegroup_relax': 225, 'Pearson_symbol_relax': 'cF8', 'species': ['Ag', 'C'], 'Egap': 0, 'Egap_type': 'metal', 'aflow_prototype_label_relax': 'AB_cF8_225_a_b'}, {'compound': 'Ag1Cl1', 'auid': 'aflow:1b9d91d05f2e4c8c', 'aurl': 'aflowlib.duke.edu:AFLOWDATA/ICSD_WEB/FCC/Ag1Cl1_ICSD_64734', 'spacegroup_relax': 225, 'Pearson_symbol_relax': 'cF8', 'species': ['Ag', 'Cl'], 'Egap': 1.9714, 'Egap_type': 'insulator-indirect', 'aflow_prototype_label_relax': 'AB_cF8_225_a_b'}, {'compound': 'Ag1F1', 'auid': 'aflow:bcee7bb7be81cfb8', 'aurl': 'aflowlib.duke.edu:AFLOWDATA

(c) Read in the JSON file with the elemental properties, and use it to generate the feature vectors based on the differences in the electronegativities and ionization energies. Only include entries that are not metals (if "metal" not in entry["Egap_type"]):

In [None]:
from google.colab import drive
drive.mount('/content/drive')

f = open("/content/drive/MyDrive/Colab Notebooks/Chemical_element_data.json", "r+")
Chemical_element_data = json.load(f)
f.close()
drive.flush_and_unmount()
print(Chemical_element_data)

x_list = []
y_list = []

for datum in response_clean:
    if "metal" not in datum["Egap_type"]:
        species1 = datum["species"][0]
        species2 = datum["species"][1]
        en_diff = abs(Chemical_element_data[species1]["electronegativity"] - Chemical_element_data[species2]["electronegativity"])
        ie_diff = abs(Chemical_element_data[species1]["first_ionization_energy"] - Chemical_element_data[species2]["first_ionization_energy"])
        x_list.append([en_diff, ie_diff])
        y_list.append(datum["Egap"])

print(x_list)
print(y_list)

Mounted at /content/drive
{'H': {'valence': 1.0, 'atomic_mass': 1.008, 'first_ionization_energy': 1310.0, 'electronegativity': 2.1}, 'Li': {'valence': 1.0, 'atomic_mass': 6.94, 'first_ionization_energy': 519.0, 'electronegativity': 1.0}, 'Be': {'valence': 2.0, 'atomic_mass': 9.013, 'first_ionization_energy': 900.0, 'electronegativity': 1.5}, 'B': {'valence': 3.0, 'atomic_mass': 10.82, 'first_ionization_energy': 799.0, 'electronegativity': 2.0}, 'C': {'valence': 4.0, 'atomic_mass': 12.01, 'first_ionization_energy': 1090.0, 'electronegativity': 2.5}, 'N': {'valence': 5.0, 'atomic_mass': 14.008, 'first_ionization_energy': 1400.0, 'electronegativity': 3.0}, 'O': {'valence': 6.0, 'atomic_mass': 16.0, 'first_ionization_energy': 1310.0, 'electronegativity': 3.5}, 'F': {'valence': 7.0, 'atomic_mass': 19.0, 'first_ionization_energy': 1680.0, 'electronegativity': 4.0}, 'Na': {'valence': 1.0, 'atomic_mass': 22.97, 'first_ionization_energy': 494.0, 'electronegativity': 0.9}, 'Mg': {'valence': 2.0,

(d) Split the data into testing and training sets. Fit linear regression to the training set. What is the fit score for the test set?

In [None]:
import numpy as np
from sklearn.linear_model import LinearRegression  #  import lin reg package
from sklearn.model_selection import train_test_split  #  import method to split data into training & testing sets

x = np.array(x_list)
y = np.array(y_list)

x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=42)  #  random data split
linreg = LinearRegression().fit(x_train, y_train)  #  model, fit lin reg model

print("R^2 testing score: ", linreg.score(x_test, y_test))

R^2 testing score:  0.5353752660372736


(e) Fit k-nearest neighbors regression to the same data. Use separate training, validation and test sets to optimize the number of nearest neighbors (test up to a maximum of 10 neighbors), and evaluate the resulting model. How does it compare to linear regression?

In [None]:
from sklearn.neighbors import KNeighborsRegressor  #  import KNN regression package
from sklearn.model_selection import train_test_split, cross_val_score  #  import method to split data into training & testing sets

kvals = []
cv_scores = []

for k in range(1, 10 + 1):
    reg = KNeighborsRegressor(n_neighbors = k)
    scores = cross_val_score(reg, x, y)
    sum_scores = sum(scores)
    ave_scores = sum_scores / len(scores)
    kvals.append(k)
    cv_scores.append(ave_scores)

print("k = ", kvals)
print("Average score = ", cv_scores)

k =  [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Average score =  [-1.1181438617655401, -0.6554782721931158, -0.4665056842844221, -0.3404440216250265, -0.3352688824030756, -0.29185655854726505, -0.2549667131841688, -0.23478972761799297, -0.28641702012257186, -0.2653848063006333]
