# Question 1: Structural features
**1.1 cif-cn-featurizer**

Our friend Anton Oliynyk has recently released a new python script `cif-cn-featurizer`. 

Here is their description "A Python script designed to process CIF (Crystallographic Information File) files and extract various features from them. These features include interatomic distances, atomic environment information, and coordination numbers. The script can handle binary and ternary compounds."

Let's test it out and see how it works! To do so we will need some cifs. Right now the featurizer only works with binaries made up of certain elements shown in this plot. 

![allowed elements](https://github.com/sp8rks/MaterialsInformatics/blob/main/HW/HW2/cif-cn-featurizer-allowed-elements.png?raw=true)

**<font color='teal'>a)</font>** Download the `cif-cn-featurizer` files and run it on the cif files in the `HW\cn-featurizer\cifs` folder. 

Note: in case you can't get it working, you'll also find a csv folder with all the extracted features for these cifs already complete, but try and get it working so you can use it in the future!

Byron Notes on the running of the featurizer code:

I used conda install -c conda-forge for all installs in my environment

cif-cn-featurizer Github prereqs advised to install:
    -   Click
    -   Pandas
    -   Gemmi

main.py then failed to run due to missing Scipy

Installed scipy
main.py failed due to missing sympy

Installed sympy
The script ran, but when selecting the folder cif_test from the cloned repository, it errored with missing openpyxl dependency

Installed openpyxl
The script ran on their cif_test set without issues
Their test set was only one AlSb.cif

**Try using pip install for the Github prereqs - does it install the addiional packages automatically?
    I did not evaluate this

The main.py for the cif-cn-featurizer was modified by setting script_directory to the HW2 folder that contained the cifs folder for this homework question.
Then was run from the command line.

A try - except block was added to the cif-cn-featurizer main.py to skip files that had issues


In [2]:
#The following code was either added or commented out in the original main.py for the cn-cif-featurizer
 
# # script_directory = os.path.dirname(os.path.abspath(__file__))    
# script_directory = r"C:\Users\byron\OneDrive\Documents\GitHub\MaterialsInformatics\byron\HW2\cn-featurizer"
# cif_folder_directory = folder.choose_CIF_directory(script_directory)

# # I used the above, but my resulting file was not in the folder I wanted
# # I believe the below line will fix that so I don't have to move it later, but I did not test it
# cif_folder_directory = os.path.join(script_directory, "csv")


# # generate an error file for any errors in processing cif files
    # error_log_file = os.path.join(script_directory, 'cif-cn-featurizer-errors.csv')


## Added the try statement in this for loop
    # for idx, filename in enumerate(files_lst, start=1):
    #     try:
    #         start_time = time.time()
## ..
# #..
## ..

    #    except Exception as e:
    #         error_message = f"Could not process {filename}. Error: {e}"
    #         print(error_message)
    #         # write the error to the file
    #         with open(error_log_file, 'a') as file:
    #             filename_base = os.path.basename(filename)
    #             file.write(f'"{filename_base}","{error_message}"\n')
        
    #         continue



In [3]:
# Read the error file

import pandas as pd
import os
# df = pd.dataframe()

csv_file_path = r"C:\Users\byron\OneDrive\Documents\GitHub\MaterialsInformatics\byron\HW2\cn-featurizer\cifs\csv"
error_file = csv_file_path + "\cif-cn-featurizer-errors.csv"
error_df = pd.read_csv(error_file)

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [4]:
import pandas as pd

# The error message has the full path name
# Truncate it and put it in 'Error' for easier reading
error_df['Error'] = error_df['Error Message'].apply(lambda x: x[-11:])

# Display all cif files and the error message
print(error_df[['Filename','Error']])

      Filename        Error
0   250022.cif  Error: 'Nb'
1   250065.cif  Error: 'Mo'
2   250125.cif  Error: 'Mo'
3   250186.cif  Error: 'Mo'
4   250191.cif  Error: 'Mo'
5   250223.cif  Error: 'Nb'
6   250225.cif  Error: 'Mo'
7   250236.cif  Error: 'Nb'
8   250390.cif  Error: 'Mo'
9   250399.cif   Error: 'V'
10  250476.cif  Error: 'Pb'
11  250477.cif  Error: 'Pb'
12  250525.cif  Error: 'Ti'
13  250527.cif  Error: 'Pb'
14  250530.cif  Error: 'Nb'
15  250561.cif   Error: 'W'
16  250562.cif   Error: 'W'
17  250563.cif  Error: 'Mo'
18  250564.cif  Error: 'Mo'
19  250628.cif  Error: 'Nb'
20  250679.cif  Error: 'Pb'
21  250733.cif  Error: 'Ta'
22  250735.cif  Error: 'Nb'
23  250737.cif  Error: 'Ta'
24  250740.cif   Error: 'V'
25  250742.cif   Error: 'V'
26  250751.cif  Error: 'Au'
27  250752.cif  Error: 'Au'
28  250753.cif  Error: 'Au'
29  250754.cif  Error: 'Au'
30  250755.cif  Error: 'Au'


In [5]:
# I had ChatGPT 3.5 write me the code to 
# get a new df that has the unique Error with a count of how many records in my error_df have that Error

# Get counts of each unique error
error_counts = error_df["Error"].value_counts()

# Create a new DataFrame with unique errors and their counts
unique_errors_df = pd.DataFrame({"Error": error_counts.index, "Count": error_counts.values})

# Display the new DataFrame
print(unique_errors_df)


         Error  Count
0  Error: 'Mo'      8
1  Error: 'Nb'      6
2  Error: 'Au'      5
3  Error: 'Pb'      4
4   Error: 'V'      3
5   Error: 'W'      2
6  Error: 'Ta'      2
7  Error: 'Ti'      1


After looking through the main.py and the file structure of the repository it seems clear that the scripts are pulling data from the Excel file and using it to generate the features for binary and ternary systems.

The _featurizer_log.csv that was supplied with the CIF files has records of successfully processing most of the CIF files.
When I run the featurizer, it fails to process 31 of the files.

It seems that they were processed with a property file that was more complete than the one available on the repository

126 CIF files were received, but only 124 were processed (according to the log.csv file)
Which 2 files were not processed?

All of the W and Ta files that failed for me were successfully processed.

**1.2 Getting labeled data for the cifs**

**<font color='teal'>b)</font>** Now that you've got feature vectors in a series of .csv files, let's use them to build a model to predict a property. To get a property let's search for a materials project entry using the cif cards! If you've forgotten how, go back to the `legacy_MPRester_tutorial.ipynb` notebook where we did an example. Once you have the material project id, run a query to extract a property like bulk modulus (["elasticity"]["K_VRH"])

In [6]:
# Set up my MP Rester with my API key

from mp_api.client import MPRester
import os

#set path to text file with the MP API key
filename = r'C:\Users\byron\OneDrive\Documents\Byron School\Materials Informatics\MP-apikey.txt'

def get_file_contents(filename):
    try:
        with open(filename, 'r') as f:
            # It's assumed our file contains a single line,
            # with our API key
            return f.read().strip()
    except FileNotFoundError:
        print("'%s' file not found" % filename)

#set MP_APIKEY to the key
MP_APIKEY = get_file_contents(filename)

#set mpr as the MPRester to simplify later code
mpr = MPRester(MP_APIKEY)


No module named 'phonopy'
No module named 'phonopy'


In [7]:
#pull mpid files from the mp database for each cif file in the cifs folder
import os

# get a list of all cif files
script_directory = os.getcwd()
cif_path = os.path.join(script_directory,'cn-featurizer\cifs')
all_files = os.listdir(cif_path)
cif_files = [os.path.join(cif_path, file) for file in all_files if file.endswith('.cif')]

# In HW1 I used the following line successfully
# 2/18:  I keep getting 504 time out errors so I expaneded the for loop to print 
# Successes and failures
# list = [mpr.find_structure(cif_file) for cif_file in cif_files]
# Encapsulated in a for loop in order to add a try except to log successes and failures

# bulk modulus (["elasticity"]["K_VRH"])

import pandas as pd
import os

# Initialize an empty DataFrame with columns for cif_file and mpid
mpid_records = pd.DataFrame(columns=['cif_file', 'mpid'])

for cif_file in cif_files:
    try:
        mpid = mpr.find_structure(cif_file)
        # Append the new record to the DataFrame
        new_mpid = pd.DataFrame({'cif_file': [os.path.basename(cif_file)],'mpid': [mpid]})
        mpid_records = pd.concat([mpid_records,new_mpid],ignore_index=True )
        # mpid_records_df = mpid_records_df.append({'cif_file': os.path.basename(cif_file), 'mpid': mpid}, ignore_index=True)
        print(f"Success processing {cif_file}.")
    except Exception as e:
        print(f"Error processing {cif_file}. Error is {e}")


Success processing c:\Users\byron\OneDrive\Documents\GitHub\MaterialsInformatics\byron\HW2\cn-featurizer\cifs\250022.cif.
Success processing c:\Users\byron\OneDrive\Documents\GitHub\MaterialsInformatics\byron\HW2\cn-featurizer\cifs\250065.cif.
Success processing c:\Users\byron\OneDrive\Documents\GitHub\MaterialsInformatics\byron\HW2\cn-featurizer\cifs\250101.cif.




Success processing c:\Users\byron\OneDrive\Documents\GitHub\MaterialsInformatics\byron\HW2\cn-featurizer\cifs\250125.cif.
Success processing c:\Users\byron\OneDrive\Documents\GitHub\MaterialsInformatics\byron\HW2\cn-featurizer\cifs\250186.cif.
Success processing c:\Users\byron\OneDrive\Documents\GitHub\MaterialsInformatics\byron\HW2\cn-featurizer\cifs\250191.cif.




Success processing c:\Users\byron\OneDrive\Documents\GitHub\MaterialsInformatics\byron\HW2\cn-featurizer\cifs\250223.cif.
Success processing c:\Users\byron\OneDrive\Documents\GitHub\MaterialsInformatics\byron\HW2\cn-featurizer\cifs\250225.cif.
Success processing c:\Users\byron\OneDrive\Documents\GitHub\MaterialsInformatics\byron\HW2\cn-featurizer\cifs\250236.cif.
Success processing c:\Users\byron\OneDrive\Documents\GitHub\MaterialsInformatics\byron\HW2\cn-featurizer\cifs\250328.cif.
Success processing c:\Users\byron\OneDrive\Documents\GitHub\MaterialsInformatics\byron\HW2\cn-featurizer\cifs\250329.cif.
Success processing c:\Users\byron\OneDrive\Documents\GitHub\MaterialsInformatics\byron\HW2\cn-featurizer\cifs\250330.cif.
Success processing c:\Users\byron\OneDrive\Documents\GitHub\MaterialsInformatics\byron\HW2\cn-featurizer\cifs\250331.cif.
Success processing c:\Users\byron\OneDrive\Documents\GitHub\MaterialsInformatics\byron\HW2\cn-featurizer\cifs\250332.cif.
Success processing c:\Us

'_atom_site_label'
No structure parsed for section 1 in CIF.
'_atom_site_label'


Success processing c:\Users\byron\OneDrive\Documents\GitHub\MaterialsInformatics\byron\HW2\cn-featurizer\cifs\250628.cif.
Success processing c:\Users\byron\OneDrive\Documents\GitHub\MaterialsInformatics\byron\HW2\cn-featurizer\cifs\250646.cif.
Success processing c:\Users\byron\OneDrive\Documents\GitHub\MaterialsInformatics\byron\HW2\cn-featurizer\cifs\250647.cif.
Success processing c:\Users\byron\OneDrive\Documents\GitHub\MaterialsInformatics\byron\HW2\cn-featurizer\cifs\250648.cif.
Success processing c:\Users\byron\OneDrive\Documents\GitHub\MaterialsInformatics\byron\HW2\cn-featurizer\cifs\250649.cif.
Success processing c:\Users\byron\OneDrive\Documents\GitHub\MaterialsInformatics\byron\HW2\cn-featurizer\cifs\250650.cif.
Success processing c:\Users\byron\OneDrive\Documents\GitHub\MaterialsInformatics\byron\HW2\cn-featurizer\cifs\250651.cif.
Success processing c:\Users\byron\OneDrive\Documents\GitHub\MaterialsInformatics\byron\HW2\cn-featurizer\cifs\250652.cif.
Success processing c:\Us

'_atom_site_label'


Success processing c:\Users\byron\OneDrive\Documents\GitHub\MaterialsInformatics\byron\HW2\cn-featurizer\cifs\250733.cif.
Success processing c:\Users\byron\OneDrive\Documents\GitHub\MaterialsInformatics\byron\HW2\cn-featurizer\cifs\250735.cif.
Success processing c:\Users\byron\OneDrive\Documents\GitHub\MaterialsInformatics\byron\HW2\cn-featurizer\cifs\250737.cif.
Success processing c:\Users\byron\OneDrive\Documents\GitHub\MaterialsInformatics\byron\HW2\cn-featurizer\cifs\250740.cif.
Success processing c:\Users\byron\OneDrive\Documents\GitHub\MaterialsInformatics\byron\HW2\cn-featurizer\cifs\250742.cif.
Success processing c:\Users\byron\OneDrive\Documents\GitHub\MaterialsInformatics\byron\HW2\cn-featurizer\cifs\250751.cif.
Success processing c:\Users\byron\OneDrive\Documents\GitHub\MaterialsInformatics\byron\HW2\cn-featurizer\cifs\250752.cif.
Success processing c:\Users\byron\OneDrive\Documents\GitHub\MaterialsInformatics\byron\HW2\cn-featurizer\cifs\250753.cif.
Success processing c:\Us

The output above was saved to mpr_log.txt so I can go back through the errors and either look for or fix cif files that returned errors, it could just be a minor formatting issue.


In [8]:
# Filter mpid_records into success and error
mpid_records_success = mpid_records[mpid_records['mpid'].astype(bool)]
mpid_records_error = mpid_records[~mpid_records['mpid'].astype(bool)]


In [9]:
# Pull the density of each mpid
mpids = mpid_records_success['mpid'].tolist()

docs = mpr.materials.summary.search(material_ids=mpids,
                                    fields=["material_id",
                                            "formula_pretty",
                                            "density"])

Retrieving SummaryDoc documents:   0%|          | 0/92 [00:00<?, ?it/s]

There were 92 records pulled from the mpids list
There are 105 mpids in the list indicating there are several duplicate values
Data Wrangler indicates there are 86 distinct mpids so it is unclear why we got 92 records for 86 unique ids

In [10]:
mpids = [doc.material_id for doc in docs]
mpids

[MPID(mp-20131),
 MPID(mp-20729),
 MPID(mp-20369),
 MPID(mp-20903),
 MPID(mp-21197),
 MPID(mp-20258),
 MPID(mp-20920),
 MPID(mp-20236),
 MPID(mp-21431),
 MPID(mp-1291),
 MPID(mp-21177),
 MPID(mp-977),
 MPID(mp-801),
 MPID(mp-22568),
 MPID(mp-16513),
 MPID(mp-865411),
 MPID(mp-1080098),
 MPID(mp-1101986),
 MPID(mp-1102392),
 MPID(mp-2588),
 MPID(mp-569196),
 MPID(mp-19977),
 MPID(mp-1051),
 MPID(mp-20309),
 MPID(mp-2451),
 MPID(mp-865411),
 MPID(mp-16513),
 MPID(mp-12553),
 MPID(mp-567305),
 MPID(mp-959),
 MPID(mp-2134),
 MPID(mp-1409),
 MPID(mp-21432),
 MPID(mp-1451),
 MPID(mp-30745),
 MPID(mp-30866),
 MPID(mp-11482),
 MPID(mp-30787),
 MPID(mp-2351),
 MPID(mp-790),
 MPID(mp-481),
 MPID(mp-2092),
 MPID(mp-21427),
 MPID(mp-718),
 MPID(mp-640095),
 MPID(mp-20309),
 MPID(mp-369),
 MPID(mp-865411),
 MPID(mp-1451),
 MPID(mp-891),
 MPID(mp-1082),
 MPID(mp-2006),
 MPID(mp-1080590),
 MPID(mp-30386),
 MPID(mp-1080756),
 MPID(mp-980752),
 MPID(mp-1101053),
 MPID(mp-2465),
 MPID(mp-1549),
 MPID(mp

In [11]:
# Generate a df of the data pulled from the Material Project
import pandas as pd

mpdata_df = pd.DataFrame(columns=['mpid','formula','density'])
for doc in docs:
    new_record = pd.DataFrame({'mpid':[doc.material_id],'formula':[doc.formula_pretty],'density':[doc.density]})
    mpdata_df = pd.concat([mpdata_df,new_record],ignore_index=True)
        

  mpdata_df = pd.concat([mpdata_df,new_record],ignore_index=True)


In [12]:
# Remove duplicate mpids
mpdata_df = mpdata_df.drop_duplicates()
mpdata_df

Unnamed: 0,mpid,formula,density
0,mp-20131,YIn3,7.270778
1,mp-20729,LaIn3,7.376636
2,mp-20369,CeIn3,7.775450
3,mp-20903,PrIn3,7.656969
4,mp-21197,NdIn3,7.823092
...,...,...,...
87,mp-376,PrSn3,7.546256
88,mp-1977,NdSn3,7.691879
89,mp-2484,SmSn3,7.930209
90,mp-20387,EuSn3,7.801257


The MPID records I pulled for the CIF files need to be tied to the featurizer data in the CSV files
Need to query these CSV files with the formulae in my docs to relate the density I pulled from mprester to the features generated by the cif featurizer

The following CSV files were generated by the featurizer:
    -   cifs_atomic_environment_features_binary
    -   cifs_atomic_environment_wyckoff_multiplicity_features_binary
    -   cifs_atomic_environment_wyckoff_multiplicity_features_universal
    -   cifs_coordination_number_binary_all
    -   cifs_coordination_number_binary_avg
    -   cifs_coordination_number_binary_max
    -   cifs_coordination_number_binary_min
    -   cifs_interatomic_features_binary
    -   cifs_interatomic_features_universal

In [13]:
# Merge the mpid_record_success which has CIF files with MPIDs with the mpdata_df that has the formula and density

cif_data_df = pd.merge(mpid_records_success,mpdata_df[['mpid','formula','density']],on='mpid',how='left')


In [14]:
# cif_data_df now has a list of all cif_files where an mpid was found 
# looking through we can now see which records are missing formula and density data which
# tells us which records failed to pull from the Materials Project

# scrub the NaN records
cif_data_scrubbed = cif_data_df.dropna()
# record the CIF_id from each cif_file and convert to an integer
cif_data_scrubbed['CIF_id']= cif_data_scrubbed['cif_file'].str.replace('.cif', '', regex=False)
cif_data_scrubbed['CIF_id'] = cif_data_scrubbed['CIF_id'].astype(int)
cif_data_scrubbed

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cif_data_scrubbed['CIF_id']= cif_data_scrubbed['cif_file'].str.replace('.cif', '', regex=False)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cif_data_scrubbed['CIF_id'] = cif_data_scrubbed['CIF_id'].astype(int)


Unnamed: 0,cif_file,mpid,formula,density,CIF_id
0,250065.cif,mp-801,Mo3Os,12.883572,250065
1,250101.cif,mp-768,TmAl3,5.632610,250101
2,250125.cif,mp-1139,Co3Mo,9.962997,250125
3,250191.cif,mp-1232,Mo3Pt,12.898261,250191
4,250225.cif,mp-11506,Ni3Mo,9.769614,250225
...,...,...,...,...,...
99,250920.cif,mp-30634,HoFe3,8.915389,250920
101,250934.cif,mp-718,SnPd3,11.303794,250934
102,250962.cif,mp-20971,SnPt3,17.869663,250962
103,250963.cif,mp-20516,InPt3,17.912752,250963


Now I have a dataframe, cif_data_scrubbed, which has all the information I Need to merge with the features that are in the CSV files by CIFID

In the code block below, the features are pulled in from the CSV files
Initially I pulled in all the CSV files that I put into feature_files but the number of rows ended up growing exponentially
This is due to some of the csv files containing multiple rows for each cif_id
Not 100% sure how to reconcile that in a SVM model, 
I was under the impression that the featurizer would add many columns as features for a single row 
The commented out files below are those that had multiple rows per cif_id

Will use the other features in my model

In [15]:

import pandas as pd
import os

# This was also set above, but including it here again for ease of reusing thsi code at a later time possibly
csv_file_path = r"C:\Users\byron\OneDrive\Documents\GitHub\MaterialsInformatics\byron\HW2\cn-featurizer\cifs\csv"

# list of the csv files with all of the structural features
feature_files = [
    "cifs_atomic_environment_features_binary.csv",                          # 0 *
    # "cifs_atomic_environment_wyckoff_multiplicity_features_binary.csv",     # 1 *
    # "cifs_atomic_environment_wyckoff_multiplicity_features_universal.csv",  # 2 *
    # "cifs_coordination_number_binary_all.csv",                              # 3
    # "cifs_coordination_number_binary_avg.csv",                              # 4
    # "cifs_coordination_number_binary_max.csv",                              # 5
    # "cifs_coordination_number_binary_min.csv",                              # 6
    "cifs_interatomic_features_binary.csv",                                 # 7 **
    "cifs_interatomic_features_universal.csv"                               # 8 **
]

features_df = cif_data_scrubbed.copy()
for file in feature_files:
    file_path = os.path.join(csv_file_path,file)
    file_df = pd.read_csv(file_path)
    features_df = pd.merge(features_df, file_df, on="CIF_id",how='left')




In [16]:
print(f"cif_data_scrubbed shape = ",{cif_data_scrubbed.shape})
print(f"features_df shape = ",{features_df.shape})

cif_data_scrubbed shape =  {(95, 5)}
features_df shape =  {(95, 67)}


At this point, my cif_data_scrubbed df has the cif_file, cif_id, formula, mpid, and density for each CIF file where this data was found.
My features_df has all those columns plus the structural features produced by the cif-cn-featurizer
    The feature csv files that had multiple rows per cif file were excluded

**1.3 Comparing structural features to compositional features**

**<font color='teal'>c)</font>** Now that you've got structural features and you can get compositional features (use CBFV), let's compare them! Build a Support vector machine regressor model with each feature set and determine which works better. 

In [17]:
# generate a new df to use for cbfv featuring
cbfv_unfeatured_df = cif_data_scrubbed[['formula','density']].copy().rename(columns={'density':'target'})

In [18]:
# generate features using CBFV

from CBFV import composition
feature_model = 'oliynyk'       #explicitly set the featurizer model 'oliynyk is the default, but I'm setting it explicitly anyway
X, y, formulae, skipped = composition.generate_features(cbfv_unfeatured_df,elem_prop=feature_model)

Processing Input Data: 100%|██████████| 95/95 [00:00<00:00, 8550.25it/s]


	Featurizing Compositions...


Assigning Features...: 100%|██████████| 95/95 [00:00<00:00, 6114.52it/s]

	Creating Pandas Objects...





In [19]:
from sklearn.svm import SVR
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error

# scale and normalize the data
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import normalize

scaler = StandardScaler()   # Set up the scaler as the STandardScaler

RNG_SEED = 42
np.random.seed(seed=RNG_SEED)

X_train_unscaled, X_test_unscaled, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=RNG_SEED)

X_train = scaler.fit_transform(X_train_unscaled)       #the train and val splits were done on unscaled data
X_test = scaler.transform(X_test_unscaled)
 
# Normalize all of the X datasets
X_train = normalize(X_train)
X_test = normalize(X_test)


svr = SVR(kernel='rbf', C=1e3, gamma=0.1)
svr.fit(X_train, y_train)

y_pred = svr.predict(X_test)
r2 = r2_score(y_test, y_pred)
print('the r2 score is',r2)
mae = mean_absolute_error(y_test, y_pred)
print('the mean absolute error is',mae)


the r2 score is 0.9948482146436889
the mean absolute error is 0.15708938688275234


In [24]:
# Now need to split the cif-cn-featurizer data in features_df 

features_df_dropna = features_df.dropna()
y_cn = features_df_dropna[['density']].copy()
X_cn = features_df_dropna.drop(columns=['density','cif_file','mpid','formula','Compound','Compound_x','A_x','B_x','Compound_y','A_y','B_y']).copy()


from sklearn.svm import SVR
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error

RNG_SEED = 42
np.random.seed(seed=RNG_SEED)

X_train_cn, X_test_cn, y_train_cn, y_test_cn = train_test_split(X_cn, y_cn, test_size=0.33, random_state=RNG_SEED)
svr = SVR(kernel='rbf', C=1e3, gamma=0.1)
svr.fit(X_train_cn, y_train_cn)

y_pred_cn = svr.predict(X_test_cn)
r2 = r2_score(y_test_cn, y_pred_cn)
print('the r2 score is',r2)
mae = mean_absolute_error(y_test_cn, y_pred_cn)
print('the mean absolute error is',mae)


the r2 score is 0.8338439171696592
the mean absolute error is 1.1292359041264897


  y = column_or_1d(y, warn=True)


In [33]:
from sklearn.svm import SVR
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_absolute_error
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import normalize

scaler = StandardScaler()   # Set up the scaler as the STandardScaler


# Split data into train and test sets
X_train_unscaled, X_test_unscaled, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

X_train = scaler.fit_transform(X_train_unscaled)       #the train and val splits were done on unscaled data
X_test = scaler.transform(X_test_unscaled)
 
# Normalize all of the X datasets
X_train = normalize(X_train)
X_test = normalize(X_test)

# Feature selection with Lasso
lasso = Lasso(alpha=0.001)  # You can adjust the regularization strength
lasso.fit(X_train, y_train)
selected_features = np.where(lasso.coef_ != 0)[0]  # Indices of selected features

# Use selected features to train SVR model
svr = SVR(kernel='rbf', C=1e3, gamma=0.1)
svr.fit(X_train[:, selected_features], y_train)

# Predict target values using SVR model
y_pred = svr.predict(X_test[:, selected_features])

# Evaluate SVR model performance
r2 = r2_score(y_test, y_pred)
print('R2 score:', r2)

mae = mean_absolute_error(y_test, y_pred)
print('Mean Absolute Error:', mae)


R2 score: 0.9932120519360371
Mean Absolute Error: 0.21225614785733807


  model = cd_fast.enet_coordinate_descent(


alpha 0.2: R2 = 0.971, MAE = 0.4584
alpha 0.1: R2 = 0.971, MAE = 0.4584
alpha 0.01: R2 = 0.985, MAE = 0.3032
alpha 0.001: R2 = 0.993, MAE = 0.2123

At alpha at 0.3 and higher, it fails with 0 features

Unclear if we are overfitting as alpha decreases.

The code should be modified to perform cross-validation to ensure we don't over fit
Should also use a search method for hyperparameter optimization on the lasso regularization and for SVR parameters