# Perovskite Material Analysis and Band Gap Prediction

This notebook documents the workflow for analyzing perovskite material compositions and predicting their band gaps using machine learning models. The steps include data cleaning, feature extraction, and model training with Ridge Regression and Random Forest Regressor.

## Data Cleaning and Preparation

In this section, we load a sample dataset and perform data cleaning and preparation. The steps include:
1. Loading the dataset into a pandas DataFrame.
2. Cleaning and normalizing molecule names.
3. Converting coefficients to floats.
4. Extracting ions and coefficients into dictionaries.
5. Identifying unique molecules and creating new columns for each unique molecule.
6. Calculating the proportions for each molecule and assigning them to the corresponding columns.
7. Dropping the original ion and coefficient columns as they are no longer needed.
8. Saving the cleaned and modified DataFrame to a new CSV file.
```

In [5]:


# Load the CSV file
file_path = 'perovsite_database_query.csv'  # Change this to your actual file path
data = pd.read_csv(file_path)

# Columns to keep from the dataset
columns_to_keep = [
    'JV_default_Voc', 'JV_default_Jsc', 'JV_default_FF', 'JV_default_PCE', 
    'Perovskite_band_gap', 'Perovskite_composition_a_ions', 'Perovskite_composition_a_ions_coefficients',
    'Perovskite_composition_b_ions', 'Perovskite_composition_b_ions_coefficients',
    'Perovskite_composition_c_ions', 'Perovskite_composition_c_ions_coefficients'
]
data_cleaned = data[columns_to_keep]

# Drop rows with NaN values in important columns
data_cleaned.dropna(subset=[
    'Perovskite_composition_a_ions', 'Perovskite_composition_a_ions_coefficients',
    'Perovskite_composition_b_ions', 'Perovskite_composition_b_ions_coefficients',
    'Perovskite_composition_c_ions', 'Perovskite_composition_c_ions_coefficients'
], inplace=True)

# Remove rows where 'Perovskite_band_gap' contains '|'
data_cleaned = data_cleaned[~data_cleaned['Perovskite_band_gap'].str.contains('|', na=False)]

# Reset index after dropping rows
data_cleaned.reset_index(drop=True, inplace=True)

# Function to clean and normalize molecule names
def clean_molecule_name(name):
    # Remove leading/trailing spaces and normalize space
    name = name.strip()
    
    # Allow alphanumeric characters and hyphens; also preserve names in parentheses
    name = re.sub(r'[^a-zA-Z0-9\s\-()]+', ' ', name)

    # Replace multiple spaces with a single space and trim again
    name = re.sub(r'\s+', ' ', name).strip()
    
    # Split the string into individual molecule components based on spaces
    elements = name.split()

    # Include valid molecule names, allowing for bracketed entries
    elements = [element for element in elements if element and 
                not element.replace('.', '', 1).isdigit()]

    return elements

# Function to clean and convert coefficients to floats
def clean_and_convert_coefficient(coefficient):
    try:
        # Remove non-numeric characters except periods and minus signs
        cleaned_coefficient = re.sub(r'[^0-9.-]', '', coefficient)
        return float(cleaned_coefficient)
    except ValueError:
        return 0.0  # Default to 0.0 if conversion fails

# Function to split ions and coefficients into dictionaries
def extract_ions_and_coefficients(ions_column, coefficients_column):
    # Split ions and coefficients by ';' and '|'
    ions = re.split(r'[;|]', ions_column)
    coefficients = re.split(r'[;|]', coefficients_column)
    
    # Clean and convert to floats
    coefficients = [clean_and_convert_coefficient(c) for c in coefficients]
    
    # Clean ion names and split into individual elements
    all_ions = []
    for ion in ions:
        cleaned_ions = clean_molecule_name(ion)  # Clean and split the ion names
        all_ions.extend(cleaned_ions)            # Add each individual element
    
    return all_ions, coefficients

# Step 1: Identify unique molecules
unique_molecules = set()

# Go through each row to identify unique ions
for index, row in data_cleaned.iterrows():
    for column_group in ['a', 'b', 'c']:
        ions_column = f'Perovskite_composition_{column_group}_ions'
        coefficients_column = f'Perovskite_composition_{column_group}_ions_coefficients'
        
        ions = str(row[ions_column]).split(';')
        all_ions = []
        for ion in ions:
            cleaned_ions = clean_molecule_name(ion)
            all_ions.extend(cleaned_ions)
        unique_molecules.update(all_ions)

# Remove any 'nan' from unique molecules set
unique_molecules.discard('nan')

# Step 2: Create new columns for each unique molecule
for molecule in unique_molecules:
    data_cleaned[molecule] = 0.0  # Initialize columns for each molecule with 0.0

# Step 3: Calculate the proportions for each molecule
for index, row in data_cleaned.iterrows():
    # Iterate over the ion columns and their coefficients
    for column_group in ['a', 'b', 'c']:
        ions_column = f'Perovskite_composition_{column_group}_ions'
        coefficients_column = f'Perovskite_composition_{column_group}_ions_coefficients'
        
        ions, coefficients = extract_ions_and_coefficients(str(row[ions_column]), str(row[coefficients_column]))

        total_coeff = sum(coefficients) if sum(coefficients) != 0 else 1  # Avoid division by zero
        
        # Calculate proportion and assign to the corresponding molecule columns
        for ion, coeff in zip(ions, coefficients):
            data_cleaned.at[index, ion] += coeff / total_coeff  # Add the proportion to the column

# Step 4: Drop the original ion and coefficient columns as they are no longer needed
columns_to_drop = [
    'Perovskite_composition_a_ions', 'Perovskite_composition_a_ions_coefficients',
    'Perovskite_composition_b_ions', 'Perovskite_composition_b_ions_coefficients',
    'Perovskite_composition_c_ions', 'Perovskite_composition_c_ions_coefficients'
]
data_cleaned.drop(columns=columns_to_drop, inplace=True)
# Remove brackets from molecule names in the unique molecules set
unique_molecules_cleaned = {molecule.strip('()') for molecule in unique_molecules}

# Update the output to show the cleaned unique molecules
print("Cleaned Unique molecules identified:", unique_molecules_cleaned)
# Save the modified dataframe to a new CSV
output_file_path = 'modified_file.csv'
data_cleaned.to_csv(output_file_path, index=False)

print("CSV file modified and saved as:", output_file_path)
print("Unique molecules identified:", unique_molecules)


  data = pd.read_csv(file_path)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_cleaned.dropna(subset=[
  data_cleaned[molecule] = 0.0  # Initialize columns for each molecule with 0.0
  data_cleaned[molecule] = 0.0  # Initialize columns for each molecule with 0.0
  data_cleaned[molecule] = 0.0  # Initialize columns for each molecule with 0.0
  data_cleaned[molecule] = 0.0  # Initialize columns for each molecule with 0.0
  data_cleaned[molecule] = 0.0  # Initialize columns for each molecule with 0.0
  data_cleaned[molecule] = 0.0  # Initialize columns for each molecule with 0.0
  data_cleaned[molecule] = 0.0  # Initialize columns for each molecule with 0.0
  data_cleaned[molecule] = 0.0  # Initialize columns for each molecule with 0.0
  data_cleaned[molecule] = 0.0  # Initialize columns for each molecule with 0.0
  data_clea

Cleaned Unique molecules identified: {'C8H17NH3', '5-AVAI', 'PA', 'IA', 'ALA', 'EPA', 'PGA', 'HDA', 'pF1PEA', 'Y', 'Cl-PEA', 'TFEA', 'NH4', 'PF6', 'ThMA', 'Cs', 'HTAB', 'Cl', 'Mn', 'MIC3', 'PBA', 'Sr', 'EU-pyP', 'CHMA', 'DI', 'Co', 'Sm', 'BU', 'iso-BA', 'f-PEA', 'I', 'PR', 'ODA', 'C6H4NH2', 'GABA', 'PyEA', '6-ACA', 'HdA', 'DA', 'mFPEA', 'n-C3H7NH3', 'PN', 'PMA', '4AMPY', '3-Pr(NH3)2', 'O', 'mF1PEA', 'MIC1', 'TEA', 'BI', 'Bn', 'DPA', 'oF1PEA', 'BdA', 'Li', 'EDA', 'TMA', 'C4H9N2H6', 'PTA', 'Au', 'EA', 'PDMA', 'DAT', 'PDA', 'HEA', 'DAP', '4ApyH', 'Fe', 'TA', 'Zn', 'Br', 'A43', 'GA', 'TN', 'Ag', 'APMim', 'F3EA', '3AMP', 'HAD', 'CH3ND3', 'Eu', 'NMA', 'MTEA', 'AN', 'FPEA', 'Mg', 'IM', '4FPEA', '1', 'MA', 'BF4', 'K', 'Bi', 'BZA', 'Sb', 'BzDA', 'Aa', 'PEI', 'TBA', 'Na', 'Ge', 'F-PEA', 'CA', 'iPA', 'F5PEA', 'BA', 'GU', 'DMA', 'FEA', 'Te', 'Ba', 'SCN', '5-AVA', 'CH3)3S', 'La', 'OA', 'oFPEA', 'Ti', 'Ace', '3AMPY', 'NEA', 'Pb', 'PEA', 'Ca', 'BEA', 'PPEA', 'CIEA', 'C6H13NH3', 'OdA', 'Anyl', 'H-PEA'

## Training and testing models

In [6]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.metrics import r2_score, mean_squared_error
import matplotlib.pyplot as plt

# Load the data
data = pd.read_csv('modified_file.csv')

# Remove rows with NaN values (including np.nan)
data = data.dropna()  # This drops rows with any NaN values

# Prepare features and targets
X = data.drop(['JV_default_Voc', 'JV_default_Jsc', 'JV_default_FF', 'JV_default_PCE', 'Perovskite_band_gap'], axis=1)
y_band_gap = data['Perovskite_band_gap']
y_pce = data['JV_default_PCE']
y_voc = data['JV_default_Voc']
y_jsc = data['JV_default_Jsc']
y_ff = data['JV_default_FF']

# Split the data
X_train, X_test, y_band_gap_train, y_band_gap_test, y_pce_train, y_pce_test, \
y_voc_train, y_voc_test, y_jsc_train, y_jsc_test, y_ff_train, y_ff_test = \
    train_test_split(X, y_band_gap, y_pce, y_voc, y_jsc, y_ff, test_size=0.2, random_state=42)

# Initialize models
models = {
    'LR': LinearRegression(),
    'RR': Ridge(),
    'KNN': KNeighborsRegressor(),
    'RF': RandomForestRegressor(random_state=42),
    'XGBoost': XGBRegressor(random_state=42)
}

# Function to evaluate models
def evaluate_model(y_true, y_pred):
    r_value = np.sqrt(r2_score(y_true, y_pred))
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    return r_value, rmse

# Train and evaluate models
results = {model_name: {'Band gap prediction': [], 'PCE direct prediction': [], 'PCE calculated': []}
           for model_name in models.keys()}

for model_name, model in models.items():
    # Band gap prediction
    model.fit(X_train, y_band_gap_train)
    y_band_gap_pred = model.predict(X_test)
    r_value, rmse = evaluate_model(y_band_gap_test, y_band_gap_pred)
    results[model_name]['Band gap prediction'] = [r_value, rmse]

    # PCE direct prediction
    model.fit(X_train, y_pce_train)
    y_pce_pred = model.predict(X_test)
    r_value, rmse = evaluate_model(y_pce_test, y_pce_pred)
    results[model_name]['PCE direct prediction'] = [r_value, rmse]

    # Train models for Voc, Jsc, and FF
    model.fit(X_train, y_voc_train)
    y_voc_pred = model.predict(X_test)
    model.fit(X_train, y_jsc_train)
    y_jsc_pred = model.predict(X_test)
    model.fit(X_train, y_ff_train)
    y_ff_pred = model.predict(X_test)

    # Calculate PCE from predicted Voc, Jsc, and FF
    y_pce_calculated = y_voc_pred * y_jsc_pred * y_ff_pred / 100
    r_value, rmse = evaluate_model(y_pce_test, y_pce_calculated)
    results[model_name]['PCE calculated'] = [r_value, rmse]

# Create a DataFrame for the results
df_results = pd.DataFrame({
    'Models': list(models.keys()),
    'Band gap prediction r-Value': [results[model]['Band gap prediction'][0] for model in models],
    'Band gap prediction RMSE (eV)': [results[model]['Band gap prediction'][1] for model in models],
    'PCE direct prediction r-Value': [results[model]['PCE direct prediction'][0] for model in models],
    'PCE direct prediction RMSE (%)': [results[model]['PCE direct prediction'][1] for model in models],
    'PCE calculated r-Value': [results[model]['PCE calculated'][0] for model in models],
    'PCE calculated RMSE (%)': [results[model]['PCE calculated'][1] for model in models]
})

# Display the results
print(df_results)

# Create a figure and axis
fig, ax = plt.subplots(figsize=(12, 6))
ax.axis('off')

# Create the table
table = ax.table(cellText=df_results.values,
                 colLabels=df_results.columns,
                 cellLoc='center',
                 loc='center')

# Set table properties
table.auto_set_font_size(False)
table.set_fontsize(9)
table.scale(1.2, 1.5)

# Add title
plt.title("Table 1: ML model types with corresponding results", fontsize=12, fontweight='bold', pad=20)

# Save the figure
plt.tight_layout()
plt.savefig('table_1_results.png', dpi=300, bbox_inches='tight')
plt.close()

print("Table 1 with results has been generated and saved as 'table_1_results.png'")

  r_value = np.sqrt(r2_score(y_true, y_pred))
  r_value = np.sqrt(r2_score(y_true, y_pred))
  r_value = np.sqrt(r2_score(y_true, y_pred))
  r_value = np.sqrt(r2_score(y_true, y_pred))
  r_value = np.sqrt(r2_score(y_true, y_pred))


    Models  Band gap prediction r-Value  Band gap prediction RMSE (eV)  \
0       LR                     0.883697                       0.054775   
1       RR                     0.889709                       0.053425   
2      KNN                     0.912604                       0.047845   
3       RF                     0.929616                       0.043127   
4  XGBoost                     0.931828                       0.042468   

   PCE direct prediction r-Value  PCE direct prediction RMSE (%)  \
0                       0.354416                        4.673366   
1                       0.377099                        4.628812   
2                       0.344558                        4.691744   
3                       0.421541                        4.532036   
4                       0.421552                        4.532011   

   PCE calculated r-Value  PCE calculated RMSE (%)  
0                     NaN                12.532525  
1                     NaN               

## Feature importance

In [7]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor
import matplotlib.pyplot as plt

# Load the data
data = pd.read_csv('modified_file.csv')
# Remove rows with NaN values (including np.nan)
data = data.dropna()  # This drops rows with any NaN values

# Prepare features and target
X = data.drop(['JV_default_Voc', 'JV_default_Jsc', 'JV_default_FF', 'JV_default_PCE', 'Perovskite_band_gap'], axis=1)
y = data['Perovskite_band_gap']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Ridge Regression model
rr_model = Ridge()
rr_model.fit(X_train, y_train)

# Train Random Forest model
rf_model = RandomForestRegressor(random_state=42)
rf_model.fit(X_train, y_train)

# Get feature importance for Ridge Regression
rr_importance = np.abs(rr_model.coef_)
rr_importance = rr_importance / np.sum(rr_importance)
rr_feature_importance = pd.DataFrame({'feature': X.columns, 'importance': rr_importance})
rr_feature_importance = rr_feature_importance.sort_values('importance', ascending=False).head(8)

# Get feature importance for Random Forest
rf_importance = rf_model.feature_importances_
rf_feature_importance = pd.DataFrame({'feature': X.columns, 'importance': rf_importance})
rf_feature_importance = rf_feature_importance.sort_values('importance', ascending=False).head(8)

# Create subplots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

# Plot for RR model
ax1.barh(rr_feature_importance['feature'], rr_feature_importance['importance'], color='steelblue', height=0.6)
ax1.set_xlabel('Impact')
ax1.set_title('(a) RR model')
ax1.invert_yaxis()  # Invert y-axis to match the image

# Plot for RF model
ax2.barh(rf_feature_importance['feature'], rf_feature_importance['importance'], color='steelblue', height=0.6)
ax2.set_xlabel('Impact')
ax2.set_title('(b) RF model')
ax2.invert_yaxis()  # Invert y-axis to match the image

# Add overall title
fig.suptitle('Fig. 4 Impact of different materials on band gap prediction', fontsize=12, fontweight='bold')

# Adjust layout and save
plt.tight_layout()
plt.savefig('figure_4_results.png', dpi=300, bbox_inches='tight')
plt.close()

print("Figure 4 with results has been generated and saved as 'figure_4_results.png'")

# Print feature importances
print("Ridge Regression Feature Importance:")
print(rr_feature_importance)
print("\nRandom Forest Feature Importance:")
print(rf_feature_importance)

Figure 4 with results has been generated and saved as 'figure_4_results.png'
Ridge Regression Feature Importance:
         feature  importance
75      (N-EtPy)    0.063773
68      (oF1PEA)    0.056431
21            Cl    0.046634
96            IM    0.040960
105           Sb    0.040661
122  (n-C3H7NH3)    0.038947
158           Ni    0.034778
116           GU    0.034506

Random Forest Feature Importance:
      feature  importance
32          I    0.559334
140        Sn    0.110543
132        Pb    0.096258
82         Br    0.066894
152        FA    0.027970
19         Cs    0.025341
149        HA    0.016671
68   (oF1PEA)    0.012233
