# Polymer Properties Model Refinement
Data and methodology taken from: Estimation and Prediction of the Polymers’ Physical Characteristics Using the Machine Learning Models Polymers 2024, 16(1), 115; https://doi.org/10.3390/polym16010115.

Github repository: https://github.com/catauggie/polymersML/tree/main

The goal of this notebook is to begin the refine the model of linear regression using different ML techniques

In [None]:
# Go ahead and import the data into a dataframe and then make sure its imported properly by listing the first few lines
import pandas as pd
new_df = pd.read_excel('Tg_Data_Frame.xlsx')
new_df.head(3)

# Tuning the model

## Function Definitions

In [None]:
import numpy as np
import matplotlib.pyplot as plt

from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error, explained_variance_score, median_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error



def evaluate_model_performance(y_test, y_pred):
    """
    Evaluate the performance of a model using various metrics.
    
    Parameters:
    - y_test: array-like of shape (n_samples,) or (n_samples, n_outputs), 
              True values for X.
    - y_pred: array-like of shape (n_samples,) or (n_samples, n_outputs),
              Estimated target values.
    """
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    mae = mean_absolute_error(y_test, y_pred)
    variance_score = explained_variance_score(y_test, y_pred)
    medae = median_absolute_error(y_test, y_pred)

    print(f'Mean Squared Error: {mse:.3f}')
    print(f'R-squared: {r2:.3f}')
    print(f'Mean Absolute Error: {mae:.3f}')
    print(f'Explained Variance Score: {variance_score:.3f}')
    print(f'Median Absolute Error: {medae:.3f}')

# Example usage:
# y_test = [actual values]
# y_pred = [predicted values]
# evaluate_model_performance(y_test, y_pred)


def plot_actual_vs_predicted(y_test, y_pred):
    """
    Plot the actual vs. predicted values to evaluate a model's performance.
    
    Parameters:
    - y_test: array-like, True values.
    - y_pred: array-like, Predicted values.
    """
    plt.figure(figsize=(10, 6))  # Set the figure size for better readability

    # Scatter plot for actual vs. predicted values
    plt.scatter(y_test, y_pred, alpha=0.5, label='Predicted vs. Actual')

    # Ideal line for perfect predictions
    max_val = max(max(y_test), max(y_pred))  # Find the maximum value for setting plot limits
    min_val = min(min(y_test), min(y_pred))  # Find the minimum value for setting plot limits
    plt.plot([min_val, max_val], [min_val, max_val], 'k--', lw=2, label='Ideal Fit')

    # Customization and labels
    plt.xlabel('Actual Values')
    plt.ylabel('Predicted Values')
    plt.title('Model Predictions vs. Actual Data')
    plt.legend()
    plt.grid(True)

    # Show plot
    plt.show()

# Example usage:
# y_test = [actual values]
# y_pred = [predicted values]
# plot_actual_vs_predicted(y_test, y_pred)

def bland_altman_plot(y_test, y_pred):
    """
    Generate a Bland-Altman plot to assess the agreement between two sets of measurements.
    
    Parameters:
    - y_test: array-like, true values.
    - y_pred: array-like, predicted values.
    """
    avg = (y_test + y_pred) / 2
    diff = y_test - y_pred

    plt.figure(figsize=(10, 6))
    plt.scatter(avg, diff, alpha=0.5)
    plt.axhline(y=np.mean(diff), color='r', linestyle='--', label='Mean Difference')
    plt.axhline(y=np.mean(diff) + 1.96 * np.std(diff), color='g', linestyle='--', label='Upper Limit of Agreement')
    plt.axhline(y=np.mean(diff) - 1.96 * np.std(diff), color='g', linestyle='--', label='Lower Limit of Agreement')
    plt.xlabel('Average of Actual and Predicted Values')
    plt.ylabel('Difference Between Actual and Predicted Values')
    plt.title('Bland-Altman Plot')
    plt.legend()
    plt.show()

# Example usage:
# y_test = [actual values]
# y_pred = [predicted values]
# bland_altman_plot(y_test, y_pred)


## Random Forest Regressor

The RandomForestRegressor is a popular machine learning algorithm that belongs to the ensemble learning family, specifically within the Random Forests methodology. It operates by constructing a multitude of decision trees at training time and outputting the mean or average prediction of the individual trees. This method is particularly used for regression tasks, where the goal is to predict a continuous outcome variable. 

Below, I'll outline the pros and cons of using a RandomForestRegressor, its applications, and its potential use in predicting glass transition temperatures from molecular fingerprints.

### Pros of RandomForestRegressor
* **Accuracy**: RandomForestRegressor is known for providing high accuracy in many prediction tasks due to its ensemble approach, which reduces overfitting by averaging the results of numerous trees.
* **Robustness**: It can handle outliers and nonlinear data effectively, making it robust across various datasets.
* **Feature Importance**: It inherently provides insights into feature importance, which can be valuable for understanding the factors influencing the prediction.
* **Versatility**: Can be used for both regression and classification tasks, making it applicable to a wide range of problems.
* **Handling Missing Values**: Capable of handling missing values in the dataset without requiring extensive pre-processing.
* **Parallelizable**: The algorithm can be easily parallelized across multiple CPUs for faster processing, which is beneficial for dealing with large datasets.

### Cons of RandomForestRegressor
* **Complexity**: It creates numerous trees (which can be computationally intensive) and requires more memory and processing power, especially as the number of trees increases.
* **Interpretability**: While individual decision trees are interpretable, the ensemble nature of RandomForest makes it more challenging to interpret the model's predictions directly.
* **Long Training Time**: For large datasets or a large number of trees, the training time can be significantly longer compared to simpler models.
* **Overfitting with Noisy Data**: Despite its robustness to overfitting, in cases of extremely noisy data, the model can still overfit.

### Applications
RandomForestRegressor is widely applicable in various domains, including:

* **Financial Market Analysis**: For predicting stock prices, market trends, and risk assessment.
* **Healthcare**: In predicting disease outbreak, patient prognosis, and treatment effectiveness.
* **Environmental Modeling**: For forecasting weather patterns, air quality, and climate change effects.
* **Real Estate**: In estimating property values based on numerous features.
Retail: For demand forecasting and inventory management.

### Predicting Glass Transition Temperatures
RandomForestRegressor can be particularly suited for this task due to its ability to handle complex, non-linear relationships between features (in this case, molecular fingerprints) and the target variable (glass transition temperatures). The model can capture the intricate patterns and dependencies in the molecular structure that influence the glass transition temperature.

The molecular fingerprints, which represent the presence or absence of certain molecular structures and properties, serve as input features for the RandomForestRegressor. By training on these features along with known glass transition temperatures, the model can learn the relationship between the molecular structure and the transition temperature, thereby enabling it to predict the glass transition temperatures of new compounds.

This application leverages the RandomForestRegressor's strengths in handling complex, high-dimensional data and its robustness to variations in the data, making it a promising approach for predicting properties of materials based on their molecular composition. However, success in this specific application would also depend on the quality and representativeness of the training data, the selection of relevant features from the molecular fingerprints, and the tuning of the model's hyperparameters to optimize performance.


In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Assuming 'df' is your DataFrame
# Assuming 'df' is your DataFrame
cols = [c for c in new_df.columns if 'col' in c]
X = new_df[cols]
y = new_df['Glass transition temperature_value_median']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.7, random_state=42)

# Initialize the Random Forest Regressor model
model = RandomForestRegressor(n_estimators=1000, random_state=42)

# Train the model on the training data
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

evaluate_model_performance(y_test, y_pred)
plot_actual_vs_predicted(y_test, y_pred)
bland_altman_plot(y_test, y_pred)

## Gradient Boosting Regressor
Another powerful machine learning algorithm that uses the boosting strategy to improve prediction accuracy. It builds an ensemble of weak prediction models, typically decision trees, in a sequential manner where each tree tries to correct the errors made by the previous ones. This approach focuses on converting weak learners into strong ones over iterations, optimizing a loss function. Below are the pros and cons of using GradientBoostingRegressor, its applications, and its potential use in predicting glass transition temperatures from molecular fingerprints.

## Pros of GradientBoostingRegressor
* **High Accuracy*: Gradient boosting is capable of producing highly accurate models by systematically addressing errors of previous models through optimization.
* **Flexibility**: It can handle various types of data (numerical, categorical) and is adaptable to different loss functions, making it suitable for a wide range of regression and classification tasks.
* **Handling Non-linear Data**: Due to its sequential approach in correcting errors, it can model complex non-linear relationships effectively.
* **Feature Importance**: Similar to RandomForest, gradient boosting provides insights into which features are most important for making predictions.
* **Overfitting Control**: Offers several hyperparameters (like the number of trees, depth of trees, learning rate) that can be fine-tuned to prevent overfitting.

## Cons of GradientBoostingRegressor
* **Training Time**: The sequential nature of boosting means it can be slower to train compared to models that allow parallelization, such as random forests.
* **Complexity and Tuning**: Requires careful tuning of hyperparameters to achieve the best performance without overfitting. This process can be time-consuming and complex.
* **Memory Usage**: Can consume more memory than simpler models, especially as the number of trees and depth increases.
* **Risk of Overfitting**: If not properly tuned, especially with too many trees or too deep trees, it can overfit the training data.


## Applications
Gradient Boosting Regressor finds its application in diverse fields, including but not limited to:

* **Finance*: For credit scoring, risk management, and algorithmic trading strategies.
* **Healthcare**: In disease prediction, personalized medicine, and healthcare resource optimization.
* **Energy**: Forecasting electricity demand, renewable energy output predictions, and price forecasting.
* **Retail and E-commerce**: For customer lifetime value prediction, sales forecasting, and inventory management.
* **Real Estate**: Predicting house prices based on various features like location, size, and amenities.

## Predicting Glass Transition Temperatures
The prediction of glass transition temperatures from molecular fingerprints is a nuanced task that can benefit from the high accuracy and flexibility of the Gradient Boosting Regressor. This method is particularly useful for capturing the complex, non-linear relationships between the molecular structure (encoded as fingerprints) and their physical properties, such as the glass transition temperature.

Molecular fingerprints, which encode the presence of particular molecular features, serve as the input to the model. The Gradient Boosting Regressor can iteratively learn from the subtle nuances in how these features correlate with the glass transition temperatures. By focusing on minimizing the prediction error in each step, it can effectively predict the glass transition temperatures for new, unseen molecules.

This application benefits from the model's ability to handle complex datasets and its robustness against overfitting (when properly tuned), making it a potent tool in the field of materials science and chemistry. The success of GradientBoostingRegressor in predicting glass transition temperatures will largely depend on the quality of the molecular fingerprint data, the representativeness of the training dataset, and the optimization of model parameters to the specific characteristics of the data.

In [None]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Assuming 'df' is your DataFrame
cols = [c for c in new_df.columns if 'col' in c]
X = new_df[cols]
y = new_df['Glass transition temperature_value_median']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.8, random_state=42)

# Initialize the Gradient Boosting Regressor model
model = GradientBoostingRegressor(n_estimators=100, random_state=42)

# Train the model on the training data
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

evaluate_model_performance(y_test, y_pred)
plot_actual_vs_predicted(y_test, y_pred)
bland_altman_plot(y_test, y_pred)

Support Vector Regressor

In [None]:
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score


cols = [c for c in new_df.columns if 'col' in c]
X = new_df[cols]
y = new_df['Glass transition temperature_value_median']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

# Initialize the Support Vector Regressor model
model = SVR(kernel='linear')  # You can experiment with different kernels: 'linear', 'poly', 'rbf', etc.

# Train the model on the training data
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

evaluate_model_performance(y_test, y_pred)
plot_actual_vs_predicted(y_test, y_pred)
bland_altman_plot(y_test, y_pred)

Lasso Regression

In [None]:
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import pandas as pd

# Assuming 'df' is your DataFrame
# X should be your features, and y should be your target variable
cols = [c for c in new_df.columns if 'col' in c]
X = new_df[cols]
y = new_df['Glass transition temperature_value_median']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

# Initialize the Lasso Regression model
model = Lasso(alpha=0.1)  # 'alpha' is the regularization strength, you can adjust it based on your data

# Train the model on the training data
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

evaluate_model_performance(y_test, y_pred)
plot_actual_vs_predicted(y_test, y_pred)
bland_altman_plot(y_test, y_pred)

Elastic Net

In [None]:
from sklearn.linear_model import ElasticNet
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Assuming 'df' is your DataFrame
# X should be your features, and y should be your target variable
cols = [c for c in new_df.columns if 'col' in c]
X = new_df[cols]
y = new_df['Glass transition temperature_value_median']


# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

# Initialize the Elastic Net model
model = ElasticNet(alpha=0.1, l1_ratio=0.5)
# 'alpha' is the regularization strength, and 'l1_ratio' controls the mix between L1 and L2 regularization.
# You can adjust these parameters based on your data.

# Train the model on the training data
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

evaluate_model_performance(y_test, y_pred)
plot_actual_vs_predicted(y_test, y_pred)
bland_altman_plot(y_test, y_pred)


K Nebours Regressor

In [None]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score


# Assuming 'df' is your DataFrame
# X should be your features, and y should be your target variable
cols = [c for c in new_df.columns if 'col' in c]
X = new_df[cols]
y = new_df['Glass transition temperature_value_median']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

# Initialize the KNN Regressor model
model = KNeighborsRegressor(n_neighbors=5)  # 'n_neighbors' is the number of neighbors to consider, you can adjust it

# Train the model on the training data
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

evaluate_model_performance(y_test, y_pred)
plot_actual_vs_predicted(y_test, y_pred)
bland_altman_plot(y_test, y_pred)

In [None]:
def normalized_mse(y_true, y_pred):
    mse = mean_squared_error(y_true, y_pred)
    var_y = np.var(y_true)
    nmse = mse / var_y
    return nmse

def mean_percentage_error(y_true, y_pred):
    y_true, y_pred = np.array(y_true), np.array(y_pred)
    
    # Avoid division by zero
    non_zero_indices = y_true != 0
    y_true_non_zero = y_true[non_zero_indices]
    y_pred_non_zero = y_pred[non_zero_indices]
    
    # Calculate MPE
    mpe = abs(np.mean((y_true_non_zero - y_pred_non_zero) / y_true_non_zero) * 100)
    return mpe

def root_mean_squared_error(y_true, y_pred):
    mse = mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    return rmse


rmse_result = root_mean_squared_error(y_test, y_pred)
nmse_result = normalized_mse(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
mpe_result = mean_percentage_error(y_test, y_pred)

print(f'Root Mean Squared Error: {rmse_result}')
print(f'Mean Percentage Error: {mpe_result}')
print(f'Normalized Mean Squared Error: {nmse_result}')

Decision Tree Regressor

In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import pandas as pd

# Assuming 'df' is your DataFrame
# X should be your features, and y should be your target variable
cols = [c for c in new_df.columns if 'col' in c]
X = new_df[cols]
y = new_df['Glass transition temperature_value_median']


# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

# Initialize the Decision Tree Regressor model
model = DecisionTreeRegressor(max_depth=5)  # 'max_depth' controls the maximum depth of the tree, you can adjust it

# Train the model on the training data
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

evaluate_model_performance(y_test, y_pred)
plot_actual_vs_predicted(y_test, y_pred)
bland_altman_plot(y_test, y_pred)

Decision Tree & Bagging Regressor

In [None]:
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Assuming 'df' is your DataFrame
# X should be your features, and y should be your target variable
cols = [c for c in new_df.columns if 'col' in c]
X = new_df[cols]
y = new_df['Glass transition temperature_value_median']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

# Initialize the base Decision Tree Regressor model
base_model = DecisionTreeRegressor(max_depth=5)  # You can adjust 'max_depth' based on your data

# Initialize the Bagging Regressor model
model = BaggingRegressor(base_model, n_estimators=10, random_state=42)
# 'n_estimators' is the number of base models (Decision Trees) in the ensemble, you can adjust it

# Train the model on the training data
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)


evaluate_model_performance(y_test, y_pred)
plot_actual_vs_predicted(y_test, y_pred)
bland_altman_plot(y_test, y_pred)

Ada Boost and Decision Tree

In [None]:
from sklearn.ensemble import AdaBoostRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import pandas as pd

# Assuming 'df' is your DataFrame
# X should be your features, and y should be your target variable
cols = [c for c in new_df.columns if 'col' in c]
X = new_df[cols]
y = new_df['Glass transition temperature_value_median']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

# Initialize the base Decision Tree Regressor model
base_model = DecisionTreeRegressor(max_depth=5)  # You can adjust 'max_depth' based on your data

# Initialize the AdaBoost Regressor model
model = AdaBoostRegressor(base_model, n_estimators=50, learning_rate=0.1, random_state=42)
# 'n_estimators' is the number of base models (Decision Trees) in the ensemble, and 'learning_rate' scales the contribution of each model

# Train the model on the training data
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)


evaluate_model_performance(y_test, y_pred)
plot_actual_vs_predicted(y_test, y_pred)
bland_altman_plot(y_test, y_pred)

XGBoost

In [None]:
from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import pandas as pd

# Assuming 'df' is your DataFrame
# X should be your features, and y should be your target variable
cols = [c for c in new_df.columns if 'col' in c]
X = new_df[cols]
y = new_df['Glass transition temperature_value_median']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

# Initialize the XGBoost Regressor model
model = XGBRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
# 'n_estimators' is the number of boosting rounds, 'learning_rate' scales the contribution of each tree, and 'max_depth' controls the maximum depth of each tree

# Train the model on the training data
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

evaluate_model_performance(y_test, y_pred)
plot_actual_vs_predicted(y_test, y_pred)
bland_altman_plot(y_test, y_pred)