<a href="https://colab.research.google.com/github/christophergaughan/Bioinformatics-Code/blob/main/GNN_PubChem.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Data Dictionary for Antibiotic Compounds Dataset

| **Field**            | **Description**                                                                                                                                   |
|----------------------|---------------------------------------------------------------------------------------------------------------------------------------------------|
| **cid**              | The compound identifier (CID), a unique number assigned to the compound in the PubChem database.                                                  |
| **cmpdname**         | The name of the compound.                                                                                                                         |
| **cmpdsynonym**      | Synonyms for the compound. This can include alternate names, trade names, or common names.                                                        |
| **mw**               | The molecular weight of the compound (in grams per mole).                                                                                         |
| **mf**               | The molecular formula of the compound, representing the number and types of atoms.                                                                |
| **polararea**        | The polar surface area of the compound.                                                                                                           |
| **xlogp**            | A measure of the hydrophobicity (logarithm of the partition coefficient between octanol and water) of the compound.                                |
| **heavycnt**         | The number of heavy atoms in the compound (atoms that are not hydrogen).                                                                          |
| **hbonddonor**       | The number of hydrogen bond donors in the compound, which are atoms that can donate hydrogen atoms to form hydrogen bonds.                        |
| **hbondacc**         | The number of hydrogen bond acceptors in the compound, which are atoms that can accept hydrogen atoms to form hydrogen bonds.                     |
| **rotatablebond**    | The number of rotatable bonds in the compound, which indicates the flexibility of the molecule.                                                   |
| **IUPAC Name**       | The International Union of Pure and Applied Chemistry (IUPAC) name of the compound, following standardized chemical nomenclature.                 |
| **Isomeric SMILES**  | The Simplified Molecular Input Line Entry System (SMILES) string, representing the structure of the compound, including stereochemistry information. |
| **InChIKey**         | A hashed version of the InChI string, providing a unique identifier for the compound.                                                             |
| **InChI**            | The International Chemical Identifier (InChI), a textual representation of the compound’s structure.                                              |
| **annotation**       | Additional annotation about the compound, including therapeutic uses and classifications, such as `Anti-Infective Agent`, `Antiprotozoal Agent`, or `Anti-Bacterial Agents`. |


In [None]:
from google.colab import drive
import os

# Mount Google Drive
# drive.mount('/content/drive')

# Check if the file exists
file_path = '/content/drive/MyDrive/Colab Notebooks/PubChem_compound_text_Antibiotics.csv'
if os.path.exists(file_path):
    print("File exists and ready to load!")
else:
    print("File not found. Check the file path.")


In [None]:
import pandas as pd

# Load the CSV file
df = pd.read_csv(file_path)

# Display the first few rows to verify
df.head()


In [None]:
# Check for missing values
print("Missing values per column:")
print(df.isnull().sum())

# Get basic statistics about the dataset
print("\nBasic statistics:")
print(df.describe())

# Display data types to ensure consistency
print("\nData types:")
print(df.dtypes)

# Check the first few rows to ensure everything is loaded correctly
df.head()


### Imputing Missing xlogp Values Using Predictive Modeling

#### What:
We have found that 17 values in the xlogp column are missing, and the variability in the existing values is quite high. Simple imputation (e.g., mean or median) could introduce bias, as the distribution of xlogp is broad and not concentrated around a single value. Therefore, we will use predictive modeling to estimate these missing values based on other available features in the dataset.

#### Why:
Predictive imputation uses the relationships between the missing values and other features to provide more accurate estimations. In this case, we can build a model to predict xlogp using the rest of the columns as inputs. This approach is more robust than simply filling missing values with a constant, as it takes into account the actual distribution and relationships in the data.

#### How:
1. We will first prepare the dataset, ensuring that no columns used for prediction have missing values themselves.
2. We'll split the dataset into two parts:
    - One with known xlogp values (to train the model).
    - One with missing xlogp values (to impute).
3. A regression model (e.g., RandomForestRegressor) will be trained on the known values to predict xlogp.
4. We will then use this model to predict the missing values and fill them back into the dataset.


In [None]:
# Import necessary libraries
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np

# Load your dataset
file_path = '/content/drive/MyDrive/Colab Notebooks/PubChem_compound_text_Antibiotics.csv'
data = pd.read_csv(file_path)

# Drop rows where xlogp is missing for training, and separate the rows with missing xlogp for prediction
data_with_xlogp = data.dropna(subset=['xlogp'])
data_without_xlogp = data[data['xlogp'].isnull()]

# Selecting features to use for predicting xlogp (excluding 'xlogp' itself)
features = ['mw', 'polararea', 'heavycnt', 'hbonddonor', 'hbondacc', 'rotbonds', 'exactmass', 'monoisotopicmass']

# Training data
X_train = data_with_xlogp[features]
y_train = data_with_xlogp['xlogp']

# The rows without xlogp, which need to be predicted
X_pred = data_without_xlogp[features]

# Train a Random Forest Regressor model
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Predict the missing xlogp values
predicted_xlogp = rf.predict(X_pred)

# Impute the missing xlogp values with the predicted values
data.loc[data['xlogp'].isnull(), 'xlogp'] = predicted_xlogp

# Check the imputation
print(data[['xlogp']].isnull().sum())  # This should print 0, meaning all missing values have been filled.


In [None]:
# Extracting the features for rows where xlogp is missing
X_missing = data_without_xlogp.drop(['xlogp'], axis=1)

# Ensure that columns in X_train and X_missing are aligned
X_missing = X_missing[X_train.columns]

# Predict the xlogp values for the missing data
predicted_xlogp_missing = rf.predict(X_missing)

# Now plot the original vs predicted xlogp values and highlight the imputed values
plt.figure(figsize=(8, 6))

# Plot the actual vs predicted xlogp values for training data (in blue)
sns.scatterplot(x=y_train, y=rf.predict(X_train), label="Predicted (Train)", color="blue")

# Plot the red circles for imputed xlogp values
sns.scatterplot(x=data_without_xlogp['xlogp'], y=predicted_xlogp_missing,
                label="Imputed xlogp (Red Circles)", color="red", marker="o", s=100)

# Plot the perfect prediction line
plt.plot([y_train.min(), y_train.max()], [y_train.min(), y_train.max()],
         color='green', linestyle='--')

plt.title('Actual vs Predicted xlogp Values (with Imputed Highlighted)')
plt.xlabel('Actual xlogp')
plt.ylabel('Predicted xlogp')
plt.legend()
plt.show()


In [None]:
# Extracting the features for rows where xlogp is missing
X_missing = data_without_xlogp.drop(['xlogp'], axis=1)

# Ensure that columns in X_train and X_missing are aligned
X_missing = X_missing[X_train.columns]

# Predict the xlogp values for the missing data
predicted_xlogp_missing = rf.predict(X_missing)

# Now plot the actual vs predicted xlogp values and highlight the imputed values
plt.figure(figsize=(8, 6))

# Plot the actual vs predicted xlogp values for training data (in blue)
sns.scatterplot(x=y_train, y=rf.predict(X_train), label="Predicted (Train)", color="blue")

# Plot the red circles for imputed xlogp values
sns.scatterplot(x=predicted_xlogp_missing, y=predicted_xlogp_missing,
                label="Imputed xlogp (Red Circles)", color="red", marker="o", s=100)

# Plot the perfect prediction line
plt.plot([y_train.min(), y_train.max()], [y_train.min(), y_train.max()],
         color='green', linestyle='--')

plt.title('Actual vs Predicted xlogp Values (with Imputed Highlighted)')
plt.xlabel('Actual xlogp')
plt.ylabel('Predicted xlogp')
plt.legend()
plt.show()


In [None]:
# Select only numeric columns for correlation matrix
numeric_data_with_xlogp = data_with_xlogp.select_dtypes(include=[float, int])

# Calculate the correlation matrix
correlation_matrix = numeric_data_with_xlogp.corr()

# Create a heatmap of the correlation matrix
plt.figure(figsize=(12, 8))
heatmap = sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5)
heatmap.set_title('Correlation Matrix of Key Variables', fontdict={'fontsize': 12}, pad=12)

# Display the plot
plt.show()


# Principal Component Analysis (PCA) and Correlation Simplification

## Why Highly Correlated Variables Provide Redundant Information

In any dataset, when two or more variables are highly correlated, it means that they provide similar information. Correlation measures the linear relationship between variables, and a correlation coefficient close to +1 or -1 indicates that one variable can be approximately predicted from the other. In simpler terms, if two variables are highly correlated (either positively or negatively), they are essentially measuring the same aspect of the data.

### Positive Correlation (+1)
A correlation of +1 indicates a perfect positive relationship, where as one variable increases, the other also increases proportionally. This means the two variables are redundant because they move together in the same direction.

### Negative Correlation (-1)
A correlation of -1 indicates a perfect negative relationship, where as one variable increases, the other decreases proportionally. In this case, the variables are inversely related, but still provide overlapping information.

### Why Redundant Variables Are a Problem
When variables are highly correlated, they introduce **multicollinearity** into statistical models like regression. Multicollinearity can cause problems because:
- **Unstable Coefficients**: The model may struggle to determine which variable is more important, leading to large standard errors in estimated coefficients.
- **Overfitting**: Including too many highly correlated variables increases the risk of overfitting the model to the training data, which may reduce its ability to generalize to new data.

## Purpose of Principal Component Analysis (PCA)

PCA is a dimensionality reduction technique that helps us simplify a dataset with many features (variables) into fewer dimensions, while preserving the essential information. By transforming correlated variables into a new set of uncorrelated variables called **principal components**, PCA allows us to:
1. **Reduce Redundancy**: By combining correlated variables into fewer components, we remove redundant information.
2. **Improve Model Performance**: Simplifying the dataset can help reduce the risk of overfitting and improve the stability of models.
3. **Visualize Data**: PCA also allows us to visualize high-dimensional data in a lower-dimensional space, helping to understand its structure.

## Why We Are Performing This Analysis

In our dataset, we observed that some variables are highly correlated with each other, indicating redundancy. To simplify the analysis, reduce multicollinearity, and create a more efficient model, we are applying PCA. Specifically, we aim to:
- **Reduce the number of features**: By using PCA, we can combine the information from correlated features into fewer principal components without losing critical information.
- **Identify key patterns**: PCA helps us identify the underlying structure of the data and the relationships between variables.
- **Prepare data for modeling**: After applying PCA, we will use the principal components as features in our predictive models, improving their performance by removing redundant information.

## Steps for Applying PCA

1. **Standardize the Data**: Before performing PCA, we need to standardize the data so that each feature has a mean of 0 and a standard deviation of 1. This is important because PCA is sensitive to the variance in data.
2. **Fit PCA**: We will fit PCA to the standardized data and compute the principal components.
3. **Visualize Variance Explained**: PCA will return a set of principal components. We will examine how much variance is explained by each component, helping us decide how many components to keep.
4. **Transform Data**: Finally, we will use the principal components to transform the dataset, replacing the original variables with the new components.


In [None]:
# Set a correlation threshold
threshold = 0.8

# Find pairs of variables with high correlation
high_corr_pairs = correlation_matrix.unstack().sort_values(kind="quicksort").drop_duplicates()

# Filter out only pairs that exceed the threshold
high_corr_pairs = high_corr_pairs[(high_corr_pairs > threshold) | (high_corr_pairs < -threshold)]

# Display the high correlation pairs
print(high_corr_pairs)


# Key Observations:
1. Mass-related Variables (mw, exactmass, monoisotopicmass): These variables are perfectly correlated with one another (correlation of 1.0). In any subsequent analysis, we can keep just one of these and discard the others to reduce redundancy.

2. Polar Surface Area (polararea): This variable is highly correlated with several other features, including complexity (0.842), hbondacc (0.911), and mw (0.880). We can consider keeping only one of these highly correlated variables.

3. Hydrogen Bond Acceptors (hbondacc): This feature is strongly correlated with many other variables, including heavy atom count (0.899), molecular weight (0.913), and complexity (0.873). This suggests that hbondacc overlaps with a lot of the other variables, and we can consider dropping some.

4. Complexity: Complexity is highly correlated with molecular weight (0.983) and heavy atom count (0.986). Like the mass variables, complexity provides redundant information in combination with these features.

### Why Perform PCA?

Principal Component Analysis (PCA) is a dimensionality reduction technique. It is especially useful in cases where the dataset contains highly correlated features. Highly correlated variables (positive or negative) provide redundant information. PCA transforms the data into a new coordinate system such that the greatest variance by any projection of the data comes to lie on the first principal component (PC1), the second greatest variance on the second component (PC2), and so on.

This allows us to simplify the dataset while retaining most of its variability. By focusing on a reduced number of principal components, we can capture the essential patterns of the data without having to process all of the original features.

In this case, many of our features are highly correlated, and PCA will help us identify the most important features contributing to the variability in our data.


In [None]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import pandas as pd

# Select the features for PCA (excluding 'cid', 'cmpdname', etc.)
selected_features = [
    'mw', 'polararea', 'complexity', 'heavycnt', 'hbonddonor', 'hbondacc',
    'rotbonds', 'exactmass', 'monoisotopicmass', 'charge',
    'covalentunitcnt', 'totalatomstereocnt', 'definedatomstereocnt',
    'undefinedatomstereocnt', 'totalbondstereocnt', 'definedbondstereocnt',
    'undefinedbondstereocnt', 'pclidcnt', 'gpidcnt', 'gpfamilycnt', 'annothitcnt'
]

# Standardize the data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data_with_xlogp[selected_features])

# Run PCA
pca = PCA()
pca_result = pca.fit_transform(scaled_data)

# Explained variance
explained_variance = pca.explained_variance_ratio_

# Display explained variance per component
for i, variance in enumerate(explained_variance, 1):
    print(f'PC{i}: {variance * 100:.2f}%')

# Plot cumulative explained variance
plt.figure(figsize=(8, 6))
plt.plot(range(1, len(explained_variance) + 1), explained_variance.cumsum(), marker='o', linestyle='--', color='b')
plt.title('Cumulative Explained Variance by Principal Components')
plt.xlabel('Number of Principal Components')
plt.ylabel('Cumulative Explained Variance (%)')
plt.grid(True)
plt.show()

# Optional: Create a DataFrame to explore the PCA components
pca_df = pd.DataFrame(pca_result, columns=[f'PC{i+1}' for i in range(pca.n_components_)])
pca_df.head()


### How to Interpret PCA Results

When performing PCA, the output provides important insights into the structure of the dataset. Here’s how to interpret the results step by step:

#### 1. **Explained Variance Ratio**
   The **explained variance ratio** tells us how much of the total variance in the data is captured by each principal component (PC). Each PC is an orthogonal (uncorrelated) linear combination of the original variables, and the total variance in the dataset is distributed across these PCs.

   - **PC1** will explain the largest amount of variance in the data.
   - **PC2** will explain the second largest, and so on.

   For example, if PC1 has an explained variance ratio of **40%**, that means 40% of the variance in the original data is captured by this component alone.

   **Interpreting the first few PCs:**
   - The first few principal components (usually PC1, PC2, PC3, etc.) are the most important because they capture most of the variability in the data. Components with very low explained variance (e.g., PC10, PC11) are less important and can be ignored if we want to reduce the dimensionality.
   
   You can use a rule of thumb that **selects the number of components** that explain around **85% to 95%** of the variance to retain most of the information.

#### 2. **Cumulative Explained Variance Plot**
   This plot shows how much variance is cumulatively explained by increasing the number of principal components. Typically, the curve will show that most of the variance is captured by the first few components, and then it levels off.

   - If the curve flattens after a certain number of PCs (e.g., after PC4), it indicates that **adding more components beyond that point doesn't significantly increase the explained variance.**
   - For example, if the plot shows that **90% of the variance** is explained by the first **4 components**, you may choose to keep those 4 components and discard the rest, simplifying the dataset without losing too much information.

#### 3. **Principal Component Scores**
   The transformed dataset after applying PCA gives us the **principal component scores** for each observation. These scores represent the coordinates of each observation in the new space defined by the principal components. They can be used for further analysis or as input to machine learning models.

   - **PC1, PC2, etc.** form new axes where each observation is represented as a point in this new space. These axes capture the most significant patterns in the data.
   - **Higher variance along a component** means that component contains more useful information about how the observations vary.

#### 4. **Loadings (Contribution of Original Features to Each PC)**
   Each principal component is a linear combination of the original features. The coefficients (also called **loadings**) for each original feature in these combinations tell us how much each feature contributes to each principal component.

   - **Positive or negative loadings** show the direction of the relationship with that component. Large positive or negative values indicate a strong influence on the component.
   - Features with high loadings on the first few principal components are the most important in explaining the variance in the dataset.

   **Example Interpretation:**
   - If PC1 has high positive loadings for features like `mw` (molecular weight) and `hbonddonor` (hydrogen bond donors), it means that these features are strongly associated with the first principal component.

#### 5. **Dimensionality Reduction**
   After running PCA, you can choose to keep only the principal components that explain a high percentage of variance (e.g., the first 4 or 5). This reduced set of components can then be used for further analysis or modeling.

   - By reducing the dimensionality, we can simplify the data and reduce computational complexity, while still retaining the majority of the original information.

### Practical Steps

1. **Check how many components explain ~85-95% of the variance.** This will help decide how many PCs to retain.
2. **Inspect the loadings** for the first few principal components to understand which features are contributing the most.
3. **Use the reduced dataset** (with the selected number of principal components) for further analysis or as input to machine learning algorithms.


In [None]:
# Get the explained variance from the PCA
pca_explained_variance = pca.explained_variance_ratio_

# Plot the explained variance for the top PCA components
plt.figure(figsize=(10, 6))
plt.bar(range(1, len(pca_explained_variance)+1), pca_explained_variance, color='blue', alpha=0.7)
plt.xlabel('Principal Components')
plt.ylabel('Explained Variance Ratio')
plt.title('Explained Variance by Principal Components')
plt.xticks(range(1, len(pca_explained_variance)+1))
plt.grid(True)

# Annotate with variance percentages
for i, v in enumerate(pca_explained_variance):
    plt.text(i + 1, v + 0.01, f"{v*100:.2f}%", ha='center', va='bottom')

plt.show()


In [None]:
# Select only numerical columns
numerical_columns = data_with_xlogp.select_dtypes(include=[np.number])

# Variance of original features (only for numerical columns)
original_variance = numerical_columns.var()

# Plot comparison of variance in original features and PCA components
plt.figure(figsize=(14, 6))

# Plot variance of original features
plt.subplot(1, 2, 1)
original_variance.plot(kind='bar', color='green', alpha=0.7)
plt.title('Variance of Original Features (Numerical)')
plt.xlabel('Features')
plt.ylabel('Variance')
plt.xticks(rotation=90)
plt.grid(True)

# Plot explained variance of PCA components
plt.subplot(1, 2, 2)
plt.bar(range(1, len(pca_explained_variance)+1), pca_explained_variance, color='blue', alpha=0.7)
plt.title('Explained Variance of PCA Components')
plt.xlabel('Principal Components')
plt.ylabel('Explained Variance Ratio')
plt.xticks(range(1, len(pca_explained_variance)+1))
plt.grid(True)

# Annotate with variance percentages for PCA
for i, v in enumerate(pca_explained_variance):
    plt.text(i + 1, v + 0.01, f"{v*100:.2f}%", ha='center', va='bottom')

plt.tight_layout()
plt.show()


### Comparing PCA Results with Original Numerical Features

To ensure an accurate comparison between PCA components and the original dataset, we are focusing on **numerical** features only. The `cmpdname`, `cmpdsynonym`, and other non-numerical columns have been excluded from this analysis.

- **Original Features**: The variance for each numerical feature is plotted to show how much variability each feature captures within the dataset.
- **PCA Components**: The explained variance ratio for each principal component is plotted to demonstrate how well the PCA components summarize the data.

This comparison helps us understand how much information is retained after reducing the dimensionality of the dataset with PCA.


In [None]:
# Display the principal components without using ace_tools
pca_components = pd.DataFrame(pca.components_[:num_components_to_keep],
                              columns=numerical_columns.columns[:21],  # Ensure correct number of columns
                              index=[f'PC{i+1}' for i in range(num_components_to_keep)])

# Display the principal components DataFrame
print("Principal Components after PCA:")
pca_components


### Principal Components Analysis (PCA) - Results

After running PCA on the dataset, we retain **8 components** that together explain 95% of the variance in the data. Each of these components is a linear combination of the original features, and the amount each feature contributes to each component is given by the loadings.

- **Variance Explained**: The first few components capture the majority of the variance, allowing us to reduce the dimensionality of the data without losing significant information.
- **Loadings**: These indicate how much each original feature contributes to a principal component. High absolute values indicate stronger contributions from that feature.


In [None]:
# Select the number of principal components to keep
num_components_to_keep = 8

# Update numerical_columns to reflect only those columns used in PCA
used_columns = numerical_columns.columns[:len(pca.components_[0])]

# Extract the top principal components
pca_components_to_keep = pd.DataFrame(pca.components_[:num_components_to_keep],
                                      columns=used_columns,
                                      index=[f'PC{i+1}' for i in range(num_components_to_keep)])

# Only display the top contributing variables for each principal component
# We can display the absolute highest loadings for each component
top_contributing_features = pca_components_to_keep.apply(lambda x: x.nlargest(5).index, axis=1)

# Print the components that we will keep
print("Top contributing features for each principal component:")
print(top_contributing_features)


## The first two principal components (PC1 and PC2) contribute significantly to explaining the variance, and they have 10 key features combined. Here’s a more refined breakdown:

**PC1 captures variability primarily from features such as:**

* rotbonds
* hbondacc
* cid
* complexity
* polararea

**PC2 captures variability from features such as:**

* definedbondstereocnt
* undefinedbondstereocnt
* polararea
* gpidcnt
* gpfamilycnt

The 10 features in PC1 and PC2 give us a broad understanding of which features are influencing the first two principal components the most, which in turn explain a large portion of the variance in the dataset.

In [None]:
# Import the necessary libraries
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Define the Random Forest Regressor
rf_regressor = RandomForestRegressor(n_estimators=100, random_state=42)

# Apply PCA to the selected numerical columns
numerical_columns_for_pca = numerical_columns[['mw', 'polararea', 'complexity', 'xlogp', 'heavycnt',
                                               'hbonddonor', 'hbondacc', 'rotbonds', 'exactmass',
                                               'monoisotopicmass', 'charge', 'covalentunitcnt',
                                               'isotopeatomcnt', 'totalatomstereocnt', 'definedatomstereocnt',
                                               'undefinedatomstereocnt', 'totalbondstereocnt',
                                               'definedbondstereocnt', 'undefinedbondstereocnt', 'pclidcnt',
                                               'gpidcnt']]

# Apply PCA to the data
pca_components = pca.transform(numerical_columns_for_pca)

# Define the target variable (xlogp values)
target = data_with_xlogp['xlogp'].values

# Check the shape to ensure consistency
print(f"PCA Components shape: {pca_components.shape}")
print(f"Target shape: {target.shape}")

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(pca_components, target, test_size=0.2, random_state=42)

# Fit the Random Forest Regressor to the training data
rf_regressor.fit(X_train, y_train)

# Predict on the test set
y_pred = rf_regressor.predict(X_test)

# Evaluate the model performance
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Print out the model performance
print(f"Mean Squared Error: {mse}")
print(f"R^2 Score: {r2}")
