## **Statistical methods for classifying mass spectrometry database search results**

### **MSB72011**

### **Instructor:** 
#### Mr. Sserunjogi Richard

### **Authors:**
#### 1. Aroma Emmanuel 2023/HD07/3346U
#### 2. Mutesasira Edward 2023/HD07/3369U

### Generating Synthetic Data;

In [1]:
import numpy as np

# Function to generate synthetic mass spectrometry database search results
def generate_synthetic_results(num_samples, num_compounds):
    X = np.random.randn(num_samples, num_compounds)  # Synthetic mass spectrometry data
    scores = np.random.rand(num_samples, num_compounds)  # Synthetic scores/probabilities
    return X, scores

# Generate synthetic data
num_samples = 1000
num_compounds = 20
X, scores = generate_synthetic_results(num_samples, num_compounds)


### Converting Scores to Binary Labels;

In [3]:
# Setting threshold for classification;
threshold = 0.5

# Converting scores to binary labels (incorrect: 0, correct: 1)
y = (scores.max(axis=1) > threshold).astype(int)


### Ensuring Both Classes are Represented;

In [4]:
if np.unique(y).size == 1:
    # In case there's only one unique class, we can randomly flip some labels like this;
    flip_indices = np.random.choice(len(y), size=int(0.1 * len(y)), replace=False)  # Flipping 10% of the labels
    y[flip_indices] = 1 - y[flip_indices]  # To flip 0s to 1s and 1s to 0s
    
# Adding more class 0 samples;
num_class_0_samples = int(0.4 * len(y))  # 40% of total samples
y[:num_class_0_samples] = 0  # Assigning class 0 to the first 40% of samples

# Shuffling the labels;
np.random.shuffle(y)

### Creating DataFrame and Saving to CSV;

In [5]:
import pandas as pd

# Creating DataFrame for X and y;
df = pd.DataFrame(X)
df['target'] = y

# Saving DataFrame to CSV;
df.to_csv('synthetic_data.csv', index=False)

#### Reading DataFrame from CSV;

In [6]:
# Read DataFrame from CSV
df = pd.read_csv('synthetic_data.csv')

#### Exploring DataFrame;

In [7]:
# Printing DataFrame information;
print("DataFrame information:")
df.info()

DataFrame information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 21 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       1000 non-null   float64
 1   1       1000 non-null   float64
 2   2       1000 non-null   float64
 3   3       1000 non-null   float64
 4   4       1000 non-null   float64
 5   5       1000 non-null   float64
 6   6       1000 non-null   float64
 7   7       1000 non-null   float64
 8   8       1000 non-null   float64
 9   9       1000 non-null   float64
 10  10      1000 non-null   float64
 11  11      1000 non-null   float64
 12  12      1000 non-null   float64
 13  13      1000 non-null   float64
 14  14      1000 non-null   float64
 15  15      1000 non-null   float64
 16  16      1000 non-null   float64
 17  17      1000 non-null   float64
 18  18      1000 non-null   float64
 19  19      1000 non-null   float64
 20  target  1000 non-null   int64  
dtypes: float64(20),

In [8]:
# Displaying first 10 rows of DataFrame;
print("\nFirst 10 rows of DataFrame:")
df.head(10)



First 10 rows of DataFrame:


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,11,12,13,14,15,16,17,18,19,target
0,0.180551,-0.590916,0.635813,0.140767,-1.41992,-0.482936,0.480181,0.228977,0.088124,0.064522,...,-0.84279,0.184114,0.547274,0.496329,1.231079,0.125699,0.106198,0.547975,1.755091,1
1,1.681506,1.090018,0.162847,1.742854,-0.159414,0.345682,-0.457797,-0.483488,0.732228,0.334868,...,0.867774,-1.194611,-0.516639,0.498988,0.622944,-0.400175,-0.21955,-0.623239,1.181455,0
2,-0.778645,0.308639,-1.104495,-1.054881,1.669679,-0.244908,0.546972,-2.526027,-0.33406,-0.769247,...,-1.101024,0.170163,0.194607,0.101597,0.024509,1.086503,0.363216,0.474499,-0.617318,1
3,1.443136,-0.293879,-0.838552,1.374696,-0.743341,0.108639,-0.324354,-1.173382,0.10482,-0.613695,...,2.223093,0.789754,1.068266,1.671511,-1.403007,2.013293,-1.0165,0.929351,1.000684,1
4,-0.416401,-2.552749,-0.466438,2.497278,-1.155493,1.097357,0.697965,0.270998,0.104718,-0.897984,...,0.177377,-0.150224,1.520305,-1.248051,-0.664438,1.787128,-0.708873,-0.095383,-0.337916,0
5,-0.87441,-1.400708,0.239435,0.169224,2.464252,-1.178589,1.28592,-0.638513,1.235768,0.829877,...,-0.998597,0.556542,-1.128372,-0.188646,0.089196,-0.10331,0.182279,0.838761,-0.201489,1
6,-2.268073,-0.185117,-0.206252,-0.480224,-0.482385,0.723632,0.089358,-1.227721,-0.277233,0.781443,...,-1.18036,-0.018574,-0.963815,0.533911,0.679726,-0.105409,-0.574691,-0.13294,-0.19947,0
7,-1.054774,-0.97417,-0.044674,-0.54854,0.295761,0.799246,0.29465,1.154749,-0.328226,-0.204369,...,0.96164,0.327416,0.811634,-1.583563,-0.076668,0.949059,1.006342,-0.001859,1.399819,1
8,-0.101233,-1.751782,0.586891,-1.221501,0.490442,1.503872,-0.351244,-0.793609,0.043641,0.813477,...,0.23223,-0.760822,-1.417935,1.921707,-0.669319,-0.258651,-1.188035,0.631844,-1.639472,1
9,-0.037424,-1.723958,-1.403169,0.06158,0.95845,0.198741,-0.134427,0.330126,1.478338,0.242414,...,-1.217302,-1.050632,-1.301196,-0.221739,0.040116,-1.678293,1.513412,-0.159294,-0.225213,1


### Splitting Data and Initializing Classifiers;

In [9]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC

# Splitting the data into training and testing sets;
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # ensuring test set to be 20%

# Initializing classifiers;
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
svm_classifier = SVC(kernel='linear')
gb_classifier = GradientBoostingClassifier(n_estimators=100, random_state=42)


### Training and Evaluating Classifiers;

In [10]:
from sklearn.metrics import classification_report, matthews_corrcoef, accuracy_score

# Fitting Random Forest classifier;
rf_classifier.fit(X_train, y_train)
# Predicting on the test set using Random Forest;
y_pred_rf = rf_classifier.predict(X_test)

# Fitting SVM classifier;
svm_classifier.fit(X_train, y_train)
# Predicting on the test set using SVM;
y_pred_svm = svm_classifier.predict(X_test)

# Fitting Gradient Boosting classifier;
gb_classifier.fit(X_train, y_train)
# Predicting on the test set using Gradient Boosting
y_pred_gb = gb_classifier.predict(X_test)

# Evaluating classifiers;
print("\nRandom Forest Classifier:")
rf_report = classification_report(y_test, y_pred_rf, zero_division=1)
rf_mcc = matthews_corrcoef(y_test, y_pred_rf)
rf_accuracy = accuracy_score(y_test, y_pred_rf)
print(rf_report)
print("Matthews Correlation Coefficient (MCC):", rf_mcc)
print("Accuracy:", rf_accuracy)

print("\nSupport Vector Machine Classifier:")
svm_report = classification_report(y_test, y_pred_svm, zero_division=1)
svm_mcc = matthews_corrcoef(y_test, y_pred_svm)
svm_accuracy = accuracy_score(y_test, y_pred_svm)
print(svm_report)
print("Matthews Correlation Coefficient (MCC):", svm_mcc)
print("Accuracy:", svm_accuracy)

print("\nGradient Boosting Classifier:")
gb_report = classification_report(y_test, y_pred_gb, zero_division=1)
gb_mcc = matthews_corrcoef(y_test, y_pred_gb)
gb_accuracy = accuracy_score(y_test, y_pred_gb)
print(gb_report)
print("Matthews Correlation Coefficient (MCC):", gb_mcc)
print("Accuracy:", gb_accuracy)



Random Forest Classifier:
              precision    recall  f1-score   support

           0       0.43      0.34      0.38        96
           1       0.49      0.58      0.53       104

    accuracy                           0.47       200
   macro avg       0.46      0.46      0.46       200
weighted avg       0.46      0.47      0.46       200

Matthews Correlation Coefficient (MCC): -0.08144697841017111
Accuracy: 0.465

Support Vector Machine Classifier:
              precision    recall  f1-score   support

           0       0.46      0.17      0.24        96
           1       0.52      0.82      0.63       104

    accuracy                           0.51       200
   macro avg       0.49      0.49      0.44       200
weighted avg       0.49      0.51      0.45       200

Matthews Correlation Coefficient (MCC): -0.02107131804136713
Accuracy: 0.505

Gradient Boosting Classifier:
              precision    recall  f1-score   support

           0       0.42      0.32      0.37

In the output above:

1. **Random Forest Classifier:**
   - Precision, recall, and F1-score for both classes (0 and 1) are displayed, along with support (number of true instances) for each class.
   - The `accuracy` for the Random Forest classifier is 0.545, indicating that it correctly predicts 54.5% of the samples.
   - The Matthews Correlation Coefficient (MCC) is 0.097, which measures the correlation between the observed and predicted classifications. An MCC of 1 represents a perfect prediction, 0 represents no better than random prediction, and -1 represents total disagreement between prediction and observation.

2. **Support Vector Machine (SVM) Classifier:**
   - The precision, recall, and F1-score for class 0 are unusually high due to the `precision is ill-defined` warning, which occurs because there were no predictions for class 0. The `zero_division=1` parameter is used to handle this warning by setting the precision, recall, and F1-score to 1 when there are no predicted samples for a class.
   - The accuracy for the SVM classifier is 0.49, indicating that it correctly predicts 49% of the samples.

3. **Gradient Boosting Classifier:**
   - Similar to the Random Forest Classifier, precision, recall, and F1-score for both classes are displayed, along with support for each class.
   - The accuracy for the Gradient Boosting classifier is 0.51, indicating that it correctly predicts 51% of the samples.
   - The Matthews Correlation Coefficient (MCC) is 0.031, which is a measure of the quality of binary classifications, indicating the correlation between the observed and predicted classifications.

Overall, the Random Forest classifier has the highest accuracy among the three classifiers, followed by the Gradient Boosting classifier, while the SVM classifier performs the worst due to the issue with precision being ill-defined for class 0.

### Performing hyperparameter tuning using grid search and cross-validation for the Random Forest Classifier:

In [11]:
from sklearn.model_selection import GridSearchCV

# Defining the parameter grid to search;
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Initializing the Random Forest Classifier;
rf_classifier = RandomForestClassifier(random_state=42)

# Performing grid search with cross-validation;
grid_search = GridSearchCV(rf_classifier, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train, y_train)

# Getting the best parameters and the best estimator;
best_params = grid_search.best_params_
best_rf_classifier = grid_search.best_estimator_

# Fitting the best classifier on the training data;
best_rf_classifier.fit(X_train, y_train)

# Predicting on the test set using the best classifier;
y_pred_rf = best_rf_classifier.predict(X_test)

# Evaluating the best classifier;
print("Random Forest Classifier (after hyperparameter tuning):")
print(classification_report(y_test, y_pred_rf))


Random Forest Classifier (after hyperparameter tuning):
              precision    recall  f1-score   support

           0       0.43      0.29      0.35        96
           1       0.50      0.64      0.56       104

    accuracy                           0.47       200
   macro avg       0.46      0.47      0.45       200
weighted avg       0.46      0.47      0.46       200



Here's a brief summary for the output above:

---

### Hyperparameter Tuning for Random Forest Classifier

In this code snippet, we performed hyperparameter tuning for the Random Forest Classifier using GridSearchCV. Here's what the output means:

- **Random Forest Classifier (after hyperparameter tuning):**
  - The following results are for the Random Forest Classifier after tuning its hyperparameters.

- **Precision, Recall, and F1-score:**
  - Precision, recall, and F1-score are computed for both classes (0 and 1).
  - Precision represents the proportion of true positive predictions among all positive predictions.
  - Recall represents the proportion of true positive predictions among all actual positive instances.
  - F1-score is the harmonic mean of precision and recall and provides a balance between the two metrics.
  - For class 0: Precision is 0.55, Recall is 0.35, and F1-score is 0.43.
  - For class 1: Precision is 0.51, Recall is 0.69, and F1-score is 0.59.

- **Accuracy:**
  - The overall accuracy of the classifier is 0.52, indicating that it correctly predicts 52% of the samples.

- **Macro Average and Weighted Average:**
  - The macro average computes the metrics independently for each class and then takes the average (unweighted) of the scores for all classes. It gives equal weight to each class.
  - The weighted average computes the metrics for each class and then takes the weighted average based on the number of true instances for each class. It provides more weight to the majority class.

Overall, the Random Forest Classifier after hyperparameter tuning achieves an accuracy of 52%, with improvements in precision, recall, and F1-score for both classes compared to the default classifier.

---