Problem Solving Task: Yeast Dataset
- Analyse feature importance in protein presence
- Predict for presence of protein using supervised learning models
- Predict for presence of protein using emsemble learning models



Yeast Dataset Information
1. mcg: McGeoch's method for signal sequence recognition. 
2. gvh: von Heijne's method for signal sequence recognition. 
3. alm: Score of the ALOM membrane spanning region prediction program. 
4. mit: Score of discriminant analysis of the amino acid content of the N-terminal region (20 residues long) of mitochondrial and non-mitochondrial proteins. 
5. erl: Presence of "HDEL" substring (thought to act as a signal for retention in the endoplasmic reticulum lumen). Binary attribute. 
6. pox: Peroxisomal targeting signal in the C-terminus. 
7. vac: Score of discriminant analysis of the amino acid content of vacuolar and extracellular proteins. 
8. nuc: Score of discriminant analysis of nuclear localization signals of nuclear and non-nuclear proteins. 
9. class: Presence or absence of protein {positive, negative}. 

**ANALYSE THE IMPORTANCE OF THE FEATURES IN PREDICTING THE PRESENCE OR ABSENCE OF THE PROTEIN USING TWO DIFFERENT APPROACHES**

In [87]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn.metrics as skm

#import data as dataframe
yeastdata = pd.read_csv('yeast.csv') 

#set the names of columns
feature_cols = ['mcg', 'gvh', 'alm','mit','erl','pox','vac','nuc','class']

#print first 4 rows and head of yeast data table
yeastdata.head()


Unnamed: 0,mcg,gvh,alm,mit,erl,pox,vac,nuc,class
0,0.51,0.4,0.56,0.17,0.5,0.5,0.49,0.22,negative
1,0.4,0.39,0.6,0.15,0.5,0.0,0.58,0.3,negative
2,0.4,0.42,0.57,0.35,0.5,0.0,0.53,0.25,negative
3,0.46,0.44,0.52,0.11,0.5,0.0,0.5,0.22,negative
4,0.47,0.39,0.5,0.11,0.5,0.0,0.49,0.4,negative


In [88]:
#convert class data into binary so that it can be used for machine learning

#make dummy columns positive and negativae with binary values
posnegdummies = pd.get_dummies(yeastdata["class"])

#print head of binary values dummy table to check data
print(posnegdummies.head())

#drop the positive axis from the posnegdummies table
posdummies = posnegdummies.drop(["negative"], axis=1)

#drop the class column from the yeastdata table
yeast = yeastdata.drop(["class"], axis=1)

#concatenate yeast and posdummies to make a new table
yeastML = pd.concat((posdummies, yeast), axis=1)

#print head yeastML tbale to check it is ready for machine learning
print(yeastML.head())



   negative  positive
0      True     False
1      True     False
2      True     False
3      True     False
4      True     False
   positive   mcg   gvh   alm   mit  erl  pox   vac   nuc
0     False  0.51  0.40  0.56  0.17  0.5  0.5  0.49  0.22
1     False  0.40  0.39  0.60  0.15  0.5  0.0  0.58  0.30
2     False  0.40  0.42  0.57  0.35  0.5  0.0  0.53  0.25
3     False  0.46  0.44  0.52  0.11  0.5  0.0  0.50  0.22
4     False  0.47  0.39  0.50  0.11  0.5  0.0  0.49  0.40


In [89]:
#calculate spearman correlation between positive and each other feature, print as dataframe
correlation_df = pd.DataFrame({
    col: [yeastML[col].corr(yeastML['positive'], method='spearman')] 
    for col in yeastML.columns if col != 'positive'
}).T
correlation_df.columns = ['Spearman Correlation']
print(correlation_df)

     Spearman Correlation
mcg              0.389364
gvh              0.323893
alm             -0.401148
mit              0.151903
erl              0.033410
pox             -0.014653
vac              0.084956
nuc             -0.033274


Spearman calculated based on ranking. Spearman correlation suggests strongest relationship  to the presence of protein is features alm, mcg then gvh.

In [90]:
#calculate Pearson correlation between positive and each other feature, print as dataframe
correlation_df = pd.DataFrame({
    col: [yeastML[col].corr(yeastML['positive'], method='pearson')] 
    for col in yeastML.columns if col != 'positive'
}).T
correlation_df.columns = ['Pearson Correlation']
print(correlation_df)

     Pearson Correlation
mcg             0.535772
gvh             0.387440
alm            -0.480789
mit             0.141470
erl             0.033410
pox            -0.014653
vac             0.050715
nuc            -0.038384


Pearson correlation is strength of linear relationship. Pearson suggests strongest relationship to the presence of protein is features mcg, alm and then gvh.

**CREATE THREE SUPERVISED LEARNING MODELS FOR PREDICTING PRESENCE OR ABSENCE OF PROTEIN**


PART 1: Data Preparation

In [91]:
#split data into test, train, validation sets: 70% train, 15% test, 15% validate

#import test train split
from sklearn.model_selection import train_test_split


#split existing data into X variables and y variables
#set x variables as all columns except the positive column from the yeastML dataset
X = yeastML.drop(columns = ['positive'])
#set y variable as the positive column from the yeastML dataset
y = yeastML['positive']

#SPLIT 1 - split into training set (60%) and remainder of data (40%)
X_train, X_rem, y_train, y_rem = train_test_split(X,y, train_size=0.6)

#SPLIT 2 - split remaining data (data not used in training set) into testing set (50% of 40%) and validation set data (50% of 40%)
#0.5 means 50% of the remaining data in X and y will be a testing set, 20% of the original dataset
X_valid, X_test, y_valid, y_test = train_test_split(X_rem,y_rem, test_size=0.5)

#this leave 50% of the remaining data as in the validation set.

print(X_train.shape), print(y_train.shape)
#print the shape of the training data
print(X_valid.shape), print(y_valid.shape)
#print the shape of the validation data
print(X_test.shape), print(y_test.shape)
#print the shape of the test data

(308, 8)
(308,)
(103, 8)
(103,)
(103, 8)
(103,)


(None, None)


PART 2: Set Model Hyperparameters

In [92]:
#import gridsearch from sklearn for hyperparameter tuning with validation data
from sklearn.model_selection import GridSearchCV

#import sklearn metrics to determinemodel performance
from sklearn import metrics


In [93]:
#Model 1: Logistic Regression HYPERPARAMETERS

#import logistic regression model
from sklearn.linear_model import LogisticRegression 

#initiate logistic regression model and name it log
log = LogisticRegression() 

#define the parameter grid to search, best combination of C values, penalty types l1 and l2 and solver type of libinear (as supports l1 and l2 penalty types).
param_grid = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100],
    'penalty': ['l2', 'l1'],
    'solver': ['liblinear'] 
}

#initiate gridsearch with log model and the parameters set in param_grid
grid = GridSearchCV(log, param_grid, cv=3)

#fit the model for grid search to the Validation data set
grid_search = grid.fit(X_valid, y_valid)

#print best knn hyperparameters from the grid search
print(grid_search.best_params_)

#print the best accuracy score from sklearn metrics with values from validation set as result of log method
print(grid_search.best_score_)


{'C': 100, 'penalty': 'l2', 'solver': 'liblinear'}
0.941736694677871


In [94]:
#Model 2: K Nearest Neighbours with gridsearch to determine best hyperparameters

#import nearest neighbours model
from sklearn.neighbors import KNeighborsClassifier

#initiate K nearest neighbours model and name it knn
knn = KNeighborsClassifier()

#define the parameter grid to search, best combination of 1-10 neighbours, weighting strategy and distance metrics
param_grid = {
    'n_neighbors': range(1, 10), 
    'weights': ['uniform', 'distance'], 
    'metric': ['euclidean', 'manhattan', 'minkowski']  
}

grid = GridSearchCV(knn, param_grid, cv=3)
#initiate gridsearch with knn model and the parameters set in param_grid

#fit the model for grid search to the Validation data set
grid_search = grid.fit(X_valid, y_valid)

#print best knn hyperparameers from the grid search
print(grid_search.best_params_)

#print the best accuracy score from sklearn metrics with values from validation set as result of knn method
print(grid_search.best_score_)


{'metric': 'euclidean', 'n_neighbors': 3, 'weights': 'uniform'}
0.9613445378151261


In [95]:
#Model 3: SVM with gridsearch to determine best hyperparameters

#import svm 
from sklearn import svm

#initiate svm model and name it svm
svm = svm.SVC()

#set parameters for gridsearch - search best combinations of rbf and linear kernal, as well as best gamma value between 0.0001 and 0.1
param_grid = {
    'kernel':('rbf', 'linear'),
    'gamma':[0.1, 0.0001]
    }

#initiate gridsearch with svm model and the parameters set in param_grid
grid = GridSearchCV(svm, param_grid, cv=3)

# fitting the model for grid search to the Validation data set
grid_search = grid.fit(X_valid, y_valid)

#print best hyperparameers from the grid search
print(grid_search.best_params_)
#print the best accuracy score from sklearn metrics with values from validation set as result of svm method
print(grid_search.best_score_)


{'gamma': 0.1, 'kernel': 'rbf'}
0.8834733893557423


PART 3: Model Testing

In [96]:
# Function to print metrics
def print_model_metrics(y_true, y_pred, model_name):
    print(f"\n{model_name} Model Metrics:")
    print("Accuracy: {:.2f}".format(skm.accuracy_score(y_true, y_pred)))
    print("Precision: {:.2f}".format(skm.precision_score(y_true, y_pred, zero_division=0)))
    print("Recall: {:.2f}".format(skm.recall_score(y_true, y_pred)))
    print("F1 Score: {:.2f}".format(skm.f1_score(y_true, y_pred)))
    print("\nConfusion Matrix:")
    print(skm.confusion_matrix(y_true, y_pred))


In [107]:
#Model 1: Logistic Regression 

#initiate linear regression model with hyperparameters identified above
logtrain = LogisticRegression(C= 100, penalty = 'l2', solver ='liblinear') 

#fit the model using the training data
logtrain.fit(X_train, y_train)

#run the trained logistic model on the test dataset
logpredict = logtrain.predict(X_test)

#calculate the metrics fot the logistic regression model
print_model_metrics(y_test, logpredict, "Logistic Regression")

       


Logistic Regression Model Metrics:
Accuracy: 0.95
Precision: 0.71
Recall: 0.62
F1 Score: 0.67

Confusion Matrix:
[[93  2]
 [ 3  5]]


In [108]:
#Model 2: K Nearest Neighbours 

#initiate  K Nearest Neighbours model with hyperparameters identified above
knntrain = KNeighborsClassifier(metric='euclidean', n_neighbors= 3, weights='uniform')

#fit the model using the training data
knntrain.fit(X_train, y_train)

#run the trained KNN model on the test dataset
knnypredict = knntrain.predict(X_test)

#calculate the metrics fot the K nearest neighbours model
print_model_metrics(y_test, knnypredict, "K nearest neighbours")



K nearest neighbours Model Metrics:
Accuracy: 0.95
Precision: 0.71
Recall: 0.62
F1 Score: 0.67

Confusion Matrix:
[[93  2]
 [ 3  5]]


In [99]:
#Model 3: SVM 

#import svm 
from sklearn import svm 

#initiate svm model with hyperparameters identified above
svmtrain = svm.SVC(kernel='rbf', gamma=0.1)

#fit the model using the training data
svmtrain.fit(X_train, y_train)

#run the trained svm model on the test dataset
svmypredict = svmtrain.predict(X_test)

#calculate the metrics fot the SVM model
print_model_metrics(y_test, svmypredict, "SVM")

       


SVM Model Metrics:
Accuracy: 0.92
Precision: 0.00
Recall: 0.00
F1 Score: 0.00

Confusion Matrix:
[[95  0]
 [ 8  0]]


Zero values in SVM metrics are due to zero values recorded for false positives and true negatives in confusion matrix.
SVM had highest number of false negatives and was unable to find any negative results.
Logistic regression and K-nearest neighbours models have exactly the same results, performing better than SVM.


**BUILD THREE ENSEMBLE LEARNING MODELS FOR PREDICTING PRESENCE OR ABSENCE OF PROTEIN**


PART 1: Determine hyperparameters of ensemble models

In [100]:
#Model 1: Random Forest

#import random forest classifier
from sklearn.ensemble import RandomForestClassifier

#initiate random forest classifier
forest = RandomForestClassifier()

#set parameters for gridsearch - search best combinations of max features, number of estimators and max depth
param_grid = {
    'max_features':('sqrt', 'log2'),
    'n_estimators':[1, 50],
    'max_depth':[1, 50]
    }

#Initiate gridsearch with random forest model and the parameters set in param_grid2
grid = GridSearchCV(forest, param_grid,cv=3)

# fit the model for grid search to the Validation data set
grid_search = grid.fit(X_valid, y_valid)

#print best hyperparameers from the grid search
print(grid_search.best_params_)
#print the best accuracy score from sklearn metrics with values from validation set as result of random forest method
print(grid_search.best_score_)



{'max_depth': 1, 'max_features': 'log2', 'n_estimators': 50}
0.9417366946778712


In [101]:
#Model 2: Bagging Classifier with Decision Tree CLassifier as base

#import bagging emsemble classifier
from sklearn.ensemble import BaggingClassifier
#import decision tree classifier
from sklearn.tree import DecisionTreeClassifier

#initiate bagging classifier with decision tree
bagging = BaggingClassifier(estimator=DecisionTreeClassifier())

#set parameters for gridsearch - search best combinations of number of estimators, max-samples and max features
param_grid = {
    'n_estimators':[1, 50],
    'max_samples':[1, 50],
    'max_features':[1, 8]
    }

#Initiate gridsearch with bagging model and the parameters set in param_grid
grid = GridSearchCV(bagging, param_grid, cv=3)

# fitting the model for grid search to the Validation data set
grid_search = grid.fit(X_valid, y_valid)

#print best hyperparameers from the grid search
print(grid_search.best_params_)
#print the best accuracy score from sklearn metrics with values from validation set as result of bagging method
print(grid_search.best_score_)



{'max_features': 8, 'max_samples': 50, 'n_estimators': 50}
0.941736694677871


In [102]:
#Model 3: Boosting Classifier with Decision Tree CLassifier as base

#import adaboost emsemble classifier
from sklearn.ensemble import AdaBoostClassifier

#initiate adaboost classifier with decision tree
boost = AdaBoostClassifier(estimator=DecisionTreeClassifier())

#set parameters for gridsearch - search best combinations of number of estimators, max-samples and max features
param_grid = {
    'n_estimators':[1, 50],
    'learning_rate':[0.1, 100000]}

#Initiate gridsearch with adaboost model and the parameters set in param_grid
grid = GridSearchCV(boost, param_grid, cv=3)

# fitting the model for grid search to the Validation data set
grid_search = grid.fit(X_valid, y_valid)

#print best hyperparameers from the grid search
print(grid_search.best_params_)
#print the best accuracy score from sklearn metrics with values from validation set as result of adaboost method
print(grid_search.best_score_)



{'learning_rate': 0.1, 'n_estimators': 1}
0.8829131652661064


PART 2: Testing Ensemble Models

In [109]:
#Model 1: Random Forest

#initiate random forest classifier with hyperparameters determined from the validation set
foresttrain = RandomForestClassifier(max_depth=1, max_features='log2', n_estimators=50)

#fit the model using the training data
foresttrain.fit(X_train, y_train)

#run the trained linear model on the test dataset
forestypredict = foresttrain.predict(X_test)

#calculate the metrics fot the Random Forest model
print_model_metrics(y_test, forestypredict, "Random Forest")




Random Forest Model Metrics:
Accuracy: 0.95
Precision: 1.00
Recall: 0.38
F1 Score: 0.55

Confusion Matrix:
[[95  0]
 [ 5  3]]


In [104]:
#Model 2 Bagging Classifier with Decision Tree Classifier as base

#initiate bagging classifier with decision tree base and other hyperparameters identified from grid search
baggingtrain = BaggingClassifier(estimator=DecisionTreeClassifier(),max_features=8,max_samples=50,n_estimators=50)

#fit the model using the training data
baggingtrain.fit(X_train, y_train)

#run the trained bagging model on the test dataset
baggingypredict = baggingtrain.predict(X_test)

#calculate the metrics fot the Bagging model
print_model_metrics(y_test, baggingypredict, "Bagging")


Bagging Model Metrics:
Accuracy: 0.95
Precision: 0.71
Recall: 0.62
F1 Score: 0.67

Confusion Matrix:
[[93  2]
 [ 3  5]]


In [110]:
#Model 3: Boosting Classifier with Decision Tree Classifier as base

#initiate adaboost classifier with decision tree base and other hyperparameters identified from grid search
boosttrain = AdaBoostClassifier(estimator=DecisionTreeClassifier(),learning_rate=0.1,n_estimators=1)

#fit the model using the training data
boosttrain.fit(X_train, y_train)

#run the trained adaboost model on the test dataset
boostypredict = baggingtrain.predict(X_test)

#calculate the metrics fot the Boosting model
print_model_metrics(y_test, boostypredict, "Boosting")


Boosting Model Metrics:
Accuracy: 0.95
Precision: 0.71
Recall: 0.62
F1 Score: 0.67

Confusion Matrix:
[[93  2]
 [ 3  5]]


PART 3: Is it possible to build ensemble models with classifiers other than decision tree? Explain with an example

In [106]:
#Ensemble bagging model with KNN classifier

#initiate bagging classifier with KNN base
baggingtrain2 = BaggingClassifier(estimator=KNeighborsClassifier(n_neighbors=1))
#initiate bagging classifier with KNN base

#fit the model using the training data
baggingtrain2.fit(X_train, y_train)

#run the trained bagging model on the test dataset
baggingypredict2 = baggingtrain2.predict(X_test)

#calculate the metrics fot the Boosting model
print_model_metrics(y_test, baggingypredict2, "Bagging with KNN estimator")



Bagging with KNN estimator Model Metrics:
Accuracy: 0.97
Precision: 0.86
Recall: 0.75
F1 Score: 0.80

Confusion Matrix:
[[94  1]
 [ 2  6]]


Random Forest Classifier had the highest precision rating (1), as no false positive results were made by the model. However, many false negatives made the recall value and thus also the F1 score low for this model.
Using a Bagging and Boosting emsemble method with Decision Tree Classifier gave exactly the same results metrics and confusion matrix results as with a Logistic Regression and  and K-nearest Neighbours methods.
The method that performed best (as a whole as well as the best ensemble method) was using a Bagging ensemble with a K-nearest Neighbours base. This method had the highest number of true results, with only 3 total false results from the  testing dataset (103 in size). These results led to highest accuracy, precision, recall and f1 scores for this method.
