Q4. Investigate the possibility of predicting WHO-based Covid-19 Severity
using the species-level gut microbiome profile and specific metadata: Age,
BMI, Gender and the total number of comorbidities. Compare the relative
classification accuracy obtained using the three machine-learning based
classifiers: Random Forest, Support Vector Machines and k-Nearest
Neighbors. Identify the top 50 predictive features for this scheme.
First investigate the classification performance on the three different
groups. Then investigate the classification performance only between
mild and critical_severe groups.
Marks: 6

In [59]:
  
from google.colab import drive
drive.mount('/content/drive')

#df = pd.read_excel('/content/drive/My Drive/path/to/excel_file.xlsx')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [4]:
import pandas as pd
import numpy as np
from scipy.stats import kruskal, wilcoxon

# Load the metadata and clr transformed abundance profiles
metadata = pd.read_excel('Assignment1_Metadata.xlsx', index_col = 0)
clr_profiles = pd.read_excel('Assignment1_ClrTrans_Species.xlsx', index_col=0)
raw_counts = pd.read_excel('Assignment_RawCount_Species.xlsx', index_col=0)


In [3]:
!pip install openpyxl

Collecting openpyxl
  Downloading openpyxl-3.1.2-py2.py3-none-any.whl (249 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m250.0/250.0 kB[0m [31m421.1 kB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting et-xmlfile
  Downloading et_xmlfile-1.1.0-py3-none-any.whl (4.7 kB)
Installing collected packages: et-xmlfile, openpyxl
Successfully installed et-xmlfile-1.1.0 openpyxl-3.1.2


In [5]:
metadata

Unnamed: 0_level_0,Sex,Age,BMI,HTN,Diabetes,Respiratory_disease,Heart_disease,Renal_Disease,Liver_Disease,Obesity,Malignancy,Immunosuppressive_Disease,Neurological_disease,Metabolic_Disease,Cardiovascular_Disease,comorbidities_total,WHO_severity
SampleID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
AIIDB0330,M,63,48.995023,n,y,n,y,y,n,y,n,n,n,y,y,6,critical_severe
AIIDM0042,M,63,30.322325,y,n,n,n,n,n,n,n,n,n,y,y,3,moderate
AIIDM0318,M,41,30.322325,n,n,n,n,n,n,n,n,n,n,n,n,0,critical_severe
AIIDV1015,F,42,57.795561,n,n,n,n,n,n,y,n,n,n,y,n,2,mild
AIIDV1406,F,85,23.728191,y,y,y,y,y,n,n,y,n,n,y,y,8,mild
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
AIIDM0040,F,43,30.322325,n,n,n,n,n,n,n,n,n,n,n,n,0,moderate
COVIRL-201-015,F,87,25.562130,n,n,y,y,n,n,n,n,n,n,n,y,3,critical_severe
AIIDV1085,M,78,26.543210,n,y,n,y,n,n,n,y,n,n,y,y,5,critical_severe
AIIDM0020,M,74,30.322325,y,n,n,y,n,n,n,n,n,n,y,y,4,mild


In [6]:
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report
import seaborn as sns
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.metrics import confusion_matrix, classification_report

In [7]:
new_metadata = metadata.loc[:, ['Age', 'BMI', 'Sex', 'comorbidities_total', 'WHO_severity']]
new_metadata

Unnamed: 0_level_0,Age,BMI,Sex,comorbidities_total,WHO_severity
SampleID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AIIDB0330,63,48.995023,M,6,critical_severe
AIIDM0042,63,30.322325,M,3,moderate
AIIDM0318,41,30.322325,M,0,critical_severe
AIIDV1015,42,57.795561,F,2,mild
AIIDV1406,85,23.728191,F,8,mild
...,...,...,...,...,...
AIIDM0040,43,30.322325,F,0,moderate
COVIRL-201-015,87,25.562130,F,3,critical_severe
AIIDV1085,78,26.543210,M,5,critical_severe
AIIDM0020,74,30.322325,M,4,mild


In [8]:
# Convert the severity values to numeric encodings
severity_mapping = {'mild': 1, 'moderate': 2, 'critical_severe': 3}
new_metadata['WHO_severity'] = new_metadata['WHO_severity'].map(severity_mapping)


In [9]:
# Convert the gender values to numeric encodings
gender_mapping = {'F': 1, 'M': 0}
new_metadata['Sex'] = new_metadata['Sex'].map(gender_mapping)

In [10]:
new_metadata

Unnamed: 0_level_0,Age,BMI,Sex,comorbidities_total,WHO_severity
SampleID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AIIDB0330,63,48.995023,0,6,3
AIIDM0042,63,30.322325,0,3,2
AIIDM0318,41,30.322325,0,0,3
AIIDV1015,42,57.795561,1,2,1
AIIDV1406,85,23.728191,1,8,1
...,...,...,...,...,...
AIIDM0040,43,30.322325,1,0,2
COVIRL-201-015,87,25.562130,1,3,3
AIIDV1085,78,26.543210,0,5,3
AIIDM0020,74,30.322325,0,4,1


In [11]:
data = pd.concat([clr_profiles, new_metadata], axis=1)
#data = pd.concat([normalized_data, new_metadata], axis=1)
data= data.reset_index(drop=True)
data



Unnamed: 0,Bifidobacterium_dentium,Butyricimonas_synergistica,Parvibacter_caecicola,Clostridium_sartagoforme,Ruminococcus_gauvreauii,Bacteroides_stercoris,Bacteroides_plebeius,Streptococcus_parasanguinis,Enterococcus_rivorum,Clostridium_methylpentosum,...,Veillonella_atypica,Enterococcus_casseliflavus,Olsenella_profusa,Lactobacillus_reuteri,Peptococcus_niger,Age,BMI,Sex,comorbidities_total,WHO_severity
0,2.484907,2.197225,0.000000,0.000000,3.332205,7.753624,0.000000,4.442651,0.000000,4.919981,...,1.098612,0.000000,1.609438,4.418841,0.000000,63,48.995023,0,6,3
1,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.693147,0.693147,...,0.000000,0.000000,0.000000,0.000000,0.000000,63,30.322325,0,3,2
2,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,3.258097,...,3.761200,0.000000,0.000000,0.000000,0.000000,41,30.322325,0,0,3
3,0.693147,0.000000,0.000000,0.693147,1.609438,0.000000,0.000000,0.000000,0.000000,0.693147,...,0.000000,0.000000,0.693147,0.000000,0.000000,42,57.795561,1,2,1
4,0.000000,0.000000,0.000000,0.000000,2.564949,0.000000,0.000000,0.000000,0.000000,2.079442,...,0.000000,0.000000,0.000000,0.000000,0.000000,85,23.728191,1,8,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
77,1.609438,0.000000,0.000000,0.000000,1.609438,0.000000,0.000000,0.693147,0.000000,0.000000,...,2.833213,0.000000,0.000000,0.000000,0.000000,43,30.322325,1,0,2
78,7.774856,0.000000,0.000000,3.555348,1.098612,0.000000,6.023448,4.025352,0.693147,6.431331,...,0.000000,2.197225,0.000000,0.000000,0.000000,87,25.562130,1,3,3
79,0.000000,0.000000,0.000000,0.000000,0.000000,1.386294,0.000000,0.000000,0.000000,1.609438,...,0.000000,0.000000,0.693147,0.000000,4.779123,78,26.543210,0,5,3
80,0.000000,0.000000,0.000000,0.000000,3.044522,0.000000,0.000000,0.000000,0.000000,2.995732,...,0.000000,0.000000,0.000000,0.000000,0.000000,74,30.322325,0,4,1


In [12]:
nan_sum = data.isnull().sum()
print(nan_sum)

Bifidobacterium_dentium       0
Butyricimonas_synergistica    0
Parvibacter_caecicola         0
Clostridium_sartagoforme      0
Ruminococcus_gauvreauii       0
                             ..
Age                           0
BMI                           0
Sex                           0
comorbidities_total           0
WHO_severity                  0
Length: 280, dtype: int64


In [13]:
#data = data.dropna(subset=['Age'])

In [14]:
data

Unnamed: 0,Bifidobacterium_dentium,Butyricimonas_synergistica,Parvibacter_caecicola,Clostridium_sartagoforme,Ruminococcus_gauvreauii,Bacteroides_stercoris,Bacteroides_plebeius,Streptococcus_parasanguinis,Enterococcus_rivorum,Clostridium_methylpentosum,...,Veillonella_atypica,Enterococcus_casseliflavus,Olsenella_profusa,Lactobacillus_reuteri,Peptococcus_niger,Age,BMI,Sex,comorbidities_total,WHO_severity
0,2.484907,2.197225,0.000000,0.000000,3.332205,7.753624,0.000000,4.442651,0.000000,4.919981,...,1.098612,0.000000,1.609438,4.418841,0.000000,63,48.995023,0,6,3
1,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.693147,0.693147,...,0.000000,0.000000,0.000000,0.000000,0.000000,63,30.322325,0,3,2
2,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,3.258097,...,3.761200,0.000000,0.000000,0.000000,0.000000,41,30.322325,0,0,3
3,0.693147,0.000000,0.000000,0.693147,1.609438,0.000000,0.000000,0.000000,0.000000,0.693147,...,0.000000,0.000000,0.693147,0.000000,0.000000,42,57.795561,1,2,1
4,0.000000,0.000000,0.000000,0.000000,2.564949,0.000000,0.000000,0.000000,0.000000,2.079442,...,0.000000,0.000000,0.000000,0.000000,0.000000,85,23.728191,1,8,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
77,1.609438,0.000000,0.000000,0.000000,1.609438,0.000000,0.000000,0.693147,0.000000,0.000000,...,2.833213,0.000000,0.000000,0.000000,0.000000,43,30.322325,1,0,2
78,7.774856,0.000000,0.000000,3.555348,1.098612,0.000000,6.023448,4.025352,0.693147,6.431331,...,0.000000,2.197225,0.000000,0.000000,0.000000,87,25.562130,1,3,3
79,0.000000,0.000000,0.000000,0.000000,0.000000,1.386294,0.000000,0.000000,0.000000,1.609438,...,0.000000,0.000000,0.693147,0.000000,4.779123,78,26.543210,0,5,3
80,0.000000,0.000000,0.000000,0.000000,3.044522,0.000000,0.000000,0.000000,0.000000,2.995732,...,0.000000,0.000000,0.000000,0.000000,0.000000,74,30.322325,0,4,1


In [68]:
X = data.iloc[:, 0:-1] # features
y = data.iloc[:, -1] # labels
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [69]:
X_train

Unnamed: 0,Bifidobacterium_dentium,Butyricimonas_synergistica,Parvibacter_caecicola,Clostridium_sartagoforme,Ruminococcus_gauvreauii,Bacteroides_stercoris,Bacteroides_plebeius,Streptococcus_parasanguinis,Enterococcus_rivorum,Clostridium_methylpentosum,...,Alistipes_massiliensis,Veillonella_atypica,Enterococcus_casseliflavus,Olsenella_profusa,Lactobacillus_reuteri,Peptococcus_niger,Age,BMI,Sex,comorbidities_total
62,4.343805,1.386294,0.693147,0.000000,2.397895,0.693147,8.843326,3.401197,0.000000,4.779123,...,1.098612,0.693147,0.000000,0.000000,0.693147,0.000000,72,32.658526,1,2
56,1.098612,0.000000,0.000000,0.000000,0.693147,1.098612,0.000000,0.000000,0.000000,3.044522,...,3.367296,0.000000,0.000000,0.693147,0.000000,0.000000,44,30.322325,1,3
40,5.068904,0.000000,0.000000,0.000000,5.176150,5.703782,6.799056,4.499810,0.000000,5.209486,...,0.693147,5.176150,0.000000,1.609438,0.000000,0.000000,60,21.907582,0,0
9,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,3.218876,1.609438,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,89,30.322325,1,7
78,7.774856,0.000000,0.000000,3.555348,1.098612,0.000000,6.023448,4.025352,0.693147,6.431331,...,0.000000,0.000000,2.197225,0.000000,0.000000,0.000000,87,25.562130,1,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20,0.693147,0.000000,2.708050,2.197225,3.555348,0.000000,1.791759,4.682131,3.496508,7.636752,...,0.000000,0.000000,1.945910,2.944439,0.000000,1.098612,69,30.491487,0,2
60,5.433722,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,4.382027,...,0.000000,0.000000,0.000000,1.098612,0.000000,0.000000,76,30.322325,0,6
71,0.000000,0.000000,0.000000,0.000000,1.945910,0.000000,0.000000,0.000000,0.000000,3.135494,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,51,27.681661,0,0
14,0.000000,0.000000,0.000000,0.000000,2.772589,0.000000,0.000000,0.000000,0.000000,4.158883,...,0.000000,0.000000,0.000000,0.000000,1.098612,0.000000,70,25.641873,1,1


In [70]:
rfc = RandomForestClassifier(random_state=4)
rfc.fit(X_train, y_train)
rfc_y_pred = rfc.predict(X_test)
rfc_accuracy = accuracy_score(y_test, rfc_y_pred)
rfc_cm = confusion_matrix(y_test, rfc_y_pred)
rfc_report = classification_report(y_test, rfc_y_pred)

In [71]:
rfc_accuracy

0.47058823529411764

In [76]:
# Fit and evaluate Support Vector Machine Classifier
svc = SVC(kernel='linear', C=0.5, random_state=4)
svc.fit(X_train, y_train)
svc_y_pred = svc.predict(X_test)
svc_accuracy = accuracy_score(y_test, svc_y_pred)
svc_cm = confusion_matrix(y_test, svc_y_pred)
svc_report = classification_report(y_test, svc_y_pred)
print("Support Vector Machine Classifier accuracy:", svc_accuracy)

Support Vector Machine Classifier accuracy: 0.4117647058823529


In [38]:
accuracy

0.52

In [75]:
#Without Top features selection and multiclassification
#Compare the relative classification accuracy obtained using the three machine-learning based classifiers: Random Forest, Support Vector Machines and k-Nearest Neighbors (without feature selection)
# Fit and evaluate Random Forest Classifier
rfc = RandomForestClassifier(n_estimators=150 ,random_state=4)
rfc.fit(X_train, y_train)
rfc_y_pred = rfc.predict(X_test)
rfc_accuracy = accuracy_score(y_test, rfc_y_pred)
rfc_cm = confusion_matrix(y_test, rfc_y_pred)
rfc_report = classification_report(y_test, rfc_y_pred)

# Fit and evaluate Support Vector Machine Classifier
svc = SVC(kernel='linear', C=0.5, random_state=4)
svc.fit(X_train, y_train)
svc_y_pred = svc.predict(X_test)
svc_accuracy = accuracy_score(y_test, svc_y_pred)
svc_cm = confusion_matrix(y_test, svc_y_pred)
svc_report = classification_report(y_test, svc_y_pred)

# Fit and evaluate k-Nearest Neighbors Classifier
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
knn_y_pred = knn.predict(X_test)
knn_accuracy = accuracy_score(y_test, knn_y_pred)
knn_cm = confusion_matrix(y_test, knn_y_pred)
knn_report = classification_report(y_test, knn_y_pred)


# Print the classification accuracies
print("Random Forest Classifier accuracy:", rfc_accuracy)
print("Support Vector Machine Classifier accuracy:", svc_accuracy)
print("k-Nearest Neighbors Classifier accuracy:", knn_accuracy)

# Print the confusion matrices
print("Random Forest Classifier confusion matrix:\n", rfc_cm)
print("Support Vector Machine Classifier confusion matrix:\n", svc_cm)
print("k-Nearest Neighbors Classifier confusion matrix:\n", knn_cm)

# Print the classification reports
print("Random Forest Classifier classification report:\n", rfc_report)
print("Support Vector Machine Classifier classification report:\n", svc_report)
print("k-Nearest Neighbors Classifier classification report:\n", knn_report)



Random Forest Classifier accuracy: 0.5294117647058824
Support Vector Machine Classifier accuracy: 0.4117647058823529
k-Nearest Neighbors Classifier accuracy: 0.35294117647058826
Random Forest Classifier confusion matrix:
 [[4 1 0]
 [1 1 1]
 [5 0 4]]
Support Vector Machine Classifier confusion matrix:
 [[2 1 2]
 [1 1 1]
 [5 0 4]]
k-Nearest Neighbors Classifier confusion matrix:
 [[4 1 0]
 [1 1 1]
 [5 3 1]]
Random Forest Classifier classification report:
               precision    recall  f1-score   support

           1       0.40      0.80      0.53         5
           2       0.50      0.33      0.40         3
           3       0.80      0.44      0.57         9

    accuracy                           0.53        17
   macro avg       0.57      0.53      0.50        17
weighted avg       0.63      0.53      0.53        17

Support Vector Machine Classifier classification report:
               precision    recall  f1-score   support

           1       0.25      0.40      0.31     

In [73]:
# Create a DataFrame with the actual and predicted severity statuses for Random Forest Classifier
rfc_table = pd.DataFrame({'Actual Severity Status': y_test, 'Predicted Severity Status': rfc_y_pred})

# Create a DataFrame with the actual and predicted severity statuses for Support Vector Machine Classifier
svc_table = pd.DataFrame({'Actual Severity Status': y_test, 'Predicted Severity Status': svc_y_pred})

# Create a DataFrame with the actual and predicted severity statuses for k-Nearest Neighbors Classifier
knn_table = pd.DataFrame({'Actual Severity Status': y_test, 'Predicted Severity Status': knn_y_pred})

# Print the classification tables
print("Random Forest Classifier classification table:\n", rfc_table)
print("Support Vector Machine Classifier classification table:\n", svc_table)
print("k-Nearest Neighbors Classifier classification table:\n", knn_table)

Random Forest Classifier classification table:
     Actual Severity Status  Predicted Severity Status
30                       3                          1
0                        3                          3
22                       1                          1
31                       3                          1
18                       2                          1
28                       3                          1
10                       2                          3
53                       3                          3
4                        1                          2
12                       1                          1
49                       3                          3
33                       1                          1
68                       2                          2
35                       3                          1
69                       1                          1
45                       3                          1
75                       3        

In [75]:
# To Identify the top 50 predictive features we used SelectKBest, information gain, Mean Absolute Difference (MAD), Kendall Tau etc. As we can see the data is sparse so Kendall tau worked the best. 
# Select the top 50 features using SelectKBest and chi-squared test
from scipy.stats import kendalltau
from sklearn.feature_selection import SelectKBest, chi2
skb = SelectKBest(chi2, k=50)
X_new = skb.fit_transform(X, y) 

# Calculate the Kendall correlation between each feature and the target
correlations = []
for feature in X.columns:
    corr, pvalue = kendalltau(data[feature], data['WHO_severity'])
    correlations.append(corr)
    
# Sort the features by descending correlation coefficient
feature_ranking = sorted(zip(X.columns, correlations), key=lambda x: abs(x[1]), reverse=True)
top_features = [f[0] for f in feature_ranking[:50]]
print("Top 50 predictive features:", top_features)


Top 50 predictive features: ['Methanobrevibacter_smithii', 'Eubacterium_contortum', 'Clostridium_innocuum', 'Actinomyces_naeslundii', 'Clostridium_clostridioforme', 'Hespellia_porcina', 'Bacteroides_acidifaciens', 'Bacteroides_plebeius', 'Blautia_stercoris', 'Clostridium_irregulare', 'Anaerorhabdus_furcosa', 'Clostridium_sartagoforme', 'Clostridium_methylpentosum', 'Finegoldia_magna', 'Eggerthella_lenta', 'Dialister_invisus', 'Ruminococcus_lactaris', 'Clostridium_disporicum', 'Alistipes_finegoldii', 'Solobacterium_moorei', 'Clostridium_oroticum', 'Pseudoflavonifractor_capillosus', 'Melissococcus_plutonius', 'Robinsoniella_peoriensis', 'Enterococcus_casseliflavus', 'Bacteroides_nordii', 'Clostridium_lavalense', 'Actinomyces_graevenitzii', 'Clostridium_hathewayi', 'Lactonifactor_longoviformis', 'Eubacterium_oxidoreducens', 'Clostridium_saccharogumia', 'Clostridium_sporosphaeroides', 'Natranaerovirga_pectinivora', 'Lactobacillus_fermentum', 'Anaerosporobacter_mobilis', 'Acetivibrio_ethano

In [76]:
y

0     3
1     2
2     3
3     1
4     1
     ..
77    2
78    3
79    3
80    1
81    3
Name: WHO_severity, Length: 82, dtype: int64

In [77]:
#with top features and multiclassification
#investigate the classification performance on the three different groups.Compare the relative classification accuracy obtained using the three machine-learning based classifiers: Random Forest, Support Vector Machines and k-Nearest Neighbors (feature selection)
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X[top_features], y, test_size=0.3, random_state=42)

In [78]:
y_test.shape

(25,)

In [79]:
# Train and evaluate a random forest classifier
rfc = RandomForestClassifier(n_estimators=100, random_state=42)
rfc.fit(X_train, y_train)
rfc_pred = rfc.predict(X_test)
rfc_acc = accuracy_score(y_test, rfc_pred)
rfc_cm = confusion_matrix(y_test, rfc_pred)
rfc_report = classification_report(y_test, rfc_pred)


# Train and evaluate a KNN classifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
knn_pred = knn.predict(X_test)
knn_acc = accuracy_score(y_test, knn_pred)
knn_cm = confusion_matrix(y_test, knn_pred)
knn_report = classification_report(y_test, knn_pred)


# Train and evaluate an SVM classifier
svm = SVC(kernel='linear', random_state=42)
svm.fit(X_train, y_train)
svm_pred = svm.predict(X_test)
svm_acc = accuracy_score(y_test, svm_pred)
svc_cm = confusion_matrix(y_test, svm_pred)
svc_report = classification_report(y_test, svm_pred)


print("Random Forest Accuracy:", rfc_acc)
print("KNN Accuracy:", knn_acc)
print("SVM Accuracy:", svm_acc)

# Print the confusion matrices
print("Random Forest Classifier confusion matrix:\n", rfc_cm)
print("Support Vector Machine Classifier confusion matrix:\n", svc_cm)
print("k-Nearest Neighbors Classifier confusion matrix:\n", knn_cm)

# Print the classification reports
print("Random Forest Classifier classification report:\n", rfc_report)
print("Support Vector Machine Classifier classification report:\n", svc_report)
print("k-Nearest Neighbors Classifier classification report:\n", knn_report)

#significance : we got better accuracy after feature selection

Random Forest Accuracy: 0.64
KNN Accuracy: 0.4
SVM Accuracy: 0.56
Random Forest Classifier confusion matrix:
 [[5 0 1]
 [1 2 2]
 [3 2 9]]
Support Vector Machine Classifier confusion matrix:
 [[5 0 1]
 [2 1 2]
 [3 3 8]]
k-Nearest Neighbors Classifier confusion matrix:
 [[5 1 0]
 [3 0 2]
 [9 0 5]]
Random Forest Classifier classification report:
               precision    recall  f1-score   support

           1       0.56      0.83      0.67         6
           2       0.50      0.40      0.44         5
           3       0.75      0.64      0.69        14

    accuracy                           0.64        25
   macro avg       0.60      0.63      0.60        25
weighted avg       0.65      0.64      0.64        25

Support Vector Machine Classifier classification report:
               precision    recall  f1-score   support

           1       0.50      0.83      0.62         6
           2       0.25      0.20      0.22         5
           3       0.73      0.57      0.64        1

In [80]:
#we found that after feature selction our prediction improved as shown above 
# Create a DataFrame with the actual and predicted severity statuses for Random Forest Classifier
rfc_table = pd.DataFrame({'Actual Severity Status': y_test, 'Predicted Severity Status': rfc_pred})

# Create a DataFrame with the actual and predicted severity statuses for Support Vector Machine Classifier
svc_table = pd.DataFrame({'Actual Severity Status': y_test, 'Predicted Severity Status': svm_pred})

# Create a DataFrame with the actual and predicted severity statuses for k-Nearest Neighbors Classifier
knn_table = pd.DataFrame({'Actual Severity Status': y_test, 'Predicted Severity Status': knn_pred})

# Print the classification tables
print("Random Forest Classifier classification table:\n", rfc_table)
print("Support Vector Machine Classifier classification table:\n", svc_table)
print("k-Nearest Neighbors Classifier classification table:\n", knn_table)

Random Forest Classifier classification table:
     Actual Severity Status  Predicted Severity Status
30                       3                          2
0                        3                          3
22                       1                          1
31                       3                          2
18                       2                          1
28                       3                          1
10                       2                          3
53                       3                          3
4                        1                          1
12                       1                          1
49                       3                          3
33                       1                          1
68                       2                          2
35                       3                          1
69                       1                          3
45                       3                          3
75                       3        

In [81]:
#investigate the classification performance only between mild and critical_severe groups i.e Binary Classification
top_features = ['Methanobrevibacter_smithii', 'Eubacterium_contortum', 'Clostridium_innocuum', 'Actinomyces_naeslundii', 'Clostridium_clostridioforme', 'Hespellia_porcina', 'Bacteroides_acidifaciens', 'Bacteroides_plebeius', 'Blautia_stercoris', 'Clostridium_irregulare', 'Anaerorhabdus_furcosa', 'Clostridium_sartagoforme', 'Clostridium_methylpentosum', 'Finegoldia_magna', 'Eggerthella_lenta', 'Dialister_invisus', 'Ruminococcus_lactaris', 'Clostridium_disporicum', 'Alistipes_finegoldii', 'Solobacterium_moorei', 'Clostridium_oroticum', 'Pseudoflavonifractor_capillosus', 'Melissococcus_plutonius', 'Robinsoniella_peoriensis', 'Enterococcus_casseliflavus', 'Bacteroides_nordii', 'Clostridium_lavalense', 'Actinomyces_graevenitzii', 'Clostridium_hathewayi', 'Lactonifactor_longoviformis', 'Eubacterium_oxidoreducens', 'Clostridium_saccharogumia', 'Clostridium_sporosphaeroides', 'Natranaerovirga_pectinivora', 'Lactobacillus_fermentum', 'Anaerosporobacter_mobilis', 'Acetivibrio_ethanolgignens', 'Escherichia.Shigella_flexneri', 'Megasphaera_micronuciformis', 'Eubacterium_cylindroides', 'Prevotella_stercorea', 'Clostridium_tertium', 'Clostridium_citroniae', 'Clostridium_leptum', 'Enterococcus_asini', 'Hydrogenoanaerobacterium_saccharovorans', 'Clostridium_asparagiforme', 'Peptococcus_niger', 'Marvinbryantia_formatexigens', 'Streptococcus_sanguinis']
data_new = data[top_features]
data_new = pd.concat([clr_profiles, new_metadata['WHO_severity']], axis=1)
data_new = data_new.reset_index(drop=True)



In [82]:
#drop the rows with moderate serverity 
data_new = data_new.drop(index=data_new[data_new['WHO_severity'] == 2].index)
X = data_new.iloc[:, 0:-1] # features
y = data_new.iloc[:, -1] # labels
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [83]:
#Compare the relative classification accuracy obtained using the three machine-learning based classifiers: Random Forest, Support Vector Machines and k-Nearest Neighbors (without feature selection)
# Fit and evaluate Random Forest Classifier
rfc = RandomForestClassifier(n_estimators=100)
rfc.fit(X_train, y_train)
rfc_y_pred = rfc.predict(X_test)
rfc_accuracy = accuracy_score(y_test, rfc_y_pred)
rfc_cm = confusion_matrix(y_test, rfc_y_pred)
rfc_report = classification_report(y_test, rfc_y_pred)


# Fit and evaluate Support Vector Machine Classifier
svc = SVC(kernel='linear', C=1.0, random_state=42)
svc.fit(X_train, y_train)
svc_y_pred = svc.predict(X_test)
svc_accuracy = accuracy_score(y_test, svc_y_pred)
svc_cm = confusion_matrix(y_test, svc_y_pred)
svc_report = classification_report(y_test, svc_y_pred)

# Fit and evaluate k-Nearest Neighbors Classifier
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
knn_y_pred = knn.predict(X_test)
knn_accuracy = accuracy_score(y_test, knn_y_pred)
knn_cm = confusion_matrix(y_test, knn_y_pred)
knn_report = classification_report(y_test, knn_y_pred)


# Print the classification accuracies
print("Random Forest Classifier accuracy:", rfc_accuracy)
print("Support Vector Machine Classifier accuracy:", svc_accuracy)
print("k-Nearest Neighbors Classifier accuracy:", knn_accuracy)

# Print the confusion matrices
print("Random Forest Classifier confusion matrix:\n", rfc_cm)
print("Support Vector Machine Classifier confusion matrix:\n", svc_cm)
print("k-Nearest Neighbors Classifier confusion matrix:\n", knn_cm)

# Print the classification reports
print("Random Forest Classifier classification report:\n", rfc_report)
print("Support Vector Machine Classifier classification report:\n", svc_report)
print("k-Nearest Neighbors Classifier classification report:\n", knn_report)




Random Forest Classifier accuracy: 0.7894736842105263
Support Vector Machine Classifier accuracy: 0.7894736842105263
k-Nearest Neighbors Classifier accuracy: 0.631578947368421
Random Forest Classifier confusion matrix:
 [[8 2]
 [2 7]]
Support Vector Machine Classifier confusion matrix:
 [[7 3]
 [1 8]]
k-Nearest Neighbors Classifier confusion matrix:
 [[5 5]
 [2 7]]
Random Forest Classifier classification report:
               precision    recall  f1-score   support

           1       0.80      0.80      0.80        10
           3       0.78      0.78      0.78         9

    accuracy                           0.79        19
   macro avg       0.79      0.79      0.79        19
weighted avg       0.79      0.79      0.79        19

Support Vector Machine Classifier classification report:
               precision    recall  f1-score   support

           1       0.88      0.70      0.78        10
           3       0.73      0.89      0.80         9

    accuracy                      

In [84]:

# Create a DataFrame with the actual and predicted severity statuses for Random Forest Classifier
rfc_table = pd.DataFrame({'Actual Severity Status': y_test, 'Predicted Severity Status': rfc_y_pred})

# Create a DataFrame with the actual and predicted severity statuses for Support Vector Machine Classifier
svc_table = pd.DataFrame({'Actual Severity Status': y_test, 'Predicted Severity Status': svc_y_pred})

# Create a DataFrame with the actual and predicted severity statuses for k-Nearest Neighbors Classifier
knn_table = pd.DataFrame({'Actual Severity Status': y_test, 'Predicted Severity Status': knn_y_pred})

# Print the classification tables
print("Random Forest Classifier classification table:\n", rfc_table)
print("Support Vector Machine Classifier classification table:\n", svc_table)
print("k-Nearest Neighbors Classifier classification table:\n", knn_table)

Random Forest Classifier classification table:
     Actual Severity Status  Predicted Severity Status
80                       1                          1
75                       3                          3
0                        3                          3
57                       3                          3
6                        3                          1
49                       3                          3
20                       3                          3
15                       1                          1
33                       1                          1
79                       3                          1
72                       1                          1
12                       1                          1
53                       3                          3
16                       1                          3
47                       1                          3
64                       3                          3
4                        1        

In [85]:
# we can see that Binary classification (mild annd critical) with selected features gave the better accuracy results than multiclass clasification.
#pipeline
#dataimport --- multiclass classification without feature selection --- multiclass classification with feature selction by kendall tau filter method -- binary class classification with selected feature -- comparision of accuracy (clasification report as well as table mentioned)
#note: Simple accuracy is the proportion of correctly classified instances to the total number of instances in the dataset. It is a basic metric to evaluate the performance of a classification model.
#Classification accuracy, on the other hand, takes into account the number of true positive, true negative, false positive, and false negative instances of each class. It is a more informative metric that provides a better understanding of how well a model performs for each class separately.
#Simple accuracy can be misleading when the dataset is imbalanced, which means that some classes have many more instances than others. In this case, the model can achieve a high accuracy by simply predicting the majority class for all instances. In contrast, classification accuracy provides a more accurate assessment of the model's performance for each class separately, regardless of the class distribution in the dataset.
#Therefore, simple accuracy and classification accuracy can differ, Thus Please see classification accuracy results while evaluating the assignment.