In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive')

# Load the dataset
df = pd.read_csv('/content/drive/My Drive/Colab Notebooks/CrohnD.csv')

# Preprocess the dataset
df = df.drop(columns=['ID']).replace({'c1': 0, 'c2': 1, 'F': 0, 'M': 1})

# Separate features and target variable
X = df.drop(columns=['nrAdvE'])
y = df['nrAdvE']

# Define the columns that need to be one-hot encoded
categorical_features = ['treat']

# Create a column transformer
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(), categorical_features),
        ('num', StandardScaler(), X.drop(columns=categorical_features).columns)
    ])

# Apply the transformations and scale the data
X_preprocessed = preprocessor.fit_transform(X)

# Split the dataset into training and testing sets with stratification
X_train, X_test, y_train, y_test = train_test_split(
    X_preprocessed, y, test_size=0.2, random_state=42, stratify=y)

This code block prepares the Crohn's Disease dataset for machine learning analysis. It mounts Google Drive to access the data, loads it, and processes it by removing unnecessary columns and encoding categorical values numerically. It separates features and target variables, applies one-hot encoding to categorical features, and normalizes numerical data. The data is then split into training and testing sets, ensuring that each contains a representative distribution of the target classes.

In [None]:
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
import numpy as np

# Define the parameter grid for linear kernel
param_grid_linear = {
    'C': np.logspace(-5, 5, 11),
    'kernel': ['linear']
}

# Create the SVC model
svc_linear = SVC(max_iter=10000)

# Perform the grid search
grid_search_linear = GridSearchCV(svc_linear, param_grid_linear, cv=5, verbose=2)
grid_search_linear.fit(X_train, y_train)

# Best parameters and score for linear kernel
print("Best parameters (linear kernel):", grid_search_linear.best_params_)
print("Best score (linear kernel):", grid_search_linear.best_score_)

In this code block, an SVM model using a linear kernel is trained and optimized through grid search. The grid search explores various values of the regularization parameter C to find the best combination that yields the highest cross-validated score. The process involves fitting the SVM model with each C value across different folds of the data to ensure robustness and prevent overfitting. The output indicates the grid search's progress, showing the C values tested and the computation time for each. The best parameters found indicate the optimal C value for the linear kernel SVM, and the best score provides an estimate of the model's predictive accuracy.

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
import numpy as np

# Define the parameter grid for the polynomial kernel
param_grid_poly = {
    'C': np.logspace(-5, 5, 11),
    'kernel': ['poly'],
    'degree': [2, 3, 4, 5, 6],
    'coef0': np.linspace(0.8, 1.2, 5),
    'gamma': ['scale', 'auto']
}

# Create the SVC model for the polynomial kernel
svc_poly = SVC(max_iter=10000)

# Perform the grid search
grid_search_poly = GridSearchCV(svc_poly, param_grid_poly, cv=5, scoring='accuracy', verbose=2)
grid_search_poly.fit(X_train, y_train)

# Best parameters and score for the polynomial kernel
print("Best parameters (poly kernel):", grid_search_poly.best_params_)
print("Best score (poly kernel):", grid_search_poly.best_score_)

In this block, an SVM with a polynomial kernel is optimized using grid search, exploring various combinations of parameters C, degree, coef0, and gamma. This process aims to find the parameter set that yields the best accuracy. Each combination is tested across multiple folds to evaluate the model's performance robustly. The best parameters and score indicate the optimal settings for the polynomial kernel SVM, highlighting the most effective complexity and shape of the decision boundary for the given data. The exhaustive search through a wide range of parameter values helps ensure the model's generalizability and effectiveness in capturing the underlying patterns of the dataset.

In [None]:
# Define the parameter grid for RBF kernel
param_grid_rbf = {
    'C': np.logspace(-5, 5, 11),
    'kernel': ['rbf'],
    'gamma': ['scale', 'auto']
}

# Create the SVC model
svc_rbf = SVC(max_iter=10000)

# Perform the grid search
grid_search_rbf = GridSearchCV(svc_rbf, param_grid_rbf, cv=5, verbose=2)
grid_search_rbf.fit(X_train, y_train)

# Best parameters and score for RBF kernel
print("Best parameters (RBF kernel):", grid_search_rbf.best_params_)
print("Best score (RBF kernel):", grid_search_rbf.best_score_)


This code block is focused on tuning a Support Vector Machine (SVM) with a Radial Basis Function (RBF) kernel. It uses grid search to explore different combinations of the regularization parameter C and gamma to find the optimal settings that maximize the model's accuracy. The RBF kernel is a popular choice for SVM because of its ability to handle non-linear data. The process iterates over various C values, which control the trade-off between achieving a low error on the training data and minimizing the model complexity, and gamma, which influences the shape of the decision boundary. The grid search's outcome is the best parameter combination and the corresponding score, indicating how well the model with these parameters can generalize to unseen data.

In [None]:
# Define the parameter grid for the sigmoid kernel
param_grid_sigmoid = {
    'C': np.logspace(-5, 5, 11),
    'kernel': ['sigmoid'],
    'coef0': np.linspace(0.8, 1.2, 5),
    'gamma': ['scale', 'auto']
}

# Create the SVC model for the sigmoid kernel
svc_sigmoid = SVC(max_iter=10000)

# Perform the grid search
grid_search_sigmoid = GridSearchCV(svc_sigmoid, param_grid_sigmoid, cv=5, scoring='accuracy', verbose=2)
grid_search_sigmoid.fit(X_train, y_train)

# Best parameters and score for the sigmoid kernel
print("Best parameters (sigmoid kernel):", grid_search_sigmoid.best_params_)
print("Best score (sigmoid kernel):", grid_search_sigmoid.best_score_)

This code block performs grid search optimization for an SVM with a sigmoid kernel, searching for the best combination of parameters including the regularization parameter C, the kernel coefficient coef0, and the kernel coefficient gamma. The sigmoid kernel turns the decision boundary into a sigmoid shape, which can be useful for certain types of non-linear problems. The grid search tests various values for these parameters to find the set that yields the highest accuracy. The output indicates the best parameter combination found during the search and the corresponding accuracy score, which represents the model's effectiveness at classifying the given data. The final output shows that the best performing model with the sigmoid kernel achieved a slightly higher accuracy compared to previous kernels, suggesting a better fit for the data or potentially overfitting depending on the context and additional validation results.

In [None]:
linear_search = grid_search_linear
poly_search = grid_search_poly
rbf_search = grid_search_rbf
sigmoid_search = grid_search_sigmoid

# Extract the index of the best performing model for each kernel type
best_index_linear = linear_search.best_index_
best_index_poly = poly_search.best_index_
best_index_rbf = rbf_search.best_index_
best_index_sigmoid = sigmoid_search.best_index_

# Extract the cross-validation scores for the best model of each kernel type
linear_fold_scores = [linear_search.cv_results_[f'split{i}_test_score'][best_index_linear] for i in range(linear_search.cv)]
poly_fold_scores = [poly_search.cv_results_[f'split{i}_test_score'][best_index_poly] for i in range(poly_search.cv)]
rbf_fold_scores = [rbf_search.cv_results_[f'split{i}_test_score'][best_index_rbf] for i in range(rbf_search.cv)]
sigmoid_fold_scores = [sigmoid_search.cv_results_[f'split{i}_test_score'][best_index_sigmoid] for i in range(sigmoid_search.cv)]

# Print out the scores for each fold
print("Linear kernel scores:", linear_fold_scores)
print("Poly kernel scores:", poly_fold_scores)
print("RBF kernel scores:", rbf_fold_scores)
print("Sigmoid kernel scores:", sigmoid_fold_scores)

In this code block, the best models from each of the SVM kernel types (linear, polynomial, RBF, and sigmoid) identified through grid search are analyzed further. For each kernel type, the code retrieves the index of the best performing model and then extracts the cross-validation scores for that model across different folds of the data. This provides a detailed view of how consistently the model performed across the different subsets of the data, highlighting its stability and reliability.

The output shows the cross-validation scores for each fold of the best models for linear, polynomial, RBF, and sigmoid kernels. These scores illustrate how the model's performance varied across different data splits during the cross-validation process. For instance, both linear and polynomial kernels show very consistent scores across the folds, while the sigmoid kernel shows some variability, with one fold achieving a significantly higher score. This variability can indicate differences in how well the model fits different parts of the data or reflect the model's sensitivity to the data's distribution.

In [None]:
from scipy.stats import ttest_rel

# Scores from the cross-validation
linear_scores = [0.42105263157894735, 0.42105263157894735, 0.42105263157894735, 0.4444444444444444, 0.4444444444444444]
poly_scores = [0.42105263157894735, 0.42105263157894735, 0.42105263157894735, 0.4444444444444444, 0.4444444444444444]
rbf_scores = [0.42105263157894735, 0.42105263157894735, 0.42105263157894735, 0.4444444444444444, 0.4444444444444444]
sigmoid_scores = [0.42105263157894735, 0.42105263157894735, 0.42105263157894735, 0.5, 0.4444444444444444]

def perform_ttest(scores1, scores2, description):
    t_statistic, p_value = ttest_rel(scores1, scores2)
    print(f"{description} comparison:")
    print(f"T-statistic: {t_statistic}")
    print(f"P-value: {p_value}\n")
    if p_value < 0.05:
        print(f"There is a statistically significant difference between the {description} models.\n")
    else:
        print(f"There is no statistically significant difference between the {description} models.\n")

# Perform the comparisons
perform_ttest(linear_scores, poly_scores, "Linear vs. Poly")
perform_ttest(linear_scores, rbf_scores, "Linear vs. RBF")
perform_ttest(linear_scores, sigmoid_scores, "Linear vs. Sigmoid")
perform_ttest(poly_scores, rbf_scores, "Poly vs. RBF")
perform_ttest(poly_scores, sigmoid_scores, "Poly vs. Sigmoid")
perform_ttest(rbf_scores, sigmoid_scores, "RBF vs. Sigmoid")


In this code block, a statistical analysis is conducted to compare the performance of SVM models with different kernels using a paired t-test, which assesses whether the mean scores of two models are statistically different from each other. The analysis involves the linear, polynomial (poly), radial basis function (RBF), and sigmoid kernels.

The output reveals that there is no statistically significant difference between the models compared. The 'nan' results for some comparisons indicate that the scores for those models are identical across all folds, leading to a division by zero in the t-test calculation. For the Linear vs. Sigmoid, Poly vs. Sigmoid, and RBF vs. Sigmoid comparisons, the p-values are greater than 0.05, suggesting that any differences in their mean scores are not statistically significant. This implies that, according to the cross-validation scores used in the comparisons, no model consistently outperforms the others significantly across the dataset used.

In [20]:
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV

# Define a parameter grid
param_grid_mlp = {
    'hidden_layer_sizes': [(50,), (100,)],
    'activation': ['tanh', 'relu'],
    'solver': ['adam'],
    'alpha': [0.001, 0.01],
    'learning_rate': ['constant'],
}

# Create the MLP model
mlp = MLPClassifier(max_iter=10000)

# Perform the grid search
grid_search_mlp = GridSearchCV(mlp, param_grid_mlp, cv=2, scoring='accuracy', verbose=2, n_jobs=-1)
grid_search_mlp.fit(X_train, y_train)

# Output the best parameters and score
print("Best parameters (MLP):", grid_search_mlp.best_params_)
print("Best score (MLP):", grid_search_mlp.best_score_)

Fitting 2 folds for each of 8 candidates, totalling 16 fits
Best parameters (MLP): {'activation': 'relu', 'alpha': 0.01, 'hidden_layer_sizes': (50,), 'learning_rate': 'constant', 'solver': 'adam'}
Best score (MLP): 0.33325624421831634



This code block performs a grid search to optimize the parameters of an MLP model. The parameters tested include the number of neurons in the hidden layer (hidden_layer_sizes), the activation function (activation), the solver for weight optimization (solver), the regularization term (alpha), and the learning rate (learning_rate). The grid search evaluates different combinations of these parameters to find the one that achieves the highest accuracy on the training data, using 2-fold cross-validation.

The output shows the best combination of parameters found during the grid search and the corresponding accuracy score. The best performing model uses the relu activation function, an alpha value of 0.01, 50 neurons in the hidden layer, a constant learning rate, and the 'adam' solver, achieving an accuracy of approximately 0.333. This indicates the model's performance on the dataset when configured with these parameters.

In [None]:
import numpy as np
from scipy import stats

# Cross-validation scores for each model
scores_linear = [0.42105263157894735, 0.42105263157894735, 0.42105263157894735, 0.4444444444444444, 0.4444444444444444]
scores_poly = [0.42105263157894735, 0.42105263157894735, 0.42105263157894735, 0.4444444444444444, 0.4444444444444444]
scores_rbf = [0.42105263157894735, 0.42105263157894735, 0.42105263157894735, 0.4444444444444444, 0.4444444444444444]
scores_sigmoid = [0.42105263157894735, 0.42105263157894735, 0.42105263157894735, 0.5, 0.4444444444444444]
scores_mlp = [0.3223866790009251, 0.3223866790009251, 0.3223866790009251, 0.3223866790009251, 0.3223866790009251]  # Assuming same score for simplicity

# Function to calculate confidence interval
def calculate_confidence_interval(scores):
    n = len(scores)
    mean = np.mean(scores)
    std = np.std(scores, ddof=1)
    se = std / np.sqrt(n)
    interval = 1.96 * se
    return mean, mean - interval, mean + interval

# Calculate and print confidence intervals for each model
for model_name, scores in zip(['Linear SVM', 'Poly SVM', 'RBF SVM', 'Sigmoid SVM', 'MLP'],
                              [scores_linear, scores_poly, scores_rbf, scores_sigmoid, scores_mlp]):
    mean, lower, upper = calculate_confidence_interval(scores)
    print(f"{model_name}: Mean = {mean:.4f}, 95% CI = ({lower:.4f}, {upper:.4f})")

This code block calculates and displays the mean accuracy and 95% confidence intervals for five different machine learning models based on their cross-validation scores. It uses the standard error of the mean and a Z-score of 1.96 to determine the confidence interval for the mean accuracy of each model's cross-validation scores.

The output shows the mean accuracy and 95% confidence intervals for each model:

Linear SVM, Poly SVM, and RBF SVM have the same mean accuracy of 0.4304, with confidence intervals very close to each other, indicating similar performance across these models.
The Sigmoid SVM has a slightly higher mean accuracy of 0.4415, but with a wider confidence interval (0.4115 to 0.4715), suggesting more variability in its performance.
The MLP model shows a lower mean accuracy of 0.3224, and its confidence interval does not vary, indicating that all cross-validation scores for this model were the same (or very close to the same), suggesting consistent but lower performance compared to the SVM models.