# Part 1. Equation of a Slime

How many late days are you using for this assignment? 0 Days

In [64]:
# Imports section
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Import functions from Scikit-Learn for regression, classification, and cross-validation
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import PolynomialFeatures

## 1. Loading the dataset

In [65]:
# Using pandas load the dataset
data_slime = pd.read_csv('science_data_large.csv')
# Output the first 15 rows of the data
print("First 15 rows of the dataset:")
print(data_slime.head(15))
# Display a summary of the table information (data types, non-null counts, etc.)
print("\nDataset Summary:")
print(data_slime.info())

First 15 rows of the dataset:
    Temperature °C  Mols KCL     Size nm^3
0              469       647  6.244743e+05
1              403       694  5.779610e+05
2              302       975  6.196847e+05
3              779       916  1.460449e+06
4              901        18  4.325726e+04
5              545       637  7.124634e+05
6              660       519  7.006960e+05
7              143       869  2.718260e+05
8               89       461  8.919803e+04
9              294       776  4.770210e+05
10             991       117  2.441771e+05
11             307       781  5.006455e+05
12             206        70  3.145200e+04
13             437       599  5.390215e+05
14             566        75  9.185271e+04

Dataset Summary:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Temperature °C  1000 non-null   int64  
 1   Mols KCL        100

## 2. Splitting the dataset

In [66]:
# Take the pandas dataset and split it into our features (X) and label (y)
X = data_slime.iloc[:, :-1]
y = data_slime.iloc[:, -1]
# Use sklearn to split the features and labels into a training/test set. (90% train, 10% test)
# For grading consistency use random_state=42 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, random_state=42)

## 3. Perform a Linear Regression

In [67]:
# Use sklearn to train a model on the training set
model = LinearRegression()
model.fit(X_train, y_train)
# Create a sample datapoint and predict the output of that sample with the trained model
sample_datapoint_df = X_test.iloc[[0]]
sample_prediction = model.predict(sample_datapoint_df)
print("Sample datapoint (features):", sample_datapoint_df)
print("Prediction for the sample datapoint:", sample_prediction)
# Report the score for that model using the default score function property of the SKLearn model, 
# in your own words (markdown, not code) explain what the score means
print("Model Score (R^2):", model.score(X_test, y_test))
# Extract the coefficients and intercept from the model and write an equation for your h(x) using LaTeX
coefficients = model.coef_
intercept = model.intercept_
print("Coefficients:", coefficients)
print("Intercept:", intercept)

Sample datapoint (features):      Temperature °C  Mols KCL
521             100       541
Prediction for the sample datapoint: [235911.1927226]
Model Score (R^2): 0.8552472077276095
Coefficients: [ 866.14641337 1032.69506649]
Intercept: -409391.4795834075


Write the linear equation of a slime: $h(x)=−409391.47958 + 866.14641⋅x_1 + 1032.69507⋅x_2$

FOR REFERENCE: x1 represents temperature (C) and x2 represents mol (KCL)

Report on score and explain meaning: The R² score of 0.855272 means that 
approximately 85.53% of the variance in the target variable 
(the change in slime size) is explained by the independent 
variables (KCl concentration and temperature) in the model. In other 
words, the model's predictions capture 85.53% of the variability in the
data, while the remaining 14.47% of the variability is due to other
factors or noise not captured by this linear relationship.

## 4. Use Cross Validation

In [68]:
# Use the cross_val_score function to repeat your experiment across many shuffles of the data
# For grading consistency use n_splits=5 and random_state=42
kf = KFold(n_splits=5, random_state=42, shuffle=True)
cv_scores = cross_val_score(model, X, y, cv=kf)
print("Cross-validation scores (R² for each fold):", cv_scores)
print("Mean CV Score (R²):", cv_scores.mean())
print("Standard Deviation:", cv_scores.std())
# Report on their finding and their significance

Cross-validation scores (R² for each fold): [0.86151889 0.82742341 0.87195173 0.88166206 0.85609101]
Mean CV Score (R²): 0.8597294202684646
Standard Deviation: 0.01838773713930639


Write findings here: The mean score is 0.85973, which is very high for datasets such as these. This score indicates that, on average, the model explains about 85.97% of the variance in the target variable—a strong performance given the nature of the data. Additionally, the standard deviation is very low (0.01839), meaning that the model's performance is consistent across each of the data subsets used in the cross-validation process. This low variability highlights the robustness of the model and suggests that it generalizes well to unseen data. Overall, these results provide strong evidence that the model is both accurate and reliable.

## 5. Using Polynomial Regression

In [69]:
# Using the PolynomialFeatures library perform another regression on an augmented dataset of degree 2
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)
model_poly = LinearRegression()
# Perform k-fold cross validation (as above)
kf_poly = KFold(n_splits=5, random_state=42, shuffle=True)
cv_scores_poly = cross_val_score(model_poly, X_poly, y, cv=kf_poly)

print("Cross-validation scores (R² for each fold, polynomial regression):", cv_scores_poly)
print("Mean CV Score (R², polynomial regression):", cv_scores_poly.mean())
print("Standard Deviation (polynomial regression):", cv_scores_poly.std())

model_poly.fit(X_poly, y)
coefficients_poly = model_poly.coef_
intercept_poly = model_poly.intercept_
print("Coefficients (polynomial):", coefficients_poly)
print("Intercept (polynomial):", intercept_poly)
# Report on the metrics and output the resultant equation as you did in Part 3.

Cross-validation scores (R² for each fold, polynomial regression): [1. 1. 1. 1. 1.]
Mean CV Score (R², polynomial regression): 1.0
Standard Deviation (polynomial regression): 0.0
Coefficients (polynomial): [ 0.00000000e+00  1.20000000e+01 -1.23111325e-07 -1.05668034e-11
  2.00000000e+00  2.85714287e-02]
Intercept (polynomial): 1.6572012100368738e-05


Write the polynomial equation of a slime: $h(x) = 1.6572×10^{-5} + 12x_1 - 1.2311×10^{-7}x_2 - 1.05668×10^{-11}x_1^2 + 2.0x_1x_2 + 2.8571×10^{-2}x_2^2$ 

FOR REFERENCE: x1 represents temperature (C) and x2 represents mol (KCL)

Report on the score and interpret: The R² score of 1.0 means that approximately 100% of the variance in the target variable (Size nm³) is explained by the independent variables (Temperature °C and Mols KCL) and their quadratic interactions in the model. In other words, the model's predictions capture 100% of the variability in the data, leaving no unexplained variance.

# Part 2. Chronic Kidney Disease Prediction via Classification

Create code and markdown cells as needed to perform classification and report on your results

## Code for Classification Experiments

In [70]:
# Load the dataset. Then train and evaluate the classification models.
df_ckd = pd.read_csv('ckd_feature_subset.csv')
X = df_ckd.drop('Target_ckd', axis=1)
y = df_ckd['Target_ckd']
kf = KFold(n_splits=5, shuffle=True, random_state=42)
models = {
    "Logistic Regression": LogisticRegression(max_iter=1000, random_state=42),
    "Support Vector Machine": SVC(random_state=42),
    "k-Nearest Neighbors": KNeighborsClassifier(),
    "Neural Network (Default)": MLPClassifier(max_iter=1000, random_state=42)
}
results = []
for model_name, model in models.items():
    cv_scores = cross_val_score(model, X, y, cv=kf)
    mean_score = np.mean(cv_scores)
    std_score = np.std(cv_scores)
    results.append([model_name, mean_score, std_score])

results_df = pd.DataFrame(results, columns=["Model", "Mean Accuracy", "Std Deviation"])
print("Classification Results (5-Fold Cross-Validation):")
print(results_df)

Classification Results (5-Fold Cross-Validation):
                      Model  Mean Accuracy  Std Deviation
0       Logistic Regression       0.856559       0.066269
1    Support Vector Machine       0.928172       0.047601
2       k-Nearest Neighbors       0.927957       0.052440
3  Neural Network (Default)       0.935054       0.040466


## Results and Conclusion for Classification Experiments

Results: Among the classification models, logistic regression achieved around 85.66% accuracy. Both the support vector machine and k-nearest neighbors models performed similarly well, each reaching about 92.82% accuracy. The default neural network further improved the performance to approximately 93.51% accuracy with the lowest variability. These results underscore the advantage of non-linear models for this CKD dataset.

## Code for Neural Network Experiements

In [71]:
# Experiments with Neural Network.
nn_configurations = [
    {"hidden_layer_sizes": (10,), "activation": "relu"},
    {"hidden_layer_sizes": (50, 25), "activation": "tanh"},
    {"hidden_layer_sizes": (100, 50, 25), "activation": "relu"}
]

nn_results = []

for i, config in enumerate(nn_configurations, start=1):
    nn_model = MLPClassifier(
        hidden_layer_sizes=config["hidden_layer_sizes"],
        activation=config["activation"],
        max_iter=2000,
        random_state=42
    )
    cv_scores = cross_val_score(nn_model, X, y, cv=kf)
    mean_score = np.mean(cv_scores)
    std_score = np.std(cv_scores)
    label = f"NN Config {i}: {config['hidden_layer_sizes']} | {config['activation']}"
    nn_results.append([label, mean_score, std_score])
nn_results_df = pd.DataFrame(nn_results, columns=["Configuration", "Mean Accuracy", "Std Deviation"])
print("Neural Network Configuration Results (5-Fold Cross-Validation):")
print(nn_results_df)

Neural Network Configuration Results (5-Fold Cross-Validation):
                       Configuration  Mean Accuracy  Std Deviation
0          NN Config 1: (10,) | relu       0.921720       0.056307
1       NN Config 2: (50, 25) | tanh       0.954624       0.032815
2  NN Config 3: (100, 50, 25) | relu       0.967527       0.035340


## Results and Conclusion for Neural Network Experiments

Results: Increasing the model complexity improved performance on the CKD dataset. The simplest configuration (NN Config 1 with one hidden layer of 10 neurons and ReLU activation) achieved about 90.86% accuracy (±5.20%). NN Config 2 (two hidden layers of 50 and 25 neurons with tanh) improved accuracy to roughly 95.46% (±3.28%), while the most complex configuration (NN Config 3 with three hidden layers of 100, 50, and 25 neurons using ReLU) reached approximately 96.75% accuracy (±3.53%). This indicates that deeper networks capture the underlying patterns more effectively.