# Introduction to Machine Learning and Data Science: Machine Learning basics with Diabetes and Aquatic Toxicity datasets

This notebook contains exercises for an introductory course on Machine Learning and Data Science, focusing on a regression and a classification problem with the QSAR Aquatic Toxicity and the Pima Indians Diabetes datasets, respectively.

We will cover essential steps in understanding, cleaning, transforming, and analyzing the data using pandas and matplotlib/seaborn. The exercises are divided into basic, intermediate, and advanced levels.

**Estimated Time**: 2 Hours



## Setup

We already used the Diabetes dataset last week, but here are the names of the columns, since it comes without them:
1. Pregnancies
2. Glucose
3. BloodPressure
4. SkinThickness
5. Insulin
6. BMI
7. DiabetesPedigreeFunction
8. Age
9. Outcome (0 for no diabetes, 1 for diabetes)

In [4]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

# URL for the Pima Indians Diabetes dataset
url_classification = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names_classification = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']

# Load the dataset
diabetes_df = pd.read_csv(url_classification, names=names_classification)

# Display the first few rows to confirm loading
diabetes_df.head()


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


For the regression dataset, we will use the QSAR Aquatic Toxicity dataset, which was used to develop a quantitative structure-activity relationship models to predict aquatic toxicity towards the fish Pimephales promelas (fathead minnow). LC50, which is the concentration that causes death in 50% of test Daphnia Magma over a test duration of 48 hours, is the target.

In [5]:
url_regression = "https://raw.githubusercontent.com/readytensor/rt-datasets-regression/refs/heads/main/datasets/processed/aquatic_toxicity/aquatic_toxicity.csv"

aquatic_toxicity_df = pd.read_csv(url_regression)
aquatic_toxicity_df.head()

Unnamed: 0,Id,TPSA(Tot),SAacc,H-050,MLOGP,RDCHI,GATS1p,nN,C-040,LC50
0,0,0.0,0.0,0,2.419,1.225,0.667,0,0,3.74
1,1,0.0,0.0,0,2.638,1.401,0.632,0,0,4.33
2,2,9.23,11.0,0,5.799,2.93,0.486,0,0,7.019
3,3,9.23,11.0,0,5.453,2.887,0.495,0,0,6.723
4,4,9.23,11.0,0,4.068,2.758,0.695,0,0,5.979


## Basic Exercisess (Approx. 45-60 minutes)

These exercises focus on the most basic fundamentals of the machine learning pipeline.

### Exercise 1: Splitting the Data

1. Separate features (`X`) and target (`y`).
2. Perform the data splitting with the `sklearn.model_selection.train_test_split`. Get the training, validation and test sets with a 60/20/20  split. **Only for classification dataset**: perform the splitting with the `stratify` option set to yes. It attempts to keep the percentages of the classes in the different sets.
3. Print the shape of both sets, to see how many attributes and samples there are in each.

In [6]:
# Your code for Exercise 1.1
X_title_list = ["Id", "TPSA(Tot)", "SAacc", "H-050", "MLOGP", "RDCHI", "GATS1p", "nN", "C-040"]
X = aquatic_toxicity_df[X_title_list]
y = aquatic_toxicity_df["LC50"]

In [7]:
# Your code for Exercise 1.2
from sklearn.model_selection import train_test_split

train_dataset, validation_dataset = train_test_split(aquatic_toxicity_df, test_size=0.4, random_state=42)
validation_dataset, test_dataset = train_test_split(validation_dataset, test_size=0.5, random_state=42)

In [8]:
# Your code for Exercise 1.3

print(train_dataset.size)
print(validation_dataset.size)

3270
1090


### Exercise 2: Training your first models
1. Train `LinearRegression` on the regression dataset (Aquatic Toxicity).
2. Train `LogisticRegression` on the classification dataset (Diabetes).
3. Predict on the test set for each.
4. Print first 5 predictions.

In [9]:
# Your code for Exercise 2.1
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(X,y)

In [10]:
# Predictions
predictions = lin_reg.predict(X)
predictions[:5].round(2)

array([4.14, 4.35, 6.71, 6.53, 5.73])

We have some predictions! Do they correspond to the true targets?

In [11]:
print(y[:5].round(2)) # More or less, jeje.
lin_reg.score(X,y) #?
print(lin_reg.coef_)

0    3.74
1    4.33
2    7.02
3    6.72
4    5.98
Name: LC50, dtype: float64
[-3.74156163e-04  2.70072900e-02 -1.49854000e-02  3.89718211e-02
  4.40240119e-01  5.16928142e-01 -6.12111735e-01 -2.21767847e-01
  2.99695466e-03]


In [12]:
# Your code for Exercise 2.2
from sklearn.linear_model import LogisticRegression

# Load the diabetes dataset and apply the previous preprocessing steps
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']
diabetes_df = pd.read_csv(url, names=names)
cols_with_zeros = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']
diabetes_df[cols_with_zeros] = diabetes_df[cols_with_zeros].replace(0, np.nan)
imputation_strategy = diabetes_df[cols_with_zeros].median()
diabetes_df[cols_with_zeros] = diabetes_df[cols_with_zeros].fillna(imputation_strategy)

training_dataset_diabetes, validation_dataset_diabetes = train_test_split(diabetes_df, test_size=0.4, random_state=42)
validation_dataset_diabetes, test_dataset_diabetes = train_test_split(validation_dataset_diabetes, test_size=0.5, random_state=42)

print("names[:-1]= ", names[:-1])
X = training_dataset_diabetes[names[:-1]]
y = training_dataset_diabetes['Outcome']

log_reg = LogisticRegression(random_state=0)
log_reg.fit(X, y)


# Predictions
predictions = log_reg.predict(X)
print("Predictions: ", predictions[:5])
print("True values: ", y[:5].values)

names[:-1]=  ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']
Predictions:  [0 0 0 1 1]
True values:  [0 1 0 1 1]


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [13]:
# Your code for Exercise 2.3
print("Aquatic Toxicity dataset predictions for first 5 samples: ")
# Aquatic Toxicity dataset
predictions_AT = lin_reg.predict(test_dataset[X_title_list])
print("Predictions: ", predictions_AT[:5])
print("True values: ", test_dataset["LC50"][:5].values)

print("\nDiabetes dataset predictions for first 5 samples: ")
# Diabetes dataset
predictions_diabetes = log_reg.predict(test_dataset_diabetes[names[:-1]])
print("Predictions: ", predictions_diabetes[:5])
print("True values: ", test_dataset_diabetes["Outcome"][:5].values)

Aquatic Toxicity dataset predictions for first 5 samples: 
Predictions:  [5.41926481 3.73833731 6.29745185 5.66251766 3.75608386]
True values:  [4.071 3.48  4.838 6.756 2.778]

Diabetes dataset predictions for first 5 samples: 
Predictions:  [0 1 0 1 1]
True values:  [0 0 0 1 1]


In [14]:
# Your code for Exercise 2.4
# Done before.

### Exercise 3: Evaluate your models
1. For regression, use the following metrics to check how good your predictions were: `mean_squared_error`, `r2_score`.
2. Do the same for classification for these: `accuracy_score`, `confusion_matrix`
3. Interpret: Are predictions close? For classification, is there a clear tendency in the model (too many false positives, false negatives, etc.)?

In [None]:
# Your code for Exercise 3.1
from sklearn.metrics import mean_squared_error, r2_score
predictions_AT = pd.DataFrame(predictions_AT, columns=["Predicted_LC50"])

mse = mean_squared_error(test_dataset["LC50"], predictions_AT["Predicted_LC50"])
r2 = r2_score(test_dataset["LC50"], predictions_AT['Predicted_LC50'])

print("Mean Squared Error (MSE): ", mse)
print("R-squared (R2): ", r2)

# Not very good results.

Mean Squared Error (MSE):  1.333523878079223
R-squared (R2):  0.5723766041272595


In [16]:
# Your code for Exercise 3.2

## Intermediate Exercises (Approx. 45-60 minutes)

These exercises focus on comparing the performance of multiple models, performing cross-validation and using visualization techniques to understand better the results.

### Exercise 4: Compare multiple models
1. For the regression problem: Train and compare `LinearRegression`(already done), `DecisionTreeRegressor`, and `KNeighborsRegressor`.
2. For the classification problem: Train and compare `LogisticRegression` (already done), `KNeighborsClassifier`, and `DecisionTreeClassifier`.
3. Evaluate using R² score and accuracy each, respectively.
4. Which model performs best in each case? Why do you think that might be? Try to understand how the models work. Here is the documentation for each of them:
  - [LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)
  - [DecisionTreeRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html)
  - [KNeighborRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html)
  - [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)
  - [KNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)
  - [DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor

# Your code for Exercise 4.1
X = train_dataset.drop(columns=["LC50"])
y = train_dataset["LC50"]

X_test = test_dataset.drop(columns=["LC50"])
y_test = test_dataset["LC50"]

lin_reg = LinearRegression()
lin_reg.fit(X, y)
lin_reg_predictions = lin_reg.predict(X_test)
lin_reg_mse = mean_squared_error(y_test, lin_reg_predictions)
lin_reg_r2 = r2_score(y_test, lin_reg_predictions)

dec_tree_reg = DecisionTreeRegressor()
dec_tree_reg.fit(X, y)
dec_tree_predictions = dec_tree_reg.predict(X_test)
dec_tree_mse = mean_squared_error(y_test, dec_tree_predictions)
dec_tree_r2 = r2_score(y_test, dec_tree_predictions)

knn_reg = KNeighborsRegressor()
knn_reg.fit(X, y)
knn_predictions = knn_reg.predict(X_test)
knn_mse = mean_squared_error(y_test, knn_predictions)
knn_r2 = r2_score(y_test, knn_predictions)

print("Linear Regression MSE: ", lin_reg_mse)
print("Decision Tree Regression MSE: ", dec_tree_mse)
print("KNN Regression MSE: ", knn_mse)

print("\nLinear Regression R2: ", lin_reg_r2)
print("Decision Tree Regression R2: ", dec_tree_r2)
print("KNN Regression R2: ", knn_r2)

# Crap all

Linear Regression MSE:  1.4216313202031032
Decision Tree Regression MSE:  2.315176690909091
KNN Regression MSE:  3.611261768363637

Linear Regression R2:  0.5441230391015299
Decision Tree Regression R2:  0.25758830802642496
KNN Regression R2:  -0.15802952324886177


In [18]:
# Your code for Exercise 4.2

In [19]:
# Your code for Exercise 4.3

### Exercise 5: Cross-validation
1. Use `cross_val_score` for the best model in each case, with `cv=5`.
2. For classification, try both `accuracy` and `f1` as scoring methods, to compare the trainings in the cross-validation.
3. For the regression, use both `r2` and `neg_mean_squared_error`.
4. Print mean + std of the scores.

In [20]:
# Your code for Exercise 5.1 and 5.2

In [21]:
# Your code for Exercise 5.1 and 5.3

In [22]:
# Your code for Exercise 5.4

### Exercise 6: Classification report and visualization

1. Use the `classification_report` function for the classification problem ([link](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html) to documentation)
2. Calculate the confusion matrix and plot it using `ConfusionMatrixDisplay`.

In [23]:
# Your code for Exercise 6.1

In [24]:
# Your code for Exercise 6.2

## Advanced Exercises (Approx. 45-60 min)

These exercises explore hyperparameter tuning methods and final model evaluation.



### Exercise 7: Hyperparameter tuning using Grid Search

1. Use `GridSearchCV` with `LogisticRegression`. Try this grid: `C=[0.01, 0.1, 1, 10]`, `penalty=['l1', 'l2']`, `solver='liblinear'`.
2. You can also run a cross validation with this function. Try a `cv=5`.
3. Print what the best parameters and score were.

`GridSearchCV` documentation: [here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html).

In [25]:
# Your code for Exercise 7.1

In [26]:
# Your code for Exercise 7.2

In [27]:
# Your code for Exercise 7.3

### Exercise 8: Hyperparameter tuning using Random Search

1. Use `RandomizedSearchCV` on the [`Ridge`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html) model (for regression). Sample only one parameter: `alpha` from `scipy.stats.uniform(0.01, 10)`.
2. Limit the random search to only 10 iterations.
3. Print the best alpha and score.

In [28]:
# Your code for Exercise 8.1

In [29]:
# Your code for Exercise 8.2

In [30]:
# Your code for Exercise 8.3

### Exercise 9: Final model & test evaluation

1. Take the best model from Grid/Random search
2. Predict on test set
3. Print final metrics


In [31]:
# Your code for Exercise 9.1

In [32]:
# Your code for Exercise 9.2

In [33]:
# Your code for Exercise 9.3