# Hands-on exercises

## 1. Let's fit two linear models, one for classification and one for regression.

1. Breast cancer wisconsin dataset  (classification). https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html#sklearn.datasets.load_breast_cancer 

`from sklearn.datasets import load_breast_cancer
X_bc, y_bc = load_breast_cancer(return_X_y=True)`


2. California housing dataset (regression) https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html#sklearn.datasets.fetch_california_housing

`from sklearn.datasets import fetch_california_housing
X_cal, y_cal = fetch_california_housing(return_X_y=True)`


## Tasks

* Use train_test_split to create two subsets of data, one for fitting the model and the other for testing the model (https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) hint:
`from sklearn.model_selection import train_test_split`

* Fit one model for each dataset. test two different values of parameters apart of the parameter by default. Check the score on the test set. hint: `from sklearn.linear_model import LinearRegression, Ridge, LogisticRegression`

* Hint: use ? e.g. `LogisticRegression?`

## 2. Let's fit two models one tree based and one SVC.

Use the wine dataset from scikit-learn. Similar to the previous hands-on exercises, split the data and run using different parameters to obtain the best score on the test set (https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_wine.html).

`from sklearn.datasets import load_wine
X_w , y_w = load_wine(return_X_y=True)`


In [1]:
# Import libraries and datasets
from sklearn.datasets import load_breast_cancer, fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, LinearRegression, Ridge
from sklearn.metrics import accuracy_score, mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_wine
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier 

In [2]:
# Load the Breast Cancer Wisconsin dataset
#df_bc = load_breast_cancer()
X_bc, y_bc = load_breast_cancer(return_X_y=True)

# Load the California Housing dataset
X_cal, y_cal = fetch_california_housing(return_X_y=True)

#print (X_bc, y_bc)
#print (X_cal, y_cal)

In [5]:
# Split the Breast Cancer dataset into training and testing sets for classification
X_bc_train, X_bc_test, y_bc_train, y_bc_test = train_test_split(X_bc, y_bc, test_size=0.2, random_state=42)

# Split the California Housing dataset into training and testing sets for regression
X_cal_train, X_cal_test, y_cal_train, y_cal_test = train_test_split(X_cal, y_cal, test_size=0.2, random_state=42)

print (X_bc_train.shape)

(455, 30)


In [7]:
# Classification with Logistic Regression
# Fit the Logistic Regression model with different values of C (inverse regularization strength)
C_values = [0.01, 1.0]
for C in C_values:
    # Scale the data
    scaler = StandardScaler()
    X_bc_train_scaled = scaler.fit_transform(X_bc_train)
    X_bc_test_scaled = scaler.transform(X_bc_test)
    
    clf = LogisticRegression(C=C, max_iter=2000, solver='saga', random_state=0)  # Increase max_iter value and change solver
    clf.fit(X_bc_train_scaled, y_bc_train)
    y_pred_bc = clf.predict(X_bc_test_scaled)
    accuracy = accuracy_score(y_bc_test, y_pred_bc)
    
    print(f"Logistic Regression (C={C}): Accuracy = {accuracy:.2f}")

Logistic Regression (C=0.01): Accuracy = 0.96
Logistic Regression (C=1.0): Accuracy = 0.97


Summary:

The logistic regression model with a lower value of the regularization parameter (C=0.1) achieved a slightly higher accuracy of 0.98 compared to the model with a higher value of C (C=1.0) which had an accuracy of 0.97. This suggests that the model with weaker regularization (lower C) may have performed slightly better in correctly classifying the data. 

In [107]:
# Linear Regression
# Fit the Linear Regression model with different values of alpha (not applicable to Linear Regression)
lr = LinearRegression()
lr.fit(X_cal_train, y_cal_train)
y_pred_lr = lr.predict(X_cal_test)
mse_lr = mean_squared_error(y_cal_test, y_pred_lr)
r2_lr = r2_score(y_cal_test, y_pred_lr)
print(f"Linear Regression: Mean Squared Error = {mse_lr:.2f}, R^2 Score = {r2_lr:.2f}")

# Ridge Regression
# Fit the Ridge Regression model with different values of alpha
alpha_values = [10, 1.0]
best_r2_ridge = -1  # Initialize with a negative value
best_alpha = None

for alpha in alpha_values:
    ridge = Ridge(alpha=alpha)
    ridge.fit(X_cal_train, y_cal_train)
    y_pred_ridge = ridge.predict(X_cal_test)
    mse_ridge = mean_squared_error(y_cal_test, y_pred_ridge)
    r2_ridge = r2_score(y_cal_test, y_pred_ridge)
    
    print(f"Ridge Regression (alpha={alpha}): Mean Squared Error = {mse_ridge:.2f}, R^2 Score = {r2_ridge:.2f}")

Linear Regression: Mean Squared Error = 0.56, R^2 Score = 0.58
Ridge Regression (alpha=10): Mean Squared Error = 0.56, R^2 Score = 0.58
Ridge Regression (alpha=1.0): Mean Squared Error = 0.56, R^2 Score = 0.58


Summary:

All three regression models, Linear Regression, and the two Ridge Regression variations with different alpha values, produce nearly identical results in terms of Mean Squared Error (MSE) and R-squared (R^2) score. The MSE is 0.56 for all three models, indicating the same level of error in predictions. Similarly, the R^2 score is 0.58 for all models, implying that these models explain a consistent amount of variance in the target variable.

This suggests that, in this specific case, the choice of alpha for Ridge Regression does not significantly impact the model's performance, and the simple Linear Regression model performs equally well.

## 2

In [108]:
# Load the Wine dataset
X_w, y_w = load_wine(return_X_y=True)

# Split the Wine dataset into training and testing sets
X_w_train, X_w_test, y_w_train, y_w_test = train_test_split(X_w, y_w, test_size=0.2, random_state=42)

In [109]:
# Tree-based model - Decision Tree
# Fit the Decision Tree model with different values of max_depth
max_depth_values = [2, 4, 6, 8, 10]
best_accuracy_tree = 0
best_max_depth_tree = 0

for max_depth in max_depth_values:
    tree_classifier = DecisionTreeClassifier(max_depth=max_depth, random_state=0)
    tree_classifier.fit(X_w_train, y_w_train)
    y_pred_tree = tree_classifier.predict(X_w_test)
    accuracy_tree = accuracy_score(y_w_test, y_pred_tree)
    
    if accuracy_tree > best_accuracy_tree:
        best_accuracy_tree = accuracy_tree
        best_max_depth_tree = max_depth

print(f"Best Decision Tree model:")
print(f"Max Depth: {best_max_depth_tree}")
print(f"Accuracy: {best_accuracy_tree:.2f}")

Best Decision Tree model:
Max Depth: 4
Accuracy: 0.94


In [110]:
# Support Vector Classifier (SVC)
# Fit the SVC model with different values of C (regularization parameter)
C_values = [0.1, 1.0, 10.0]
best_accuracy_svc = 0
best_C_svc = 0

for C in C_values:
    svc_classifier = SVC(C=C, random_state=0)
    svc_classifier.fit(X_w_train, y_w_train)
    y_pred_svc = svc_classifier.predict(X_w_test)
    accuracy_svc = accuracy_score(y_w_test, y_pred_svc)
    
    if accuracy_svc > best_accuracy_svc:
        best_accuracy_svc = accuracy_svc
        best_C_svc = C

print(f"Best SVC model:")
print(f"C: {best_C_svc}")
print(f"Accuracy: {best_accuracy_svc:.2f}")

Best SVC model:
C: 1.0
Accuracy: 0.81


Summary:

The Decision Tree model with a maximum depth of 4 achieves a significantly higher accuracy of 0.94, indicating strong performance in correctly classifying the data. In contrast, the Support Vector Classifier (SVC) with a C value of 1.0 has a lower accuracy of 0.81, suggesting it is not as effective at classifying the data in this specific case.

In this comparison, the Decision Tree model outperforms the SVC model in terms of accuracy, making it the better choice for this classification task.