### Exercises: Decision trees, random forests and K-means

1. We load the iris dataset for you in the next code. We provide you with the X and y(=target) variable.
 a. Count the different values for y
 b. Fit a decision tree on this set. Use cross-validation and calculate the accuracy on the validation sets.
    What other metrics are important for classification problems like this one?
 c. Makes it any difference if we use  a standard scaler?



In [35]:
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.datasets import load_wine
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix

# Load the wine dataset and take 'target' as the value you need to predict
wine = load_wine()

# Convert to a Pandas DataFrame for easier exploration
X = pd.DataFrame(wine.data, columns=wine.feature_names)
y = pd.Series(wine.target)
X.head()


Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0
1,13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0
4,13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0


In [15]:
#SOLUTION_START
print(y.value_counts()) #print the different values
#SOLUTION_END

1    71
0    59
2    48
Name: count, dtype: int64


In [32]:
#SOLUTION_START
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, train_size=0.8, test_size=0.2, stratify=y)

# Create a pipeline with scaling and Decision Tree
pipeline = Pipeline([('standard_scaler',StandardScaler()),
                     ('decision_tree', DecisionTreeClassifier(random_state=42))
])
# model = DecisionTreeClassifier()

# Perform cross-validation
scores = cross_val_score(pipeline, X, y, cv=5, scoring='accuracy')
print("\n\nScores:", scores)
# Fit the model to the training set
pipeline.fit(X_train, y_train)

# Evaluate on the test set
prediction = pipeline.predict(X_test)

#other important metrics
print(f"Test Accuracy: {pipeline.score(X_test, y_test):.4f}")
print("Classification Report:\n", classification_report(y_test, prediction))
#other important metrics
print("Confusion Matrix:\n", confusion_matrix(y_test, prediction))
#SOLUTION_END



Scores: [0.91666667 0.80555556 0.83333333 0.91428571 0.85714286]
Test Accuracy: 0.9444
Classification Report:
               precision    recall  f1-score   support

           0       1.00      0.92      0.96        12
           1       0.88      1.00      0.93        14
           2       1.00      0.90      0.95        10

    accuracy                           0.94        36
   macro avg       0.96      0.94      0.95        36
weighted avg       0.95      0.94      0.94        36

Confusion Matrix:
 [[11  1  0]
 [ 0 14  0]
 [ 0  1  9]]


In [None]:
#SOLUTION_START
# Split the dataset into training and testing sets


# Create a pipeline with scaling and Decision Tree



# Perform cross-validation




# Fit the model to the training set


# Evaluate on the test set


#other important metrics

#SOLUTION_END

2. We load the same iris dataset for you in the next code. We provide you again  with the X and y(=target) variable.
 a.  Fit a random forest with 100 decision trees on this set. Use cross-validation and calculate the accuracy on the validation sets (.
    What other metrics are important for classification problems like this one?
 b. Is the result better than for the decision tree?


In [36]:
#SOLUTION_START
# Import RandomForestClassifier
X = pd.DataFrame(wine.data, columns=wine.feature_names)
y = pd.Series(wine.target)
X.head()

# Create a pipeline with scaling and Random Forest
pipeline = Pipeline([
    ('scaler', StandardScaler()),    
    ('random_forest', RandomForestClassifier(random_state=42)) 
])

# Perform cross-validation
scores = cross_val_score(pipeline, X, y, cv=5, scoring='accuracy')

# Fit the model to the training set
pipeline.fit(X_train, y_train)

# Evaluate on the test set
pipeline.predict(X_test)
#SOLUTION_END

array([0, 2, 0, 1, 1, 0, 0, 1, 1, 2, 1, 2, 0, 2, 0, 1, 1, 0, 1, 0, 1, 1,
       0, 0, 1, 1, 0, 2, 1, 2, 0, 2, 1, 2, 2, 2])

3. We load the same iris dataset for you in the next code. We provide you again  with the X and y(=target) variable.
 a.  Fit a K-means with 3 values on this set. Use cross-validation and calculate the accuracy on the validation sets (.
    What other metrics are important for classification problems like this one?
 b. Is the result better than for the decision tree? Why?
 c. Visualize the clusters (using PCA to reduce to 2 dimensions for plotting)


In [37]:
#SOLUTION_START
#a) Import necessary libraries
from sklearn.cluster import KMeans

# Load the wine dataset
wine = load_wine()

# Convert to a Pandas DataFrame for easier exploration
X = pd.DataFrame(wine.data, columns=wine.feature_names)
y = pd.Series(wine.target)

# Standardize the features (important for KMeans since it's distance-based)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Perform K-Means clustering
kmeans = KMeans(n_clusters=3, random_state=42)
clusters = kmeans.fit_predict(X_scaled)

# Compare the K-Means clusters to the actual wine classes
print("Confusion Matrix:\n", confusion_matrix(y, clusters))
print("\nClassification Report:\n", classification_report(y, clusters))


#b) the K-measn clustering will construct clusters which are not related to the target variable. It will construct clusters which have nothing to do with the 3 wine categories

Confusion Matrix:
 [[ 0  0 59]
 [65  3  3]
 [ 0 48  0]]

Classification Report:
               precision    recall  f1-score   support

           0       0.00      0.00      0.00        59
           1       0.06      0.04      0.05        71
           2       0.00      0.00      0.00        48

    accuracy                           0.02       178
   macro avg       0.02      0.01      0.02       178
weighted avg       0.02      0.02      0.02       178



In [None]:
#SOLUTION_START
#3c) Visualize the clusters (using PCA to reduce to 2 dimensions for plotting)

#SOLUTION_END

4. Load the concrete_data dataset and split in a training and a  test set (20% test set). What is the result of a decissiontree regressor  on the test set?

In [None]:
#SOLUTION_START
# Import necessary libraries


df = pd.read_csv('data/concrete_data.csv')
# Display the first few rows to understand the structure of the dataset


# Features (input) and target variable

# Split the data into training and testing sets (80% train, 20% test)

# Initialize and fit the DecisionTreeRegressor

# Make predictions on the test set

# Evaluate the model



# Visualize actual vs predicted

#SOLUTION_END

4. Load the concrete_data dataset and split in a training and a  test set (20% test set). What is the result of a random forestregressor  on the test set? Do we see an improvement?

In [None]:
#SOLUTION_START
# Import necessary libraries



# Features (input) and target variable


# Split the data into training and testing sets (80% train, 20% test)

# Initialize and fit the RandomForestRegressor


# Make predictions on the test set

# Evaluate the model



# Visualize actual vs predicted

#SOLUTION_END

4. Load the concrete_data dataset and use random forest again. Use gridsearch to determine the optimal number of trees in our forest.(test the folowing number of trees [10, 20,40, 50, 100,150, 200,250, 300, 350,450,500])

In [None]:
#SOLUTION_START
# Import necessary libraries

# Features (input) and target variable


# Split the data into training and testing sets (80% train, 20% test)

# Define the RandomForestRegressor model

# Define the grid of parameters to search (different numbers of trees)


# Set up the GridSearchCV with cross-validation

# Fit GridSearchCV to the training data

# Get the best parameters and best score

# Train the RandomForestRegressor using the optimal number of trees

# Make predictions on the test set

# Evaluate the model




# Visualize actual vs predicted

#SOLUTION_END

In [None]:
#SOLUTION_START
# Import necessary libraries


# Features (input) and target variable


# Split the data into training and testing sets (80% train, 20% test)

# Define the RandomForestRegressor model

# Define the grid of parameters to search (different numbers of trees)


# Set up the GridSearchCV with cross-validation


# Fit GridSearchCV to the training data


# Get the best parameters and best score



# Train the RandomForestRegressor using the optimal number of trees


# Make predictions on the test set


# Evaluate the model




# Visualize actual vs predicted


#SOLUTION_END

5. Go to the scikit-learn documentation and look for RandomizedSearchCV. Apply it for 10 different guesses between 10 and 500. What is the result for R-squared?

In [None]:
#SOLUTION_START
# Import necessary libraries



# Features (input) and target variable


# Split the data into training and testing sets (80% train, 20% test)



# Define the RandomForestRegressor model


# Define the grid of parameters to search (different numbers of trees)



# Set up the GridSearchCV with cross-validation


# Fit GridSearchCV to the training data


# Get the best parameters and best score



# Train the RandomForestRegressor using the optimal number of trees


# Make predictions on the test set


# Evaluate the model





# Visualize actual vs predicted


#SOLUTION_END

6. Finally apply a K-means clustering on the concrete dataset.
   a.What is the influence of a standard scaler on the clustering?
   b.What is the optimal number of clusters (elbow method, mean silhouette score)?
   c.Show for the cement and water features how the clusters are scattered.

In [None]:
# Import necessary libraries





# Features for clustering


# Standardize the features



# Determine the optimal number of clusters using the elbow method





# Plot the elbow method


# Elbow method plot



# Silhouette score plot


#b probably 6 will be the optimal if we view the graphs
# Choose the optimal number of clusters (you can choose based on the plots)


# Fit the KMeans model with the optimal number of clusters



# Assign cluster labels to the original DataFrame


#c Visualize the clustering result (plotting two features:cement and water)



# Display the DataFrame with cluster labels


