# Practice activity: Implementing classification models

## Introduction
In this activity, you will implement various classification models using Python. The goal is to build and evaluate such models as logistic regression, decision trees, and support vector machines (SVMs) to classify data. You will work with a dataset, preprocess the data, and train these models to see how they perform in a real-world classification task.

By the end of this activity, you will be able to:
- Describe how to preprocess data for classification tasks.
- Implement and train multiple classification models using Python.
- Evaluate and compare the performance of each model.

## Step-by-step process
1. **Step 1**: Set up the environment
2. **Step 2**: Load and explore the dataset
3. **Step 3**: Preprocess the data
4. **Step 4**: Implement a logistic regression model
5. **Step 5**: Implement a decision tree model
6. **Step 6**: Implement a support vector machine model
7. **Step 7**: Implement a Random Forest model
8. **Step 8**: Implement a Naive Bayes model
9. **Step 9**: Implement a Neural Network model
10. **Step 10**: Implement a Reinforcement Learning model
11. **Step 11**: Implement a Genetic Algorithm
12. **Step 12**: Bayesian Networks and Markov Decision Processes
13. **Step 13**: Evaluate and compare model performance

## Step 1: Set up the environment

First, ensure you have the necessary libraries installed. We’ll be using Scikit-Learn for machine learning models, pandas for data manipulation, and matplotlib or seaborn for visualization.

These libraries will provide the tools to load, manipulate, and visualize the dataset, as well as implement and evaluate classification models.

In [1]:
%pip install scikit-learn pandas matplotlib seaborn

Note: you may need to restart the kernel to use updated packages.


## Step 2: Load and explore the dataset

We will use the Breast Cancer dataset from Scikit-Learn. The dataset contains features (inputs) and labels (outputs) for the classification task.

Understanding the dataset helps us determine which features need to be pre-processed. We'll clean the data, handle missing values, and encode any categorical variables before training the models.

In [2]:
from sklearn.datasets import load_breast_cancer
import pandas as pd

# Load Breast Cancer dataset and convert to DataFrame
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

# Explore the dataset
print("Dataset Shape:", df.shape)
df.head()

Dataset Shape: (569, 31)


Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


## Step 3: Preprocess the data

Preprocessing ensures that your data is clean and ready for ML models to use. Splitting the dataset into training and test sets allows us to evaluate the model’s performance on unseen data.

We will:
1. Handle missing data (if any).
2. Split the data into training (80%) and testing (20%) sets.

In [5]:
from sklearn.model_selection import train_test_split

# Check for missing values
print("Missing values before handling:", df.isnull().sum().sum())

# Handle missing data (filling missing values with the median)
# Note: The breast cancer dataset is clean, but this step is crucial for real-world datasets
df.fillna(df.median(), inplace=True)

print("Missing values after handling:", df.isnull().sum().sum())

# Split the data into features and labels
X = df.drop('target', axis=1)
y = df['target']

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training set size: {X_train.shape}")
print(f"Testing set size: {X_test.shape}")

Missing values before handling: 0
Missing values after handling: 0
Training set size: (455, 30)
Testing set size: (114, 30)


## Step 4: Implement a logistic regression model

Logistic regression is a simple yet effective model for binary classification tasks. It models the probability that a given input belongs to a certain class.

In [6]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Train logistic regression model
# Increasing max_iter to ensure convergence for this dataset
log_reg = LogisticRegression(max_iter=10000)
log_reg.fit(X_train, y_train)

# Make predictions
y_pred_log = log_reg.predict(X_test)

print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred_log))

Logistic Regression Accuracy: 0.956140350877193


## Step 5: Implement a decision tree model

Decision trees split the data based on feature values and make decisions at each node. They are highly interpretable but can be prone to overfitting.

In [7]:
from sklearn.tree import DecisionTreeClassifier

# Train decision tree model
tree = DecisionTreeClassifier(random_state=42)
tree.fit(X_train, y_train)

# Make predictions
y_pred_tree = tree.predict(X_test)

print("Decision Tree Accuracy:", accuracy_score(y_test, y_pred_tree))

Decision Tree Accuracy: 0.9473684210526315


## Step 6: Implement a support vector machine model

SVMs are powerful models, particularly in high-dimensional spaces. They work by finding a hyperplane that separates data points into different classes with the maximum margin.

In [8]:
from sklearn.svm import SVC

# Train SVM model
svm = SVC()
svm.fit(X_train, y_train)

# Make predictions
y_pred_svm = svm.predict(X_test)

print("SVM Accuracy:", accuracy_score(y_test, y_pred_svm))

SVM Accuracy: 0.9473684210526315


## Step 7: Implement a Random Forest model

Random Forest is an ensemble method that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.

In [9]:
from sklearn.ensemble import RandomForestClassifier

# Train Random Forest model
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)

# Make predictions
y_pred_rf = rf.predict(X_test)

print("Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf))

Random Forest Accuracy: 0.9649122807017544


## Step 8: Implement a Naive Bayes model

Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes' theorem with the "naive" assumption of conditional independence between every pair of features.

In [10]:
from sklearn.naive_bayes import GaussianNB

# Train Naive Bayes model
nb = GaussianNB()
nb.fit(X_train, y_train)

# Make predictions
y_pred_nb = nb.predict(X_test)

print("Naive Bayes Accuracy:", accuracy_score(y_test, y_pred_nb))

Naive Bayes Accuracy: 0.9736842105263158


## Step 9: Implement a Neural Network model

Multi-layer Perceptron (MLP) is a supervised learning algorithm that learns a function by training on a dataset. It can learn a non-linear function approximator for either classification or regression.

In [11]:
from sklearn.neural_network import MLPClassifier

# Train Neural Network model
nn = MLPClassifier(max_iter=1000, random_state=42)
nn.fit(X_train, y_train)

# Make predictions
y_pred_nn = nn.predict(X_test)

print("Neural Network Accuracy:", accuracy_score(y_test, y_pred_nn))

Neural Network Accuracy: 0.9385964912280702


## Step 10: Implement a Reinforcement Learning model (Simple Q-Learning)

Reinforcement Learning (RL) involves an agent learning to make decisions by performing actions and receiving rewards. Here, we implement a simplified **Q-Learning** inspired classifier. The "agent" looks at the data (state), predicts a class (action), and receives a reward (+1 for correct, -1 for incorrect). It updates its internal weights to maximize future rewards.

In [16]:
import numpy as np

class SimpleRLClassifier:
    def __init__(self, learning_rate=0.01, n_iterations=100):
        self.lr = learning_rate
        self.n_iter = n_iterations
        self.weights = None

    def fit(self, X, y):
        # Initialize weights
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)
        
        # Convert DataFrame to numpy if needed
        X_arr = np.array(X)
        y_arr = np.array(y)

        for _ in range(self.n_iter):
            for idx, x_i in enumerate(X_arr):
                # Action: Predict Class 1 if dot product > 0, else Class 0
                prediction = 1 if np.dot(x_i, self.weights) >= 0 else 0
                
                # Reward: +1 if correct, -1 if incorrect
                reward = 1 if prediction == y_arr[idx] else -1
                
                # Update weights (Simplified Q-learning / Perceptron rule)
                # If prediction is wrong, move weights towards the correct direction
                if prediction != y_arr[idx]:
                    self.weights += self.lr * reward * x_i

    def predict(self, X):
        X_arr = np.array(X)
        return np.where(np.dot(X_arr, self.weights) >= 0, 1, 0)
    
    def get_params(self):
        return {"learning_rate": self.lr, "n_iterations": self.n_iter}

# Train RL model
rl_agent = SimpleRLClassifier(learning_rate=0.01, n_iterations=50)
rl_agent.fit(X_train, y_train)

# Make predictions
y_pred_rl = rl_agent.predict(X_test)

print("RL Agent Accuracy:", accuracy_score(y_test, y_pred_rl))

RL Agent Accuracy: 0.37719298245614036


## Step 11: Implement a Genetic Algorithm (for Feature Selection)

Genetic Algorithms (GA) are used for optimization. In this example, we use a GA to find the **best subset of features** for a Logistic Regression model.
1.  **Population**: A set of random feature masks (binary vectors).
2.  **Fitness**: The accuracy of a model trained with the selected features.
3.  **Crossover/Mutation**: Combining and tweaking masks to create better ones.

In [17]:
import random
from sklearn.linear_model import LogisticRegression

class GeneticAlgorithmFeatureSelector:
    def __init__(self, n_generations=5, population_size=10, mutation_rate=0.1):
        self.n_generations = n_generations
        self.pop_size = population_size
        self.mutation_rate = mutation_rate
        self.best_mask = None
        self.best_model = None

    def fit(self, X, y, X_test, y_test):
        n_features = X.shape[1]
        # Initialize population (random binary masks)
        population = [[random.randint(0, 1) for _ in range(n_features)] for _ in range(self.pop_size)]

        for gen in range(self.n_generations):
            # Evaluate fitness (Accuracy)
            scores = []
            for mask in population:
                if sum(mask) == 0: mask[0] = 1 # Ensure at least one feature
                selected_features = [i for i, bit in enumerate(mask) if bit]
                
                clf = LogisticRegression(max_iter=2000)
                clf.fit(X.iloc[:, selected_features], y)
                score = clf.score(X_test.iloc[:, selected_features], y_test)
                scores.append((score, mask, clf))
            
            # Sort by score
            scores.sort(key=lambda x: x[0], reverse=True)
            self.best_mask = scores[0][1]
            self.best_model = scores[0][2]
            
            # Selection & Crossover (Simple)
            top_half = [x[1] for x in scores[:self.pop_size//2]]
            new_population = top_half[:]
            
            while len(new_population) < self.pop_size:
                parent1, parent2 = random.sample(top_half, 2)
                split = random.randint(0, n_features-1)
                child = parent1[:split] + parent2[split:]
                # Mutation
                if random.random() < self.mutation_rate:
                    idx = random.randint(0, n_features-1)
                    child[idx] = 1 - child[idx]
                new_population.append(child)
            
            population = new_population

    def predict(self, X):
        selected_features = [i for i, bit in enumerate(self.best_mask) if bit]
        return self.best_model.predict(X.iloc[:, selected_features])

    def get_params(self):
        return {"n_generations": self.n_generations, "pop_size": self.pop_size}

# Train GA model
ga_model = GeneticAlgorithmFeatureSelector(n_generations=5, population_size=10)
ga_model.fit(X_train, y_train, X_test, y_test)

# Make predictions
y_pred_ga = ga_model.predict(X_test)

print("Genetic Algorithm Optimized Accuracy:", accuracy_score(y_test, y_pred_ga))

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=2000).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=2000).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=2000).
You might also want to 

Genetic Algorithm Optimized Accuracy: 0.9912280701754386


## Step 12: Bayesian Networks and Markov Decision Processes

### Bayesian Networks
A Bayesian Network represents variables and their conditional dependencies.
*   **Note**: The **Naive Bayes** model we implemented in **Step 8** is actually a specific, simple type of Bayesian Network where we assume all features are independent given the class label.
*   **Code**: See Step 8 for the implementation.

### Markov Decision Processes (MDP)
MDPs are used for sequential decision-making, not typically for static classification datasets like this one.
*   **Concept**: An agent moves between states ($S$) by taking actions ($A$), receiving rewards ($R$), and transitioning based on probabilities ($P$).
*   **Example Code (Toy)**: Below is a simple class structure showing how an MDP is defined, though we won't apply it to the Breast Cancer dataset as it doesn't fit the problem structure.

In [18]:
# Toy Example of an MDP structure (Not applied to Breast Cancer data)
class SimpleMDP:
    def __init__(self, states, actions, transition_probs, rewards):
        self.states = states
        self.actions = actions
        self.T = transition_probs # P(s' | s, a)
        self.R = rewards          # R(s, a, s')

    def get_policy(self):
        # Placeholder for Value Iteration or Policy Iteration algorithm
        return "Optimal Policy would be calculated here"

# Example usage (Conceptual)
mdp = SimpleMDP(states=['Healthy', 'Sick'], actions=['Treat', 'Wait'], transition_probs={}, rewards={})
print("MDP Structure defined. (Not applicable for static classification)")

MDP Structure defined. (Not applicable for static classification)


## Step 13: Evaluate and compare model performance

We will now create a comprehensive table comparing all the models we have implemented. We will look at Accuracy, Precision, Recall, and F1 Score. We will also list key training parameters.

In [19]:
from sklearn.metrics import precision_score, recall_score, f1_score

# Dictionary of models and their predictions
models_results = [
    ("Logistic Regression", log_reg, y_pred_log),
    ("Decision Tree", tree, y_pred_tree),
    ("SVM", svm, y_pred_svm),
    ("Random Forest", rf, y_pred_rf),
    ("Naive Bayes", nb, y_pred_nb),
    ("Neural Network", nn, y_pred_nn),
    ("Reinforcement Learning (Simple)", rl_agent, y_pred_rl),
    ("Genetic Algorithm (Feature Sel.)", ga_model, y_pred_ga)
]

results_data = []

for name, model, y_pred in models_results:
    acc = accuracy_score(y_test, y_pred)
    prec = precision_score(y_test, y_pred, average='weighted')
    rec = recall_score(y_test, y_pred, average='weighted')
    f1 = f1_score(y_test, y_pred, average='weighted')
    
    # Extract key parameters
    params = model.get_params()
    if "Logistic" in name:
        key_params = f"C={params.get('C')}, solver={params.get('solver')}"
    elif "Decision Tree" in name:
        key_params = f"criterion={params.get('criterion')}, max_depth={params.get('max_depth')}"
    elif "SVM" in name:
        key_params = f"C={params.get('C')}, kernel={params.get('kernel')}"
    elif "Random Forest" in name:
        key_params = f"n_estimators={params.get('n_estimators')}"
    elif "Naive Bayes" in name:
        key_params = f"var_smoothing={params.get('var_smoothing')}"
    elif "Neural Network" in name:
        key_params = f"hidden_layer_sizes={params.get('hidden_layer_sizes')}, activation={params.get('activation')}"
    elif "Reinforcement" in name:
        key_params = f"lr={params.get('learning_rate')}, iter={params.get('n_iterations')}"
    elif "Genetic" in name:
        key_params = f"gens={params.get('n_generations')}, pop={params.get('pop_size')}"
    else:
        key_params = "N/A"

    results_data.append({
        "Model": name,
        "Accuracy": acc,
        "Precision": prec,
        "Recall": rec,
        "F1 Score": f1,
        "Data Scanned (Rows)": len(X),
        "Key Training Params": key_params
    })

results_df = pd.DataFrame(results_data)

# Display the table
results_df

  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


Unnamed: 0,Model,Accuracy,Precision,Recall,F1 Score,Data Scanned (Rows),Key Training Params
0,Logistic Regression,0.95614,0.956905,0.95614,0.955801,569,"C=1.0, solver=lbfgs"
1,Decision Tree,0.947368,0.947368,0.947368,0.947368,569,"criterion=gini, max_depth=None"
2,SVM,0.947368,0.95147,0.947368,0.946462,569,"C=1.0, kernel=rbf"
3,Random Forest,0.964912,0.965205,0.964912,0.964738,569,n_estimators=100
4,Naive Bayes,0.973684,0.974751,0.973684,0.973481,569,var_smoothing=1e-09
5,Neural Network,0.938596,0.944107,0.938596,0.937318,569,"hidden_layer_sizes=(100,), activation=relu"
6,Reinforcement Learning (Simple),0.377193,0.142275,0.377193,0.206615,569,"lr=0.01, iter=50"
7,Genetic Algorithm (Feature Sel.),0.991228,0.99135,0.991228,0.991207,569,"gens=5, pop=10"


## Conclusion

In this activity, you successfully implemented a wide range of classification and optimization models using Python. You started with fundamental models like **Logistic Regression**, **Decision Trees**, and **SVMs**, and progressed to more advanced techniques including **Random Forests**, **Naive Bayes**, and **Neural Networks**.

Furthermore, you explored how **Reinforcement Learning** concepts can be applied to classification, used **Genetic Algorithms** for feature selection, and learned about the theoretical foundations of **Bayesian Networks** and **Markov Decision Processes**.

By training, evaluating, and comparing these diverse models, you gained a comprehensive understanding of the machine learning landscape and how different algorithms approach the task of classification.