Student 1: karam: , i.d.:213611213 , https://github.com/dieselcode100-alt
Student 2: lamar: , i.d.:214636656 , github:https://github.com/l2amar
Student 3: kinda: , i.d.:214576498 , githubhttps://github.com/kabunasra-cmyk

1. Load breast cancer dataset (**structured data**)

For more details about the data: https://scikit-learn.org/1.5/modules/generated/sklearn.datasets.load_breast_cancer.html

In [1]:


from sklearn.datasets import load_breast_cancer
my_data = load_breast_cancer()



2. Split **my_data** to train and test:

- Define X_train, X_test, Y_train, Y_test
- Choose **test_size** for splitting **my_data**
- Use **train_test_split** (for details: https://scikit-learn.org/dev/modules/generated/sklearn.model_selection.train_test_split.html)

In [2]:
from sklearn.model_selection import train_test_split

X = my_data.data
Y = my_data.target

X_train, X_test, Y_train, Y_test = train_test_split(
    X, Y,
    test_size=0.2,
    random_state=42,
    stratify=Y
)
!pip install mlflow




3. Libraries

In [3]:
import itertools
import mlflow
import mlflow.sklearn

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score


4. Define MLFlow experiment

In [4]:
EXPERIMENT_NAME = "trees_hyperparam"
mlflow.set_experiment(EXPERIMENT_NAME)
# MLFlow details: https://mlflow.org/docs/latest/ml/tracking/

2025/12/14 09:38:52 INFO mlflow.store.db.utils: Creating initial MLflow database tables...
2025/12/14 09:38:52 INFO mlflow.store.db.utils: Updating database tables
2025/12/14 09:38:52 INFO alembic.runtime.migration: Context impl SQLiteImpl.
2025/12/14 09:38:52 INFO alembic.runtime.migration: Will assume non-transactional DDL.
2025/12/14 09:38:52 INFO alembic.runtime.migration: Context impl SQLiteImpl.
2025/12/14 09:38:52 INFO alembic.runtime.migration: Will assume non-transactional DDL.


<Experiment: artifact_location='/content/mlruns/1', creation_time=1765702828230, experiment_id='1', last_update_time=1765702828230, lifecycle_stage='active', name='trees_hyperparam', tags={}>

5. Train **model_decision_tree**

- Library: sklearn.tree.DecisionTreeClassifier
- Data: X_train, Y_train
- **Essential**: explore and optimize DecisionTreeClassifier options   

In [5]:


param_1_list = ["gini", "entropy", "log_loss"]
param_2_list = [None, 2, 3, 4, 5, 6, 8, 10]
param_3_list = [2, 5, 10, 20]

param_grid = list(itertools.product(param_1_list, param_2_list, param_3_list))

for criterion, max_depth, min_samples_split in param_grid:
    with mlflow.start_run():


        mlflow.log_param("model_type", "DecisionTree")
        mlflow.log_param("criterion", criterion)
        mlflow.log_param("max_depth", max_depth)
        mlflow.log_param("min_samples_split", min_samples_split)


        d_tree = DecisionTreeClassifier(
            criterion=criterion,
            max_depth=max_depth,
            min_samples_split=min_samples_split,
            random_state=42
        )
        d_tree.fit(X_train, Y_train)


        y_pred = d_tree.predict(X_test)

        acc = accuracy_score(Y_test, y_pred)
        pre = precision_score(Y_test, y_pred, zero_division=0)
        rec = recall_score(Y_test, y_pred, zero_division=0)
        f1  = f1_score(Y_test, y_pred, zero_division=0)


        mlflow.log_metric("accuracy", acc)
        mlflow.log_metric("precision_score", pre)
        mlflow.log_metric("recall_score", rec)
        mlflow.log_metric("f1_score", f1)


6. Train model_random_forest
- Library: sklearn.ensemble.RandomForestClassifier
- Data: X_train, Y_train
- **Essential**: explore and optimize RandomForestClassifier options

In [6]:


param_1_list = [50, 100, 200]
param_2_list = [None, 3, 5, 8, 12]
param_3_list = ["sqrt", "log2", None]

param_grid = list(itertools.product(param_1_list, param_2_list, param_3_list))

for n_estimators, max_depth, max_features in param_grid:
    with mlflow.start_run():


        mlflow.log_param("model_type", "RandomForest")
        mlflow.log_param("n_estimators", n_estimators)
        mlflow.log_param("max_depth", max_depth)
        mlflow.log_param("max_features", max_features)


        rf = RandomForestClassifier(
            n_estimators=n_estimators,
            max_depth=max_depth,
            max_features=max_features,
            random_state=42,
            n_jobs=-1
        )
        rf.fit(X_train, Y_train)


        y_pred = rf.predict(X_test)

        acc = accuracy_score(Y_test, y_pred)
        pre = precision_score(Y_test, y_pred, zero_division=0)
        rec = recall_score(Y_test, y_pred, zero_division=0)
        f1  = f1_score(Y_test, y_pred, zero_division=0)


        mlflow.log_metric("accuracy", acc)
        mlflow.log_metric("precision_score", pre)
        mlflow.log_metric("recall_score", rec)
        mlflow.log_metric("f1_score", f1)


7. Train model_adaboost

- Library: sklearn.ensemble.AdaBoostClassifier
- Data: X_train, Y_train
- **Essential**: explore and optimize AdaBoostClassifier options

In [7]:


from sklearn.tree import DecisionTreeClassifier

param_1_list = [50, 100, 200]
param_2_list = [0.01, 0.1, 0.5, 1.0]
param_3_list = [1, 2, 3]

param_grid = list(itertools.product(param_1_list, param_2_list, param_3_list))

for n_estimators, learning_rate, est_max_depth in param_grid:
    with mlflow.start_run():


        mlflow.log_param("model_type", "AdaBoost")
        mlflow.log_param("n_estimators", n_estimators)
        mlflow.log_param("learning_rate", learning_rate)
        mlflow.log_param("estimator_max_depth", est_max_depth)


        base_est = DecisionTreeClassifier(max_depth=est_max_depth, random_state=42)


        ada = AdaBoostClassifier(
            estimator=base_est,
            n_estimators=n_estimators,
            learning_rate=learning_rate,
            random_state=42
        )
        ada.fit(X_train, Y_train)


        y_pred = ada.predict(X_test)

        acc = accuracy_score(Y_test, y_pred)
        pre = precision_score(Y_test, y_pred, zero_division=0)
        rec = recall_score(Y_test, y_pred, zero_division=0)
        f1  = f1_score(Y_test, y_pred, zero_division=0)

        # Log metrics
        mlflow.log_metric("accuracy", acc)
        mlflow.log_metric("precision_score", pre)
        mlflow.log_metric("recall_score", rec)
        mlflow.log_metric("f1_score", f1)


8. Store the result

In [8]:
from google.colab import files

df = mlflow.search_runs(experiment_names=[EXPERIMENT_NAME])

df = df.drop(columns=[col for col in df.columns if "time" in col.lower()], errors="ignore")
df.to_excel("karam_kabha_results.xlsx", index=False)

files.download("karam_kabha_results.xlsx")


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>