# Example usage

To use `mds_2025_helper_functions` in a project:

## Imports

In [22]:
from mds_2025_helper_functions.scores import compare_model_scores
from sklearn.datasets import load_iris, load_diabetes
from sklearn.dummy import DummyRegressor, DummyClassifier
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
import warnings
warnings.filterwarnings('ignore')

## Compare CV scores of multiple models

`compare_model_scores()` is a wrapper function for scikit learn's `cross_validate()` that allows you to compare the mean cross validation scores across multiple models. The only difference in calling this function compared to `cross_validate()` is that it takes multiple model objects rather than one.

Note: The default scoring metric is R² for regression and accuracy for classification tasks.

### Basic usage
To demonstrate, let's load a sample dataset and instantiate our model classes. We'll be using the Diabetes dataset from scikit learn. The Diabetes dataset contains 10 baseline variables and progression of diabetes after one year. To learn more about this dataset, visit its documentation: https://scikit-learn.org/stable/datasets/toy_dataset.html#diabetes-dataset

In [23]:
X, y = load_diabetes(return_X_y=True)
dummy_regressor = DummyRegressor()
tree_regressor = DecisionTreeRegressor()

This is already enough for our basic use of the function. Simply pass these to `compare_model_scores()`.

Note: The default scoring metric is R² for regression tasks. Negative R² scores indicate the model performs worse than predicting the mean value.

In [24]:
compare_model_scores(dummy_regressor, tree_regressor, X=X, y=y)

Unnamed: 0_level_0,fit_time,score_time,test_score
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
DummyRegressor,0.000311,0.000366,-0.027506
DecisionTreeRegressor,0.003522,0.000706,-0.128997


As you can see, the function returns a dataframe with the performance statistics for each model. The model names are used for the index.

### Using `cross_validate()` arguments
Like `cross_validate`, the function also works for classification models, and you can pass arguments to reutrn training scores, or use different scoring metrics.

For classification, we'll be using the Iris dataset from scikit learn. The Iris dataset contains measurements of iris flowers with 3 different species. To learn more about this dataset, visit its documentation: https://scikit-learn.org/stable/datasets/toy_dataset.html#iris-dataset

In [25]:
X, y = load_iris(return_X_y=True)
dummy_classifier = DummyClassifier()
tree_classifier = DecisionTreeClassifier()
scoring_metric = "f1_macro"                 # A scoring metric for multiclass classification

compare_model_scores(dummy_classifier, tree_classifier, X=X, y=y, return_train_scores=True, scoring=scoring_metric)

Unnamed: 0_level_0,fit_time,score_time,test_score,train_score
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
DummyClassifier,0.000196,0.00113,0.166667,0.166667
DecisionTreeClassifier,0.000527,0.001102,0.9599,1.0


### Passing multiple models of the same type

When you compare several models of the same type, each model is be given an index in the output table based on the order it was passed to `compare_model_scores()`.

In [26]:
second_tree_classifier = DecisionTreeClassifier(max_depth=3)

compare_model_scores(tree_classifier, second_tree_classifier, X=X, y=y)

Unnamed: 0_level_0,fit_time,score_time,test_score
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
DecisionTreeClassifier,0.000597,0.000388,0.966667
DecisionTreeClassifier_2,0.000505,0.000444,0.96


## Perform exploratory data analysis (EDA)

In [None]:
# Sam

## Summarize a dataset

In [None]:
# Karlygash

## Visualize hypothesis tests

In [27]:
#CiXu