### Getting the dataset from kaggle

You can download any dataset from Kaggle by simply specifying their namespace: <username>/<dataset_name> 

Do make sure you have a kaggle account and generated API key against your account. 
KAGGLE_KEY, KAGGLE_USERNAME -> These need to be set in your ENV variables.

By default the dataset will be downloaded into cache location of the system but this can be altered by defining the specific path in KAGGLEHUB_CACHE variable.

In [5]:
import os
from mlutils.utils.kaggle import fetch_kaggle_dataset

data_path = fetch_kaggle_dataset(dataset_name="denkuznetz/housing-prices-regression", 
                     )


os.listdir(data_path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/denkuznetz/housing-prices-regression?dataset_version_number=1...


100%|██████████| 23.6k/23.6k [00:00<00:00, 445kB/s]

Extracting files...





['real_estate_dataset.csv']

### Reading any dataset

Dataset configuration needs to be specified in config directory.

Currently only tabular data (classification and regression) are supported. Support for non-tabular data is still pending.

Even in tabular dataset it is assumed data is provided as a single csv file. Support for reading multiple files from kaggle and other formats such as parquet is still pending.

Depending on the type of dataset you can manually specify the specs of datasets along with hyper-parameter configuration you want to test against.

This is a code independent way to provide specifications that are independent of code functionality such as path of dataset, target column, param_grid and type of dataset.

In [1]:
from mlutils.utils.io import find_git_root
_dir = find_git_root() / "config"

import sys
sys.path.append(str(_dir))

from config.classifier import dataset as classifier_dataset
from config.regression import dataset as regression_dataset

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
classifier_dataset.display_basic_info()

Dataset name classification_dataset
Dataset type classification
Dataset path C:\Users\dusad\Documents\Projects\agnei_consulting\mlutils\data\datasets\blastchar\telco-customer-churn\versions\1\WA_Fn-UseC_-Telco-Customer-Churn.csv
Target column Churn
Imbalanced: True
Binary: True


In [3]:
regression_dataset.display_basic_info()

Dataset name regression_dataset
Dataset type regression
Dataset path C:\Users\dusad\Documents\Projects\agnei_consulting\mlutils\data\datasets\denkuznetz\housing-prices-regression\versions\1\real_estate_dataset.csv
Target column Price


In [4]:
import pandas as pd

df = pd.read_csv("C:\\Users\\dusad\\Documents\\Projects\\Sigmoid-Case-Study\\train (6).csv")
null_percentage = df.isnull().mean()
features_to_remove = null_percentage[null_percentage > 0.5].index
df.drop(columns=features_to_remove)

Unnamed: 0,Var6,Var7,Var13,Var21,Var22,Var24,Var25,Var28,Var35,Var38,...,Var217,Var218,Var219,Var220,Var221,Var222,Var223,Var226,Var227,Var228
0,1526.0,7.0,184.0,464.0,580.0,14.0,128.0,166.56,0.0,3570.0,...,sH5Z,cJvF,FzaX,1YVfGrO,oslk,fXVEsaq,jySVZNlOJy,xb3V,RAYp,F2FyR07IdsN7I
1,525.0,0.0,0.0,168.0,210.0,2.0,24.0,353.52,0.0,4764966.0,...,,,FzaX,0AJo2f2,oslk,2Kb5FSF,LM8l689qOp,fKCe,RAYp,F2FyR07IdsN7I
2,5236.0,7.0,904.0,1212.0,1515.0,26.0,816.0,220.08,0.0,5883894.0,...,bHR7,UYBR,FzaX,JFM1BiF,Al6ZaUT,NKv4yOc,jySVZNlOJy,Qu4f,02N6s8f,ib5G6X1eUxUn6
3,,0.0,0.0,,0.0,,0.0,22.08,0.0,0.0,...,eKej,UYBR,FzaX,L91KIiz,oslk,CE7uk3u,LM8l689qOp,FSa2,RAYp,F2FyR07IdsN7I
4,1029.0,7.0,3216.0,64.0,80.0,4.0,64.0,200.00,0.0,0.0,...,H3p7,UYBR,FzaX,OrnLfvc,oslk,1J2cvxe,LM8l689qOp,FSa2,RAYp,F2FyR07IdsN7I
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49995,357.0,0.0,0.0,132.0,165.0,2.0,0.0,288.08,0.0,6042420.0,...,XXsx,cJvF,FzaX,3JmRJnY,oslk,EROH7Cg,LM8l689qOp,7FJQ,RAYp,F2FyR07IdsN7I
49996,1078.0,0.0,2736.0,380.0,475.0,2.0,88.0,166.56,0.0,0.0,...,4a9J,UYBR,FzaX,MMTv4zN,oslk,GfSQowC,LM8l689qOp,FSa2,RAYp,55YFVY9
49997,2807.0,7.0,1460.0,568.0,710.0,4.0,328.0,166.56,0.0,42210.0,...,DV70,UYBR,FzaX,FM28hdx,oslk,dh6qI2t,LM8l689qOp,fKCe,RAYp,TCU50_Yjmm6GIBZ0lL_
49998,,,,,,,,,,,...,8Mfr,UYBR,FzaX,BV9YlW4,oslk,2fF2Oqu,LM8l689qOp,FSa2,RAYp,F2FyR07IdsN7I


### Running auto-ml 

#### Classifier dataset

In [5]:
classifier_dataset.display_basic_info()

Dataset name classification_dataset
Dataset type classification
Dataset path C:\Users\dusad\Documents\Projects\agnei_consulting\mlutils\data\datasets\blastchar\telco-customer-churn\versions\1\WA_Fn-UseC_-Telco-Customer-Churn.csv
Target column Churn
Imbalanced: True
Binary: True


In [6]:
## Reading local dataset
from mlutils.utils.io import read_local_data

X, y = read_local_data(classifier_dataset)
X

Unnamed: 0_level_0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85
5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5
3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15
7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,No,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.30,1840.75
9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,70.70,151.65
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6840-RESVB,Male,0,Yes,Yes,24,Yes,Yes,DSL,Yes,No,Yes,Yes,Yes,Yes,One year,Yes,Mailed check,84.80,1990.5
2234-XADUH,Female,0,Yes,Yes,72,Yes,Yes,Fiber optic,No,Yes,Yes,No,Yes,Yes,One year,Yes,Credit card (automatic),103.20,7362.9
4801-JZAZL,Female,0,Yes,Yes,11,No,No phone service,DSL,Yes,No,No,No,No,No,Month-to-month,Yes,Electronic check,29.60,346.45
8361-LTMKD,Male,1,Yes,No,4,Yes,Yes,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Mailed check,74.40,306.6


In [11]:
from mlutils.automl.train import model_tune

param_grid_map = classifier_dataset.param_grid_map
search_algo = classifier_dataset.optimization_type
cross_val_scoring = classifier_dataset.cross_val_scoring
evaluation_type = classifier_dataset.type
mlflow_expt_name = 'test_classification'

imbalanced = classifier_dataset.imbalanced

for model_name, model_info in param_grid_map.items():
    model = model_info['model']
    param_grid = model_info['param_grid']

    model_tune(X, y, model_name, model, param_grid, search_algo,
     mlflow_expt_name, cross_val_scoring, 
     evaluation_type)

2025/06/09 12:08:54 INFO mlflow.tracking.fluent: Experiment with name 'test_classification' does not exist. Creating a new experiment.


---LogisticRegression----


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

🏃 View run LogisticRegression at: http://localhost:5000/#/experiments/3/runs/7089c0d26d3142beb0d6d72ca0e57989
🧪 View experiment at: http://localhost:5000/#/experiments/3
---GradientBoostingClassifier----


2025-06-09 12:14:18,159 - Eval - INFO - Accuracy: {_accuracy}
2025-06-09 12:14:18,171 - Eval - INFO - Precision: {_precision}
2025-06-09 12:14:18,184 - Eval - INFO - Recall: {_recall}
2025-06-09 12:14:18,189 - Eval - INFO - F1 Score: {_f1_score}
2025-06-09 12:14:18,192 - Eval - INFO - All metrics: {'accuracy': 0.7409510290986515, 'precision': 0.5067567567567568, 'recall': 0.8042895442359249, 'f1': 0.6217616580310881}


🏃 View run GradientBoostingClassifier at: http://localhost:5000/#/experiments/3/runs/b56160b6f9a04ac7904a3c72d7d857a2
🧪 View experiment at: http://localhost:5000/#/experiments/3


#### Regression dataset

In [5]:
## Reading local dataset
from mlutils.utils.io import read_local_data

X, y = read_local_data(regression_dataset)
X

Unnamed: 0_level_0,Square_Feet,Num_Bedrooms,Num_Bathrooms,Num_Floors,Year_Built,Has_Garden,Has_Pool,Garage_Size,Location_Score,Distance_to_Center
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,143.635030,1,3,3,1967,1,1,48,8.297631,5.935734
2,287.678577,1,2,1,1949,0,1,37,6.061466,10.827392
3,232.998485,1,3,2,1923,1,0,14,2.911442,6.904599
4,199.664621,5,2,2,1918,0,0,17,2.070949,8.284019
5,89.004660,4,3,3,1999,1,0,34,1.523278,14.648277
...,...,...,...,...,...,...,...,...,...,...
496,138.338057,2,2,2,1967,1,0,16,4.296086,5.562583
497,195.914028,2,3,1,1977,0,1,45,7.406261,2.845105
498,69.433659,1,1,2,2004,0,0,18,8.629724,6.263264
499,293.598702,5,1,3,1940,1,0,41,5.318891,16.990684


In [None]:
from mlutils.automl.train import model_tune

param_grid_map = regression_dataset.param_grid_map
search_algo = regression_dataset.optimization_type
cross_val_scoring = regression_dataset.cross_val_scoring
evaluation_type = regression_dataset.type
mlflow_expt_name = 'test_regression'

imbalanced = regression_dataset.imbalanced

for model_name, model_info in param_grid_map.items():
    model = model_info['model']
    param_grid = model_info['param_grid']

    model_tune(X, y, model_name, model, param_grid, search_algo,
     mlflow_expt_name, cross_val_scoring, 
     evaluation_type, imbalanced)

2025/06/09 12:17:29 INFO mlflow.tracking.fluent: Experiment with name 'test_regression' does not exist. Creating a new experiment.


---LinearRegression----


Traceback (most recent call last):
  File "c:\Users\dusad\miniconda3\envs\agnei\lib\site-packages\sklearn\model_selection\_validation.py", line 949, in _score
    scores = scorer(estimator, X_test, y_test, **score_params)
  File "c:\Users\dusad\miniconda3\envs\agnei\lib\site-packages\sklearn\metrics\_scorer.py", line 288, in __call__
    return self._score(partial(_cached_call, None), estimator, X, y_true, **_kwargs)
  File "c:\Users\dusad\miniconda3\envs\agnei\lib\site-packages\sklearn\metrics\_scorer.py", line 388, in _score
    return self._sign * self._score_func(y_true, y_pred, **scoring_kwargs)
  File "c:\Users\dusad\miniconda3\envs\agnei\lib\site-packages\sklearn\utils\_param_validation.py", line 216, in wrapper
    return func(*args, **kwargs)
  File "c:\Users\dusad\miniconda3\envs\agnei\lib\site-packages\sklearn\metrics\_classification.py", line 2429, in recall_score
    _, r, _, _ = precision_recall_fscore_support(
  File "c:\Users\dusad\miniconda3\envs\agnei\lib\site-package

🏃 View run LinearRegression at: http://localhost:5000/#/experiments/4/runs/f00a21e38c494f5b93e027509d330c3e
🧪 View experiment at: http://localhost:5000/#/experiments/4
---Ridge----


Traceback (most recent call last):
  File "c:\Users\dusad\miniconda3\envs\agnei\lib\site-packages\sklearn\model_selection\_validation.py", line 949, in _score
    scores = scorer(estimator, X_test, y_test, **score_params)
  File "c:\Users\dusad\miniconda3\envs\agnei\lib\site-packages\sklearn\metrics\_scorer.py", line 288, in __call__
    return self._score(partial(_cached_call, None), estimator, X, y_true, **_kwargs)
  File "c:\Users\dusad\miniconda3\envs\agnei\lib\site-packages\sklearn\metrics\_scorer.py", line 388, in _score
    return self._sign * self._score_func(y_true, y_pred, **scoring_kwargs)
  File "c:\Users\dusad\miniconda3\envs\agnei\lib\site-packages\sklearn\utils\_param_validation.py", line 216, in wrapper
    return func(*args, **kwargs)
  File "c:\Users\dusad\miniconda3\envs\agnei\lib\site-packages\sklearn\metrics\_classification.py", line 2429, in recall_score
    _, r, _, _ = precision_recall_fscore_support(
  File "c:\Users\dusad\miniconda3\envs\agnei\lib\site-package

🏃 View run Ridge at: http://localhost:5000/#/experiments/4/runs/2fa77786bd0a4e50962f167bec8c576b
🧪 View experiment at: http://localhost:5000/#/experiments/4
---Lasso----


Traceback (most recent call last):
  File "c:\Users\dusad\miniconda3\envs\agnei\lib\site-packages\sklearn\model_selection\_validation.py", line 949, in _score
    scores = scorer(estimator, X_test, y_test, **score_params)
  File "c:\Users\dusad\miniconda3\envs\agnei\lib\site-packages\sklearn\metrics\_scorer.py", line 288, in __call__
    return self._score(partial(_cached_call, None), estimator, X, y_true, **_kwargs)
  File "c:\Users\dusad\miniconda3\envs\agnei\lib\site-packages\sklearn\metrics\_scorer.py", line 388, in _score
    return self._sign * self._score_func(y_true, y_pred, **scoring_kwargs)
  File "c:\Users\dusad\miniconda3\envs\agnei\lib\site-packages\sklearn\utils\_param_validation.py", line 216, in wrapper
    return func(*args, **kwargs)
  File "c:\Users\dusad\miniconda3\envs\agnei\lib\site-packages\sklearn\metrics\_classification.py", line 2429, in recall_score
    _, r, _, _ = precision_recall_fscore_support(
  File "c:\Users\dusad\miniconda3\envs\agnei\lib\site-package

🏃 View run Lasso at: http://localhost:5000/#/experiments/4/runs/a35b934877fb42559df675ee16febeb8
🧪 View experiment at: http://localhost:5000/#/experiments/4
---GradientBoostingRegressor----


Traceback (most recent call last):
  File "c:\Users\dusad\miniconda3\envs\agnei\lib\site-packages\sklearn\model_selection\_validation.py", line 949, in _score
    scores = scorer(estimator, X_test, y_test, **score_params)
  File "c:\Users\dusad\miniconda3\envs\agnei\lib\site-packages\sklearn\metrics\_scorer.py", line 288, in __call__
    return self._score(partial(_cached_call, None), estimator, X, y_true, **_kwargs)
  File "c:\Users\dusad\miniconda3\envs\agnei\lib\site-packages\sklearn\metrics\_scorer.py", line 388, in _score
    return self._sign * self._score_func(y_true, y_pred, **scoring_kwargs)
  File "c:\Users\dusad\miniconda3\envs\agnei\lib\site-packages\sklearn\utils\_param_validation.py", line 216, in wrapper
    return func(*args, **kwargs)
  File "c:\Users\dusad\miniconda3\envs\agnei\lib\site-packages\sklearn\metrics\_classification.py", line 2429, in recall_score
    _, r, _, _ = precision_recall_fscore_support(
  File "c:\Users\dusad\miniconda3\envs\agnei\lib\site-package

🏃 View run GradientBoostingRegressor at: http://localhost:5000/#/experiments/4/runs/bf063e781d5047ff91eab7b56e39cda3
🧪 View experiment at: http://localhost:5000/#/experiments/4
---RandomForestRegressor----


Traceback (most recent call last):
  File "c:\Users\dusad\miniconda3\envs\agnei\lib\site-packages\sklearn\model_selection\_validation.py", line 949, in _score
    scores = scorer(estimator, X_test, y_test, **score_params)
  File "c:\Users\dusad\miniconda3\envs\agnei\lib\site-packages\sklearn\metrics\_scorer.py", line 288, in __call__
    return self._score(partial(_cached_call, None), estimator, X, y_true, **_kwargs)
  File "c:\Users\dusad\miniconda3\envs\agnei\lib\site-packages\sklearn\metrics\_scorer.py", line 388, in _score
    return self._sign * self._score_func(y_true, y_pred, **scoring_kwargs)
  File "c:\Users\dusad\miniconda3\envs\agnei\lib\site-packages\sklearn\utils\_param_validation.py", line 216, in wrapper
    return func(*args, **kwargs)
  File "c:\Users\dusad\miniconda3\envs\agnei\lib\site-packages\sklearn\metrics\_classification.py", line 2429, in recall_score
    _, r, _, _ = precision_recall_fscore_support(
  File "c:\Users\dusad\miniconda3\envs\agnei\lib\site-package

🏃 View run RandomForestRegressor at: http://localhost:5000/#/experiments/4/runs/16c3fe5323424fcf8f86cb64dc43b873
🧪 View experiment at: http://localhost:5000/#/experiments/4
---AdaBoostRegressor----


Traceback (most recent call last):
  File "c:\Users\dusad\miniconda3\envs\agnei\lib\site-packages\sklearn\model_selection\_validation.py", line 949, in _score
    scores = scorer(estimator, X_test, y_test, **score_params)
  File "c:\Users\dusad\miniconda3\envs\agnei\lib\site-packages\sklearn\metrics\_scorer.py", line 288, in __call__
    return self._score(partial(_cached_call, None), estimator, X, y_true, **_kwargs)
  File "c:\Users\dusad\miniconda3\envs\agnei\lib\site-packages\sklearn\metrics\_scorer.py", line 388, in _score
    return self._sign * self._score_func(y_true, y_pred, **scoring_kwargs)
  File "c:\Users\dusad\miniconda3\envs\agnei\lib\site-packages\sklearn\utils\_param_validation.py", line 216, in wrapper
    return func(*args, **kwargs)
  File "c:\Users\dusad\miniconda3\envs\agnei\lib\site-packages\sklearn\metrics\_classification.py", line 2429, in recall_score
    _, r, _, _ = precision_recall_fscore_support(
  File "c:\Users\dusad\miniconda3\envs\agnei\lib\site-package

🏃 View run AdaBoostRegressor at: http://localhost:5000/#/experiments/4/runs/7eff051649d14b6fa52c53cd37f88ca3
🧪 View experiment at: http://localhost:5000/#/experiments/4
---BaggingRegressor----


Traceback (most recent call last):
  File "c:\Users\dusad\miniconda3\envs\agnei\lib\site-packages\sklearn\model_selection\_validation.py", line 949, in _score
    scores = scorer(estimator, X_test, y_test, **score_params)
  File "c:\Users\dusad\miniconda3\envs\agnei\lib\site-packages\sklearn\metrics\_scorer.py", line 288, in __call__
    return self._score(partial(_cached_call, None), estimator, X, y_true, **_kwargs)
  File "c:\Users\dusad\miniconda3\envs\agnei\lib\site-packages\sklearn\metrics\_scorer.py", line 388, in _score
    return self._sign * self._score_func(y_true, y_pred, **scoring_kwargs)
  File "c:\Users\dusad\miniconda3\envs\agnei\lib\site-packages\sklearn\utils\_param_validation.py", line 216, in wrapper
    return func(*args, **kwargs)
  File "c:\Users\dusad\miniconda3\envs\agnei\lib\site-packages\sklearn\metrics\_classification.py", line 2429, in recall_score
    _, r, _, _ = precision_recall_fscore_support(
  File "c:\Users\dusad\miniconda3\envs\agnei\lib\site-package

### Cloud Interface

In [None]:
### AWS 

filename = "test_dump/uploaded_file.txt"
local_file_path = "a.txt"

from mlutils.cloud.aws import AWSStorage
aws_storage = AWSStorage()
# aws_storage.upload_file(upload_file=local_file_path, 
#                      filename=filename, 
#                     )

# aws_storage.generate_presigned_url_for_object(filename,
#                                    expiration=3600)  # URL valid for 1 hour

# aws_storage.create_bucket('crete-b')
# aws_storage.delete_bucket('crete-b')
# aws_storage.list_buckets()

# aws_storage.list_objects('expt')

In [None]:
### GCP
from mlutils.cloud.gcp import GCPStorage

# GCS bucket and file info
bucket_name = "expt-mandrakebio"
destination_blob_name = "folder/your_file.txt"  # GCS path
local_file_path = "a.txt"  # Local file to upload


# upload_blob(bucket_name, destination_blob_name, local_file_path)
gcp_storage = GCPStorage()
# gcp_storage.upload_file(upload_file=local_file_path, 
#                          filename=destination_blob_name)

# gcp_storage.generate_presigned_url_for_object(destination_blob_name,
#                                     )  

gcp_storage.list_buckets()

In [None]:
### Azure

blob_name = "folder/uploaded_file.txt"
local_file_path = "a.txt"

from mlutils.cloud.azure import upload_file_to_blob
upload_file_to_blob(blob_name=blob_name, 
                     local_file_path=local_file_path, 
                    )