# Example
## Introduction

**Datalizer** is a Python package designed to simplify early-stage data analysis.

Its goal is to help users:
- Load and validate structured datasets
- Detect and resolve common data issues (e.g. missing values, duplicates)
- Prepare numerical data for modeling

## Install Datalizer Package

In [None]:
!pip install datalizer

## Imports

In [2]:
import datalizer as dl

## Loading in Data

The first step in any data project is to load your dataset.  
`Datalizer` provides a convenient function called `load_data()` that reads `.csv`, `.xlsx`, or `.json` files and ensures that the dataset is entirely numerical.

If non-numeric data is detected, `load_data()` will raise an error — helping you catch issues early in the pipeline.

In [3]:
# Specify a file path
file_path = "sample_numerical.csv"

# Load a sample dataset
df = dl.load_data(file_path=file_path)

# Show the first few rows
df.head()

Unnamed: 0,age,weight,height
0,25,68,175.0
1,30,75,180.0
2,22,60,165.0
3,35,80,185.0
4,35,80,185.0


## Checking for Issues

Before cleaning a dataset, it's good practice to identify any existing problems.  
The `check_for_issues()` function quickly inspects your dataset and reports:

- Number of missing values
- Number of duplicate rows
- Displays the problematic rows, if there are any

This step helps you decide what kind of cleaning strategy to apply.

In [4]:
dl.check_for_issues(df)


Number of missing cells: 1

Rows with missing values:
   age  weight  height
5   28      70     NaN

Number of duplicate rows: 1

Duplicate rows:
   age  weight  height
4   35      80   185.0


## Cleaning the Dataset

Once issues are detected, you can clean the dataset using `clean_basic()`.

This function:
- Removes duplicate rows
- Handles missing values based on a selected strategy:
  - `"mean"` – fill missing values with column means
  - `"median"` – fill with column medians
  - `"mode"` – fill with most frequent values
  - `"drop"` – remove rows with missing values entirely

By default, `clean_basic()` returns a new cleaned DataFrame without modifying the original.

In [5]:
df = dl.clean_basic(df, strategy="drop")

print("\nData after cleaning:")
df.head()


Missing values detected. Cleaning with strategy: 'drop'.

Data after cleaning:


Unnamed: 0,age,weight,height
0,25,68,175.0
1,30,75,180.0
2,22,60,165.0
3,35,80,185.0


## Preprocessing the Dataset

### Acknowledgement

The dataset used in this notebook is sourced from **"Within-Project Defect Prediction for Ansible"** by **Elif Ceren Gok**. It is available on **OpenML** at the following link: [OpenML Dataset](https://www.openml.org/search?type=data&status=active&id=43357).

### Data Preprocessing
Using `preprocess_data()`, we perform the following steps:

- **Merging** the feature set (`X_train`) and target dataset (`y_train`) on the `id` column.
- **Removing correlated features**: Highly correlated features are identified and removed from the dataset to reduce multicollinearity.
- **Splitting the data**: The dataset is split into training and validation sets for model training and evaluation.

In [6]:
# Acknowledgement
# The following dataset is sourced from "Within-Project Defect Prediction for Ansible" by user Elif Ceren Gok. https://www.openml.org/search?type=data&status=active&id=43357

# File paths for the dataset
X_file_path = "X_train.csv"
y_file_path = "y_train.csv"

# Load the data using the datalizer loader function
X = dl.load_data(file_path=X_file_path)
y = dl.load_data(file_path=y_file_path)

# Preprocess the data
# Merge the datasets on 'id', split into training and validation sets, remove correlated features, and return the result
X_train_split, X_val_split, y_train_split, y_val_split, dropped_corr_feats = dl.preprocess_data(X, y, merge_col="id", val=True, target_col="failure_prone", remove_corr=True)

# Output the dropped features due to high correlation
print("Dropped correlated features:\n",dropped_corr_feats)

Dropped correlated features:
 ['additions_max', 'code_churn_max', 'num_tasks', 'delta_num_keys', 'delta_num_tasks', 'delta_num_tokens']


## Model Selection Recommendations

After preprocessing the data, we can use the `recommend_approach` function to analyze the dataset and get recommendations for:

1. **Task identification**: Determining if this is a classification or regression problem
2. **Data characteristics analysis**: Examining dataset size, class balance, and feature dimensions
3. **Model recommendations**: Suggesting appropriate models based on the dataset characteristics
4. **Overfitting prevention**: Recommending strategies to prevent overfitting
5. **Evaluation metrics**: Suggesting appropriate metrics for model evaluation

The function performs a comprehensive analysis of the preprocessed data and provides tailored recommendations for modeling approach.

In [7]:
# Model Selection Recommendations
# ------------------------------
# Using the preprocessed data to get model recommendations

# Get recommendations based on dataset characteristics
recommendations = dl.recommend_approach(X_train_split, y_train_split)

# Extract and display key information
print("Dataset Analysis:")
print(f"Task type: {recommendations['task']}")
print(f"Dataset size: {recommendations['dataset_size']['status']}")

# Display class balance information if it's a classification task
if 'classification' in recommendations['task']:
    print(f"Class balance: {recommendations['imbalance']['status']}")
    print(f"Class ratio: {recommendations['imbalance']['ratio']:.4f}")

# Display top recommended models
print("\nTop 3 Recommended Models:")
for i, model in enumerate(recommendations['recommended_models'][:3], 1):
    print(f"{i}. {model['name']}")
    print(f"   Strengths: {', '.join(model['strengths'])}")
    print(f"   Key hyperparameters: {', '.join(model['hyperparameters'])}")
    print()

# Display overfitting prevention strategies
print("Recommended Strategies to Prevent Overfitting:")
for strategy_type, strategies in recommendations['overfitting_strategies'].items():
    print(f"\n{strategy_type.capitalize()} strategies:")
    for strategy in strategies:
        print(f"- {strategy}")

# Display suggested evaluation metrics
print("\nSuggested Evaluation Metrics:")
for metric in recommendations['suggested_metrics']:
    print(f"- {metric}")


Dataset Analysis:
Task type: binary classification
Dataset size: Large dataset
Class balance: Severe imbalance detected
Class ratio: 0.0651

Top 3 Recommended Models:
1. Logistic Regression
   Strengths: Good interpretability, Works well with linear decision boundaries, Fast to train, Provides probabilities
   Key hyperparameters: C (regularization strength), penalty type (L1/L2)

2. Random Forest Classifier
   Strengths: Handles non-linear relationships well, Good with high-dimensional data, Robust to outliers, Provides feature importance
   Key hyperparameters: n_estimators, max_depth, min_samples_leaf, max_features

3. Gradient Boosting Classifier
   Strengths: Often achieves state-of-the-art performance, Handles non-linear relationships well, Good with imbalanced data
   Key hyperparameters: learning_rate, n_estimators, max_depth, subsample

Recommended Strategies to Prevent Overfitting:

General strategies:
- Use cross-validation to estimate model performance
- Monitor training vs