
# Discretization (Binning) with Feature-engine

This notebook demonstrates various techniques for discretization (binning) of continuous variables using the `feature_engine` library in Python.

## 1. Import Libraries

```python
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from feature_engine.discretisation import (
    EqualWidthDiscretiser, 
    EqualFrequencyDiscretiser, 
    DecisionTreeDiscretiser,
    ArbitraryDiscretiser
)

# Ignore warnings
import warnings
warnings.filterwarnings('ignore')
```

## 2. Load Data

We'll use a synthetic dataset that includes continuous variables suitable for discretization.

```python
# Create a sample dataset
np.random.seed(42)
data = {
    'Age': [22, 25, 47, 52, 46, 56, 24, 27, 30, 45, 57, 60],
    'Income': [25000, 27000, 55000, 60000, 52000, 70000, 28000, 30000, 35000, 50000, 72000, 75000],
    'Years_of_Experience': [1, 2, 5, 8, 7, 10, 2, 3, 4, 6, 9, 12]
}

df = pd.DataFrame(data)
print("Initial Dataset with Continuous Variables:")
df.head(10)
```

## 3. Split Data into Training and Testing Sets

```python
# Split the data into training and testing sets
X_train, X_test = train_test_split(df, test_size=0.3, random_state=42)
print("Training Set:")
X_train.head()
```

## 4. Understanding Discretization

Discretization, also known as binning, is the process of transforming continuous variables into discrete buckets or intervals. This process can help improve model interpretability, reduce the impact of outliers, and handle skewed distributions. Discretization can be especially useful when the relationship between a continuous variable and the target variable is non-linear or when certain ranges of the variable have distinct impacts.

## 5. Discretization Techniques

### 5.1 Equal-Width Discretization

**Explanation:**  
Equal-width discretization divides the range of the variable into intervals of equal width. This method is straightforward and easy to interpret, but it may not capture the underlying distribution of the data effectively, especially if the data is skewed.

**Implementation:**

```python
# Apply Equal-Width Discretization
ewd = EqualWidthDiscretiser(bins=4, variables=['Age', 'Income'])
X_train_ewd = ewd.fit_transform(X_train)
print("Dataset after Equal-Width Discretization:")
X_train_ewd.head()
```

### 5.2 Equal-Frequency Discretization

**Explanation:**  
Equal-frequency discretization divides the variable into intervals that contain approximately the same number of observations. This method is useful for capturing the distribution of the data, especially in skewed distributions, but it may result in intervals of varying widths.

**Implementation:**

```python
# Apply Equal-Frequency Discretization
efd = EqualFrequencyDiscretiser(q=4, variables=['Age', 'Income'])
X_train_efd = efd.fit_transform(X_train)
print("Dataset after Equal-Frequency Discretization:")
X_train_efd.head()
```

### 5.3 Decision Tree Discretization

**Explanation:**  
Decision tree discretization uses a decision tree to determine the optimal split points based on the relationship between the continuous variable and the target variable. This method can effectively capture non-linear relationships and create meaningful bins, but it requires a target variable for supervision.

**Implementation:**

```python
# Assume we have a target variable
X_train['target'] = [0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1]

# Apply Decision Tree Discretization
dtd = DecisionTreeDiscretiser(cv=3, scoring='accuracy', variables=['Age', 'Income'])
X_train_dtd = dtd.fit_transform(X_train, X_train['target'])
print("Dataset after Decision Tree Discretization:")
X_train_dtd.head()
```

### 5.4 Arbitrary Discretization

**Explanation:**  
Arbitrary discretization allows the user to manually specify the bin edges. This method is useful when domain knowledge suggests specific thresholds or when specific ranges of the variable are known to have distinct meanings.

**Implementation:**

```python
# Specify custom bin edges
arbitrary_binner = ArbitraryDiscretiser(binning_dict={
    'Age': [20, 30, 40, 50, 60],
    'Income': [25000, 35000, 50000, 70000, 80000]
})
X_train_arbitrary = arbitrary_binner.fit_transform(X_train)
print("Dataset after Arbitrary Discretization:")
X_train_arbitrary.head()
```

## 6. Conclusion

Discretization can significantly impact the performance and interpretability of machine learning models. The method you choose depends on the specific characteristics of your data and the problem you're trying to solve.

- **Equal-width discretization** is simple and interpretable but may not capture data distribution well.
- **Equal-frequency discretization** ensures that bins have similar numbers of observations, which can be beneficial for skewed data.
- **Decision tree discretization** is powerful for capturing complex relationships but requires a target variable.
- **Arbitrary discretization** offers flexibility when specific thresholds are meaningful based on domain knowledge.

```python
# Save the final dataset after discretization
X_train_arbitrary.to_csv('discretized_dataset.csv', index=False)
```

You can run this notebook step by step to see how each discretization method works and how it impacts the dataset. Discretization is a crucial step in many machine learning workflows, especially when dealing with non-linear relationships and improving model interpretability.