In [None]:
# Utilities
import warnings
warnings.filterwarnings('ignore')

# Mathematical Operations
import numpy as np

# Data Manipulation
import pandas as pd

# Statistics and scientific computing
import scipy.stats as stats
import statsmodels.api as sm

# Machine Learning - Preprocessing
from sklearn.preprocessing import (
    OneHotEncoder,
    LabelEncoder,
    MinMaxScaler,
    StandardScaler
)
from sklearn.impute import SimpleImputer

# Machine Learning - Modeling
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier, plot_tree

# Machine Learning - Evaluation
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    accuracy_score,
    confusion_matrix,
    classification_report,
    roc_curve,
    auc,
    roc_auc_score
)

### 3. Feature Transformation
- **Date features**: Convert date_recorded to datetime format and extract relevant features like year, month, or day of the week for temporal patterns.
- **Encoding categorical variables**: Transform categorical columns into numeric representations using techniques like one-hot encoding, label encoding, or target encoding, depending on the modeling approach.
- **Scaling numerical features**: Apply normalization or standardization to numerical features if required by the model (especially for distance-based algorithms).

### 4. Feature Engineering
- **Derived columns**: Create new features that could capture important information, such as:
  - Waterpoint age (date_recorded - construction_year)
  - Basin-region interaction (basin + region)
  - Extraction type class simplification
- **Group rare categories**: Combine infrequent categories into an 'Other' class to reduce sparsity in categorical features.

### 5. Target Preparation
- Ensure the target column status_group is properly formatted and, if necessary, map it to numeric labels for modeling (functional = 0, functional needs repair = 1, non functional = 2).

### 6. Final Dataset Check
- Verify that all features are clean, properly encoded, and aligned with the modeling requirements.