<a href="https://colab.research.google.com/github/Zahab163/Income_Inequality_in_Developing_Nations/blob/main/Income_Inequality_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#ðŸ’µ Income Inequality Prediction

##Description:
Income inequality - when income is distributed in an uneven manner among
a population - is a growing problem in developing nations across the world.
With the rapid rise of AI and worker automation, this problem could continue to
grow if steps are not taken to address the issue. This solution can potentially
reduce the cost and improve the accuracy of monitoring key population
indicators such as income level in between census years. This information will
help policymakers to better manage and avoid income inequality globally.
###Problem Statement:
The target feature is `income_above_limit` which is a binary-class variable.
The objective of this challenge is to create a machine learning model to predict
whether an individual earns above or below a certain amount. Your metric for
evaluation will be f1-score

### Import necessary libraries

In [2]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.metrics import f1_score, classification_report, confusion_matrix, roc_auc_score, precision_recall_curve, auc
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
import joblib
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
# The scikit-plot import is causing compatibility issues with scipy, temporarily removing it.
# from imblearn.over_sampling import SMOTE
import os
import warnings
warnings.filterwarnings('ignore')



In [4]:
 from imblearn.over_sampling import SMOTE

In [5]:
sheet_id = "1NcmQLYZ10HHwSQEVY8CL4A9Wvhf2ecw8pkB7OluyYKY"
csv_url = f"https://docs.google.com/spreadsheets/d/1NcmQLYZ10HHwSQEVY8CL4A9Wvhf2ecw8pkB7OluyYKY/export?format=csv"

df = pd.read_csv(csv_url)
print("\nFirst few rows:")
df.head()


First few rows:


Unnamed: 0,ID,age,gender,education,class,education_institute,marital_status,race,is_hispanic,employment_commitment,...,country_of_birth_mother,migration_code_change_in_msa,migration_prev_sunbelt,migration_code_move_within_reg,migration_code_change_in_reg,residence_1_year_ago,old_residence_reg,old_residence_state,importance_of_record,income_above_limit
0,ID_TZ0000,79,Female,High school graduate,,,Widowed,White,All other,Not in labor force,...,US,?,?,?,?,,,,1779.74,Below limit
1,ID_TZ0001,65,Female,High school graduate,,,Widowed,White,All other,Children or Armed Forces,...,US,unchanged,,unchanged,unchanged,Same,,,2366.75,Below limit
2,ID_TZ0002,21,Male,12th grade no diploma,Federal government,,Never married,Black,All other,Children or Armed Forces,...,US,unchanged,,unchanged,unchanged,Same,,,1693.42,Below limit
3,ID_TZ0003,2,Female,Children,,,Never married,Asian or Pacific Islander,All other,Children or Armed Forces,...,India,unchanged,,unchanged,unchanged,Same,,,1380.27,Below limit
4,ID_TZ0004,70,Male,High school graduate,,,Married-civilian spouse present,White,All other,Not in labor force,...,US,?,?,?,?,,,,1580.79,Below limit


**let's discuss what's going on here?**

our file is in the same google file where our google colab notebook is but it's in the form of google sheet not `.csv`file and in pandas `.csv` file will format .So, what we have done here in these two lines of code :
* We are changing the format of file from google sheet to `.csv` file .
* yes, we could also do it the other way simply download the file from google sheet to `.csv ` format , than process the coding which I would definitely do for vs code for streamlit app .

**What's going on with in the data?**

As you can see we have a lot of `NaN` values (Not a Number) which are counted in missing data and some unwanted columns which we will deal with.

In [6]:

print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 209499 entries, 0 to 209498
Data columns (total 43 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   ID                              209499 non-null  object 
 1   age                             209499 non-null  int64  
 2   gender                          209499 non-null  object 
 3   education                       209499 non-null  object 
 4   class                           104254 non-null  object 
 5   education_institute             13302 non-null   object 
 6   marital_status                  209499 non-null  object 
 7   race                            209499 non-null  object 
 8   is_hispanic                     208617 non-null  object 
 9   employment_commitment           209499 non-null  object 
 10  unemployment_reason             6520 non-null    object 
 11  employment_stat                 209499 non-null  int64  
 12  wage_per_hour   

As we can see that the data is huge and we have many data type= Object which we have to deal with.
* for Machine Learning Modeling we have to convert the data into numerical .


In [7]:
# Shape of dataset
print('Rows: {} Columns: {}'.format(df.shape[0], df.shape[1]))

Rows: 209499 Columns: 43


## Exploratory Data Analysis(EDA)

In [8]:
# Statistical summary
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,209499.0,34.518728,22.306738,0.0,15.0,33.0,50.0,90.0
employment_stat,209499.0,0.17676,0.555562,0.0,0.0,0.0,0.0,2.0
wage_per_hour,209499.0,55.433487,276.757327,0.0,0.0,0.0,0.0,9999.0
working_week_per_year,209499.0,23.15885,24.397963,0.0,0.0,8.0,52.0,52.0
industry_code,209499.0,15.332398,18.049655,0.0,0.0,0.0,33.0,51.0
occupation_code,209499.0,11.321734,14.460839,0.0,0.0,0.0,26.0,46.0
total_employed,209499.0,1.956067,2.365154,0.0,0.0,1.0,4.0,6.0
vet_benefit,209499.0,1.515854,0.850853,0.0,2.0,2.0,2.0,2.0
gains,209499.0,435.926887,4696.3595,0.0,0.0,0.0,0.0,99999.0
losses,209499.0,36.881737,270.383302,0.0,0.0,0.0,0.0,4608.0


.T then transposes (switches rows and columns) this summary table, which often makes it easier to read and analyze, especially when you have many columns.

In [9]:
    # Display target distribution
    target_dist = df['income_above_limit'].value_counts()
    print(f"\n Target distribution:\n{target_dist}")
    print(f"Target ratio: {target_dist['Above limit']/len(df)*100:.2f}% Above limit")


 Target distribution:
income_above_limit
Below limit    196501
Above limit     12998
Name: count, dtype: int64
Target ratio: 6.20% Above limit


In [10]:
# Data Preprocessing
print("Preprocessing data...")

Preprocessing data...


In [11]:
# Create a copy for preprocessing
df_clean = df.copy()

In [12]:
# Handle target variable - convert to binary
df_clean['income_above_limit'] = df_clean['income_above_limit'].map({'Below limit': 0, 'Above limit': 1})

In [13]:
# Handle missing values
print("Missing values before preprocessing:")
print(df_clean.isnull().sum().sort_values(ascending=False).head(10))

Missing values before preprocessing:
veterans_admin_questionnaire    207415
unemployment_reason             202979
education_institute             196197
old_residence_reg               193148
old_residence_state             193148
is_labor_union                  189420
under_18_family                 151654
residence_1_year_ago            106284
occupation_code_main            105694
class                           105245
dtype: int64


In [14]:
# Drop ID column as it's not useful for prediction
if 'ID' in df_clean.columns:
    df_clean = df_clean.drop('ID', axis=1)


In [16]:
# Handle missing values
print("\n Handling missing values...")
numerical_cols = df_clean.select_dtypes(include=[np.number]).columns
categorical_cols = df_clean.select_dtypes(include=['object']).columns


 Handling missing values...


In [17]:
# Advanced missing value handling
for col in numerical_cols:
    if df_clean[col].isnull().sum() > 0:
        df_clean.loc[:, col] = df_clean.loc[:, col].fillna(df_clean[col].median())

for col in categorical_cols:
    if df_clean[col].isnull().sum() > 0:
        mode_val = df_clean[col].mode()[0] if not df_clean[col].mode().empty else 'Unknown'
        df_clean.loc[:, col] = df_clean.loc[:, col].fillna(mode_val)

print(f" Missing values after preprocessing: {df_clean.isnull().sum().sum()}")

 Missing values after preprocessing: 0


In [18]:
def create_advanced_preprocessor(X):
    """Create advanced preprocessing pipeline"""
    categorical_cols = X.select_dtypes(include=['object']).columns.tolist()
    numerical_cols = X.select_dtypes(include=[np.number]).columns.tolist()

    preprocessor = ColumnTransformer(
        transformers=[
            ('num', StandardScaler(), numerical_cols),
            ('cat', OneHotEncoder(handle_unknown='ignore', sparse_output=False), categorical_cols)
        ])

    return preprocessor, categorical_cols, numerical_cols

In [24]:
print("\nMissing values after preprocessing:")
print(df_clean.isnull().sum().sum())


Missing values after preprocessing:
0


In [20]:
# Prepare features and target
X = df_clean.drop('income_above_limit', axis=1)
y = df_clean['income_above_limit']

In [21]:
# Identify final column types
categorical_cols = X.select_dtypes(include=['object']).columns.tolist()
numerical_cols = X.select_dtypes(include=[np.number]).columns.tolist()

print(f"Categorical columns ({len(categorical_cols)}): {categorical_cols}")
print(f"Numerical columns ({len(numerical_cols)}): {numerical_cols}")


Categorical columns (28): ['gender', 'education', 'class', 'education_institute', 'marital_status', 'race', 'is_hispanic', 'employment_commitment', 'unemployment_reason', 'is_labor_union', 'industry_code_main', 'occupation_code_main', 'household_stat', 'household_summary', 'under_18_family', 'veterans_admin_questionnaire', 'tax_status', 'citizenship', 'country_of_birth_own', 'country_of_birth_father', 'country_of_birth_mother', 'migration_code_change_in_msa', 'migration_prev_sunbelt', 'migration_code_move_within_reg', 'migration_code_change_in_reg', 'residence_1_year_ago', 'old_residence_reg', 'old_residence_state']
Numerical columns (13): ['age', 'employment_stat', 'wage_per_hour', 'working_week_per_year', 'industry_code', 'occupation_code', 'total_employed', 'vet_benefit', 'gains', 'losses', 'stocks_status', 'mig_year', 'importance_of_record']


In [22]:
def train_multiple_models(X_train, X_test, y_train, y_test, preprocessor):
    """Train and evaluate multiple models"""
    print("\n Training Multiple Models...")

    # Define models with their parameter grids
    models = {
        'Random Forest': {
            'model': RandomForestClassifier(random_state=42, class_weight='balanced'),
            'params': {
                'classifier__n_estimators': [100, 200],
                'classifier__max_depth': [10, 20, None],
                'classifier__min_samples_split': [2, 5]
            }
        },
        'XGBoost': {
            'model': XGBClassifier(random_state=42, eval_metric='logloss'),
            'params': {
                'classifier__n_estimators': [100, 200],
                'classifier__max_depth': [3, 6, 9],
                'classifier__learning_rate': [0.01, 0.1, 0.2]
            }
        },
        'LightGBM': {
            'model': LGBMClassifier(random_state=42, verbose=-1),
            'params': {
                'classifier__n_estimators': [100, 200],
                'classifier__max_depth': [5, 10, 15],
                'classifier__learning_rate': [0.01, 0.1]
            }
        },
        'Logistic Regression': {
            'model': LogisticRegression(random_state=42, class_weight='balanced', max_iter=1000),
            'params': {
                'classifier__C': [0.1, 1, 10],
                'classifier__solver': ['liblinear', 'saga']
            }
        },
        'SVM': {
            'model': SVC(random_state=42, class_weight='balanced', probability=True),
            'params': {
                'classifier__C': [0.1, 1, 10],
                'classifier__kernel': ['linear', 'rbf']
            }
        },
        'Gradient Boosting': {
            'model': GradientBoostingClassifier(random_state=42),
            'params': {
                'classifier__n_estimators': [100, 200],
                'classifier__learning_rate': [0.05, 0.1, 0.2],
                'classifier__max_depth': [3, 5, 7]
            }
        },
        'K-Nearest Neighbors': {
            'model': KNeighborsClassifier(),
            'params': {
                'classifier__n_neighbors': [3, 5, 7, 9],
                'classifier__weights': ['uniform', 'distance']
            }
        }
    }

    results = {}
    best_models = {}

    for name, config in models.items():
        print(f"\n Training {name}...")


In [26]:
def train_multiple_models(X_train, X_test, y_train, y_test, preprocessor):
    """Train and evaluate multiple models"""
    print("\n Training Multiple Models...")

    # Define models with their parameter grids
    models = {
        'Random Forest': {
            'model': RandomForestClassifier(random_state=42, class_weight='balanced'),
            'params': {
                'classifier__n_estimators': [100, 200],
                'classifier__max_depth': [10, 20, None],
                'classifier__min_samples_split': [2, 5]
            }
        },
        'XGBoost': {
            'model': XGBClassifier(random_state=42, eval_metric='logloss'),
            'params': {
                'classifier__n_estimators': [100, 200],
                'classifier__max_depth': [3, 6, 9],
                'classifier__learning_rate': [0.01, 0.1, 0.2]
            }
        },
        'LightGBM': {
            'model': LGBMClassifier(random_state=42, verbose=-1),
            'params': {
                'classifier__n_estimators': [100, 200],
                'classifier__max_depth': [5, 10, 15],
                'classifier__learning_rate': [0.01, 0.1]
            }
        },
        'Logistic Regression': {
            'model': LogisticRegression(random_state=42, class_weight='balanced', max_iter=1000),
            'params': {
                'classifier__C': [0.1, 1, 10],
                'classifier__solver': ['liblinear', 'saga']
            }
        },
        'SVM': {
            'model': SVC(random_state=42, class_weight='balanced', probability=True),
            'params': {
                'classifier__C': [0.1, 1, 10],
                'classifier__kernel': ['linear', 'rbf']
            }
        },
        'Gradient Boosting': {
            'model': GradientBoostingClassifier(random_state=42),
            'params': {
                'classifier__n_estimators': [100, 200],
                'classifier__learning_rate': [0.05, 0.1, 0.2],
                'classifier__max_depth': [3, 5, 7]
            }
        },
        'K-Nearest Neighbors': {
            'model': KNeighborsClassifier(),
            'params': {
                'classifier__n_neighbors': [3, 5, 7, 9],
                'classifier__weights': ['uniform', 'distance']
            }
        }
    }

    results = {}
    best_models = {}

    for name, config in models.items():
        print(f"\n Training {name}...")
        # Create pipeline
        pipeline = Pipeline([
            ('preprocessor', preprocessor),
            ('classifier', config['model'])
        ])

        # Hyperparameter tuning
        grid_search = GridSearchCV(
            pipeline,
            config['params'],
            cv=3,
            scoring='f1',
            n_jobs=-1,
            verbose=0
        )

        grid_search.fit(X_train, y_train)

        best_model = grid_search.best_estimator_
        y_pred = best_model.predict(X_test)
        f1 = f1_score(y_test, y_pred)

        results[name] = {'f1_score': f1, 'best_params': grid_search.best_params_}
        best_models[name] = best_model

        print(f"{name} - F1 Score: {f1:.4f}")
        print(f"Best Params: {grid_search.best_params_}")
        print(classification_report(y_test, y_pred))
        print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

    return results, best_models

In [None]:
# Best model
        best_model = grid_search.best_estimator_
        best_models[name] = best_model

        # Predictions
        y_pred = best_model.predict(X_test)
        y_pred_proba = best_model.predict_proba(X_test)[:, 1]

        # Calculate metrics
        f1 = f1_score(y_test, y_pred)
        accuracy = best_model.score(X_test, y_test)
        roc_auc = roc_auc_score(y_test, y_pred_proba

In [None]:
  # Store results
        results[name] = {
            'model': best_model,
            'f1_score': f1,
            'accuracy': accuracy,
            'roc_auc': roc_auc,
            'best_params': grid_search.best_params_,
            'predictions': y_pred,
            'probabilities': y_pred_proba
        }

        print(f"{name} - F1: {f1:.4f}, Accuracy: {accuracy:.4f}, AUC: {roc_auc:.4f}")

    return results, best_models


In [None]:
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Create the preprocessor
preprocessor, categorical_cols_preprocessor, numerical_cols_preprocessor = create_advanced_preprocessor(X_train)

# Train and evaluate models
results, best_models = train_multiple_models(X_train, X_test, y_train, y_test, preprocessor)

print("\n--- Model Training Results ---")
for name, res in results.items():
    print(f"{name}: F1 Score = {res['f1_score']:.4f}")
    print(f"Best Params: {res['best_params']}")


 Training Multiple Models...

 Training Random Forest...


In [None]:
def create_comprehensive_visualizations(results, X_test, y_test, model_info):
    """Create comprehensive visualizations for model comparison"""
    print("\nðŸ“Š Creating comprehensive visualizations...")

    # 1. Model Comparison Bar Chart
    models = list(results.keys())
    f1_scores = [results[model]['f1_score'] for model in models]
    accuracies = [results[model]['accuracy'] for model in models]
    auc_scores = [results[model]['roc_auc'] for model in models]

    fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(20, 12))

    # F1 Scores
    bars1 = ax1.barh(models, f1_scores, color='skyblue')
    ax1.set_xlabel('F1 Score')
    ax1.set_title('Model Comparison - F1 Score')
    ax1.bar_label(bars1, fmt='%.3f')

    # Accuracies
    bars2 = ax2.barh(models, accuracies, color='lightgreen')
    ax2.set_xlabel('Accuracy')
    ax2.set_title('Model Comparison - Accuracy')
    ax2.bar_label(bars2, fmt='%.3f')

    # AUC Scores
    bars3 = ax3.barh(models, auc_scores, color='salmon')
    ax3.set_xlabel('ROC AUC Score')
    ax3.set_title('Model Comparison - ROC AUC')
    ax3.bar_label(bars3, fmt='%.3f')

    # Combined metrics
    x = np.arange(len(models))
    width = 0.25
    ax4.bar(x - width, f1_scores, width, label='F1 Score', color='skyblue')
    ax4.bar(x, accuracies, width, label='Accuracy', color='lightgreen')
    ax4.bar(x + width, auc_scores, width, label='AUC', color='salmon')
    ax4.set_xlabel('Models')
    ax4.set_ylabel('Scores')
    ax4.set_title('Combined Model Metrics')
    ax4.set_xticks(x)
    ax4.set_xticklabels(models, rotation=45)
    ax4.legend()

    plt.tight_layout()
    plt.savefig('model_comparison.png', dpi=300, bbox_inches='tight')
    plt.close()


In [None]:
# 2. Confusion Matrices for top 3 models
    top_models = sorted(results.items(), key=lambda x: x[1]['f1_score'], reverse=True)[:3]

    fig, axes = plt.subplots(1, 3, figsize=(18, 5))
    for idx, (name, result) in enumerate(top_models):
        cm = confusion_matrix(y_test, result['predictions'])
        sns.heatmap(cm, annot=True, fmt='d', ax=axes[idx], cmap='Blues',
                   xticklabels=['Below', 'Above'],
                   yticklabels=['Below', 'Above'])
        axes[idx].set_title(f'{name}\nF1: {result["f1_score"]:.3f}')
        axes[idx].set_xlabel('Predicted')
        axes[idx].set_ylabel('Actual')

    plt.tight_layout()
    plt.savefig('confusion_matrices.png', dpi=300, bbox_inches='tight')
    plt.close()
    plt.show()

In [None]:
 # 3. ROC Curves for all models
    plt.figure(figsize=(10, 8))
    for name, result in results.items():
        fpr, tpr, _ = skplt.metrics.roc_curve(y_test, result['probabilities'])
        roc_auc = auc(fpr, tpr)
        plt.plot(fpr, tpr, label=f'{name} (AUC = {roc_auc:.3f})', linewidth=2)

    plt.plot([0, 1], [0, 1], 'k--', linewidth=2)
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('ROC Curves - Model Comparison')
    plt.legend(loc="lower right")
    plt.grid(True, alpha=0.3)
    plt.savefig('roc_curves.png', dpi=300, bbox_inches='tight')
    plt.close()
    plt.show()

In [None]:
 # 4. Precision-Recall Curves
    plt.figure(figsize=(10, 8))
    for name, result in results.items():
        precision, recall, _ = precision_recall_curve(y_test, result['probabilities'])
        pr_auc = auc(recall, precision)
        plt.plot(recall, precision, label=f'{name} (AUC = {pr_auc:.3f})', linewidth=2)

    plt.xlabel('Recall')
    plt.ylabel('Precision')
    plt.title('Precision-Recall Curves - Model Comparison')
    plt.legend(loc="upper right")
    plt.grid(True, alpha=0.3)
    plt.savefig('precision_recall_curves.png', dpi=300, bbox_inches='tight')
    plt.close()
    plt.show()


In [None]:
print(" Visualizations saved!")


In [None]:
def save_models_and_results(results, model_info, best_overall_model):
    """Save all models and results"""
    print("\nðŸ’¾ Saving models and results...")

    # Save individual models
    for name, result in results.items():
        # Clean model name for filename
        clean_name = name.lower().replace(' ', '_')
        joblib.dump(result['model'], f'model_{clean_name}.pkl')

    # Save the best overall model separately
    joblib.dump(best_overall_model, 'best_model.pkl')

     # Enhanced model info
    enhanced_model_info = {
        'all_models': list(results.keys()),
        'model_performance': {name: {
            'f1_score': result['f1_score'],
            'accuracy': result['accuracy'],
            'roc_auc': result['roc_auc'],
            'best_params': result['best_params']
        } for name, result in results.items()},
        'best_model': max(results.items(), key=lambda x: x[1]['f1_score'])[0],
        'best_model_f1': max(results.items(), key=lambda x: x[1]['f1_score'])[1]['f1_score'],
        **model_info
    }

    joblib.dump(enhanced_model_info, 'enhanced_model_info.pkl')


In [None]:
 # Create performance summary
    performance_df = pd.DataFrame({
        'Model': list(results.keys()),
        'F1_Score': [results[model]['f1_score'] for model in results],
        'Accuracy': [results[model]['accuracy'] for model in results],
        'ROC_AUC': [results[model]['roc_auc'] for model in results]
    }).sort_values('F1_Score', ascending=False)

    performance_df.to_csv('model_performance_summary.csv', index=False)
    print(" Models and results saved!")


In [None]:
def main():
    # Update this path to your CSV file
    CSV_FILE_PATH = "income_dataset.csv"  # Change to your actual file name

    if not os.path.exists(CSV_FILE_PATH):
        print(f" File '{CSV_FILE_PATH}' not found!")
        print("Available files:")
        for file in os.listdir('.'):
            if file.endswith('.csv'):
                print(f"  - {file}")
        return

    # Load and preprocess data
    df_processed = load_and_preprocess_data(CSV_FILE_PATH)


In [None]:
 # Prepare features and target
    X = df_processed.drop('income_above_limit', axis=1)
    y = df_processed['income_above_limit']

In [None]:
# Create preprocessor
    preprocessor, categorical_cols, numerical_cols = create_advanced_preprocessor(X)


In [None]:
  # Split data
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=y
    )

    print(f"\n Data split:")
    print(f"Training set: {X_train.shape}")
    print(f"Test set: {X_test.shape}")
    print(f"Positive class in training: {y_train.sum()}/{len(y_train)} ({y_train.mean():.2%})")
    print(f"Positive class in test: {y_test.sum()}/{len(y_test)} ({y_test.mean():.2%})")

In [None]:
 # Handle class imbalance with SMOTE
    print("\n Applying SMOTE for class imbalance...")
    smote = SMOTE(random_state=42)
    X_train_resampled, y_train_resampled = smote.fit_resample(
        preprocessor.fit_transform(X_train),
        y_train
    )

    print(f"After SMOTE - Training set: {X_train_resampled.shape}")
    print(f"After SMOTE - Positive class: {y_train_resampled.sum()}/{len(y_train_resampled)} ({y_train_resampled.mean():.2%})")

In [None]:
# Train multiple models
    results, best_models = train_multiple_models(X_train, X_test, y_train, y_test, preprocessor)

    # Find best model
    best_model_name = max(results.items(), key=lambda x: x[1]['f1_score'])[0]
    best_overall_model = best_models[best_model_name]

    print(f"\n Best Model: {best_model_name}")
    print(f" Best F1-Score: {results[best_model_name]['f1_score']:.4f}")

In [None]:
# Model info for saving
    model_info = {
        'categorical_columns': categorical_cols,
        'numerical_columns': numerical_cols,
        'all_columns': X.columns.tolist(),
        'feature_names': preprocessor.get_feature_names_out().tolist(),
        'class_distribution_original': dict(y.value_counts()),
        'best_model_name': best_model_name
    }

    # Create visualizations
    create_comprehensive_visualizations(results, X_test, y_test, model_info)

     # Save models and results
    save_models_and_results(results, model_info, best_overall_model)

    print(f"\n Enhanced training completed!")
    print(f" Generated files:")
    print(f"   - 7 trained models (model_*.pkl)")
    print(f"   - Best model (best_model.pkl)")
    print(f"   - Enhanced model info (enhanced_model_info.pkl)")
    print(f"   - Model performance summary (model_performance_summary.csv)")
    print(f"   - 4 visualization files (*.png)")
    print(f"\n You can now run: streamlit run enhanced_app.py")

if __name__ == "__main__":
    main()