# PCOS - Exploring Predictive Health Factors
Polycystic Ovary Syndrome (PCOS) is a common hormonal disorder that affects women, often during their reproductive years. It is characterized by symptoms such as:

- Irregular periods or no periods at all

- Excess androgen levels, which can cause acne, excess facial or body hair (hirsutism), and sometimes hair thinning

- Polycystic ovaries, where the ovaries may be enlarged and contain multiple small follicles

Possible Causes:
- Hormonal imbalances, particularly elevated androgens (male hormones)

- Insulin resistance, which can lead to higher insulin levels and contribute to weight gain and difficulty losing weight

- Genetics, as PCOS often runs in families

Treatment Options:
While there’s no cure for PCOS, treatments focus on managing symptoms and may include:

- Lifestyle changes, such as diet modifications and exercise, to help manage weight and insulin resistance

- Birth control pills to regulate periods and reduce androgen levels

- Medications like Metformin to improve insulin sensitivity

- Fertility treatments for those struggling to conceive

## Load Dependencies and Data
### Load Dependencies and Install Other Requirements

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectFromModel, RFE, mutual_info_classif
from sklearn.metrics import roc_auc_score
from scipy.stats import spearmanr, chi2_contingency
from scipy.cluster import hierarchy
from scipy.stats import kendalltau
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import xgboost as xgb
import lightgbm as lgb
from catboost import CatBoostClassifier
import warnings
import re
from IPython.display import clear_output
from sklearn.model_selection import train_test_split, cross_val_score, KFold, StratifiedKFold
from sklearn.metrics import accuracy_score, classification_report
from tqdm.auto import tqdm
import optuna
import logging
import sys

# Suppress all warnings and logging
warnings.filterwarnings('ignore')
warnings.filterwarnings("ignore", category=UserWarning, module="lightgbm")
logging.getLogger('lightgbm').setLevel(logging.ERROR)
optuna.logging.set_verbosity(optuna.logging.ERROR)


In [2]:
!pip install tabpfn
clear_output(wait=True)
from tabpfn import TabPFNClassifier

print('Installed tapfn')

Installed tapfn


### Load Data
Load test, train, and submission example and make target column ordinal

In [3]:
train = pd.read_csv('/kaggle/input/exploring-predictive-health-factors/train.csv')
test = pd.read_csv("/kaggle/input/exploring-predictive-health-factors/test.csv")
sample_submission = pd.read_csv("/kaggle/input/exploring-predictive-health-factors/sample_submission.csv")
TARGET = 'PCOS'
train[TARGET] = train[TARGET].map({'Yes': 1, 'No': 0})

train.head()#.to_csv('train_data.csv')


Unnamed: 0,ID,Age,Weight_kg,PCOS,Hormonal_Imbalance,Hyperandrogenism,Hirsutism,Conception_Difficulty,Insulin_Resistance,Exercise_Frequency,Exercise_Type,Exercise_Duration,Sleep_Hours,Exercise_Benefit
0,0,20-25,64.0,0,No,No,No,No,No,Rarely,"Cardio (e.g., running, cycling, swimming)",30 minutes,Less than 6 hours,Somewhat
1,1,15-20,55.0,0,No,No,No,No,No,6-8 Times a Week,No Exercise,Less than 30 minutes,6-8 hours,Somewhat
2,2,15-20,91.0,0,No,No,Yes,No,No,Rarely,"Cardio (e.g., running, cycling, swimming)",Less than 30 minutes,6-8 hours,Somewhat
3,3,15-20,56.0,0,No,No,No,No,No,6-8 Times a Week,"Cardio (e.g., running, cycling, swimming)",45 minutes,6-8 hours,Not at All
4,4,15-20,47.0,0,Yes,No,No,No,No,Rarely,No Exercise,Not Applicable,6-8 hours,Not Much


In [4]:
def clean_age_column(age_column):
    cleaned_ages = []
    for age in age_column:
        if pd.isna(age):
            cleaned_ages.append(np.nan)
        elif 'Less than' in age:
            cleaned_ages.append('0-20')
        elif 'and above' in age:
            cleaned_ages.append('45-100')
        elif '-' in age:
            # Handle ranges and ensure they are in proper order
            parts = age.split('-')
            try:
                min_age, max_age = int(parts[0]), int(parts[1])
                if min_age > max_age:
                    min_age, max_age = max_age, min_age
                if max_age <= 20:  # Include anything below or equal to 20 in '<20'
                    cleaned_ages.append('0-20')
                else:
                    cleaned_ages.append(f"{min_age}-{max_age}")
            except ValueError:
                # Handle cases like 'Less than 20-25'
                if 'Less than' in parts[0]:
                    cleaned_ages.append('20-25')
                else:
                    cleaned_ages.append(age)  # Leave as is if it's not fixable
        else:
            cleaned_ages.append(age)  # Leave other cases unchanged
    return cleaned_ages

def normalize_yes_no(column, visualize=False):
    normalized = []
    for value in column:
        if pd.isna(value):  # Handle NaN
            normalized.append(np.nan)
        elif 'Yes' in str(value):  # If 'Yes' is present in the string
            normalized.append('es' if visualize else '1')
        elif 'No' in str(value):  # If 'No' is present but not 'Yes'
            normalized.append('No' if visualize else '0')
        else:
            normalized.append(value)  # For any unexpected case (shouldn't occur here)
    return normalized

def clean_exercise_column(column):
    replacements = {
        'Rarely': 'Rarely',
        '6-8 Times a Week': '6-8 Times a Week',
        'Never': 'Never',
        '1-2 Times a Week': '1-2 Times a Week',
        '3-4 Times a Week': '3-4 Times a Week',
        '6-8 hours': '3-4 Times a Week',
        'Less than usual': 'Rarely',
        'Less than 6 hours': '1-2 Times a Week'
    }
    
    # Replace using the dictionary and handle NaN values
    return column.map(replacements).fillna('Unknown')


def process_exercise_data(df, column='Exercise_Type', replacements=None):
    """
    Process exercise data with standardized replacements and binary encoding
    
    Parameters:
    df (pandas.DataFrame): Input DataFrame
    column (str): Target column name
    replacements (dict): Mapping for standardizing exercise types
    
    Returns:
    tuple: (processed DataFrame, list of unique exercises)
    """
    df = df.copy()
    
    if replacements is None:
        replacements = {
            'Cardio': 'Cardio',
            'Strength': 'Strength Training',
            'Flexibility': 'Flexibility and Balance',
            'High-intensity': 'HIIT',
            'Sleep_Benefit': 'Misc',
            'Yes': 'Misc',
            'No': 'None',
            None: 'None',
            np.nan: 'None'
        }
    
    def standardize_and_split(x):
        if pd.isna(x) or x in ['No', 'Somewhat', 'None']:
            return ['None']
        
        x = str(x)
        cleaned = re.sub(r'\s*\([^)]*$', '', x)
        cleaned = re.sub(r'\s*\([^)]*\)', '', cleaned)
        cleaned = re.sub(r'[(),]$', '', cleaned)
        
        items = [item.strip() for item in cleaned.split(',') if item.strip()]
        items = [re.sub(r'\s*\(.*$', '', item).strip() for item in items]
        items = [re.sub(r'\s+$', '', item) for item in items]
        
        # Apply replacements
        standardized = []
        for item in items:
            for key, value in replacements.items():
                if str(key).lower() in item.lower():
                    item = value
                    break
            if item:
                standardized.append(item)
                
        return standardized if standardized else ['None']
    
    # Process exercise types
    exercise_lists = df[column].apply(standardize_and_split)
    
    # Get unique exercises (including 'None')
    all_exercises = sorted(set(
        exercise 
        for sublist in exercise_lists 
        for exercise in sublist
    ))
    
    # Create binary columns
    for exercise in all_exercises:
        df[exercise] = exercise_lists.apply(lambda x: 1 if exercise in x else 0)
    
    # Ensure 'None' is properly encoded
    none_conditions = (
        (df[column].isna()) |
        (df[column] == 'No') |
        (df[column] == 'Somewhat')
    )
    df.loc[none_conditions, 'None'] = 1
    
    return df.drop(columns=[column]), all_exercises

def process_data(df):
    categorical_cols = df.select_dtypes(include=['object', 'category']).columns
    df = df.fillna(df.median(numeric_only=True))  # Fill numeric NaNs with median
    df = df.fillna(df.mode().iloc[0])  # Fill categorical NaNs with mode    
    
    for col in categorical_cols:
        if col not in ['Exercise_Duration', 'Sleep_Hours', 'Exercise_Benefit', 'Exercise_Type']:  
            df[col] = normalize_yes_no(df[col])
        if col == 'Exercise_Frequency':
            df[col] = clean_exercise_column(df[col])
        if col == 'Age':
            df[col] = clean_age_column(df[col])
        if col == 'Exercise_Type':
            df, exercise_lists = process_exercise_data(df)
    df['Weight_kg'] = df['Weight_kg'].fillna(df['Weight_kg'].median())
    weight_bins = [20, 40, 50, 60, 70, 80, 90, 100, 120]  # Fixed bin ranges
    weight_bin_labels = [i for i in range(len(weight_bins)-1)]
    df['weight_bins'] = pd.cut(df['Weight_kg'], bins=weight_bins, labels=weight_bin_labels)    
    df.drop(columns=['Weight_kg'], inplace=True)
    for col in categorical_cols:
        if col != 'Exercise_Type':
            df[col] = df[col].astype('category').cat.codes 
    df = df.fillna(df.median(numeric_only=True))  # Fill numeric NaNs with median
    df = df.fillna(df.mode().iloc[0])  # Fill categorical NaNs with mode    
    for col in df.columns.tolist():
        df[col] = df[col].astype('int')
    return df, exercise_lists
   

In [5]:
optuna.logging.set_verbosity(optuna.logging.WARNING)

class ModelOptimizer:
    def __init__(self, model_type="xgboost", n_trials=50, cv_folds=5, random_state=42):
        self.model_type = model_type
        self.n_trials = n_trials
        self.cv_folds = cv_folds
        self.random_state = random_state
        self.best_params = None
        self.best_score = None
        self.study = None

    def _get_param_space(self, trial):
        """Define hyperparameter search space based on model type."""
        if self.model_type == "xgboost":
            return {
                "n_estimators": trial.suggest_int("n_estimators", 100, 500, step=50),
                "max_depth": trial.suggest_int("max_depth", 3, 10),
                "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.3, log=True),
                "subsample": trial.suggest_float("subsample", 0.5, 1.0),
                "colsample_bytree": trial.suggest_float("colsample_bytree", 0.5, 1.0),
                "gamma": trial.suggest_float("gamma", 0, 5),
            }
        elif self.model_type == "lgbm":
            return {
                # Core Parameters
                'objective': 'binary',
                'metric': 'binary_logloss',
                'boosting_type': trial.suggest_categorical('boosting_type', ['gbdt', 'dart']),  # Removed GOSS
                'num_leaves': trial.suggest_int('num_leaves', 2, 256),
                'max_depth': trial.suggest_int('max_depth', 3, 12),
                'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3),
                'n_estimators': trial.suggest_int('n_estimators', 50, 300),
                
                # Regularization and Control Parameters
                'min_child_samples': trial.suggest_int('min_child_samples', 1, 100),
                'min_child_weight': trial.suggest_float('min_child_weight', 1e-3, 10.0),
                'subsample': trial.suggest_float('subsample', 0.5, 1.0),
                'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
                'reg_alpha': trial.suggest_float('reg_alpha', 1e-8, 10.0, log=True),
                'reg_lambda': trial.suggest_float('reg_lambda', 1e-8, 10.0, log=True),
                
                # Additional Parameters
                'min_split_gain': trial.suggest_float('min_split_gain', 0.0, 1.0),
                'max_bin': trial.suggest_int('max_bin', 200, 300),
                'min_data_in_leaf': trial.suggest_int('min_data_in_leaf', 1, 100),
                'feature_fraction': trial.suggest_float('feature_fraction', 0.5, 1.0),
                'bagging_fraction': trial.suggest_float('bagging_fraction', 0.5, 1.0),
                'bagging_freq': trial.suggest_int('bagging_freq', 1, 7),
                
                # Fixed Parameters
                'random_state': self.random_state,
                'verbose': -1,
                'force_col_wise': True
            }
        elif self.model_type == "rf":
            return {
                "n_estimators": trial.suggest_int("n_estimators", 100, 500, step=50),
                "max_depth": trial.suggest_int("max_depth", 3, 20),
                "min_samples_split": trial.suggest_int("min_samples_split", 2, 10),
                "min_samples_leaf": trial.suggest_int("min_samples_leaf", 1, 10),
                "max_features": trial.suggest_categorical("max_features", ["sqrt", "log2", None]),
            }
        elif self.model_type == "catboost":
            return {
                "iterations": trial.suggest_int("iterations", 100, 500, step=5),
                "depth": trial.suggest_int("depth", 3, 10),
                "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.3, log=True),
                "l2_leaf_reg": trial.suggest_float("l2_leaf_reg", 1, 10),
            }
        else:
            raise ValueError(f"Unsupported model type: {self.model_type}")

    def _create_model(self, params):
        """Initialize the model with optimized hyperparameters."""
        if self.model_type == "xgboost":
            return xgb.XGBClassifier(**params, random_state=self.random_state, use_label_encoder=False, eval_metric="logloss")
        elif self.model_type == "lgbm":
            return lgb.LGBMClassifier(**params, random_state=self.random_state)
        elif self.model_type == "rf":
            return RandomForestClassifier(**params, random_state=self.random_state)
        elif self.model_type == "catboost":
            return CatBoostClassifier(**params, random_state=self.random_state, verbose=0)
        else:
            raise ValueError(f"Unsupported model type: {self.model_type}")

    def _objective(self, trial, X, y):
        """Objective function for Optuna optimization."""
        params = self._get_param_space(trial)
        model = self._create_model(params)

        # Stratified k-fold cross-validation for AUROC score
        cv = StratifiedKFold(n_splits=self.cv_folds, shuffle=True, random_state=self.random_state)
        scores = cross_val_score(model, X, y, cv=cv, scoring="roc_auc")

        return np.mean(scores)


    def optimize(self, X, y):
        """Run Optuna optimization with progress tracking."""
        self.study = optuna.create_study(direction="maximize")
        
        # Wrap the optimization process in tqdm for visual progress
        with tqdm(total=self.n_trials, desc=f"Optimizing {self.model_type}") as pbar:
            def objective_with_progress(trial):
                score = self._objective(trial, X, y)
                pbar.update(1)  # Update progress bar after each trial
                return score
    
            self.study.optimize(objective_with_progress, n_trials=self.n_trials)
    
        self.best_params = self.study.best_params
        self.best_score = self.study.best_value
    
        print(f"Best AUROC Score: {self.best_score:.4f}")
        print(f"Best Parameters: {self.best_params}")
    
        return self.best_params, self.best_score


    def get_best_model(self):
        """Return the best model trained with optimized parameters."""
        if self.best_params is None:
            raise ValueError("Optimization has not been run yet!")
        return self._create_model(self.best_params)

# Example Usage
# optimizer = ModelOptimizer(model_type="xgboost", n_trials=50)
# best_params, best_score = optimizer.optimize(X_train, y_train)
# best_model = optimizer.get_best_model()


In [6]:
class PCOSDataProcessor:
    def __init__(self, target_correlation_threshold=0.05, correlation_threshold=0.8, 
                 categorical_threshold=0.7, n_features_to_select=20, model_type=None):
        self.target_correlation_threshold = target_correlation_threshold
        self.correlation_threshold = correlation_threshold
        self.categorical_threshold = categorical_threshold
        self.n_features_to_select = n_features_to_select
        self.model_type = model_type  
        self.feature_importances_ = None
        self.selected_features_ = None

    def _get_cramers_v(self, x, y):
        """Calculate Cramér's V statistic for categorical variables"""
        confusion_matrix = pd.crosstab(x, y)
        chi2 = chi2_contingency(confusion_matrix)[0]
        n = confusion_matrix.sum().sum()
        min_dim = min(confusion_matrix.shape) - 1
        
        # Handle division by zero
        if n * min_dim == 0:
            return 0
            
        return np.sqrt(chi2 / (n * min_dim))

    def _calculate_categorical_correlation(self, x, y):
        """Calculate correlation for categorical variables using Cramér's V"""
        return self._get_cramers_v(x, y)

    def _calculate_mixed_correlation(self, numeric_var, categorical_var):
        """
        Calculate correlation between numeric and categorical variables
        using ANOVA-based correlation ratio
        """
        categories = categorical_var.unique()
        
        # If only one category, correlation is 0
        if len(categories) < 2:
            return 0
            
        # Calculate means per category
        cat_means = {}
        cat_counts = {}
        
        for category in categories:
            mask = categorical_var == category
            cat_means[category] = numeric_var[mask].mean()
            cat_counts[category] = mask.sum()
            
        # Calculate overall mean
        overall_mean = numeric_var.mean()
        
        # Calculate correlation ratio
        numerator = sum(count * (cat_mean - overall_mean) ** 2 
                       for category, (cat_mean, count) in 
                       zip(cat_means.keys(), zip(cat_means.values(), cat_counts.values())))
        
        denominator = sum((x - overall_mean) ** 2 for x in numeric_var)
        
        # Handle division by zero
        if denominator == 0:
            return 0
            
        correlation_ratio = np.sqrt(numerator / denominator)
        return correlation_ratio

    def calculate_target_correlations(self, X, y):
        """Calculate correlations between features and target"""
        target_correlations = {}
        
        # Identify column types
        numeric_cols = X.select_dtypes(include=['int64', 'float64']).columns
        
        # Calculate mutual information scores
        mi_scores = mutual_info_classif(X, y)
        mi_dict = dict(zip(X.columns, mi_scores))
        
        for column in X.columns:
            if column in numeric_cols:
                correlation = abs(spearmanr(X[column], y)[0])
            else:
                correlation = self._calculate_categorical_correlation(X[column], y)
                
            target_correlations[column] = {
                'correlation': correlation,
                'mutual_info': mi_dict[column]
            }
        
        return pd.DataFrame.from_dict(target_correlations, orient='index')

    def remove_highly_correlated(self, X, y):
        """Remove highly correlated features while preserving those most correlated with target"""
        # Calculate target correlations
        target_correlations = self.calculate_target_correlations(X, y)
        self.target_correlations_ = target_correlations
        
        # Identify numeric and non-numeric columns
        numeric_cols = X.select_dtypes(include=['int64', 'float64']).columns
        non_numeric_cols = X.select_dtypes(exclude=['int64', 'float64']).columns
        
        # Initialize correlation matrix
        n_features = len(X.columns)
        corr_matrix = pd.DataFrame(np.nan, index=X.columns, columns=X.columns)
        
        # Calculate correlations between features
        for i in range(n_features):
            for j in range(i, n_features):
                col1, col2 = X.columns[i], X.columns[j]
                
                if col1 == col2:
                    corr_matrix.loc[col1, col2] = 1.0
                    continue
                
                # Both numeric
                if (col1 in numeric_cols) and (col2 in numeric_cols):
                    correlation = abs(spearmanr(X[col1], X[col2])[0])
                
                # Both non-numeric
                elif (col1 in non_numeric_cols) and (col2 in non_numeric_cols):
                    correlation = self._calculate_categorical_correlation(X[col1], X[col2])
                
                # Mixed types
                else:
                    numeric_var = X[col1] if col1 in numeric_cols else X[col2]
                    categorical_var = X[col2] if col1 in numeric_cols else X[col1]
                    correlation = self._calculate_mixed_correlation(numeric_var, categorical_var)
                
                corr_matrix.loc[col1, col2] = correlation
                corr_matrix.loc[col2, col1] = correlation
        
        # Find features to drop
        to_drop = set()
        
        # Remove features with very low correlation to target
        low_correlation_features = target_correlations[
            (target_correlations['correlation'] < self.target_correlation_threshold) & 
            (target_correlations['mutual_info'] < self.target_correlation_threshold)
        ].index
        to_drop.update(low_correlation_features)
        
        # Handle correlated features
        for i in range(n_features):
            for j in range(i + 1, n_features):
                col1, col2 = X.columns[i], X.columns[j]
                if col1 in to_drop or col2 in to_drop:
                    continue
                    
                correlation = abs(corr_matrix.loc[col1, col2])
                threshold = (self.categorical_threshold 
                           if (col1 in non_numeric_cols) and (col2 in non_numeric_cols)
                           else self.correlation_threshold)
                
                if correlation >= threshold:
                    # Keep the feature with higher target correlation
                    col1_score = (0.7 * target_correlations.loc[col1, 'correlation'] + 
                                0.3 * target_correlations.loc[col1, 'mutual_info'])
                    col2_score = (0.7 * target_correlations.loc[col2, 'correlation'] + 
                                0.3 * target_correlations.loc[col2, 'mutual_info'])
                    
                    to_drop.add(col1 if col1_score < col2_score else col2)
        
        # Store correlation matrices for inspection
        self.correlation_matrix_ = corr_matrix
        
        return X.drop(columns=list(to_drop))

    def get_feature_rankings(self):
        """Get feature rankings based on target correlation and mutual information"""
        if not hasattr(self, 'target_correlations_'):
            raise ValueError("Must run remove_highly_correlated first!")
            
        rankings = self.target_correlations_.copy()
        rankings['combined_score'] = (0.7 * rankings['correlation'] + 
                                    0.3 * rankings['mutual_info'])
        return rankings.sort_values('combined_score', ascending=False)

    def select_features(self, X, y, model_type='rf'):
        """Select best features based on model type"""
        if model_type == 'rf':
            selector = RandomForestClassifier(n_estimators=100, random_state=42)
        elif model_type == 'logistic':
            selector = LogisticRegression(random_state=42)
        else:
            selector = xgb.XGBClassifier(random_state=42)
            
        selector = SelectFromModel(selector, max_features=self.n_features_to_select)
        selector.fit(X, y)
        
        selected_features = X.columns[selector.get_support()].tolist()
        feature_importances = pd.Series(
            selector.estimator_.feature_importances_ if model_type != 'logistic' 
            else np.abs(selector.estimator_.coef_[0]),
            index=X.columns
        )
        
        return selected_features, feature_importances
    
    def process_for_model(self, X, model_type):
        """Process features based on model type"""
        if model_type in ['rf', 'xgboost', 'catboost']:
            # For tree-based models, keep categorical features as is
            return X
        elif model_type == 'tabpfn':
            return X.to_numpy()  # TabPFN requires NumPy input
        else:
            # For linear models, one-hot encode categorical features
            return pd.get_dummies(X, drop_first=True)
    
    def fit_transform(self, X, y, model_type='rf'):
        """Main method to process data"""
        # Remove highly correlated features
        X_processed = self.remove_highly_correlated(X, y)
        
        # Select best features
        self.selected_features_, self.feature_importances_ = self.select_features(
            X_processed, y, model_type
        )
        X_selected = X_processed[self.selected_features_]
        
        # Process features based on model type
        X_final = self.process_for_model(X_selected, model_type)
        
        return X_final
    
    def transform(self, X):
        """Transform new data using fitted processor"""
        if self.selected_features_ is None:
            raise ValueError("Processor has not been fitted yet!")
        
        X_selected = X[self.selected_features_]
        return self.process_for_model(X_selected, self.model_type)


class PCOSEnsemble:
    def __init__(self, tune=None, processors=None, models=None):
        self.tune = tune if tune else {
            'rf': False,
            'xgboost': False,
            'catboost': False,
            'lgbm': False
        }
        self.processors = processors if processors else {
            'rf': PCOSDataProcessor(n_features_to_select=num_features - 5),
            # 'logistic': PCOSDataProcessor(n_features_to_select=num_features - 5),
            # 'xgboost': PCOSDataProcessor(n_features_to_select=num_features - 5),
            # 'catboost': PCOSDataProcessor(n_features_to_select=num_features - 5),
            # 'lgbm': PCOSDataProcessor(n_features_to_select=num_features - 5),
            # 'tabpfn': PCOSDataProcessor(n_features_to_select=num_features - 5)
        }
        
        self.models = models if models else {
            'rf': RandomForestClassifier(random_state=42),
            # 'logistic': LogisticRegression(random_state=42),
            # 'xgboost': xgb.XGBClassifier(random_state=42),
            # 'catboost': CatBoostClassifier(verbose=0, random_state=42),
            # 'lgbm': lgb.LGBMClassifier(verbose=-1, random_state=42),
            # 'tabpfn': TabPFNClassifier()
        }
        
        self.feature_importances_ = {}
        
    def fit(self, X, y):
        """Fit the ensemble"""
        self.predictions_ = {}
        
        for name, processor in self.processors.items():
            print(f"Processing data for {name}")
            # Process data for specific model
            X_processed = processor.fit_transform(X, y, model_type=name)
            
            # Store feature importances
            self.feature_importances_[name] = processor.feature_importances_

            # Tune model
            if name not in ['logistic', 'tabpfn']:
                trials = {'rf': 50, 'xgboost': 50, 'lgbm': 50, 'catboost': 50}
                if self.tune[name]:
                    optimizer = ModelOptimizer(model_type=name, n_trials=trials[name])
                    best_params, best_score = optimizer.optimize(X_processed, y)
                    self.models[name] = optimizer.get_best_model()
            
            # Fit model
            print(f"Training {name} model on data")
            self.models[name].fit(X_processed, y)
            
            # Get predictions
            self.predictions_[name] = self.models[name].predict_proba(X_processed)[:, 1]
        
        return self
    
    def predict_proba(self, X):
        """Get probability predictions from ensemble"""
        predictions = {}
        
        for name, processor in self.processors.items():
            X_processed = processor.transform(X)
            predictions[name] = self.models[name].predict_proba(X_processed)[:, 1]
        
        # Average predictions from all models
        return np.mean(list(predictions.values()), axis=0)
    
    def predict(self, X, threshold=0.5):
        """Get class predictions from ensemble"""
        probas = self.predict_proba(X)
        return (probas >= threshold).astype(int)

In [7]:
# Example usage
def run_pcos_ensemble(data, tune=True, processors=None, models=None):
    # Split features and target
    X = data.drop('PCOS', axis=1)
    y = data['PCOS']
    
    # Split data
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )

    # Initialize and fit ensemble
    ensemble = PCOSEnsemble(tune, processors, models)
    ensemble.fit(X_train, y_train)
    
    # Get predictions
    y_pred = ensemble.predict(X_test)
    y_pred_proba = ensemble.predict_proba(X_test)
    
    # Calculate metrics
    results = {
        'ROC-AUC': roc_auc_score(y_test, y_pred_proba),
        'Feature Importances': ensemble.feature_importances_
    }
    print('Ensemble training complete.')
    return results, ensemble

In [8]:
# for i in range(5,10):
#     processors = {
#         # 'rf': PCOSDataProcessor(n_features_to_select=num_features - 5),
#         # 'logistic': PCOSDataProcessor(n_features_to_select=num_features - 5),
#         # 'xgboost': PCOSDataProcessor(n_features_to_select=num_features - 5),
#         'catboost': PCOSDataProcessor(n_features_to_select=num_features - i),
#         # 'lgbm': PCOSDataProcessor(n_features_to_select=num_features - 5),
#         # 'tabpfn': PCOSDataProcessor(n_features_to_select=num_features - 5)
#     }
        
#     models = {
#         # 'rf': RandomForestClassifier(random_state=42),
#         # 'logistic': LogisticRegression(random_state=42),
#         # 'xgboost': xgb.XGBClassifier(random_state=42),
#         'catboost': CatBoostClassifier(verbose=0, random_state=42),
#         # 'lgbm': lgb.LGBMClassifier(verbose=-1, random_state=42),
#         # 'tabpfn': TabPFNClassifier()
#     }
train_processed, exercise_list = process_data(train)
num_features = len(train_processed.columns.tolist())

processors = {
    'rf': PCOSDataProcessor(n_features_to_select=num_features - 5),
    'logistic': PCOSDataProcessor(n_features_to_select=num_features - 5),
    'xgboost': PCOSDataProcessor(n_features_to_select=num_features - 5),
    # 'catboost': PCOSDataProcessor(n_features_to_select=num_features - 5),
    # 'lgbm': PCOSDataProcessor(n_features_to_select=num_features - 5),
    # 'tabpfn': PCOSDataProcessor(n_features_to_select=num_features - 5)
}
    
models = {
    'rf': RandomForestClassifier(random_state=42),
    'logistic': LogisticRegression(random_state=42),
    'xgboost': xgb.XGBClassifier(random_state=42),
    # 'catboost': CatBoostClassifier(verbose=0, random_state=42),
    # 'lgbm': lgb.LGBMClassifier(verbose=-1, random_state=42),
    # 'tabpfn': TabPFNClassifier()
}
tunes = {'rf': True, 'xgboost': True, 'catboost': True, 'lgbm': False}


results, trained_ensemble = run_pcos_ensemble(train_processed, tune=tunes, processors = processors, models = models)
print(f"AUROC results: {results['ROC-AUC'] :.4f}")

Processing data for rf


Optimizing rf:   0%|          | 0/50 [00:00<?, ?it/s]

Best AUROC Score: 0.9163
Best Parameters: {'n_estimators': 350, 'max_depth': 3, 'min_samples_split': 5, 'min_samples_leaf': 1, 'max_features': 'log2'}
Training rf model on data
Processing data for logistic
Training logistic model on data
Processing data for xgboost


Optimizing xgboost:   0%|          | 0/50 [00:00<?, ?it/s]

Best AUROC Score: 0.8771
Best Parameters: {'n_estimators': 500, 'max_depth': 6, 'learning_rate': 0.06949791734011004, 'subsample': 0.9553039313375912, 'colsample_bytree': 0.842512774825359, 'gamma': 0.02105251567458649}
Training xgboost model on data
Ensemble training complete.
AUROC results: 0.7812


In [9]:
test_processed, test_exercise_lists = process_data(test)
train_features = set(train_processed.drop(columns=[TARGET]).columns)
test_features = set(test_processed.columns)

# Identify missing columns
missing_cols = list(train_features - test_features)

# Add missing columns
if missing_cols:
    test_processed = pd.concat([test_processed, pd.DataFrame(0, index=test_processed.index, columns=missing_cols)], axis=1)

# Find columns not in training data
drop = []
for col in test.columns.tolist():
    if col not in train.columns.tolist():
        drop.append(col)
test_processed.drop(columns=drop, inplace=True)

test_processed.head()

Unnamed: 0,ID,Age,Hormonal_Imbalance,Hyperandrogenism,Hirsutism,Conception_Difficulty,Insulin_Resistance,Exercise_Frequency,Exercise_Duration,Sleep_Hours,Exercise_Benefit,Cardio,Flexibility and Balance,Misc,None,Strength Training,weight_bins,HIIT
0,0,2,0,0,0,0,0,4,8,4,2,0,0,0,1,0,2,0
1,1,2,1,0,0,0,0,1,10,4,2,0,0,0,1,0,3,0
2,2,2,1,0,0,0,0,2,10,4,2,1,0,0,0,0,3,0
3,3,0,1,0,1,0,1,4,6,4,2,0,0,0,1,0,2,0
4,4,0,1,0,1,0,0,4,3,4,2,1,0,0,0,0,2,0


In [10]:
probs = trained_ensemble.predict_proba(test_processed)
# Prepare submission
submission = pd.DataFrame()
submission['ID'] = test_processed['ID']
submission[TARGET] = probs 
# submission[TARGET] = np.where(ensemble_preds >= 0.5, 1, 0)
submission.columns = sample_submission.columns 
submission.to_csv('submission.csv', index=False)
submission

Unnamed: 0,ID,PCOS
0,0,0.030582
1,1,0.198714
2,2,0.198714
3,3,0.625591
4,4,0.180324
...,...,...
140,140,0.186071
141,141,0.028870
142,142,0.614505
143,143,0.031442
