## Pre-processing data
### 1. Split your data into categorical and numerical columns
### 2. One-Hot Encode Categorical Features: 
onehot_encoder = OneHotEncoder(sparse=False, handle_unknown="ignore")
data_onehot = onehot_encoder.fit_transform(data[categorical_features])

### 3. Impute Missing Values with MICE:
#### Initialize MICE with the one-hot encoded data and numerical features:
missing_vars = [col for col in data_onehot.columns] + numerical_features
imputer = mice(data[missing_vars], printflag=False)
data_imputed = imputer.complete()

#### 4. Rescale Numerical Features (Optional):
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data_imputed[numerical_features])

### 5. Combine Preprocessed Data:
#### Create a new DataFrame combining the one-hot encoded categories, imputed numerical features, and optionally scaled numerical features:
data_preprocessed = pd.concat([pd.DataFrame(data_onehot, columns=onehot_encoder.get_feature_names(categorical_features)),
                              pd.DataFrame(data_imputed[numerical_features], columns=numerical_features),
                              pd.DataFrame(data_scaled) if "data_scaled" in locals() else pd.DataFrame()], axis=1)

### 6. Create Reusable Pipeline:
#### Wrap the steps into a function like the previous example, ensuring the correct order:
def preprocess_data(data):
  """
  Preprocesses data for machine learning, considering MICE for missing values.

  Args:
    data: A pandas DataFrame containing the data to preprocess.

  Returns:
    A pandas DataFrame containing the preprocessed data, 
    a OneHotEncoder object, and a mice object.
  """

  # One-hot encode categorical features
  categorical_features = [col for col in data.columns if col.startswith("fl_")]
  onehot_encoder = OneHotEncoder(sparse=False, handle_unknown="ignore")
  data_onehot = onehot_encoder.fit_transform(data[categorical_features])

  # Impute missing values with MICE
  missing_vars = [col for col in data_onehot.columns] + [col for col in data.columns if not col.startswith("fl_") and not col.endswith("_sqm")]
  imputer = mice(data[missing_vars], printflag=False)
  data_imputed = imputer.complete()

  # Optionally scale numerical features
  numerical_features = [col for col in data.columns if not col.startswith("fl_") and not col.endswith("_sqm")]
  scaler = StandardScaler()
  data_scaled = scaler.fit_transform(data_imputed[numerical_features])

  # Combine preprocessed data
  data_preprocessed = pd.concat([pd.DataFrame(data_onehot, columns=onehot_encoder.get_feature_names(categorical_features)),
                                pd.DataFrame(data_imputed[numerical_features], columns=numerical_features),
                                pd.DataFrame(data_scaled) if "data_scaled" in locals() else pd.DataFrame()], axis=1)

  return data_preprocessed, onehot_encoder, imputer


MICE is available in several python libraries like mice and missForest. 
1. Import the library and initialize MICE
2. Impute missing values
3. Combine with remaining processing steps:
You can integrate the MICE imputation step into your existing pipeline by replacing the SimpleImputer step with MICE. Remember to adapt the missing_vars list based on the actual variables containing missing data in your dataset.

Important considerations:
MICE requires categorical variables to be one-hot encoded before imputation. Ensure you one-hot encode the relevant categorical features before applying MICE.
MICE generates multiple imputed datasets. Remember to combine them using appropriate techniques like pooling or Rubin's rules when interpreting your model results.

In [11]:
import pandas as pd

# Split your data into categorical and numerical columns:
data = pd.read_csv('data\\properties.csv')
# Separate object and numerical columns
object_cols = data.select_dtypes(include=['object'])
numeric_cols = data.select_dtypes(include=['int64', 'float64'])
equiped_nan_count = data['equipped_kitchen'].isna().sum()
print(equiped_nan_count)
# One-hot encode categorical features using pd.get_dummies
encoded_object_cols = pd.get_dummies(object_cols, drop_first=True)
# Combine encoded object and numerical columns
combined_df = pd.concat([encoded_object_cols, numeric_cols], axis=1)


0
