# Random Forests Part 2: Missing data and clustering

In this notebook, we will focus on handling missing data and sample clustering in random forests. These are essential aspects of creating robust and accurate random forest models. We will use a patient dataset, where some patients have missing information for certain features.

We will consider two types of missing data:
1. Missing data in the original dataset used to create the random forest.
2. Missing data in a new sample that we want to categorize.

Let's start by importing the necessary Python libraries and loading the dataset.

In [None]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import euclidean_distances
import seaborn as sns

# Load dataset
data = pd.read_csv('patients_data.csv')
data.head()

## Handling missing data in the original dataset

The general idea for dealing with missing data in the original dataset is to make an initial guess (which could be bad) and then gradually refine the guess until it is hopefully a good guess. We will use the most common value or the median value for categorical and numerical features respectively as our initial guess.

Let's implement this approach.

In [None]:
# Fill in missing values
for column in data.columns:
    if data[column].dtype == 'object':
        data[column].fillna(data[column].mode()[0], inplace=True)
    else:
        data[column].fillna(data[column].median(), inplace=True)

## Determining similarity

Now, we want to refine these initial guesses by determining which samples are similar to the ones with missing data. The steps are:
1. Build a random forest with the data.
2. Run all of the data down all of the trees.
3. Keep track of similar samples using a proximity matrix.

We define two samples as similar if they end up in the same leaf node in a tree. Let's calculate the proximity matrix.

In [None]:
def calculate_proximity_matrix(model, X):
    """
    Calculate the proximity matrix based on the decision path of a random forest model.
    """
    leaves = model.apply(X)
    n_samples = leaves.shape[0]
    proximity_matrix = np.zeros((n_samples, n_samples))
    for sample_i in range(n_samples):
        for sample_j in range(sample_i, n_samples):
            proximity_matrix[sample_i, sample_j] = (leaves[sample_i] == leaves[sample_j]).mean()
            proximity_matrix[sample_j, sample_i] = proximity_matrix[sample_i, sample_j]
    return proximity_matrix

# Fit a random forest model
model = RandomForestClassifier(random_state=0)
model.fit(data.iloc[:, :-1], data.iloc[:, -1])

# Calculate proximity matrix
proximity_matrix = calculate_proximity_matrix(model, data.iloc[:, :-1])

# Plot proximity matrix
sns.heatmap(proximity_matrix, cmap='viridis')

## Refining the guess

Now we use the proximity values to make better guesses about the missing data. For categorical data, we calculate a weighted frequency of each category using proximity values as the weights. For numerical data, we use the proximities to calculate a weighted average.

Let's implement this approach.

In [None]:
def refine_guesses(data, proximity_matrix):
    """
    Refine the guesses of missing values in data based on the proximity matrix.
    """
    refined_data = data.copy()
    for column in data.columns:
        if data[column].isnull().sum() > 0:
            for i in data[data[column].isnull()].index:
                if data[column].dtype == 'object':
                    # Calculate weighted frequency for categorical data
                    weighted_freq = data.loc[:, column].value_counts() / data.loc[:, column].value_counts().sum()
                    for category in weighted_freq.index:
                        weighted_freq[category] *= proximity_matrix[i, data.loc[:, column] == category].sum()
                    weighted_freq /= weighted_freq.sum()
                    # Assign the category with the highest weighted frequency
                    refined_data.loc[i, column] = weighted_freq.idxmax()
                else:
                    # Calculate weighted average for numerical data
                    weights = proximity_matrix[i, ~data.loc[:, column].isnull()]
                    values = data.loc[~data.loc[:, column].isnull(), column]
                    refined_data.loc[i, column] = np.average(values, weights=weights)
    return refined_data

# Refine the initial guesses
refined_data = refine_guesses(data, proximity_matrix)
refined_data.head()

## Handling missing data in a new sample

Now consider the case where we have missing data in a new sample that we want to categorize. We will create two copies of the data, one that has the target class and one that doesn't. Then we use the iterative method we talked about to make a good guess about the missing values. Let's implement this approach.

In [None]:
def handle_missing_new_sample(model, sample, data, proximity_matrix):
    """
    Handle missing values in a new sample for categorization based on a random forest model.
    """
    # Create two copies of the new sample
    sample_yes = sample.copy()
    sample_no = sample.copy()
    sample_yes['target'] = 'yes'
    sample_no['target'] = 'no'
    
    # Use the iterative method to make a good guess about the missing values
    for _ in range(10):
        proximity_matrix_yes = calculate_proximity_matrix(model, pd.concat([data, sample_yes], ignore_index=True))
        proximity_matrix_no = calculate_proximity_matrix(model, pd.concat([data, sample_no], ignore_index=True))
        sample_yes = refine_guesses(sample_yes, proximity_matrix_yes[-1, :-1])
        sample_no = refine_guesses(sample_no, proximity_matrix_no[-1, :-1])
    
    # Run the two samples down the trees in the forest and see which of the two is correctly labeled by the random forest the most times
    prediction_yes = model.predict_proba(sample_yes.iloc[:, :-1])
    prediction_no = model.predict_proba(sample_no.iloc[:, :-1])
    
    return 'yes' if prediction_yes[0, 1] > prediction_no[0, 1] else 'no'

# Missing data in a new sample
new_sample = pd.DataFrame([[np.nan, 180, 'yes']], columns=data.columns[:-1])
target = handle_missing_new_sample(model, new_sample, data, proximity_matrix)
print(f'The predicted target for the new sample is: {target}')

That's it! Now you have learned how to handle missing data and perform sample clustering in random forests. This knowledge will help you create more robust and accurate random forest models. Happy machine learning!