<a href="https://colab.research.google.com/github/chenweioh/GCP-Inspector-Toolkit/blob/main/1_to_1_Nearest_Neighbor_Matching_Tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1-to-1 Nearest Neighbor Matching Tutorial

This document provides an overview and demonstration of how to perform 1-to-1 nearest neighbor matching using the R package MatchIt. Matching is a common technique in observational studies to balance covariates between treatment and control groups, thereby reducing confounding and enabling more accurate causal inference.


Requirements
R
MatchIt package: Install if not already available.

In [10]:
# Install MatchIt package if not already installed
if (!requireNamespace("MatchIt", quietly = TRUE)) {
  install.packages("MatchIt")
}

# Load the MatchIt library
library(MatchIt)


Step 1: Create a Sample Dataset
We start by creating a synthetic dataset with 20 observations and the following variables:

age: Randomly sampled ages between 30 and 80.
disease_stage: Hypertension status, either "HPT" (with hypertension) or "without HPT" (no hypertension).
gender: Randomly assigned as "Male" or "Female".
treatment: Binary variable indicating whether an individual received the treatment (1 = treated, 0 = control).

The treatment variable is our treatment assignment indicator, where 1 indicates the individual received treatment, and 0 indicates they did not (control group). We want to balance the covariates (age, disease_stage, and gender) between treated and control groups to reduce potential biases.

In [11]:
# Set a seed for reproducibility
set.seed(42)

# Create a sample dataset
data <- data.frame(
  age = sample(30:80, 20, replace = TRUE),
  disease_stage = sample(c("HPT", "without HPT"), 20, replace = TRUE),
  gender = sample(c("Male", "Female"), 20, replace = TRUE),
  treatment = sample(c(0, 1), 20, replace = TRUE)
)

# Display the original data
cat("Original Data:\n")
print(data)


Original Data:
   age disease_stage gender treatment
1   78           HPT Female         1
2   66           HPT Female         1
3   30           HPT Female         1
4   54   without HPT Female         1
5   39           HPT Female         1
6   65           HPT Female         0
7   47           HPT Female         1
8   78           HPT   Male         0
9   76   without HPT Female         1
10  53   without HPT   Male         0
11  36   without HPT Female         1
12  65   without HPT Female         1
13  54           HPT Female         1
14  66   without HPT Female         1
15  75           HPT   Male         0
16  49   without HPT Female         0
17  55   without HPT   Male         0
18  79   without HPT   Male         0
19  76           HPT   Male         1
20  32           HPT Female         0


Step 2: Perform 1-to-1 Nearest Neighbor Matching
We use MatchIt to perform 1-to-1 nearest neighbor matching based on a logistic regression model that estimates the probability of treatment (propensity score). MatchIt internally calculates the propensity score for each observation and uses it to match treated individuals with similar control individuals.

In [12]:
# Perform 1-to-1 matching with nearest neighbor method
match_out <- matchit(treatment ~ age + disease_stage + gender, data = data, method = "nearest")

# Get the matched dataset
matched_data <- match.data(match_out)


“Fewer control units than treated units; not all treated units will get a match.”


In this example:

1-to-1 Nearest Neighbor Matching: Each treated observation is matched with the nearest control based on similarity in the covariates.
Propensity Score: Calculated using logistic regression (default method in MatchIt) to estimate the probability of treatment based on covariates. Matching is then performed on these scores.

Step 3: Display the Matched Data
The matched_data object contains only the matched pairs. This dataset is balanced with respect to covariates, making treated and control groups more comparable.

In [13]:
# Display the matched data
cat("\nMatched Data (After 1-to-1 Nearest Neighbor Matching):\n")
print(matched_data)



Matched Data (After 1-to-1 Nearest Neighbor Matching):
   age disease_stage gender treatment   distance weights subclass
1   78           HPT Female         1 0.92240756       1        1
2   66           HPT Female         1 0.88207514       1        5
4   54   without HPT Female         1 0.76489591       1        6
6   65           HPT Female         0 0.87799965       1        1
7   47           HPT Female         1 0.78222685       1        7
8   78           HPT   Male         0 0.23340118       1        4
9   76   without HPT Female         1 0.88381582       1        8
10  53   without HPT   Male         0 0.07421897       1        6
12  65   without HPT Female         1 0.83263129       1        2
13  54           HPT Female         1 0.82476087       1        3
14  66   without HPT Female         1 0.83794260       1        4
15  75           HPT   Male         0 0.21332024       1        2
16  49   without HPT Female         0 0.72843065       1        8
17  55   without HPT

Matched Data: Only matched pairs are included in matched_data. The goal is to ensure that for each treated individual, there’s a control individual with similar values for age, disease_stage, and gender. This helps to isolate the effect of the treatment by controlling for these covariates.

Step 4: Check the Balance of Covariates Before and After Matching
The summary(match_out) function provides a summary of covariate balance before and after matching, allowing us to evaluate how well the matching procedure worked in balancing the groups.

In [14]:
# Check balance of covariates before and after matching
cat("\nBalance Summary:\n")
print(summary(match_out))



Balance Summary:

Call:
matchit(formula = treatment ~ age + disease_stage + gender, data = data, 
    method = "nearest")

Summary of Balance for All Data:
                         Means Treated Means Control Std. Mean Diff. Var. Ratio
distance                        0.7454        0.3818          1.9190     0.3464
age                            57.2500       60.7500         -0.2119     0.9999
disease_stageHPT                0.5833        0.5000          0.1690          .
disease_stagewithout HPT        0.4167        0.5000         -0.1690          .
genderFemale                    0.9167        0.3750          1.9598          .
genderMale                      0.0833        0.6250         -1.9598          .
                         eCDF Mean eCDF Max
distance                    0.3229   0.5417
age                         0.0806   0.2083
disease_stageHPT            0.0833   0.0833
disease_stagewithout HPT    0.0833   0.0833
genderFemale                0.5417   0.5417
genderMale         

Covariate Balance: This summary provides balance statistics for each covariate, both before and after matching. A well-balanced dataset will show reduced differences in covariate means or proportions between the treated and control groups after matching.
Purpose of Checking Balance: Verifying balance is crucial because it confirms that the matching process has successfully minimized confounding. If covariates are balanced, differences in outcomes can more confidently be attributed to the treatment rather than to pre-existing differences between groups.