## 4.3 Matching

This notebook starts from a notebook by Matheus Facture, freely released on his GitHub account under the MIT license: [link](https://github.com/matheusfacure/python-causality-handbook/blob/master/causal-inference-for-the-brave-and-true). All credits to the author!

In [1]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
import graphviz as gr
%matplotlib inline

In [None]:
url = "https://raw.githubusercontent.com/matheusfacure/python-causality-handbook/refs/heads/master/causal-inference-for-the-brave-and-true/data/trainees.csv"
trainees_data = pd.read_csv(url)
trainees_data = trainees_data[trainees_data.columns[:9]]

trainees_data.head()

Unnamed: 0,unit,trainees,age,earnings
0,1,1,28,17700
1,2,1,34,10200
2,3,1,29,14400
3,4,1,25,20800
4,5,1,29,6100


In [3]:
trainees_data.query("trainees==1")

Unnamed: 0,unit,trainees,age,earnings
0,1,1,28,17700
1,2,1,34,10200
2,3,1,29,14400
3,4,1,25,20800
4,5,1,29,6100
5,6,1,23,28600
6,7,1,33,21900
7,8,1,27,28800
8,9,1,31,20300
9,10,1,26,28100


Let's compute the difference in means for the earnings of people who underwent the treatment vs people who did not.

In [4]:
trainees_data.query("trainees==1")["earnings"].mean() - trainees_data.query("trainees==0")["earnings"].mean()

np.float64(-4297.49373433584)

This number tells us that people who underwent the training have lower wages than people who did not. However, if we look at the data, we see that there are several other values, such as age, that influence the wage. Therefore, we can filter the dataset based on age.

In [5]:
# make dataset where no one has the same age
unique_on_age = (trainees_data
                 .query("trainees==0")
                 .drop_duplicates("age"))

matches = (trainees_data
           .query("trainees==1")
           .merge(unique_on_age, on="age", how="left", suffixes=("_t_1", "_t_0"))
           .assign(t1_minuts_t0 = lambda d: d["earnings_t_1"] - d["earnings_t_0"]))

matches.head(7)

Unnamed: 0,unit_t_1,trainees_t_1,age,earnings_t_1,unit_t_0,trainees_t_0,earnings_t_0,t1_minuts_t0
0,1,1,28,17700,27,0,8800,8900
1,2,1,34,10200,34,0,24200,-14000
2,3,1,29,14400,37,0,6200,8200
3,4,1,25,20800,35,0,23300,-2500
4,5,1,29,6100,37,0,6200,-100
5,6,1,23,28600,40,0,9500,19100
6,7,1,33,21900,29,0,15500,6400


Here, the last column contains the difference in earning between the two matched entities, the one who underwent the treatment and the one who did not. Computing the mean of the column will tell us the mean difference.

In [6]:
matches["t1_minuts_t0"].mean()

np.float64(2457.8947368421054)

This was a very practical example, but sadly, real-world data is usually not as nice. There are usually more confound factors, and matches are not exact. Here we will see a more advanced way of dealing with matching.

In [8]:
url = "https://raw.githubusercontent.com/matheusfacure/python-causality-handbook/refs/heads/master/causal-inference-for-the-brave-and-true/data/medicine_impact_recovery.csv"
med_data = pd.read_csv(url)
med_data = med_data[med_data.columns[:9]]

med_data.head()

Unnamed: 0,sex,age,severity,medication,recovery
0,0,35.049134,0.887658,1,31
1,1,41.580323,0.899784,1,49
2,1,28.127491,0.486349,0,38
3,1,36.375033,0.323091,0,35
4,0,25.091717,0.209006,0,15


In [9]:
med_data.query("medication==1")["recovery"].mean() - med_data.query("medication==0")["recovery"].mean()

np.float64(16.895799546498726)

This tell us that the treatment takes, on average, 16.9 more days to recover than the untreated. Of course, this is not supposed to happen - let's see if it is due to confounding. 

First, we standardize the features, to avoid that values like age have a higher influence than values like severity (the absolute value of age is higher).

In [14]:
# scale features
X = ["severity", "age", "sex"]
y = "recovery"

med_data = med_data.assign(**{f: (med_data[f] - med_data[f].mean())/med_data[f].std() for f in X})
med_data.head()

Unnamed: 0,sex,age,severity,medication,recovery
0,-0.99698,0.280787,1.4598,1,31
1,1.002979,0.865375,1.502164,1,49
2,1.002979,-0.338749,0.057796,0,38
3,1.002979,0.399465,-0.512557,0,35
4,-0.99698,-0.610473,-0.911125,0,15


What the above code does: <br> <br>
1. Loops over every feature name (column name) in the list X.
2. Standardizes each column f (`(med_data[f] - med_data[f].mean()) / med_data[f].std()`)
3. Builds a dictionary where each key is a column name and each value is the standardized column, and unpacks it into a pandas df (with the **)

Now, to the matching itself. Instead of coding a matching function, we will use the K nearest neighbour algorithm from `Sklearn`. This algorithm makes predictions by finding the nearest data point in an estimation or training set.

For matching, we will need 2 of those. One, mt0, will store the untreated points and will find matches in the untreated when asked to do so. The other, mt1, will store the treated point and will find matches in the treated when asked to do so. After this fitting step, we can use these KNN models to make predictions, which will be our matches.

In [12]:
from sklearn.neighbors import KNeighborsRegressor

treated = med_data.query("medication==1")
untreated = med_data.query("medication==0")

mt0 = KNeighborsRegressor(n_neighbors=1).fit(untreated[X], untreated[y])
mt1 = KNeighborsRegressor(n_neighbors=1).fit(treated[X], treated[y])

predicted = pd.concat([
    # find matches for the treated looking at the untreated knn model
    treated.assign(match=mt0.predict(treated[X])),
    
    # find matches for the untreated looking at the treated knn model
    untreated.assign(match=mt1.predict(untreated[X]))
])

predicted.head()

Unnamed: 0,sex,age,severity,medication,recovery,match
0,-0.99698,0.280787,1.4598,1,31,39.0
1,1.002979,0.865375,1.502164,1,49,52.0
7,-0.99698,1.495134,1.26854,1,38,46.0
10,1.002979,-0.106534,0.545911,1,34,45.0
16,-0.99698,0.043034,1.428732,1,30,39.0


In [13]:
np.mean((2*predicted["medication"] - 1)*(predicted["recovery"] - predicted["match"]))

np.float64(-0.9954)

Conclusion: THe effect of the medicine reduces the recovery by one day.