<a href="https://colab.research.google.com/github/alexiscaphar/my-first-repo/blob/main/Measurement.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Measurement Error

Our goal is to see how measurement error affects the observed estimates of a model.

Imports

In [1]:
import numpy as np
import pandas as pd
from scipy.stats import bernoulli
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression

Configuration

In [4]:
N_SAMPLES = 5000 #va en mayusculas porque le asigna un valor fijo
N_FEATURES = 3
ERROR_P = 0.1  # Probability of being affected by measurement error
ERROR_PCT = 0.2  # 20% Increase in target (y)
RANDOM_STATE = 42 #parametro inicial como set.seed

Data generating Process

Normally, we don't know the underlying data generating process (DGP) that created the data (if we did, there would be no point in estimating it at all!).

However, in this notebook, we will generate a synthetic dataset. This means we will know the form of the underlying DGP. We do this to be able to see how measurement error affects our estimates by comparing the population parameters against the observed ones.

In [6]:
# Generate IDs
ids = np.arange(N_SAMPLES) + 1

# Init a bernoulli RV and sample from it
brv = bernoulli(ERROR_P)
exclusions = brv.rvs(size=N_SAMPLES, random_state=RANDOM_STATE)

# Create a dataset of IDs and exclusion status
df_ids = pd.DataFrame({'id': ids, 'exclude': exclusions})

# View dataframe
print('Share of excluded observations:', df_ids['exclude'].mean())
df_ids.head(5)

Share of excluded observations: 0.0958


Unnamed: 0,id,exclude
0,1,0
1,2,1
2,3,0
3,4,0
4,5,0


Now we will generate a dataset with a known DGP. We will join both datasets to learn how joins are performed, but also to synthetically create measurement error for all rows marked with exclude == 1.

In [9]:
# Generate data using sklearn
X, y, true_coef = make_regression(
    n_samples=N_SAMPLES,
    n_features=N_FEATURES,
    n_informative=N_FEATURES,
    n_targets=1,
    bias=50.0,
    noise=10.0,
    shuffle=False,
    coef=True,  # Return population coefficients
    random_state=RANDOM_STATE
)
print('Population parameters:', true_coef)

# Turn X and y into a single dataframe
df_obs = pd.concat(
    objs=[
        pd.DataFrame(ids, columns=['id']),
        pd.DataFrame(X, columns=['x1', 'x2', 'x3']),
        pd.DataFrame(y, columns=['y'])
    ],
    axis=1
)

# View second dataset
df_obs.head()

Population parameters: [38.95952484  1.51074456 89.82730651]


Unnamed: 0,id,x1,x2,x3,y
0,1,0.496714,-0.138264,0.647689,139.19152
1,2,1.52303,-0.234153,-0.234137,88.868465
2,3,1.579213,0.767435,-0.469474,63.743456
3,4,0.54256,-0.463418,-0.46573,14.226501
4,5,0.241962,-1.91328,-1.724918,-103.038928


Join both tables together on the key 'id'.

In [10]:
# Join (juntas tablas con la primary key que es id pero pueden ser muchas columnas, hay muchos tipos de join)
df = df_ids.merge(right=df_obs, how='inner', on=['id'])

# View data
df.head()

Unnamed: 0,id,exclude,x1,x2,x3,y
0,1,0,0.496714,-0.138264,0.647689,139.19152
1,2,1,1.52303,-0.234153,-0.234137,88.868465
2,3,0,1.579213,0.767435,-0.469474,63.743456
3,4,0,0.54256,-0.463418,-0.46573,14.226501
4,5,0,0.241962,-1.91328,-1.724918,-103.038928


We will now create two types of measurement error:

Error on a covariate
Error on the target

In [None]:
# Mask that only preserves
mask = df['exclude'].eq(1)

# Distort x3 for entries that should be excluded
df['x3_bad'] = df['x3'].copy()
df.loc[mask, 'x3_bad'] = df.loc[mask, 'x3_bad'] * (1 + ERROR_PCT)

# Distort y for entries that should be excluded
df['y_bad'] = df['y'].copy()
df.loc[mask, 'y_bad'] = df.loc[mask, 'y_bad'] * (1 + ERROR_PCT)

# View some affected cases
df.loc[mask, ['id', 'x3', 'x3_bad', 'y', 'y_bad']].head()

Fitting models
1. Clean dataset

2. All observations and bad target

3. Filtered dataset and bad target

4. All observations and bad features

5. Filtered observations and bad features