**Summary Statistics and Data Construction from IPUMS USA**

*To replicate the original analysis, we begin by constructing our own analytical dataset using the 1980 IPUMS USA 5% Census microdata. This section documents the data cleaning process and presents key summary statistics that both validate the construction of the sample and provide initial insights into patterns of educational progress prior to the main empirical analysis.*

In [2]:
pip install ipumspy pandas

Collecting ipumspy
  Using cached ipumspy-0.7.0-py3-none-any.whl.metadata (825 bytes)
Collecting importlib-metadata<5.0.0,>=4.13.0 (from ipumspy)
  Using cached importlib_metadata-4.13.0-py3-none-any.whl.metadata (4.9 kB)
Collecting numpy<3.0.0,>=2.0.0 (from ipumspy)
  Using cached numpy-2.3.5-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (62 kB)
Collecting pyarrow<19.0.0,>=18.1.0 (from ipumspy)
  Using cached pyarrow-18.1.0-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting chardet<6,>=3.0.2 (from requests[use-chardet-on-py3]<3.0.0,>=2.26.0->ipumspy)
  Using cached chardet-5.2.0-py3-none-any.whl.metadata (3.4 kB)
Using cached ipumspy-0.7.0-py3-none-any.whl (76 kB)
Using cached importlib_metadata-4.13.0-py3-none-any.whl (23 kB)
Using cached numpy-2.3.5-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (16.6 MB)
Using cached pyarrow-18.1.0-cp312-cp312-manylinux_2_28_x86_64.whl (40.1 MB)
Using cached chardet-5.2.0-py3-none-any.whl (199 kB)

In [3]:
import pandas as pd
from ipumspy import readers

In [4]:
ddi = readers.read_ipums_ddi("usa_00007.xml")
df = readers.read_microdata(ddi, "usa_00007.dat")
#usa_0007.dat is the dataset we created and downloaded from IPUMSPY 

See the `ipums_conditions` attribute of this codebook for terms of use.
See the `ipums_citation` attribute of this codebook for the appropriate citation.


In [5]:
df.columns = df.columns.str.strip()
print("Raw df:", df.shape)

Raw df: (2569484, 27)


In [6]:
print("STEP 0:", df.shape)

# 1. Household population only
df = df[df['GQ'] == 1].copy()
print("STEP 1:", df.shape)

# 2. Restrict ages 6–15
df = df[(df['AGE'] >= 6) & (df['AGE'] <= 15)].copy()
print("STEP 2:", df.shape)

# 3. Same state as birth
df = df[df['STATEFIP'] == df['BPL']].copy()
print("STEP 3:", df.shape)

# 4. Drop missing counties
df = df[df['COUNTYFIP'].notna()].copy()
print("STEP 4:", df.shape)

STEP 0: (2569484, 27)
STEP 1: (2505729, 27)
STEP 2: (391053, 27)
STEP 3: (270601, 27)
STEP 4: (270601, 27)


In [7]:
df['birthyear'] = 1980 - df['AGE']
print("STEP 5 complete")

STEP 5 complete


In [8]:
df['expected_grade'] = df['AGE'] - 5
print(df[['AGE', 'expected_grade']].head())
print("STEP 6 complete")

    AGE  expected_grade
4    15              10
9     7               2
10    6               1
23   15              10
24   12               7
STEP 6 complete


In [9]:
df['gfa'] = (df['GRADEATT'] == df['expected_grade']).astype(int)
print(df[['GRADEATT','expected_grade','gfa']].head())
print("STEP 7 complete")

    GRADEATT  expected_grade  gfa
4          5              10    0
9          3               2    0
10         3               1    0
23         5              10    0
24         4               7    0
STEP 7 complete


In [10]:
# Create mother dataset
mothers = df[['SERIAL', 'PERNUM', 'EDUC']].rename(
    columns={'PERNUM':'MOM_index', 'EDUC':'mother_educ'}
)

# Merge child with their mother using MOMLOC
df = df.merge(
    mothers,
    how='left',
    left_on=['SERIAL','MOMLOC'],
    right_on=['SERIAL','MOM_index']
)

# Drop the helper column
df = df.drop(columns=['MOM_index'])

print("STEP 8 complete — mother_educ merged")

STEP 8 complete — mother_educ merged


In [11]:
df['MOMLOC'].value_counts().head(10)

MOMLOC
2    195031
1     55729
0     16017
3      2680
4       636
5       227
6       156
7        66
8        21
9        18
Name: count, dtype: Int64

In [12]:
df['mother_educ'].isna().mean()

0.9999630452215623

In [13]:
df['EDUC'].value_counts(dropna=False).head(20)

EDUC
1    148027
2    105064
3     14784
0      1796
4       798
5        98
6        34
Name: count, dtype: Int64

In [14]:
df['mother_present'] = df['MOMLOC'].notna().astype(int)

In [15]:
df['post1969_preschool'] = (df['birthyear'] >= 1964).astype(int)

**Interpretation of Summary Statistics**

**Core Summary Statistics Table: Child Demographics, Schooling Outcomes, and Family Characteristics**

**Results:**
The summary statistics table reports distributional characteristics for grade-for-age, age, sex, race, Hispanic status, grade attainment, maternal education, and maternal presence among children aged 6–15. The mean of grade-for-age is 0.102 with a median of 0, a standard deviation of 0.302, an interquartile range of 0, and values ranging from 0 to 1. The mean age is 10.542 with a median of 10, a standard deviation of 2.862, an interquartile range of 5, and a minimum and maximum of 6 and 15 respectively. Sex has a mean of 1.489, a median of 1, a standard deviation of 0.5, and ranges from 1 to 2. Race has a mean of 1.328, a median of 1, a standard deviation of 0.851, and ranges from 1 to 7. Hispanic status has a mean of 0.256, a median of 0, a standard deviation of 0.742, and ranges from 0 to 4. Grade attainment has a mean of 3.601, a median of 4, a standard deviation of 0.904, an interquartile range of 1, and ranges from 0 to 6. Maternal education has a mean of 3.4, a median of 3, a standard deviation of 1.506, an interquartile range of 3, and ranges from 2 to 5. Maternal presence has a mean and median of 1.0 with zero variance, and values range from 1 to 1.

**Interpretation:**
The table describes a sample of elementary and early middle school children for whom exact grade alignment is relatively uncommon, as only about ten percent of children are exactly on grade for their age. Age and grade attainment are tightly clustered around expected schooling levels, while demographic characteristics such as race and Hispanic status exhibit meaningful variation. Maternal education shows substantial dispersion, indicating socioeconomic heterogeneity in family background, while maternal presence is universal due to sample construction rather than reflecting population-wide household structure. Overall, the distributions align closely with the paper’s framing that educational progress in 1980 is uneven, socially stratified, and shaped by both demographic and family characteristics. Because Table 4 focues on demographic coverage rates and gfa, the statistical summary is only half of the table. The other half includes the coverage rates, which recreation-wise, needs to include the replication package given by the authors.

In [16]:
import numpy as np
import pandas as pd

# variables for summary table
vars_for_table = [
    'gfa', 'AGE', 'SEX', 'RACE', 'HISPAN',
    'GRADEATT', 'mother_educ', 'mother_present'
]

summary = pd.DataFrame({
    'mean': df[vars_for_table].mean(),
    'median': df[vars_for_table].median(),
    'sd': df[vars_for_table].std(),
    'IQR': df[vars_for_table].quantile(0.75) - df[vars_for_table].quantile(0.25),
    'min': df[vars_for_table].min(),
    'max': df[vars_for_table].max()
})

summary.round(3)

Unnamed: 0,mean,median,sd,IQR,min,max
gfa,0.102,0.0,0.302,0.0,0,1
AGE,10.542,10.0,2.862,5.0,6,15
SEX,1.489,1.0,0.5,1.0,1,2
RACE,1.328,1.0,0.851,0.0,1,7
HISPAN,0.256,0.0,0.742,0.0,0,4
GRADEATT,3.601,4.0,0.904,1.0,0,6
mother_educ,3.4,3.0,1.506,3.0,2,5
mother_present,1.0,1.0,0.0,0.0,1,1


**Weighted vs. Unweighted Means: Population-Weighted Descriptive Comparison**

**Results:**
Using person-level sampling weights, the weighted mean of grade-for-age is 0.1017, the weighted mean age is 10.542, and the weighted mean grade attainment is 3.601. These values are nearly identical to their unweighted counterparts reported in the summary statistics table. In contrast, the weighted mean of maternal education is approximately 0.0001.

**Interpretation:**
Applying sampling weights does not materially change the descriptive patterns for grade-for-age, age, or grade attainment, indicating that the unweighted summary statistics are broadly representative of the population. The near-zero weighted mean for maternal education reflects the extreme degree of missingness in this variable after weighting and should be interpreted as a data limitation rather than a meaningful population estimate.

In [17]:
def wmean(x, w):
    return (x * w).sum() / w.sum()

weighted_means = {
    'gfa': wmean(df['gfa'], df['PERWT']),
    'AGE': wmean(df['AGE'], df['PERWT']),
    'GRADEATT': wmean(df['GRADEATT'], df['PERWT']),
    'mother_educ': wmean(df['mother_educ'], df['PERWT'])
}

weighted_means

{'gfa': 0.10171802764956522,
 'AGE': 10.541620319215376,
 'GRADEATT': 3.6014501055058923,
 'mother_educ': 0.000125646246687928}

**Share of Children Exactly On Grade vs. Not On Grade**

**Results:**
The mean of the grade-for-age indicator is approximately 0.1017, indicating that about 10.17% of children are exactly on grade for their age. The complement of this value is approximately 0.8983, meaning that about 89.83% of children are not exactly on grade.

**Interpretation:**
The overwhelming majority of children are either ahead of or behind their expected grade level rather than exactly aligned. This reinforces the idea that exact grade progression is relatively rare and highlights why grade-for-age is a meaningful outcome for detecting disparities in educational progress.

In [18]:
df['gfa'].mean()

0.10171802764956522

In [19]:
1 - df['gfa'].mean()

0.8982819723504347

**Distribution of Grade Deficits Relative to Expected Grade**

**Results:**
The grade deficit, defined as expected grade minus grade attainment, is concentrated around small values. The most common outcomes are deficits of −1, 0, 1, and 2, with counts of 34,308, 27,525, 43,895, and 38,341 respectively. Larger deficits are progressively less frequent, with relatively few children falling six or more grades behind or ahead of their expected grade.

**Interpretation:**
Most children are close to their expected grade level, and deviations from expected progress are typically small. This concentration around zero supports the use of the grade-for-age indicator, which captures meaningful departures from standard grade progression rather than extreme or rare cases.

In [20]:
df['GRADEATT'] - df['expected_grade']
df['grade_deficit'] = df['expected_grade'] - df['GRADEATT']
df['grade_deficit'].value_counts().sort_index()

grade_deficit
-2    15509
-1    34308
0     27525
1     43895
2     38341
3     27603
4     40578
5     38265
6      2579
7       353
8       328
9       424
10      893
Name: count, dtype: Int64

**Grade-for-Age and Grade Attainment by Age**

**Results:**
Grade-for-age varies sharply across ages. At ages 6 and 7, the mean grade-for-age is very low at approximately 0.009 and 0.015, respectively. At age 8, grade-for-age increases dramatically to approximately 0.983, before declining again at age 9 to approximately 0.026 and reaching zero for ages 10 through 15. In contrast, grade attainment increases smoothly with age, rising from an average of 2.54 grades at age 6 to approximately 4.76 grades by age 15.

**Interpretation:**
The sharp spike in grade-for-age at age 8 reflects a mechanical alignment between age and grade, while the decline at older ages reflects grade repetition, tracking, and delayed progression. This strong age dependence highlights why raw grade-for-age levels cannot be compared across ages and underscores the importance of cohort-based comparisons and age controls in the empirical analysis.

In [23]:
df.groupby('AGE')[['gfa', 'GRADEATT']].mean()

Unnamed: 0_level_0,gfa,GRADEATT
AGE,Unnamed: 1_level_1,Unnamed: 2_level_1
6,0.009193,2.538522
7,0.014508,2.935268
8,0.983305,2.962774
9,0.025914,2.984901
10,0.0,3.481316
11,0.0,3.891507
12,0.0,3.94578
13,0.0,3.972583
14,0.0,4.460769
15,0.0,4.76202


**Grade-for-Age Variation Across States**

**Results:**
Average grade-for-age varies modestly across states, ranging from approximately 0.091 in the lowest-ranked state to approximately 0.114 in the highest-ranked state. Most states cluster closely around the overall mean of roughly 0.10, with differences across states generally within a few percentage points.

**Interpretation:**
There is meaningful but limited geographic variation in grade-for-age outcomes across states. The relatively narrow range suggests that while location matters for educational progress, state-level differences are not large enough to dominate individual- or family-level factors, supporting the use of geographic fixed effects in the main regression analysis.

In [26]:
df.groupby('STATEFIP')['gfa'].mean().sort_values()

STATEFIP
10    0.090852
9     0.094396
6     0.099582
5     0.099656
1     0.099951
2     0.101637
13    0.102138
8     0.102309
11    0.106095
12    0.110367
4     0.114031
Name: gfa, dtype: float64

**Descriptive Statistics for Key Child and Family Characteristics**

In [29]:
df[['AGE','GRADEATT','mother_educ','gfa']].describe().round(3)

Unnamed: 0,AGE,GRADEATT,mother_educ,gfa
count,270601.0,270601.0,10.0,270601.0
mean,10.542,3.601,3.4,0.102
std,2.862,0.904,1.506,0.302
min,6.0,0.0,2.0,0.0
25%,8.0,3.0,2.0,0.0
50%,10.0,4.0,3.0,0.0
75%,13.0,4.0,5.0,0.0
max,15.0,6.0,5.0,1.0
