# Logistic Regression Analysis of Upward Career Mobility

This notebook fits a series of logistic regression models to examine determinants of upward career mobility during the first five years of college-educated workers’ careers. Using a large-scale dataset of job transition pairs, the analysis focuses on how gender, race, and early career moves predict the likelihood of experiencing an upward mobility.

__Important Note For Reproducibility__

* This data analysis code was developed using a proprietary, licensed commercial dataset (Lightcast data). The raw data files are NOT included in this repository and cannot be publicly shared.
* The file names for the career trajectory data used in the code are placeholders. To run this notebook, users must substitute them with their own compatible raw data files.

## Overview

The notebook performs the following key steps:

1. Load structured longitudinal job transition data for early-career workers.
2. Estimate logistic regression models (GLM with Binomial family) using `statsmodels`.
3. Export regression summaries to CSV for downstream analysis or reporting.
4. Explore interaction effects between demographic factors (gender, race) and early career move patterns.
5. Run stratified regressions by gender to understand heterogeneity in the predictors of upward mobility.

All models use a consistent specification of individual and job-level covariates, with additional interaction terms included in Models 2–4.

## Dataset

* Source: Pre-processed LightCast (LC) data covering job transitions for 229K workers.
* Population: Individuals’ first five years in the labor market.
* Structure: Each row represents a job-to-job transition pair, with information on demographic attributes, job characteristics, and economic context.

Key variables include:

* Demographics: `gender`, `race`, `generation`, `max_edu_name`
* Job characteristics: `move_1_1`, `move_1_2`, `move_2_1` (Type-1, Type-2, and Type-3 job changes), `onet_major_x`, `nacis6_major_x`, `log_wage_x`
* Regional context: `state_gdp_decile_x`
* Outcome: `up_move` — indicator for whether the transition represents an upward career move.

## Data Table Description: 

| Column                                         | Description                                                               |
| ---------------------------------------------- | ------------------------------------------------------------------------- |
| `max_edu_name`                                 | Highest degree attained within the first 5-year timeframe                 |
| `onet_major_x`                                 | Major SOC code (2-digit) of first job                                     |
| `naics6_major_x`                               | 2-digit NAICS sector of first job                                         |
| `company_x`                                    | First employer company name                                               |
| `state_x`                                      | First job state                                                           |
| `job_start_year_x`                             | Year of first job start                                                   |
| `num_job_changes`                              | Number of job changes in the 5-year window (top-coded at 95th percentile) |
| `gender`                                       | Gender (1 = Male, 2 = Female)                                             |
| `race`                                         | Race (1 = White, 2 = Black, 3 = Asian, 4 = Hispanic)                      |
| `generation`                                   | Estimated generation cohort                                               |
| `state_gdp_decile_x`                           | Decile of state GDP in first job year                                     |
| `annual_state_wage_x`                          | Occupational wage for the first job (state × 6-digit occupation × year)   |
| `log_wage_x`                                   | Log-transformed first job wage                                            |
| `move_1_1`, `move_1_2`, `move_2_1`, `move_2_2` | Job change type indicators for Type-1, Type-2, Type-3, respectively       |
| `up_move`                                      | Upward mobility indicator: last job wage ≥ 5% above first job wage        |


## Model Specifications

All models are estimated using:

* Logistic regression (GLM Binomial)
* Newton method for fitting (fallback to IRLS if convergence fails)
* Covariates are specified using patsy-style formulas with categorical encodings via `C()`.

### Model 1 — Baseline Model

Estimates the effect of demographic, job, and contextual covariates on upward mobility.

### Model 2 — Gender Interactions

Adds interactions between `gender` and early career move patterns to assess gender differences in mobility pathways.

### Model 3 — Race Interactions

Adds interactions between `race` and early career move patterns to explore racial disparities in upward mobility.

### Model 4 — Stratified by Gender

Runs separate models for male and female subsamples, incorporating race × move interactions to highlight within-gender heterogeneity.

## Outputs

For each model, the notebook:

* Prints a full regression summary to the console.
* Saves the standard `statsmodels` summary table as a `.csv` file for external analysis.

Output files (e.g., `glm_m1.csv`, `glm_m2.csv`, etc.) contain coefficient estimates, standard errors, z-statistics, p-values, and confidence intervals.

## Reproducibility

To rerun the analysis:

1. Place the pre-processed Parquet dataset in the `data/` directory.
2. Open the notebook in Jupyter or JupyterLab.
3. Execute all cells in sequence.
4. Check the `results/` folder for exported model summaries.

## Dependencies

* Python 3.8+
* [pandas](https://pandas.pydata.org/)
* [numpy](https://numpy.org/)
* [statsmodels](https://www.statsmodels.org/)


In [20]:
from datetime import datetime, timedelta
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
import time

# Filename breakdown: using LightCast (LC) partitions 0 to 19 (20 partitions) covering the first 5-year career of individual workers
data_filename = f'../data/lc0-19_job_pairs_y1-5_final.parquet'

def load_datatable(filename):
    return pd.read_parquet(filename)

# Logistic Resgression Models

In [15]:
def run_logit(formula, output_filename, maxiter=100, gtol=1e-8):
    """
    Fit a standard logistic regression (GLM Binomial) using the provided formula string,
    scale the same continuous covariates as in run_mglm, and export the standard
    statsmodels summary table to CSV.

    Args:
        formula (str): patsy-style formula for the fixed effects.
        output_filename (str): path to write CSV results.
        maxiter (int): max iterations passed to fit().
        gtol (float): gradient tolerance passed to fit().
    """
    # --- Fit model ---
    model = smf.glm(formula=formula, data=df, family=sm.families.Binomial())
    try:
        start = time.time()
        results = model.fit(method='newton', maxiter=maxiter, tol=gtol)
        end = time.time()
        print(f"Done in {end-start:.2f} sec")
    except Exception as e:
        print(f"Newton failed ({e}), falling back to IRLS...")
        results = model.fit(maxiter=maxiter, tol=gtol)

    # --- Print to console ---
    print(results.summary())

    # --- Export standard summary table ---
    with open(output_filename, 'w') as fh:
        fh.write(results.summary().as_csv())

    print(f"Exported regression summary table to {output_filename}")


## Model 1: Main Effects

In [13]:
df = load_datatable('../data/lc0-19_job_pairs_y1-5_final.parquet')
output_filename = f'../results/glm_m1.csv'
formula = (
'up_move ~ C(gender) + C(race) + C(move_1_1) + C(move_1_2) + C(move_2_1)'  +
'+ C(onet_major_x) + C(nacis6_major_x) +  C(max_edu_name) + C(generation)' +
'+ state_gdp_decile_x + num_job_changes + log_wage_x'
)
run_logit(formula, output_filename)

Done in 8.66 sec
                 Generalized Linear Model Regression Results                  
Dep. Variable:                up_move   No. Observations:               228710
Model:                            GLM   Df Residuals:                   228652
Model Family:                Binomial   Df Model:                           57
Link Function:                  Logit   Scale:                          1.0000
Method:                        newton   Log-Likelihood:            -1.1688e+05
Date:                Mon, 13 Oct 2025   Deviance:                   2.3377e+05
Time:                        21:48:12   Pearson chi2:                 2.36e+05
No. Iterations:                     3   Pseudo R-squ. (CS):             0.3014
Covariance Type:            nonrobust                                         
                                         coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------------------

<statsmodels.genmod.generalized_linear_model.GLMResultsWrapper at 0x29782d01850>

## Model 2: Gender x Job Movement Types

In [14]:
df = load_datatable(data_filename)
output_filename = f'data/glm_m2.csv'
formula = (
'up_move ~ C(gender) + C(race) + C(move_1_1) + C(move_1_2) + C(move_2_1)' +
'+ C(onet_major_x) + C(nacis6_major_x) + C(max_edu_name) + C(generation)' +
'+ state_gdp_decile_x + num_job_changes + log_wage_x' +
'+ C(gender):C(move_1_1) + C(gender):C(move_1_2) + C(gender):C(move_2_1)'
)
run_logit(formula, output_filename)

Done in 8.99 sec
                 Generalized Linear Model Regression Results                  
Dep. Variable:                up_move   No. Observations:               228710
Model:                            GLM   Df Residuals:                   228649
Model Family:                Binomial   Df Model:                           60
Link Function:                  Logit   Scale:                          1.0000
Method:                        newton   Log-Likelihood:            -1.1687e+05
Date:                Mon, 13 Oct 2025   Deviance:                   2.3374e+05
Time:                        21:50:51   Pearson chi2:                 2.36e+05
No. Iterations:                     3   Pseudo R-squ. (CS):             0.3015
Covariance Type:            nonrobust                                         
                                         coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------------------

<statsmodels.genmod.generalized_linear_model.GLMResultsWrapper at 0x297e3836250>

## Model 3: Race x Job Change Types

In [16]:
df = load_datatable(data_filename)
output_filename = f'../results/glm_m3.csv'
formula = (
'up_move ~ C(gender) + C(race) + C(move_1_1) + C(move_1_2) + C(move_2_1)' +
'+ C(onet_major_x) + C(nacis6_major_x) + C(max_edu_name) + C(generation)' +
'+ state_gdp_decile_x + num_job_changes + log_wage_x' +
'+ C(race):C(move_1_1) + C(race):C(move_1_2) + C(race):C(move_2_1)'
)
run_logit(formula, output_filename)

Done in 6.31 sec
                 Generalized Linear Model Regression Results                  
Dep. Variable:                up_move   No. Observations:               228710
Model:                            GLM   Df Residuals:                   228643
Model Family:                Binomial   Df Model:                           66
Link Function:                  Logit   Scale:                          1.0000
Method:                        newton   Log-Likelihood:            -1.1688e+05
Date:                Mon, 13 Oct 2025   Deviance:                   2.3375e+05
Time:                        21:53:02   Pearson chi2:                 2.36e+05
No. Iterations:                     3   Pseudo R-squ. (CS):             0.3014
Covariance Type:            nonrobust                                         
                                         coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------------------

## Model 4: Race x Job Change Types Stratified by Gender

In [17]:
df = load_datatable(data_filename)
df = df.query("gender==1")
output_filename = f'data/glm_m4-m.csv'
formula = (
'up_move ~ C(gender) + C(race) + C(move_1_1) + C(move_1_2) + C(move_2_1)' +
'+ C(onet_major_x) + C(nacis6_major_x) + C(max_edu_name) + C(generation)' +
'+ state_gdp_decile_x + num_job_changes + log_wage_x' +
'+ C(race):C(move_1_1) + C(race):C(move_1_2) + C(race):C(move_2_1)'
)
run_logit(formula, output_filename)

Done in 3.02 sec
                 Generalized Linear Model Regression Results                  
Dep. Variable:                up_move   No. Observations:               115555
Model:                            GLM   Df Residuals:                   115489
Model Family:                Binomial   Df Model:                           65
Link Function:                  Logit   Scale:                          1.0000
Method:                        newton   Log-Likelihood:                -59114.
Date:                Mon, 13 Oct 2025   Deviance:                   1.1823e+05
Time:                        21:55:57   Pearson chi2:                 1.19e+05
No. Iterations:                     3   Pseudo R-squ. (CS):             0.3025
Covariance Type:            nonrobust                                         
                                         coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------------------

In [18]:
df = load_datatable(data_filename)
df = df.query("gender==2")
output_filename = f'../results/glm_m4-f.csv'
formula = (
'up_move ~ C(gender) + C(race) + C(move_1_1) + C(move_1_2) + C(move_2_1)' +
'+ C(onet_major_x) + C(nacis6_major_x) + C(max_edu_name) + C(generation)' +
'+ state_gdp_decile_x + num_job_changes + log_wage_x' +
'+ C(race):C(move_1_1) + C(race):C(move_1_2) + C(race):C(move_2_1)'
)
run_logit(formula, output_filename)

Done in 2.68 sec
                 Generalized Linear Model Regression Results                  
Dep. Variable:                up_move   No. Observations:               113155
Model:                            GLM   Df Residuals:                   113089
Model Family:                Binomial   Df Model:                           65
Link Function:                  Logit   Scale:                          1.0000
Method:                        newton   Log-Likelihood:                -57653.
Date:                Mon, 13 Oct 2025   Deviance:                   1.1531e+05
Time:                        21:56:10   Pearson chi2:                 1.17e+05
No. Iterations:                     3   Pseudo R-squ. (CS):             0.3011
Covariance Type:            nonrobust                                         
                                         coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------------------