A comprehensive guide to Python packages for applied economists, organized by functionality to support econometric analysis, data management, visualization, and specialized tasks.
- Core Libraries
- Econometric Methods and Research Designs
- Treatment Effect Estimation Tools
- Machine Learning
- Time Series Tools
- Bayesian Analysis Tools
- Data Management and Processing
- Visualization and Reporting
- Specialized Tools
- Development Tools
- Installation Summary
Before diving into specialized packages, ensure you have the foundational libraries installed:
-
NumPy
- Description: Fundamental package for numerical computations.
- Installation:
pip install numpy
- Link: https://numpy.org/
-
Pandas
- Description: Essential for data manipulation and analysis.
- Installation:
pip install pandas
- Link: https://pandas.pydata.org/
-
SciPy
- Description: Provides additional statistical functions and tools.
- Installation:
pip install scipy
- Link: https://www.scipy.org/
-
Statsmodels
- Description: Provides classes and functions for estimating various statistical models, performing statistical tests, and data exploration.
- Capabilities:
- Linear Regression: Ordinary Least Squares (OLS)
- Generalized Linear Models (GLM)
- Discrete Choice Models: Logit, Probit
- Time Series Analysis: ARIMA, VAR, and state-space models
- Instrumental Variable Estimation: IV regression
- Installation:
pip install statsmodels
- Stata Equivalent:
regress
,logit
,probit
,arima
,var
,ivregress
- Link: https://www.statsmodels.org/
-
Pingouin
- Description: Statistical package offering statistical tests and plotting functions.
- Capabilities:
- ANOVAs, t-tests, correlations
- Effect sizes, power analyses
- Installation:
pip install pingouin
- Link: https://pingouin-stats.org/
- Linearmodels
- Description: Specialized for panel data econometrics, including fixed effects, random effects, and instrumental variable models.
- Capabilities:
- Panel Data Analysis: Fixed effects, random effects, between estimators
- Instrumental Variables: IV estimators, Generalized Method of Moments (GMM)
- Seemingly Unrelated Regressions: System estimation
- Installation:
pip install linearmodels
- Stata Equivalent:
xtreg
,ivregress
,sureg
- Link: https://bashtage.github.io/linearmodels/
- PyFixest
- Description: Allows for fast estimation of linear models with multiple fixed effects, inspired by the R package fixest.
- Capabilities:
- High-dimensional fixed effects models
- Clustered and robust standard errors
- Support for instrumental variables and interaction terms
- Installation:
pip install pyfixest
- Stata Equivalent:
reghdfe
,areg
- Link: https://github.com/py-econometrics/pyfixest
-
rdrobust
- Description: Implements local polynomial RD point estimators with robust bias-corrected confidence intervals and inference procedures.
- Capabilities:
- RD estimation and inference
- Automatic bandwidth selection
- Installation:
pip install rdrobust
- Stata Equivalent:
rdrobust
- Link: https://pypi.org/project/rdrobust/
-
rdlocrand
- Description: Provides tools for local randomization methods in RD designs.
- Capabilities:
- Inference in RD designs using local randomization
- Installation:
pip install rdlocrand
- Stata Equivalent:
rdlocrand
- Link: https://pypi.org/project/rdlocrand/
-
rddensity
- Description: Provides manipulation testing based on density discontinuity.
- Capabilities:
- Density discontinuity tests at cutoff
- Installation:
pip install rddensity
- Stata Equivalent:
rddensity
- Link: https://pypi.org/project/rddensity/
-
rdmulti
- Description: Analysis of RD designs with multiple cutoffs or scores.
- Capabilities:
- Multivariate RD analysis
- Installation:
pip install rdmulti
- Stata Equivalent:
rdmulti
- Link: https://pypi.org/project/rdmulti/
-
rdpower
- Description: Power calculations for RD designs.
- Capabilities:
- Computes power and sample size for RD designs
- Installation:
pip install rdpower
- Stata Equivalent:
rdpower
- Link: https://pypi.org/project/rdpower/
-
lpdensity
- Description: Implements local polynomial point estimation with robust bias-corrected confidence intervals.
- Capabilities:
- Kernel density estimation
- Local polynomial estimation
- Installation:
pip install lpdensity
- Stata Equivalent: Part of the RD analysis toolkit
- Link: https://pypi.org/project/lpdensity/
-
CSDID
- Description: Implements the Callaway and Sant'Anna (2020) Difference-in-Differences estimator for staggered adoption designs with treatment effect heterogeneity.
- Capabilities:
- Estimation of group-time average treatment effects
- Handles multiple time periods and variation in treatment timing
- Allows for treatment effect heterogeneity
- Installation:
git clone https://github.com/d2cml-ai/csdid.git cd csdid pip install .
- Stata Equivalent:
csdid
(user-contributed command) - Link: https://github.com/d2cml-ai/csdid
-
synthdid
- Description: Implements synthetic difference-in-differences estimation with inference and graphing procedures.
- Capabilities:
- Synthetic DiD estimation
- Multiple inference methods (placebo, bootstrap, jackknife)
- Plotting tools for outcomes and weights
- Support for covariates
- Handles staggered adoption over multiple treatment periods
- Installation:
pip install synthdid
- Stata Equivalent:
sdid
- Link: https://pypi.org/project/synthdid/
-
SyntheticControlMethods
- Description: A Python package for causal inference using various Synthetic Control Methods.
- Capabilities:
- Synthetic Control estimation
- Placebo tests
- Support for panel data
- Installation:
pip install SyntheticControlMethods
- Stata Equivalent:
synth
- Link: https://pypi.org/project/SyntheticControlMethods/
-
MarginalEffects
- Description: Provides methods for computing and interpreting marginal effects in statistical models.
- Capabilities:
- Calculates marginal effects for various models
- Supports models from scikit-learn, statsmodels, and others
- Installation:
pip install marginaleffects
- Link: https://pypi.org/project/marginaleffects/
-
EconML
- Description: Developed by Microsoft, EconML provides methods for estimating causal effects with machine learning techniques.
- Capabilities:
- Double Machine Learning (DML)
- Treatment Effect Estimation: Heterogeneous effects, policy evaluation
- Support for Machine Learning Models: Integration with scikit-learn, LightGBM, and more
- Installation:
pip install econml
- Stata Equivalent:
teffects
,ddml
- Link: https://econml.azurewebsites.net/
-
DoubleML
- Description: Implements the Double Machine Learning framework for causal inference in high-dimensional settings.
- Capabilities:
- Treatment effect estimation using DML
- Support for various machine learning algorithms
- Installation:
pip install doubleml
- Stata Equivalent:
ddml
- Link: https://docs.doubleml.org/stable/index.html
- PySensemakr
- Description: Sensitivity analysis toolkit for regression models.
- Capabilities:
- Quantify robustness of regression coefficients to unobserved confounding
- Implements methods similar to the
sensemakr
R package
- Installation:
pip install PySensemakr
- Link: https://github.com/Carloscinelli/PySensemakr
-
scikit-learn
- Description: A comprehensive library for machine learning algorithms.
- Capabilities:
- Supervised Learning: Regression, classification
- Unsupervised Learning: Clustering, dimensionality reduction
- Model Selection and Evaluation: Cross-validation, grid search
- Installation:
pip install scikit-learn
- Stata Equivalent: Machine learning methods for predictive modeling
- Link: https://scikit-learn.org/
-
XGBoost
- Description: An optimized distributed gradient boosting library designed to be highly efficient, flexible, and portable.
- Capabilities:
- High-performance gradient boosting algorithms
- Support for regression, classification, and ranking problems
- Installation:
pip install xgboost
- Stata Equivalent: Advanced machine learning methods
- Link: https://xgboost.readthedocs.io/
-
LightGBM
- Description: A fast, distributed, high-performance gradient boosting framework.
- Capabilities:
- Efficient gradient boosting algorithms
- Support for large-scale data
- Installation:
pip install lightgbm
- Link: https://github.com/microsoft/LightGBM
-
Statsmodels Time Series
- Description: Provides extensive time series analysis capabilities.
- Capabilities:
- ARIMA Models: Autoregressive Integrated Moving Average
- SARIMAX Models: Seasonal components and exogenous variables
- Vector Autoregression (VAR): Multivariate time series
- State Space Models: Flexible modeling of time series
- Installation: Part of
statsmodels
- Stata Equivalent:
arima
,var
,dfuller
,kpSS
- Link: https://www.statsmodels.org/stable/tsa.html
-
ARCH
- Description: Tools for analyzing financial time series, including volatility modeling.
- Capabilities:
- ARCH and GARCH models
- Volatility forecasting
- Installation:
pip install arch
- Link: https://arch.readthedocs.io/en/latest/
-
Ruptures
- Description: A Python library for offline change point detection.
- Capabilities:
- Multiple change point detection methods
- Handling univariate and multivariate signals
- Installation:
pip install ruptures
- Link: https://centre-borelli.github.io/ruptures-docs/
-
xarray
- Description: N-D labeled arrays and datasets in Python.
- Capabilities:
- Work with multi-dimensional arrays (similar to netCDF data)
- Convenient data structures for time series data
- Installation:
pip install xarray
- Link: https://xarray.pydata.org/en/stable/
-
StatsForecast
- Description: A collection of statistical models for time series forecasting.
- Capabilities:
- Efficient implementation of forecasting models
- Support for large-scale time series data
- Installation:
pip install statsforecast
- Link: https://github.com/Nixtla/statsforecast
-
NeuralForecast
- Description: Deep learning models for time series forecasting.
- Capabilities:
- State-of-the-art neural network architectures
- Handling of complex seasonality and trends
- Installation:
pip install neuralforecast
- Link: https://github.com/Nixtla/neuralforecast
-
PyMC
- Description: Probabilistic programming library for Bayesian modeling and inference.
- Capabilities:
- Bayesian statistical models
- Markov Chain Monte Carlo (MCMC)
- Variational inference
- Installation:
pip install pymc
- Link: https://docs.pymc.io/
-
PyStan
- Description: Python interface to the Stan language for statistical modeling and high-performance statistical computation.
- Capabilities:
- Bayesian inference
- Customizable statistical models
- Installation:
pip install pystan
- Link: https://pystan.readthedocs.io/en/latest/
-
Bambi
- Description: High-level Bayesian model-building interface in Python.
- Capabilities:
- Simplifies specification of Bayesian models using formulas
- Built on top of PyMC
- Installation:
pip install bambi
- Link: https://bambinos.org/
-
Polars
- Description: Modern, high-performance DataFrame library optimized for performance and memory efficiency.
- Capabilities:
- Fast parallel execution of data operations
- Memory-efficient processing
- Syntax familiar to pandas and R's tidyverse users
- Strong integration with Apache Arrow
- Installation:
pip install polars
- Link: https://pola.rs/
-
Datatable
- Description: High-performance library for processing large datasets (up to 100GB) on a single machine.
- Capabilities:
- Superior performance in sorting and grouping operations
- Efficient memory usage
- Seamless interoperability with pandas/NumPy
- Optimized for single-node processing
- Installation:
pip install datatable
- Link: https://github.com/h2oai/datatable
-
Vaex
- Description: Out-of-core DataFrame library for large datasets with lazy evaluation.
- Capabilities:
- Memory-efficient handling of large datasets
- Lazy evaluation for optimized performance
- Built-in visualization capabilities
- Good for datasets that don't fit in memory
- Installation:
pip install vaex
- Link: https://vaex.io/
-
DuckDB
- Description: SQL database engine with DataFrame-like functionality and exceptional performance for analytical queries.
- Capabilities:
- Top-tier performance for large-scale data operations
- SQL interface for data manipulation
- Efficient handling of large datasets (50GB+)
- Strong integration with pandas and Arrow
- Installation:
pip install duckdb
- Link: https://duckdb.org/
-
Recordlinkage
- Description: Python toolkit for linking and deduplicating records.
- Capabilities:
- Preprocessing and data cleaning
- Index/blocking methods to reduce comparisons
- Various comparison methods
- Classification of record pairs
- Evaluation metrics
- Installation:
pip install recordlinkage
- Stata Equivalent:
merge
,reclink
- Link: https://recordlinkage.readthedocs.io/en/latest/
-
Dedupe
- Description: Machine learning powered deduplication and entity resolution.
- Capabilities:
- Active learning approach to training
- Scalable blocking methods
- Automated matching decisions
- Installation:
pip install dedupe
- Link: https://github.com/dedupeio/dedupe
-
Python-Levenshtein
- Description: Fast implementation of Levenshtein distance and string similarity metrics.
- Capabilities:
- Compute edit distances for fuzzy matching
- Installation:
pip install python-Levenshtein
- Link: https://pypi.org/project/python-Levenshtein/
-
Jellyfish
- Description: Library for approximate and phonetic matching of strings.
- Capabilities:
- Soundex, Metaphone, and other phonetic algorithms
- Damerau-Levenshtein distance
- Installation:
pip install jellyfish
- Link: https://pypi.org/project/jellyfish/
-
PyStemmer
- Description: Snowball stemming algorithms for various languages.
- Capabilities:
- Stemming words to their root forms for better matching
- Installation:
pip install PyStemmer
- Link: https://pypi.org/project/PyStemmer/
-
NameParser
- Description: Parser for human names.
- Capabilities:
- Splits names into components (first name, last name, etc.)
- Useful for matching records based on names
- Installation:
pip install nameparser
- Link: https://pypi.org/project/nameparser/
-
Company-Matching
- Description: Toolkit for matching company names.
- Capabilities:
- Standardizes company names for accurate matching
- Handles common abbreviations and variations
- Installation:
pip install company-matching
- Link: https://github.com/IntelligentSoftwareSystems/Company-Matching
-
py_stringmatching
- Description: Comprehensive toolkit for string matching.
- Capabilities:
- Multiple string similarity measures
- Phonetic encoding
- Token-based similarities
- Installation:
pip install py_stringmatching
- Link: https://github.com/J535D165/py_stringmatching
-
pyjarowinkler
- Description: Implementation of Jaro-Winkler distance.
- Capabilities:
- Jaro similarity
- Jaro-Winkler similarity
- Installation:
pip install pyjarowinkler
- Link: https://pypi.org/project/pyjarowinkler/
-
RapidFuzz
- Description: Fast string matching library.
- Capabilities:
- Quick fuzzy string matching
- Multiple distance metrics
- Optimized for performance
- Installation:
pip install rapidfuzz
- Link: https://github.com/rapidfuzz/RapidFuzz
-
FuzzyWuzzy
- Description: Fuzzy string matching in Python.
- Capabilities:
- String similarity matching
- Partial and token-based ratios
- Installation:
pip install fuzzywuzzy
- Link: https://pypi.org/project/fuzzywuzzy/
-
Matplotlib
- Description: The foundational plotting library in Python.
- Capabilities:
- Line plots, scatter plots, histograms, bar charts
- Highly customizable visualizations
- Support for LaTeX formatting in labels
- Installation:
pip install matplotlib
- Stata Equivalent: Basic plotting functions
- Link: https://matplotlib.org/
-
Seaborn
- Description: A statistical data visualization library built on top of Matplotlib.
- Capabilities:
- Enhanced statistical graphics
- Regression plots, distribution plots, heatmaps
- Integration with pandas data structures
- Installation:
pip install seaborn
- Stata Equivalent: Enhanced plotting functions
- Link: https://seaborn.pydata.org/
-
Plotnine
- Description: A grammar of graphics for Python, based on ggplot2 in R.
- Capabilities:
- Declarative syntax for creating complex plots
- Supports layering, scaling, and theming
- Ideal for creating publication-quality visualizations
- Installation:
pip install plotnine
- Link: https://plotnine.readthedocs.io/
-
Binsreg
- Description: Provides binned regression methods for RD designs and data visualization.
- Capabilities:
- Binned scatter plots
- Regression discontinuity analysis
- Data-driven bin selection
- Installation:
pip install binsreg
- Stata Equivalent:
binsreg
,binscatter
- Link: https://pypi.org/project/binsreg/
-
Plotly
- Description: An interactive, open-source plotting library.
- Capabilities:
- Interactive plots
- Support for web-based applications
- Wide range of chart types
- Installation:
pip install plotly
- Link: https://plotly.com/python/
-
Altair
- Description: Declarative statistical visualization library for Python.
- Capabilities:
- Grammar of graphics approach
- Interactive visualizations
- Installation:
pip install altair
- Link: https://altair-viz.github.io/
-
Bokeh
- Description: Interactive visualization library for modern web browsers.
- Capabilities:
- Interactive plots and dashboards
- Real-time streaming and data updates
- Installation:
pip install bokeh
- Link: https://bokeh.org/
-
Stargazer
- Description: A Python package that emulates the R package
stargazer
, generating LaTeX code for regression tables. - Capabilities:
- Formats regression results into LaTeX tables
- Supports models from
statsmodels
andlinearmodels
- Installation:
pip install stargazer
- Link: https://pypi.org/project/stargazer/
- Description: A Python package that emulates the R package
-
PyTableWriter
- Description: A library to write tabular data in various formats.
- Capabilities:
- Export data to formats like LaTeX, Markdown, Excel, CSV
- Supports styling and formatting options
- Installation:
pip install pytablewriter
- Link: https://pypi.org/project/pytablewriter/
-
pystout
- Description: A package to create publication-quality LaTeX tables from Python regression output.
- Capabilities:
- Generates LaTeX tables from regression models
- Supports models from
statsmodels
andlinearmodels
- Customizable table appearance and statistics
- Installation:
pip install pystout
- Link: https://pypi.org/project/pystout/
-
tableone
- Description: Produces summary statistics for research papers.
- Capabilities:
- Generates descriptive statistics tables
- Supports grouping variables and statistical tests
- Exports tables to LaTeX and other formats
- Installation:
pip install tableone
- Link: https://pypi.org/project/tableone/
-
GreatTables
- Description: A package for creating beautiful and complex tables in Python.
- Capabilities:
- Compose tables with headers, footers, stubs, and spanners
- Format cell values in various ways
- Integrates with pandas DataFrames
- Installation:
pip install great_tables
- Link: https://pypi.org/project/great-tables/
-
tabulate
- Description: Formats tabular data in plain-text tables and can output in formats like LaTeX.
- Capabilities:
- Convert arrays or DataFrames into formatted tables
- Multiple output formats: plain text, GitHub-flavored Markdown, LaTeX, HTML, and more
- Installation:
pip install tabulate
- Link: https://pypi.org/project/tabulate/
-
GeoPandas
- Description: Extends pandas to allow spatial operations on geometric types.
- Capabilities:
- Reading and writing spatial data
- Spatial joins and operations
- Handling geospatial data formats like Shapefiles and GeoJSON
- Installation:
pip install geopandas
- Stata Equivalent: Limited geospatial capabilities
- Link: https://geopandas.org/
-
Geoplot
- Description: A high-level geospatial plotting library.
- Capabilities:
- Geospatial visualizations
- Choropleth maps, cartograms, kernel density plots
- Installation:
pip install geoplot
- Stata Equivalent: Basic mapping (with limited functionality)
- Link: https://github.com/ResidentMario/geoplot
-
Geopy
- Description: A Python client for several popular geocoding web services.
- Capabilities:
- Geocoding addresses (converting addresses to coordinates)
- Reverse geocoding
- Calculating distances between points
- Installation:
pip install geopy
- Stata Equivalent: Not directly available
- Link: https://geopy.readthedocs.io/
-
Geocoder
- Description: Geocoding library supporting multiple services.
- Capabilities:
- Address standardization
- Geographic entity matching
- Multiple provider support
- Installation:
pip install geocoder
- Link: https://geocoder.readthedocs.io/
-
libpysal
- Description: Core components of PySAL (Python Spatial Analysis Library).
- Capabilities:
- Spatial weights matrices
- Spatial graph analysis
- Computational geometry
- Installation:
pip install libpysal
- Stata Equivalent:
spreg
, spatial econometrics tools - Link: https://pysal.org/libpysal/
-
NLTK
- Description: Natural Language Toolkit, a leading platform for building Python programs to work with human language data.
- Capabilities:
- Tokenization, stemming, tagging, parsing
- Corpora and lexical resources
- Installation:
pip install nltk
- Link: https://www.nltk.org/install.html
-
LangDetect
- Description: Port of Google's language-detection library.
- Capabilities:
- Detects language of a text
- Installation:
pip install langdetect
- Link: https://pypi.org/project/langdetect/
-
LayoutParser
- Description: A unified toolkit for Deep Learning-based Document Image Analysis.
- Capabilities:
- Deep Learning Models: Perform layout detection in a few lines of code
- Layout Data Structures: Optimized APIs for document image analysis tasks
- OCR Integration: Perform OCR for each detected layout region
- Visualization Tools: Flexible APIs for visualizing the detected layouts
- Data Loading: Load layout data stored in JSON, CSV, and even PDFs
- Installation:
pip install layoutparser # For deep learning layout models pip install "layoutparser[layoutmodels]" # For OCR toolkit pip install "layoutparser[ocr]"
- Link: https://github.com/Layout-Parser/layout-parser
-
PyTesseract
- Description: Python wrapper for Google's Tesseract-OCR Engine.
- Capabilities:
- Optical Character Recognition (OCR)
- Extract text from images and PDFs
- Installation:
pip install pytesseract
- Link: https://pypi.org/project/pytesseract/
-
Tabula-py
- Description: Simple wrapper of tabula-java, which can read tables in PDF and convert them into pandas DataFrames.
- Capabilities:
- Extract tables from PDFs
- Installation:
pip install tabula-py
- Link: https://pypi.org/project/tabula-py/
-
Python-PDFBox
- Description: Python interface to Apache PDFBox.
- Capabilities:
- PDF manipulation (extract text, merge, split)
- Installation:
pip install python-pdfbox
- Link: https://pypi.org/project/python-pdfbox/
-
PDFMiner
- Description: Tool for extracting information from PDF documents.
- Capabilities:
- Text extraction
- Layout analysis
- Installation:
pip install pdfminer.six
- Link: https://pypi.org/project/pdfminer/
-
BeautifulSoup
- Description: Library for pulling data out of HTML and XML files.
- Capabilities:
- Parse and navigate HTML/XML documents
- Installation:
pip install beautifulsoup4
- Link: https://pypi.org/project/beautifulsoup4/
-
Requests
- Description: HTTP library for Python.
- Capabilities:
- Send HTTP requests
- Handle HTTP sessions and cookies
- Installation:
pip install requests
- Link: https://pypi.org/project/requests/
-
Requests-HTML
- Description: HTML Parsing for Humans.
- Capabilities:
- Parse HTML with JavaScript support
- Simplify web scraping tasks
- Installation:
pip install requests-html
- Link: https://github.com/psf/requests-html
-
StackPrinter
- Description: Debugging tool for printing informative tracebacks.
- Installation:
pip install stackprinter
- Link: https://github.com/cknd/stackprinter
-
Pdb++
- Description: Drop-in replacement for pdb (Python debugger), with additional features.
- Installation:
pip install pdbpp
- Link: https://github.com/pdbpp/pdbpp
-
tqdm
- Description: Fast, extensible progress bar for Python.
- Installation:
pip install tqdm
- Link: https://tqdm.github.io/
- RPy2
- Description: Interface to call R functions and use R packages directly from Python.
- Use Case: When specific R packages have no Python equivalent, especially for advanced econometric methods not yet available in Python.
- Example R Packages Accessible via RPy2:
- did: Implements the Callaway and Sant'Anna (2020) DiD estimator.
- bacondecomp: For the Goodman-Bacon decomposition in DiD settings.
- fixest: Used for estimation with multiple fixed effects.
- Installation:
pip install rpy2
- Link: https://rpy2.github.io/
You can install most of these packages using pip:
pip install numpy pandas scipy statsmodels pingouin pymc pystan bambi linearmodels pyfixest econml doubleml marginaleffects pysensemakr scikit-learn xgboost lightgbm matplotlib seaborn plotnine rpy2 rdrobust rdlocrand rddensity rdmulti rdpower lpdensity synthdid SyntheticControlMethods arch ruptures xarray statsforecast neuralforecast recordlinkage dedupe py_stringmatching pyjarowinkler rapidfuzz fuzzywuzzy nameparser company-matching python-Levenshtein jellyfish PyStemmer nltk langdetect beautifulsoup4 requests requests-html pytesseract tabula-py python-pdfbox pdfminer.six plotly altair bokeh prettytable tabulate stackprinter pdbpp tqdm geopandas geoplot geopy geocoder libpysal binsreg prophet layoutparser stargazer pytablewriter xtable pystout tableone great_tables
This work is licensed under a Creative Commons Attribution 4.0 International License.
You are free to:
- Share — copy and redistribute the material in any medium or format
- Adapt — remix, transform, and build upon the material for any purpose, even commercially
Under the following terms:
- Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made.