Skip to content

baristiran/statika

Repository files navigation

PyPI Version PyPI Downloads Python License Surface Stack

statika

Run statistical analyses in seconds — no license fees, no GUI required.
OLS, logit, survival analysis, panel data, and 260+ more commands, right from your terminal.

InstallDemoWhy statika?CommandsExamplesContributing


Note: statika is an independent, community-driven open-source project. It is not affiliated with, endorsed by, or connected to StataCorp LLC or any commercial statistical software vendor.


Installation

pip install statika
statika repl

That is all. No virtual environment required, no license server, no installer wizard.

Optional extras

pip install "statika[excel]"    # Excel (.xlsx) import/export
pip install "statika[stata]"    # Stata .dta import/export
pip install "statika[survival]" # Survival analysis (lifelines)
pip install "statika[all]"      # Everything above

30-Second Demo

$ statika repl
statika v1.0.0 — Open-source statistical analysis tool
Type help for commands, quit to exit.

statika> load examples/data.csv
Loaded 50 rows x 7 columns from examples/data.csv

statika> summarize age income score
┌──────────┬────┬─────────┬─────────┬───────┬─────────┬─────────┬─────────┬─────────┐
│ Variable │ N  │ Mean    │ SD      │ Min   │ P25     │ P50     │ P75     │ Max     │
├──────────┼────┼─────────┼─────────┼───────┼─────────┼─────────┼─────────┼─────────┤
│ age      │ 50 │ 34.6600 │  8.7634 │ 21.00 │ 27.2500 │ 34.0000 │ 42.5000 │ 53.0000 │
│ income   │ 50 │ 49840.0 │ 17547.2 │ 26000 │ 34000.0 │ 47000.0 │ 66000.0 │ 88000.0 │
│ score    │ 50 │  7.4280 │  1.2844 │  4.90 │  6.4750 │  7.5000 │  8.5500 │  9.4000 │
└──────────┴────┴─────────┴─────────┴───────┴─────────┴─────────┴─────────┴─────────┘

statika> ols score ~ age + income --robust
┌──────────┬────────┬─────────┬───────┬────────┬────────────┬─────────────┐
│ Variable │ Coef   │ Std.Err │ t/z   │ P>|t|  │ [95% CI L] │ [95% CI H]  │
├──────────┼────────┼─────────┼───────┼────────┼────────────┼─────────────┤
│ _cons    │ 2.1435 │ 0.4521  │ 4.741 │ 0.0000 │ 1.2343     │ 3.0527      │
│ age      │ 0.0312 │ 0.0187  │ 1.668 │ 0.1018 │ -0.0066    │ 0.0690      │
│ income   │ 0.0001 │ 0.0000  │ 5.234 │ 0.0000 │ 0.0000     │ 0.0001      │
└──────────┴────────┴─────────┴───────┴────────┴────────────┴─────────────┘
N = 50  |  R² = 0.5481  |  Adj.R² = 0.5289  |  F(2, 47) = 28.52 (p=0.0000)

statika> margins
Average marginal effects computed.

statika> estimates table
Model comparison table generated.

statika> quit
Bye!

Run a script instead

statika run analysis.ost           # Run an .ost script
statika run analysis.ost --strict  # Stop on first error (useful in CI)

Why statika?

Feature Stata R SPSS statika
Price $595/yr Free $99/mo Free
Familiar CLI syntax Yes No No Yes
Scripting Yes Yes No Yes
Python ecosystem No No No Yes
No eval / safe DSL Yes
Interactive REPL No Partial No Yes
Polars backend No No No Yes

statika is designed for researchers and data scientists who want the muscle memory of a CLI workflow without paying for it, and who want scripted, reproducible analyses that fit into version-controlled projects.


Stable vs Experimental

statika distinguishes between a stable core and experimental modules:

  • Stable core: data loading, transformation, descriptive statistics, core regression models, hypothesis tests, plotting, reporting, scripting.
  • Experimental: panel data, survival analysis, survey-weighted estimation, SEM, network analysis, spatial statistics, and advanced ML commands.

Help and tab completion default to stable commands. To inspect the experimental surface:

statika> help --list --experimental

Quick Examples

1. Basic data exploration

statika> load survey.csv
statika> describe
statika> summarize age income education
statika> tabulate region
statika> crosstab gender employed
statika> corr age income score

2. OLS regression with post-estimation

statika> load data.csv
statika> ols income ~ age + education + experience --robust
statika> predict yhat
statika> residuals resid
statika> vif
statika> estat all
statika> latex results/model.tex

3. Logit with marginal effects and model comparison

statika> logit employed ~ age + income + education
statika> margins
statika> margins --at=means
statika> ols employed ~ age + income + education
statika> estimates table

4. Grouped analysis and hypothesis tests

statika> groupby region summarize mean(income) sd(income) count()
statika> ttest income by employed
statika> anova score by region
statika> chi2 region employed

5. Scripted reproducible analysis (.ost file)

Create analysis.ost:

# analysis.ost — reproducible wage regression
load data/wages.csv
describe
summarize wage age education experience

derive log_wage = log(wage)
encode region as region_code

ols log_wage ~ age + education + experience --robust
predict yhat
residuals resid
vif
estat all
bootstrap n=1000 ci=95

latex outputs/wage_table.tex
report outputs/wage_report.md
save outputs/wages_modeled.parquet

Run it:

statika run analysis.ost --strict

Command Reference

Data Management (8 commands)
Command Description Example
load <path> Load CSV, Parquet, Stata (.dta), Excel (.xlsx) load survey.csv
save <path> Save data to any supported format save results.parquet
describe Show dataset structure (types, nulls) describe
head [N] Show first N rows (default: 10) head 20
tail [N] Show last N rows tail 5
count Row and column count count
merge <path> on <key> [how=...] Join with another file merge scores.csv on id how=left
undo Undo last data change (multi-level) undo
Data Transformation (18 commands)
Command Description Example
filter <expr> Filter rows with expressions filter age > 30 and income < 50000
select <cols> Keep specific columns select age income score
derive <col> = <expr> Create new variables derive bmi = weight / (height ** 2)
dropna [cols] Drop missing values dropna age income
fillna <col> <strategy> Fill missing values fillna income median
sort <col> [--desc] Sort dataset sort income --desc
rename <old> <new> Rename a column rename income salary
cast <col> <type> Cast column type cast age float
encode <col> [as <new>] Label-encode strings encode region as region_code
recode <col> old=new ... Recode values recode region North=N South=S
replace <col> <old> <new> Replace values replace region North Norte
sample <N|N%> Random sample sample 100 or sample 10%
duplicates [drop] [cols] Find or drop duplicates duplicates drop
unique <col> List unique values unique region
lag <col> [N] Lag variable (shift down) lag price 2
lead <col> [N] Lead variable (shift up) lead price
pivot <val> by <col> Reshape to wide format pivot score by subject over name
melt <ids>, <vals> Reshape to long format melt name, math eng
Descriptive Statistics (5 commands)
Command Description Example
summarize [cols] Summary statistics (N, Mean, SD, quartiles) summarize age income
tabulate <col> Frequency table (top 50 values) tabulate education
crosstab <row> <col> Two-way contingency table with row percentages crosstab gender status
corr [cols] Pearson correlation matrix corr age income score
groupby <cols> summarize <aggs> Group-by with aggregations groupby region summarize mean(income) count()
Statistical Models (6 commands)
Command Description Example
ols y ~ x1 + x2 OLS linear regression ols score ~ age + income --robust
logit y ~ x1 + x2 Logistic regression (binary) logit employed ~ age + income
probit y ~ x1 + x2 Probit regression (binary) probit employed ~ age + income
poisson y ~ x1 + x2 Poisson regression (counts) poisson visits ~ age --exposure=time
negbin y ~ x1 + x2 Negative Binomial (overdispersed) negbin claims ~ age + gender
quantreg y ~ x1 + x2 Quantile regression quantreg wage ~ edu + exp tau=0.9

All models support:

  • --robust — heteroscedasticity-robust standard errors (HC1)
  • --cluster=col — cluster-robust standard errors
  • --weight=col — frequency/analytic weights

Formula syntax:

  • y ~ x1 + x2 — standard predictors
  • y ~ x1*x2 — full factorial (expands to x1 + x2 + x1:x2)
  • y ~ x1:x2 — interaction term only
  • y ~ x1*x2*x3 — three-way interaction
Post-Estimation (9 commands)
Command Description Example
predict [name] Predicted values from last model predict yhat
residuals [name] Residuals + diagnostic plots residuals resid
vif Variance Inflation Factor vif
margins [--at=means|average] Marginal effects (dy/dx) margins --at=average
bootstrap [n=N] [ci=N] Bootstrap confidence intervals bootstrap n=1000 ci=95
estat <sub> Post-estimation diagnostics estat all
estimates table Side-by-side model comparison estimates table
stepwise y ~ x1 + ... Stepwise variable selection stepwise y ~ x1 + x2 --backward
latex [path.tex] Export model as LaTeX table latex results.tex

estat subcommands: hettest, ovtest, linktest, ic, all

Hypothesis Tests (5 commands)
Command Description Example
ttest <col> One-sample t-test ttest score mu=7
ttest <col> by <group> Two-sample Welch t-test ttest income by employed
ttest <col> paired <col2> Paired t-test ttest before paired after
chi2 <col1> <col2> Chi-square independence test chi2 region employed
anova <col> by <group> One-way ANOVA (F-test) anova score by region
Visualization (7 commands)
Command Description Example
plot hist <col> Histogram plot hist age
plot scatter <y> <x> Scatter plot plot scatter score income
plot line <y> <x> Line plot plot line score age
plot box <col> [by <g>] Box plot (optionally grouped) plot box income by region
plot bar <col> [by <g>] Bar chart plot bar income by region
plot heatmap [cols] Correlation heatmap plot heatmap age income score
plot diagnostics Residual diagnostic plots plot diagnostics
Reporting and Utilities (4 commands)
Command Description Example
report <path> Generate Markdown report report analysis.md
help [cmd] Show help (all or specific command) help ols
esttab Publication-style coefficient table esttab
quit / exit / q Exit REPL quit

Expression Language

The expression language used by filter and derive is a safe, recursive-descent parser. No Python eval() is used anywhere in statika.

# Arithmetic
statika> derive income_k = income / 1000
statika> derive bmi = weight / (height ** 2)

# Comparisons and boolean logic
statika> filter age > 30 and income < 50000
statika> filter not is_null(score) and region == "North"

# Functions
statika> derive log_income = log(income)
statika> derive name_upper = upper(name)
statika> derive score_clean = fill_null(score, 0)
Category Functions
Math log(x), log10(x), sqrt(x), abs(x), exp(x), round(x, n)
String upper(x), lower(x), len_chars(x), strip(x), contains(x, "pat")
Null is_null(x), is_not_null(x), fill_null(x, value)
Type cast_float(x), cast_int(x), cast_str(x)

Aggregation functions for groupby ... summarize:

Function Description
mean(col) Arithmetic mean
sd(col) Standard deviation (sample)
sum(col) Sum
min(col) Minimum
max(col) Maximum
median(col) Median
count() Row count per group

Automatic Model Diagnostics

Every model automatically checks for common problems:

  • Multicollinearity — Condition number > 30 triggers a warning
  • Heteroscedasticity — Breusch-Pagan test; suggests --robust if p < 0.05
  • Autocorrelation — Durbin-Watson statistic far from 2.0
  • Convergence — Warns if logit/probit MLE did not converge
  • Missing values — Reports how many observations were dropped
  • Low sample size — Warns when the observation-to-predictor ratio is low

File Formats

Format Import Export Notes
CSV Yes Yes Built-in
Parquet Yes Yes Built-in
Stata (.dta) Yes Yes pip install "statika[stata]"
Excel (.xlsx) Yes Yes pip install "statika[excel]"

CLI Reference

statika repl                     # Interactive REPL
statika run script.ost           # Run an .ost script
statika run script.ost --strict  # Stop on first error (exit code 1)
statika --verbose repl           # Verbose logging
statika --debug repl             # Debug logging
statika --version                # Show version

Logs are written to ~/.statika/logs/openstat.log.


Configuration

Create ~/.statika/config.toml to customize defaults:

[data]
output_dir = "outputs"
csv_separator = ","

[display]
tabulate_limit = 50
head_default = 10

[undo]
max_undo_stack = 20
max_undo_memory_mb = 500

[plotting]
plot_dpi = 150
plot_figsize_w = 8.0
plot_figsize_h = 5.0

[model]
condition_threshold = 30
min_obs_per_predictor = 5
bootstrap_iterations = 1000

Technology Stack

Component Library Notes
Data engine Polars Rust-powered, zero-copy, 10-100x faster than pandas
Statistics statsmodels OLS, GLM, quantile regression
Scientific SciPy Hypothesis tests, distributions
Plotting matplotlib Publication-quality figures
CLI Typer Type-annotated CLI
Terminal UI Rich Tables and formatted output
REPL prompt-toolkit Tab completion, history

Contributing

Contributions are welcome. Whether you are fixing a typo, stabilizing an experimental module, or adding a new command, the process is the same:

  1. Fork the repository on GitHub
  2. Create a feature branch: git checkout -b feature/your-feature
  3. Write code and tests
  4. Confirm tests pass and lint is clean: pytest and ruff check src/
  5. Open a pull request with a clear description

What to contribute

  • Stable-core hardening — CLI/REPL behavior, error handling, command metadata
  • Experimental stabilization — panel, survival, survey, IV, mixed models
  • New commands — any useful data manipulation or analysis command
  • Expression language — new DSL functions
  • Plot types — new visualization types
  • File formats — SAS, SPSS, JSON, and others
  • Documentation — tutorials, examples, translations
  • Bug reports — open an issue on GitHub

New to open source? Look for issues labeled good first issue. See CONTRIBUTING.md for the full setup guide.


Roadmap

Completed

  • OLS, Logit, Probit, Poisson, Negative Binomial, Quantile regression
  • Interaction terms (x1*x2, x1:x2, three-way)
  • Robust and cluster-robust standard errors
  • Frequency and analytic weight support
  • Marginal effects (average, at-means, for OLS/logit/probit)
  • Bootstrap confidence intervals (parallelized)
  • Post-estimation diagnostics (estat, vif, residuals)
  • Model comparison tables (estimates table)
  • Stepwise variable selection (forward/backward)
  • Safe expression language (no eval)
  • Tab completion and multi-level undo in REPL
  • LaTeX and Markdown report export
  • CSV, Parquet, Stata .dta, Excel import/export
  • Configuration file support
  • CI/CD with GitHub Actions (1173 tests, 91% coverage)

Planned

  • Stabilize experimental estimators (panel, survival, survey, IV)
  • Replace remaining pandas paths in large-data workflows
  • Publish full documentation site
  • Improve backend abstraction (shared engine contract for load/query/model/export)
  • SAS and SPSS file format support

Acknowledgements

statika is built on top of excellent open-source libraries:

  • Polars — for reimagining what a DataFrame library can be
  • statsmodels — for bringing professional-grade statistics to Python
  • SciPy — for decades of scientific computing
  • Rich — for making terminal output readable
  • prompt-toolkit — for the interactive REPL foundation

License

MIT License. See LICENSE for the full text.


GitHubPyPIContributing

Not affiliated with StataCorp LLC, IBM SPSS, or SAS Institute Inc.
statika is an independent open-source project.

About

Open-source statistical analysis tool — a Python alternative to Stata, SPSS, and SAS

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages