Skip to content

benjann/dstat

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

60 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

dstat

Stata module to compute summary statistics and distribution functions including standard errors and optional covariate balancing

dstat unites a variety of methods to describe (univariate) statistical distributions. Covered are density estimation, histograms, cumulative distribution functions, probability distributions, quantile functions, lorenz curves, percentile shares, and a large collection of summary statistics such as classical and robust measures of location, scale, skewness, and kurtosis, as well as inequality and poverty measures. Particular features of the command are that it provides consistent standard errors supporting complex sample designs for all covered statistics and that the simultaneous analysis of multiple variables across multiple subpopulations is possible. Furthermore, the command supports covariate balancing based on reweighting techniques (inverse probability weighting and entropy balancing), including appropriate correction of standard errors. Standard error estimation is implemented in terms of influence functions, which can be stored for further analysis, for example, using RIF regression.

To install dstat from the SSC Archive, type

. ssc install dstat, replace

in Stata. Stata version 14 or newer is required. Furthermore, moremata and coefplot are required. To install these packages from the SSC Archive, type

. ssc install moremata, replace
. ssc install coefplot, replace

Installation from GitHub:

. net install dstat, replace from(https://raw.githubusercontent.com/benjann/dstat/main/)
. net install moremata, replace from(https://raw.githubusercontent.com/benjann/moremata/master/)
. net install coefplot, replace from(https://raw.githubusercontent.com/benjann/coefplot/master/)

Main changes:

24mar2023 (version 1.4.4)
- generate() stored the influence functions of the raw statistics rather than the
  influence functions of the transformed statistics if suboption -lnratio- was
  specified in over(); this also implied that vce(svy) reported the standard errors
  of the raw statistics rather than standard errors of the transformed statistics
  if suboption -lnratio- was specified in over(); this is fixed

28dec2022 (version 1.4.3)
- command -dstat (somersd) Y, by(X)- computed D(X|Y) rather than D(Y|X); I now
  changed this so that D(Y|X) is computed, which is more intuitive (and more in
  line with how other asymmetric statistics are computed by dstat); thanks to
  Maurizio Pisati for pointing out this inconsistency

15dec2022 (version 1.4.2)
- modified dstat_svyr such that replication-based svy estimators no longer
  apply checks for omitted coefficients; this prevents the estimators from
  failing on results that have zero variance (e.g. a zero-frequency histogram
  bar)

14dec2022 (version 1.4.1)
- [no]cov is no longer a suboption within vce(); it is now a regular option
- dstat predict now has option scaling() to determine the scaling of the
  generated influence functions
- option nobwfixed added; code to obtain grid and bandwidth in case of
  replication estimators revised
- revised implementation of vce(svy)
- revised implementation of predict

12dec2022 (version 1.4.0)
- dstat pw did not work with vce() set to bootstrap, jackknife, or svy; this is
  fixed
- the returned information on sample and population size included observations
  that were excluded from estimation due to missing values if vce(svy) with
  replication-based variance estimation was specified; this is fixed
- the secondary variable (-by-) can now be string for inequality decomposition
  measures as well as for cohend, mindex, uc[l|r], cramersv, and dissim

05dec2022 (version 1.3.9)
- statistic -sdlog- added
- new methods in citype() for proportions: agresti, exact, jeffreys, wilson
- citype(normal) can now be abbreviated as citype(norm)
- reorganized code for computation of CIs
- dstat graph: overlay can now be specified as a synonym for merge
- r() from -dstat- is now preserved if option -graph- is specified; this ensures
  that r(table) will be available after running -dstat- with both the -graph-
  option and the -table- option; furthermore, r() from dstat is now also
  preserved if option -generate()- or -rif()- is applied
- the display routine is now executed even if -quietly- is applied to -dstat-,
  so that r(table) will created even if -quietly- is applied
- the display routine will now clear preexisting r() even if -notable- is applied
- -dstat predict- no longer modified r()
- an informative error message is now displayed if a string variable is
  specified in by(), pline(), or as an argument to a statistic

21nov2022 (version 1.3.8)
- dstat density: option [l|r]tight added; requires newest update of moremata

20oct2022 (version 1.3.7)
- dstat returned error if option -nose- was applied with statistics that set
  standard errors to zero (e.g. min and max); this is fixed

22sep2022 (version 1.3.6)
- dstat returned error if histogram method -scott- was specified; this is fixed
- now using errprintf() to display errors in Mata

11aug2022 (version 1.3.5)
- statistic -cohend- added
- statistic -freq- without argument can now be used to obtain
  overall frequence/sum of weights; can also type -count- 

17feb2022 (version 1.3.4)
- dstat pw added (wrapper for dstat summarize to compute pairwise correlations
  and similar)
- informative error message is now displayed if factor variables are used in
  -dstat proportion- without option -nocategorical-

14feb2022 (version 1.3.3)
- additional statistics in dstat sum: -slope- or -b- (regression coefficient;
  may also be used to compute mean difference or risk difference),
  -or- (odds ratio in 2x2 table), -rr- (risk ratio in 2x2 table)
- version of moremata library is now checked

17jan2022 (version 1.3.2)
- option hdtrim() added (trimmed Harrell-Davis quantiles)
- grid size in _ds_mq_d_init() now 1024+1 because first point will be removed

11jan2022 (version 1.3.1)
- now using a properly derived expression for the influence function of 
  Harrell-Davis quantiles (rather than obtaining the IF by analogy to the
  jackknife approach proposed by Harrell and Davis 1982); the new formulas
  lead to slightly different results

07jan2022 (version 1.3.0)
- dstat sum: huber, biweight, mad[n], mae[n], mscale now take account of qdef()
- dstat sum: computation of IFs for winsor, qskew, qw, lqw, rqw revised so that
  qdef() is taken into account (only relevant if qdef=10 or qdef=11)

30dec2021 (version 1.2.9)
- system for managing selection of observations and temporary results rewritten
  (more systematic, cleaner code, less error prone, more efficient)
- dstat sum: harmonic mean (hmean) is now set to zero if at least one outcome
  value is equal to zero

22dec2021 (version 1.2.8)
- dstat sum: computation of taua was wrong in case of fweights; this is fixed
- dstat sum: renamed cdfm to mcdf, cdff to fcdf, ccdfm to mccdf, ccdff to fccdf
- system for parsing syntax of -dstat sum- rewritten (more general, cleaner
  code, easier to manage/expand, better error messages)

22dec2021 (version 1.2.7)
- support for qdef(11) added (mid-quantile); option -mquantile- is a synonym
  for qdef(11)
- dstat sum: mquantile, gw_vlog, w_vlog, b_vlog, ekurtosis, rsquared added
- dstat sum: now using quad precision when taking cross products in variance,
  sd, cv, md, gini, vlog, sen, sst, takayama, lvar, mse, spearman, skewness,
  kurtosis, gci, corr, cov 
- default for napprox() increased from 512 to 1024
- dstat histogram: in case of pweights or iweights, the effective sample size 
  (sum(w)^2/sum(w^2)) is now used instead of the physical number number of
  observations in the rules for selecting the number of bins
- default bandwidth selector for density estimation is now -dpi(2)-; -sjpi-
  can be erratic on data that contains heaping
- improved error messages and some code cleaning

05dec2021 (version 1.2.6)
- IF of b_gini assumes that the order of group means is stable; this is an
  assumption that is typically not very critical; comparison to the jackknife
  illustrates that the IF is quite accurate even in small samples; removed
  the corresponding disclaimer in the help file

05dec2021 (version 1.2.6)
- dstat_sum: b_gini added (IF not fully correct yet; may only serve as a rough
  approximation)
- dstat sum: gw_gini, gw_mld, gw_theil, gw_ge added
- datat sum: mldwithin renamed to w_mld; mldbetween renamed to b_mld
- datat sum: theilwithin renamed to w_theil; theilbetween renamed to b_theil
- datat sum: gewithin renamed to w_ge; gebetween renamed to b_ge

04dec2021 (version 1.2.5)
- dstat sum: gewithin and gebetween added
- dstat sum: IF of dissim made more efficient

03dec2021 (version 1.2.4)
- dstat sum: mldwithin, mldbetween, teilwithin, teilbetween, dissim added
- dstat sum: now using more efficient approach to compute IFs of categorical
  measures (hhi, entropy, mindex, etc)
- option zvar() is now called by(); zvar() still supported but no longer documented

27nov2021 (version 1.2.3)
- -nocasewise- had a bug that could crash -dstat- in some cases; this is fixed

26nov2021 (version 1.2.2)
- new system to manage temporary results to improve efficiency of -dstat sum-
- due to a type the values for gamma and tau_b could be somewhat off if weight
  were specified; this is fixed

25nov2021 (version 1.2.1)
- added association statistics: taua, taub, somersd, gamma; using a fast
  algorithm by R. Newson (2006. Efficient Calculation of Jackknife Confidence 
  Intervals for Rank Statistics. Journal of Statistical Software 15/1) to
  compute the difference in the sum of concordant and discordant pairs
- dstat automatically (and silently) recentered (all) influence functions if
  any IF had a relative error (i.e. deviation from zero relative to the value
  of the statistic) larger than 1e-14; a corresponding warning message was only
  displayed if any IF had a relative error larger than 1e-6; the former type
  of recentering is now discarded; that is, recentering is now only applied
  if at least one relative error is larger than 1e-6 (all IFs will be
  affected) and a warning message is always displayed if recentering is applied
- option -relax- could cause error in some situations; this is fixed
- dstat no longer enforces user version 14.2 when writing coefficient names to
  e(b) (enforcing user version 14.2 caused issues with bootstrap and similar
  commands); a consequence of this is that in Stata 15 (and in Stata 16 prior
  to the 30mar2021 update) the results table from -dstat summarize- might look
  slightly awkward if statistics with parameters in parentheses are specified;
  type -version 14: dstat summarize ...- for better output in these cases
- over-legend is no longer displayed if the coefficients table is suppressed
- subcmd is now always set to -summarize-, if no known subcmd is specified; for
  example, -datat x1-x5- now works

20nov2021 (version 1.2.0)
- a bug in -nocasewise- led to erroneous selection of observations or crashed
  dstat in some situations; this is fixed
- added statistics for categorical variables: hhi, hhin, gimp, entropy, hill,
  renyi, mindex, uc, cramer

03aug2021 (version 1.1.9)
- fixed header layout in Stata 17, employing _coef_table_header options
  introduced in the 13jul2021 update of Stata 17

14jul2021 (version 1.1.8)
- option -discrete- now allowed in -dstat histogram-; -dstat histogram, discrete-
  is an alias for -dstat proportion, nocategorical-
- graphs after -dstat proportion- now use a continuous axis instead of a categorical
  axis if option -nocategorical- has been specified
- -dstat frequency- can now be used as alias for -dstat proportion, frequency-
- statistic hdquantile() now fully supports weights; computation of influence
  functions has been improved
- option -qdef(10)- can now be specified to use Harrell-Davis quantiles; option
  -hdquantile- is a synonym for -qdef(10)-

01jul2021 (version 1.1.7)
- statistic hdquantile() added
- SEs of quantile(0) and quantile(1) now set to 0
- -dstat pdf- now allowed as alias for -dstat pdf-
- better error message if an invalid subcommand is specified

30jun2021 (version 1.1.6)
- additional poverty measures: tip (TIP ordinate) and atip (absolute TIP ordinate)
- -datat tip- failed if a variables was specified in -pline()- instead of a
  fixed value; this is fixed
- -dstat tip- no longer returns HCR and PGI in e()

29jun2021 (version 1.1.5)
- -dstat tip- (Tip curve) added
- option range() added to subcommands density, cdf, ccdf, quantile, lorenz, tip
- association measures added: corr (correlation), cov (covariance), spearman
  (Spearman's rank correlation)
- additional poverty measures: apgap (absolute poverty gap), apgi (absolute
  poverty gap index)
- contrast(lag) and contrast(lead) now allowed in over()
- can now specify custom p1 and p2 with -iqrn-
- observations with missing on variables specified in zvar() or pline() (or
  corresponding variables specified as arguments to individual statistics) are
  no longer excluded from the overall estimation sample if -nocasewise- is 
  specified
- number of obs and sum of weights now returned for each parameter in e(nobs)
  and e(sumw)

23jun2021 (version 1.1.4)
- additional inequality statistic: hoover index (robin hood index)
- additional poverty statistics: hcr (head count ratio), pgap (poverty gap),
  pgi (poverty gap index), sen (Sen poverty index), sst (Sen-Shorrocks-Thon),
  takayama (Takayama poverty index), chu (Clark-Hemming-Ulph)
- new option -pstrong- to employ the "strong" poverty definition; -fgt- now uses
  the "weak" definition by default
- option -relax- of -dstat summarize- was not included in e() and was not passed 
  through to -predict-; this is fixed
- the routine computing -md- could break in some contexts; this is fixed

10jun2021 (version 1.1.2)
- -predict- could fail after -dstat proportion-; this is fixed
- contrast options -ratio- and -lnratio- now again supported for statistics
  that are not normalized by the sample size (frequencies, totals)
- fixed bug that could occur if nocasewise and unconditional were both specified

07jun2021 (version 1.1.1)
- option -nocasewise- added
- option -relax- added
- dstat now always uses scores for totals/frequencies instead of influence
  functions; (sub)option svy in -predict-, -vce(analytic)- and -vce(cluster)-
  is discontinued; option -unconditional(fixed)- is discontinued; treatment of
  totals/freqs now consistent with survey estimation by default (i.e. supopulation
  sizes are assumed random; number PSUs is assumed fixed); this is different
  from how official command -total- handles subpops if used without -svy-
  prefix
- contrast options -ratio- and -lnratio- are no longer supported for statistics
  that are not normalized by the sample size (frequencies, totals); -ratio- and
  -lnratio- now imply -contrast-
- option -compact- of -predict/generate()/rif()- no longer allowed with
  -over(, contrast/accumulate)- or with statistics that are not normalized by
  the sample size
- dstat summarize applied sorting even if not necessary; this is fixed
- omitted estimates are no longer flagged in the coefficient names; vector
  e(omit) is now returned
- density estimation settings are now returned in e() only if density estimation
  has, in fact, been employed; e(bwidth) now has better column names
- in some situations, dstat histogram computed wrong results for the first bin
  if option balance() was specified; this is fixed
- _makesymmetric() is now applied to e(V) to remove asymmetry due to possible
   roundoff-error

22dec2020 (version 1.1.0)
- results for statistics mad(0,0), madn(0,0), mae(0), and maen(0) were wrong
  in case of weights; this is fixed

16dec2020 (version 1.0.9)
- new subopions -contrast()-, -ratio-, -lnratio-, and -accumulate- in -over()-
- new -common- option in -dstat density-, -dstat histogram-, and -datat [c]cdf-
- new display options -cref- and -pvalue-
- citype() now sets CI to missing if value of coef is outside domain of
  transformation function
- option select() in -dstat graph- can now contain -reverse- instead of a
  numlist

11dec2020 (version 1.0.8)
- cluster variable in vce(cluster) can now be string
- over(..., rescale) now implemented as subcommand-specific option
  -unconditional-; -unconditional(fixed)- added to treat subpopulation
  sizes as fixed
- dstat cdf/ccdf: specifying -ipolate- together with -floor- returned error; this
  is fixed

10dec2020 (version 1.0.7)
- vce(analytic/cluster, svy)
  o svy was not taken into account if no clusters and no weights, iweights, or
    fweights were specified; this is fixes
  o revised code to preserve memory and avoid double work
- for reasons of consistency, in case of iweights, the sum of weights is now
  reported in e(N)  instead of the physical number of observations

09dec2020 (version 1.0.6)
- new option select() in -dstat graph- to select and order subgraphs and plots
- new suboption select() in over(): select and order subpopulations to be included
  in results; total will still include obs from all groups
- new suboption -rescale- in over(): rescale results by the relative size of the
  subpopulation
- suboption -svy- in vce(analytic) and vce(cluster) to compute SEs for
  frequencies and totals like svy does 
- new statistics: min, max, range, midrange (IFs/SEs will be set to zero for 
  these statistics)
- vce(svy), vce(bootstrap), and vce(jackknife) now feature suboption [no]cov to
  decide whether to store full e(V) or only e(se); default is -cov- for 
  -dstat summarize- and -nocov- else; with vce(svy) option -nocov- also removes
  auxiliary covariance matrices such as e(V_srs)
- dstat density: standard errors were correct only in the first subpopulation 
  if -over()- was specified together with -exact-; this is fixed 

05dec2020 (version 1.0.5)
- new -dstat ccdf- command for complementary CDF (tail distribution, survival
  function)
- -dstat cdf- has new options -frequency-, -percent-, -floor-, and -ipolate-
- additional statistics: total(), cdff(), ccdf(), ccdfm(), ccdff()
- statistics trim(p1,p2) and winsor(p1,p2) now documented; furthermore, qdef()
  is now taken into account by trim() and winsor()
- option -sum- in -dstat lorenz- and -dstat share- now documented
- statistics tlorenz(), tshare(), tccurve(), tcshare() now documented
- option generate() has a new -svy- suboption to generated scores for survey 
  estimation instead of influence functions; this is only makes a difference for
  unnormalized statistics (frequencies, totals)
- VCE for unnormalized statistics (frequencies, totals) did not take account of
  the extra uncertainty induced by the variability of the sum of weights in the
  context of survey estimation; this is fixed
- confidence limits had wrong scale if -percent- was specified, citype() was not
  normal, and width of confidence interval was zero; this is fixed
- predict after survey estimation with subpop() returned missing in observations
  outside subpop(); the IFs for these observations are now set to 0
- revised code of some IFs to avoid double work; affected functions are
  dstat_density_IF(), dstat_cdf_IF(), dstat_sum_hist(), dstat_sum_cdf(),
  dstat_sum_cdfm(), dstat_sum_freq()
- now using pstyle(p#line) instead of pstyle(p#) in graphs if appropriate
- no longer using mm_repeat(); using J() instead

27nov2020 (version 1.0.4)
- "version, user" issue now finally fixed (hopefully); the issue was related
  to -set dp comma-

27nov2020 (version 1.0.3)
- yet another try to fix the "version, user" issue

27nov2020 (version 1.0.2)
- graph option -merge- added
- added code to circumvent the "version, user" error that appears to occur
  in some variants of Stata installations

24nov2020 (version 1.0.1)
- issues encountered with regexr() in Stata 14; no longer using regexr()
- fixed another awkward Stata 14 issue

24nov2020 (version 1.0.0):
- dstat released on GitHub

About

Stata module to compute summary statistics and distribution functions including standard errors and optional covariate balancing

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages