Skip to content
Damon Snyder edited this page Jun 5, 2013 · 7 revisions

Chapter 4

4.11

Is the BRFSS data normal or lognormal? Here is the code to generate the data and the plots:

(four/brfss-4.11-gen-data "tmp/CDBRFS08.ASC.gz")
$ Rscript plots/is-distribution-qq.R tmp/weights.csv
$ mv plots/is-distribution-qq.png plots/4.11a.png
$ Rscript plots/is-distribution-qq.R tmp/weights-log.csv
$ mv plots/is-distribution-qq.png plots/4.11b.png

Here is the probability plot for the male data using mu = 88.64 and sigma = 16.26:

probability plot

And the lognormal probability plot with the same inputs.

log normal

The lognormal plot fits much better to x=y (in red). I would conclude that it's a good fit for the data.

4.12

I tried to determine if the Pareto distribution was a good model for the city data. I generated the log(x) / log(CCDF) plot and it didn't look like a good fit. After reading more about the log-normal and a reference to the distribution of city sizes being log-normal and following Gibrat's Law. Gibrat's law claims that the entire distribution of cities is log-normal.

To test this, I generated sample using random/lognormalvariate with a mu of 7.33 and a sigma of 1.862. These values were derived from the city data as generated by Allen's populations.py using stats/summary and stats/stddev.

I then computed the CDF of each and plotted them on the same graph using log(x). Here is the plot with the empirical data set in blue and the model in red.

Population data and lognormal fit.

Both CDFs have the expected shape of a normal distribution when you compute the log(x). In hindsight, generating the model wasn't necessary. The CDF of the empirical with log(x) substituted for x appears to be log-normal.

Clone this wiki locally