In [1]:
using RDatasets, StatsBase

In [2]:
RDatasets.packages()

Unnamed: 0,Package,Title
1,COUNT,"Functions, data and code for count data."
2,Ecdat,Data sets for econometrics
3,HSAUR,A Handbook of Statistical Analyses Using R (1st Edition)
4,HistData,Data sets from the history of statistics and data visualization
5,ISLR,Data for An Introduction to Statistical Learning with Applications in R
6,KMsurv,"Data sets from Klein and Moeschberger (1997), Survival Analysis"
7,MASS,Support Functions and Datasets for Venables and Ripley's MASS
8,SASmixed,"Data sets from \\""SAS System for Mixed Models\\"""
9,Zelig,Everyone's Statistical Software
10,adehabitatLT,Analysis of Animal Movements


In [3]:
RDatasets.datasets("datasets")

Unnamed: 0,Package,Dataset,Title,Rows,Columns
1,datasets,BOD,Biochemical Oxygen Demand,6,2
2,datasets,CO2,Carbon Dioxide Uptake in Grass Plants,84,5
3,datasets,Formaldehyde,Determination of Formaldehyde,6,2
4,datasets,HairEyeColor,Hair and Eye Color of Statistics Students,32,4
5,datasets,InsectSprays,Effectiveness of Insect Sprays,72,2
6,datasets,LifeCycleSavings,Intercountry Life-Cycle Savings Data,50,6
7,datasets,Loblolly,Growth of Loblolly pine trees,84,4
8,datasets,OrchardSprays,Potency of Orchard Sprays,64,4
9,datasets,PlantGrowth,Results from an Experiment on Plant Growth,30,2
10,datasets,Puromycin,Reaction Velocity of an Enzymatic Reaction,23,3


In [4]:
mydf = dataset("datasets", "esoph")
head(mydf)

Unnamed: 0,AgeGp,AlcGp,TobGp,NCases,NControls
1,25-34,0-39g/day,0-9g/day,0,40
2,25-34,0-39g/day,10-19,0,10
3,25-34,0-39g/day,20-29,0,6
4,25-34,0-39g/day,30+,0,5
5,25-34,40-79,0-9g/day,0,27
6,25-34,40-79,10-19,0,7


In [5]:
showcols(mydf)

88×5 DataFrames.DataFrame
│ Col # │ Name      │ Eltype                                     │ Missing │
├───────┼───────────┼────────────────────────────────────────────┼─────────┤
│ 1     │ AgeGp     │ CategoricalArrays.CategoricalString{UInt8} │ 0       │
│ 2     │ AlcGp     │ CategoricalArrays.CategoricalString{UInt8} │ 0       │
│ 3     │ TobGp     │ CategoricalArrays.CategoricalString{UInt8} │ 0       │
│ 4     │ NCases    │ Int32                                      │ 0       │
│ 5     │ NControls │ Int32                                      │ 0       │

│ Col # │ Values             │
├───────┼────────────────────┤
│ 1     │ 25-34  …  75+      │
│ 2     │ 0-39g/day  …  120+ │
│ 3     │ 0-9g/day  …  10-19 │
│ 4     │ 0  …  1            │
│ 5     │ 40  …  1           │

In [6]:
sample(mydf[:AgeGp])

CategoricalArrays.CategoricalString{UInt8} "45-54"

Get a larger sample

In [7]:
sample(mydf[:NCases], 8)

8-element Array{Int32,1}:
 8
 2
 3
 0
 6
 1
 0
 1

Get ordered values.

In [10]:
sample(mydf[:NCases], 8; replace = true, ordered = true)

8-element Array{Int32,1}:
 0
 0
 1
 3
 3
 4
 6
 9

In [11]:
sample(mydf[:NCases], 8; replace = false, ordered = true)

8-element Array{Int32,1}:
 0
 0
 0
 0
 0
 2
 2
 1

## Weight vectors

`WeightVec` is deprecated.  Uses

1. Programmatically distinguish weight vector and `DataArrays`.
2. Store the sum of the weights to avoid repeated computation of `sum(wv)`.

In [13]:
wv = Weights([2., 4., 5.], 11.)

3-element StatsBase.Weights{Float64,Float64,Array{Float64,1}}:
 2.0
 4.0
 5.0

In [15]:
eltype(wv)

Float64

In [16]:
values(wv)

3-element Array{Float64,1}:
 2.0
 4.0
 5.0

In [17]:
size(wv)

(3,)

In [18]:
typeof(wv)

StatsBase.Weights{Float64,Float64,Array{Float64,1}}

In [19]:
sum(wv)

11.0

## Basic summary and tricks

In [20]:
describe(mydf)

AgeGp
Summary Stats:
Length:         88
Type:           CategoricalArrays.CategoricalString{UInt8}
Number Unique:  6

AlcGp
Summary Stats:
Length:         88
Type:           CategoricalArrays.CategoricalString{UInt8}
Number Unique:  4

TobGp
Summary Stats:
Length:         88
Type:           CategoricalArrays.CategoricalString{UInt8}
Number Unique:  4

NCases
Summary Stats:
Mean:           2.272727
Minimum:        0.000000
1st Quartile:   0.000000
Median:         1.000000
3rd Quartile:   4.000000
Maximum:        17.000000
Length:         88
Type:           Int32

NControls
Summary Stats:
Mean:           11.079545
Minimum:        1.000000
1st Quartile:   3.000000
Median:         6.000000
3rd Quartile:   14.000000
Maximum:        60.000000
Length:         88
Type:           Int32



### Trimmed vector

Remove outliers (top `α` and bottom `α`) with `mean(trim(vec, α))`, where $\alpha \in (0,1)$.  Test the working of `mean(trim())`.  `trimmean()` is deprecated.  See the [docs for `trim()`](http://juliastats.github.io/StatsBase.jl/stable/robust.html#StatsBase.trim) for details.  `trim()` offers another kwarg `count`.  `N * α` is chosen to be an even number to avoid numerical error.

In [48]:
N = 10000
vec = randn(N)
sort!(vec)
println(vec[1:10])
α = 0.346
trimBottomCount = trunc(Int, round(N * α))
trimTopCount = N - trimBottomCount
testtrimmean = mean(vec[trimBottomCount+1:trimTopCount])
println("predicted = $(testtrimmean)")
calculatedmean = mean(trim(vec, prop = α))
println("calculated = $(calculatedmean)")

[-4.07998, -3.96588, -3.92512, -3.6632, -3.38367, -3.25907, -3.24791, -3.21946, -3.19977, -3.18661]
predicted = -0.0003665493449321315
calculated = -0.00037287136724384524


Test failed, findinig the cause ...

In [49]:
trimBottomCount+1, trimTopCount

(3461, 6540)

In [52]:
mean(vec[3461:6540])

-0.0003665493449321315

In [54]:
trimvec = trim(vec, prop = α)

3082-element Array{Float64,1}:
 -0.404622
 -0.404197
 -0.403942
 -0.403936
 -0.403199
 -0.402794
 -0.402415
 -0.40227 
 -0.401893
 -0.401789
 -0.401323
 -0.401114
 -0.400594
  ⋮       
  0.382229
  0.382527
  0.3826  
  0.382795
  0.382962
  0.383058
  0.383589
  0.383957
  0.384121
  0.384309
  0.384402
  0.384404

In [55]:
find(vec .== trimvec[1])

1-element Array{Int64,1}:
 3460

In [56]:
find(vec .== trimvec[end])

1-element Array{Int64,1}:
 6541

### Weighted mean

In [57]:
vec = rand(3)
wv = rand(3)
poids = weights(wv)

3-element StatsBase.Weights{Float64,Float64,Array{Float64,1}}:
 0.338022
 0.563825
 0.616274

In [58]:
moyenne = mean(vec, wv)

LoadError: [91mArgumentError: reduced dimension(s) must be integers[39m

In [59]:
moyenne = mean(vec, poids)

0.25367502178027934

To sum up, it's `mean(vec, weights(wv))`.

In [64]:
a = reshape(collect(1:10), 2, :)'

5×2 Array{Int64,2}:
 1   2
 3   4
 5   6
 7   8
 9  10

In [65]:
var(a)

9.166666666666666

In [66]:
var(a,1)

1×2 Array{Float64,2}:
 10.0  10.0

In [67]:
var(a,2)

5×1 Array{Float64,2}:
 0.5
 0.5
 0.5
 0.5
 0.5

I leave `mean_and_var()` and `mean_and_std()`.

### Skewness and kurtosis

In [69]:
vec = rand(1000)
skewness(vec), kurtosis(vec)

(-0.08426243377088599, -1.2023956710962824)

In [71]:
vecn = randn(1000)
skewness(vecn), kurtosis(vecn)

(0.014621506824230007, -0.22845742852590467)

### Central moments

In [72]:
moment(vec, 3), moment(vecn, 3)

(-0.0020109181128410013, 0.01484123969017431)

In [73]:
moment(vec, 4), moment(vecn, 4)

(0.012351165888606163, 2.827215745952343)

### Variations

`span` returns `minimum(x):maximum(x)` of an *integer* array.

In [95]:
a = rand(1:100, 4)
span(a)

59:99

Coefficient of variation: $c_{v}={\sigma  \over \mu}$

In [96]:
variation(a)

0.23275541634948538

[Standard error of mean](https://www.statsdirect.com/help/basic_descriptive_statistics/standard_deviation.htm): $\text{SEM} = \frac{s}{\sqrt{n}}$

In [97]:
sem(a)

9.222933372848358

[MAD](http://datalearning.eu/wp-content/uploads/2016/02/Outliers-Robust-Statistics.pdf) for robustness (without being affected by outliers).  The book claims `mad()` to be the mean absolute deviation, but the [doc](http://juliastats.github.io/StatsBase.jl/latest/scalarstats.html#StatsBase.mad) suggests it's the median absolute deviation instead.  The karg `center` is needed to avoid `MethodError`.

In [99]:
mad(a), mad(a, center=6)

Stacktrace:
 [1] [1mdepwarn[22m[22m[1m([22m[22m::String, ::Symbol[1m)[22m[22m at [1m.\deprecated.jl:70[22m[22m
 [2] [1m#mad#32[22m[22m[1m([22m[22m::Void, ::Void, ::Function, ::Array{Int64,1}[1m)[22m[22m at [1mC:\Users\Owner\.julia\v0.6\StatsBase\src\scalarstats.jl:259[22m[22m
 [3] [1mmad[22m[22m[1m([22m[22m::Array{Int64,1}[1m)[22m[22m at [1mC:\Users\Owner\.julia\v0.6\StatsBase\src\scalarstats.jl:253[22m[22m
 [4] [1minclude_string[22m[22m[1m([22m[22m::String, ::String[1m)[22m[22m at [1m.\loading.jl:522[22m[22m
 [5] [1minclude_string[22m[22m[1m([22m[22m::Module, ::String, ::String[1m)[22m[22m at [1mC:\Users\Owner\.julia\v0.6\Compat\src\Compat.jl:71[22m[22m
 [6] [1mexecute_request[22m[22m[1m([22m[22m::ZMQ.Socket, ::IJulia.Msg[1m)[22m[22m at [1mC:\Users\Owner\.julia\v0.6\IJulia\src\execute_request.jl:158[22m[22m
 [7] [1m(::Compat.#inner#17{Array{Any,1},IJulia.#execute_request,Tuple{ZMQ.Socket,IJulia.Msg}})[22m[22m[

(22.239033277584028, 108.97126306016173)

In [100]:
zscore(a)

4-element Array{Float64,1}:
  0.582786
 -0.55568 
  1.0707  
 -1.09781 

### Entropy

From [Wiki](https://en.wikipedia.org/wiki/Dirichlet_distribution), $\mathrm {H} (X)=\mathrm {E} [\mathrm {I} (X)]=\mathrm {E} [-\ln(\mathrm {P} (X))]$.

In [101]:
using Distributions

PDF of Dirichlet's distribution
$$f(x_1,\dots, x_{K}; \alpha_1,\dots, \alpha_K) = \frac{1}{\mathrm{B}(\alpha)} \prod_{i=1}^K x_i^{\alpha_i - 1}$$
where $B$ is the multinomial beta function
$${\displaystyle \mathrm {B} (\alpha )={\frac {\prod _{i=1}^{K}\Gamma (\alpha _{i})}{\Gamma \left(\sum _{i=1}^{K}\alpha _{i}\right)}},\qquad \alpha =(\alpha _{1},\dots ,\alpha _{K}).}$$
$\|{\boldsymbol {x}}\|_{1}=1$ by assumption.

In [103]:
loidedir = Dirichlet([2., 4., 6.])

Distributions.Dirichlet{Float64}(alpha=[2.0, 4.0, 6.0])

In [104]:
RVdir = rand(loidedir)

3-element Array{Float64,1}:
 0.162564
 0.295005
 0.542432

In [105]:
sum(RVdir)

1.0

The second parameter of `entropy()` is the base.

In [106]:
entropy(RVdir)

0.9872600709882985

To be understood: `crossentropy()`.