/
mclust.Rmd
213 lines (151 loc) · 5.37 KB
/
mclust.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
---
title: "Exploring model-based clustering with condvis2"
author: "Catherine B. Hurley"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Exploring model-based clustering with condvis2}
%\VignetteEncoding{UTF-8}
%\VignetteEngine{knitr::rmarkdown}
editor_options:
chunk_output_type: console
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.width=5, fig.height=5 ,fig.align="center"
)
fpath <- ""
```
`mclust` is a R package that offers
- model-based clustering: `Mclust`
- classification ( discriminant analysis): `MclustDA`
- density estimation `densityMclust`
all based on mixtures of Normals.
`condvis2` offers an interface to all of these.
We start with
## Classification: MclustDA
```{r}
library(condvis2)
library(mclust)
```
```{r}
data(banknote)
bankDA <- MclustDA(banknote[,-1], banknote[,1],verbose=F)
```
This fits an XXX=ellipsoidal multivariate normal for the genuine notes and
EVE2 = ellipsoidal, equal volume and equal shape with two components for
counterfeits.
Condvis uses a generic `CVpredict` to provide a uniform interface to `predict` methods.
We will use it to find the number of misclassifications:
```{r}
table(banknote$Status, CVpredict(bankDA, banknote))
```
```{r echo=F, eval=F}
bankDA$models$genuine$modelName
bankDA$models$genuine$G
bankDA$models$counterfeit$modelName
bankDA$models$counterfeit$G
bankDA$models$counterfeit
```
```{r eval=F}
svars <- c("Top", "Diagonal")
cvars <- setdiff(names(banknote)[-1], svars)
condvis(data = banknote, model = bankDA,
response="Status", sectionvars=svars,conditionvars=cvars,
pointColor="Status", showsim=TRUE
)
```
The `showsim=TRUE` setting means that the condition plots will mark points considered
sufficiently near the section with a dark outline. This is the default setting only
when $n <= 150$. In the condition plots, the green points are the genuine ones, pink
are counterfeit.
Here is a views of the result:
```{r echo=FALSE, out.width='100%'}
knitr::include_graphics(paste0(fpath, "mclustda1.png"))
```
There is one point that appears to be mis-classified (though it is not).
However, as it's size is small it is not very close
to the selected section.
Selecting `Show probs`: we see there is some uncertainty about the classification on the classification
boundaries.
A different, simpler fit is got using
```{r}
bankDAe <- MclustDA(banknote[,-1], banknote[,1], modelType="EDDA",verbose=F)
```
This uses an EVE1 for each class, and there is one mis-classification, at case 70.
We can use condvis to compare the two fits:
```{r eval=F}
condvis(data = banknote, model = list(bankDA=bankDA, bankDAe=bankDAe),
response="Status", sectionvars=svars,conditionvars=cvars,
pointColor="Status", showsim=T
)
```
If you go to Tour, `Diff fits`, you can move through views where the
conditioning points are those where the fits disagree. In this case there is
only one such point.
Here `Show probs` is selected so there is some uncertainty visible at the classification
boundaries.
The screenshot shows that in this view.
```{r echo=FALSE, out.width='100%'}
knitr::include_graphics(paste0(fpath, "mclustda6.png"))
```
The mis-classified point is a genuine note (green) which is just inside the
region classified as counterfeit by EDDA.
Here also the Similarity threshold is made small so only the point on the section is visible
in the section plot.
From the condition plots it is evident this note has a low Diagonal size, like the
counterfeits.
## Density estimation
First calculate a density estimate of two variables.
```{r}
data(banknote)
dens2 <- densityMclust(banknote[,c("Diagonal","Left")],verbose=F)
summary(dens2)
```
We can visualise the density as a surface or contour plot.
Alternatively, we can use condvis to show the conditional
density, fixing one of the variables.
```{r eval=F}
condvis(data = banknote, model = dens2, response=NULL,
sectionvars="Diagonal",conditionvars="Left",
density=T, showdata=T)
```
The density of Diagonal varies with the Left value. Click on left to check this.
```{r echo=FALSE, out.width='100%'}
knitr::include_graphics(paste0(fpath, "left1.png"))
```
Estimating the density of three variables:
```{r}
dens3 <- densityMclust(banknote[,c("Right", "Bottom", "Diagonal")],verbose=F)
summary(dens3)
```
By way of comparison, here is the kernel density estimate to compare with mclust:
```{r}
library(ks)
kdens3 <- kde(banknote[,c("Right", "Bottom", "Diagonal")])
```
```{r eval=F}
condvis(data = banknote, model = list(mclust=dens3, kde=kdens3), response=NULL,
sectionvars=c("Bottom", "Diagonal"),conditionvars="Right",
density=T, showdata=T)
```
As you vary the level of Right, both densities look quite similar.
```{r echo=FALSE, out.width='100%'}
knitr::include_graphics(paste0(fpath, "dens3Right2.png"))
```
<!-- https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5096736/ -->
Click on "Show 3d surface" to get a wireframe plot.
# Clustering example
```{r eval=F}
bankC <- Mclust(banknote[,-1],verbose=F) # picks 3 clusters
banknote1 <- banknote
banknote1$cluster <- factor(CVpredict(bankC, banknote))
svars <- c("Top", "Diagonal")
cvars <- c("Left", "Right" , "Bottom")
condvis(data = banknote1, model = bankC,
response="cluster", sectionvars=svars,conditionvars=cvars,
pointColor="Status", showsim=TRUE
)
```