-
Notifications
You must be signed in to change notification settings - Fork 0
/
K-Means.Rmd
181 lines (121 loc) · 4.03 KB
/
K-Means.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
---
output:
html_document:
toc: true
toc_float:
collapsed: false
number_sections: true
title: ""
author: "[User-764Q](https://github.com/User-764Q)"
date: "`r paste0('Last Run: ', format(Sys.time(), '%A %d-%B-%Y'))`"
params:
param1: "Don't Forget about params"
---
<style>
#TOC {
font-family: Calibri;
font-size: 16px;
border-color: #3D68DF;
background: #3D68DF;
}
body {
font-family: Garamond;
font-size: 16px;
border-color: #D0D0D0;
background-color: #D0D0D0;
color: #1A1A1A;
}
pre {
color: #1A1A1A
background: #D0D0D0;
background-color: #D0D0D0
font-family: Calibri;
}
</style>
```{r setup, include = FALSE}
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(collapse = TRUE)
knitr::opts_chunk$set(warning = TRUE)
knitr::opts_chunk$set(message = TRUE)
knitr::opts_chunk$set(include = TRUE)
custom_black <- '1A1A1A'
custom_white <- 'C0C0C0'
custom_grey_dark <- '6F6F6F'
custom_grey_light <- 'B2B2B2'
custom_accent_blue <- '3D6BFF'
```
```{r, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
```
## K Means Clustering
This page demonstrates K-Means Clustering with some imaginary data.
### Loading Libraries
```{r libraries, include = TRUE, message = FALSE, echo = TRUE}
library(tidyverse)
library(cluster)
library(factoextra)
library(meanShiftR)
library(ggthemes)
```
### Creating the data
Samlping from a series of single digit integers, with some more common than others. Two columns of data are created, to be displayed in a 2D scatter plot.
#### TSetting up data
```{r creating data, include = TRUE, echo = TRUE}
# size of test data, 3 = 1,000 rows, 2 = 100
zeroes <- 4
# k means requires you to tell it how many clusters there are
# the sample data will create 4 clusters
clusters <- 4
# randomly select data from 1 - 6 with more 1's and 5's than other numbers
# this will give two clusters in this dimension
var_a <- sample(c(1,1,1,2,3,4,5,5,5,6), 1*10^zeroes, replace = TRUE)
# do the same thing again
# two clusters in this dimension
var_b <- sample(c(1,1,1,1,2,2,3,4,5,5,5,5,6), 1*10^zeroes, replace = TRUE)
# create a dataframe of the sample data
variables <- data.frame(var_a = var_a,
var_b = var_b)
```
#### Adding noise
Adding some random noise to the numbers so the are a bit more interesting, and realistic.
```{r, inclue = TRUE, echo = TRUE}
# add some noise to the sample data
# add a random number from -1 to 1 to each variable
variables <- variables %>%
mutate(var_a = var_a+runif(1*10^zeroes, min = -1, max = 1),
var_b = var_b+runif(1*10^zeroes, min = -1, max = 1))
```
#### Plotting the data
Plotting the data to check for clusters, the clusters show up pretty clearly, there are four. K-Means requires you to specify the number of clusters, plotting is a quick way check how many. In this case I knew there would be four because I made the data.
```{r, include = TRUE, echo = TRUE}
# plotting the pretend data to see how the clusters look
variables %>%
ggplot(aes(x = var_a, y = var_b)) +
geom_point() +
theme_few()
# two clusters in each direction for a total of four
```
### Running K-Means
Running the K-Means function and plotting the results, one colour per cluster.
```{r, running the k-means function, include = TRUE, echo = TRUE}
# running the k means function
# one component of the result is a column with a number for the cluster each point
# belongs to
km <- kmeans(variables, clusters)
# adding a column to the test data, with each cluster that the
# point falls in
variables <- variables %>%
mutate(cluster = km$cluster) %>%
# converting to factor, or ggplplot will shade it rather than
# using discrete colours
mutate(cluster = factor(cluster, levels = 1:clusters))
```
### Plotting results
Scatter plot with all the points, coloured by the cluster K-Means assigned them.
```{r, plotting the results, include = TRUE, echo = TRUE}
# plotting the test data with the colours indicating the cluster
variables %>%
ggplot(aes(x = var_a, y = var_b, col = cluster)) +
geom_point() +
theme_few()
```