-
Notifications
You must be signed in to change notification settings - Fork 13
/
FacebookNetwork.Rmd
290 lines (233 loc) · 12 KB
/
FacebookNetwork.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
---
title: "My Facebook network"
author: "Benjamin Chan (https://www.facebook.com/benjamin.ks.chan)"
output:
html_document:
keep_md: yes
toc: yes
---
Build off of the first assignment from the Coursera Social Network Analysis course. Below are part of the instructions from the first assignment from that course. After downloading my Facebook network data, I examine centrality and communities in the network using `r R.version$version.string` with the `igraph` package. I used the RStudio IDE with Knitr to generate this HTML file. This analysis was run on `r as.character(Sys.time())`.
Getting Facebook network data
-----------------------------
The following are instructions for downloading my own GML file is below. These instructions were taken from assignment 1 from the fall 2012 Coursera Social Network Analysis course.
> In order to get your own network, complete the following steps:
>
> * Go to [http://snacourse.com/getnet](http://snacourse.com/getnet)
> * Choose which user data (e.g. "wall posts count") you'd like to include, for this assignment no additional data is necessary, but whatever you do download, you can visualize/analyze (profile age rank: oldest profile = highest value, declarative intensity (length of text in fields like activities, books, etc.)
> * Save the .gml file and load it into Gephi using "File -> Open...".
Read the GML file.
```{r ReadGraph}
setwd("~/GitHub repositories/FacebookNetworkAnalysis")
require(igraph, quietly=TRUE)
G <- read.graph(file="ChanFacebook.gml", format="gml")
```
My GML file was last modified on `r as.character(file.info("ChanFacebook.gml")$mtime)`. As of that date, there are `r vcount(G)` nodes.
Create first name and initials vectors from names. I'll want to use these to label nodes when plotting the network.
```{r NodeLabels}
listName <- strsplit(V(G)$label, " ")
nameF <- sapply(listName, head, 1)
nameL <- sapply(listName, tail, 1)
initF <- substr(nameF, 1, 1)
initL <- substr(nameL, 1, 1)
initials <- paste0(initF, initL)
label <- nameF
head(label)
```
Analysis questions
------------------
I'm interested in two questions:
* What communities exist in my network?
* Who are the people who are "bridges" across communities?
Centrality
----------
Here's a link to [Wikipedia](http://en.wikipedia.org/wiki/Centrality) for some background information on centrality.
Calculate **degree** centrality. This ends up not being too useful. Degree is really just a measure of how connected someone is. In Facebook terms, it's who has the most friends within my network.
```{r Degree}
deg <- degree(G)
summary(deg)
```
The median degree, or number of friends, was `r median(deg)`. The highest number of friends a person in my network has was `r max(deg)`.
Even though I don't want to focus on degree centrality, let's see who are the most connected people in my network.
```{r HighDegree, results='asis'}
require(xtable, quietly=TRUE)
lim <- sort(deg, decreasing=TRUE)[round(vcount(G) * 0.05)]
top <- data.frame("Name"=V(G)$label, "Degree"=deg)
top <- subset(top, deg >= lim)
top <- top[order(top$Degree, decreasing=TRUE),]
print(xtable(top, digits=0), type="html", include.rownames=FALSE)
```
Calculate **closeness** centrality. Closeness is a measure of how many steps are required to access every other node. It's a measure of how close a node is to all the action. A person with high closeness, however, doesn't necessarily have to have very many friends or be in between relationship.
```{r Closeness}
close <- closeness(G)
summary(close)
```
Again, I don't want to focus on closeness centrality since it's not really what I'm after in this analysis, so I won't say anything more about it.
Calculate **betweenness** centrality. I'm going to focus on betweenness since it's going to get after one of my analysis questions, who are the "bridges" between people and communities? Betweenness is a measure of how often a node is in the pathway between two other nodes. I.e., a person with high betweenness can be a key player in introducing a large group of friends to another large group of friends. Such a person doesn't necessarily have to have a large number of friends themselves. But they could be in a unique position of influence in the network.
Plot a histogram of betweenness scores.
```{r Betweenness, fig.height=4}
require(ggplot2, quietly=TRUE)
btwn <- betweenness(G)
summary(btwn)
qplot(btwn, binwidth=1000)
```
The absolute value of the scores don't mean much. But their relative values tell the story. There are a number of people with extremely high betweenness scores. Who are these people? List the 25 of people with the highest betweeness scores.
```{r HighBetweenness, results='asis'}
rank <- length(btwn) - rank(btwn) + 1
top <- data.frame("Rank"=rank, "Name"=V(G)$label, "Betweenness"=btwn)
lim <- sort(btwn, decreasing=TRUE)[round(vcount(G) * 0.05)]
# top <- subset(top, btwn >= lim)
top <- subset(top, rank <= 25)
top <- top[order(top$Betweenness, decreasing=TRUE),]
print(xtable(top, digits=0), type="html")
```
Plot the association between degree centrality and betweenness centrality. See if there are any highly influential people (betweenness) who also have a high number of friends (degree).
```{r AssociationCentrality, fig.height=4, fig.width=4}
rsq <- format(cor(deg, btwn) ^2, digits=3)
cntrl <- data.frame(deg, btwn, close)
ggplot(cntrl, aes(x=deg, y=btwn)) +
geom_jitter(alpha=1/2) +
scale_y_log10() +
labs(x="Degree", y="Betweenness") +
annotate("text", label=paste("R-sq =", rsq), x=+Inf, y=1, hjust=1)
```
Communities
-----------
Find communities using the edge betweenness algorithm.
```{r FindCommunities}
C <- edge.betweenness.community(G)
sizesOrdered <- sizes(C)[order(sizes(C), decreasing=TRUE)]
sizesOrdered
maxComm <- 11
commThreshold <- sizesOrdered[maxComm]
```
There are `r maxComm` clearly large communities in my network. *Large* is defined as having more than `r commThreshold` people.
Set the `r maxComm + 1` community as a *junk* community.
```{r}
C$membership[!(C$membership %in% as.numeric(names(sizesOrdered[1:maxComm])))] <- 999
C$membership <- unclass(factor(C$membership))
table(C$membership)
```
Visualize the network
---------------------
Scale the size of a node's plotting symbol according to its betweenness centrality score. Scaling is by percentile (75%, 90%, 95%, 97.5%, 99%).
```{r ScaleSymbol, fig.height=4}
q <- quantile(btwn, probs=c(0, 0.75, 0.9, 0.95, 0.975, 0.99, 1))
size <- cut(btwn, breaks=q, include.lowest=TRUE, dig.lab=5)
summary(size)
V(G)$size <- unclass(size)
```
Label the nodes with the top betweenness centrality within each community. This will help make sense of who are the important links between each community.
```{r SubsetLabels}
labNode <- nameF
labNode[labNode == "Marc"] <- paste(labNode[labNode == "Marc"], initL[labNode == "Marc"]) # Since there's 2 Marcs, paste their last initial to the label
rankWithin <- ave(btwn, C$membership, FUN=function(x) length(x) - rank(x) + 1)
pctWithin <- 7
topWithin <- round(pctWithin / 100 * sizes(C))
message(sprintf("The top %3.1f%% within each community leads to labelling %2.0f nodes.", pctWithin, sum(topWithin)))
# Need to find a more elegant way to do this
isLabelled1 <- C$membership == 1 & rankWithin <= topWithin[ 1]
isLabelled2 <- C$membership == 2 & rankWithin <= topWithin[ 2]
isLabelled3 <- C$membership == 3 & rankWithin <= topWithin[ 3]
isLabelled4 <- C$membership == 4 & rankWithin <= topWithin[ 4]
isLabelled5 <- C$membership == 5 & rankWithin <= topWithin[ 5]
isLabelled6 <- C$membership == 6 & rankWithin <= topWithin[ 6]
isLabelled7 <- C$membership == 7 & rankWithin <= topWithin[ 7]
isLabelled8 <- C$membership == 8 & rankWithin <= topWithin[ 8]
isLabelled9 <- C$membership == 9 & rankWithin <= topWithin[ 9]
isLabelled10 <- C$membership == 10 & rankWithin <= topWithin[10]
isLabelled11 <- C$membership == 11 & rankWithin <= topWithin[11]
isLabelled12 <- C$membership == 12 & rankWithin <= topWithin[12]
isLabelled <- isLabelled1 | isLabelled2 | isLabelled3 | isLabelled4 | isLabelled5 | isLabelled6 | isLabelled7 | isLabelled8 | isLabelled9 | isLabelled10 | isLabelled11 | isLabelled12
# isLabelledOther <- grepl("Cat$|Jackie|Moi", nameF)
# isLabelled <- isLabelled | isLabelledOther
labNode[!isLabelled] <- NA
# labNode[C$membership != 5 & nameF != "Cat"] <- NA
```
Create a function for `plot.igraph`.
```{r FunctionPlot, tidy=FALSE}
P <- function(vertex.label.cex=1, vertex.label.color="#0000007F") {
set.seed(seed)
plot(C, G,
vertex.label=labNode,
vertex.label.color=vertex.label.color,
vertex.label.dist=1/4,
vertex.label.family="sans",
vertex.label.cex=vertex.label.cex,
vertex.frame.color=NA,
colbar=node.color,
mark.groups=NA,
mark.col=NA,
mark.border=palette,
edge.color=c("#7F7F7F3F", "#FF00003F")[crossing(C, G)+1],
edge.width=1/2
)
}
```
Show low resolution plot for HTML file. Set the random number seed so when a high resolution version is created, it will have the same layout as this version.
```{r NetworkVisualizationLowRes, dpi=72, fig.height=12, fig.width=12}
id <- communities(C)[1:maxComm]
require(RColorBrewer, quietly=TRUE)
palette <- c(brewer.pal(maxComm, "Spectral"), "gray")
node.color <- c(palette[1:maxComm], rep(palette[maxComm + 1], max(C$membership) - maxComm))
seed <- file.info("ChanFacebook.gml")$mtime
P(vertex.label.cex=1)
```
Create high resolution PNG file for importing into presentations.
```{r NetworkVisualizationHiRes, dpi=600, fig.height=12, fig.width=12, fig.show='hide'}
png(filename="NetworkVisualizationHiRes.png", width=12, height=12, units="in", res=600)
P(vertex.label.cex=1/2, vertex.label.color="#000000")
dev.off()
```
Create a data frame of the network.
```{r}
D <- data.frame(name=paste(nameF, nameL), comm=C$membership, isLabelled, btwn, close, deg)
D[order(D$comm, !D$isLabelled), ]
```
* Community 1 is family and relatives.
* Community 2 is [Marathon Maniacs](http://www.marathonmaniacs.com/).
* Community 3 is trail runners, including Trail Factor, Animal Athletics, and BananaSluggers.
* Community 4 is [Union High School](https://www.facebook.com/pages/Union-High-School/127917213038), mostly students with non-math teachers and staff. Some [Mountain View](https://www.facebook.com/GoThunder.org) students who are connected to Union people.
* Community 5 is the [Union High School](https://www.facebook.com/pages/Union-High-School/127917213038) math teachers.
* Community 6 is OHSU BICC.
* Community 7 is my [Concordia MATE](http://www.cu-portland.edu/coe/graduate/mat/) cohort.
* Community 8 is Cat's side of the family and her friends.
* Community 9 is a secondary group of Portland friends.
* Community 10 is a core group of Portland friends.
* Community 11 is a secondary group of Portland friends that has connections to Community 9.
* Community 12 is the *misfits* community of people that either
+ Don't fit in with a larger community within my network, or
+ Truly bridge the other 11 well-defined communities to an extent that they don't belong in a single community
* There are a number of key people that connect 2 or more communities.
Create interactive 3-D high resolution.
Open the `network.html` file in a browser.
```{r}
require(networkD3)
edges <- data.frame(get.edgelist(G))
names(edges) <- c("source", "target")
edges$source <- edges$source - 1
edges$target <- edges$target - 1
edges$value <- 1
vertices <- D[, c("name", "comm")]
names(vertices) <- c("name", "group")
vertices$name <- as.character(vertices$name)
N <- forceNetwork(Links=edges, Nodes=vertices,
Source="source", Target="target", Value = "value",
NodeID = "name", Group="group",
opacity=3/4)
saveNetwork(N, "index.html")
```
**Don't use `rglplot`.**
```{r NetworkVisualization3D, tidy=FALSE, eval=FALSE}
l <- layout.fruchterman.reingold(G, dim=3)
rglplot(G,
layout=l,
vertex.label=labNode,
vertex.label.color="black",
vertex.label.dist=1/4,
vertex.label.family="sans",
vertex.label.cex=1/2,
vertex.frame.color=NA,
edge.color=c("gray", "red")[crossing(C, G)+1],
edge.width=1/8
)
```