Apache Projects at Strata EU

Douglas Ashton
7 June 2016

At the beginning of June Mango made the trip to the Strata + Hadoop World big data conference at the Excel Centre in London. I've been to Strata twice now and find it a great way to catch up with how people are using different technologies to get their job done. Sometimes it can be really hard keeping track of all of the latest tools and how they complement one another; so an event like Strata can really bring it all together.

One thing that's really noticeable is the dramatic increase in the number of projects from the Apache Software Foundation. Apache is possibly still most famous for its eponymous HTTP server but it is also a driving force in the open source community. Many of the most famous big data technologies have been donated to the Apache Software Foundation, such as Hadoop and Spark.

Counting project mentions

With so many new names each year I decided to use a bit of scraping to help keep track of it all. The data came from the Strata + Hadoop EU web pages. The information is rendered rather than static here so there were some manual steps to retrieve it. I extracted what I needed into a JSON file.

strata <- jsonlite::fromJSON("strata.json", simplifyDataFrame = FALSE)

Starting with just this year, what Apache projects were mentioned in all of the titles and abstracts? It turns out that everyone was talking about Kafka!

# Start with this year
year <- "2016"

# Extract the titles and descriptions from the web data
titles <- vapply(strata[[year]], `[[`, "a", "title")
descriptions <- vapply(strata[[year]], `[[`, "a", "description")

# Match anything that comes after the work "Apache"
allApache <- na.omit(stringi::stri_match(c(titles, descriptions), regex = "[Aa]pache (\\w+)"))[,2]
apache <- unique(allApache)

Project	Count	Link
Kafka	9	http://kafka.apache.org/
Hadoop	7	http://hadoop.apache.org/
Spark	7	http://spark.apache.org/
Beam	3	http://beam.incubator.apache.org/
NiFi	3	https://nifi.apache.org/
Cassandra	2	http://cassandra.apache.org/
Drill	2	https://drill.apache.org/
Eagle	2	https://eagle.incubator.apache.org/
Flink	2	https://flink.apache.org/
Kudu	1	http://getkudu.io/

Co-occurance of projects

As well as what's being mentioned I was interested in how they fit together. I decided to use a co-occurance approach as a proxy for related tech. The code below looks for co-ocurrance of each name in a title or description and forms edges where it sees them. It gets a little long to account for the special cases that can come up.

library(dplyr) # for munging

# Default no edges
el <- data.frame(from=character(0), to = character(0))

# When edges/nodes are few things can break

if(length(apache) > 1) {
  occur <- sapply(apache, grepl, c(titles, descriptions), ignore.case = TRUE)
  
  occur <- occur[rowSums(occur) > 1, ]
  
  if(nrow(occur) > 0) {
    if (all(rowSums(occur)==2)) {
      el <- as.data.frame(t(apply(occur, 1, function(x) t(combn(apache[x], 2)))),
                          stringsAsFactors = FALSE)
    } else {
      el <- as.data.frame(do.call("rbind", apply(occur, 1, function(x) t(combn(apache[x], 2)))),
                          stringsAsFactors = FALSE)
    }
    names(el) <- c("from", "to")
  }
}

Once the hard part is done we can build a graph.

library(dplyr)  # for the counting
library(igraph) # for the graph

# I prefer to count edges before going to igraph
el <- el %>% group_by(from, to) %>% summarise(weight = n())
# Data frame of vertices
vl <- data.frame(name = apache, mentions = as.numeric(table(allApache)[apache]))

# Build the graph ---------------------------------------------------------
g <- graph_from_data_frame(el, vertices = vl, directed = FALSE)

plot(g, edge.width=E(g)$weight, vertex.size = 5*V(g)$mentions)

Apache through the years

If you feel that the complexity of the Apache project ecosystem is increasing then you are not wrong! We can look at how this co-occurance graph changes through the years. In 2012 Hadoop is the only project and gradually others were added. I remember last year as being the year of Spark and that seems to pan out in the numbers here. Interestingly it seems that mentions of Hadoop itself are relatively going down. This could suggest that it's just assumed that this big data stack is living in an Hadoop ecosystem or perhaps alternatives to HDFS mean that it's not so much the core of big data any longer.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
README_files		README_files
img		img
pdf		pdf
vault		vault
.gitignore		.gitignore
README.Rmd		README.Rmd
README.html		README.html
README.md		README.md
Strata.Rproj		Strata.Rproj
allYears.jpg		allYears.jpg
allYears.svg		allYears.svg
analysis.R		analysis.R
getdata.R		getdata.R
getlinks.R		getlinks.R
projectinfo.csv		projectinfo.csv
responses.rda		responses.rda
strata.json		strata.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Apache Projects at Strata EU

Counting project mentions

Co-occurance of projects

Apache through the years

About

Releases

Packages

Languages

dougmet/stratawords

Folders and files

Latest commit

History

Repository files navigation

Apache Projects at Strata EU

Counting project mentions

Co-occurance of projects

Apache through the years

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages