# TravisTorrent Analysis

* Data source: https://travistorrent.testroots.org/dumps/travistorrent_8_2_2017.csv.gz
* Data format: https://travistorrent.testroots.org/page_dataformat/


Beller M, Gousios G, Zaidman A. (2017) TravisTorrent: Synthesizing Travis CI and GitHub for Full-Stack Research on Continuous Integration

@inproceedings{msr17challenge,
 title={TravisTorrent: Synthesizing Travis CI and GitHub for Full-Stack Research on Continuous Integration},
 author={Beller, Moritz and Gousios, Georgios and Zaidman, Andy},
 booktitle={Proceedings of the 14th working conference on mining software repositories},
 preprint={http://www.st.ewi.tudelft.nl/~mbeller/publications/2017_beller_gousios_zaidman_travistorrent_synthesizing_travis_ci_and_github_for_full-stack_research_on_continuous_integration.pdf},
 year={2017}
}

In [None]:
# Only needed when NOT using mybinder

install.packages('data.table')

## Needed for gzip
install.packages('R.utils') 

In [None]:
## Needed for (first) direct file download. Afterwards, the three following lines can be commented out.
library('R.utils')
download.file("https://travistorrent.testroots.org/dumps/travistorrent_8_2_2017.csv.gz", "travistorrent_8_2_2017.csv.gz")
gunzip("travistorrent_8_2_2017.csv.gz", remove=FALSE, overwrite=TRUE)

library('data.table')
tt <- fread("travistorrent_8_2_2017.csv")

We select the interesting columns and remove duplicate build_ids as they are duplicates for sub_jobs. In addition, we remove all entries that do not have a duration.


In [None]:
tt <- tt[tt$tr_duration != "NA"]
tt <- tt[, c("tr_build_id","git_branch","gh_project_name","gh_build_started_at","tr_duration")]
tt <- tt[!duplicated(tt$tr_build_id),]

In [None]:
# Assign start and duration to variables
tt_start <- tt$gh_build_started_at
tt_duration<- tt$tr_duration

# Extract epoch formats for start and end by adding duration to start
tt_start_epoch <- as.POSIXct(tt_start ,format="%Y-%m-%d %H:%M")
tt_end_epoch <- tt_start_epoch + tt_duration
#tt_end_epoch <- as.POSIXct(as.POSIXlt(tt_end_epoch), format="%Y-%m-%d %H:%M")
tt_end_epoch <- strptime(tt_end_epoch, "%Y-%m-%d %H:%M")

# Find min and max time
time_min <- min(tt_start_epoch)
time_max <- max(tt_end_epoch)

Luckily found a good example on how to aggregate: https://stackoverflow.com/a/20426276/1779346

In [None]:
options(digits.secs=0)
queries.start <- data.frame(Time=tt_start_epoch, Value=1)
queries.end <- data.frame(Time=tt_end_epoch, Value=-1)

queries.both <- rbind(queries.start, queries.end)
queries.both <- queries.both[with(queries.both, order(Time)), ]

queries.sum <- data.frame(Time=queries.both$Time, Queries=cumsum(queries.both$Value))

In [None]:
plot(queries.sum, type="l", ylab="Concurrent builds")

In [None]:
saveRDS(queries.sum, file="dataset.Rda")

# can be loaded using 
# queries.sum <- readRDS(file="dataset.Rda")