Data Science Box of Pandora Miscellaneous

Status

lines of R code: 90, lines of test code: 27

Version

0.4.0 ( 2020-10-15 11:30:17 )

Note

If you find any of the package’s functionality useful and have a package that is dedicated to a particular set of problems where you think one or two of the functions would find a more suitable home - please feel free to start a conversation via opening an github issue.

Description

Tool collection for common and not so common data science use cases. This includes custom made algorithms for data management as well as value calculations that are hard to find elsewhere because of their specificity but would be a waste to get lost nonetheless. Currently available functionality: find sub-graphs in an edge list data.frame, find mode or modes in a vector of values, extract (a) specific regular expression group(s), generate ISO time stamps that play well with file names, or generate URL parameter lists by expanding value combinations.

License

GPL (>= 2)
Peter Meissner [aut, cre]

Citation

citation("dsmisc")

Meissner P (2020). dsmisc: Data Science Box of Pandora Miscellaneous. R package version 0.4.0.

BibTex for citing

toBibtex(citation("dsmisc"))

@Manual{,
  title = {dsmisc: Data Science Box of Pandora Miscellaneous},
  author = {Peter Meissner},
  year = {2020},
  note = {R package version 0.4.0},
}

Installation

Stable version from CRAN:

install.packages("dsmisc")

Usage

starting up …

library("dsmisc")

Graph computations

find isolated graphs / networks

A graph described by an edgelist with two distinct subgraphs.

edges_df <- 
  data.frame(
    node_1 = c(1:5, 10:8),
    node_2 = c(2:6, 7,7,7)
  )

edges_df

##   node_1 node_2
## 1      1      2
## 2      2      3
## 3      3      4
## 4      4      5
## 5      5      6
## 6     10      7
## 7      9      7
## 8      8      7

Finding subgraphs and grouping them together via subgraph id.

edges_df$subgraph_id <- 
  graphs_find_subgraphs(
    id_1    = edges_df$node_1,
    id_2    = edges_df$node_2,
    verbose = 0
  )

edges_df

##   node_1 node_2 subgraph_id
## 1      1      2           1
## 2      2      3           1
## 3      3      4           1
## 4      4      5           1
## 5      5      6           1
## 6     10      7           2
## 7      9      7           2
## 8      8      7           2

speedtest for large graph

edges_df <- 
   data.frame(
    node_1 = sample(x = 1:10000, size = 10^5, replace = TRUE),
    node_2 = sample(x = 1:10000, size = 10^5, replace = TRUE)
  )

system.time({
  edges_df$subgraph_id <- 
    graphs_find_subgraphs(
      id_1    = edges_df$node_1,
      id_2    = edges_df$node_2,
      verbose = 0
    )
})

##    user  system elapsed 
##    2.71    0.02    2.81

Stats Functions

Calculating the modus from a collection of values

# one modus only 
stats_mode(1:10)

## Warning in stats_mode(1:10): modus : multimodal but only one value returned (use warn=FALSE to turn this off)

## [1] 1

# all values if multiple modi are found
stats_mode_multi(1:10)

##  [1]  1  2  3  4  5  6  7  8  9 10

String Functions

{stringr} / {stringi} packages are cool … but can they do this (actually they can, of cause but with a little more work and cognitive load needed, e.g.: stringr::str_match(strings, "([\\w])_(?:\\d+)")[, 2])?

Extract specific RegEx groups

strings <- paste(LETTERS, seq_along(LETTERS), sep = "_")

# whole pattern
str_group_extract(strings, "([\\w])_(\\d+)")

##  [1] "A_1"  "B_2"  "C_3"  "D_4"  "E_5"  "F_6"  "G_7"  "H_8"  "I_9"  "J_10" "K_11" "L_12" "M_13" "N_14" "O_15"
## [16] "P_16" "Q_17" "R_18" "S_19" "T_20" "U_21" "V_22" "W_23" "X_24" "Y_25" "Z_26"

# first group
str_group_extract(strings, "([\\w])_(\\d+)", 1)

##  [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S" "T" "U" "V" "W" "X" "Y" "Z"

# second group
str_group_extract(strings, "([\\w])_(\\d+)", 2)

##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14" "15" "16" "17" "18" "19" "20" "21"
## [22] "22" "23" "24" "25" "26"

Data.Frame Manipulation

Transform factor columns in a data.frame to character vectors

df <- 
  data.frame(
    a = 1:2, 
    b = factor(c("a", "b")), 
    c = as.character(letters[3:4]), 
    stringsAsFactors = FALSE
  )
vapply(df, class, "")

##           a           b           c 
##   "integer"    "factor" "character"

df_df <- df_defactorize(df)
vapply(df_df, class, "")

##           a           b           c 
##   "integer" "character" "character"

Time Manipulation

File name ready time stamps

# current time
time_stamp()

## [1] "2020-10-15_13_35_53"

time_stamp(
  ts  = as.POSIXct(c("2010-01-27 10:23:45", "2010-01-27 10:23:45")),
  sep = c("","_","")
)

## [1] "20100127_102345" "20100127_102345"

time_stamp(
  ts  = as.POSIXct(c("2010-01-27 10:23:45", "2010-01-27 10:23:45")),
  sep = c("")
)

## [1] "20100127102345" "20100127102345"

Web Scraping

prepare multiple URLs via query parameter grid expansion

web_gen_param_list_expand(id=1:3, lang=c("en", "de"))

## [1] "id=1&lang=en" "id=2&lang=en" "id=3&lang=en" "id=1&lang=de" "id=2&lang=de" "id=3&lang=de"

Tools

tool_i_fit_index(i = -13:13, index = 7)

##  [1] 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6

tool_i_fit_obj(i = -13:13, obj = letters)

##  [1] "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z" "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l"
## [27] "m"

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
R		R
inst		inst
man		man
src		src
tests		tests
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
.travis.yml		.travis.yml
DESCRIPTION		DESCRIPTION
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.Rmd		README.Rmd
README.md		README.md
appveyor.yml		appveyor.yml
codecov.yml		codecov.yml
cran-comments.md		cran-comments.md
dsmisc.Rproj		dsmisc.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Science Box of Pandora Miscellaneous

Usage

Graph computations

Stats Functions

String Functions

Data.Frame Manipulation

Time Manipulation

Web Scraping

Tools

About

Releases

Packages

Languages

cutterkom/dsmisc

Folders and files

Latest commit

History

Repository files navigation

Data Science Box of Pandora Miscellaneous

Usage

Graph computations

Stats Functions

String Functions

Data.Frame Manipulation

Time Manipulation

Web Scraping

Tools

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages