Skip to content

Commit

Permalink
update readme
Browse files Browse the repository at this point in the history
  • Loading branch information
alastair rushworth committed Apr 3, 2019
1 parent 1db76a6 commit a149cbf
Show file tree
Hide file tree
Showing 16 changed files with 330 additions and 108 deletions.
180 changes: 140 additions & 40 deletions README.Rmd
Expand Up @@ -26,101 +26,201 @@ The package has three aims:
+ to support quick visualisation of data frames


Key functions
---

+ `report_types()` summary of column types
+ `report_mem()` summary of memory usage of columns
+ `report_na()` columnwise prevelance of missing values
+ `report_cor()` correlation coefficients of numeric columns
+ `report_imb()` feature imbalance of categorical columns
+ `report_num()` summaries of numeric columns
+ `report_cat()` summaries of categorical columns



Installation
---

To install the development version of the package, use the command
To install the development version of the package, use
```{r, eval = FALSE}
devtools::install_github("alastairrushworth/reporter")
# load the package
library(reporter)
```

Then load the package and the `starwars` data.
```{r}
# load reporter
```{r, echo = FALSE}
library(reporter)
```



Illustrative data: `starwars`
---

The examples below make use of the `starwars` data from the `dplyr` package

```{r}
# some example data
data(starwars, package = "dplyr")
```

For illustrating comparisons of dataframes, use the `starwars` data and produce two new dataframes `star_1` and `star_2` that randomly sample the rows of the original and drop a couple of columns.

```{r, message=FALSE, warning=FALSE}
library(dplyr)
star_1 <- starwars %>% sample_n(50)
star_2 <- starwars %>% sample_n(50) %>% select(-1, -2)
```

## Single data frame summaries

#### Column types

To explore the column types in a data frame, use the function `report_types`. The command returns a tibble summarising the percentage of columns with the a particular type. A barplot is also returned when `show_plot = TRUE`.
##### `report_types()` for a single df

To explore the column types in a data frame, use the function `report_types()`. The command returns a `tibble` summarising the counts and percentagee of columns with particular types. A barplot is also returned when `show_plot = TRUE`.

```{r}
# return tibble and visualisation of columns types
report_types(starwars, show_plot = T)
report_types(starwars, show_plot = TRUE)
```

##### `report_types()` for two dfs

When a second dataframe is provided, `report_types()` will create a dataframe comparing the count and percentage of each column type for each of the input dataframes. The summaries for the first and second dataframes are show in columns with names appended with `_1` and `_2`, respectively.

```{r}
report_types(star_1, star_2, show_plot = TRUE)
```



#### Memory usage

To explore the memory usage of the columns in a data frame, use the function `report_mem`. The command returns a tibble containing the memory usage and percentage of total usage for each column in the data frame. A barplot is also returned when `show_plot = TRUE`.
##### `report_mem()` for a single df

To explore the memory usage of the columns in a data frame, use `report_mem()`. The command returns a `tibble` containing the size of each column in the dataframe. A barplot is also returned when `show_plot = TRUE`.

```{r}
report_mem(starwars, show_plot = TRUE)
```

##### `report_mem()` for two dfs

When a second dataframe is provided, `report_mem()` will create a dataframe comparing the size of each column for both input dataframes. The summaries for the first and second dataframes are show in columns with names appended with `_1` and `_2`, respectively.

```{r}
report_mem(starwars, show_plot = T)
report_mem(star_1, star_2, show_plot = TRUE)
```


#### Missing values

`report_na` is used to report the number and proportion of missing values contained within each column in a data frame. The command returns a tibble containing the count (`cnt_na`) and the overall percentage (`pcnt_na`) of missing values by column in the data frame. A barplot is also returned when `show_plot` is set to `TRUE`.
##### `report_na()` for a single df

`report_na()` summarises the prevalence of missing values by each column in a data frame. A tibble containing the count (`cnt`) and the overall percentage (`pcnt`) of missing values is returned A barplot is also returned when `show_plot` is set to `TRUE`.

```{r}
report_na(starwars, show_plot = TRUE)
```

##### `report_na()` for two dfs

When a second dataframe is provided, `report_na()` returns a tibble containing counts and percentage missingess by column, with summaries for the first and second data frames are show in columns with names appended with `_1` and `_2`, respectively. In addition, a $p$-value is calculated which provides a measure of evidence of whether the difference in missing values is significantly different.

```{r}
report_na(starwars, show_plot = T)
report_na(star_1, star_2, show_plot = TRUE)
```

Notes:

+ Smaller $p$-values indicate stronger evidence of a difference in the missingness rate for a single column
+ If a column appears in one data frame and not the other - for example `height` appears in `star_1` but nor `star_2`, then the corresponding `pcnt_`, `cnt_` and `p_value` columns will contain `NA`
+ Where the missingness is identically 0, the `p_value` is `NA`.
+ The visualisation illustrates the significance of the difference using a coloured bar overlay. Orange bars indicate evidence of equality or missingess, while blue bars indicate inequality. If a `p_value` cannot be calculated, no coloured bar is shown.
+ The significance level can be specified using the `alpha` argument to `report_na()`. The default is `alpha = 0.05`.


#### Correlation

`report_cor` returns a tibble containing Pearson's correlation coefficient, confidence intervals and $p$-values between pairs of numeric columns in a data frame. The function combines the functionality of `cor()` and `cor.test()` into a more useable wrapper. An point and whiskers plot is also returned when `show_plot = TRUE`.
##### `report_cor()` for a single df

`report_cor()` returns a tibble containing Pearson's correlation coefficient, confidence intervals and $p$-values for pairs of numeric columns . The function combines the functionality of `cor()` and `cor.test()` in a more convenient wrapper. A point and whiskers plot is also returned when `show_plot = TRUE`.

```{r}
report_cor(starwars, show_plot = T)
```

__Notes__
Notes

+ The tibble is sorted in descending order of the absolute coefficient.
+ The tibble is sorted in descending order of the absolute coefficient $|\rho|$.
+ `report_cor` drops missing values prior to calculation of each correlation coefficient.
+ The `p_value` column is associated with the null hypothesis $H_0: \rho = 0$.
+ The `p_value` is associated with the null hypothesis $H_0: \rho = 0$.

##### `report_cor()` for for two dfs

When a second dataframe is provided, `report_cor()` returns a tibble that compares correlation coefficients of the first dataframe to those in the second. The `p_value` column contains a measure of evidence for whether the two correlation coefficients are equal or not.

```{r}
report_cor(star_1, star_2, show_plot = TRUE)
```

Notes:

+ Smaller `p_value` indicates stronger evidence against the null hypothesis $H_0: \rho_1 = \rho_2$ and an indication that the true correlation coefficients differ.
+ The visualisation illustrates the significance of the difference using a coloured bar overlay. Orange bars indicate evidence of equality of correlations, while blue bars indicate inequality. If a `p_value` cannot be calculated, no coloured bar is shown.
+ The significance level can be specified using the `alpha` argument to `report_cor()`. The default is `alpha = 0.05`.



#### Feature imbalance

Categorical features where each element is identical (or nearly) are often removed or scrutinised more closely. The function `report_imbalance` helps to find categorical columns that are dominated by a single feature level and returns a tibble containing the columns: `col_name` the categorical column names; `value` the most frequently occurring categorical level in each column; `percent` the percentage frequency with which the value occurs. The tibble is sorted in descending order of the `percent`. A barplot is also returned when `show_plot` is set to `TRUE`.
##### `report_imb()` for a single df

Understanding categorical columns that are dominated by a single level can be useful. `report_imb()` returns a tibble containing categorical column names (`col_name`); the most frequently occurring categorical level in each column (`value`) and `pctn` & `cnt` the percentage and count which the value occurs. The tibble is sorted in descending order of `pcnt`. A barplot is also returned when `show_plot` is set to `TRUE`.

```{r}
report_imb(starwars, show_plot = TRUE)
```

##### `report_imb()` for two dfs

When a second dataframe is provided, `report_imb()` returns a tibble that compares the frequency of the most common categorical values of the first dataframe to those in the second. The `p_value` column contains a measure of evidence for whether the true frequencies are equal or not.

```{r}
report_imb(starwars, show_plot = T)
report_imb(star_1, star_2, show_plot = TRUE)
```

+ Smaller `p_value` indicates stronger evidence against the null hypothesis that the true frequency of the most common values is the same.
+ The visualisation illustrates the significance of the difference using a coloured bar overlay. Orange bars indicate evidence of equality of the imbalance, while blue bars indicate inequality. If a `p_value` cannot be calculated, no coloured bar is shown.
+ The significance level can be specified using the `alpha` argument to `report_imb()`. The default is `alpha = 0.05`.


#### Numeric summaries

`report_num` generates statistical summaries of numeric columns contained in a data frame, combining some of the functionality of `summary` and `hist`. The tibble returned contains standard numerical summaries (min, max, mean, median etc.), but also the percentage of missing entries (`percent_na`) and a simple histogram (`hist`). If `show_plot = TRUE` a histogram is generated for each numeric feature.
`report_num()` combining some of the functionality of `summary()` and `hist()` by returning summaries of numeric columns. `report_num()` returns standard numerical summaries (`min`, `q1`, `mean`, `median`,`q3`, `max`, `sd`), but also the percentage of missing entries (`pcnt_na`) and a simple histogram (`hist`). If `show_plot = TRUE` a histogram is generated for each numeric feature.

```{r}
report_num(starwars, show_plot = T, breaks = 10)
report_num(starwars, show_plot = TRUE, breaks = 10)
```

The `hist` column is a list whose elements are tibbles each containing a simple histogram with the relative frequency of counts for each feature. These tibbles are used to generate the histograms shown when `show_plot = TRUE`. For example, the histogram for `starwars$birth_year` is
The `hist` column is a list whose elements are tibbles each containing the relative frequencies of bins for each feature. These tibbles are used to generate the histograms when `show_plot = TRUE`. For example, the histogram for `starwars$birth_year` is

```{r}
report_num(starwars)$hist$birth_year
```


#### Categorical levels

`report_cat` returns a tibble summarising categorical features in the data frame. This combines the functionality of `report_imbalance` and the `table` function. If `show_plot = TRUE` a barplot is generated showing the relative split. The tibble generated contains the columns
`report_cat()` returns a tibble summarising categorical features in a data frame, combining the functionality of the `report_imb()` and `table()` functions. If `show_plot = TRUE` a barplot is generated showing the relative split. The tibble generated contains the columns

+ `col_name` name of the column
+ `cnt` the number of unique levels in the feature
+ `common` the most common level (see also `report_imb`)
+ `col_name` name of each categorical column
+ `cnt` the number of unique levels in the feature
+ `common` the most common level (see also `report_imb()`)
+ `common_pcnt` the percentage occurrence of the most dominant level
+ `levels` a list of tibbles each containing frequency tabulations of all levels.
+ `levels` a list of tibbles each containing frequency tabulations of all levels

```{r}
report_cat(starwars, show_plot = T)
Expand All @@ -135,28 +235,28 @@ report_cat(starwars)$levels$hair_color
Note that by default, if `NA` values are present, they are counted as a distinct categorical level.


## Comparing data frames

In addition to printing summaries for a single data frame, each `report_` function can accept two dataframe objects and produces statistics and plots comparing the two dataframes.

#### Example

To keep things simple, suppose we take the `starwars` data and produce two new dataframes `star_1` and `star_2` that randomly sample the rows of the original and drop a couple of columns.

```{r, message=FALSE, warning=FALSE}
library(dplyr)
star_1 <- starwars %>% sample_n(50)
star_2 <- starwars %>% sample_n(50) %>% select(-1, -2)
```

#### Comparing column types

When a second dataframe is provided, `report_types` will create a dataframe comparing the count and percentage of each column type. To enable an easy comparison, the dataframe names are embedded into the column names.

```{r}
report_types(star_1, star_2, show_plot = T)
```

+ Comparison of column types between the two data frames

















0 comments on commit a149cbf

Please sign in to comment.