# Dfplyr: Data Wrangling for humans

I have to admit something, which is sometimes looked down upon within the data scientist community. I'm better at R than Python. And although it served me well in the past, I've since converted to Python for it's wide range of libraries, robust ecosystem of developer tools, and it's battle-weathered stability. But old habbits die hard, and the one package I miss the most is dplyr.

Base R can be considered to have a clunky interface for manipulating dataframes. There's lots of repitition, and doesn't feel very natural or organic. Look at this example for counting diamonds per color for diamonds in the top quartile. We'll first do it with only base R functions

```r
> diamonds_filtered <- diamonds[diamonds$price > quantile(diamonds$price, probs = 0.75),]
> aggregate(price ~ color, diamonds_filtered, NROW)
  color price
1     D  1152
2     E  1552
3     F  2112
4     G  3186
5     H  2525
6     I  1891
7     J  1067
```

Now compare the same data manipulation in R

```r
> diamonds %>% 
+     filter(price > quantile(price, probs = 0.75)) %>% 
+     group_by(color) %>% 
+     summarise(N = n())
# A tibble: 7 x 2
  color     N
  <ord> <int>
1 D      1152
2 E      1552
3 F      2112
4 G      3186
5 H      2525
6 I      1891
7 J      1067
```

So although the base R version is half the number of lines, it took me much longer to write. Maybe it's because I'm better versed at dplyr, but I also find it to be more logical. The components fit together nicely, and the functions follow how I model these steps in my head. It probably also took you as a reader longer to decypher what each step was doing, for the exact same reason.

Dplyr, along with a swathe of other packages were created by Hadley Wickham a number of years ago to try and make R a more elegant and productive language. This bundle is called the [tidyverse](https://www.tidyverse.org/) and was created by Hadley Wickham in collaboration with various developers. At this point these packages might as well be included in the default installation, and nowadays either `dplyr` or `datable` are used by the majority of R users doing data manipulation.

## Enter Pandas

The pandas library is the defacto statistics library for python. It also share many similarities to how base R is put together.