# Describing the same data with summary statistics

### Python and R Setup

This setup allows you to use *Python* and *R* in the same notebook.

To set up a similar notebook, see quickstart instructions here:

https://github.com/dmil/jupyter-quickstart



In [18]:
%load_ext rpy2.ipython
%load_ext autoreload
%autoreload 2

%matplotlib inline  
from matplotlib import rcParams
rcParams['figure.figsize'] = (16, 100)

import warnings
from rpy2.rinterface import RRuntimeWarning
warnings.filterwarnings("ignore") # Ignore all warnings
# warnings.filterwarnings("ignore", category=RRuntimeWarning) # Show some warnings

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display, HTML

The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [19]:
%%javascript
// Disable auto-scrolling
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}

<IPython.core.display.Javascript object>

### Import packages in R

In [20]:
%%R

require('tidyverse')


Loading required package: tidyverse
In library(package, lib.loc = lib.loc, character.only = TRUE, logical.return = TRUE,  :
  there is no package called â€˜tidyverseâ€™


### Read data

In [22]:
%%R

# Read data
df <- read.csv('housing_data.csv')

### R syntax - getting a column
in R, the `$` lets you grab a column of a dataframe

in Python this might be something like `df["pct_below_poverty"]`

In [21]:
%%R

# Get a column
df$pct_below_poverty

  [1] 19.69 10.68 25.22 25.68 25.20 12.53 15.74 13.10  7.51 27.25 37.16 16.29
 [13] 14.27 28.50  9.13 34.96 34.92 34.92 19.92 34.02 10.81 21.34 13.64 37.77
 [25] 18.65 21.76 27.14 29.75 17.12 32.86 20.80 14.55 19.09 34.41 36.23 17.82
 [37] 37.14 27.36  8.43 18.51 15.60 15.60 16.71 16.71  7.09 19.42 31.16 10.97
 [49] 22.69 25.27 11.76  9.46 14.83 12.35 20.76 12.79  7.96  5.88 13.45 21.97
 [61] 26.48 35.61  6.01  6.01 17.62 23.54  7.40 16.98 17.25 11.89  9.20  9.24
 [73] 19.84  8.20  9.92 15.38 29.09 33.55 11.51 36.13 10.76 27.67  5.30 28.12
 [85] 12.80 21.04 12.24 12.07  5.21 12.73 13.69 11.31 12.32 18.21 38.14 13.26
 [97]  6.77  3.15 10.37 14.53 21.56  5.32 23.14  8.72 15.03 18.67  7.53 14.71
[109]  8.18 12.63 43.12 16.36  9.33 10.18  9.27  7.39 36.77 11.51 10.82  6.63
[121]  6.37 11.56 15.20 16.85 13.37  3.98 27.58 27.58  5.79  9.20 16.08  5.88
[133]  4.36 10.73 14.33  9.68  3.17 25.47  8.03 23.01  8.07 14.11 20.15 10.14
[145]  8.54 13.06 16.41  9.90 11.08 18.57  9.25  2.45 25.60 11.1

### R syntax - the pipe `|>`

The pipe (`|>`) takes the output of the previous function and makes it the input to the next.

This is the native pipe introduced in R 4.1.0. There's also an older pipe (`%>%`) from the magrittr package that works similarly.

https://www.r-bloggers.com/2021/05/the-new-r-pipe/

In [23]:
%%R 

df$pct_below_poverty |> 
    mean()

# Equivalent to...
mean(df$pct_below_poverty)

[1] 15.87868


In [24]:
%%R

df$pct_below_poverty |> 
    median()

[1] 13.06


In [25]:
%%R 

df$pct_below_poverty |> 
    var()

[1] 108.451


In [26]:
%%R 

df$pct_below_poverty |> 
    sd()

[1] 10.41398


In [27]:
%%R

install.packages('dplyr', repos='https://cloud.r-project.org')
library(dplyr)

df |> 
    group_by(borough) |>
    summarize(
        mean=mean(pct_below_poverty), 
        median=median(pct_below_poverty), 
        standard_deviation=sd(pct_below_poverty))



* installing *source* package â€˜dplyrâ€™ ...
** this is package â€˜dplyrâ€™ version â€˜1.1.4â€™
** package â€˜dplyrâ€™ successfully unpacked and MD5 sums checked
** using staged installation
** libs
using C++ compiler: â€˜Apple clang version 17.0.0 (clang-1700.4.4.1)â€™
using SDK: â€˜MacOSX26.1.sdkâ€™


clang++ -std=gnu++17 -I"/opt/homebrew/Cellar/r/4.5.2_1/lib/R/include" -DNDEBUG   -I/opt/homebrew/opt/gettext/include -I/opt/homebrew/opt/readline/include -I/opt/homebrew/opt/xz/include -I/opt/homebrew/include    -fPIC  -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -I/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/c++/v1  -c chop.cpp -o chop.o
clang++ -std=gnu++17 -I"/opt/homebrew/Cellar/r/4.5.2_1/lib/R/include" -DNDEBUG   -I/opt/homebrew/opt/gettext/include -I/opt/homebrew/opt/readline/include -I/opt/homebrew/opt/xz/include -I/opt/homebrew/include    -fPIC  -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -I/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/c++/v1  -c filter.cpp -o filter.o
clang++ -std=gnu++17 -I"/opt/homebrew/Cellar/r/4.5.2_1/lib/R/include" -DNDEBUG   -I/opt/homebrew/opt/gettext/include -I/opt/homebrew/opt/readline/include -I/opt/homebrew/opt/xz/include -I/opt/homebrew/include    -fPIC  -isysroot /Library/Deve

installing to /opt/homebrew/lib/R/4.5/site-library/00LOCK-dplyr/00new/dplyr/libs
** R
** data
*** moving datasets to lazyload DB
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
*** copying figures
** building package indices
** installing vignettes
** testing if installed package can be loaded from temporary location
** checking absolute paths in shared objects and dynamic libraries
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (dplyr)


# A tibble: 5 Ã— 4
  borough        mean median standard_deviation
  <chr>         <dbl>  <dbl>              <dbl>
1 BRONX          26.7  28.7               11.6 
2 BROOKLYN       18.6  17.2                8.20
3 MANHATTAN      13.8  11.0                9.41
4 QUEENS         12.0  10.7                9.06
5 STATEN ISLAND  12.0   9.24               6.58


Installing package into â€˜/opt/homebrew/lib/R/4.5/site-libraryâ€™
(as â€˜libâ€™ is unspecified)
trying URL 'https://cloud.r-project.org/src/contrib/dplyr_1.1.4.tar.gz'
Content type 'application/x-gzip' length 1207521 bytes (1.2 MB)
downloaded 1.2 MB


The downloaded source packages are in
	â€˜/private/var/folders/j5/twx6v7jj4nq17ymrbt5c1klw0000gn/T/RtmpONUXH2/downloaded_packagesâ€™

Attaching package: â€˜dplyrâ€™

The following objects are masked from â€˜package:statsâ€™:

    filter, lag

The following objects are masked from â€˜package:baseâ€™:

    intersect, setdiff, setequal, union



In [28]:
%%R


df |> 
    group_by(borough) |>
    summarize(
        mean=mean(pct_below_poverty), 
        median=median(pct_below_poverty), 
        standard_deviation=sd(pct_below_poverty))

# A tibble: 5 Ã— 4
  borough        mean median standard_deviation
  <chr>         <dbl>  <dbl>              <dbl>
1 BRONX          26.7  28.7               11.6 
2 BROOKLYN       18.6  17.2                8.20
3 MANHATTAN      13.8  11.0                9.41
4 QUEENS         12.0  10.7                9.06
5 STATEN ISLAND  12.0   9.24               6.58


**ðŸ‘‰ Try It**
Compare the summary statistics to the distributions in your previous assignment. What story do they tell? What stories do they obscure? Why was it important to plot the data in the case of this dataset? What did you gain from plotting the `pct_below_poverty` distribution in various different ways?

> Summary statistics tell us that the Bronx has the highest average poverty rate and Manhattan the lowest, but they obscure the shape of distributionsâ€”like bimodal patterns, outliers, and whether data clusters tightly or spreads widely. Plotting was essential because a borough's "average" could hide wildly different realities (e.g., half wealthy zip codes, half poor). Different plot types revealed different insights: histograms showed frequency, boxplots highlighted outliers, violin plots exposed distribution shapes, and beeswarms let us see every individual data point.