In [None]:
%load_ext rpy2.ipython

# R DataFrames

## Building DataFrames

<!-- label:dataframe_constructor -->

An R `DataFrame` is not-unlike SQL tables,
or (well-structured) spreadsheets.

Data frames are one-dimensional arrays of "columns".

In [None]:

%%R

dataf <- data.frame(lower=letters, upper=LETTERS)

dataf


---

`data.frame` objects can also be built by reading data in CSV files.

**Note:** All data is loaded into memory.
This is obviously only working if there is enough memory on the machine used.

<!-- label:dataframe_read_csv -->

In [None]:
%%R

csv_filename <- 'location.csv'
dataf <- read.csv(csv_filename)

Types for the columnd are inferred. This is often acceptable for interactive work, but
can also lead to surprises.

In [None]:
%%R
str(dataf)

# Working with data frames

Visually inspection of few rows in the table is a common first step when working interactively.
This is often why one wants to "see the data in a spreadsheet".

<!-- label:dataframe_head -->

In [None]:
%%R
head(dataf)

<!-- label:dataframe_tail -->

In [None]:
%%R
tail(dataf)

The size of the `data.frame` (number of rows and columns)
is also a common early check:

In [None]:
%%R
print(nrow(dataf))
print(ncol(dataf))


Column names.

In [None]:
%%R
colnames(dataf)

Summary statistics.

<!-- label:dataframe_summary -->

In [None]:
%%R
summary(dataf)

Filtering rows is a common operation when working with data. This is the `WHERE` clause
in SQL.

In [None]:
%%R
res <- subset(dataf, grepl("^M", state))

print('Original number of rows:', nrow(dataf))
print('After filter:', nrow(res))

# `dplyr`

<!-- label:dplyr_filter -->

In [None]:
%%R
suppressMessages(library(dplyr))
res <- dataf %>% filter(grepl("^M", state))
res %>% head()

---

Sorting:

In [None]:
%%R
res <- dataf[order(dataf$city), ]
res %>% head()

<!-- label:dplyr_arrange -->

In [None]:
%%R
res <- dataf %>% arrange(city)
res %>% head()

---

Like with SQL, tables can be joined using a key (this is like SQL's `INNER JOIN`).

<!-- label:dplyr_inner_join -->

In [None]:
%%R
# DataFrame with counts in a column "count_cities"
res <- dataf %>%
       group_by(state) %>%
       summarise(count_cities = n())

# Join by state (since the counts are aggregates by state)
location_with_count <- dataf %>%
  inner_join(res, by='state')

location_with_count %>% head()

---

## SQL backends


<!-- label:dplyr_src -->

In [None]:
%%R
dbfilename <- "tycho.db"
datasrc  <- src_sqlite(dbfilename)
location_tbl <- tbl(datasrc, "location")


---

<!-- label:dplyr_pipe -->

In [None]:
%%R
res <- (location_tbl %>%
        group_by('state') %>%
        summarise(count_cities = n()) %>%
        arrange(desc('count_cities')))
res %>% head()

---

## Exercises:

The remaining 2 tables can also be mapped to dplyr objects.

In [None]:
%%R
casecount_tbl <- tbl(datasrc, "casecount")
disease_tbl <- tbl(datasrc, "disease")


Can you answer the following with dplyr ? (yes, these are the same questions as the ones
in the notebook about SQL / SQLAlchemy)

- Count the number of cities in states with a name starting with 'N'

In [None]:
%%R

location_tbl %>%
  filter(sql('city LIKE "N%"')) %>%
  summarise(n = n())


- Count the number of cities in each state.

- Count the number of cities with a name starting with 'N', stratified by state.

- for each state,
  count the number of cities for which we have deadly cases
  for more than 5 distinct diseases. Oh, and sort the list of states in decreasing
  number of such diseases. In fact, only report the first 10 states. (hint: this is
  pretty much the last example query about).

- Count the total number of cases of flu in NYC
  (hint: flu is a short name, you may want the long name)

- Count the number of cases of flu in NYC each year