<center><h1>The DataFrame Object in R</h1></center>
<center><h3>Ellen Duong</h3></center>
<center><h3>Paul Stey</h3></center>
<center><h3>2023-10-05</h3></center>

# 1. What is a `data.frame`?

  - Tabular data structure (i.e., like Excel spreadsheet)
  - Canonical data structure for data analysis
  - Capable of storing heterogeneous data

## 1.1 What does it look like?

In [None]:
idx <- 1:4
score <- rnorm(4)
vocals <- c(TRUE, TRUE, TRUE, FALSE)
firstname <- c("john", "george", "paul", "ringo")

dat <- data.frame(idx, firstname, score, vocals)

dat 

## 1.2 Indexing and Slicing a `data.frame`

  - Similar to `vector`, `matrix`, and `array` objects

In [None]:
dat[1, 2]           # get element in first row, second column

In [None]:
dat[, 2]            # get all of second column

In [None]:
dat[3, ]            # get third row

### 1.2.1 Indexing using Column Names

In [None]:
dat[3, "score"]          # element from row 3 and "score" column

In [None]:
dat[2:4, "firstname"]    # get elements 2, 3, and 4 from "firstname" column

## 1.3 The `$` Operator and `data.frame` Objects 

In [None]:
dat$firstname            # get the "firstname" column

# 2. Filter `data.frame` using Logical Indexing

In [None]:
dat

In [None]:
idx_keep <- c(TRUE, TRUE, TRUE, FALSE)

dat[idx_keep, ]

## 2.1 Create New `data.frame` from Another

In [None]:
dat2 <- dat[idx_keep, ]       # create new dataframe, from subset of original

head(dat2, n=3)

### 2.1.1 Take Subset of `data.frame` Columns

In [None]:
cols <- c("firstname", "score")    # columns we care about

dat_namescore <- dat2[, cols]      # create new dataframe

dat_namescore

# 3. Adding Columns to a `data.frame`

In [None]:
dat

In [None]:
dat$food <- c("steak", "chicken", "potato", "rice")

dat

## 3.1. Adding Columns (cont.)

In [None]:
dat[, "drink"] <- c("water", "milk", "beer", "scotch")

dat


<center><h1>Challenge Questions</h1></center>

### Question 1.
Create a `data.frame` object called `state_df` with two columns, one called `state` and one called `population`. Each column should have five elements. For the `state` column, select the abbreviations for five US states (e.g., "OH", "RI", "NY", "MA", "CT"). For the `population` column, use the `sample()` function to create "populations" at random from the range `1` to `1000000`.

### Question 2.
Add a third column to the `state_df` dataframe called `size`. In particular, use boolean indexing to assign elements of the third column to be `"large"` if that row's `population` value is larger than or equal to `500000`, and be `"small"`  if the row's `population` is less than `500000`.

In [None]:
state <- c("OH", "RI", "NY", "MA", "CT")

set.seed(1) # sets the state of the random number generator stored in .Random.seed
population <- sample(1:1000000, 5)

state_df <- data.frame(state, population)

In [None]:
is_small <- state_df$population < 500000   # create vector of booleans

is_small

state_df$size <- rep(NA, 5)                # create column of 5 NAs

state_df[is_small, "size"] <- "small"      # assigning "small" wherever `is_small` is TRUE
state_df[!is_small, "size"] <- "large"     # assigning "large" wherever !is_small

state_df

In [None]:
n <- nrow(state_df)                        # get number of rows in dataframe
state_df$size <- rep(NA, n)                # create empty column of NAs

for (i in 1:n) {
    if (state_df[i, "population"] < 500000) {
        state_df[i, "size"] <- "small"
    }
    else {
        state_df[i, "size"] <- "large"
    }
}

state_df

In [None]:
state_df$size <- ifelse(state_df$population < 500000, "small", "large")

state_df

# 4. Reading Data from CSV File

  - CSV File is "comma-separated values"
  - The `,` separator is conventional, but not mandatory
  - The `|` character is also common

## 4.1 Providence Police Dept. Data

  - We will be looking at public data regarding arrests and case

In [None]:
# The line below reads the CSV file and creates a dataframe 

arrests_df <- read.csv("data/pvd_arrests_2021-10-03.csv")     

## 4.2 Exploring the Data

In [None]:
head(arrests_df)         # show first few lines of the dataframe

### 4.2.1 More Data Exploring 

In [None]:
dim(arrests_df)             # get dimensions of the dataframe

In [None]:
nrow(arrests_df)            # get number of rows

In [None]:
ncol(arrests_df)            # get the number of columns

In [None]:
colnames(arrests_df)        # get the column names

# 5. Summaries from `data.frame`

In [None]:
str(arrests_df)              # the str() function shows the structure of dataframe

## 5.1 Summarizing Numeric Data

In [None]:
summary(arrests_df)

### 5.1.1 Summarizing Numeric Data (cont.)

In [None]:
numeric_vars <- c("month", "year", "age", "year_of_birth", "counts")

In [None]:
summary(arrests_df[, numeric_vars])

## 5.2 Summarizing String Variables

In [None]:
table(arrests_df$race)           # show summary of "race" column in `arrests_df`

# 6. Options when Reading CSV

  - The `read.csv()` function has many optional arguments
  - Critically, we can tell R the strings that ought to be considered missing

In [None]:
help(read.csv)

In [None]:
arrests_df2 <- read.csv("data/pvd_arrests_2021-10-03.csv", 
                        na.strings = c("NA", "", " ", "NULL", "Unknown"))

## 6.1 Effects of  `na.strings`

In [None]:
table(arrests_df2$race)             # explore `race` in original dataframe

In [None]:
table(arrests_df2$race)           # dataframe after setting `na.strings`