<center><h1>The DataFrame Object in R</h1></center>

# 1. What is a `data.frame`?

  - Tabular data structure (i.e., like Excel spreadsheet)
  - Canonical data structure for data analysis
  - Capable of storing heterogeneous data

## 1.1 What does it look like?

In [2]:
idx <- 1:4
score <- rnorm(4)
vocals <- c(TRUE, TRUE, TRUE, FALSE)
firstname <- c("john", "george", "paul", "ringo")

dat <- data.frame(idx, firstname, score, vocals)

dat 

idx,firstname,score,vocals
<int>,<chr>,<dbl>,<lgl>
1,john,0.80531144,True
2,george,0.18674302,True
3,paul,-0.03858071,True
4,ringo,0.06816614,False


## 1.2 Indexing and Slicing a `data.frame`

  - Similar to `vector`, `matrix`, and `array` objects

In [3]:
dat[1, 2]           # get element in first row, second column

In [4]:
dat[, 2]            # get all of second column

In [5]:
dat[3, ]            # get third row

Unnamed: 0_level_0,idx,firstname,score,vocals
Unnamed: 0_level_1,<int>,<chr>,<dbl>,<lgl>
3,3,paul,-0.03858071,True


### 1.2.1 Indexing using Column Names

In [7]:
dat[3, "score"]          # element from row 3 and "score" column

In [9]:
dat[2:4, "firstname"]    # get elements 2, 3, and 4 from "firstname" column

## 1.3 The `$` Operator and `data.frame` Objects 

In [12]:
dat$firstname            # get the "firstname" column

# 2. Filter `data.frame` using Logical Indexing

In [13]:
dat

idx,firstname,score,vocals
<int>,<chr>,<dbl>,<lgl>
1,john,0.80531144,True
2,george,0.18674302,True
3,paul,-0.03858071,True
4,ringo,0.06816614,False


In [15]:
idx_keep <- c(TRUE, TRUE, TRUE, FALSE)

dat[idx_keep, ]

Unnamed: 0_level_0,idx,firstname,score,vocals
Unnamed: 0_level_1,<int>,<chr>,<dbl>,<lgl>
1,1,john,0.80531144,True
2,2,george,0.18674302,True
3,3,paul,-0.03858071,True


## 2.1 Create New `data.frame` from Another

In [17]:
dat2 <- dat[idx_keep, ]       # create new dataframe, from subset of original

head(dat2)

Unnamed: 0_level_0,idx,firstname,score,vocals
Unnamed: 0_level_1,<int>,<chr>,<dbl>,<lgl>
1,1,john,0.80531144,True
2,2,george,0.18674302,True
3,3,paul,-0.03858071,True


### 2.1.1 Take Subset of `data.frame` Columns

In [20]:
cols <- c("firstname", "score")    # columns we care about

dat_namescore <- dat2[, cols]      # create new dataframe

dat_namescore

Unnamed: 0_level_0,firstname,score
Unnamed: 0_level_1,<chr>,<dbl>
1,john,0.80531144
2,george,0.18674302
3,paul,-0.03858071


# 3. Adding Columns to a `data.frame`

In [21]:
dat

idx,firstname,score,vocals
<int>,<chr>,<dbl>,<lgl>
1,john,0.80531144,True
2,george,0.18674302,True
3,paul,-0.03858071,True
4,ringo,0.06816614,False


In [27]:
dat$food <- c("steak", "chicken", "potato", "rice")

dat

idx,firstname,score,vocals,food
<int>,<chr>,<dbl>,<lgl>,<chr>
1,john,0.80531144,True,steak
2,george,0.18674302,True,chicken
3,paul,-0.03858071,True,potato
4,ringo,0.06816614,False,rice


## 3.1. Adding Columns (cont.)

In [31]:
dat[, "drink"] <- c("water", "milk", "beer", "scotch")

dat <- dat[, c("drink", "idx", "firstname", "score", "vocals", "food")]

dat

drink,idx,firstname,score,vocals,food
<chr>,<int>,<chr>,<dbl>,<lgl>,<chr>
water,1,john,0.80531144,True,steak
milk,2,george,0.18674302,True,chicken
beer,3,paul,-0.03858071,True,potato
scotch,4,ringo,0.06816614,False,rice



<center><h1>Challenge Questions</h1></center>

### Question 1.
Create a `data.frame` object called `state_df` with two columns, one called `state` and one called `population`. Each column should have five elements. For the `state` column, select the abbreviations for five US states (e.g., "OH", "RI", "NY", "MA", "CT"). For the `population` column, use the `sample()` function to create "populations" at random from the range `1` to `1000000`.

### Question 2.
Add a third column to the `state_df` dataframe called `size`. In particular, use boolean indexing to assign elements of the third column to be `"large"` if that row's `population` value is larger than or equal to `500000`, and be `"small"`  if the row's `population` is less than `500000`.

In [65]:
set.seed(1)

state <- c("OH", "RI", "NY", "MA", "CT")
population <- sample(1:1000000, 5)

state_df <- data.frame(state, population)

In [59]:
is_small <- state_df$population < 500000   # create vector of booleans

is_small

state_df$size <- rep(NA, 5)                # create column of 5 NAs

state_df[is_small, "size"] <- "small"      # assigning "small" wherever `is_small` is TRUE
state_df[!is_small, "size"] <- "large"     # assigning "large" wherever !is_small

state_df

state,population,size
<chr>,<int>,<chr>
OH,548676,large
RI,452737,small
NY,124413,small
MA,436523,small
CT,856018,large


In [62]:
n <- nrow(state_df)                        # get number of rows in dataframe
state_df$size <- rep(NA, n)                # create empty column of NAs

for (i in 1:n) {
    if (state_df[i, "population"] < 500000) {
        state_df[i, "size"] <- "small"
    }
    else {
        state_df[i, "size"] <- "large"
    }
}

state_df

state,population,size
<chr>,<int>,<chr>
OH,548676,large
RI,452737,small
NY,124413,small
MA,436523,small
CT,856018,large


In [66]:
state_df$size <- ifelse(state_df$population < 500000, "small", "large")

state_df

state,population,size
<chr>,<int>,<chr>
OH,548676,large
RI,452737,small
NY,124413,small
MA,436523,small
CT,856018,large


# 4. Reading Data from CSV File

  - CSV File is "comma-separated values"
  - The `,` separator is conventional, but not mandatory
  - The `|` character is also common

## 4.1 Providence Police Dept. Data

  - We will be looking at public data regarding arrests and case

In [69]:
# The line below reads the CSV file and creates a dataframe 

arrests_df <- read.csv("data/pvd_arrests_2021-10-03.csv")     

## 4.2 Exploring the Data

In [72]:
head(arrests_df)         # show first few lines of the dataframe

Unnamed: 0_level_0,arrest_date,year,month,gender,race,ethnicity,year_of_birth,age,from_address,from_city,from_state,statute_type,statute_code,statute_desc,counts,case_number,arresting_officers,id
Unnamed: 0_level_1,<chr>,<int>,<int>,<chr>,<chr>,<chr>,<int>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<int>,<chr>,<chr>,<chr>
1,2019-08-24T02:23:00.0,2019,8,Male,White,NonHispanic,1981,37,No Permanent Address,providence,Rhode Island,,,,,2019-00084142,"YGonzalez, LTaveras",pvd2218242150382148273
2,2019-08-24T02:02:00.0,2019,8,,,,1994,25,SUMMER AVE,Cranston,Rhode Island,RI Statute Violation,31-11-18,"Driving after Denial, Suspension or Revocation of License",1.0,2019-00084127,NManfredi,pvd15166785558364246202
3,2019-08-24T02:02:00.0,2019,8,Female,Black,NonHispanic,1984,34,DOUGLAS AVE,Providence,Rhode Island,RI Statute Violation,12-7-10,RESISTING LEGAL OR ILLEGAL ARREST,1.0,2019-00084126,"MPlace, JPerez, ASantos",pvd3142917706201385905
4,2019-08-24T02:02:00.0,2019,8,Female,Black,NonHispanic,1984,34,DOUGLAS AVE,Providence,Rhode Island,RI Statute Violation,11-45-1,DISORDERLY CONDUCT,1.0,2019-00084126,"MPlace, JPerez, ASantos",pvd3142917706201385905
5,2019-08-24T02:02:00.0,2019,8,Female,Black,Unknown,2001,18,TRASH ST,,,RI Statute Violation,12-7-10,RESISTING LEGAL OR ILLEGAL ARREST,1.0,2019-00084126,"MPlace, JPerez, ASantos",pvd460449304532374599
6,2019-08-24T02:02:00.0,2019,8,Female,Black,Unknown,2001,18,TRASH ST,,,RI Statute Violation,11-45-1,DISORDERLY CONDUCT,1.0,2019-00084126,"MPlace, JPerez, ASantos",pvd460449304532374599


### 4.2.1 More Data Exploring 

In [73]:
dim(arrests_df)             # get dimensions of the dataframe

In [74]:
nrow(arrests_df)            # get number of rows

In [75]:
ncol(arrests_df)            # get the number of columns

In [78]:
colnames(arrests_df)        # get the column names

# 5. Summaries from `data.frame`

In [82]:
str(arrests_df)              # the str() function shows the structure of dataframe

'data.frame':	13012 obs. of  18 variables:
 $ arrest_date       : chr  "2019-08-24T02:23:00.0" "2019-08-24T02:02:00.0" "2019-08-24T02:02:00.0" "2019-08-24T02:02:00.0" ...
 $ year              : int  2019 2019 2019 2019 2019 2019 2019 2019 2019 2019 ...
 $ month             : int  8 8 8 8 8 8 8 8 8 8 ...
 $ gender            : chr  "Male" "" "Female" "Female" ...
 $ race              : chr  "White" "" "Black" "Black" ...
 $ ethnicity         : chr  "NonHispanic" "" "NonHispanic" "NonHispanic" ...
 $ year_of_birth     : int  1981 1994 1984 1984 2001 2001 2001 1991 1991 1991 ...
 $ age               : int  37 25 34 34 18 18 18 28 28 28 ...
 $ from_address      : chr  "No Permanent Address" "SUMMER AVE" "DOUGLAS AVE" "DOUGLAS AVE" ...
 $ from_city         : chr  "providence" "Cranston" "Providence" "Providence" ...
 $ from_state        : chr  "Rhode Island" "Rhode Island" "Rhode Island" "Rhode Island" ...
 $ statute_type      : chr  "" "RI Statute Violation" "RI Statute Violation" "RI Stat

## 5.1 Summarizing Numeric Data

In [83]:
summary(arrests_df)

 arrest_date             year          month           gender         
 Length:13012       Min.   :2019   Min.   : 1.000   Length:13012      
 Class :character   1st Qu.:2019   1st Qu.: 3.000   Class :character  
 Mode  :character   Median :2020   Median : 7.000   Mode  :character  
                    Mean   :2020   Mean   : 6.508                     
                    3rd Qu.:2021   3rd Qu.: 9.000                     
                    Max.   :2021   Max.   :12.000                     
                                                                      
     race            ethnicity         year_of_birth       age       
 Length:13012       Length:13012       Min.   :1938   Min.   :18.00  
 Class :character   Class :character   1st Qu.:1980   1st Qu.:24.00  
 Mode  :character   Mode  :character   Median :1989   Median :31.00  
                                       Mean   :1986   Mean   :33.07  
                                       3rd Qu.:1995   3rd Qu.:39.00  
            

### 5.1.1 Summarizing Numeric Data (cont.)

In [84]:
numeric_vars <- c("month", "year", "age", "year_of_birth", "counts")

In [85]:
summary(arrests_df[, numeric_vars])

     month             year           age        year_of_birth 
 Min.   : 1.000   Min.   :2019   Min.   :18.00   Min.   :1938  
 1st Qu.: 3.000   1st Qu.:2019   1st Qu.:24.00   1st Qu.:1980  
 Median : 7.000   Median :2020   Median :31.00   Median :1989  
 Mean   : 6.508   Mean   :2020   Mean   :33.07   Mean   :1986  
 3rd Qu.: 9.000   3rd Qu.:2021   3rd Qu.:39.00   3rd Qu.:1995  
 Max.   :12.000   Max.   :2021   Max.   :83.00   Max.   :2003  
                                                               
     counts      
 Min.   : 1.000  
 1st Qu.: 1.000  
 Median : 1.000  
 Mean   : 1.087  
 3rd Qu.: 1.000  
 Max.   :15.000  
 NA's   :2983    

## 5.2 Summarizing String Variables

In [86]:
table(arrests_df$race)           # show summary of "race" column in `arrests_df`


                               American Indian/Alaskan Native 
                            24                             25 
        Asian/Pacific Islander                          Black 
                           121                           5721 
                          NULL                        Unknown 
                            41                            588 
                         White            ZHispanic (FD only) 
                          6483                              9 

# 6. Options when Reading CSV

  - The `read.csv()` function has many optional arguments
  - Critically, we can tell R the strings that ought to be considered missing

In [88]:
help(read.csv)

In [96]:
arrests_df2 <- read.csv("data/pvd_arrests_2021-10-03.csv", 
                        na.strings = c("NA", "", " ", "NULL", "Unknown"))

## 6.1 Effects of  `na.strings`

In [97]:
table(arrests_df2$race)             # explore `race` in original dataframe


American Indian/Alaskan Native         Asian/Pacific Islander 
                            25                            121 
                         Black                          White 
                          5721                           6483 
           ZHispanic (FD only) 
                             9 

In [98]:
table(arrests_df2$race)           # dataframe after setting `na.strings`


American Indian/Alaskan Native         Asian/Pacific Islander 
                            25                            121 
                         Black                          White 
                          5721                           6483 
           ZHispanic (FD only) 
                             9 