<center><h1>The DataFrame Object in R</h1></center>
<center><h3>Ellen Duong</h3></center>
<center><h3>August Guang</h3></center>
<center><h3>Paul Stey</h3></center>

# 1. What is a `data.frame`?

  - Tabular data structure (i.e., like Excel spreadsheet)
  - Canonical data structure for data analysis
  - Capable of storing heterogeneous data

## 1.1 What does it look like?

In [1]:
idx <- 1:4
score <- rnorm(4)
vocals <- c(TRUE, TRUE, TRUE, FALSE)
firstname <- c("john", "george", "paul", "ringo")

dat <- data.frame(idx, firstname, score, vocals)

dat 

idx,firstname,score,vocals
<int>,<chr>,<dbl>,<lgl>
1,john,-1.51900212,True
2,george,0.59062028,True
3,paul,-0.03782878,True
4,ringo,0.22167893,False


## 1.2 Indexing and Slicing a `data.frame`

  - Similar to `vector`, `matrix`, and `array` objects

In [2]:
dat[1, 2]           # get element in first row, second column

In [3]:
dat[, 2]            # get all of second column

In [4]:
dat[3, ]            # get third row

Unnamed: 0_level_0,idx,firstname,score,vocals
Unnamed: 0_level_1,<int>,<chr>,<dbl>,<lgl>
3,3,paul,-0.03782878,True


### 1.2.1 Indexing using Column Names

In [5]:
dat[3, "score"]          # element from row 3 and "score" column

In [6]:
dat[2:4, "firstname"]    # get elements 2, 3, and 4 from "firstname" column

## 1.3 The `$` Operator and `data.frame` Objects 

In [7]:
dat$firstname            # get the "firstname" column

# 2. Filter `data.frame` using Logical Indexing

In [8]:
dat

idx,firstname,score,vocals
<int>,<chr>,<dbl>,<lgl>
1,john,-1.51900212,True
2,george,0.59062028,True
3,paul,-0.03782878,True
4,ringo,0.22167893,False


In [12]:
idx_keep <- c(TRUE, TRUE, TRUE, FALSE)

dat[idx_keep, ]

Unnamed: 0_level_0,idx,firstname,score,vocals
Unnamed: 0_level_1,<int>,<chr>,<dbl>,<lgl>
1,1,john,-1.51900212,True
2,2,george,0.59062028,True
3,3,paul,-0.03782878,True


## 2.1 Create New `data.frame` from Another

In [15]:
dat2 <- dat[idx_keep, ]       # create new dataframe, from subset of original

tail(dat2, n=2)

Unnamed: 0_level_0,idx,firstname,score,vocals
Unnamed: 0_level_1,<int>,<chr>,<dbl>,<lgl>
2,2,george,0.59062028,True
3,3,paul,-0.03782878,True


### 2.1.1 Take Subset of `data.frame` Columns

In [25]:
cols <- c("firstname", "score")    # columns we care about

dat_namescore <- dat2[, cols]      # create new dataframe

dat_namescore

Unnamed: 0_level_0,firstname,score
Unnamed: 0_level_1,<chr>,<dbl>
1,john,-1.51900212
2,george,0.59062028
3,paul,-0.03782878


# 3. Adding Columns to a `data.frame`

In [26]:
dat

idx,firstname,score,vocals
<int>,<chr>,<dbl>,<lgl>
1,john,-1.51900212,True
2,george,0.59062028,True
3,paul,-0.03782878,True
4,ringo,0.22167893,False


In [27]:
dat$food <- c("steak", "chicken", "potato", "rice")

dat

idx,firstname,score,vocals,food
<int>,<chr>,<dbl>,<lgl>,<chr>
1,john,-1.51900212,True,steak
2,george,0.59062028,True,chicken
3,paul,-0.03782878,True,potato
4,ringo,0.22167893,False,rice


## 3.1. Adding Columns (cont.)

In [28]:
dat[, "drink"] <- c("water", "milk", "beer", "scotch")

dat

idx,firstname,score,vocals,food,drink
<int>,<chr>,<dbl>,<lgl>,<chr>,<chr>
1,john,-1.51900212,True,steak,water
2,george,0.59062028,True,chicken,milk
3,paul,-0.03782878,True,potato,beer
4,ringo,0.22167893,False,rice,scotch



<center><h1>Challenge Questions</h1></center>

### Question 1.
Create a `data.frame` object called `state_df` with two columns, one called `state` and one called `population`. Each column should have five elements. For the `state` column, select the abbreviations for five US states (e.g., "OH", "RI", "NY", "MA", "CT"). For the `population` column, use the `sample()` function to create "populations" at random from the range `1` to `1000000`.

### Question 2.
Add a third column to the `state_df` dataframe called `size`. In particular, use boolean indexing to assign elements of the third column to be `"large"` if that row's `population` value is larger than or equal to `500000`, and be `"small"`  if the row's `population` is less than `500000`.

In [30]:
state <- c("OH", "RI", "NY", "MA", "CT")

set.seed(1) # sets the state of the random number generator stored in .Random.seed
population <- sample(1:1000000, 5)

state_df <- data.frame(state, population)

state_df

state,population
<chr>,<int>
OH,548676
RI,452737
NY,124413
MA,436523
CT,856018


In [32]:
is_small <- state_df$population < 500000   # create vector of booleans

is_small

state_df$size <- rep(NA, 5)                # create column of 5 NAs

state_df[is_small, "size"] <- "small"      # assigning "small" wherever `is_small` is TRUE
state_df[!is_small, "size"] <- "large"     # assigning "large" wherever !is_small

state_df

state,population,size
<chr>,<int>,<chr>
OH,548676,large
RI,452737,small
NY,124413,small
MA,436523,small
CT,856018,large


In [33]:
n <- nrow(state_df)                        # get number of rows in dataframe
state_df$size <- rep(NA, n)                # create empty column of NAs

for (i in 1:n) {
    if (state_df[i, "population"] < 500000) {
        state_df[i, "size"] <- "small"
    }
    else {
        state_df[i, "size"] <- "large"
    }
}

state_df

state,population,size
<chr>,<int>,<chr>
OH,548676,large
RI,452737,small
NY,124413,small
MA,436523,small
CT,856018,large


In [34]:
state_df$size <- ifelse(state_df$population < 500000, "small", "large")

state_df

state,population,size
<chr>,<int>,<chr>
OH,548676,large
RI,452737,small
NY,124413,small
MA,436523,small
CT,856018,large


# 4. Reading Data from CSV File

  - CSV File is "comma-separated values"
  - The `,` separator is conventional, but not mandatory
  - The `|` character is also common

## 4.1 Providence Police Dept. Data

  - We will be looking at public data regarding arrests and case

In [35]:
# The line below reads the CSV file and creates a dataframe 

arrests_df <- read.csv("data/pvd_arrests_2021-10-03.csv")     

## 4.2 Exploring the Data

In [36]:
head(arrests_df)         # show first few lines of the dataframe

Unnamed: 0_level_0,arrest_date,year,month,gender,race,ethnicity,year_of_birth,age,from_address,from_city,from_state,statute_type,statute_code,statute_desc,counts,case_number,arresting_officers,id
Unnamed: 0_level_1,<chr>,<int>,<int>,<chr>,<chr>,<chr>,<int>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<int>,<chr>,<chr>,<chr>
1,2019-08-24T02:23:00.0,2019,8,Male,White,NonHispanic,1981,37,No Permanent Address,providence,Rhode Island,,,,,2019-00084142,"YGonzalez, LTaveras",pvd2218242150382148273
2,2019-08-24T02:02:00.0,2019,8,,,,1994,25,SUMMER AVE,Cranston,Rhode Island,RI Statute Violation,31-11-18,"Driving after Denial, Suspension or Revocation of License",1.0,2019-00084127,NManfredi,pvd15166785558364246202
3,2019-08-24T02:02:00.0,2019,8,Female,Black,NonHispanic,1984,34,DOUGLAS AVE,Providence,Rhode Island,RI Statute Violation,12-7-10,RESISTING LEGAL OR ILLEGAL ARREST,1.0,2019-00084126,"MPlace, JPerez, ASantos",pvd3142917706201385905
4,2019-08-24T02:02:00.0,2019,8,Female,Black,NonHispanic,1984,34,DOUGLAS AVE,Providence,Rhode Island,RI Statute Violation,11-45-1,DISORDERLY CONDUCT,1.0,2019-00084126,"MPlace, JPerez, ASantos",pvd3142917706201385905
5,2019-08-24T02:02:00.0,2019,8,Female,Black,Unknown,2001,18,TRASH ST,,,RI Statute Violation,12-7-10,RESISTING LEGAL OR ILLEGAL ARREST,1.0,2019-00084126,"MPlace, JPerez, ASantos",pvd460449304532374599
6,2019-08-24T02:02:00.0,2019,8,Female,Black,Unknown,2001,18,TRASH ST,,,RI Statute Violation,11-45-1,DISORDERLY CONDUCT,1.0,2019-00084126,"MPlace, JPerez, ASantos",pvd460449304532374599


### 4.2.1 More Data Exploring 

In [37]:
dim(arrests_df)             # get dimensions of the dataframe

In [38]:
nrow(arrests_df)            # get number of rows

In [39]:
ncol(arrests_df)            # get the number of columns

In [40]:
colnames(arrests_df)        # get the column names

# 5. Summaries from `data.frame`

In [41]:
str(arrests_df)              # the str() function shows the structure of dataframe

'data.frame':	13012 obs. of  18 variables:
 $ arrest_date       : chr  "2019-08-24T02:23:00.0" "2019-08-24T02:02:00.0" "2019-08-24T02:02:00.0" "2019-08-24T02:02:00.0" ...
 $ year              : int  2019 2019 2019 2019 2019 2019 2019 2019 2019 2019 ...
 $ month             : int  8 8 8 8 8 8 8 8 8 8 ...
 $ gender            : chr  "Male" "" "Female" "Female" ...
 $ race              : chr  "White" "" "Black" "Black" ...
 $ ethnicity         : chr  "NonHispanic" "" "NonHispanic" "NonHispanic" ...
 $ year_of_birth     : int  1981 1994 1984 1984 2001 2001 2001 1991 1991 1991 ...
 $ age               : int  37 25 34 34 18 18 18 28 28 28 ...
 $ from_address      : chr  "No Permanent Address" "SUMMER AVE" "DOUGLAS AVE" "DOUGLAS AVE" ...
 $ from_city         : chr  "providence" "Cranston" "Providence" "Providence" ...
 $ from_state        : chr  "Rhode Island" "Rhode Island" "Rhode Island" "Rhode Island" ...
 $ statute_type      : chr  "" "RI Statute Violation" "RI Statute Violation" "RI Stat

## 5.1 Summarizing Numeric Data

In [42]:
summary(arrests_df)

 arrest_date             year          month           gender         
 Length:13012       Min.   :2019   Min.   : 1.000   Length:13012      
 Class :character   1st Qu.:2019   1st Qu.: 3.000   Class :character  
 Mode  :character   Median :2020   Median : 7.000   Mode  :character  
                    Mean   :2020   Mean   : 6.508                     
                    3rd Qu.:2021   3rd Qu.: 9.000                     
                    Max.   :2021   Max.   :12.000                     
                                                                      
     race            ethnicity         year_of_birth       age       
 Length:13012       Length:13012       Min.   :1938   Min.   :18.00  
 Class :character   Class :character   1st Qu.:1980   1st Qu.:24.00  
 Mode  :character   Mode  :character   Median :1989   Median :31.00  
                                       Mean   :1986   Mean   :33.07  
                                       3rd Qu.:1995   3rd Qu.:39.00  
            

### 5.1.1 Summarizing Numeric Data (cont.)

In [43]:
numeric_vars <- c("month", "year", "age", "year_of_birth", "counts")

In [44]:
summary(arrests_df[, numeric_vars])

     month             year           age        year_of_birth 
 Min.   : 1.000   Min.   :2019   Min.   :18.00   Min.   :1938  
 1st Qu.: 3.000   1st Qu.:2019   1st Qu.:24.00   1st Qu.:1980  
 Median : 7.000   Median :2020   Median :31.00   Median :1989  
 Mean   : 6.508   Mean   :2020   Mean   :33.07   Mean   :1986  
 3rd Qu.: 9.000   3rd Qu.:2021   3rd Qu.:39.00   3rd Qu.:1995  
 Max.   :12.000   Max.   :2021   Max.   :83.00   Max.   :2003  
                                                               
     counts      
 Min.   : 1.000  
 1st Qu.: 1.000  
 Median : 1.000  
 Mean   : 1.087  
 3rd Qu.: 1.000  
 Max.   :15.000  
 NA's   :2983    

## 5.2 Summarizing String Variables

In [45]:
table(arrests_df$race)           # show summary of "race" column in `arrests_df`


                               American Indian/Alaskan Native 
                            24                             25 
        Asian/Pacific Islander                          Black 
                           121                           5721 
                          NULL                        Unknown 
                            41                            588 
                         White            ZHispanic (FD only) 
                          6483                              9 

# 6. Options when Reading CSV

  - The `read.csv()` function has many optional arguments
  - Critically, we can tell R the strings that ought to be considered missing

In [46]:
help(read.csv)

0,1
read.table {utils},R Documentation

0,1
file,"the name of the file which the data are to be read from. Each row of the table appears as one line of the file. If it does not contain an absolute path, the file name is relative to the current working directory, getwd(). Tilde-expansion is performed where supported. This can be a compressed file (see file). Alternatively, file can be a readable text-mode connection (which will be opened for reading if necessary, and if so closed (and hence destroyed) at the end of the function call). (If stdin() is used, the prompts for lines may be somewhat confusing. Terminate input with a blank line or an EOF signal, Ctrl-D on Unix and Ctrl-Z on Windows. Any pushback on stdin() will be cleared before return.) file can also be a complete URL. (For the supported URL schemes, see the ‘URLs’ section of the help for url.)"
header,"a logical value indicating whether the file contains the names of the variables as its first line. If missing, the value is determined from the file format: header is set to TRUE if and only if the first row contains one fewer field than the number of columns."
sep,"the field separator character. Values on each line of the file are separated by this character. If sep = """" (the default for read.table) the separator is ‘white space’, that is one or more spaces, tabs, newlines or carriage returns."
quote,"the set of quoting characters. To disable quoting altogether, use quote = """". See scan for the behaviour on quotes embedded in quotes. Quoting is only considered for columns read as character, which is all of them unless colClasses is specified."
dec,the character used in the file for decimal points.
numerals,"string indicating how to convert numbers whose conversion to double precision would lose accuracy, see type.convert. Can be abbreviated. (Applies also to complex-number inputs.)"
row.names,"a vector of row names. This can be a vector giving the actual row names, or a single number giving the column of the table which contains the row names, or character string giving the name of the table column containing the row names. If there is a header and the first row contains one fewer field than the number of columns, the first column in the input is used for the row names. Otherwise if row.names is missing, the rows are numbered. Using row.names = NULL forces row numbering. Missing or NULL row.names generate row names that are considered to be ‘automatic’ (and not preserved by as.matrix)."
col.names,"a vector of optional names for the variables. The default is to use ""V"" followed by the column number."
as.is,"controls conversion of character variables (insofar as they are not converted to logical, numeric or complex) to factors, if not otherwise specified by colClasses. Its value is either a vector of logicals (values are recycled if necessary), or a vector of numeric or character indices which specify which columns should not be converted to factors. Note: to suppress all conversions including those of numeric columns, set colClasses = ""character"". Note that as.is is specified per column (not per variable) and so includes the column of row names (if any) and any columns to be skipped."
na.strings,"a character vector of strings which are to be interpreted as NA values. Blank fields are also considered to be missing values in logical, integer, numeric and complex fields. Note that the test happens after white space is stripped from the input, so na.strings values may need their own white space stripped in advance."


In [47]:
arrests_df2 <- read.csv("data/pvd_arrests_2021-10-03.csv", 
                        na.strings = c("NA", "", " ", "NULL", "Unknown"))

## 6.1 Effects of  `na.strings`

In [49]:
table(arrests_df$race)             # explore `race` in original dataframe


                               American Indian/Alaskan Native 
                            24                             25 
        Asian/Pacific Islander                          Black 
                           121                           5721 
                          NULL                        Unknown 
                            41                            588 
                         White            ZHispanic (FD only) 
                          6483                              9 

In [50]:
table(arrests_df2$race)           # dataframe after setting `na.strings`


American Indian/Alaskan Native         Asian/Pacific Islander 
                            25                            121 
                         Black                          White 
                          5721                           6483 
           ZHispanic (FD only) 
                             9 

In [51]:
arrests_df[1, ]

Unnamed: 0_level_0,arrest_date,year,month,gender,race,ethnicity,year_of_birth,age,from_address,from_city,from_state,statute_type,statute_code,statute_desc,counts,case_number,arresting_officers,id
Unnamed: 0_level_1,<chr>,<int>,<int>,<chr>,<chr>,<chr>,<int>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<int>,<chr>,<chr>,<chr>
1,2019-08-24T02:23:00.0,2019,8,Male,White,NonHispanic,1981,37,No Permanent Address,providence,Rhode Island,,,,,2019-00084142,"YGonzalez, LTaveras",pvd2218242150382148273
