# Practice 4 - Working with Complex Data Types


Here we will explore the use of iteration, factors, and dates in the analysis of R data. Each of these data types presents unique and important challenges to the data scientist new to R. The more you work with these libraries and data types, the better shape you will be in. It's a little bit like data science push-ups.

**Task 1:** Load the [ggplot2](https://web.dsa.missouri.edu/static/PDF/R/ggplot2.pdf), [lubridate](https://web.dsa.missouri.edu/static/PDF/R/lubridate.pdf), and [dplyr](https://web.dsa.missouri.edu/static/PDF/R/dplyr.pdf) libraries.

In [1]:
## 1.  Develop your code for Task 1 here.
## ---------------------------------------
suppressMessages(library('dplyr'))
suppressMessages(library('lubridate'))
suppressMessages(library('ggplot2'))


**Task 2:** 
- Read the "/dsa/data/all_datasets/Shooting_Victims.csv" shooting victim csv into a new data frame, reading strings as characters. 
- Output the first 5 rows of the data frame to the screen.

In [2]:
## 2.  Develop your code for Task 2 here.
## ---------------------------------------

shooting.victims <- read.csv("/dsa/data/all_datasets/Shooting_Victims.csv", stringsAsFactors = FALSE)

head(shooting.victims, 5L)



DIST,DC_KEY,CODE,DATE_,TIME,RACE,SEX,LATINO,AGE,WOUND,INSIDE,OUTSIDE,FATAL,OFFICER_INVOLVED,OFFENDER_INJURED,OFFENDER_DECEASED,LOCATION,SHAPE,Police.Districts
<int>,<dbl>,<int>,<chr>,<chr>,<chr>,<chr>,<int>,<int>,<chr>,<int>,<int>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<int>
39,201539000000.0,411,05/16/2015 12:00:00 AM,23:15 pm,B,M,0,18,buttocks,0,-1,0,N,N,N,2800 Block of N. Bonsall St.,POINT (-75.169159 39.998284),21
39,201539000000.0,411,05/25/2015 12:00:00 AM,01:40 am,B,M,0,20,hand,0,-1,0,N,N,N,1700 Block of W. Courtland St.,POINT (-75.153064 40.023554),20
39,201539000000.0,411,06/03/2015 12:00:00 AM,23:30 pm,B,M,0,26,leg,0,-1,0,N,N,N,2200 Block of N. Lippencott St.,POINT (-75.166117 40.002397),21
39,201539000000.0,411,06/13/2015 12:00:00 AM,02:31 am,B,F,0,27,leg,0,-1,0,N,N,N,4200 Block of N. Carlisle St.,POINT (-75.15049 40.017063),21
39,201539000000.0,411,06/13/2015 12:00:00 AM,02:31 am,B,M,0,40,leg,0,-1,0,N,N,N,4200 Block of N. Carlisle St.,POINT (-75.15049 40.017063),21


**Task 3:** 
- Create a new date variable based on the information in the `DATE_` field.  Name it datevar.
- Output the first 5 lines of the data frame to the screen. 

Hint: as_date(); parse_date_time().

In [3]:
## 3.  Develop your code for Task 3 here.
## ---------------------------------------

shooting.victims$datevar <- parse_date_time(shooting.victims$DATE_, orders='mdY HMS')
head(shooting.victims,n=5L)




DIST,DC_KEY,CODE,DATE_,TIME,RACE,SEX,LATINO,AGE,WOUND,INSIDE,OUTSIDE,FATAL,OFFICER_INVOLVED,OFFENDER_INJURED,OFFENDER_DECEASED,LOCATION,SHAPE,Police.Districts,datevar
<int>,<dbl>,<int>,<chr>,<chr>,<chr>,<chr>,<int>,<int>,<chr>,<int>,<int>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<int>,<dttm>
39,201539000000.0,411,05/16/2015 12:00:00 AM,23:15 pm,B,M,0,18,buttocks,0,-1,0,N,N,N,2800 Block of N. Bonsall St.,POINT (-75.169159 39.998284),21,2015-05-16 12:00:00
39,201539000000.0,411,05/25/2015 12:00:00 AM,01:40 am,B,M,0,20,hand,0,-1,0,N,N,N,1700 Block of W. Courtland St.,POINT (-75.153064 40.023554),20,2015-05-25 12:00:00
39,201539000000.0,411,06/03/2015 12:00:00 AM,23:30 pm,B,M,0,26,leg,0,-1,0,N,N,N,2200 Block of N. Lippencott St.,POINT (-75.166117 40.002397),21,2015-06-03 12:00:00
39,201539000000.0,411,06/13/2015 12:00:00 AM,02:31 am,B,F,0,27,leg,0,-1,0,N,N,N,4200 Block of N. Carlisle St.,POINT (-75.15049 40.017063),21,2015-06-13 12:00:00
39,201539000000.0,411,06/13/2015 12:00:00 AM,02:31 am,B,M,0,40,leg,0,-1,0,N,N,N,4200 Block of N. Carlisle St.,POINT (-75.15049 40.017063),21,2015-06-13 12:00:00


**Task 4:** 
- Create a list of all of the unique names in the `LOCATION` variable. 
- Output the first 5 lines to the screen. 

Hint: distinct().

In [4]:
## 4.  Develop your code for Task 4 here.
## ---------------------------------------

unique.names <- distinct(shooting.victims, LOCATION)
head(unique.names, 5L)



LOCATION
<chr>
2800 Block of N. Bonsall St.
1700 Block of W. Courtland St.
2200 Block of N. Lippencott St.
4200 Block of N. Carlisle St.
3700 Block of N. Germantown Ave.


 **Task 5:** 
 - Examine the list of unique location names that you generated more closely (and looking at more lines). What if we needed to come up with a list of the unique street names, not just the unique location references?  Are there any patterns that emerge that could be used to help us parse out the street names?  Do you see any instances where streets may have different spellings or naming conventions?  
 - List three or more issues below (as comments) so you can get a feel for the types of inconsistencies that may need to be remedied before this info can be used more reliably.

In [5]:
## 5.  Develop your code for Task 5 here.
## ---------------------------------------

# The pattern I see so far is all have "Block of"
# Yes, there are changes in patterns, some have N., W. and others do not
# Some addresses have a lowercase n., some have an "&", some don't have a number,
# Some have street as ST others as St. or St, I see some street have St attached to
# the street name
unique.names <- distinct(shooting.victims, LOCATION)
head(unique.names, 10L)


LOCATION
<chr>
2800 Block of N. Bonsall St.
1700 Block of W. Courtland St.
2200 Block of N. Lippencott St.
4200 Block of N. Carlisle St.
3700 Block of N. Germantown Ave.
3900 Block of Priscilla St.
2500 Block of W. Allegheny Ave.
2700 Block of N. 29th St.
2800 Block of Stillman St.
3700 Block of N. 21st St.


**Task 6:** 
- Develop some code that separates the street name into two new fields, one with the block info `block` and the other `street` with the street name. 
    - In this case, let's base the decision to separate based on the presence of the word ` of`. Of course, the names with an `&` in them won't split properly, but we can deal with that later. 
- Output the first 5 lines of the data frame to the screen. 

Hint: library(); separate().

In [6]:
## 6.  Develop your code for Task 6 here.
## ---------------------------------------
suppressMessages(library('tidyr'))

block.street <- unique.names %>% separate(LOCATION, c('BLOCK', 'STREET'), sep = 'of')
head(block.street, 5L)



“Expected 2 pieces. Missing pieces filled with `NA` in 21 rows [214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 231, 1263, 1264, 1265, ...].”

BLOCK,STREET
<chr>,<chr>
2800 Block,N. Bonsall St.
1700 Block,W. Courtland St.
2200 Block,N. Lippencott St.
4200 Block,N. Carlisle St.
3700 Block,N. Germantown Ave.


**Task 7:** 
- Let's deal with the streets with `&` separating their names. Create a new column named `street2` and set it equal to NA. 
- Then, iterate over the data frame using a `for` loop, testing if the `street` variable you created earlier contains an NA value.  
    - In cases where this occurs, separate the names in `block` according to the `&` delimiter into the fields `street` and `street2` accordingly.  
- Output the first 5 lines of the data frame to the screen. 

Hint: mutate(); for; if; :; nrow(); is.na(); strsplit(); unlist().

In [14]:
## 7.  Develop your code for Task 7 here.
## ---------------------------------------

new.street <- block.street %>% mutate(STREET2 = (STREET=="NA"))
head(new.street, 5L)


BLOCK,STREET,STREET2
<chr>,<chr>,<lgl>
2800 Block,N. Bonsall St.,False
1700 Block,W. Courtland St.,False
2200 Block,N. Lippencott St.,False
4200 Block,N. Carlisle St.,False
3700 Block,N. Germantown Ave.,False


In [15]:
street_separate <- for (i in seq (from = 1, to = length(colnames(new.street)))) {
    for (row in seq(from =1, to=nrow(new.street))) {
        if (i == 3 ) {
          new.street[row,STREET2] <- unlist(strsplit(new.street$BLOCK, split = "&"))
        }       
    }    
}    

head(street_separate, 5L)

ERROR: Error in `[<-.data.frame`(`*tmp*`, row, STREET2, value = c("2800 Block ", : object 'STREET2' not found


**Task 8:** 
- Clean up the street names by trimming the whitespace from the `street` and `street2` variables. You'll need to look up the function `trimws()`.  Figure out what trim white space does.

In [None]:
## 8.  Develop your code for Task 8 here.
## ---------------------------------------





**Task 9:** 
- Continue cleaning up the street names by developing some code that loops through the data frame and checks for some common problems in the street names and that standardizes the naming conventions a bit. 
    - For example, find instances of `N` and `N.` meaning north and make sure to change so they are all `N.`.  
    - Do as many corrections that you need to feel comfortable with this task...no need to overdo it though. This is just practice. For a real analysis you would need to be thorough. 
    
Hint: for; if; :; nrow(); length(); regexpr(); sub().

In [None]:
## 9.  Develop your code for Task 9 here.
## ---------------------------------------






    


**Task 10:** 
- Add a new column to the data frame named `dayofweek`. Set the value of this new variable to be the name of the day of the week the shooting was reported (i.e., Monday). 
- Output the first 5 lines of the data frame to the screen. 

Hint: mutate(); wday().

In [None]:
## 10.  Develop your code for Task 10 here.
## ---------------------------------------






**Task 11:** 
- For each shooting, compute a new variable `mindays` representing the number of days until the next shooting occurred. 
- Arrange the results such that `mindays` is reported in decreasing order. 
- Output the first 15 lines of the data frame to the screen.  

Hint: for; :; nrow(); filter(); min(); arrange(); select records with dates greater than the date of the current record.

In [None]:
## 11.  Develop your code for Task 11 here.
## ---------------------------------------







**Task 12:** 
- Based on the primary street variable that was refined earlier, list the top 20 streets associated with the most shootings in this dataset. 

Hint: count(); arrange(); head().

In [None]:
## 12.  Develop your code for Task 12 here.
## ---------------------------------------






# SAVE YOUR NOTEBOOK, then `File > Close and Halt`

---