In [4]:
library(plyr)
library(tidyverse)
library(infer)
library(repr)
library(stringr)
install.packages("janitor")
library(janitor)


theme_stat201 <- function (width = 12, height = 5) { 
    options(repr.plot.width = width, repr.plot.height = height)
    theme_bw(base_size = 14) %+replace% 
        theme(
            plot.title = element_text(hjust = 0.5) 
        )
}

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.2     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.0.3     [32m✔[39m [34mdplyr  [39m 1.0.2
[32m✔[39m [34mtidyr  [39m 1.1.1     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.5.0

“package ‘ggplot2’ was built under R version 4.0.1”
“package ‘tibble’ was built under R version 4.0.2”
“package ‘tidyr’ was built under R version 4.0.2”
“package ‘dplyr’ was built under R version 4.0.2”
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32marrange()[39m   masks [34mplyr[39m::arrange()
[31m✖[39m [34mpurrr[39m::[32mcompact()[39m   masks [34mplyr[39m::compact()
[31m✖[39m [34mdplyr[39m::[32mcount()[39m     masks [34mplyr[39m::count()
[31m✖[39m [34mdplyr[39m::[32mfailwith()[39m  masks [34mpl

In [5]:
# Download the file to wherever your Jupyter notebook is located
url <- "https://geodash.vpd.ca/opendata/crimedata_download/crimedata_csv_all_years.zip"
filename <- "crime_data.zip"
download.file(url, destfile = filename)

# Data comes in as a zip, so we'll need to extract it
unzip("crime_data.zip")

# Read in the desired file
crime_data <- read_csv("crimedata_csv_all_years.csv")

Parsed with column specification:
cols(
  TYPE = [31mcol_character()[39m,
  YEAR = [32mcol_double()[39m,
  MONTH = [32mcol_double()[39m,
  DAY = [32mcol_double()[39m,
  HOUR = [32mcol_double()[39m,
  MINUTE = [32mcol_double()[39m,
  HUNDRED_BLOCK = [31mcol_character()[39m,
  NEIGHBOURHOOD = [31mcol_character()[39m,
  X = [32mcol_double()[39m,
  Y = [32mcol_double()[39m
)



In [6]:
crime_data <- read_csv("crimedata_csv_all_years.csv")

theft_crimes <- c("Other Theft", "Theft from Vehicle", 
                  "Theft of Bicycle", "Theft of Vehicle")

Parsed with column specification:
cols(
  TYPE = [31mcol_character()[39m,
  YEAR = [32mcol_double()[39m,
  MONTH = [32mcol_double()[39m,
  DAY = [32mcol_double()[39m,
  HOUR = [32mcol_double()[39m,
  MINUTE = [32mcol_double()[39m,
  HUNDRED_BLOCK = [31mcol_character()[39m,
  NEIGHBOURHOOD = [31mcol_character()[39m,
  X = [32mcol_double()[39m,
  Y = [32mcol_double()[39m
)



In [7]:
crime_data <- crime_data %>% clean_names()
head(crime_data)

total_rows <- crime_data %>% nrow()
print(sprintf("There are %d rows in the data frame", total_rows))

na_neighbourhoods <- sum(is.na(crime_data$neighbourhood))
print(sprintf("Originally, there were %d NA values in the neighbourhood column", na_neighbourhoods))

crime_data <- na.omit(crime_data)

na_neighbourhoods <- sum(is.na(crime_data$neighbourhood))
print(sprintf("Now, there are %d NA values in the neighbourhood column", na_neighbourhoods))


crime_data <- crime_data %>% 
    select(type, year, neighbourhood) %>%
    filter(2017 <= year & year <= 2020)


crime_data <- crime_data %>%
    mutate(type = as_factor(type)) %>%
    mutate(neighbourhood = as_factor(neighbourhood))




type,year,month,day,hour,minute,hundred_block,neighbourhood,x,y
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<dbl>,<dbl>
Break and Enter Commercial,2012,12,14,8,52,,Oakridge,491285.0,5453433
Break and Enter Commercial,2019,3,7,2,6,10XX SITKA SQ,Fairview,490613.0,5457110
Break and Enter Commercial,2019,8,27,4,12,10XX ALBERNI ST,West End,491007.8,5459174
Break and Enter Commercial,2014,8,8,5,13,10XX ALBERNI ST,West End,491015.9,5459166
Break and Enter Commercial,2020,7,28,19,12,10XX ALBERNI ST,West End,491015.9,5459166
Break and Enter Commercial,2005,11,14,3,9,10XX ALBERNI ST,West End,491021.4,5459161


[1] "There are 668167 rows in the data frame"
[1] "Originally, there were 70135 NA values in the neighbourhood column"
[1] "Now, there are 0 NA values in the neighbourhood column"


# MERGE EVERYTHING BELOW,

The items above shoud be duplicated.

Make sure to merge the cell directly below, as I needed to re-format neightbourhoods values into a string for parsing in the scripts below.

In [8]:
neighbourhoods <- 
    crime_data %>%
    mutate(neighbourhood = as.character(neighbourhood))%>%
    pull(unique(neighbourhood))
    
neighbourhoods_list<-unique(neighbourhoods)


    

## Past Data and Calculating our P0

First, like look at what the crime rates are in the past three years (2017, 2018, 2019).  Since our hypothesis states that there is not changes in crime rates in the past year compared to the Covid year in 2020, we are going to use the proportion of Thefts in the past years as our null-hypothesis.

We will be generating the proportion of reported thefts in each neighbourhood below by:
- First creating a fuction that filters out and calculates our parameters for the past years (2017, 2018, 2019)
- Then we will pass all the neighourhood into the fucntion
- With the results of each neighbourhood, we would then consolidate the infomation via a tibble with the following headers:
    - neighbourhood, prop, count, total

In [10]:
#Script to generate the Past Proportion for Theft in Vancouver (2017,2018,2019)

set.seed(12345)

past_prop<-function(neigh) {
    crime_data_past <- 
        crime_data %>%
        filter(neighbourhood == neigh)%>%
        filter(year %in% c(2017, 2018, 2019))

    theft_past_prop <-
        crime_data_past%>%
        summarize(stat = mean(type %in% theft_crimes),
                 count = sum(type %in% theft_crimes)) %>%
        mutate(total = nrow(crime_data_past))

    return (theft_past_prop)
}

neighbourhoods <- 
    crime_data %>%
    mutate(neighbourhood = as.character(neighbourhood))%>%
    pull(unique(neighbourhood))
    
neighbourhoods_list<-unique(neighbourhoods)



past_prop_neigh <-tibble(neighbourhood = "", prop = 0, count = 0, total = 0)



for (neigh in neighbourhoods_list) {
    
    stats <- past_prop(neigh)
    prop <- pull(stats[1])
    count <-  pull(stats[2])
    total <- pull(stats[3])
    
   past_prop_neigh<-add_row(past_prop_neigh,neighbourhood = neigh, prop = prop, count = count, total = total)
    
}

past_prop_neigh<-past_prop_neigh[-1,]

past_prop_neigh



neighbourhood,prop,count,total
<chr>,<dbl>,<dbl>,<dbl>
Fairview,0.6891537,3933,5707
West End,0.7351588,6712,9130
Central Business District,0.7533982,24997,33179
Hastings-Sunrise,0.6451613,2500,3875
Strathcona,0.5577118,3305,5926
Grandview-Woodland,0.6237392,3463,5552
Mount Pleasant,0.6768553,4916,7263
Sunset,0.589672,1690,2866
Kensington-Cedar Cottage,0.6043016,2613,4324
Stanley Park,0.784,392,500


## Current (2020) Data and Calculating our P_hat

Next, we will look at what the crime rates are in the Covid Year (2020). We will generate our sample proportion using the reported thefts in 2020.

The following function is similar to the one for Past Data, but modified to produce the proportion of crimes related to thefts for 2020.

We will be generating the proportion of reported thefts in each neighbourhood below by:
- First creating a fuction that filters out and calculates our parameters for the Covid Year (2020)
- Then we will pass all the neighourhood into the fucntion
- With the results of each neighbourhood, we would then consolidate the infomation via a tibble with the following headers:
    - neighbourhood, prop, count, total

In [13]:
#Script to generate the Current Proportion for Theft in Vancouver (2020)

set.seed(12345)
curr_prop<-function(neigh) {
    crime_data_past <- 
        crime_data %>%
        filter(neighbourhood == neigh)%>%
        filter(year == 2020)

    theft_past_prop <-
        crime_data_past%>%
        summarize(stat = mean(type %in% theft_crimes),
                 count = sum(type %in% theft_crimes)) %>%
        mutate(total = nrow(crime_data_past))

    return (theft_past_prop)
}

curr_prop_neigh <-tibble(neighbourhood = "", prop = 0, count = 0, total = 0)

for (neigh in neighbourhoods_list) {
    
    stats <- curr_prop(neigh)
    prop <- pull(stats[1])
    count <-  pull(stats[2])
    total <- pull(stats[3])
    
   curr_prop_neigh<-add_row(curr_prop_neigh,neighbourhood = neigh, prop = prop, count = count, total = total)
    
}

curr_prop_neigh<-curr_prop_neigh[-1,]
curr_prop_neigh

neighbourhood,prop,count,total
<chr>,<dbl>,<dbl>,<dbl>
Fairview,0.619474,1107,1787
West End,0.633131,1460,2306
Central Business District,0.6046934,4664,7713
Hastings-Sunrise,0.5693642,591,1038
Strathcona,0.4385246,749,1708
Grandview-Woodland,0.5506849,804,1460
Mount Pleasant,0.5719462,1105,1932
Sunset,0.5651163,486,860
Kensington-Cedar Cottage,0.6161616,793,1287
Stanley Park,0.5074627,34,67


From a quick glance at the data, there seems to be a difference with the proportion of thefts between 2020 and the past years.  Infact, quite a few seem to have a decrease in crime rate!

However, we should not make any asumption that there are significant changes to the crime rate with first running a significance test using Z-test and p-values.

## Z-Test of Significance between the past years and Covid Year of proportion of Thefts

Lets have a look at the significance with the changes in proportion of thefts between the past years and the Covid year.

The script below combines and selects the neighbourhood, past proportion value, current proportion value, and the sample size with the above tibbles for the Past and Current Proportion in each neighbourhood.

We then calculate the Test Statistic (1 sample Z-Score) and used pnorm to caluclate the p-value for the difference in proportion between the reported thefts in the past year and 2020.  

Next, we will test the significance of the difference difference via 5% and 10%, as discussed in our methodology.  

In [17]:
#Combines the Current and Past and produce the Test Statistics and P Value, along with Rejection value

combine_past_curr_p_stat <- past_prop_neigh %>% 
    rename(past_prop = prop) %>%
    select(neighbourhood, past_prop) %>%
    mutate(curr_prop = curr_prop_neigh$prop,
          curr_count_total = curr_prop_neigh$total,
          stat = (curr_prop-past_prop)/sqrt(past_prop*(1-past_prop)/curr_count_total),
          p_value = ifelse(stat > 0, 2*pnorm(stat, lower.tail=FALSE),2*pnorm(stat, lower.tail=TRUE)),
          reject5 = p_value < 0.05,
          reject10 = p_value < 0.10)

combine_past_curr_p_stat

neighbourhood,past_prop,curr_prop,curr_count_total,stat,p_value,reject5,reject10
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<lgl>,<lgl>
Fairview,0.6891537,0.619474,1787,-6.3641087,1.96427e-10,True,True
West End,0.7351588,0.633131,2306,-11.103632,1.204476e-28,True,True
Central Business District,0.7533982,0.6046934,7713,-30.2988702,1.186309e-201,True,True
Hastings-Sunrise,0.6451613,0.5693642,1038,-5.1038956,3.327319e-07,True,True
Strathcona,0.5577118,0.4385246,1708,-9.9178129,3.4830430000000004e-23,True,True
Grandview-Woodland,0.6237392,0.5506849,1460,-5.7620361,8.310522e-09,True,True
Mount Pleasant,0.6768553,0.5719462,1932,-9.8598517,6.214108e-23,True,True
Sunset,0.589672,0.5651163,860,-1.4639673,0.1432029,False,False
Kensington-Cedar Cottage,0.6043016,0.6161616,1287,0.8700947,0.3842486,False,False
Stanley Park,0.784,0.5074627,67,-5.5005515,3.786052e-08,True,True


There are quite a few neighbourhoods that we reject our null-hypothesis that there are not changes with the proportion of reported thefts at both 10% and 5% significance level.  

Lets have a closer look at the neighbourhood that we fail to reject the null-hypothesis at either 10% or 5% significance level, or both:

In [29]:
rejected_neighbourhoods <- combine_past_curr_p_stat %>%
    filter(reject5 == FALSE | reject10 == FALSE)

rejected_neighbourhoods

print("Count of Neighbourhood that failed to reject the null with decrease in theft crime rate:") 
nrow(rejected_neighbourhoods)

print("Proportion of Neighbourhood that failed to reject the null with decrease in theft crime rate:") 
mean(rejected_neighbourhoods$stat < 0)

neighbourhood,past_prop,curr_prop,curr_count_total,stat,p_value,reject5,reject10
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<lgl>,<lgl>
Sunset,0.589672,0.5651163,860,-1.4639673,0.14320288,False,False
Kensington-Cedar Cottage,0.6043016,0.6161616,1287,0.8700947,0.38424864,False,False
Shaughnessy,0.4612766,0.4364641,362,-0.9470247,0.34362619,False,False
Marpole,0.5688531,0.5661664,733,-0.1468758,0.88323008,False,False
Oakridge,0.5432873,0.4968354,316,-1.6577176,0.09737447,False,True
Victoria-Fraserview,0.6036842,0.5797101,483,-1.0771834,0.28139833,False,False
Kerrisdale,0.4675528,0.5106952,374,1.6721938,0.09448611,False,True
West Point Grey,0.56745,0.5804878,410,0.5328634,0.59412814,False,False
Arbutus Ridge,0.5178026,0.4740061,327,-1.5849626,0.11297483,False,False
Killarney,0.6045561,0.5986395,441,-0.2541161,0.79940588,False,False


[1] "Count of Neighbourhood that failed to reject the null with decrease in theft crime rate:"


[1] "Proportion of Neighbourhood that failed to reject the null with decrease in theft crime rate:"


Out of 24 neighbourhoods, there are 11 that we either fail to reject the null-hypothesis at either 10% or 5% significance level, or both.

One thing to note is that with those neighbourhoods, ~72% have decrease in reported thefts.  Lets have a look if that is also true with all of the neighbourhoods.

In [31]:
print("Proportion of Neighbourhood  with decrease in theft crime rate:") 
mean(combine_past_curr_p_stat$stat < 0)

print("Proportion of Neighbourhood  with that rejected the Null-Hypothesis:") 
(24-11)/24

[1] "Proportion of Neighbourhood  with decrease in theft crime rate:"


[1] "Proportion of Neighbourhood  with that rejected the Null-Hypothesis:"


It seems that reported theft rate decreased in a majority of neighbourhoods in Vancouver.

Overall, around 54% of the neighbourhood in Vancouver rejected the Null-Hypothesis.

### Conculsion of using Z-Test to test the significance with the change in reported thefts in Vancouver

Looking at the data, we can see that a majority of neighbourhood rejected the Null-Hypothesis of that there are no changes in reported thefts between the three years before Covid (2017, 2018, 2019) and the Covid year (2020).  

One thing to note is that a large majority of the neighbourhoods, either with the Null-Hypothesis rejected or not, seems to have a decrease of reported thefts in 2020 compared to the past year.  

One question we can further explore is relationship between Covid and decrease in theft rate.  While it is not a large majority of neighbourhood that had significant amount of changes with reported theft rates, we can further explore why a large majority of the neighbourhood actually saw decrease in reported crime rate in 2020.