# **Theft in West Point Grey and Dunbar-Southlands**

**Group 34:** Andy Hu, Wendi Ke, CC Liang, and Mridul Manas

<img src="https://raw.githubusercontent.com/fankayii/STAT201_34/main/images/theft.jpg"/>

Source: https://raw.githubusercontent.com/fankayii/STAT201_34/main/images/theft.jpg

# 1. Introduction
Crime brings chills down everyone’s spines, with theft being the most common type of crime in Canada ([Government of Canada](https://www.justice.gc.ca/eng/cj-jp/state-etat/2019rpt-rap2019/p7.html)). Section 322 of the Canadian Criminal Code defines "theft" as "fraudulently and without colour of right" taking someone's property or converting its ownership ([Criminal Code](https://laws-lois.justice.gc.ca/eng/acts/c-46/section-322.html)). Understanding crime statistics is crucial to enhancing community relations, measuring prevention initiatives, and minimizing risks by making better decisions ([Vancouver Police Department](https://vpd.ca/crime-statistics/)). In this paper, we will study the proportion of theft crime in Dunbar-Southlands and West Point Grey, the two neighbourhoods closest to the University of British Columbia Vancouver campus ([UBC Vantage College](https://vantagecollege.ubc.ca/blog/your-guide-neighborhoods-vancouver)).

### Research Question
Is the proportion of theft occurring in the neighbourhood of West Point Grey higher than Dunbar-Southlands? 

### Variables
The random variable of interest for comparing is the proportion of theft in the neighbourhoods of Dunbar-Southlands and West Point Grey. Of the response variable, differences in proportions is the location parameter and standard error is the scale parameter.

### Hypotheses
- Null Hypothesis $H_0$: There is no difference between the proportion of theft in the neighbourhoods of Dunbar-Southlands and West Point Grey.
- Alternative Hypothesis $H_A$: The proportion of theft in West Point Grey is higher than in Dunbar-Southlands.

| Null Hypothesis $H_0$ | Alternate Hypothesis $H_A$ |
| --- | ----------- |
|  $$H_0: p_w - p_d = 0$$ | $$H_A: p_w - p_d > 0$$ |

### Dataset Description
To conduct our research, we use the [Vancouver Police Department (VPD) crime data](https://geodash.vpd.ca/opendata/), which includes information on the different types of crimes in Vancouver from 2003 to 2023. We will be focusing on crimes within the last 5 complete years, 2018 to 2022, in the neighbourhoods of Dunbar-Southlands and West Point Grey in order to base our research upon more recent and prevalent information about contemporary crimes.

# 2. Methods and Results

### Exploratory Data Analysis

The `tidyverse`, `infer`, and `broom` packages allow us to clean and wrangle data, create visualizations, and make statistical inferences. 

In [1]:
library(tidyverse)
library(infer)
library(broom)

options(repr.plot.width = 10, repr.plot.height = 6)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.2 ──
[32m✔[39m [34mggplot2[39m 3.3.6      [32m✔[39m [34mpurrr  [39m 0.3.4 
[32m✔[39m [34mtibble [39m 3.1.8      [32m✔[39m [34mdplyr  [39m 1.0.10
[32m✔[39m [34mtidyr  [39m 1.2.1      [32m✔[39m [34mstringr[39m 1.4.1 
[32m✔[39m [34mreadr  [39m 2.1.2      [32m✔[39m [34mforcats[39m 0.5.2 
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()


As we have uploaded the dataset to our GitHub repository, we can read the csv file from our GitHub link.

In [2]:
crime <- read.csv("https://raw.githubusercontent.com/fankayii/STAT201_34/main/data/crime.csv")
head(crime)

Unnamed: 0_level_0,TYPE,YEAR,MONTH,DAY,HOUR,MINUTE,HUNDRED_BLOCK,NEIGHBOURHOOD,X,Y
Unnamed: 0_level_1,<chr>,<int>,<int>,<int>,<int>,<int>,<chr>,<chr>,<dbl>,<dbl>
1,Theft from Vehicle,2008,12,7,18,0,11XX E HASTINGS ST,Strathcona,494141.1,5458690
2,Theft from Vehicle,2009,8,28,19,0,11XX E HASTINGS ST,Strathcona,494141.1,5458690
3,Theft from Vehicle,2012,7,25,12,0,11XX E HASTINGS ST,Strathcona,494141.1,5458690
4,Theft from Vehicle,2014,5,8,12,49,11XX E HASTINGS ST,Strathcona,494141.1,5458690
5,Theft from Vehicle,2014,10,19,18,0,11XX E HASTINGS ST,Strathcona,494141.1,5458690
6,Theft from Vehicle,2015,2,18,18,30,11XX E HASTINGS ST,Strathcona,494141.1,5458690


First, we check for any NA values in our dataset.

In [5]:
print(sum(is.na(crime)))
head(crime[!complete.cases(crime), ])

[1] 146


Unnamed: 0_level_0,TYPE,YEAR,MONTH,DAY,HOUR,MINUTE,HUNDRED_BLOCK,NEIGHBOURHOOD,X,Y
Unnamed: 0_level_1,<chr>,<int>,<int>,<int>,<int>,<int>,<chr>,<chr>,<dbl>,<dbl>
310264,Vehicle Collision or Pedestrian Struck (with Injury),2003,6,22,18,58,0X TERMINAL AV / QUEBEC ST,,,
311248,Vehicle Collision or Pedestrian Struck (with Injury),2004,11,7,18,24,13XX PACIFIC BLVD / 198 DRAKE ST,,,
311366,Vehicle Collision or Pedestrian Struck (with Injury),2003,9,20,17,15,14XX BLOCK S E MARINE DR,,,
311484,Vehicle Collision or Pedestrian Struck (with Injury),2003,8,31,21,3,14XX W KING EDWARD AV / GRANVILLE ST,,,
311596,Vehicle Collision or Pedestrian Struck (with Injury),2004,10,5,17,56,15XX BLOCK W 70TH AV / 8600 GRANVILLE ST,,,
311649,Vehicle Collision or Pedestrian Struck (with Injury),2003,5,30,18,0,15XX W 66TH AV / 8298 GRANVILLE ST,,,


We demonstrated removing the 164 NA values from X and Y below, but we decide to ignore them for the analysis as they are not related.

In [None]:
na.omit(crime) %>% 
    head()

Next, we filter for our years and neighbourhoods of interest.

In [None]:
crime_overall_recent <- crime %>% 
    filter(YEAR >= 2018 & YEAR <= 2022) %>%
    select(TYPE, NEIGHBOURHOOD)

colnames(crime_overall_recent) <- c('type', 'neighbourhood')
head(crime_overall_recent)

To study theft crime proportions, we will also be grouping all the types of crime into exclusively one of `theft` or `not theft` using the `mutate` and `case_when` functions.

In [None]:
crime_overall <- crime_overall_recent %>%
    mutate(type = case_when(
    type %in% c("Other Theft", "Theft from Vehicle", "Theft of Bicycle", "Theft of Vehicle") ~ "theft",
    TRUE ~ "not_theft"))

head(crime_overall)

Let us visualize all our theft data below.

In [None]:
ggplot(crime_overall, aes(y = reorder(neighbourhood, -table(neighbourhood)[neighbourhood]), fill = type)) + 
    geom_bar() + 
    labs(x = "Crime Count",
         y = "Neighbourhood",
         fill = "Type of Crime",
         title = "Figure 1. Counts of Theft Crime in All of Vancouver") + 
    scale_fill_manual(labels = c('Not Theft', 'Theft'), values = c("#56B4E9", "#009E73")) + 
    theme(text = element_text(size = 12))

We quickly noticed Central Business District, the outlier, has way more crime (both theft and non-theft) than all other places in Vancouver. West Point Grey and Dunbar-Southlands have little crime in comparison, and their total crime count and proportion are pretty similar. Thus, it would be reasonable to use inference on these two neighbourhoods to find any statistical differences between the proportion of theft.

In [None]:
crime_stats <- crime_overall %>%
    group_by(neighbourhood, type) %>%
    summarize(count = n(), .groups = 'drop') %>%
    pivot_wider(names_from = type,
                values_from = count) %>%
    mutate(total_crime = not_theft + theft,
           prop = theft / total_crime)

crime_stats = crime_stats[-1, ]

crime_stats %>%
    filter(total_crime > 1700 & total_crime < 2400)

With some comparisons, we can see that West Point Grey and Dunbar-Southlands have a similar proportion of theft. Let us zoom in by filtering  for the two neighbourhoods.

In [None]:
crime_filtered <- crime_overall %>%
    filter(neighbourhood %in% c("West Point Grey","Dunbar-Southlands"))

head(crime_filtered)

We then compute some initial observations about the filtered data and tidy it.

In [None]:
crime_type_pivot <- crime_filtered %>%
    group_by(neighbourhood, type) %>%
    summarize(count = n(), .groups = 'drop') %>%
    pivot_wider(names_from = type,
                values_from = count) %>%
    mutate(total_crime = not_theft + theft,
           prop = theft / total_crime)
    
crime_type_pivot

**Table 1. Initial observations of the crime data**

We can zoom in on the two bars pertaining to West Point Grey and Dunbar-Southlands as follows.

In [None]:
ggplot(crime_filtered, aes(x = neighbourhood, fill = type)) + 
    geom_bar() + 
    labs(x = "Neighbourhood",
         y = "Crime Count",
         fill = "Type of Crime",
         title = "Figure 2. Counts of Theft Crime in West Point Grey and Dunbar-Southlands") + 
    scale_fill_manual(labels = c('Not Theft', 'Theft'), values = c("#56B4E9", "#009E73")) + 
    theme(text = element_text(size = 12))

From the plot and table, we observe that the proportion of theft crime are similar for the two neighbourhoods. West Point Grey is slightly higher in proportions though Dunbar-Southlands having a greater count. Also, for both neighbourhoods, theft is the majority type of crime.

Let us calculate more sample estimates so that we retain only the required information, and also calculate the observed test statistic $\hat{p}_1-\hat{p}_0$, which is the proportion of theft in West Point Grey subtracted by the proportion of theft in Dunbar-Southlands.

In [None]:
crime_estimates <- crime_type_pivot %>%
    select(neighbourhood, total_crime, prop) %>%
    pivot_wider(names_from = neighbourhood, values_from = c(total_crime, prop))

colnames(crime_estimates) <- c('n_ds', 'n_wpg', 'p_ds', 'p_wpg')

crime_estimates <- crime_estimates %>%
    mutate(prop_diff = p_wpg - p_ds)

crime_estimates

**Table 2. Crime estimates**

Based on the difference in proportions of 0.008251004 and the plot, we cannot easily conclude anything significant about the difference in proportions of theft in both neighbourhoods, meaning that we will have to make use of some statistical analysis.

### Methods

We will be using both asymptotics and bootstrapping to conduct our research to discover whether West Point Grey has a higher proportion of theft compared to Dunbar-Southlands. 

For asymptotics, we rely on the Central Limit Theorem because proportions distribution is not normal. Therefore, we need to check for the large enough sample size condition, such that `n(1-p) >= 10` and `np >= 10`. We also need to assume that the sample is random and independent. By carrying out a two-sample independent z-test, we will use the following test statistic:

$$
Z = \frac{\hat{p}_1 - \hat{p}_2}{\sqrt{\hat{p}(1-\hat{p})\left(\frac{1}{n_1} + \frac{1}{n_2}\right)}}.
$$

For the bootstrapping approach, we will set the seed to make our analysis reproducible due to randomness introduce through bootstrapping. Afterwards, we will take a large number of bootstrap samples from our crime sample dataset to create a bootstrap distribution for a difference in proportions.

For both approaches, we can find the p-value to check for any statistically significant findings to decide whether we can reject the null hypothesis or not under a significance level of $\alpha = 0.05$. This also means a confidence level of 95< which we can construct using both methods to capture with a certain degree of confidence that the true difference in proportions falls within the interval.

### Results Using Asymptotics

We can conduct a two-sample z-test to calculate the proportion differences. Recall that we have already stored the counts and proportions of crime in both neighbourhoods in the data frame `crime_estimates`.



In [None]:
crime_estimates

**Table 2. Crime estimates**

Since we are using theory-based methods, we have to make sure that 
`np >= 10` and `n(1-p) >= 10` for both neighbourhoods. We have that $2165*0.5773672 \approx 1250 \geq 10$ and $2165*(1-0.0.5773672) \approx 915 \geq 10$ for Dunbar-Southlands. Likewise for West Point Grey, $1933*0.5856182 \approx 1130 \geq 10$ and 
$1933*0.5856182 \approx 800 \geq 10$. Aside from satisfying the 'success-failure' condition for a large enough sample size, We assume that the sample dataset is random and the data are independent of each other, so we can use the Central Limit Theorem.

We proceed to calculate the null distribution standard error by first calculating a pooled proportion. 

In [None]:
crime_asymptotics <- crime_estimates %>%
    mutate(pooled_proportion = (n_ds*p_ds + n_wpg*p_wpg) / (n_ds + n_wpg),
           null_std_error = sqrt(pooled_proportion * (1-pooled_proportion) * (1/n_ds + 1/n_wpg)))

crime_asymptotics

**Table 3. Parameters calculated from asymptotics**

We can now easily find a 95% confidence interval using asymptotics.

In [None]:
obs_prop_diff_asymptotics <- crime_asymptotics$prop_diff
null_std_error <- crime_asymptotics$null_std_error

prop_ci_asymptotics <- tibble(
    lower_ci = qnorm(0.025, obs_prop_diff_asymptotics, null_std_error),
    upper_ci = qnorm(0.975, obs_prop_diff_asymptotics, null_std_error))

prop_ci_asymptotics

**Table 4. Asymptotics Confidence Interval**

We note that 0 is included in the interval, meaning that is is likely that we fail to reject the null hypothesis. We can also visualize this confidence interval.

In [None]:
x = seq(-0.1, 0.1, by = 0.001)
y <- dnorm(x, 0, null_std_error)
normal_data <- tibble(x, y)

normal_data %>%
    ggplot(aes(x, y)) + 
    geom_line(color = 'red', lwd = 2) + 
    shade_confidence_interval(endpoints = prop_ci_asymptotics) + 
    labs(x = "Difference in Proportions",
         y = "Density",
         title = "Figure 3. Visualizing the 95% Confidence Interval Using a Normal Distribution")+ 
    theme(text = element_text(size = 12))

We are 95% confident that the true difference in proportions is captured by the confidence interval created using asymptotics.

Finally, we will obtain the p-value.

In [None]:
p_value_asymptotics <- pnorm(obs_prop_diff_asymptotics, 0, null_std_error, lower.tail=F)
p_value_asymptotics

Clearly, the p-value above is greater than the significance level we set at 0.05, indicating that we fail to reject the null hypothesis.
We can visualize this result with a plot.

In [None]:
normal_data %>%
    ggplot(aes(x, y)) + 
    geom_line(color = 'red', lwd = 2) + 
    geom_ribbon(aes(xmin = obs_prop_diff_asymptotics,
                    xmax = max(x)),
                    alpha = 0.2,
                    fill = 'blue') + 
    geom_vline(xintercept = obs_prop_diff_asymptotics,
               lwd = 1, alpha = 0.5, color = 'blue', linetype = 'dashed') + 
    labs(x = "Difference in Proportions",
         y = "Density",
         title = "Figure 4. Visualizing the p-value Using a Normal Distribution") + 
    theme(text = element_text(size = 12))

In addition, we can use `prop.test` to check our answer.

In [None]:
c_wpg <- crime_estimates$n_wpg * crime_estimates$p_wpg
c_ds <- crime_estimates$n_ds * crime_estimates$p_ds

prop_test <- tidy(
    prop.test(x = c(c_wpg, c_ds),
              n = c(crime_estimates$n_wpg, crime_estimates$n_ds),
              correct = FALSE,
              alternative = "greater"))
prop_test

The p-value using asymptotics and prop.test gives us the same result, yet we expect it to be slightly different from using the infer package.

### Results Using the Infer Package for Bootstrapping

We will first obtain the null model using the infer package after setting the seed to make the analysis reproducible. We first specify our response variable of `type` and explanatory variable of `neighbourhood`. Assuming the samples are independent, we generate replicates of shuffled data with the `permute` argument. Recall that we are calculating a difference in proportions of West Point Grey subtracted by Dunbar-Southlands.

In [None]:
set.seed(1)

null_model <- crime_filtered %>%
    specify(type ~ neighbourhood, success = "theft") %>%
    hypothesise(null = "independence") %>%
    generate(reps = 2000, type = "permute") %>%
    calculate(stat = "diff in props", order = c("West Point Grey", "Dunbar-Southlands"))
head(null_model)

The shaded portion represents the 95% confidence interval using the percentile method, while the other uses the standard error theory-based method.

Now that we have calculated the results for our research, let us summarize them in two tables. The first one shows the confidence intervals using both methods, while the latter shows the p-values of the different approaches.We can quickly calculate the observed difference in theft proportions using the infer package, which should be the same as the value from the `crime_estimates` data frame.

In [None]:
crime_estimates

**Table 2. Crime estimates**

In [None]:
obs_prop_diff <- crime_filtered %>%
    specify(type ~ neighbourhood, success = "theft") %>%
    calculate(stat = "diff in props", order = c("West Point Grey", "Dunbar-Southlands")) %>%
    pull()
obs_prop_diff

As expected, the observed difference in proportions do match, so we can move onto visualizing the simulation-based null distribution.

In [None]:
theft_result_plot <- 
   null_model %>%
   visualize() + 
   shade_p_value(obs_stat = obs_prop_diff, direction = "right") +
   labs(x = "Difference in Proportions",
        y = "Count",
       title = "Figure 5. Simulation-Based Null Distribution") + 
    theme(text = element_text(size = 12))
theft_result_plot

We can compute the p_value using `get_p_value`.

In [None]:
p_value_infer <- null_model %>%
    get_p_value(obs_stat = obs_prop_diff, direction = "right") %>%
    pull()
p_value_infer 

Now, we will jump from conducting the hypothesis test to calculating a 95% confidence interval using the percentile and standard error method by bootstrapping from our crime sample.

In [None]:
bootstrap_distribution <- crime_filtered %>%
    specify(type ~ neighbourhood, success = "theft") %>%
    generate(reps = 2000, type = "bootstrap") %>%
    calculate(stat = "diff in props", order = c("West Point Grey", "Dunbar-Southlands"))

percentile_ci <- bootstrap_distribution %>%
    get_confidence_interval(level = 0.95, type = 'percentile')

se_ci <- bootstrap_distribution %>% 
  get_confidence_interval(level = 0.95, type = "se", 
                          point_estimate = obs_prop_diff)

visualize(bootstrap_distribution) + 
    shade_confidence_interval(endpoints = percentile_ci) + 
    labs(x = "Difference in Proportions",
         y = "Count",
         title = "Figure 6. Simulation-Based Bootstrap Distribution with 95% Confidence Interval Shaded") + 
    theme(text = element_text(size = 12)) + 
    geom_vline(xintercept = pull(se_ci[1]), linetype = 'dashed', lwd = 2, colour = 'red') + 
    geom_vline(xintercept = pull(se_ci[2]), linetype = 'dashed', lwd = 2, colour = 'red')

The shaded portion represents the 95% confidence interval using the percentile method, while the other uses the standard error theory-based method.

Now that we have calculated the results for our research, let us summarize them in two tables. The first one shows the confidence intervals using both methods, while the latter shows the p-values of the different approaches.

In [None]:
types <- tibble(type = c("asymptotics", "bootstrap_percentile", "bootstrap_se"))
combined_ci <- rbind(prop_ci_asymptotics,
                  percentile_ci,
                  se_ci)

cbind(types, combined_ci)

**Table 5. Confidence interval comparison**

In [None]:
p_types <- tibble(type = c("asymptotics", "bootstrap"))
combined_p <- tibble(p_value = c(p_value_asymptotics, p_value_infer))

cbind(p_types, combined_p)

**Table 6. P-value comparison**

Doing a quick comparison, we see that each of the 95% confidence intervals capture 0, and that all the intervals produced by the different methods have little deviation. We can then say across all 95% confidence intervals that could be calculated, we can expect that 95% of the intervals contain the true difference in theft crime proportions. Similarly, the p-value of using asymptotics and bootstrapping are both at approximately 0.3, which is significantly greater than the 5% significance level, so we fail to reject the null hypothesis. In other words, we do not have enough evidence to demonstrate that West Point Grey has a higher theft rate than Dunbar-Southlands, resulting in the possibility of committing a Type II Error.

Since both asymptotics and bootstrapping gave similar confidence interval and p-values, either method works for us. Although bootstrapping is very versatile and can be applied without any assumptions and conditions, asymptotics is preferred. With a large sample size of over 1000 observations, theory-based method is not only accurate but also computationally inexpensive with only a few mathematicall calculations.

# 3. Discussion

### Summary
In our study of determining whether or not there is a statistically significant difference in the proportion of theft crime occurring in West Point Grey against Dunbar-Southlands, we fail to reject the null hypotehsis stating there is no difference between the neighbourhoods at a significance level of 5%. Originally, we expected that West Point Grey to have a higher proportion of theft compared to Dunbar-Southlands as population density would be higher closer to campus, leading to a higher theft rate. However, both asymptotic analysis based upon the Central Limit Theorem and bootstrap analysis fail to reject the null hypothesis. We obtained confidence intervals containing the value of 0, or no difference, and achieved p-values of approximately 0.3. This means that using both the theory-based approach and simulation-based approach gave us the same conclusion, so we can be relatively confident in our answer.

### Significance
Our findings could impact residents living in either neighbourhoods, if not both, to be more proactive and aware of theft in their area and enforce relevant safety measures. Since we specifically targetted the neighbourhoods of West Point Grey and Dunbar-Southlands, two of the most popular neighbourhoods for off-campus housing for University of British Columbia students, then these students can become more informed on the theft proportions in these places. However, since we discovered no statistical significance between the two neighbourhoods in terms of theft proportion, then individuals should not be too wary of living in either neighbourhood.

### Further Questions
One drawback of our study is that while big data may explain differences in the danger of theft and support governmental measures, it cannot explain individual cases or provide detailed plan of how to minimize the risks. Furthermore, a challenge we face is how to extrapolate our analysis to predict future crime rates, as that is the only important information for reducing crimes. Despite not finding any statistical difference between the proportion of theft in West Point Grey and Dunbar-Southlands, we have further questions regarding the underlying motivations behind theft and what features cause a difference in theft between different neighbourhoods. Not only that, our research raises further questions into proportions of different crime not limited to just theft and not theft, such as homicides or residential break-ins. 

# 4. References
Branch, Legislative Services. “Consolidated Federal Laws of Canada, Criminal Code.” Criminal Code, 27 July 2023, [laws-lois.justice.gc.ca/eng/acts/c-46/section-322.html](laws-lois.justice.gc.ca/eng/acts/c-46/section-322.html). 

Crime Statistics. “Crime Statistics.” Vancouver Police Department, 19 July 2023, [vpd.ca/crime-statistics/](vpd.ca/crime-statistics/). 

Government of Canada, Department of Justice. “State of the Criminal Justice System - 2019 Report.” Results by Outcome, 7 July 2021, [www.justice.gc.ca/eng/cj-jp/state-etat/2019rpt-rap2019/p7.html](www.justice.gc.ca/eng/cj-jp/state-etat/2019rpt-rap2019/p7.html). 

UBC. “Your Guide to Neighborhoods in Vancouver: UBC Vantage College.” Your Guide to Neighborhoods in Vancouver | UBC Vantage College, [vantagecollege.ubc.ca/blog/your-guide-neighborhoods-vancouver](vantagecollege.ubc.ca/blog/your-guide-neighborhoods-vancouver). Accessed 30 July 2023. 

Vancouver Police Department. “Vancouver Police Department Crime Data.” Accessed 30 July 2023. 