In [1]:
library(tidyverse)
library(lubridate)
library(ggmap)
library(plotly)
library(leaflet)
library(ggplot2)
library(dplyr)
library(factoextra)
library(caret)      # Machine learning library for data splitting and training
library(randomForest)  # Random forest model

“running command 'timedatectl' had status 1”
── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.4.4     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
[32mℹ[39m Google's Terms of Service: [34m[3m<https://mapsplatform.google.com>[23m[39m
  Stadia Maps' Terms of Service:

In [2]:
head(crime)

Unnamed: 0_level_0,time,date,hour,premise,offense,beat,block,street,type,suffix,number,month,day,location,address,lon,lat
Unnamed: 0_level_1,<dttm>,<chr>,<int>,<chr>,<fct>,<chr>,<chr>,<chr>,<chr>,<chr>,<int>,<ord>,<ord>,<chr>,<chr>,<dbl>,<dbl>
82729,2010-01-01 06:00:00,1/1/2010,0,18A,murder,15E30,9600-9699,marlive,ln,-,1,january,friday,apartment parking lot,9650 marlive ln,-95.43739,29.6779
82730,2010-01-01 06:00:00,1/1/2010,0,13R,robbery,13D10,4700-4799,telephone,rd,-,1,january,friday,road / street / sidewalk,4750 telephone rd,-95.29888,29.69171
82731,2010-01-01 06:00:00,1/1/2010,0,20R,aggravated assault,16E20,5000-5099,wickview,ln,-,1,january,friday,residence / house,5050 wickview ln,-95.45586,29.59922
82732,2010-01-01 06:00:00,1/1/2010,0,20R,aggravated assault,2A30,1000-1099,ashland,st,-,1,january,friday,residence / house,1050 ashland st,-95.40334,29.79024
82733,2010-01-01 06:00:00,1/1/2010,0,20A,aggravated assault,14D20,8300-8399,canyon,,-,1,january,friday,apartment,8350 canyon,-95.37791,29.67063
82734,2010-01-01 06:00:00,1/1/2010,0,20R,burglary,18F60,9300-9399,rowan,ln,-,1,january,friday,residence / house,9350 rowan ln,-95.5483,29.70223


**CLUSTERING**

In this analysis, we will group crime data spatially and create clusters to determine crime densities.

**Data Preparation:** Cleans and prepares the data according to the properties to be clustered.

**Choosing the Optimal Number of Clusters:** Selected number of appropriate clusters using Elbow Method or Silhouette Analysis

**Clustering with K-means:** We will divide crimes into clusters by the K-means algorithm.
Visualization: We will visualize the clusters on the map and scatter plot.

In [3]:
# Step 1: Data Preparation
# Extract latitude and longitude for clustering
geo_data <- crime %>%
  select(lon, lat) %>%
  drop_na()  # Remove rows with missing values

In [None]:
# Step 2: Determine Optimal Number of Clusters using Elbow Method
set.seed(42)  # For reproducibility
fviz_nbclust(geo_data, kmeans, method = "wss") +
  labs(title = "Elbow Method for Optimal Clusters", 
       x = "Number of Clusters", 
       y = "Total Within-Cluster Sum of Squares")

In [None]:
# Step 3: Perform K-means Clustering
set.seed(42)
kmeans_result <- kmeans(geo_data, centers = 5, nstart = 20)

# Add cluster results to the original dataset
crime$cluster <- factor(kmeans_result$cluster)

In [None]:
# Step 4: Visualization - Scatter Plot of Clusters
ggplot(crime, aes(x = lon, y = lat, color = cluster)) +
  geom_point(alpha = 0.6) +
  labs(title = "Crime Clusters (K-means)", 
       x = "Longitude", 
       y = "Latitude", 
       color = "Cluster") +
  theme_minimal()

In [None]:
# Step 5: Visualization - Interactive Map of Clusters
leaflet(crime) %>%
  addTiles() %>%  # Add base map
  addCircleMarkers(
    lng = ~lon, lat = ~lat,
    radius = 1, color = ~colorFactor(rainbow(5), cluster)(cluster),
    popup = ~paste("Cluster:", cluster)
  ) %>%
  addLegend("bottomright", colors = rainbow(5), labels = 1:5, title = "Clusters")

**CRIME PREDICTION**

We will estimate the type of crimes based on variables such as location (longitude, latitude) and time information (hour).

**Feature Selection:** Important variables are selected for the prediction model.

**Data Splitting:** Data is divided into training and testing sets.

**Model Training:** A simple machine learning model (e.g. The model is trained using Logistic Regression, Random Forest).

**Evaluation:** Performance of the model is evaluated (e.g. The Accuracy, Confusion Matrix).
Visualization: Model results are visualized.

In [None]:
# Step 1: Feature Selection and Data Preparation
# we are going to select relevant features for prediction
crime_prediction_data <- crime %>%
  select(offense, lon, lat, hour) %>%
  drop_na()  # Remove rows with missing values

# Convert 'offense' to a factor for classification
crime_prediction_data$offense <- as.factor(crime_prediction_data$offense)

In [None]:
# Step 2: Data Splitting
set.seed(42)
train_index <- createDataPartition(crime_prediction_data$offense, p = 0.7, list = FALSE)
train_data <- crime_prediction_data[train_index, ]
test_data <- crime_prediction_data[-train_index, ]

In [None]:
# Step 3: Model Training - Random Forest
set.seed(42)
rf_model <- randomForest(offense ~ lon + lat + hour, data = train_data, ntree = 100)

In [None]:
# Step 4: Model Evaluation
# Predict on test data
predictions <- predict(rf_model, test_data)

# Confusion matrix and accuracy
conf_matrix <- confusionMatrix(predictions, test_data$offense)
print(conf_matrix)

In [None]:
# Step 5: Visualization - Feature Importance
varImpPlot(rf_model, main = "Feature Importance")

# Visualize actual vs predicted offenses
test_data$predicted <- predictions
ggplot(test_data, aes(x = lon, y = lat, color = predicted)) +
  geom_point(alpha = 0.6) +
  labs(title = "Predicted Crime Types", 
       x = "Longitude", 
       y = "Latitude", 
       color = "Predicted Offense") +
  theme_minimal()