In [75]:
# first, import all the libs we need
import pandas as pd
import matplotlib.pyplot as plt
import folium
from folium import plugins

df = pd.read_csv("crimes.csv")

# Crime Categories

We are not interested in using all the data because there are certain types of crimes that are irrelevant to immediate danger to another person. The crime alert should only notify a citizen nearby when he could be physically hurt.

We are going to check the crime categories and filter out the ones that are not useful to us. Below are the crime categories.

In [62]:
for c in df.Category.unique():
    print c

WARRANTS
OTHER OFFENSES
LARCENY/THEFT
VEHICLE THEFT
VANDALISM
NON-CRIMINAL
ROBBERY
ASSAULT
WEAPON LAWS
BURGLARY
SUSPICIOUS OCC
DRUNKENNESS
FORGERY/COUNTERFEITING
DRUG/NARCOTIC
STOLEN PROPERTY
SECONDARY CODES
TRESPASS
MISSING PERSON
FRAUD
KIDNAPPING
RUNAWAY
DRIVING UNDER THE INFLUENCE
SEX OFFENSES FORCIBLE
PROSTITUTION
DISORDERLY CONDUCT
ARSON
FAMILY OFFENSES
LIQUOR LAWS
BRIBERY
EMBEZZLEMENT
SUICIDE
LOITERING
SEX OFFENSES NON FORCIBLE
EXTORTION
GAMBLING
BAD CHECKS
TREA
RECOVERED VEHICLE
PORNOGRAPHY/OBSCENE MAT


In [69]:
# These are the (what I believe to be) relevant crime categories:
relevant_categories = [
  "LARCENY/THEFT",
  "VEHICLE THEFT",
  "ROBBERY",
  "ASSAULT",
  "WEAPON LAWS",
  "BURGLARY",
  "DRUNKENNESS",
  "DRUG/NARCOTIC",
  "KIDNAPPING",
  "DRIVING UNDER THE INFLUENCE",
  "SEX OFFENSES FORCIBLE",
  "ARSON",
  "SEX OFFENSES NON FORCIBLE",
  "EXTORTION"
]

In [72]:
df = df[df.Category.isin(relevant_categories)]
print len(df)

188866


# Time Relevance

Crimes that happened a long time ago are less relevant because the police might have taken measures to solve the problem or the people who committed crimes at these places are no longer around for various reasons).

I am going to include only the crimes that happened between 2010 and 2015.

# Neighborhoods

Let's take a look at where the crimes happened. This will be the part of the first crime alert component. The first component tells citizens where they should avoid by showing them the neighborhoods that have serious problems with crime.

In [73]:
neighborhoods = df.PdDistrict.unique()
df.groupby("PdDistrict").PdDistrict.value_counts().nlargest(len(neighborhoods))

PdDistrict            
SOUTHERN    SOUTHERN      37299
NORTHERN    NORTHERN      24446
MISSION     MISSION       23856
CENTRAL     CENTRAL       20915
BAYVIEW     BAYVIEW       17009
INGLESIDE   INGLESIDE     16364
TENDERLOIN  TENDERLOIN    15854
TARAVAL     TARAVAL       12857
PARK        PARK          10270
RICHMOND    RICHMOND       9996
dtype: int64

Much to my surprise, Tenderloin is not even close to the top of the list. It has an important implication. We cannot just take what people say about certain neighborhoods and we should look at facts. Now let's look at the crime distribution by coordinates.

In [80]:
map_osm = folium.Map(location=[37.7749, -122.4194], zoom_start=12)

lats = df.X
lngs = df.Y
map_osm.add_children(plugins.HeatMap(zip(lngs, lats), radius = 10))

# clusters = [folium.MarkerCluster()] * len(neighborhoods)

# for _, row in df.tail(1000).iterrows():
#     popup_text = "{}: {}".format(row.Category, row.Descript)
#     marker = folium.Marker(
#         location=[row.Y, row.X],
#         popup=popup_text
#     )
#     cluster = list(neighborhoods).index(row.PdDistrict)
#     marker.add_to(clusters[cluster])
                                     
# [map_osm.add_children(cluster) for cluster in clusters]

map_osm.save("osm.html")