Scenario to investigate:
Mass surveillance makes cities safer.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Cities around the world are spending billions expanding surveillance networks, often justifying them as crime-fighting tools. Especially the city of Chicago with initiatives like "Operation Virtual Shield" has relied more heavily on technological surveillance. New York in comparison has maintained a larger physical police presence. Our analysis will focus on the years 2006-2024 (why will become apparent throughout our analysis process and our presentation). In order to compare I intend to first gather general information regarding the two cities, then the population for each city for the respective years. In addition I aim to compare the count of mapped surveillance devices and police locations in each city via Open Street Map Data. Gathering this information is essential for our later anaylsis.


In [None]:
import requests

API = "https://en.wikipedia.org/w/api.php"
params = {
    "action": "query",
    "prop": "extracts",
    "titles": "Operation Virtual Shield",
    "exsentences": 1,
    "exlimit": 1,
    "explaintext": 1,
    "redirects": 1,
    "format": "json",
}

headers = {"User-Agent": "UniversityProject/1.0 (anna.reidel@esmt.berlin)"}

r = requests.get(API, params=params, headers=headers)
r.raise_for_status()
page = next(iter(r.json()["query"]["pages"].values()))
print(page.get("extract", "").strip())

Operation Virtual Shield is a program implemented by Chicago, IL mayor Richard M. Daley, which created the most extensive video surveillance network in the United States by linking more than 3000 surveillance cameras to a centralized monitoring system, which captures and processes camera feeds in real time.


I start by printing a short Wikipedia summary for the two cities we aim to compare:

In [None]:
import requests, pandas as pd
from pathlib import Path

HEADERS = {"User-Agent": "UniversityProject/1.0 (anna.reidel@esmt.berlin)"}
def wiki_summary(title: str, max_chars=200):
    url = f"https://en.wikipedia.org/api/rest_v1/page/summary/{title.replace(' ', '_')}"
    r = requests.get(url, headers=HEADERS)
    r.raise_for_status()
    j = r.json()
    return {
        "city": j.get("title", title),
        "description": j.get("description", ""),
        "summary": (j.get("extract", "")[:max_chars]).strip()
    }

rows = [wiki_summary("Chicago"), wiki_summary("New York City")]
df_wiki = pd.DataFrame(rows)

TASK2 = Path("/content/drive/MyDrive/Group_Project/data")
TASK2.mkdir(parents=True, exist_ok=True)

df_wiki.to_csv(TASK2/"wiki_summary.csv", index=False)
df_wiki

Unnamed: 0,city,description,summary
0,Chicago,"Most populous city in Illinois, United States",Chicago is the most populous city in the U.S. ...
1,New York City,Most populous city in the United States,"New York, often called New York City (NYC), is..."


[link text](https://)This is what got me interested in the question: Chicago (and many other big cities) make their crime data openly accessible. Below we show just a little cutout of Chicago's crime record data:
[Data from the city of Chicago](https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-Present/ijzp-q8t2/data_preview)

In [None]:
import requests
requests.get("https://data.cityofchicago.org/resource/ijzp-q8t2.json?$limit=1").status_code


200

In [None]:
import requests
import pandas as pd

CHICAGO_CRIME_API = "https://data.cityofchicago.org/resource/ijzp-q8t2.json"
params = {
    "$limit": 5000,
    "$where": "year >= 2019",
    "$select": "id,primary_type,date,year,latitude,longitude"
}
r = requests.get(CHICAGO_CRIME_API, params=params, timeout=30)
crimes = r.json()
df = pd.DataFrame(crimes)
df.head()

Unnamed: 0,id,primary_type,date,year,latitude,longitude
0,11695116,BURGLARY,2019-05-21T08:20:00.000,2019,41.856547057,-87.695604526
1,11662417,ROBBERY,2019-04-21T12:30:00.000,2019,41.749500329,-87.6011574
2,12990873,OFFENSE INVOLVING CHILDREN,2019-08-17T13:14:00.000,2019,41.89621515,-87.728572048
3,11757563,ROBBERY,2019-07-13T23:30:00.000,2019,41.742454647,-87.543009205
4,11653698,MOTOR VEHICLE THEFT,2019-04-13T01:30:00.000,2019,41.903567876,-87.653197209


Next, I want to get the population from years 2006-2024 for both cities:

[Data from the city of Chicago](https://data.cityofchicago.org/Health-Human-Services/Chicago-Population-Counts/85cm-7uqa/data_preview)

In [None]:
import requests, pandas as pd
import numpy as np
import certifi

params = {
    "$select": "year,population_total",
    "$where": "geography_type='Citywide' AND year between 2006 and 2024",
    "$order": "year",
    "$limit": 50000,
}
chicago_json = requests.get(
    "https://data.cityofchicago.org/resource/85cm-7uqa.json",
    params=params,
    verify=certifi.where(),
    timeout=30,
).json()

chicago = pd.DataFrame(chicago_json).rename(columns={"population_total": "population"})
chicago = chicago.astype({"year": "int64", "population": "int64"}).sort_values("year")

known = pd.DataFrame([
    {"year": 2010, "population": 2695598},
    {"year": 2020, "population": 2699347},
])
all_years = pd.DataFrame({"year": range(2006, 2025)})
chicago = pd.concat([chicago, known]).drop_duplicates("year", keep="last")
chicago = all_years.merge(chicago, on="year", how="left")
chicago["population"] = chicago["population"].interpolate(method="linear", limit_direction="both").round().astype(int)
chicago["city"] = "Chicago"
chicago = chicago[["year","city","population"]]

known_nyc = pd.DataFrame([
    {"year": 2010, "population": 8175133},
    {"year": 2020, "population": 8804190},
    {"year": 2021, "population": 8467513},
    {"year": 2022, "population": 8335897},
    {"year": 2023, "population": 8258000},
    {"year": 2024, "population": 8478000},
])
nyc = all_years.merge(known_nyc, on="year", how="left")
nyc["population"] = nyc["population"].interpolate(method="linear", limit_direction="both").round().astype(int)
nyc["city"] = "New York City"
nyc = nyc[["year","city","population"]]

out = pd.concat([chicago, nyc], ignore_index=True)
out

Unnamed: 0,year,city,population
0,2006,Chicago,2695598
1,2007,Chicago,2695598
2,2008,Chicago,2695598
3,2009,Chicago,2695598
4,2010,Chicago,2695598
5,2011,Chicago,2696897
6,2012,Chicago,2698196
7,2013,Chicago,2699494
8,2014,Chicago,2700793
9,2015,Chicago,2702092


In [None]:
from pathlib import Path
POP_NYC_CHI = Path("/content/drive/MyDrive/Group_Project/data")
POP_NYC_CHI.mkdir(parents=True, exist_ok=True)

out.to_csv(POP_NYC_CHI/"out.csv", index=False)

Now I want to get Open Street Map Data for the two cities: how many mapped surveillance features there are (man_made = surveillance) and how many mapped ploice features there are (amenity = police)

In [None]:
import requests, pandas as pd, time
from pathlib import Path

UA = {"User-Agent": "UniversityProject/1.0 (anna.reidel@esmt.berlin)"}

def nominatim_bbox(city):
    url = "https://nominatim.openstreetmap.org/search"
    q = {"q": city, "format": "json", "limit": 1}
    r = requests.get(url, params=q, headers=UA, timeout=30)
    r.raise_for_status()
    j = r.json()
    if not j:
        raise ValueError(f"No bbox for {city}")
    b = j[0]
    return float(b["boundingbox"][0]), float(b["boundingbox"][2]), float(b["boundingbox"][1]), float(b["boundingbox"][3])

def overpass_count(bbox, overpass_query_body):
    south, west, north, east = bbox
    body = overpass_query_body.format(south=south, west=west, north=north, east=east)
    r = requests.post("https://overpass-api.de/api/interpreter", data={"data": body}, headers=UA, timeout=90)
    r.raise_for_status()
    j = r.json()
    return len(j.get("elements", []))

Q_SURV = """
[out:json][timeout:25];
(
  node["man_made"="surveillance"]({south},{west},{north},{east});
  way["man_made"="surveillance"]({south},{west},{north},{east});
  rel["man_made"="surveillance"]({south},{west},{north},{east});
);
out ids;
"""
Q_POLICE = """
[out:json][timeout:25];
(
  node["amenity"="police"]({south},{west},{north},{east});
  way["amenity"="police"]({south},{west},{north},{east});
  rel["amenity"="police"]({south},{west},{north},{east});
);
out ids;
"""

rows = []
TASK2 = Path("/content/drive/MyDrive/Group_Project/data")
TASK2.mkdir(parents=True, exist_ok=True)

for city in ["Chicago", "New York City"]:
    bbox = nominatim_bbox(city)
    surv = overpass_count(bbox, Q_SURV)
    time.sleep(1)
    police = overpass_count(bbox, Q_POLICE)
    rows.append({"city": city, "osm_cctv_count": surv, "osm_police_count": police})
    time.sleep(1)

osm_counts = pd.DataFrame(rows)
osm_counts.to_csv(TASK2/"osm_surveillance_police_counts.csv", index=False)
osm_counts

Unnamed: 0,city,osm_cctv_count,osm_police_count
0,Chicago,1332,103
1,New York City,677,208


In [None]:
# here I am collecting from government websites
# need to get the crime counts for chicago and new york
# also want to access data from US Census: median_income, umemployment_rate, and education
# in later tasks aim to normalize them by population (which was fetched in task 2)

Now I start collecting from government websites. I need to get the crime_counts for both New York City and Chicago. I am relying on offical governemnt websites for this. In later tasks I aim to normalize them by their respective population (which I got in task number 2)

In [None]:
# we start by creating the table crime_count for Chicago from 2006-2024

import requests
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path
import time

# https://data.cityofchicago.org/api/v3/views/ijzp-q8t2/query.json


# Chicago crime API (Socrata)
CHICAGO = "https://data.cityofchicago.org/resource/ijzp-q8t2.json"

def get_json(url, params=None, timeout=30):
    r = requests.get(url, params=params, timeout=timeout)
    r.raise_for_status()
    return r.json()

YEARS = list(range(2001, 2025))
rows = []
for y in YEARS:
    js = get_json(CHICAGO, params={"$select":"count(id)", "$where":f"year={y}"})
    count = int(js[0].get("count_id", 0)) if js else 0
    rows.append({"year": y, "crime_count": count})
crime_chicago = pd.DataFrame(rows)

crime_chicago

Unnamed: 0,year,crime_count
0,2001,485954
1,2002,486830
2,2003,475996
3,2004,469439
4,2005,453785
5,2006,448198
6,2007,437105
7,2008,427211
8,2009,392860
9,2010,370557


In [None]:
CRIMECHICAGO_20062024 = Path("/content/drive/MyDrive/Group_Project/data")
CRIMECHICAGO_20062024.mkdir(parents=True, exist_ok=True)

crime_chicago.to_csv(CRIMECHICAGO_20062024/"crime_chicago.csv", index=False)

In [None]:
# go on to getting crime_count for nyc from 2006-2024

import requests
import pandas as pd

URL = "https://data.cityofnewyork.us/resource/qgea-i56i.json"

def grouped_year_counts(start_year, end_year):
    params = {
        "$select": "date_extract_y(cmplnt_fr_dt) as year, count(1) as crime_count",
        "$where": f"cmplnt_fr_dt >= '{start_year}-01-01T00:00:00' AND cmplnt_fr_dt < '{end_year+1}-01-01T00:00:00'",
        "$group": "year",
        "$order": "year",
        "$limit": 50000,
    }
    r = requests.get(URL, params=params, headers={"Accept": "application/json"})
    r.raise_for_status()
    data = r.json()
    df = pd.DataFrame(data)
    if df.empty:
        return pd.DataFrame(columns=["year", "crime_count"])
    df["year"] = df["year"].astype(int)
    df["crime_count"] = df["crime_count"].astype(int)
    years = pd.DataFrame({"year": list(range(start_year, end_year + 1))})
    return years.merge(df, on="year", how="left").fillna({"crime_count": 0}).astype({"crime_count": int})

def nyc_crime_2006_2024():
    try:
        return grouped_year_counts(2006, 2024)
    except requests.HTTPError as e:
        df1 = grouped_year_counts(2006, 2014)
        df2 = grouped_year_counts(2015, 2024)
        return pd.concat([df1, df2], ignore_index=True)

crime_newyorkcity = nyc_crime_2006_2024()
crime_newyorkcity

Unnamed: 0,year,crime_count
0,2006,530211
1,2007,535507
2,2008,528991
3,2009,511241
4,2010,510398
5,2011,498944
6,2012,505093
7,2013,496257
8,2014,492532
9,2015,479100


In [None]:
CRIMENYC_20062024 = Path("/content/drive/MyDrive/Group_Project/data")
CRIMENYC_20062024.mkdir(parents=True, exist_ok=True)

crime_newyorkcity.to_csv(CRIMENYC_20062024/"crime_newyorkcity.csv", index=False)

I will now look at the median income Chicago vs New York City 2006 - 2024 from US Census Data:

In [None]:
import requests
import pandas as pd

CITIES = [
    {"city": "Chicago", "state": "17", "place": "14000"},
    {"city": "New York City", "state": "36", "place": "51000"},
]

def fetch_income(year: int, state: str, place: str) -> int:
    dataset = "acs5" if year == 2020 else "acs1"
    url = f"https://api.census.gov/data/{year}/acs/{dataset}"
    params = {"get": "NAME,B19013_001E", "for": f"place:{place}", "in": f"state:{state}"}
    r = requests.get(url, params=params)
    if not r.ok and dataset == "acs1":
        url = f"https://api.census.gov/data/{year}/acs/acs5"
        r = requests.get(url, params=params)
    r.raise_for_status()
    return int(r.json()[1][1])

rows = []
for year in range(2006, 2025):
    for c in CITIES:
        rows.append({
            "year": year,
            "city": c["city"],
            "median_income_usd": fetch_income(year, c["state"], c["place"])
        })

df_median_income = pd.DataFrame(rows).sort_values(["city", "year"]).reset_index(drop=True)

assert len(df_median_income) == len(CITIES) * (2025 - 2006)
assert df_median_income["median_income_usd"].isna().sum() == 0

df_median_income


Unnamed: 0,year,city,median_income_usd
0,2006,Chicago,43223
1,2007,Chicago,45505
2,2008,Chicago,46911
3,2009,Chicago,45734
4,2010,Chicago,44776
5,2011,Chicago,43628
6,2012,Chicago,45214
7,2013,Chicago,47099
8,2014,Chicago,48734
9,2015,Chicago,50702


In [None]:
MEDIANINCOME = Path("/content/drive/MyDrive/Group_Project/data")
MEDIANINCOME.mkdir(parents=True, exist_ok=True)

df_median_income.to_csv(MEDIANINCOME/"df_median_income.csv", index=False)

I now want to have a look at umeployment rate as well as eduction (educated here defined as Bachelor's degree or higher) and compare the citis yearwise again from US Census Data:

In [None]:
import requests, pandas as pd
import numpy as np

CITIES = [
    {"city": "Chicago", "state": "17", "place": "14000"},
    {"city": "New York City", "state": "36", "place": "51000"},
]
VARS = ["B23025_003E","B23025_005E","B15003_001E","B15003_022E","B15003_023E","B15003_024E","B15003_025E"]

def fetch(year, state, place):
    dataset = "acs5" if year == 2020 else "acs1"
    url = f"https://api.census.gov/data/{year}/acs/{dataset}"
    params = {"get": "NAME," + ",".join(VARS), "for": f"place:{place}", "in": f"state:{state}"}
    r = requests.get(url, params=params, timeout=30)
    if not r.ok and dataset == "acs1":
        url = f"https://api.census.gov/data/{year}/acs/acs5"
        r = requests.get(url, params=params, timeout=30)
    r.raise_for_status()
    row = r.json()[1]
    rec = dict(zip(["NAME"] + VARS + ["state","place"], row))
    civilian = int(rec["B23025_003E"]) if rec["B23025_003E"] not in ("", None) else 0
    unemployed = int(rec["B23025_005E"]) if rec["B23025_005E"] not in ("", None) else 0
    total25 = int(rec["B15003_001E"]) if rec["B15003_001E"] not in ("", None) else 0
    degs = sum(int(rec[v]) if rec[v] not in ("", None) else 0 for v in ["B15003_022E","B15003_023E","B15003_024E","B15003_025E"])
    unemp = (unemployed / civilian * 100.0) if civilian else None
    edu = (degs / total25 * 100.0) if total25  else None
    return unemp, edu

rows = []
for year in range(2006, 2025):
    for c in CITIES:
        try:
            u, e = fetch(year, c["state"], c["place"])
        except requests.HTTPError:
            u, e = np.nan, np.nan
        rows.append({"year": year, "city": c["city"], "unemployment_rate_pct": u, "bachelors_plus_pct": e})

df = pd.DataFrame(rows)

years = range(2006, 2025)
idx = pd.MultiIndex.from_product([df["city"].unique(), years], names=["city","year"])
df_full = df.set_index(["city","year"]).reindex(idx)

df_full[["unemployment_rate_pct", "bachelors_plus_pct"]] = df_full[["unemployment_rate_pct", "bachelors_plus_pct"]].apply(pd.to_numeric, errors='coerce')

interpolated_groups = []
for city, group in df_full.groupby(level=0):
    interpolated_group = group.interpolate(method="linear", limit_direction="both")
    interpolated_groups.append(interpolated_group)

df_full_interpolated = pd.concat(interpolated_groups)


df_full_interpolated = df_full_interpolated.reset_index()
df_full_interpolated["unemployment_rate_pct"] = df_full_interpolated["unemployment_rate_pct"].round(2)
df_full_interpolated["bachelors_plus_pct"] = df_full_interpolated["bachelors_plus_pct"].round(2)

df_full_interpolated

Unnamed: 0,city,year,unemployment_rate_pct,bachelors_plus_pct
0,Chicago,2006,14.1,33.46
1,Chicago,2007,14.1,33.46
2,Chicago,2008,14.1,33.46
3,Chicago,2009,14.1,33.46
4,Chicago,2010,14.1,33.46
5,Chicago,2011,14.1,33.46
6,Chicago,2012,13.66,34.47
7,Chicago,2013,12.67,35.13
8,Chicago,2014,10.91,35.96
9,Chicago,2015,9.5,36.64


In [None]:
UNEMPLOYMENT_EDU = Path("/content/drive/MyDrive/Group_Project/data")
UNEMPLOYMENT_EDU.mkdir(parents=True, exist_ok=True)

df_full_interpolated.to_csv(UNEMPLOYMENT_EDU/"df_full_interpolated.csv", index=False)