# Loading Mapping Police Violence (MPV) Data Into the NPDC Index

This notebook loads the [Mapping Police Violence](https://mappingpoliceviolence.org) dataset into the app. 

Run this whenever the dataset changes to pull in the latest data. Existing rows are ignored.  

## Setup

Install project dependencies and Jupyter:

```bash
pip3 install jupyter
pip3 install -r requirements/dev_unix.txt
```

The database should be running and the tables should be up to date. You can use docker to reset the application to a clean state: 

```bash
# Stop services and remove volumes, rebuild images, start the database, create tables, run seeds, and follow logs
docker-compose down -v && docker-compose up --build -d db api && docker-compose logs -f
```

Then open the notebook with either [VSCode](https://code.visualstudio.com/) or `jupyter notebook`.

You can run the notebook from the command line as well:

```bash
jupyter nbconvert --to notebook --execute backend/scraper/mpv.ipynb --output mpv
```

In [19]:
import os
if os.getcwd().endswith("scraper"):
    # Run this notebook from the repo root
    os.chdir("../..")
import sys
import math
import datetime
import numpy as np
import pandas as pd
from flask_sqlalchemy import SQLAlchemy
import psycopg2
from itertools import zip_longest
from typing import List
import requests
from IPython.display import display, HTML
from collections import namedtuple
from backend.database import db, Incident, Officer, Accusation, Victim

ModuleNotFoundError: No module named 'flask_sqlalchemy'

In [20]:
from backend.api import create_app
app = create_app("development")


ModuleNotFoundError: No module named 'flask_mail'

# Data Summary

- Detailed victim information, incident location, and department responsible
- Victim-oriented records; MPV ID's identify victims, and incidents are implied
- Low (20%) number of incidents with named officers
- Reference WaPo and Fatal Encounters ID's. We have no object to model this now but we can update scraper when appropriate
- Findings are less precise than CPDP data. CPDP contains investigation records per-accused officer, while MPV gives a single status for the whole incident. Should we model them differently?

So, each MPV record will create 1 victim and 1 incident, and some records will also create officers. We will assume victims are 1-1 with incidents, so that we can use the MPV ID for the source ID of the incident, and unambiguously store the vistims.

In [3]:
dataset_url = "https://docs.google.com/spreadsheets/d/1g7CNEDnjk5dH412wmVTAG6XtgWyS2Vax10-BbfsBp0U/export?format=csv"
dataset_path = "backend/scraper/mpv.csv"

# TODO: Fetch if the file updated. can do with head requests and headers
if True:  # not os.path.exists(dataset_path):
    print("Downloading dataset")
    r = requests.get(dataset_url, stream=True)
    with open(dataset_path, "wb") as fd:
        for chunk in r.iter_content(chunk_size=128):
            fd.write(chunk)
else:
    print("Using existing dataset")


Downloading dataset


NameError: name 'requests' is not defined

In [105]:
def map_cols(df, m: dict):
    return df[list(m.keys())].rename(columns=m)


dataset = pd.read_csv(dataset_path, dtype={"Zipcode": str, "MPV ID": str}, skiprows=1)
dataset = map_cols(
    dataset,
    {
        "MPV ID": "source_id",
        "Victim's name": "victim_name",
        "Victim's gender": "victim_gender",
        "Victim's race": "victim_race",
        "Victim's age": "victim_age",
        "URL of image of victim": "victim_image_url",
        "Cause of death": "manner_of_injury",
        "Date of Incident (month/day/year)": "incident_date",
        "Street Address of Incident": "address",
        "City": "city",
        "State": "state",
        "Zipcode": "zip",
        "County": "county",
        "Latitude": "latitude",
        "Longitude": "longitude",
        "A brief description of the circumstances surrounding the death": "description",
        "Official disposition of death (justified or other), updated on 11/7/2021": "officer_outcomes",
        "Agency responsible for death": "department",
        "Criminal Charges?": "criminal_charges",
        "Link to news article or photo of official document": "source_link",
        "Off-Duty Killing?": "off_duty_killing",
        "Encounter Type (DRAFT)": "encounter_type_draft",
        "Initial Reported Reason for Encounter (DRAFT)": "encounter_reason_draft",
        "Names of Officers Involved (DRAFT)": "officer_names_draft",
        "Race of Officers Involved (DRAFT)": "officer_races_draft",
        "Known Past Shootings of Officer(s) (DRAFT)": "officer_known_past_shootings_draft",
        "Call for Service? (DRAFT)": "call_for_service_draft",
    },
).set_index("source_id", drop=False)
# Some rows are repeated, some ID's seem to be reused.
dataset = dataset[~dataset.index.duplicated(keep='first')]
assert not dataset.index.has_duplicates

print(dataset.columns)
print(dataset.dtypes)


Index(['source_id', 'victim_name', 'victim_gender', 'victim_race',
       'victim_age', 'victim_image_url', 'manner_of_injury', 'incident_date',
       'address', 'city', 'state', 'zip', 'county', 'latitude', 'longitude',
       'description', 'officer_outcomes', 'department', 'criminal_charges',
       'source_link', 'off_duty_killing', 'encounter_type_draft',
       'encounter_reason_draft', 'officer_names_draft', 'officer_races_draft',
       'officer_known_past_shootings_draft', 'call_for_service_draft'],
      dtype='object')
source_id                                     object
victim_name                                   object
victim_gender                                 object
victim_race                                   object
victim_age                                    object
victim_image_url                              object
manner_of_injury                              object
incident_date                         datetime64[ns]
address                                

In [98]:
def create_bulk(instances, chunk_size=1000):
    """Inserts ORM instances into the database"""
    with app.app_context():
        for chunk in range(0, len(instances), chunk_size):
            db.session.add_all(instances[chunk : chunk + chunk_size])
            db.session.flush()
        db.session.commit()


def isnan(x):
    return isinstance(x, float) and math.isnan(x)


def nan_to_none(x):
    return None if isnan(x) else x


def strip_nan(r):
    return r._make([nan_to_none(e) for e in r])


def map_df(df, mapper):
    return [mapper(strip_nan(r)) for r in df.itertuples(index=False)]


def parse_int(value):
    try:
        return int(value)
    except ValueError:
        return None


def location(r: namedtuple):
    return " ".join(filter(None, [r.address, r.city, r.state, r.zip]))


def parse_parts(s: str):
    return list(map(lambda x: x.strip(), s.split(","))) if s else []


def parse_officers(r: namedtuple):
    names = parse_parts(r.officer_names_draft)
    races = parse_parts(r.officer_races_draft)

    return [
        Officer(last_name=name, race=race)
        for name, race in zip_longest(names, races)
    ]


def parse_accusations(r: namedtuple, officers: List[Officer]):
    outcomes = parse_parts(r.officer_outcomes)
    return [
        Accusation(outcome=outcome, officer=officer)
        for officer, outcome in zip_longest(officers, outcomes)
    ]


def create_orm(r: namedtuple):
    victim = Victim(
        name=r.victim_name,
        race=r.victim_race,
        gender=r.victim_gender,
        manner_of_injury=r.manner_of_injury,
        deceased=True,
    )
    officers = parse_officers(r)
    accusations = parse_accusations(r, officers)
    incident = Incident(
        source_id=r.source_id,
        source="mpv",
        time_of_incident=r.incident_date,
        location=location(r),
        description=r.description,
        department=r.department,
        # latitude=r.latitude,
        # longitude=r.longitude,
        victims=[victim],
        officers=officers,
        accusations=accusations,
    )
    return incident


In [102]:
with app.app_context():
    existing_source_ids = list(
        s
        for (s,) in db.session.query(Incident.source_id).filter(
            Incident.source == "mpv", Incident.source_id != None
        )
    )

new_data = dataset.drop(existing_source_ids)
incidents = map_df(new_data, create_orm)
print(f"Found {len(existing_source_ids)} existing records. Creating {len(incidents)} new incidents")
create_bulk(incidents)

Found 9689 existing records. Creating 0 new incidents
