## Misclassified Crimes in LAPD Data (Los Angeles Times)

- [Times Investigation: LAPD misclassified nearly 1,200 violent crimes as minor offenses](https://www.latimes.com/local/la-me-crimestats-lapd-20140810-story.html)
- [LAPD underreported serious assaults, skewing crime stats for 8 years](https://www.latimes.com/local/cityhall/la-me-crime-stats-20151015-story.html)
- [How we reported this story](https://www.latimes.com/local/cityhall/la-me-crime-stats-side-20151015-story.html)

## Definitions

>**Aggravated Assault:** An unlawful attack by one person upon another for the purpose of inflicting severe or aggravated bodily injury. This type of assault usually is accompanied by the use of a weapon or by means likely to produce death or great bodily harm.


>**Other Assault:** Simple, Not Aggravated. Includes all assaults which do not involve the use of a firearm, knife, cutting instrument, or other dangerous weapon and in which the victim did not sustain serious or aggravated injuries. 

## Our Data Sample

The dataset has hundreds of thousands of rows, but we will sample 100 from them for now: https://docs.google.com/spreadsheets/d/1LZ72b3cgVi7mhryMiromE3eT86DSnfna1cXjX-jLvGk/edit#gid=0

## Load the data

In [None]:
%matplotlib inline
import csv, requests, os
import pandas as pd
import numpy as np

In [None]:
def make_regular_gsheet_url(doc_id, sheet_id):
    return f"https://docs.google.com/spreadsheets/d/{doc_id}/edit#gid={sheet_id}"

def make_csv_gsheet_url(doc_id, sheet_id):
    return f"https://docs.google.com/spreadsheets/d/{doc_id}/export?format=csv&id={doc_id}&gid={sheet_id}"

GOOGLE_SHEET_ID = '1LZ72b3cgVi7mhryMiromE3eT86DSnfna1cXjX-jLvGk'
print("Querying Doc:", make_regular_gsheet_url(GOOGLE_SHEET_ID, "0"))
response = requests.get(make_csv_gsheet_url(GOOGLE_SHEET_ID, "0"))
reader = csv.reader(response.text.splitlines())
header = next(reader)
df = pd.DataFrame(list(reader), columns=header)


# You are the classifier 👈


Based on the definitions provided, categorize the data you have been assigned as `Other Assault` or `Aggrevated Assault`.

## ChatGPT as the classifier 🤖

In [None]:
import os
from dotenv import load_dotenv
load_dotenv()
openai_key = os.getenv("OPENAI_API_KEY")

In [None]:
import diskcache
cache = diskcache.Cache('./cache')  # stores in ./cache folder

In [None]:
from pydantic import BaseModel

# This is a pydantic model. It defines what format I want the output to come back in
# It's for an OpenAI feature called "Structured Output", but also works with other LLM tools
class Classification(BaseModel):
    classification: bool
    reason: str

In [None]:
from openai import OpenAI
client = OpenAI()

MODEL = 'gpt-4o-2024-08-06'

@cache.memoize() # This is the diskcache! Now I will never hit the API twice with the same request!
def ask_chatgpt_to_classify(text_description, model=MODEL):
  response = client.beta.chat.completions.parse(
    model=model,
    messages=[
      {
        "role": "system",
        "content": "\"You are a classifier that helps to classify between two categories.\n\nAggravated Assault: An unlawful attack by one person upon another for the purpose of inflicting severe or aggravated bodily injury. This type of assault usually is accompanied by the use of a weapon or by means likely to produce death or great bodily harm.\n\nOther Assault: Simple, Not Aggravated. Includes all assaults which do not involve the use of a firearm, knife, cutting instrument, or other dangerous weapon and in which the victim did not sustain serious or aggravated injuries. \n\nI'll give you various snippets and i'd like for you to categorize them as one or the other. Please provide only the response 'Aggravated Assault' or 'Other Assault'"
      },
      {
        "role": "user",
        "content": text_description
      },
    ],
    response_format=Classification,
    temperature=0
  )

  return response.choices[0].message.content

In [None]:
import json
from tqdm.notebook import tqdm
tqdm.pandas()

df[MODEL] = df['description'].progress_apply(ask_chatgpt_to_classify)
df['classification'] = df[MODEL].apply(lambda x: json.loads(x)['classification'])
# rename true to aggreevated and false to not
df['classification'] = df['classification'].replace({True: 'Aggravated Assault', False: 'Other Assault'})
df['reason'] = df[MODEL].apply(lambda x: json.loads(x)['reason'])
# delete model
del df[MODEL]

## Calculate precision and recall vs LAPD

In [None]:
pd.crosstab(df['lapd'], df['classification'])

In [None]:
# use sklearn to calculate precision, recall, f1 and accuracy
from sklearn.metrics import classification_report
print(classification_report(df['lapd'], df['classification']))