# NYC's Mascot: Investing Rodent Inspections Across New York City
## DTSC 2302 Final Project
### *By Arnav Sareen, Rohan Salwekar, Sindhu Gadiraju, and RJ Wright*

In [1]:
import pandas as pd
import polars as pl
import matplotlib.pyplot as plt
import joblib

## Introduction

## Data Description

Our dataset contains a [Data Dictionary](https://github.com/aesareen/2302-Final-Project/blob/main/data_dictionary.pdf) which we have attached as further reference. The dataset contains `2,750,046` rows, which represents all of the rodents inspections until April 10th, 2025. There are 18 feature columns and then 1 target column, though some features columns have relatively redundant information that we did not leverage. Notable features are summarized below:

`INSPECTION_TYPE`: Specifies type of inspection done (Initial, Compliance, Baiting, Clean Up, Etc...) <br>
`JOB_ID`: Unique Job ID to identify a Job <br>
`BLOCK`: The block number for hte inspected tax lot (not unique to every borough) <br>
`STREETNAME`: The street name portion of the address of the taxlot inspected <br>
`BOROUGH`: Name of the NYC Borough <br>
`LATITUDE`: Latitude in decimal degrees of the inspected <br>
`LONGITUDE`: Longitude in decimal degrees of inspected taxlot <br>
`INSPECTION_DATE`: Date of Inspection <br>
`RESULT`: Result of the inspection (showing of active rat signs or problem conditions that can promote rats) <br>

## Model Two: Random Forests
### *By Rohan Salwekar*

In [None]:
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, roc_auc_score, roc_curve

In [3]:
df = pl.read_csv("Rodent_Inspection_20250410.csv")

df = df.select(["BOROUGH", "INSPECTION_DATE", "RESULT"])

df.head()

BOROUGH,INSPECTION_DATE,RESULT
str,str,str
"""Bronx""","""08/30/2010 03:23:11 PM""","""Passed"""
"""Manhattan""","""08/18/2011 12:05:54 PM""","""Passed"""
"""Brooklyn""","""10/10/2018 12:57:02 PM""","""Passed"""
"""Manhattan""","""02/07/2019 12:48:34 PM""","""Passed"""
"""Bronx""","""10/16/2017 01:02:51 PM""","""Rat Activity"""


In [4]:
pdf = df.to_pandas()

# Parse date
pdf["INSPECTION_DATE"] = pd.to_datetime(pdf["INSPECTION_DATE"], errors="coerce")

# Drop bad dates
pdf = pdf[
    (pdf["INSPECTION_DATE"].dt.year >= 2010) &
    (pdf["INSPECTION_DATE"].dt.year <= 2025)
]

# Create target
pdf["target"] = (pdf["RESULT"] == "Rat Activity").astype(int)

# Create two separate datasets
pdf_month = pdf.copy()
pdf_year = pdf.copy()

# For monthly model: extract month
pdf_month["INSPECTION_MONTH"] = pdf_month["INSPECTION_DATE"].dt.month
pdf_month = pdf_month[["BOROUGH", "INSPECTION_MONTH", "target"]].dropna()

# For yearly model: extract year
pdf_year["INSPECTION_YEAR"] = pdf_year["INSPECTION_DATE"].dt.year
pdf_year = pdf_year[["BOROUGH", "INSPECTION_YEAR", "target"]].dropna()

  pdf["INSPECTION_DATE"] = pd.to_datetime(pdf["INSPECTION_DATE"], errors="coerce")


In [5]:
# Monthly
X_month = pdf_month.drop(columns="target")
y_month = pdf_month["target"]

X_train_month, X_test_month, y_train_month, y_test_month = train_test_split(
    X_month, y_month, test_size=0.3, random_state=1, stratify=y_month
)

# Yearly
X_year = pdf_year.drop(columns="target")
y_year = pdf_year["target"]

X_train_year, X_test_year, y_train_year, y_test_year = train_test_split(
    X_year, y_year, test_size=0.3, random_state=1, stratify=y_year
)

NameError: name 'train_test_split' is not defined

In [None]:
# Preprocessing for monthly model
preprocessor_month = ColumnTransformer([
    ("cat", OneHotEncoder(handle_unknown="ignore", sparse_output=False), ["BOROUGH", "INSPECTION_MONTH"])
])

# Preprocessing for yearly model
preprocessor_year = ColumnTransformer([
    ("cat", OneHotEncoder(handle_unknown="ignore", sparse_output=False), ["BOROUGH", "INSPECTION_YEAR"])
])

For simplicity, the following cells are commented out because the model took extensive time to train due hyperparameter training. However, the code is made available in case you would like to leverage it yourself