# Rodent Inspection:

a. To get the required columns, use this module:


1.   get_area_of_interest(df_spark, interested_columns)


b. Preprocessing pipeline: Pass your data through these functions. (if your columns fall in those categories)

1.   valid_date_check(date)
2.   reverse_geo_code_boros(df_spark, Latitude, Longitude, Boro, lat_index, long_index)
3.   valid_borough_check(borough)
4.   to_check_long(longitude)
5.   to_check_lat(latitude)

In [1]:
!pip install pyspark
!pip install openclean

Collecting openclean
  Downloading openclean-0.2.1-py3-none-any.whl.metadata (9.3 kB)
Collecting openclean-core==0.4.1 (from openclean)
  Downloading openclean_core-0.4.1-py3-none-any.whl.metadata (7.6 kB)
Collecting appdirs>=1.4.4 (from openclean-core==0.4.1->openclean)
  Downloading appdirs-1.4.4-py2.py3-none-any.whl.metadata (9.0 kB)
Collecting dill (from openclean-core==0.4.1->openclean)
  Downloading dill-0.4.0-py3-none-any.whl.metadata (10 kB)
Collecting histore>=0.4.0 (from openclean-core==0.4.1->openclean)
  Downloading histore-0.4.1-py3-none-any.whl.metadata (6.1 kB)
Collecting flowserv-core>=0.8.0 (from openclean-core==0.4.1->openclean)
  Downloading flowserv_core-0.9.4-py3-none-any.whl.metadata (8.3 kB)
Collecting jellyfish (from openclean-core==0.4.1->openclean)
  Downloading jellyfish-1.2.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.6 kB)
Collecting refdata>=0.2.0 (from openclean-core==0.4.1->openclean)
  Downloading refdata-0.2.0-py3-none-any.w

In [2]:
#importing packages required
from pyspark import SparkContext, SparkConf
import os
import requests
import sys
import pandas as pd
import matplotlib
import matplotlib as plt
import numpy as np
import scipy as sp
import IPython
from IPython import display
import sklearn
import random
import time
import warnings
import re
import matplotlib.pyplot as plt
%matplotlib inline
from openclean.pipeline import stream
from openclean.profiling.column import DefaultColumnProfiler
from openclean.data.source.socrata import Socrata
from openclean.pipeline import stream
from openclean.function.eval.datatype import IsDatetime
import datetime
import pandas as pd
from pyspark.sql import SparkSession, Row
from pyspark.sql.functions import udf, struct
from pyspark.sql.types import StringType

In [None]:
from geopy.geocoders import ArcGIS
geocoder=ArcGIS()
#example:
geocoder.reverse('40.61157006600007, -73.74736517199995')

Location(11-64 Redfern Ave, Far Rockaway, New York 11691, USA, (40.61161616586613, -73.74738361194636, 0.0))

In [9]:
#Creating Spark Session
sc = SparkContext.getOrCreate();
spark = SparkSession(sc)

In [3]:
import os
import urllib.request

# Correct CSV download link for Rodent Inspection dataset
fn_src = 'https://data.cityofnewyork.us/resource/p937-wjvj.csv?$limit=10000'
fn_dst = '/content/Rodent_Inspection.csv'

if os.path.isfile(fn_dst):
    print('File has already been downloaded:', fn_dst)
else:
    print('Fetching file. This may take a while...', fn_dst)
    urllib.request.urlretrieve(fn_src, fn_dst)
    print('File %s has been downloaded' % fn_dst)

Fetching file. This may take a while... /content/Rodent_Inspection.csv
File /content/Rodent_Inspection.csv has been downloaded


In [5]:
# similarly, lets get them into pyspark rdd
def get_area_of_interest(df_spark, interested_columns):
  df_spark=df_spark.select(interested_columns)
  return df_spark

# 2. Module for date related columns

As the dataset is for the data from 2006 to 2025, we can see that there is data from unknown format of "1010-05-14" to the year 2025. We need to clean this. Over here, we remove the null values where the complaint date is <2006.

In [6]:
# fileName='1010-05-14 00:00:00'
# # matches=re.search("([0-9]{4}\-[0-9]{2}\-[0-9]{2})", fileName)
# re.search(r'([0-9]{4}\-[0-9]{2}\-[0-9]{2})', fileName).group(0)

def valid_date_check(date):
  if date==None or date==" " or date=="":
      return False
  else:
    date,time,type = date.split(" ")
    date_cpy=date
    date=date.split("/")
    try:
      month=int(date[0])
      day= int(date[1])
      year=int(date[2])
      if year>=2006 and year<=2025:
        try:
          refined_date=datetime.datetime(year, month, day)
          return True
        except:
          return False
      else:
        return False
    except:
      return False

# 6.b Module for Reverse Geocoding the boroughs using latitudes and longitudes.

1. First we will remove the rows where latitude, longitude and boroughs are null. (around 450 tuples removed)
2. Then, where the boroughs are empty, take the latitude and longitude value and reverse geocode it using the module "reverseGeocoder".
3. Impute the borough name retrived in the empty space.


### USING MASTER DATASET
In the case of geocoding, geocoder gives us the zipcodes based on the latitude and longitude values. Inturn, we can use the master dataset of zipcodes inorder to retrive the borough names



NOTE: The dataset can be downloaded from : https://data.beta.nyc/en/dataset/pediacities-nyc-neighborhoods/resource/7caac650-d082-4aea-9f9b-3681d568e8a5

In [7]:
def reverseGeoCoder(latitude, longitude):
  loc=geocoder.reverse(str(latitude)+', '+str(longitude), timeout=10)
  zipCode=str(loc).split(",")[2][-5:]
  if not int(zipCode) in zip_master:
    boro="UNKNOWN"
  else:
    boro=zip_master[int(zipCode)]
  boro=boro.upper()
  return boro

def reverse_geo_code_boros(df_spark, Latitude, Longitude, Boro, lat_index, long_index, master_path):
  #select data where we have to impute
  df_temp_boro_clean=df_spark.filter((df_spark[Latitude].isNotNull()) & (df_spark[Longitude].isNotNull()))
  boro_cleaner=df_temp_boro_clean.filter((df_temp_boro_clean[Boro].isNull())|(df_temp_boro_clean[Boro]=='NEW YORK'))
  print("We have "+ str(boro_cleaner.count())+ " points to impute")
  print("___intializing Zip Code Look up ____")

  #use your path for master dataset here.
  df_zips=pd.read_csv(master_path)
  zip_master={}
  zips=df_zips['zip']
  boro=df_zips['borough']
  for i, j in zip(zips, boro):
    zip_master[i]=j
  zip_master[10020]='Manhattan'
  zip_master[11249]='Brooklyn'

  print("____ imputing the points ____")
  #creating UD function
  ud_func= udf(reverseGeoCoder, StringType())
  boro_cleaned_dataframe = boro_cleaner.withColumn(Boro, ud_func(boro_cleaner[lat_index], boro_cleaner[long_index]))

  #joining the imputed dataset to the maindataset and returning
  joiner_dataset=df_spark.filter((df_spark[Latitude].isNotNull()) & (df_spark[Longitude].isNotNull()) & (df_spark[Boro].isNotNull()))
  fin_df=joiner_dataset.union(boro_cleaned_dataframe)
  return fin_df

The size of dataset ~ 24k tuples. So, we need around 2000 data points for 95% confidence level with 2% interval. The size of data is almost 10% of the data. So we can get it into our df now

In [10]:
df_spark=spark.read.option("header",True).csv(fn_dst,inferSchema=True)
df_spark=df_spark.sample(0.001)
df_spark.count()

24

In [None]:
df_spark.printSchema()

root
 |-- INSPECTION_TYPE: string (nullable = true)
 |-- JOB_TICKET_OR_WORK_ORDER_ID: integer (nullable = true)
 |-- JOB_ID: string (nullable = true)
 |-- JOB_PROGRESS: integer (nullable = true)
 |-- BBL: long (nullable = true)
 |-- BORO_CODE: integer (nullable = true)
 |-- BLOCK: integer (nullable = true)
 |-- LOT: integer (nullable = true)
 |-- HOUSE_NUMBER: string (nullable = true)
 |-- STREET_NAME: string (nullable = true)
 |-- ZIP_CODE: integer (nullable = true)
 |-- X_COORD: integer (nullable = true)
 |-- Y_COORD: integer (nullable = true)
 |-- LATITUDE: double (nullable = true)
 |-- LONGITUDE: double (nullable = true)
 |-- BOROUGH: string (nullable = true)
 |-- INSPECTION_DATE: string (nullable = true)
 |-- RESULT: string (nullable = true)
 |-- APPROVED_DATE: string (nullable = true)
 |-- LOCATION: string (nullable = true)



## a. Select the columns that are common with the original dataset:
1. BOROUGH
2. Latitude
3. Longitude
4. Inspection_Date

We can consider the primary key along with this
5. INSPECTION_TYPE
6. JOB_TICKET_OR_WORK_ORDER_ID

In [None]:
interested_columns_1=['INSPECTION_TYPE', 'JOB_TICKET_OR_WORK_ORDER_ID', 'INSPECTION_DATE', 'BOROUGH', 'LATITUDE', 'LONGITUDE']
df_spark=get_area_of_interest(df_spark, interested_columns_1)

In [None]:
df_spark.count()

2041

In [None]:
df_temp=df_spark.rdd

In [None]:
df_temp.take(2)

[Row(INSPECTION_TYPE='Initial', JOB_TICKET_OR_WORK_ORDER_ID=13282421, INSPECTION_DATE='08/27/2021 08:52:43 AM', BOROUGH='Queens', LATITUDE=40.70621577643, LONGITUDE=-73.911020313495),
 Row(INSPECTION_TYPE='Initial', JOB_TICKET_OR_WORK_ORDER_ID=13282992, INSPECTION_DATE='08/27/2021 09:55:00 AM', BOROUGH='Staten Island', LATITUDE=40.634265775352, LONGITUDE=-74.100311764687)]

1. Date and Time

In [None]:
df_temp_=df_temp.map(lambda x:(x, valid_date_check(x[2]))).filter(lambda x: x[1]==True)
df_temp=df_temp_.map(lambda x: x[0])

In [None]:
df_temp.count()

1941

In [None]:
# #as this code requires the pyspark dataframe(Not the rdd)
df_temp=df_temp.toDF(schema=df_spark.schema)

3. Geocoding

In [None]:
df_spk=reverse_geo_code_boros(df_temp, 'LATITUDE', 'LONGITUDE', 'BOROUGH', -2, -1, dst)

We have 0 points to impute
___intializing Zip Code Look up ____
____ imputing the points ____


Lets profile the data now.

In [None]:
pandasDF = df_spk.toPandas()
ds=stream(pandasDF)

#Creating profile of our dataset
profiles = ds.profile(default_profiler=DefaultColumnProfiler)
profiles.stats()

Unnamed: 0,total,empty,distinct,uniqueness,entropy
INSPECTION_TYPE,1938,0,3,0.001548,1.139824
JOB_TICKET_OR_WORK_ORDER_ID,1938,0,1938,1.0,10.920353
INSPECTION_DATE,1938,0,1938,1.0,10.920353
BOROUGH,1938,0,5,0.00258,1.987243
LATITUDE,1938,0,1897,0.978844,10.864434
LONGITUDE,1938,0,1897,0.978844,10.864434


# 2. Precision And Recall:

Here the reason behind lower precision is the inspection date which can be of any time period, but our original dataset restricts it between 2006 and 2025

True Positive = 1941
selected elements = 2041
Relevant elements = 1941

precision= 1941/2041
recall = 1941/2041
