## Data Acquisition

First I will use the `sodapy` package to set up a data stream using the Open Data API. This data is updated monthly, and so by using an API, the data will always be up-to-date.

In [70]:
import pandas as pd
import datetime
import csv_to_geojson

In [82]:
import json

In [18]:
summons_raw = pd.read_csv("~/data608_final/docs/scratch/NYPD_Criminal_Court_Summons__Historic_.csv")

In [19]:
summons_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5280675 entries, 0 to 5280674
Data columns (total 17 columns):
 #   Column                 Dtype  
---  ------                 -----  
 0   SUMMONS_KEY            int64  
 1   SUMMONS_DATE           object 
 2   OFFENSE_DESCRIPTION    object 
 3   LAW_SECTION_NUMBER     object 
 4   LAW_DESCRIPTION        object 
 5   SUMMONS_CATEGORY_TYPE  object 
 6   AGE_GROUP              object 
 7   SEX                    object 
 8   RACE                   object 
 9   JURISDICTION_CODE      int64  
 10  BORO                   object 
 11  PRECINCT_OF_OCCUR      int64  
 12  X_COORDINATE_CD        float64
 13  Y_COORDINATE_CD        float64
 14  Latitude               float64
 15  Longitude              float64
 16  Lon_Lat                object 
dtypes: float64(4), int64(3), object(10)
memory usage: 684.9+ MB


In [108]:
cols = ['SUMMONS_KEY','PRECINCT_OF_OCCUR','OFFENSE_DESCRIPTION', 'RACE', 'AGE_GROUP', 'SEX','Latitude', 'Longitude']

df = summons_raw[cols].assign(
    SUMMONS_DATE = pd.to_datetime(summons_raw['SUMMONS_DATE'], infer_datetime_format=True)
).fillna('')

In [109]:
cond = df['OFFENSE_DESCRIPTION'].str.contains("MARIJUANA")

In [110]:
df = df.assign(
        YEAR = df['SUMMONS_DATE'].dt.year,
        MONTH = df['SUMMONS_DATE'].dt.month
).drop(['SUMMONS_DATE','OFFENSE_DESCRIPTION'], axis=1)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5280675 entries, 0 to 5280674
Data columns (total 9 columns):
 #   Column             Dtype 
---  ------             ----- 
 0   SUMMONS_KEY        int64 
 1   PRECINCT_OF_OCCUR  int64 
 2   RACE               object
 3   AGE_GROUP          object
 4   SEX                object
 5   Latitude           object
 6   Longitude          object
 7   YEAR               int64 
 8   MONTH              int64 
dtypes: int64(4), object(5)
memory usage: 362.6+ MB


In [111]:
mj_summons_2006_2019 = df[cond]
mj_summons_2006_2019.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 180724 entries, 32 to 5280641
Data columns (total 9 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   SUMMONS_KEY        180724 non-null  int64 
 1   PRECINCT_OF_OCCUR  180724 non-null  int64 
 2   RACE               180724 non-null  object
 3   AGE_GROUP          180724 non-null  object
 4   SEX                180724 non-null  object
 5   Latitude           180724 non-null  object
 6   Longitude          180724 non-null  object
 7   YEAR               180724 non-null  int64 
 8   MONTH              180724 non-null  int64 
dtypes: int64(4), object(5)
memory usage: 13.8+ MB


In [112]:
mj_summons_2006_2019[['RACE', 'YEAR', 'SUMMONS_KEY']].pivot_table(index="YEAR", columns="RACE", aggfunc="count")

Unnamed: 0_level_0,SUMMONS_KEY,SUMMONS_KEY,SUMMONS_KEY,SUMMONS_KEY,SUMMONS_KEY,SUMMONS_KEY,SUMMONS_KEY,SUMMONS_KEY,SUMMONS_KEY,SUMMONS_KEY
RACE,Unnamed: 1_level_1,AMERICAN INDIAN/ALASKAN NATIVE,ASIAN / PACIFIC ISLANDER,BLACK,BLACK HISPANIC,HISPANIC,OTHER,UNKNOWN,WHITE,WHITE HISPANIC
YEAR,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
2006,,,,33.0,,5.0,2.0,9575.0,,
2007,2.0,,,1.0,,,,9347.0,,
2008,,,,,,,,8910.0,,
2009,,,,,,,,8776.0,,
2010,2.0,,,,,,,8420.0,,
2011,,,,2.0,,,,8706.0,2.0,2.0
2012,1.0,,,1.0,,,,10798.0,,
2013,3.0,,,1.0,,,,13314.0,,
2014,4.0,,,,1.0,,,13378.0,1.0,
2015,12934.0,,,1.0,,,,4964.0,4.0,1.0


Because race is primarily coded as 'UNKNOWN' prior to 2017, and to make it a little easier to read the map, I will limit my analysis to just the last year, 2019.

In [118]:
mj_summons_2019 = mj_summons_2006_2019[mj_summons_2006_2019['YEAR'] == 2019 ].drop(['YEAR'], axis=1)

In [119]:
mj_summons_2019.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14951 entries, 1687966 to 1773877
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   SUMMONS_KEY        14951 non-null  int64 
 1   PRECINCT_OF_OCCUR  14951 non-null  int64 
 2   RACE               14951 non-null  object
 3   AGE_GROUP          14951 non-null  object
 4   SEX                14951 non-null  object
 5   Latitude           14951 non-null  object
 6   Longitude          14951 non-null  object
 7   MONTH              14951 non-null  int64 
dtypes: int64(3), object(5)
memory usage: 1.0+ MB


In [120]:
#https://gis.stackexchange.com/questions/220997/pandas-to-geojson-multiples-points-features-with-python <-found here
#https://geoffboeing.com/2015/10/exporting-python-data-geojson/ <- adapted from here
def df_to_geojson(df, properties, lat='Latitude', lon='Longitude'):
    geojson = {'type':'FeatureCollection', 'features':[]}
    for _, row in df.iterrows():
        feature = {'type':'Feature',
                   'properties':{},
                   'geometry':{'type':'Point','coordinates':[]}}
        feature['geometry']['coordinates'] = [row[lon],row[lat],0]
        for prop in properties:
            feature['properties'][prop] = row[prop]
        geojson['features'].append(feature)
    return geojson

In [121]:
cols = ['MONTH', 'RACE', 'AGE_GROUP', 'SEX']




In [122]:
with open('scratch/summons_mj.geojson', 'w') as outfile:
    json.dump(df_to_geojson(mj_summons_2019, cols), outfile)
    