# Ankur Patel
## 11/1/2019
## p.ankur.715@gmail.com

# Arrest Data from 2010 to Present

This dataset reflects arrest incidents in the City of Los Angeles dating back to 2010. This data is transcribed from original arrest reports that are typed on paper and therefore there may be some inaccuracies within the data. Some location fields with missing data are noted as (0.0000°, 0.0000°). Address fields are only provided to the nearest hundred block in order to maintain privacy. This data is as accurate as the data in the database. Please note questions or concerns in the comments.

In [1]:
from IPython.display import Image
from IPython.core.display import HTML 
Image(url= "https://ewscripps.brightspotcdn.com/dims4/default/2bf40c2/2147483647/strip/true/crop/1100x619+0+2/resize/1280x720!/quality/90/?url=https%3A%2F%2Fmediaassets.ksby.com%2Fcordillera-network%2Fwp-content%2Fuploads%2Fsites%2F2%2F2018%2F11%2F04180116%2Farrest.jpg", width=600, height=200)

In [2]:
import pandas as pd

In [3]:
# read csv and view sample
df = pd.read_csv("Arrest_Data_from_2010_to_Present.csv")
df.head(100)

Unnamed: 0,Report ID,Arrest Date,Time,Area ID,Area Name,Reporting District,Age,Sex Code,Descent Code,Charge Group Code,Charge Group Description,Arrest Type Code,Charge,Charge Description,Address,Cross Street,Location
0,4248313,02/24/2015,1310.0,20,Olympic,2022,37,M,H,5.0,Burglary,F,459PC,BURGLARY,5TH,WILTON,"(34.0653, -118.314)"
1,191811472,05/03/2019,1700.0,18,Southeast,1802,23,F,B,,,M,653.22 PC,,91ST,FIGUEROA,"(33.9543, -118.2827)"
2,4254777,02/26/2015,2010.0,19,Mission,1985,22,M,H,6.0,Larceny,M,459.5PC,SHOPLIFTING,8300 VAN NUYS BL,,"(34.2216, -118.4488)"
3,5614161,04/29/2019,1040.0,8,West LA,842,41,M,H,3.0,Robbery,F,211PC,ROBBERY,11600 WILSHIRE BL,,"(34.0508, -118.4592)"
4,5615197,04/30/2019,615.0,6,Hollywood,663,27,M,O,5.0,Burglary,F,459PC,BURGLARY,LA BREA,LEXINGTON,"(34.0907, -118.3384)"
5,5615701,04/30/2019,1100.0,9,Van Nuys,901,2,F,H,,,D,300(B)WIC,,RAYMER,SEPULVEDA BL,"(34.2149, -118.4674)"
6,4256466,02/28/2015,1430.0,18,Southeast,1824,22,M,B,5.0,Burglary,F,459PC,BURGLARY,103RD,HICKORY,"(33.947, -118.2594)"
7,4256564,02/28/2015,1715.0,10,West Valley,1039,16,M,H,3.0,Robbery,F,211PC,ROBBERY,VANOWEN,WOODLEY,"(34.1939, -118.4836)"
8,5616892,05/01/2019,1710.0,12,77th Street,1245,28,F,B,3.0,Robbery,F,211PC,ROBBERY,VERMONT,FLORENCE,"(33.9746, -118.2918)"
9,5617003,05/01/2019,1010.0,16,Foothill,1601,39,M,H,,,O,21 841A1US,,FOOTHILL BL,PAXTON,"(34.2868, -118.4081)"


### Section 1:
For this challenge, you will be asked to answer questions based on arrest incidents data of the city of Los Angeles. Information of the data set can be found here and the download link is here. Each row in the data represents the booking of an arrestee. Only consider data prior to January 1, 2019. For some questions, we specify a given date range to consider.

In [4]:
# How many bookings of arrestees were made in 2018?

import datetime
df['Arrest Date'] = pd.to_datetime(df['Arrest Date'], errors='coerce', format='%m/%d/%Y')
df['Arrest Date'].dt.year.value_counts()

# counted the number of reocrds in each year

2012    163313
2010    162416
2011    157638
2013    152673
2014    139380
2015    126154
2016    118125
2017    107651
2018    104277
2019     78500
Name: Arrest Date, dtype: int64

In [6]:
# How many bookings of arrestees were made in the area with the most arrests in 2018?

from collections import Counter
df_2018 = df[df['Arrest Date'].dt.year == 2018]
most_common,num_most_common = Counter(df_2018["Area ID"]).most_common()[0]
print("ID:", most_common)
print("How many?", num_most_common)

# in 2018, found the most common Area ID and its count

ID: 1
How many? 10951


In [8]:
# What is the 95% quantile of the age of arrestee in 2018? Only consider the following charge groups for your analysis:
# Vehicle Theft
# Robbery
# Burglary
# Receive Stolen Property

import numpy as np
# df.groupby(df["Charge Group Description"])["Age"].quantile(q=0.95)
print("Vehicle Theft:", df[df["Charge Group Description"] == "Vehicle Theft"]["Age"].quantile(q=0.95))
print("Robbery:", df[df["Charge Group Description"] == "Robbery"]["Age"].quantile(q=0.95))
print("Burglary:", df[df["Charge Group Description"] == "Burglary"]["Age"].quantile(q=0.95))
print("Receive Stolen Property:", df[df["Charge Group Description"] == "Receive Stolen Property"]["Age"].quantile(q=0.95))
print("\nmean of combined:", np.mean([49,51,53,52]))  #done in next code line but added it under this question

# calculated 95% quantile for the 4 charge groups and a mean of those

Vehicle Theft: 49.0
Robbery: 51.0
Burglary: 53.0
Receive Stolen Property: 52.0

mean of combined: 51.25


In [9]:
# There are differences between the average age of an arrestee for the various charge groups. 
# Are these differences statistically significant? 
# For this question, calculate the Z-score of the average age for each charge group. 
# Report the largest absolute value among the calculated Z-scores.

# Only consider data for 2018
# Do not consider "Pre-Delinquency" and "Non-Criminal Detention" as these charge groups are reserved for minors
# Exclude any arrests where the charge group description is not known

from scipy.stats import zscore
df1 = df[df["Charge Group Description"] != "Pre-Delinquency"]
df1 = df[df["Charge Group Description"] != "Non-Criminal Detention"]
Z = [x for x in zscore(df.groupby(df1["Charge Group Description"])["Age"].mean())]
CGD = [x for x in df1["Charge Group Description"].unique() if str(x) != 'nan']
Zscores = pd.DataFrame({"Charge Group Description":CGD, "Z-score":Z})
Zscores[Z == max(Z)]

# calculated zscores of age for each group and printed largest

Unnamed: 0,Charge Group Description,Z-score
10,Disturbing the Peace,1.7148


In [None]:
# Felony arrest incidents have been dropping over the years. 
# Using a trend line (linear estimation) for the data from 2010 and 2018 (inclusive), 
# what is the projected number of felony arrests in 2019? Round to the nearest integer. 
# Note, the data set includes arrests for misdemeanor, felonies, etc.

from sklearn.linear_model import LinearRegression
df.dtypes
X = df[df['Arrest Date'].dt.year != 2019].value_counts()
target = df[df['Arrest Date'].dt.year == 2019].value_counts()
arr_LR = LinearRegression().fit(X, target)
arr_LR.predict(target)

# using linear regression on count of arrests from 2010-2018, finding way to predict arrests count for 2019