# Final Project

Group members:
- 21127667 - Trương Công Gia Phát
- 21127743 - Trần Thái Toàn

(Last updated: 8/12/2023)

## 1. Collecting data

### What subject is your data about?

The dataset is a breakdown of every arrest effected in NYC by the NYPD during the current year. 

### What is the source of your data? Do authors of this data allow you to use like this?

Our group found the data on Kaggle and it is licensed by the U.S Government Works.

### How did authors collect data?

This data is manually extracted every quarter and reviewed by the Office of Management Analysis and Planning.

## 2. Exploring data

### Importing libraries

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('./data/NYPD_Arrest_Data__Year_to_Date_.csv')

### How many rows and how many columns?

In [3]:
shape = df.shape
print(f"The data has {shape[0]} rows and {shape[1]} columns.")

The data has 112571 rows and 19 columns.


### What is the meaning of each row?

Each record represents an arrest effected in NYC by the NYPD and includes information about the type of crime, the location and time of enforcement.

### Are there duplicated rows?

In [4]:
index = df.index
detectDupSeries = index.duplicated(keep='first')
num_duplicated_rows = detectDupSeries.sum()

if num_duplicated_rows == 0:
    print(f"Data have no duplicated line.!")
else:
    if num_duplicated_rows > 1:
        ext = "lines"
    else:
        ext = "line"
    print(f"Data have {num_duplicated_rows} duplicated " + ext + ". Please de-deduplicate your raw data.!")

Data have no duplicated line.!


### What is the meaning of each column?

- ARREST_KEY : Randomly generated persistent ID for each arrest
- ARREST_DATE: Exact date of arrest for the reported event
- PD_CD: Three digit internal classification code (more granular than Key Code)
- PD_DESC : Description of internal classification corresponding with PD code (more granular than Offense Description)
- KY_CD : Three digit internal classification code (more general category than PD code)
- OFNS_DESC : Description of internal classification corresponding with KY code (more general category than PD description)
- LAW_CODE : Law code charges corresponding to the NYS Penal Law, VTL and other various local laws
- LAW_CAT_CD : Level of offense: felony, misdemeanor, violation
- ARREST_BORO : Borough of arrest. B(Bronx), S(Staten Island), K(Brooklyn), M(Manhattan), Q(Queens)
- ARREST_PRECINCT : Precinct where the arrest occurred
- JURISDICTION_CODE : Jurisdiction responsible for arrest. Jurisdiction codes 0(Patrol), 1(Transit) and 2(Housing) represent NYPD whilst codes 3 and more represent non NYPD jurisdictions
- AGE_GROUP : Perpetrator’s age within a category
- PERP_SEX : Perpetrator’s sex description
- PERP_RACE : Perpetrator’s race description
- X_COORD_CD : Midblock X-coordinate for New York State Plane Coordinate System, Long Island Zone, NAD 83, units feet (FIPS 3104)
- Y_COORD_CD : Midblock Y-coordinate for New York State Plane Coordinate System, Long Island Zone, NAD 83, units feet (FIPS 3104)
- Latitude : Latitude coordinate for Global Coordinate System, WGS 1984, decimal degrees (EPSG 4326)
- Longitude : Longitude coordinate for Global Coordinate System, WGS 1984, decimal degrees (EPSG 4326)

### What is the current data type of each column? Are there columns having inappropriate data types?

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 112571 entries, 0 to 112570
Data columns (total 19 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   ARREST_KEY                112571 non-null  int64  
 1   ARREST_DATE               112571 non-null  object 
 2   PD_CD                     112110 non-null  float64
 3   PD_DESC                   112571 non-null  object 
 4   KY_CD                     112105 non-null  float64
 5   OFNS_DESC                 112571 non-null  object 
 6   LAW_CODE                  112571 non-null  object 
 7   LAW_CAT_CD                111725 non-null  object 
 8   ARREST_BORO               112571 non-null  object 
 9   ARREST_PRECINCT           112571 non-null  int64  
 10  JURISDICTION_CODE         112571 non-null  int64  
 11  AGE_GROUP                 112571 non-null  object 
 12  PERP_SEX                  112571 non-null  object 
 13  PERP_RACE                 112571 non-null  o

The datatype looks good for our group to explore.

### With each numerical column, how are values distributed?

In [6]:
num_col_info_df = df.select_dtypes(exclude='object')
def missing_ratio(s):
    return (s.isna().mean() * 100).round(1)


num_col_info_df = num_col_info_df.agg([missing_ratio, "min", "max"])
num_col_info_df

Unnamed: 0,ARREST_KEY,PD_CD,KY_CD,ARREST_PRECINCT,JURISDICTION_CODE,X_COORD_CD,Y_COORD_CD,Latitude,Longitude
missing_ratio,0.0,0.4,0.4,0.0,0.0,0.0,0.0,0.0,0.0
min,261180920.0,12.0,101.0,1.0,0.0,0.0,0.0,0.0,-74.251844
max,270661337.0,997.0,995.0,123.0,97.0,1066940.0,271819.0,40.912714,0.0


### With each categorical column, how are values distributed?

In [7]:
cate_col_info_df = df.select_dtypes(include='object')
cate_col_info_df = cate_col_info_df.agg([missing_ratio])
cate_col_info_df

Unnamed: 0,ARREST_DATE,PD_DESC,OFNS_DESC,LAW_CODE,LAW_CAT_CD,ARREST_BORO,AGE_GROUP,PERP_SEX,PERP_RACE,New Georeferenced Column
missing_ratio,0.0,0.0,0.0,0.0,0.8,0.0,0.0,0.0,0.0,0.0


In [9]:
for col in cate_col_info_df.columns:
    print(f"3 unique values in column {col}:  {df[col].unique()[0:3]}")

3 unique values in column ARREST_DATE:  ['02/08/2023' '03/24/2023' '03/28/2023']
3 unique values in column PD_DESC:  ['ROBBERY,CAR JACKING' 'RAPE 2' 'RAPE 1']
3 unique values in column OFNS_DESC:  ['ROBBERY' 'RAPE' 'FELONY ASSAULT']
3 unique values in column LAW_CODE:  ['PL 1601003' 'PL 1303001' 'PL 1303501']
3 unique values in column LAW_CAT_CD:  ['F' '9' 'M']
3 unique values in column ARREST_BORO:  ['K' 'S' 'Q']
3 unique values in column AGE_GROUP:  ['25-44' '18-24' '<18']
3 unique values in column PERP_SEX:  ['F' 'M' 'U']
3 unique values in column PERP_RACE:  ['WHITE' 'BLACK' 'WHITE HISPANIC']
3 unique values in column New Georeferenced Column:  ['POINT (-73.979638 40.597407)'
 'POINT (-74.0770327198983 40.6447209438691)'
 'POINT (-73.8740035373971 40.7434812638841)']


The values look normal in categorical column.

## 3. Data Exploration