<a href="https://colab.research.google.com/github/deybyr647/nyc-crime-data-across-the-years/blob/master/Analysis_of_NYC_Crime_Data_from_2010_to_2019.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Analysis of NYC Crime Data from 2010 to 2019

## Project Description
---
As part of an independent project, I decided to take a look at & analyze overall crime data from New York City. The purpose of this analysis is to find answers to certain questions, such as the following:



*   *Does crime increase or decrease over time ?*
*   *In what year of the given time scope were the most crimes committed?*
* *And more...*





---

**NYC Crime Data**: [Source](https://data.cityofnewyork.us/Public-Safety/NYPD-Complaint-Data-Historic/qgea-i56i)

**NYC Population Data**: [Source](https://data.cityofnewyork.us/City-Government/New-York-City-Population-by-Borough-1950-2040/xywu-7bv9)

The CSV for NYC Crime Data turns out to be a very large file, as it includes data since 2006, and is constantly being updated.

I decided to include a population data set to aid in answering crime-population related questions & to make certain calculations, such as calculations of crime rate.

The issue with the population data set is that it shows data from every 10 years. For the purposes of the project, I will use 2020 population data.

I put some emphasis on the year 2019 in this project, as that year just passed, and the data from it is relatively fresh for inspection.

Huge thanks to **Coach Krista** & **Coach Shaquan**, for their support, learning material & suggestions.

---

## Initial Data Import, Environment Set Up & More
---

In [None]:
import pandas as pd

populationData = 'https://raw.githubusercontent.com/deybyr647/nyc-crime-data-across-the-years/master/New_York_City_Population_by_Borough__1950_-_2040.csv'

#Returns a CSV from a given inputFile link or string variable
def getCSV(inputFile):
  #inputFile argument can be a link (string) or a string variable
  return pd.read_csv(inputFile)


#Returns a CSV for NYC Crime Data for a given year. Credits to Coach Krista
def getCSV_byYear(year):
  for i in range (1,13):
    src = f'https://raw.githubusercontent.com/kristakohler/code_next_data_science_club_2020/master/crimedata_{year}-{i}.csv'
    currentDF = pd.read_csv(src)
    if i == 1:
      dfOut = currentDF
    else:
      dfOut = dfOut.append(currentDF, ignore_index=True)
  return dfOut


#getPopulation() returns the population for a given region for a given year by
#reading into a provided population dataset
def getPopulation(region, year):
  populationDF = getCSV(populationData)
  regionList = ['NYC Total', 'Bronx', 'Brooklyn', 'Manhattan', 'Queens', 'Staten Island']
  y = str(year)

  if(region in regionList):
    regionIndex = regionList.index(region)
  else:
    print('Enter a valid region/borough')  

  populationRaw = populationDF.loc[[regionIndex], y]
  population = int(populationRaw)

  return population  

### Data Preview
---

In [None]:
#Uncomment any one of the 2 last lines to preview a data table to a certain dataset

crimeDF = getCSV_byYear(2019)
populationDF = getCSV(populationData)

crimeDF.head()
#opulationDF.head(6)

Unnamed: 0.1,Unnamed: 0,CMPLNT_NUM,CMPLNT_FR_DT,CMPLNT_FR_TM,ADDR_PCT_CD,OFNS_DESC,PD_DESC,CRM_ATPT_CPTD_CD,LAW_CAT_CD,BORO_NM,PREM_TYP_DESC,JURIS_DESC,SUSP_AGE_GROUP,SUSP_RACE,SUSP_SEX,VIC_AGE_GROUP,VIC_RACE,VIC_SEX
0,3019158,856239246,2019-01-01,00:01:00,107.0,CRIMINAL MISCHIEF & RELATED OF,"CRIMINAL MISCHIEF 4TH, GRAFFIT",COMPLETED,MISDEMEANOR,QUEENS,STREET,N.Y. POLICE DEPT,,,,UNKNOWN,UNKNOWN,E
1,3020207,747718390,2019-01-01,08:00:00,45.0,CRIMINAL MISCHIEF & RELATED OF,"CRIMINAL MISCHIEF 4TH, GRAFFIT",COMPLETED,MISDEMEANOR,BRONX,STREET,N.Y. POLICE DEPT,,,,UNKNOWN,UNKNOWN,E
2,3020309,964861740,2019-01-01,00:01:00,41.0,SEX CRIMES,"SEXUAL ABUSE 3,2",COMPLETED,MISDEMEANOR,BRONX,RESIDENCE - APT. HOUSE,N.Y. POLICE DEPT,<18,UNKNOWN,M,<18,WHITE HISPANIC,F
3,3022621,373699187,2019-01-01,00:00:00,83.0,CRIMINAL MISCHIEF & RELATED OF,"CRIMINAL MISCHIEF 4TH, GRAFFIT",COMPLETED,MISDEMEANOR,BROOKLYN,OTHER,N.Y. POLICE DEPT,,,,UNKNOWN,UNKNOWN,E
4,3022913,622676807,2019-01-01,13:00:00,101.0,ASSAULT 3 & RELATED OFFENSES,ASSAULT 3,COMPLETED,MISDEMEANOR,QUEENS,STREET,N.Y. POLICE DEPT,25-44,BLACK,M,25-44,BLACK,F


## Data Based Questions
---

### *In what year of the given time scope were the most crimes committed?*

In [None]:
#getAmt() should work with any combination of years in an array, whether the years are in order or not
#Returns yearly crime amounts in dictionary form within an array (e.g [{year : amount}])
#Also returns a statement which tells which year had the highest number of crimes

def getAmt(scopeArr):
  crimeAmts_perYear = list()
  crimeAmts_dictArr = list()

  #Empty string, to separate dictionary array and output text
  newline = ''

  for y in range(len(scopeArr)):
    crimeDF = getCSV_byYear(scopeArr[y])
    crimeAmt = len(crimeDF.index)

    amtsDict = {'Year' : f'{scopeArr[y]}',
                'Crime Amount' : f'{crimeAmt}'}

    crimeAmts_perYear.append(crimeAmt)
    crimeAmts_dictArr.append(amtsDict)

  #print(f'{crimeAmts_dictArr} \n')
  year = crimeAmts_perYear.index(max(crimeAmts_perYear))  

  out = f'New York City had the most crimes in {timeScope[year]}, with {max(crimeAmts_perYear)} crimes' 

  return crimeAmts_dictArr, newline, out


timeScope = [2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019]

getAmt(timeScope)

([{'Crime Amount': '510236', 'Year': '2010'},
  {'Crime Amount': '498734', 'Year': '2011'},
  {'Crime Amount': '504840', 'Year': '2012'},
  {'Crime Amount': '495939', 'Year': '2013'},
  {'Crime Amount': '492150', 'Year': '2014'},
  {'Crime Amount': '478777', 'Year': '2015'},
  {'Crime Amount': '478364', 'Year': '2016'},
  {'Crime Amount': '468238', 'Year': '2017'},
  {'Crime Amount': '469212', 'Year': '2018'},
  {'Crime Amount': '788159', 'Year': '2019'}],
 '',
 'New York City had the most crimes in 2019, with 788159 crimes')

---
### *During the given time scope, does crime rate increase or decrease?*
<br>
$$Rate = {Crimes \over Population} {\times 100,000}$$

In [None]:
#getRates() returns yearly crime rates for NYC in dictionary form within an array (e.g [{year:rate}, {year:rate}])
#Can work with any combination of years within an array, whether years are in order or not

#Returned crime rates are rates per 100,000 people. (e.g, In 2015, 5599.1 crimes were committed per 100,000 general population)
#For the project's purposes, rates are close estimations, as we're using the population of 2020 in calculations

def getRates(scopeArr, population):
  crimeRates = list()
  
  for y in range(len(scopeArr)):
    crimeDF = getCSV_byYear(scopeArr[y])
    crimeAmt = len(crimeDF.index)

    rawRate = (crimeAmt / population) * 100000
    cleanRate = round(rawRate, 1)

    rateDict = {'Year' : f'{scopeArr[y]}',
                'Crime Rate' : f'{cleanRate}'}
    crimeRates.append(rateDict)
   
  return crimeRates

timeScope = [2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019]  
p = getPopulation('NYC Total', 2020)

getRates(timeScope, p)

ConnectionResetError: ignored

---
### *Does region population play a role in crime rates & crime data ?*
<br>
$$Rate = {Crimes \over Population} {\times 100,000}$$

In [None]:
#getRate_byBorough() returns yearly crime rates by specified borough in dictionary form within an array
#E.g, [{Borough, Year, Crime Amount, Crime Rate}]. Very similar to getRates()
#Can work with any combination of years within an array

def getRate_byBorough(borough, scopeArr, population):
  boroughRates = list()
  boroughList = ['BRONX', 'BROOKLYN', 'MANHATTAN', 'QUEENS', 'STATEN ISLAND']
  
  for y in range(len(scopeArr)):
    crimeDF = getCSV_byYear(scopeArr[y])

    if(borough in boroughList):
      rawCrime_Amt = crimeDF.loc[crimeDF['BORO_NM'] == borough].count()
    else:
      print('Please try a valid borough name, with all capital letters')
      break             
 
    cleanCrime_Amt = int(rawCrime_Amt['CMPLNT_NUM'])

    rawRate = (cleanCrime_Amt / population) * 100000
    cleanRate = round(rawRate, 1)

    outputDict = {'Borough' : f'{borough}',
                  'Year' : f'{scopeArr[y]}',
                  'Crime Amount' : f'{cleanCrime_Amt}',
                  'Crime Rate' : f'{cleanRate}'}

    boroughRates.append(outputDict)

  return  boroughRates

p = getPopulation('Manhattan', 2020)
timeScope = [2019]

getRate_byBorough('MANHATTAN', timeScope, p)

[{'Borough': 'MANHATTAN',
  'Crime Amount': '198249',
  'Crime Rate': '12101.0',
  'Year': '2019'}]

---
### *What have been the most common types of crimes committed in a given year?*

In [None]:
#No complex code here! Just change the year in the first line & data for the year should be given
#Code will show the top 5 types of crimes committed in the given year

crimeDF = getCSV_byYear(2019)

groupedData = crimeDF.groupby(['OFNS_DESC']).count()

groupedData.sort_values(by=['CMPLNT_NUM'], ascending=False).head(10)

Unnamed: 0_level_0,Unnamed: 0,CMPLNT_NUM,CMPLNT_FR_DT,CMPLNT_FR_TM,ADDR_PCT_CD,PD_DESC,CRM_ATPT_CPTD_CD,LAW_CAT_CD,BORO_NM,PREM_TYP_DESC,JURIS_DESC,SUSP_AGE_GROUP,SUSP_RACE,SUSP_SEX,VIC_AGE_GROUP,VIC_RACE,VIC_SEX
OFNS_DESC,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
PETIT LARCENY,152056,152056,152056,152056,152056,152056,152056,152056,152056,151500,152056,99364,99364,99364,152053,152053,152053
HARRASSMENT 2,123982,123982,123982,123982,123982,123982,123982,123982,123981,123565,123982,117109,117109,117109,123982,123982,123982
ASSAULT 3 & RELATED OFFENSES,92619,92619,92619,92619,92619,92619,92619,92619,92618,92489,92619,86467,86467,86467,92619,92619,92619
CRIMINAL MISCHIEF & RELATED OF,81065,81065,81065,81065,81065,81065,81065,81065,81065,80851,81065,44374,44374,44374,81065,81065,81065
GRAND LARCENY,70572,70572,70572,70572,70572,70572,70572,70572,70568,70123,70572,39838,39838,39838,70572,70572,70572
FELONY ASSAULT,35993,35993,35993,35993,35993,35993,35993,35993,35992,35953,35993,32890,32890,32890,35993,35993,35993
OFF. AGNST PUB ORD SENSBLTY &,33261,33261,33261,33261,33261,33261,33261,33261,33261,32968,33261,30282,30282,30282,33261,33261,33261
MISCELLANEOUS PENAL LAW,24699,24699,24699,24699,24699,24699,24699,24699,24699,24601,24699,22444,22444,22444,24699,24699,24699
DANGEROUS DRUGS,23617,23617,23617,23617,23617,23617,23617,23617,23617,23603,23617,19127,19127,19127,23617,23617,23617
ROBBERY,22859,22859,22859,22859,22859,22859,22859,22859,22859,22814,22859,21906,21906,21906,22859,22859,22859


---
### *In what borough have the most crimes taken place in a given year?*

In [None]:
#Just like the last cell, there's no complex stuff here!
# Change the year & run this cell
# Result should be the boroughs ranked by amounts of crimes committed in them, from most to least

crimeDF = getCSV_byYear(2019)

groupedData = crimeDF.groupby(['BORO_NM']).count()

groupedData.sort_values(by=['CMPLNT_NUM'], ascending=False).head()

Unnamed: 0_level_0,Unnamed: 0,CMPLNT_NUM,CMPLNT_FR_DT,CMPLNT_FR_TM,ADDR_PCT_CD,OFNS_DESC,PD_DESC,CRM_ATPT_CPTD_CD,LAW_CAT_CD,PREM_TYP_DESC,JURIS_DESC,SUSP_AGE_GROUP,SUSP_RACE,SUSP_SEX,VIC_AGE_GROUP,VIC_RACE,VIC_SEX
BORO_NM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
BROOKLYN,226484,226484,226484,226484,226484,226484,226484,226484,226484,225952,226484,167843,167843,167843,226484,226484,226484
MANHATTAN,198249,198249,198249,198249,198249,198241,198249,198249,198249,197030,198249,158126,158126,158126,198247,198247,198247
BRONX,173175,173175,173175,173175,173175,173171,173175,173175,173175,172798,173175,137019,137019,137019,173175,173175,173175
QUEENS,157292,157292,157292,157292,157292,157290,157292,157292,157292,156910,157292,118097,118097,118097,157291,157291,157291
STATEN ISLAND,32443,32443,32443,32443,32443,32443,32443,32443,32443,32344,32443,22379,22379,22379,32443,32443,32443


---
## Conclusion
---

After some thorough analysis of the data, I came to the following conclusions :



*   In New York City, the most crimes were committed in the year 2019, as 788,159 crimes were documented by the NYPD.

*   In the given time scope, crime rate more or less stays the same, up until 2019, where it suddenly explodes from 5487.2 crimes per 100,000 general population in 2018 to 9217.2 crimes per 100,000 general population in 2019. See graph.

*   While population goes into crime rate calculation, it does indeed play a role in data and rates. As suggested by Coach Krista, where there are more people, there tend to be more crimes. Brooklyn is the most populated NYC Borough, and it almost always comes out on top in terms of crimes committed. A common mistake because of this fact would be to assume that Brooklyn also usually has the highest crime rate. But of course, as previously mentioned, population plays a role in the analysis, making Brooklyn's crime rate seem deceiving when compared to other boroughs' , such as that of the Bronx.

*   Crime Rates per Borough (per 100,000 general population), 2019:

  * Bronx : 11969.6
  * Brooklyn: 8551.6
  * Manhattan: 12101.0
  * Queens: 6749.9
  * Staten Island: 6659.7

*   For the year of 2019, the most common crimes were as follows : 

  *   Petit Larceny (152,056 crimes)
  *   Harrassment 2 (123,982 crimes)
  *   Assault 3 & Related Offenses (92,619 crimes)
  *   Criminal Mischief & Related Of (81,065 crimes)
  *   Grand Larceny (70,572 crimes)

*   For the year of 2019, the most crimes took place in the borough of Brooklyn, with 226,484 crimes documented by the NYPD.  



