<h1>Capstone Project - The Battle of the Neighborhoods</h1>

<h2>Problem Description and Background</h2>
In a globally connected world there is a huge influx of people from all over the world into economic hubs of the worlds. Chicago is the third largest city of the US by population and a hotbed for global migrant seeking employment.

While big cities provide enormous opportunities for the job seeker and the migrant, it also harbours crimes. A person wanting to come over to a city will surely look at the safety aspect of the various neighbourhoods in a city before deciding. The objective of this project is to analyse the available crime statistics for a particular year for the city of Chicago and try to highlight the safety of neightbourhoods relative to each other.


<h1>Data Description</h1>

We basically need two sets of data. One that provides us the data for the crimes per neighborhoods of Chicago and the other detailing the District Names and district codes of Chicago. The data is extracted from these sources: 
1. Crime Data : https://data.cityofchicago.org/api/views/vwwp-7yr9/rows.csv?accessType=DOWNLOAD 
2. District Information: https://en.wikipedia.org/wiki/Community_areas_in_Chicago

<h1>Methodology</h1>
The following steps are required to fetch the data and parse and merge

* Fetch the crime data for Chicago for a year (2015).
* Parse the data for rows and columns that are meaningful for the analysis
* Use the BeautifulSoup library - 
  * Scrapping the source URL for extracting district names and neighbourhood.
* Create panda dataframe for the District names and neighdourhoods
* Filter the crime data for very specific crime categories
  * We have filtered only the following categories of crimes:
    * ARSON
    * ASSAULT
    * BATTERY
    * BURGLARY
    * TRESPASS
    * SEX OFFENSE
* Pivot the crime data for summary results for each district
  * Summary is on percentage of major crimes of the total crimes reported
* Merge the district crime results with the district names and neighbourhood dataframe
* Sort the report by crime % to get a list of least crime infested neighbouroods.

Once we have created the final data frame, sorted in escending order of the major crimes as a percentage of total crime we get an overall picture of the safest localities of the City of Chicago. This helps a prospective home seeker to rate a neighbourhood's safety relative to the other neighbourhoods.


<h3>Import the libraries</h3>

In [1]:
import pandas as pd
import numpy as np


<h3>Fetch the crime data for the year 2015</h3>

In [2]:
crime2015_df = pd.read_csv('https://data.cityofchicago.org/api/views/vwwp-7yr9/rows.csv?accessType=DOWNLOAD')
crime2015_df = crime2015_df[crime2015_df['Latitude'].notna()]
crime2015_df = crime2015_df[crime2015_df['Longitude'].notna()]
crime2015_df.shape

(257893, 22)

<h3>Drop the columns that are not important</h3>

In [3]:
crime2015_df = crime2015_df[['ID', 'Case Number', 'Block', 'Primary Type', 'Location Description', 
                             'Arrest', 'District', 'Ward', 'Latitude', 'Longitude']]
crime2015_df.shape

(257893, 10)

In [4]:
crime2015_df.head()


Unnamed: 0,ID,Case Number,Block,Primary Type,Location Description,Arrest,District,Ward,Latitude,Longitude
3,10310586,HY499294,044XX W DIVERSEY AVE,BATTERY,SIDEWALK,False,25,31.0,41.931638,-87.738583
8,10004094,HY193838,035XX W FULLERTON AVE,BATTERY,SIDEWALK,False,14,35.0,41.924629,-87.714759
11,10224250,HY410830,060XX S RICHMOND ST,BATTERY,ALLEY,True,8,16.0,41.783865,-87.697183
13,10361286,HY552885,023XX N SAWYER AVE,CRIM SEXUAL ASSAULT,APARTMENT,False,14,32.0,41.923971,-87.709136
15,10326578,HY517169,073XX S KEDZIE AVE,OFFENSE INVOLVING CHILDREN,APARTMENT,True,8,18.0,41.760184,-87.702505


In [5]:
crime2015_df['District'].value_counts()

11    18631
8     16936
6     15767
4     15556
7     15074
25    14747
3     12783
9     12410
12    12052
1     11762
10    11523
15    11467
19    11327
5     11153
18    11131
2     10485
16     9114
14     8774
22     8576
17     7576
24     6873
20     4167
31        9
Name: District, dtype: int64

<h3>Use Beautiful Soup to extract District names and neighbourhoods</h3>

In [6]:
# import the BeautifulSoup library so we can parse HTML and XML documents
from bs4 import BeautifulSoup
import urllib.request
url = "https://en.wikipedia.org/wiki/Community_areas_in_Chicago"
page = urllib.request.urlopen(url)

# parse the HTML page
soup = BeautifulSoup(page, "lxml")

htmlContent = soup.prettify()

DistCode      = []
Community       = []
Neighbourhood = []
for items in soup.find_all('table', {"class": "wikitable"}):

    skipped = 0
    counter = 0
    # Extract individual rows and populate the lists
    for cell in items.find_all('td'):
        if (counter == 0):
            DistCode.append(cell.text.strip())
        if (counter == 1):
            Community.append(cell.text.strip())
        if (counter == 2):
            Neighbourhood.append(cell.text.strip().replace('\n', ' ,'))
            counter = 0
        else:
            counter += 1;

# Create the directory of districts and neighbourhoods
chicagoDistricts = {'DistCode':DistCode, 'Community':Community, 'Neighbourhood':Neighbourhood}
df_chicagoDistricts = pd.DataFrame.from_dict(chicagoDistricts)
df_chicagoDistricts.head(20)

Unnamed: 0,DistCode,Community,Neighbourhood
0,8,Near North Side,"Cabrini–Green ,The Gold Coast ,Goose Island ,M..."
1,32,Loop,"Loop ,New Eastside ,South Loop ,West Loop Gate"
2,33,Near South Side,"Dearborn Park ,Printer's Row ,South Loop ,Prai..."
3,5,North Center,"Horner Park ,Roscoe Village"
4,6,Lake View,"Boystown ,Lake View East ,Graceland West ,Sout..."
5,7,Lincoln Park,"Old Town Triangle ,Park West ,Ranch Triangle ,..."
6,21,Avondale,"Belmont Gardens ,Chicago's Polish Village ,Kos..."
7,22,Logan Square,"Belmont Gardens ,Bucktown ,Kosciuszko Park ,Pa..."
8,1,Rogers Park,East Rogers Park
9,2,West Ridge,"Arcadia Terrace ,Peterson Park ,West Rogers Park"


In [7]:
df_chicagoDistricts.shape

(77, 3)

<h3>Filter only the major crime data</h3>

In [8]:
interest_crime_df = crime2015_df[crime2015_df['Primary Type'].str.contains("ARSON|ASSAULT|BATTERY|BURGLARY|TRESPASS|SEX OFFENSE")]
interest_crime_df.shape

(87893, 10)

In [9]:
crimesData = pd.pivot_table(interest_crime_df,
                               index=['District'],
                               aggfunc=len,fill_value=0)
crimesData.head()

Unnamed: 0_level_0,Arrest,Block,Case Number,ID,Latitude,Location Description,Longitude,Primary Type,Ward
District,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,2688,2688,2688,2688,2688,2688,2688,2688,2688
2,3744,3744,3744,3744,3744,3744,3744,3744,3744
3,5377,5377,5377,5377,5377,5377,5377,5377,5377
4,6221,6221,6221,6221,6221,6221,6221,6221,6221
5,4524,4524,4524,4524,4524,4524,4524,4524,4524


In [10]:
crimesData.reset_index(inplace = True)
crimesData = crimesData[['District', 'Arrest']]
crimesData.columns = ['District', 'Crime']
crimesData.District.astype(str)
crimesData.head()

Unnamed: 0,District,Crime
0,1,2688
1,2,3744
2,3,5377
3,4,6221
4,5,4524


<h3>Data for total crimes</h3>

In [11]:
totalCrimes = pd.pivot_table(crime2015_df,
                               index=['District'],
                               aggfunc=len,fill_value=0)

totalCrimes.reset_index(inplace = True)
totalCrimes = totalCrimes[['District', 'Arrest']]
totalCrimes.columns = ['District', 'TotalCrime']
totalCrimes.District.astype(str)
totalCrimes.head()

Unnamed: 0,District,TotalCrime
0,1,11762
1,2,10485
2,3,12783
3,4,15556
4,5,11153


In [12]:
#Now we perform the merge
crimesData = crimesData.merge(totalCrimes, left_on='District', right_on='District')
crimesData.head()

Unnamed: 0,District,Crime,TotalCrime
0,1,2688,11762
1,2,3744,10485
2,3,5377,12783
3,4,6221,15556
4,5,4524,11153


<h3>Calculate the ratio of major crimes and total crimes</h3>

In [13]:
crimesData['crime_pct'] = crimesData.Crime / crimesData.TotalCrime
crimesData['crime_pct'] = crimesData['crime_pct'].round(4) * 100
crimesData.head()

Unnamed: 0,District,Crime,TotalCrime,crime_pct
0,1,2688,11762,22.85
1,2,3744,10485,35.71
2,3,5377,12783,42.06
3,4,6221,15556,39.99
4,5,4524,11153,40.56


<h3>Merge the crime data and District information</h3>

In [15]:
#Now we perform the full data merge
crimesData.dtypes

District        int64
Crime           int64
TotalCrime      int64
crime_pct     float64
dtype: object

In [16]:
crimesData.dtypes
df_chicagoDistricts['DistCode'] = df_chicagoDistricts['DistCode'].astype(str).astype(int)

In [17]:
df_chicagoDistricts.dtypes

DistCode          int64
Community        object
Neighbourhood    object
dtype: object

In [18]:
df_merged = df_chicagoDistricts.merge(crimesData, left_on='DistCode', right_on='District', how='left').drop(['District'], axis=1)


In [19]:
df_merged['TotalCrime'] = df_merged['TotalCrime'].round()
df_merged['Crime'] = df_merged['Crime'].round()
df_merged['TotalCrime'].fillna(0, inplace=True)
df_merged['Crime'].fillna(0, inplace=True)
df_merged['crime_pct'].fillna(0, inplace=True)

In [20]:
df_merged.dtypes

DistCode           int64
Community         object
Neighbourhood     object
Crime            float64
TotalCrime       float64
crime_pct        float64
dtype: object

In [21]:
df_merged['Crime'] = df_merged['Crime'].astype(int)
df_merged['TotalCrime'] = df_merged['TotalCrime'].astype(int)

df_merged

Unnamed: 0,DistCode,Community,Neighbourhood,Crime,TotalCrime,crime_pct
0,8,Near North Side,"Cabrini–Green ,The Gold Coast ,Goose Island ,M...",5688,16936,33.59
1,32,Loop,"Loop ,New Eastside ,South Loop ,West Loop Gate",0,0,0.00
2,33,Near South Side,"Dearborn Park ,Printer's Row ,South Loop ,Prai...",0,0,0.00
3,5,North Center,"Horner Park ,Roscoe Village",4524,11153,40.56
4,6,Lake View,"Boystown ,Lake View East ,Graceland West ,Sout...",6146,15767,38.98
5,7,Lincoln Park,"Old Town Triangle ,Park West ,Ranch Triangle ,...",5876,15074,38.98
6,21,Avondale,"Belmont Gardens ,Chicago's Polish Village ,Kos...",0,0,0.00
7,22,Logan Square,"Belmont Gardens ,Bucktown ,Kosciuszko Park ,Pa...",2969,8576,34.62
8,1,Rogers Park,East Rogers Park,2688,11762,22.85
9,2,West Ridge,"Arcadia Terrace ,Peterson Park ,West Rogers Park",3744,10485,35.71


In [22]:
df_merged.dtypes

DistCode           int64
Community         object
Neighbourhood     object
Crime              int64
TotalCrime         int64
crime_pct        float64
dtype: object

In [23]:

df_merged.sort_values(by=['crime_pct'])

Unnamed: 0,DistCode,Community,Neighbourhood,Crime,TotalCrime,crime_pct
38,37,Fuller Park,,0,0,0.00
54,64,Clearing,Chrysler Village,0,0,0.00
53,63,Gage Park,,0,0,0.00
52,62,West Elsdon,,0,0,0.00
51,61,New City,"Back of the Yards ,Canaryville",0,0,0.00
50,59,McKinley Park,,0,0,0.00
49,58,Brighton Park,,0,0,0.00
48,57,Archer Heights,,0,0,0.00
47,56,Garfield Ridge,"LeClaire Courts ,Sleepy Hollow ,Vittum Park",0,0,0.00
46,69,Greater Grand Crossing,"Grand Crossing ,Parkway Gardens ,Park Manor",0,0,0.00


<h1>Result and Conclusion</h1>
From the above tables we have a definite idea of which areas of the city have reported more number of major crimes. The higher the crime numbers the lower is the safety of that neighbourhood.
This information can be used with differnet perspective.
For example for a real estate agent this information gives and insight into which places would be highly valued by prospective clients and they can determine the prices and rents accordingly.
Similarly yhis information can be used by the law enforcement to decide which area needs more focus for increasing security of the populace.
Finally, this insight brings a migrant to choose the ideal place to set their home.