# CS418 Data Pirates


# PROJECT INTRODUCTION
Our team is analyzing the correlation between the increasing economic polarization in Chicago and changing crime landscape of the city. We are finding out the factors contributing to the disappearing middle class. In order to go over factors, we are looking into the socioeconomic trends that could help us understand each community area in Chicago.

We are using three data sets. The following datasets are as follows:

* **Chicago Crimes** (2001-present) 
	 https://catalog.data.gov/dataset/crimes-2001-to-present-398a4
#of records: 6.8M entries of crime incidents
IUCR is a four-digit code that law enforcement agencies use to classify criminal incidents when taking individual reports.
Crime location description gives important geographical metrics to better identify areas most at risk for violence.
* **Chicago Socioeconomic Trends** (2008-2012)
https://catalog.data.gov/dataset/census-data-selected-socioeconomic-indicators-in-chicago-2008-2012-36e55
#of records: 77 entries (each entry corresponds to a community area)
This dataset looks over six socioeconomic indicators, such as (a) unemployment rate, (b) % of households below poverty level, (c) % of <25 individuals without a high school diploma, (d) % of non-working class (minors and senior citizens) population, (e) crowded housing and (f) per capita income across Chicago community areas. 
Both datasets will be joined on community area number (assigned by the city of Chicago).
The hardship index is a score that incorporates each of the six selected socioeconomic indicators according to the method described in An Update on Urban Hardship
* **Chicago Map Boundaries**
https://data.cityofchicago.org/Facilities-Geographic-Boundaries/Boundaries-Community-Areas-current-/cauq-8yn6
#of records: 77 entries (each entry corresponds to a community area)
This dataset provides the geometrical specifications (such as perimeter, area, shape) of each of the 77 Chicago neighborhood.



# CHANGES

There have been really no changes to the scope. We have specifically been looking at the 
Chicago Crime dataset and Socioeconomic factors and we trying to figure out whats happening with the middle
class and why it is disappearing in many of the neighborhoods. We are anazlying the crimes and correalating 
with socioeconomic factors.

It is the same as our check in really.

# DATA CLEANING

For data cleaning, we used python to filter crimes data. We needed to use python because crime data csv file was too large. The Data we had initially consisted of crime data from the year 2001 to present. For this project, we only require data from 2008 to 2012 as we have to correalate it with the Socioeconomic data set which was from 2008 to 2012. We basically first cleaned the data by removing all the unwanted years. Furthermore, the crime data consisted alot of unneccessary columns. We decided to keep only the columns that were beneficial to our project and removed the rest so the file size could be reduced. The code for cleaning the data in our github directory. Lastly, the Socioeconomic data set is pretty small and it didn't need much cleaning. 

# EXPLORATORY DATA ANALYSIS

In [2]:
# Import libraries
import pandas as pd
import numpy as np
import seaborn as sns
import geopandas as gpd
%matplotlib inline
import matplotlib.pyplot as plt

ModuleNotFoundError: No module named 'geopandas'

## Socioeconomic Indicators

In [None]:
socioecon = pd.read_csv("ChicagoSocioecon.csv")
socioecon['Community Area Number'] = pd.to_numeric( socioecon['Community Area Number'], downcast='signed')
socioecon['Community Area Number'] = socioecon['Community Area Number'].fillna(0.0).apply(np.int64)
socioecon = socioecon.set_index('Community Area Number')
socioecon.head()

## Chicago Map

In [None]:
chicago_map = gpd.read_file("geo_export_7740d8e1-a704-49b1-8276-e70c37a786a0.shp")
chicago_map['Community Area Number'] = pd.to_numeric(chicago_map['area_num_1'], downcast='signed')
chicago_map = chicago_map.drop(['comarea', 'comarea_id', 'perimeter', 'area', 'area_numbe', 'area_num_1'], axis=1)
chicago_map = chicago_map.set_index('Community Area Number').sort_values(by="Community Area Number")
chicago_map.head()

## Chicago Crimes

In [None]:
crimes = pd.read_csv("dataFiltered.csv")
crime = crimes.groupby('Community Area').count()['ID'].reset_index(name="Crime Count")
crime.head()  

## Chicago Socioeconomic Map

In [None]:
socioecon_map = chicago_map.join(socioecon, on='Community Area Number')
crime_map = chicago_map.join(crime, on='Community Area Number')
crime_map = crime_map.drop(columns=['Community Area'])
crime_map.head()

# VISUALIZATIONS

## Hypothesis:

#### If there is a low highschool graduation rate and a high unemployment rate in a community, then it will have more crime rate in that particular community.

For instance, if you look at community area 25 in unemployment graph below. It's unemployment rate is roughly around 23.5% which is higher than many other community areas in the years between 2008 to 2012. Then if you look at the crime  graph below it can be seen that community 25 had the highest crime count with around 34,000 crimes during those 4 years as well. From the highschool dropout graph, it can be seen that community 25 also had a very high drop out rate at around 25% as well(much higher than other communites). All this contributed to high crime count in those four years for community 25. Many other communities follow such pattern. There are a few anomalies in the data but majority follow this pattern

This was interesting to see because one can see the chain effect of school drop outs with unemployment rates and the negative effects of highschool dropouts with a huge increase in crime in those 4 years.

#### The middle class communities are getting poorer and poorer and are basically disappearing in to lower class. According to PEW Research, Middle class is considered to be between 42,000 to 135,000 dollars per capita income.

If you look closely at the capita per income, majority of the communities are below $42,000, classifying them in to the lower class. From that we can clearly see that middle class barely even exists. More than 80% of the communites are in this trap and Also the crimes rates and unemployment rates are also very high in these areas as well. 

This is interesting because our group considers ourselves middle class and comparing this data set shows us where we actually stand and shows us our actual reality. We need to figure out why it is actually happening with us and how we can fix this as we are being directly affected by this. The graph is going down every year and it is not looking good. 


All the Visualizations are below which support these claims


## Socioeconomic gaps
The "Selected socioeconomic indicators in Chicago, 2008 to 2012" dataset looks into
* Poverty level
* Crowded housing
* % of High school graduates
* Per capita income

## [VIZ1] Unemployment Rate on various Chicago community areas

In [None]:
## UNEMPLOYMENT RATE ##

variable = "PERCENT AGED 16+ UNEMPLOYED"
fig, ax = plt.subplots(1, figsize=(10, 6))
vmin, vmax = 1, 50

socioecon_map.plot(column=variable, cmap="coolwarm", linewidth=0.8, ax=ax, edgecolor='1')
ax.set_title("Unemployment Rate on various Chicago community areas", fontdict={"fontsize": 25, "fontweight" : 5})
ax.axis('off')

#Legend 
sm = plt.cm.ScalarMappable(cmap="coolwarm", norm=plt.Normalize(vmin=vmin, vmax=vmax))
sm._A = []
cbar = fig.colorbar(sm)

## [VIZ2] Per Capita Income over Chicago Neighborhoods

In [None]:
variable = "PER CAPITA INCOME "
fig, ax = plt.subplots(1, figsize=(10, 6))
vmin, vmax = 8000, 90000

socioecon_map.plot(column=variable, cmap="RdBu", linewidth=0.8, ax=ax, edgecolor='1')
ax.set_title("Per Capita Income over Chicago Neighborhoods", fontdict={"fontsize": 25, "fontweight" : 5})
ax.axis('off')

# Legend 
sm = plt.cm.ScalarMappable(cmap="RdBu", norm=plt.Normalize(vmin=vmin, vmax=vmax))
sm._A = []
cbar = fig.colorbar(sm)

## [VIZ3] Crime Count per Chicago Community Area

In [None]:
variable = "Crime Count"
fig, ax = plt.subplots(1, figsize=(10, 6))
vmin, vmax = 400, 40000

crime_map.plot(column=variable, cmap="coolwarm", linewidth=0.8, ax=ax, edgecolor='1')
ax.set_title("Crime Count per Chicago Community Area", fontdict={"fontsize": 25, "fontweight" : 5})
ax.axis('off')

# Legend 
sm = plt.cm.ScalarMappable(cmap="coolwarm", norm=plt.Normalize(vmin=vmin, vmax=vmax))
sm._A = []
cbar = fig.colorbar(sm)

## [VIZ4] Percentage of Arrests made per Community

In [None]:
crimes.drop(crime.loc[crime['Community Area']== 0].index, inplace=True)
arrest = crimes.loc[crimes['Arrest'] == True]
arrest = arrest.groupby('Community Area').count()['ID'].reset_index(name="Arrest Count")
arrest['Crime Count'] = crime['Crime Count']
arrest['% Arrests'] = (arrest['Arrest Count'] / arrest['Crime Count']) * 100
arrest.dropna()
arrest_map = chicago_map.join(arrest, on='Community Area Number') 

variable = "% Arrests"
fig, ax = plt.subplots(1, figsize=(10, 6))
vmin, vmax = 1, 100

arrest_map.plot(column=variable, cmap="RdBu", linewidth=0.8, ax=ax, edgecolor='1')
ax.set_title("Percentage of Arrests made per Community", fontdict={"fontsize": 25, "fontweight" : 5})
ax.axis('off')

# Legend 
sm = plt.cm.ScalarMappable(cmap="RdBu", norm=plt.Normalize(vmin=vmin, vmax=vmax))
sm._A = []
cbar = fig.colorbar(sm)

## [VIZ5] % aged 25+ without HS diploma over Chicago Neighborhoods

In [None]:
variable = "PERCENT AGED 25+ WITHOUT HIGH SCHOOL DIPLOMA"
fig, ax = plt.subplots(1, figsize=(10, 6))
vmin, vmax = 1, 50

socioecon_map.plot(column=variable, cmap="coolwarm", linewidth=0.8, ax=ax, edgecolor='1')
ax.set_title("% aged 25+ without HS diploma over Chicago Neighborhoods", fontdict={"fontsize": 25, "fontweight" : 5})
ax.axis('off')

#Legend 
sm = plt.cm.ScalarMappable(cmap="coolwarm", norm=plt.Normalize(vmin=vmin, vmax=vmax))
sm._A = []
cbar = fig.colorbar(sm)

# Machine Learning Analysis
At least one ML analysis on your dataset, along with a baseline comparison
and an interpretation of the result that you obtain.

# Reflection
Reflection: a discussion of the following:
 What is hardest part of the project that you’ve encountered so far?
What are your initial insights?
 Are there any concrete results you can show at this point? If not, why not? o Going forward, what are the current biggest problems you’re facing?
 Do you think you are on track with your project? If not, what parts do you need to dedicate more time to?
 Given your initial exploration of the data, is it worth proceeding with your project, why? If not, how are you going to change your project and why do you think it’s better than your current results?
 
The hardest part of the project that we have encountered so far was cleaning the data. Initially, the crimes data CSV file we downloaded was 423 MB. The issue was that none of us had any previous experience with databases. Therefore, we used python in order to condense the size of the file by filtering out the information we needed in order to do data analysis. It took a little bit of research to get this initial step done, which we believe has been the hardest encounter thus far for us in this project.

After doing a bit of data analysis, a key insight we have found is that middle class is actually disappearing. After looking at the per capita income visualization per community from above, we have noted that communities with low income have higher crime counts. Thus, we believe that per capita income has major effects on crime rate in chicago. This is a very imporatant insight in this project. Another insight, we have understood is that not having a highschool diploma also has impacts on crime counts. After closely analyzing the "Highschool Education in Each Community" visualization and comparing it against "Crime count per Chicago Community Area" visualization, we analyzed that communities with higher high school dropouts have higher crime counts. Its a direct relationship.This was another very interesting insight. 


The biggest problem we feel we are currently facing is truly coming up with a solution on how to reduce the crime rate in Chicago. We have the required data. We believe our task is to inform chicago communities through proper data analysis where things are really going wrong and how significantly its affecting the crime rate. 

# Next Steps
What you plan to accomplish in the next month and how you plan to evaluate whether your project achieved the goals you set for it.

Our goal is to continue using machine learning techniques to further analyze the data set and predict how other factors such as capita income, poverty level, and crowded housing will effect the communities in the next couple of years. We will be using linear regression to do this.

Also we want to be able give the finishing touches to the project and make sure we have done a complete analysis of the crimes in chicago and how the middle class is disappearing. 

I think we will know when our goal is achieved is when we have solid reasoning with facts and data that shows that the middle class is disappearing in the communities. 