# Capstone Analytic Report and Research Proposal

In [3]:
import numpy as np
import pandas as pd
import statistics

## 1. Describe and explore a dataset. 
### What does your data contain, and what is its background? Where does it come from? Why is it interesting or significant?

I will be exploring communities within the US using a dataset that combines socio-economic data from the 1990 US Census, law enforcement data from the 1990 US LEMAS survey, and crime data from the 1995 FBI UCR. Two computer scientists from La Salle University (Philadelphia, PA) created and donated this dataset to [UC Irvine's Machine Learning Repository](http://archive.ics.uci.edu/ml/datasets/Communities+and+Crime).

The data contains 1994 instances, with 128 attributes that track information like location of community, population size, racial makeup, poverty and unemployment levels, local police department's budget, and more. A full list of the attributes can be found [here](http://archive.ics.uci.edu/ml/datasets/Communities+and+Crime), under "Attribute Information".

I found this dataset interesting, because many people have bemoaned the advent of "Big Data" for potentially letting data scientists wield math and statistics incorrectly to reinforce preexisting biases in our society. In the hands of an ill-steered or misinformed data scientist, for example, the data may recommend against loans for historically lower-income demographic groups, or recommend a higher insurance premium for applicants who live in a particular county. It is up to the humans, the data science practitioners, to intervene and assess the conclusions the data is surfacing to determine which of them are logically sound. 

A dataset that combines socio-economic, race, and crime data in particular seems ripe for misinterpreting and stereotype-reinforcing if analyzed incorrectly. I wanted to work with this data because I wanted practice making sound conclusions about complex data. I first became interested in data science, because I wanted to be literate in this new "language" to  counter the many incorrect conclusions that will inevitably be derived via inappropriate statistical practices – especially in this increasingly data-driven world.

### Conduct summary statistics and produce visualizations for the variables you use.

First, I want to take a look at my raw data and see some summary statistics about its columns.

In [14]:
raw_attrs = pd.read_csv('communities_crime_attrs.csv')
raw_data = pd.read_csv('communities_crime.csv', names=raw_attrs)

raw_data

  return _read(filepath_or_buffer, kwds)


Unnamed: 0,state,county,community,communityname,fold,population,householdsize,racepctblack,racePctWhite,racePctAsian,...,LandArea,PopDens,PctUsePubTrans,PolicCars,PolicOperBudg,LemasPctPolicOnPatr,LemasGangUnitDeploy,LemasPctOfficDrugUn,PolicBudgPerPop,ViolentCrimesPerPop
0,8,?,?,Lakewoodcity,1,0.19,0.33,0.02,0.90,0.12,...,0.12,0.26,0.20,0.06,0.04,0.9,0.5,0.32,0.14,0.20
1,53,?,?,Tukwilacity,1,0.00,0.16,0.12,0.74,0.45,...,0.02,0.12,0.45,?,?,?,?,0.00,?,0.67
2,24,?,?,Aberdeentown,1,0.00,0.42,0.49,0.56,0.17,...,0.01,0.21,0.02,?,?,?,?,0.00,?,0.43
3,34,5,81440,Willingborotownship,1,0.04,0.77,1.00,0.08,0.12,...,0.02,0.39,0.28,?,?,?,?,0.00,?,0.12
4,42,95,6096,Bethlehemtownship,1,0.01,0.55,0.02,0.95,0.09,...,0.04,0.09,0.02,?,?,?,?,0.00,?,0.03
5,6,?,?,SouthPasadenacity,1,0.02,0.28,0.06,0.54,1.00,...,0.01,0.58,0.10,?,?,?,?,0.00,?,0.14
6,44,7,41500,Lincolntown,1,0.01,0.39,0.00,0.98,0.06,...,0.05,0.08,0.06,?,?,?,?,0.00,?,0.03
7,6,?,?,Selmacity,1,0.01,0.74,0.03,0.46,0.20,...,0.01,0.33,0.00,?,?,?,?,0.00,?,0.55
8,21,?,?,Hendersoncity,1,0.03,0.34,0.20,0.84,0.02,...,0.04,0.17,0.04,?,?,?,?,0.00,?,0.53
9,29,?,?,Claytoncity,1,0.01,0.40,0.06,0.87,0.30,...,0.00,0.47,0.11,?,?,?,?,0.00,?,0.15


In [15]:
raw_data.describe()

Unnamed: 0,state,fold,population,householdsize,racepctblack,racePctWhite,racePctAsian,racePctHisp,agePct12t21,agePct12t29,...,PctForeignBorn,PctBornSameState,PctSameHouse85,PctSameCity85,PctSameState85,LandArea,PopDens,PctUsePubTrans,LemasPctOfficDrugUn,ViolentCrimesPerPop
count,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,...,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0,1994.0
mean,28.683551,5.493982,0.057593,0.463395,0.179629,0.753716,0.153681,0.144022,0.424218,0.493867,...,0.215552,0.608892,0.53505,0.626424,0.65153,0.065231,0.232854,0.161685,0.094052,0.237979
std,16.397553,2.873694,0.126906,0.163717,0.253442,0.244039,0.208877,0.232492,0.155196,0.143564,...,0.231134,0.204329,0.181352,0.200521,0.198221,0.109459,0.203092,0.229055,0.240328,0.232985
min,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,12.0,3.0,0.01,0.35,0.02,0.63,0.04,0.01,0.34,0.41,...,0.06,0.47,0.42,0.52,0.56,0.02,0.1,0.02,0.0,0.07
50%,34.0,5.0,0.02,0.44,0.06,0.85,0.07,0.04,0.4,0.48,...,0.13,0.63,0.54,0.67,0.7,0.04,0.17,0.07,0.0,0.15
75%,42.0,8.0,0.05,0.54,0.23,0.94,0.17,0.16,0.47,0.54,...,0.28,0.7775,0.66,0.77,0.79,0.07,0.28,0.19,0.0,0.33
max,56.0,10.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [16]:
len(raw_data['state'].unique())

46

I've found that this data contains 1994 entries, covering communities from 46 US states and/or territories.

## 2. Ask and answer 3 analytic questions with a combination of statistics and visualizations. These analytic questions can focus on individuals behaviors or comparisons of the population.

For the purposes of this analysis, I will remove all non-predictive columns, which are described by the dataset abstract to be the first 5 columns.

In [None]:
raw_data = raw_data.iloc[:,4:]

## 3. Make a proposal for a realistic future research project on this dataset that would use some data science techniques you'd like to learn in the bootcamp. It should present one or more clear questions, then describe the techniques needed to arrive at an answer.