In [81]:
import warnings
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Suppress annoying harmless error.
pd.options.mode.chained_assignment = None

## DRILL: Prepare the Data

[Download the Excel file here](https://ucr.fbi.gov/crime-in-the-u.s/2013/crime-in-the-u.s.-2013/tables/table-8/table-8-state-cuts/table_8_offenses_known_to_law_enforcement_new_york_by_city_2013.xls) on crime data in New York State in 2013, provided by the FBI: UCR ([Thinkful mirror](https://raw.githubusercontent.com/Thinkful-Ed/data-201-resources/master/New_York_offenses/NEW_YORK-Offenses_Known_to_Law_Enforcement_by_City_2013%20-%2013tbl8ny.csv)).

Prepare this data to model with multivariable regression (including data cleaning if necessary) according to this specification:

$$ Property crime = \alpha + Population + Population^2 + Murder + Robbery$$

The 'population' variable is already set for you, but you will need to create the last three features.  Robbery and Murder are currently continuous variables.  For this model, please use these variables to create  categorical features where values greater than 0 are coded 1, and values equal to 0 are coded 0.  You'll use this data and model in a later assignment- for now, just write the code you need to get the data ready.  Don't forget basic data cleaning procedures, either!  Do some graphing to see if there are any anomalous cases, and decide how you want to deal with them.


In [82]:
path = ("https://raw.githubusercontent.com/Thinkful-Ed/data-201-resources/master/New_York_offenses/" + 
       "NEW_YORK-Offenses_Known_to_Law_Enforcement_by_City_2013%20-%2013tbl8ny.csv")

df = pd.read_csv(path, header=4)

In [83]:
df.head()

Unnamed: 0,City,Population,Violent crime,Murder and nonnegligent manslaughter,Rape (revised definition)1,Rape (legacy definition)2,Robbery,Aggravated assault,Property crime,Burglary,Larceny- theft,Motor vehicle theft,Arson3
0,Adams Village,1861,0,0.0,,0,0,0,12,2,10,0,0.0
1,Addison Town and Village,2577,3,0.0,,0,0,3,24,3,20,1,0.0
2,Akron Village,2846,3,0.0,,0,0,3,16,1,15,0,0.0
3,Albany,97956,791,8.0,,30,227,526,4090,705,3243,142,
4,Albion Village,6388,23,0.0,,3,4,16,223,53,165,5,


In [84]:
crime = df[['Property\ncrime', 'Population', 'Murder and\nnonnegligent\nmanslaughter', 'Robbery']]
crime.columns = ['prop_crime', 'population', 'murder', 'robbery']

Let's see how many rows we have

In [85]:
len(crime)

351

In [86]:
print('rows with at least one nan value: {}'.format(len(crime[crime.isna().any(1)])))
print('rows with all nan value: {}'.format(len(crime[crime.isna().all(1)])))

rows with at least one nan value: 3
rows with all nan value: 3


Three rows in the dataframe have missing values (they're actually missing all values), so we're gonna drop those rows out

In [87]:
crime.dropna(inplace=True)

Make sure we don't have a value smaller than zero

In [88]:
crime.min()

prop_crime        0
population    1,022
murder            0
robbery           0
dtype: object

In [89]:
crime.dtypes

prop_crime     object
population     object
murder        float64
robbery        object
dtype: object

Trying to turn robbery data type to an integer throws an error due to dots and commas in the strings that we need to remove

In [90]:
#example
crime.robbery[35]

'1,322'

In [91]:
crime.replace(',', '', inplace=True)

In [92]:
crime.robbery[35]

'1,322'

In [93]:
crime.murder.dtype

dtype('float64')

In [94]:
for col in crime.columns:
    if crime[col].dtype != 'float64':
        crime[col] = crime[col].str.replace(',', '')

In [95]:
categoricals = ['murder', 'robbery']
crime = crime.astype(int)
crime[categoricals] = crime[categoricals].applymap(lambda x: 1 if x > 0 else 0)

We also have to add the squared population

In [96]:
crime['population^2'] = crime.population**2

In [97]:
crime

Unnamed: 0,prop_crime,population,murder,robbery,population^2
0,12,1861,0,0,3463321
1,24,2577,0,0,6640929
2,16,2846,0,0,8099716
3,4090,97956,1,1,9595377936
4,223,6388,0,1,40806544
5,46,4089,0,1,16719921
6,10,1781,0,0,3171961
7,2118,118296,1,1,13993943616
8,210,9519,0,1,90611361
9,405,18182,0,1,330585124
