# Data Cleaning
Import the data and clean for EDA. Drop columns that don't relate to our analysis, drop rows with unusable data or that are not in our time frame (2015-2019).

In [1]:
import pandas as pd
import warnings
import matplotlib.pyplot as plt
warnings.filterwarnings('ignore')

__Read in Files__ from csv into pandas dataframes.

In [2]:
property_2019_full    = pd.read_csv('data/property-assessment-fy2019.csv')
property_2018_full    = pd.read_csv('data/property-assessment-fy2018.csv')
property_2017_full    = pd.read_csv('data/property-assessment-fy2017.csv')
property_2016_full    = pd.read_csv('data/property-assessment-fy2016.csv')
property_2015_full    = pd.read_csv('data/property-assessment-fy2015.csv')
streetlights_full     = pd.read_csv('data/streetlight_locations.csv')
crime_incidents_full  = pd.read_csv('data/crime_incident_reports.csv')

__Read in 311__ seperately because it takes longer so you don't have to run if not needed.

In [3]:
incident_reports_full = pd.read_csv('data/311.csv')

__Drop Columns__ after careful inspection of the data contained in each dataset, drop columns that will not help in our modeling. Columns were dropped if they had no effect on the outcome of interest (such as indeces or number of fireplaces in a property) or if the information in them was a duplicate (such as location if we were already given longitude and latitude).

1. from `streetlamps` drop everything but `Long` and `Lat`
2. from `property_assessment` we only care where the property is and what it's valued at so drop everything that doesn't relate
3. from `crime_incidents` drop `Location` and the index, since the location information was duplicationg `Long` and `Lat` and the index was not useful for analysis

**For Property Values** `AV_TOTAL` is nonzero for all entries, and is the easiest representative of a property's value.

In [3]:
# drop everything but lat and long
streetlights = streetlights_full.drop(['the_geom','TYPE','OBJECTID'],axis=1)

In [20]:
# list of columns to save for properties
property_cols = ['ST_NUM','ST_NAME','ST_NAME_SUF','UNIT_NUM','ZIPCODE','AV_TOTAL']

# drop all columns not in list (keep _ at end of name to show not fully clean yet)
property_2019_ = property_2019_full[property_2019_full.columns[property_2019_full.columns.isin(property_cols)]]
property_2018_ = property_2018_full[property_2018_full.columns[property_2018_full.columns.isin(property_cols)]]
property_2017_ = property_2017_full[property_2017_full.columns[property_2017_full.columns.isin(property_cols)]]
property_2016_ = property_2016_full[property_2016_full.columns[property_2016_full.columns.isin(property_cols)]]
property_2015_ = property_2015_full[property_2015_full.columns[property_2015_full.columns.isin(property_cols)]]

In [5]:
# list of columns to drop for crime incidents
crime_cols_drop = ['INCIDENT_NUMBER','UCR_PART','Location']

# drop columns and keep only descriptors of crime, date, and location
crime_incidents_ = crime_incidents_full.drop(crime_cols_drop,axis=1)

In [6]:
# list of columns to drop for civil incident reports
incident_cols_drop = ['case_enquiry_id','closure_reason','case_title','subject','reason',
                      'queue', 'department', 'submittedphoto', 'closedphoto', 'neighborhood', 
                      'neighborhood_services_district', 'ward', 'precinct', 'location_street_name',
                      'location_zipcode']

# drop redundant and unnecessary columns
incident_reports_ = incident_reports_full.drop(incident_cols_drop,axis=1)

NameError: name 'incident_reports_full' is not defined

__Drop Rows__ that would not be usable in the forseeable future. This includes rows that have no predictor data, or no response variable data, in the form of 'nan' or 'none' or in some cases zeros. Careful inspection of each dataset led us to drop the following:
1. the `streetlights` dataset had no rows with immediately visible issues
2. from `property_assessment` we dropped all rows that had 0 in all four of the price variables, no issues with location were immediately visible
3. from `crime_incidents` we dropped if `Lat` and `Long` did not have usable values because it would be hard to get that information just from the street name and it is vital to our analysis

In [7]:
# # drop row if all price values are 0
# def property_droprows(df):
#     df_new = df[(df.AV_LAND != 0)  | (df.AV_BLDG != 0) | (df.AV_TOTAL != 0) | (df.GROSS_TAX != 0)]
#     return(df_new)

In [8]:
# # drop property rows for all years
# property_2019 = property_droprows(property_2019_)
# property_2018 = property_droprows(property_2018_)
# property_2017 = property_droprows(property_2017_)
# property_2016 = property_droprows(property_2016_)
# property_2015 = property_droprows(property_2015_)

In [10]:
# drop rows with nan long and lat 
crime_incidents = crime_incidents_.dropna(subset=['Lat','Long'])

# drop rows with zero long and lat
crime_incidents = crime_incidents[crime_incidents.Lat != 0]
crime_incidents = crime_incidents[crime_incidents.Long != -1]

In [None]:
# drop rows that  are outside our timeframe (2015-2019) sequentially
drop_years = incident_reports_[~incident_reports_.open_dt.str.contains("2011")]
drop_years = drop_years[~drop_years.open_dt.str.contains("2012")]
drop_years = drop_years[~drop_years.open_dt.str.contains("2013")]
drop_years = drop_years[~drop_years.open_dt.str.contains("2014")]

# save remaining incident reports 
incident_reports = drop_years

In [13]:
len(property_2019_[property_2019_.GROSS_TAX != 0])

155537

In [14]:
len(property_2019[property_2019.AV_BLDG != 0])

152223

In [15]:
len(property_2019[property_2019.AV_LAND != 0])

91924

In [21]:
len(property_2019_[property_2019_.AV_TOTAL != 0])

164734

In [19]:
len(property_2019)

164734