## Prioritization for data cleaning

- We should focus on cleaning up the metrics first and do some feature engineering on these to make sure they are in a state that's valuable to FuF
  - Sidewalk damage: HARDSCAPE
  - Size of trees: DBH
  - Density of trees: Lat/Long - create additional metric
  - Neighborhood: PROPERTY
  - Health of tree: CONDITION - definitely needs cleaning, could simplify categories
  - Types of tree: (Genus and Species): BOTANICAL - may be useful to seperate out genus and species here; Common name: COMMON

## Known issues / areas to work on

- ON_ADR/ONSTREET and PROP_ADR/PROPSTREET are the same except for in some cases, we think it may be because of street corners?
- ON_ADR/ONSTREET and PROP_ADR/PROPSTREET need to be merged to produce addresses
- BOTANICAL needs to be split apart into GENUS and Species
- Many categorical variables, especially the metrics, have redundant/messy categorical values that need to be cleaned up.
- We can create binary indicators from many of these categorical variables and aggregate them to create proportions at the neighborhood level, which would be much easier to visualize on the map, and could also then be exported for decision-making purposes.


Look at condition store by species and neighborhood, and look at condition disparity 

In [1]:
import pandas as pd, numpy as np

fuf_data_updated = pd.read_csv('../data/combined_tree_data_with_header.csv', header = False)

  data = self._reader.read(nrows)


In [2]:
fuf_data_updated.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 112331 entries, 0 to 112330
Data columns (total 18 columns):
tree_id                      112331 non-null object
neighborhood                 112331 non-null object
on_street_name               112320 non-null object
side                         107443 non-null object
scientific_species_name      112330 non-null object
common_species_name          112279 non-null object
diameter_at_breast_height    112331 non-null int64
condition                    108936 non-null object
parkway_space_type           112302 non-null object
parkway_largest_dimension    112331 non-null int64
trunks                       112331 non-null int64
maintenance_notes            112330 non-null object
status                       75304 non-null float64
hardscape_damage             106101 non-null object
observation_notes            100767 non-null object
clearance                    89077 non-null object
longitude                    112331 non-null float64
latitude

In [None]:
# Dead, Vacancy, Poor, Stump, Stump Removal, Unsuitable Site

In [30]:
fuf_data_updated.condition.value_counts()

Fair               28211
Poor               23342
Fair               19532
Good               12634
Good               11061
Vacancy             8666
Poor                1877
Dead                 741
Very Good            738
Very                 708
Dead                 369
Stump                313
Stump Removal        222
Excellent            175
Critical             120
Open                 106
Critical              85
Unsuitable Site       35
6/15/16                1
dtype: int64

In [29]:
fuf_data_updated[(fuf_data_updated.condition.str.contains('Vacancy')==True)]

Unnamed: 0,tree_id,neighborhood,on_street_name,side,scientific_species_name,common_species_name,diameter_at_breast_height,condition,parkway_space_type,parkway_largest_dimension,...,status,hardscape_damage,observation_notes,clearance,longitude,latitude,hardscape_metric,vacant_lot_metric,condition_metric,stump_metric
30,300040,Inner Sunset,18TH AVE,Front,Vacant Planting Site - Small,Vacant Planting Site - Small,0,Vacancy,Well/Pit,3,...,3,,,,-122.475863,37.759646,0,1,1,0
51,300075,Inner Sunset,18TH AVE,Front,Vacant Planting Site - Medium,Vacant Planting Site - Medium,0,Vacancy,Planter,6,...,3,,,,-122.475536,37.757512,0,1,1,0
54,300078,Inner Sunset,18TH AVE,Front,Vacant Planting Site - Medium,Vacant Planting Site - Medium,0,Vacancy,Behind Parkway,99,...,3,,,,-122.475561,37.757835,0,1,1,0
60,300085,Inner Sunset,18TH AVE,Front,Vacant Planting Site - Small,Vacant Planting Site - Small,0,Vacancy,Well/Pit,3,...,3,,,,-122.475711,37.759691,0,1,1,0
85,300124,Inner Sunset,17TH AVE,Front,Vacant Planting Site - Small,Vacant Planting Site - Small,0,Vacancy,Well/Pit,2,...,3,,,,-122.474640,37.757408,0,1,1,0
90,300130,Inner Sunset,17TH AVE,Front,Vacant Planting Site - Small,Vacant Planting Site - Small,0,Vacancy,Well/Pit,2,...,3,,,,-122.474595,37.756738,0,1,1,0
93,300133,Inner Sunset,17TH AVE,Front,Vacant Planting Site - Small,Vacant Planting Site - Small,0,Vacancy,Well/Pit,2,...,3,,,,-122.474568,37.756473,0,1,1,0
101,300141,Inner Sunset,17TH AVE,Front,Vacant Planting Site - Small,Vacant Planting Site - Small,0,Vacancy,Well/Pit,3,...,3,,Adjust Stakes / Ties,,-122.474455,37.756881,0,1,1,0
113,300154,Inner Sunset,LAWTON ST,Median,Vacant Planting Site - Medium,Vacant Planting Site - Medium,0,Vacancy,Median,9,...,3,,,,-122.474055,37.758091,0,1,1,0
128,300178,Inner Sunset,17TH AVE,Front,Vacant Planting Site - Small,Vacant Planting Site - Small,0,Vacancy,Well/Pit,3,...,3,,,,-122.474700,37.760513,0,1,1,0


In [3]:
fuf_data_updated.describe()

Unnamed: 0,diameter_at_breast_height,parkway_largest_dimension,trunks,status,longitude,latitude
count,112331.0,112331.0,112331.0,75304.0,112331.0,112331.0
mean,8.324879,5.796984,1.095503,1.153312,-122.447735,37.75854
std,9.512935,12.979581,0.942576,0.526511,0.031419,0.020606
min,0.0,0.0,0.0,1.0,-122.511005,37.708306
25%,3.0,3.0,1.0,1.0,-122.473802,37.742783
50%,7.0,3.0,1.0,1.0,-122.442289,37.758569
75%,12.0,4.0,1.0,1.0,-122.424428,37.775723
max,1920.0,99.0,27.0,6.0,-122.378641,37.80667


In [4]:
fuf_data_updated['hardscape_damage'].describe()

count     106101
unique        12
top         None
freq       50271
Name: hardscape_damage, dtype: object

In [5]:
fuf_data_updated['hardscape_damage'].value_counts()

None                  50271
No                    30438
Sidewalk/CG            9626
Sidewalk               7780
Yes                    6418
Curb/Gutter            1320
Temporary               148
Other                    41
Private                  30
Well Grate / Cover       27
Temporary/CG              1
0                         1
dtype: int64

In [6]:
fuf_data_updated['neighborhood'].value_counts()

Mission District                  7892
Outer Richmond                    7156
Parkside                          6435
West of Twin Peaks                6409
Inner Richmond                    6248
Central Sunset                    5518
Noe Valley                        5508
Potrero Hill                      5425
Bernal Heights                    5029
Outer Mission                     4819
Castro/Upper Market               4817
Outer Sunset                      3854
Excelsior                         3781
Inner Sunset                      3714
Western Addition                  3248
Outer Parkside                    3152
Pacific Heights                   2438
Presidio Heights                  2367
Glen Park                         2217
Lower Pacfic Heights              2184
Hayes Valley                      1652
Bayview District                  1630
Nob Hill                          1579
North Panhandle                   1472
Haight Ashbury                    1354
Haight-Ashbury           

In [7]:
fuf_data_updated['condition'].value_counts()

Fair               28211
Poor               23342
Fair               19532
Good               12634
Good               11061
Vacancy             8666
Poor                1877
Dead                 741
Very Good            738
Very                 708
Dead                 369
Stump                313
Stump Removal        222
Excellent            175
Critical             120
Open                 106
Critical              85
Unsuitable Site       35
6/15/16                1
dtype: int64

In [8]:
fuf_data_updated['diameter_at_breast_height'].value_counts()

3       8553
4       8544
0       8387
5       8024
2       7681
6       7185
1       6888
7       6732
8       6444
9       5216
10      4997
11      4535
12      4443
13      3853
14      3501
15      2970
16      2560
17      1895
18      1582
19      1191
20       981
21       871
22       699
23       634
24       552
25       447
26       332
27       292
28       246
30       206
        ... 
63         9
80         9
85         9
71         8
64         8
70         7
62         7
73         7
67         6
68         6
65         5
77         5
72         5
87         4
69         4
66         4
90         4
75         4
81         4
74         3
79         3
76         3
78         2
95         2
58         2
88         1
94         1
105        1
1920       1
151        1
dtype: int64

In [31]:
fuf_data_updated['condition'].value_counts()

Fair               28211
Poor               23342
Fair               19532
Good               12634
Good               11061
Vacancy             8666
Poor                1877
Dead                 741
Very Good            738
Very                 708
Dead                 369
Stump                313
Stump Removal        222
Excellent            175
Critical             120
Open                 106
Critical              85
Unsuitable Site       35
6/15/16                1
dtype: int64

In [32]:
# Dead, Vacancy, Poor, Stump, Stump Removal, Unsuitable Site

fuf_data_updated['hardscape_metric'] = fuf_data_updated['hardscape_damage'].apply(lambda x: 0 if x in ['None',
                                                                                    'No', 'NA'] else 1)
fuf_data_updated['vacant_lot_metric'] = fuf_data_updated['condition'].apply(lambda x: 0 if x != 'Vacancy' else 1)
fuf_data_updated['condition_metric'] = fuf_data_updated['condition'].apply(lambda x: 0 if x in ['Poor','Dead',
                                                                               'Critical','Stump','Stump Removal',
                                                                                'Unsuitable Site'] else 1)
fuf_data_updated['stump_metric'] = fuf_data_updated['condition'].apply(lambda x: 1 if x in ['Stump', 
                                                                                            'Stump Removal'] else 0)
grouped = fuf_data_updated[['hardscape_metric','diameter_at_breast_height','neighborhood',
          'vacant_lot_metric', 'condition_metric', 'stump_metric']].groupby('neighborhood').mean().reset_index()
joined = pd.merge(fuf_data_updated, grouped, on='neighborhood', how = 'left', suffixes=('_binary','_perc'))

In [33]:
grouped

Unnamed: 0,neighborhood,hardscape_metric,diameter_at_breast_height,vacant_lot_metric,condition_metric,stump_metric
0,Alamo Square,0.314371,10.082335,0.000000,1.000000,0.000000
1,Anza Vista,0.237991,8.866812,0.000000,1.000000,0.000000
2,Bayview District,0.275460,6.831902,0.000000,1.000000,0.000000
3,Bayview Heights,0.163158,4.905263,0.000000,1.000000,0.000000
4,Bernal Heights,0.455955,8.716246,0.048519,0.599324,0.005767
5,Bernal Heights North,0.379747,5.987342,0.000000,1.000000,0.000000
6,Bernal Heights South,0.381818,6.290909,0.000000,1.000000,0.000000
7,Buena Vista Park,0.289855,7.478261,0.000000,1.000000,0.000000
8,Castro/Upper Market,0.427029,9.818974,0.022836,0.373677,0.012456
9,Central Richmond,0.205036,7.075540,0.000000,1.000000,0.000000


- There are some neighborhoods that still dont have much data - would it make sense to actually exclude those because of sparsity issues? Those metrics are going to be much less reliable/useful

In [11]:
grouped_with_count = fuf_data_updated[['hardscape_metric','diameter_at_breast_height','neighborhood',
          'vacant_lot_metric', 'condition_metric', 'stump_metric']].groupby('neighborhood').agg(['mean',
                                                                                                'count']).reset_index()
grouped_with_count

Unnamed: 0_level_0,neighborhood,hardscape_metric,hardscape_metric,diameter_at_breast_height,diameter_at_breast_height,vacant_lot_metric,vacant_lot_metric,condition_metric,condition_metric,stump_metric,stump_metric
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,count,mean,count,mean,count,mean,count,mean,count
0,Alamo Square,0.314371,668,10.082335,668,0.000000,668,1.000000,668,0.000000,668
1,Anza Vista,0.237991,458,8.866812,458,0.000000,458,1.000000,458,0.000000,458
2,Bayview District,0.275460,1630,6.831902,1630,0.000000,1630,1.000000,1630,0.000000,1630
3,Bayview Heights,0.163158,190,4.905263,190,0.000000,190,1.000000,190,0.000000,190
4,Bernal Heights,0.455955,5029,8.716246,5029,0.048519,5029,0.605090,5029,0.005767,5029
5,Bernal Heights North,0.379747,79,5.987342,79,0.000000,79,1.000000,79,0.000000,79
6,Bernal Heights South,0.381818,55,6.290909,55,0.000000,55,1.000000,55,0.000000,55
7,Buena Vista Park,0.289855,69,7.478261,69,0.000000,69,1.000000,69,0.000000,69
8,Castro/Upper Market,0.427029,4817,9.818974,4817,0.022836,4817,0.386132,4817,0.012456,4817
9,Central Richmond,0.205036,556,7.075540,556,0.000000,556,1.000000,556,0.000000,556


In [12]:
joined.columns

Index([u'tree_id', u'neighborhood', u'on_street_name', u'side',
       u'scientific_species_name', u'common_species_name',
       u'diameter_at_breast_height_binary', u'condition',
       u'parkway_space_type', u'parkway_largest_dimension', u'trunks',
       u'maintenance_notes', u'status', u'hardscape_damage',
       u'observation_notes', u'clearance', u'longitude', u'latitude',
       u'hardscape_metric_binary', u'vacant_lot_metric_binary',
       u'condition_metric_binary', u'stump_metric_binary',
       u'hardscape_metric_perc', u'diameter_at_breast_height_perc',
       u'vacant_lot_metric_perc', u'condition_metric_perc',
       u'stump_metric_perc'],
      dtype='object')

In [13]:
joined.head()

Unnamed: 0,tree_id,neighborhood,on_street_name,side,scientific_species_name,common_species_name,diameter_at_breast_height_binary,condition,parkway_space_type,parkway_largest_dimension,...,latitude,hardscape_metric_binary,vacant_lot_metric_binary,condition_metric_binary,stump_metric_binary,hardscape_metric_perc,diameter_at_breast_height_perc,vacant_lot_metric_perc,condition_metric_perc,stump_metric_perc
0,200610,Presidio Heights,GEARY BLVD,Front,Tristaniopsis laurina,Water Gum,3,Good,Well/Pit,3,...,37.781515,0,0,1,0,0.16899,10.103929,0.084073,0.718631,0.00676
1,200611,Presidio Heights,GEARY BLVD,Front,Stump,Stump,7,Stump,Well/Pit,2,...,37.781501,0,0,1,1,0.16899,10.103929,0.084073,0.718631,0.00676
2,200612,Presidio Heights,GEARY BLVD,Front,Prunus x blireiana,Double-Flowering Plum,6,Fair,Well/Pit,2,...,37.781497,0,0,1,0,0.16899,10.103929,0.084073,0.718631,0.00676
3,200614,Presidio Heights,GEARY BLVD,Front,Metrosideros excelsa,New Zealand Christmas Tree,11,Fair,Well/Pit,4,...,37.781487,0,0,1,0,0.16899,10.103929,0.084073,0.718631,0.00676
4,200615,Presidio Heights,GEARY BLVD,Front,Metrosideros excelsa,New Zealand Christmas Tree,12,Good,Well/Pit,4,...,37.781484,0,0,1,0,0.16899,10.103929,0.084073,0.718631,0.00676


In [14]:
joined.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 112331 entries, 0 to 112330
Data columns (total 27 columns):
tree_id                             112331 non-null object
neighborhood                        112331 non-null object
on_street_name                      112320 non-null object
side                                107443 non-null object
scientific_species_name             112330 non-null object
common_species_name                 112279 non-null object
diameter_at_breast_height_binary    112331 non-null int64
condition                           108936 non-null object
parkway_space_type                  112302 non-null object
parkway_largest_dimension           112331 non-null int64
trunks                              112331 non-null int64
maintenance_notes                   112330 non-null object
status                              75304 non-null float64
hardscape_damage                    106101 non-null object
observation_notes                   100767 non-null object
clearan

In [15]:
joined.to_csv("fuf_with_metrics.csv")