# Risk Monitoring and Mitigation of the Urban Forest

>Identifying Problematic Areas


- toc:false
- branch: master
- badges: true
- comments: true
- author: Evgeny Khoroshukhin
- categories: [jupyter]
<!-- - image: images/student-performance.jpg -->

In [69]:
#hide
# !pip install geopandas
# !pip install geopy
# pip install plotly
# !pip install lxml

In [37]:
# hide
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import geopy
import plotly.express as px
from sklearn.cluster import DBSCAN
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
from geopy.extra.rate_limiter import RateLimiter
from geopy import Nominatim
from geopy.distance import distance

Urban forest is more important than you think.

We all know some benefits of trees like cleaning the air from CO2 and aesthetic reasons.

But there are more to it than you think:
- Carbon sequestration/storage
- Promoting diverse flora
- Barrier to noisy traffic
- Each large front yard tree adds about 1% to sales price of the property
- Trees reduce stormwater runoff by capturing and storing rainfall in their canopy and promote the infiltration of rainwater into the soil

But what exactly goes into sustaining the life of the trees and what are the risks of not taking it seriously?
Here's where urban forest monitoring, prevention and mitigation comes in. Increasing public safety and examining conditions of the trees non-destructively to develop a plan of action. preventing sidewalk damage. Monitoring invasive norway maple that lessen diversity of other trees and living habitat as well as Ash trees (Fraxinus spp.) in preparation for an invasion of emerald ash borer. budget allocation for renewal and pruning.



This project is going to aim to help to identify, predict possible risk and areas of attention.
 
For this project I'll be working with few datasets:
 
New-York tree inventory data taken from [NYC Open Data](https://data.cityofnewyork.us/Environment/Forestry-Tree-Points/hn5i-inap/data).
 
The dataset is Urban street tree inventory data for Newburgh, New York in 2015.
[Urban street tree inventory data for Newburgh, New York in 2015](https://www.uvm.edu/femc/data/archive/project/Newburgh_new_york_street_tree_inventory/dataset/newburgh-new-york-street-tree-inventory/overview)
 
Finally, [Historic Tree Inventory - 2018/2019 within the City of Buffalo](https://data.buffalony.gov/Quality-of-Life/Historic-Tree-Inventory-2018-2019/vwyx-pawz)
 
Since NYC tree inventory has more features and observations It'll be the training dataset.
The other two I'll test predictions on.

### Feature Engineering
the data set doesn't have many useful of features to work with but there's some possibility for feature engineering.
like to combine address and street name to find out longitude and latitude of trees using geopy for ploting them using plotly on the map.

<!-- deriving distance from coordinates would help to engineer the area. -->

we can use a formula to make predictions of the age according to the species to create even more features.


### for visualization
make a year slider connected to a growing riskfactor
areas with high tree pollen

uploading dataset set of a different city


With a high ephasis on urban forest it's important to it healthy and hazard free.

to find out which trees corresponds to what degree of allergy severity, I used this [website](https://www.pollenlibrary.com/Local/Species/in/Orange%20County/NY/in/Winter/).
due to lack of time I have not scraped the data(which in the future I should), but I copied manually into spreadsheet.
The plan is to merge dataframes on `botanical name`

 After training a model to predict risk rating it would be possible to upload the data of tree inventory from another city(given its formatted accordingly) to see a forecast of the trees that need attention.

# Data Wrangling

NYC tree inventory contains close to a million observations. This dataset is updated regularly.
with this large dataset we wouldn worry about dropping missing values. Wrangling will be important for the dataset to contain necessary variables and be memory efficient.

In [276]:
df_ny_trees=pd.read_csv('Forestry_Tree_Points.csv')
df_ny_trees.columns=df_ny_trees.columns.str.lower()
df_ny_trees=df_ny_trees.drop(['objectid','plantingspaceglobalid',
                              'geometry','globalid','riskratingdate',
                              'planteddate','createddate','updateddate','stumpdiameter'],axis=1)
df_ny_trees=df_ny_trees.dropna()
df_ny_trees=df_ny_trees.rename(columns={'genusspecies':'botanical_name'})
df_ny_trees['botanical_name']=df_ny_trees['botanical_name'].str.lower()
df_ny_trees['botanical_name']=df_ny_trees['botanical_name'].str.split('-').str[0]
df_ny_trees['tpcondition']=df_ny_trees['tpcondition'].str.lower()
df_ny_trees['tpstructure']=df_ny_trees['tpstructure'].str.lower()
df_ny_trees['dbh']=df_ny_trees['dbh'].astype(int)
df_ny_trees['riskrating']=df_ny_trees['riskrating'].astype(int)
df_ny_trees['botanical_name']=df_ny_trees['botanical_name'].str.split('-').str[0]
df_ny_trees.drop(df_ny_trees[df_ny_trees['tpcondition']=='unknown'].index, inplace=True)
df_ny_trees['latitude']=(df_ny_trees['location']
                         .apply(lambda x: x.split('(')[-1].strip(')').split(','))
                         .apply(lambda x:x[0])).astype(float)
df_ny_trees['longitude']=(df_ny_trees['location']
                          .apply(lambda x: x.split('(')[-1].strip(')').split(','))
                          .apply(lambda x:x[1])).astype(float)
df_ny_trees=df_ny_trees.drop('location',axis=1)
df_ny_trees.loc[df_ny_trees['botanical_name']=="prunus serrulata 'green leaf' ",'botanical_name']='prunus serrulata'
df_ny_trees['botanical_name']=df_ny_trees['botanical_name'].str.extract('(^[,. A-Za-z]*[A-Za-z])')


# df_ny_trees['botanical_name'].str.extract('(.+[A-za-z]*(?= var))')
df_ny_trees.shape

(303029, 7)

In [278]:
df_ny_trees.head()

Unnamed: 0,dbh,tpstructure,tpcondition,botanical_name,riskrating,latitude,longitude
0,26,full,good,quercus palustris,7,40.86335,-73.906594
2,30,full,fair,quercus palustris,8,40.71029,-73.833408
4,5,full,good,quercus phellos,3,40.727396,-74.00755
5,10,full,fair,acer platanoides,8,40.813687,-73.943579
6,36,full,fair,fraxinus americana,6,40.856531,-73.790451


After wrangling, this dataset reduced to around `300_000` observations. For ease of use I'll make a separate csv and load the data from there.

In [279]:
df_ny_trees.to_csv('df_ny_trees_wrangled.csv')

df_ny_trees_wrangled=pd.read_csv('df_ny_trees_wrangled.csv')

# Buffalo Dataset

In [374]:
df_buffalo=pd.read_csv('Historic_Tree_Inventory_-_2018_2019.csv')
df_buffalo.columns=df_buffalo.columns.str.lower()

df_buffalo=df_buffalo.drop(columns=[
        'editing','total yearly eco benefits ($)', 'stormwater benefits ($)',
       'stormwater gallons saved', 'greenhouse co2 benefits ($)',
       'co2 avoided (in lbs.)', 'co2 sequestered (in lbs.)',
       'energy benefits ($)', 'kwh saved', 'therms saved',
       'air quality benefits ($)', 'pollutants saved (in lbs.)',
       'property benefits ($)','address','leaf surface area (in sq. ft.)',
       'street', 'side', 'site', 'council district', 'park name', 'site id', 'location'])
df_buffalo['common name']=df_buffalo['common name'].str.lower()
df_buffalo['botanical name']=df_buffalo['botanical name'].str.lower()
df_buffalo=df_buffalo.dropna()
df_buffalo['dbh']=df_buffalo['dbh'].astype(int)
df_buffalo['botanical name']=df_buffalo['botanical name'].str.extract('(^[,. A-Za-z]*[A-Za-z])')
df_buffalo

Unnamed: 0,botanical name,common name,dbh,latitude,longitude
0,vacant,vacant,0,42.896317,-78.897606
1,vacant,vacant,0,42.899690,-78.892210
2,vacant,vacant,0,42.895772,-78.835606
3,vacant,vacant,0,42.910160,-78.850107
4,vacant,vacant,0,42.958505,-78.848124
...,...,...,...,...,...
132491,ailanthus altissima,ailanthus,4,42.917577,-78.852584
132492,fraxinus americana,"ash, white",22,42.940216,-78.843552
132494,malus,"crabapple, harvest gold",7,42.916829,-78.803502
132495,ulmus americana,"elm, american",8,42.916343,-78.826752


In [325]:
df_buffalo.shape

(132471, 6)

# Allergy dataset

In [462]:
allergy=pd.read_csv('pollen.csv',header=None,names=['trees','allergy'])
allergy=(allergy
         .drop_duplicates()
         .reset_index(drop=True)
         .fillna(0)
         .convert_dtypes(int))
allergy['trees']=allergy['trees'].str.lower()
allergy['botanical_name']=allergy['trees'].apply(lambda x: x.split('(')[-1].strip(')'))
allergy=allergy.drop(columns='trees')
allergy=allergy[['botanical_name','allergy']]

allergy

Unnamed: 0,botanical_name,allergy
0,fagus grandifolia,1
1,ulmus americana,2
2,carpinus caroliniana,2
3,platanus occidentalis,2
4,hamamelis virginiana,0
...,...,...
180,picea glauca,0
181,salix alba,3
182,ulmus glabra,2
183,baccharis halimifolia,3


Next up is the Newburgh dataset.

# Newburgh dataset

In [496]:
df_newburgh=pd.read_csv('Z1535_3079_DOVK2R.csv')
df_newburgh.columns=df_newburgh.columns.str.lower()
df_newburgh['botanical_name']=(df_newburgh['species']
                  .apply(lambda x: x.split('(')[-1].strip(')'))
                  .str.lower())
df_newburgh['species']=df_newburgh['species'].str.extract('([, A-Za-z]*(?![^(]*\)))')
df_newburgh=df_newburgh.drop(['suffix','cultivar'],axis=1)
# df.loc[df['dbh']>40,'dbh']/3.14.round(1)
df_newburgh=df_newburgh.drop(columns=['side', 'site', 'on_street', 'inventory_date','site_id','area', 'stems','species'])
df_newburgh.loc[df_newburgh['botanical_name'].str.contains('vacant'),'botanical_name']='vacant'
df_newburgh.loc[df_newburgh['dbh']>=40,'dbh']=(df_newburgh.loc[df_newburgh['dbh']>=40,'dbh']/3.14).round().astype(int)
df_newburgh['street']=df_newburgh['street'].str.lower()
df_newburgh.shape

(8037, 4)

In [393]:
df_newburgh.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8037 entries, 0 to 8036
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   address         8037 non-null   int64 
 1   street          8037 non-null   object
 2   dbh             8037 non-null   int64 
 3   botanical_name  8037 non-null   object
dtypes: int64(2), object(2)
memory usage: 251.3+ KB


In [469]:
df_newburgh

Unnamed: 0,address,street,dbh,botanical_name
0,251,farrington st,6,quercus palustris
1,251,farrington st,0,vacant
2,251,farrington st,0,vacant
3,251,farrington st,10,pinus strobus
4,251,farrington st,6,pinus strobus
...,...,...,...,...
8032,81,water st,9,fraxinus pennsylvanica
8033,94,water st,15,fraxinus pennsylvanica
8034,94,water st,15,fraxinus pennsylvanica
8035,94,water st,13,fraxinus pennsylvanica


In [463]:
df['species'].value_counts(normalize=True).head()*100

vacant                     42.055493
maple, Norway              14.781635
stump                       4.777902
pear, Callery               4.479283
honeylocust, thornless      3.794948
Name: species, dtype: float64

As we can see the most popular tree is Norway mapple. after some research it considers to be invasive.

Here are few concerns:

Norway maple (Acer platanoides) is a large deciduous tree that can grow up to approximately 40-60 feet in height.  They are tolerant of many different growing environments and have been a popular tree to plant on lawns and along streets because of their hardiness.  Norway maples have very shallow roots and produce a great deal of shade which makes it difficult for grass and other plants to grow in the understory below. In urban environments, the root systems also destroy pavement, requiring expensive repairs.  Other species of flora and fauna, such as insects and birds, may indirectly be affected due to the change in resource diversity and availability. Additionally, they are prolific seed producers and are now invading forests and forest edges.


Collecting the data is the most time intensive task, since I'm not a domain expert and have to learn fast and the fact of not having enough features. A lot of research had to be done-detective work.

Next I've found a tree growth coefficets table with formulas to calculate tree age, height and etc.

In [497]:
df_newburgh['full_Address']=df_newburgh['address'].astype(str)+' '+df_newburgh['street'].astype(str)+', newburgh, ny'

In [475]:
df_newburgh

Unnamed: 0,address,street,dbh,botanical_name,full_Address
0,251,farrington st,6,quercus palustris,"251 farrington st, newburgh, ny"
1,251,farrington st,0,vacant,"251 farrington st, newburgh, ny"
2,251,farrington st,0,vacant,"251 farrington st, newburgh, ny"
3,251,farrington st,10,pinus strobus,"251 farrington st, newburgh, ny"
4,251,farrington st,6,pinus strobus,"251 farrington st, newburgh, ny"
...,...,...,...,...,...
8032,81,water st,9,fraxinus pennsylvanica,"81 water st, newburgh, ny"
8033,94,water st,15,fraxinus pennsylvanica,"94 water st, newburgh, ny"
8034,94,water st,15,fraxinus pennsylvanica,"94 water st, newburgh, ny"
8035,94,water st,13,fraxinus pennsylvanica,"94 water st, newburgh, ny"


### Getting longitude and latitude with Geopy

To find out location of each tree I'm combining `address` and `street` in to new column.

Having a full address and using `geopy` I'm obtaining longitude and latitude data. 

Depending on the dataset It'll take a while. Best suited for overnight task.

In [None]:
locator = Nominatim(user_agent='myGeocoder')
# delay between geocoding calls
geocode = RateLimiter(locator.geocode, min_delay_seconds=1)
# creating location column
df_newburgh['location'] = df_newburgh['full_address'].apply(geocode)
# creating longitude, laatitude and altitude from location column (returns tuple) 
df_newburgh['point'] = df_newburgh['location'].apply(lambda loc: tuple(loc.point) if loc else None)
# 4 - split point column into latitude, longitude and altitude columns
df_newburgh[['latitude', 'longitude', 'altitude']] = pd.DataFrame(df_newburgh['point'].tolist(), index=df.index)

In [None]:
# df_newburgh=df_newburgh.drop(columns=['point','altitude'])

In [514]:
df_newburgh.to_csv('newburgh trees.csv',index=False)

In [515]:
df_newburgh=pd.read_csv('newburgh trees.csv')

In [516]:
df_newburgh=df_newburgh.drop(columns=['address','street','full_Address'])

In [518]:
df_newburgh

Unnamed: 0,dbh,botanical_name,latitude,longitude
0,6,quercus palustris,41.504998,-74.012407
1,0,vacant,41.504998,-74.012407
2,0,vacant,41.504998,-74.012407
3,10,pinus strobus,41.504998,-74.012407
4,6,pinus strobus,41.504998,-74.012407
...,...,...,...,...
8032,9,fraxinus pennsylvanica,41.500394,-74.006753
8033,15,fraxinus pennsylvanica,41.498959,-74.007473
8034,15,fraxinus pennsylvanica,41.498959,-74.007473
8035,13,fraxinus pennsylvanica,41.498959,-74.007473


In [308]:
fig=px.scatter_mapbox(data_frame=df_newburgh,lat='latitude',lon='longitude',
                  color_discrete_sequence=['MediumPurple'], zoom=14, height=900,opacity=1,size='dbh')
fig.update_layout(mapbox_style="open-street-map")
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

# Limitations/Challenges

The main challenge is to getting insight from the data that is out of my expertise. This impacted on what questions I can ask and answer. Although there are way to make the data work by engineering features challenging part was to stay on course and don't complicate without a need.  

- the trees measured were located on a public property only.
- limited measured features. given a table of equations and table of cooefs some features can be engineered but they still wont reflect reality
