# Visualization of Public Trees in Vancouver

The Vancouver trees dataset contains a listing of public trees on boulevards in the City of Vancouver and provides data on tree coordinates, species and other related characteristics. 

For more information, see: https://opendata.vancouver.ca/explore/dataset/street-trees/information/?disjunctive.species_name&disjunctive.common_name&disjunctive.height_range_id&disjunctive.on_street&disjunctive.neighbourhood_name. 

In this example, I investigate the top 10 trees present in the dataset, and look at their prevalence within the city (which neighbourhoods they can be found in) and how the distribution of these trees (ie. how many are being planted each year, of each species) has changed over time. In addition, I look at how tree properties (diameter and height) vary between the species and neighbourhoods.

I use a combination of the following plots:

- heat map
- bar chart
- line chart
- geographic map
- scatter plot


## Description and Review of Data

In [1]:
# Import libraries needed for this assignment
import altair as alt
import pandas as pd

In [2]:
# Read in the file. Let's immediately parse the "date_planted" column into DateTime dtype.
trees_df = pd.read_csv('https://raw.githubusercontent.com/UBC-MDS/data_viz_wrangled/main/data/Trees_data_sets/small_unique_vancouver.csv', parse_dates=["date_planted"])
trees_df.head(10)

Unnamed: 0.1,Unnamed: 0,std_street,on_street,species_name,neighbourhood_name,date_planted,diameter,street_side_name,genus_name,assigned,...,plant_area,curb,tree_id,common_name,height_range_id,on_street_block,cultivar_name,root_barrier,latitude,longitude
0,10747,W 20TH AV,W 20TH AV,PLATANOIDES,Riley Park,2000-02-23,28.5,EVEN,ACER,N,...,15,Y,21421,NORWAY MAPLE,4,0,,N,49.252711,-123.106323
1,12573,W 18TH AV,W 18TH AV,CALLERYANA,Arbutus-Ridge,1992-02-04,6.0,ODD,PYRUS,N,...,7,Y,129645,CHANTICLEER PEAR,2,2300,CHANTICLEER,N,49.25635,-123.158709
2,29676,ROSS ST,ROSS ST,NIGRA,Sunset,NaT,12.0,ODD,PINUS,N,...,7,Y,154675,AUSTRIAN PINE,4,7800,,N,49.213486,-123.083254
3,8856,DOMAN ST,DOMAN ST,AMERICANA,Killarney,1999-11-12,11.0,EVEN,FRAXINUS,N,...,7,Y,180803,AUTUMN APPLAUSE ASH,4,6900,AUTUMN APPLAUSE,N,49.220839,-123.036721
4,21098,EAST BOULEVARD,EAST BOULEVARD,HIPPOCASTANUM,Shaughnessy,NaT,15.5,ODD,AESCULUS,Y,...,N,Y,74364,COMMON HORSECHESTNUT,4,5200,,N,49.238514,-123.154958
5,17458,BUTE ST,BUTE ST,PERSICA,West End,2012-04-05,3.0,EVEN,PARROTIA,N,...,C,Y,233622,VANESSA PERSIAN IRONWOOD,1,1100,VANESSA,N,49.281906,-123.133076
6,1476,PRESTWICK DRIVE,NASSAU DRIVE,CAMPESTRE,Victoria-Fraserview,NaT,12.0,ODD,ACER,N,...,15,Y,105171,HEDGE MAPLE,3,1700,,N,49.217522,-123.071311
7,5120,FLEMING ST,FLEMING ST,OFFICINALIS,Kensington-Cedar Cottage,2001-04-02,3.0,EVEN,MAGNOLIA,N,...,N,Y,187792,CHINESE MAGNOLIA,2,3700,,N,49.251127,-123.071912
8,18338,W PENDER ST,W PENDER ST,PALUSTRIS,Downtown,1999-12-17,8.0,ODD,QUERCUS,N,...,C,Y,104016,PIN OAK,1,100,,N,49.281303,-123.108253
9,28279,MATAPAN CRESCENT,MATAPAN CRESCENT,ZUMI,Renfrew-Collingwood,2008-03-13,3.0,ODD,MALUS,N,...,12,Y,102612,REDBUD CRABAPPLE,1,3200,CALOCARPA,Y,49.257272,-123.030023


In [3]:
trees_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 21 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   Unnamed: 0          5000 non-null   int64         
 1   std_street          5000 non-null   object        
 2   on_street           5000 non-null   object        
 3   species_name        5000 non-null   object        
 4   neighbourhood_name  5000 non-null   object        
 5   date_planted        2363 non-null   datetime64[ns]
 6   diameter            5000 non-null   float64       
 7   street_side_name    5000 non-null   object        
 8   genus_name          5000 non-null   object        
 9   assigned            5000 non-null   object        
 10  civic_number        5000 non-null   int64         
 11  plant_area          4950 non-null   object        
 12  curb                5000 non-null   object        
 13  tree_id             5000 non-null   int64       

There are 5000 entries within the data frame, of type int64, object and float64 (and I have changed the date_planted column to datetime64).
Columns "data_planted", "plant_area", and "cultivar_name" contain null or NaN values. Specifically "date_planted" and "cultivar_name" have very many values missing; it may therefore be better to drop these columns - but that, of course, depends on the questions of interest and what we want to explore in our data analysis. Given that I want to investigate how the number of trees of each species being planted each year has changed over time, I will NOT drop the date_planted column.

In [4]:
# Let's see some summary statistics
trees_df.describe()

Unnamed: 0.1,Unnamed: 0,date_planted,diameter,civic_number,tree_id,height_range_id,on_street_block,latitude,longitude
count,5000.0,2363,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0
mean,14861.9204,2003-09-06 04:03:08.912399488,12.340888,2975.7076,128682.5846,2.7344,2960.227,49.247349,-123.107128
min,2.0,1989-10-31 00:00:00,0.0,2.0,36.0,0.0,0.0,49.202783,-123.22056
25%,7192.75,1997-11-06 00:00:00,4.0,1300.5,61321.5,2.0,1300.0,49.230152,-123.144178
50%,14870.0,2003-02-12 00:00:00,10.0,2639.0,130130.5,2.0,2600.0,49.247981,-123.105861
75%,22366.75,2009-11-17 00:00:00,18.0,4123.0,191332.0,4.0,4100.0,49.263275,-123.063484
max,29992.0,2019-05-07 00:00:00,71.0,9113.0,270750.0,9.0,9100.0,49.29393,-123.023311
std,8680.023278,,9.2666,2078.580429,75412.260406,1.56957,2086.861052,0.021251,0.049137


In [5]:
# Finally, let's use value_counts() to see how many different "species_names" and "common_names" there are, and just to see what types of strings these columns contain.
top_trees_species_names = trees_df["species_name"].value_counts()
top_trees_species_names

species_name
SERRULATA      463
PLATANOIDES    444
CERASIFERA     396
RUBRUM         261
AMERICANA      182
              ... 
GRANDIFLORA      1
LAEVIS           1
LOEBNERI  X      1
SERRULA          1
LUTEA            1
Name: count, Length: 171, dtype: int64

In [6]:
top_trees_common_names = trees_df["common_name"].value_counts()
top_trees_common_names

common_name
KWANZAN FLOWERING CHERRY       383
PISSARD PLUM                   295
NORWAY MAPLE                   215
CRIMEAN LINDEN                 152
PYRAMIDAL EUROPEAN HORNBEAM    100
                              ... 
CHINESE WINGNUT                  1
ELM SPECIES                      1
UMBRELLA CATALPA                 1
MAGNOLIA 'MERRILL'               1
SWEETGUM SPECIES                 1
Name: count, Length: 361, dtype: int64

## Questions of Interest

I want to answer the following questions in my analysis:

1. What is the prevalence of the top 10 tree species within the city (which neighbourhoods can they be found in)?
2. How has the distribution of these trees (ie. how many are being planted each year, of each species) changed over time? 
3. Can we visualize the total tree counts per neighbourhood on a map?

In addition, I want to explore how tree properties (diameter and height) vary between the species and neighbourhoods.

### Question 1. What is the prevalence of the top 10 tree species within the city (which neighbourhoods can they be found in)?

As seen earlier in this assignment (and below), the following are the top ten species: SERRULATA, PLATANOIDES, CERASIFERA, RUBRUM, AMERICANA, SYLVATICA, BETULUS, EUCHLORA X, FREEMANI X, and CAMPESTRE.

Let's filter our dataframe to only look at these species.

In [7]:
top_trees_species_names.nlargest(10)

species_name
SERRULATA       463
PLATANOIDES     444
CERASIFERA      396
RUBRUM          261
AMERICANA       182
SYLVATICA       178
BETULUS         170
EUCHLORA   X    152
FREEMANI   X    127
CAMPESTRE       124
Name: count, dtype: int64

Just out of curiousity, I looked up these trees online. Serrulata is the "Japanese cherry", Platanoides the "Norway maple", Cerasifera the "Cherry plum", Rubrum the "Red maple", Americana the "Linden tree", Sylvatica the "Sour gum", Betulus the "European hornbeam", Euchlora the "Caucasian linden", Freemani the "Freeman maple", and Campestre the "Field maple." These are all decidous trees.

In [8]:
top10_trees = ["SERRULATA", "PLATANOIDES", "CERASIFERA", "RUBRUM", "AMERICANA", "SYLVATICA", "BETULUS", "EUCHLORA   X", "FREEMANI   X", "CAMPESTRE"]

In [9]:
# Creating a new dataframe to populate with the top 10 species data
trees_df_top10 = pd.DataFrame(columns=trees_df.columns)

# Let's use a for-loop to filter our trees_df dataframe, and add the top 10 species to our new trees_df_top10 dataframe.
for tree in top10_trees:
    trees_toadd = trees_df[trees_df["species_name"].str.contains(tree)]
    trees_df_top10 = pd.concat([trees_df_top10, trees_toadd])

trees_df_top10 = trees_df_top10.reset_index()
trees_df_top10.head()

  trees_df_top10 = pd.concat([trees_df_top10, trees_toadd])


Unnamed: 0.1,index,Unnamed: 0,std_street,on_street,species_name,neighbourhood_name,date_planted,diameter,street_side_name,genus_name,...,plant_area,curb,tree_id,common_name,height_range_id,on_street_block,cultivar_name,root_barrier,latitude,longitude
0,19,17945,W 12TH AV,W 12TH AV,SERRULATA,Kitsilano,2008-03-13,9.0,ODD,PRUNUS,...,20,Y,106587,SHIROTAE(MT FUJI) CHERRY,1,2600,SHIROTAE,N,49.261319,-123.164948
1,21,28441,ST. CATHERINES ST,E 49TH AV,SERRULATA,Sunset,NaT,14.0,ODD,PRUNUS,...,4,Y,44256,KWANZAN FLOWERING CHERRY,3,800,KWANZAN,N,49.225494,-123.0872
2,42,24476,W 35TH AV,W 35TH AV,SERRULATA,Shaughnessy,NaT,11.0,EVEN,PRUNUS,...,12,Y,33656,KWANZAN FLOWERING CHERRY,2,2000,KWANZAN,N,49.239992,-123.152677
3,44,16997,VENABLES ST,VERNON DRIVE,SERRULATA,Strathcona,NaT,22.0,ODD,PRUNUS,...,7,Y,115638,UKON JAPANESE CHERRY,3,800,UKON,N,49.277064,-123.079379
4,60,1292,CAMOSUN ST,CAMOSUN ST,SERRULATA,Dunbar-Southlands,NaT,16.0,ODD,PRUNUS,...,N,Y,204485,KWANZAN FLOWERING CHERRY,2,4400,KWANZAN,N,49.24643,-123.1969


In [10]:
# Let's plot a heat map to see which trees are present in each neighbourhood, and how many. 
# I've added a tooltip to help see how many trees exactly are denoted in the heat map. I've also added a select tool, to enable the selection of one of the neighbourhoods.

select_neighbourhood_click = alt.selection_point(encodings=["y"], on='click', nearest=True)
tree_plot = alt.Chart(trees_df_top10).mark_rect().encode(alt.X('species_name', title="Species name"), alt.Y('neighbourhood_name', title="Neighboorhood name"), color=('count()'), tooltip=[alt.Tooltip("count()", title="Number of trees")], opacity=alt.condition(select_neighbourhood_click, alt.value(0.9), alt.value(0.2))).properties(title="Count of trees within neighbourhoods")
tree_plot.add_params(select_neighbourhood_click)

In the EDA, I initially use a simple mark_rect plot to visualize this data. I quickly realized that using a heat map would be better, because it would allow me to not only see if a species is present in a neighbourhood, but how many trees of the species are present.

Although the above plot demonstrates that there are certain neighbourhoods with greater tree counts than others, it also shows that almost all of the neighbourhoods have at least one exemplar of each of the top 10 tree species. It seems as though these trees are pretty well distributed throughout the city!

### Question 2. How has the distribution of these trees (ie. how many are being planted each year, of each species) changed over time? 
### Has this been different over the different neighbourhoods?

In [11]:
# First, let's filter out the trees that do not have a "date_planted" value
trees_filtered_df = trees_df_top10[~pd.isnull(trees_df_top10["date_planted"])].reset_index()
trees_filtered_df.head()

Unnamed: 0.1,level_0,index,Unnamed: 0,std_street,on_street,species_name,neighbourhood_name,date_planted,diameter,street_side_name,...,plant_area,curb,tree_id,common_name,height_range_id,on_street_block,cultivar_name,root_barrier,latitude,longitude
0,0,19,17945,W 12TH AV,W 12TH AV,SERRULATA,Kitsilano,2008-03-13,9.0,ODD,...,20,Y,106587,SHIROTAE(MT FUJI) CHERRY,1,2600,SHIROTAE,N,49.261319,-123.164948
1,8,114,1978,SLOCAN ST,SLOCAN ST,SERRULATA,Renfrew-Collingwood,2011-01-18,3.25,EVEN,...,B,Y,21236,KWANZAN FLOWERING CHERRY,1,3400,KWANZAN,N,49.253228,-123.049443
2,16,253,10562,CHALDECOTT ST,CHALDECOTT ST,SERRULATA,Dunbar-Southlands,2009-04-24,12.0,EVEN,...,N,Y,15443,KWANZAN FLOWERING CHERRY,2,4400,KWANZAN,N,49.247,-123.19218
3,18,263,15849,W 30TH AV,W 30TH AV,SERRULATA,Arbutus-Ridge,1989-11-08,24.0,ODD,...,7,Y,123108,KWANZAN FLOWERING CHERRY,4,2700,KWANZAN,N,49.24521,-123.16714
4,22,300,183,W 40TH AV,W 40TH AV,SERRULATA,Shaughnessy,1996-05-31,13.5,ODD,...,10,Y,168916,KWANZAN FLOWERING CHERRY,2,1600,KWANZAN,N,49.23575,-123.144273


Our initial trees_df_top10 contained 2497 trees. Now we have only 1053 trees in our dataframe.

In [12]:
# Let's add a column to our trees_filtered_df to extract the year a tree was planted from the "date_planted" column.
trees_filtered_df = trees_filtered_df.assign(year_planted = trees_filtered_df.date_planted.dt.year)
trees_filtered_df.head()

Unnamed: 0.1,level_0,index,Unnamed: 0,std_street,on_street,species_name,neighbourhood_name,date_planted,diameter,street_side_name,...,curb,tree_id,common_name,height_range_id,on_street_block,cultivar_name,root_barrier,latitude,longitude,year_planted
0,0,19,17945,W 12TH AV,W 12TH AV,SERRULATA,Kitsilano,2008-03-13,9.0,ODD,...,Y,106587,SHIROTAE(MT FUJI) CHERRY,1,2600,SHIROTAE,N,49.261319,-123.164948,2008
1,8,114,1978,SLOCAN ST,SLOCAN ST,SERRULATA,Renfrew-Collingwood,2011-01-18,3.25,EVEN,...,Y,21236,KWANZAN FLOWERING CHERRY,1,3400,KWANZAN,N,49.253228,-123.049443,2011
2,16,253,10562,CHALDECOTT ST,CHALDECOTT ST,SERRULATA,Dunbar-Southlands,2009-04-24,12.0,EVEN,...,Y,15443,KWANZAN FLOWERING CHERRY,2,4400,KWANZAN,N,49.247,-123.19218,2009
3,18,263,15849,W 30TH AV,W 30TH AV,SERRULATA,Arbutus-Ridge,1989-11-08,24.0,ODD,...,Y,123108,KWANZAN FLOWERING CHERRY,4,2700,KWANZAN,N,49.24521,-123.16714,1989
4,22,300,183,W 40TH AV,W 40TH AV,SERRULATA,Shaughnessy,1996-05-31,13.5,ODD,...,Y,168916,KWANZAN FLOWERING CHERRY,2,1600,KWANZAN,N,49.23575,-123.144273,1996


In [13]:
# Let's take our trees_filtered dataframe and group by species_name and year.
trees_by_species_and_year = trees_filtered_df.groupby(["species_name", trees_filtered_df.date_planted.dt.year]).size().reset_index().rename(columns = {0: "tree_count"})
trees_by_species_and_year

Unnamed: 0,species_name,date_planted,tree_count
0,AMERICANA,1992,1
1,AMERICANA,1993,5
2,AMERICANA,1994,6
3,AMERICANA,1995,1
4,AMERICANA,1996,3
...,...,...,...
231,SYLVATICA,2014,7
232,SYLVATICA,2015,1
233,SYLVATICA,2017,1
234,SYLVATICA,2018,4


In [14]:
# When we check the dataframe info, we can see that during the above transformations, the year_planted column got changed to int64 dtype. 
trees_filtered_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1053 entries, 0 to 1052
Data columns (total 24 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   level_0             1053 non-null   int64         
 1   index               1053 non-null   int64         
 2   Unnamed: 0          1053 non-null   object        
 3   std_street          1053 non-null   object        
 4   on_street           1053 non-null   object        
 5   species_name        1053 non-null   object        
 6   neighbourhood_name  1053 non-null   object        
 7   date_planted        1053 non-null   datetime64[ns]
 8   diameter            1053 non-null   float64       
 9   street_side_name    1053 non-null   object        
 10  genus_name          1053 non-null   object        
 11  assigned            1053 non-null   object        
 12  civic_number        1053 non-null   object        
 13  plant_area          1044 non-null   object      

In [15]:
# Let's change it back to datetime, so that we don't have trouble plotting.
trees_filtered_df['year_planted'] = pd.to_datetime(trees_filtered_df['year_planted'], format='%Y')
trees_filtered_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1053 entries, 0 to 1052
Data columns (total 24 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   level_0             1053 non-null   int64         
 1   index               1053 non-null   int64         
 2   Unnamed: 0          1053 non-null   object        
 3   std_street          1053 non-null   object        
 4   on_street           1053 non-null   object        
 5   species_name        1053 non-null   object        
 6   neighbourhood_name  1053 non-null   object        
 7   date_planted        1053 non-null   datetime64[ns]
 8   diameter            1053 non-null   float64       
 9   street_side_name    1053 non-null   object        
 10  genus_name          1053 non-null   object        
 11  assigned            1053 non-null   object        
 12  civic_number        1053 non-null   object        
 13  plant_area          1044 non-null   object      

In [16]:
# Now let's re-create our tree_plot on this trees_filtered_df, since we decreased the amount of our data by over half.
# Also, we want to use this plot later in a dashboard with other plots made using this reduced/filtered dataframe.
tree_plot_filtered = alt.Chart(trees_filtered_df).mark_rect().encode(
    alt.X('species_name', title="Species name"), 
    alt.Y('neighbourhood_name', title="Neighboorhood name"), 
    color=alt.Color('count()'), 
    tooltip=[alt.Tooltip("count()", title="Number of trees")], 
    opacity=alt.condition(select_neighbourhood_click, alt.value(0.9), alt.value(0.2))).properties(title="Count of trees within neighbourhoods")

# Add a title with instructions for how to use the interactivity.
tree_plot_title = alt.TitleParams("Count of trees within neighbourhoods",
     subtitle = "Click within the chart to select a neighbourhood to highlight.", 
     anchor = 'middle', 
     fontSize = 14,
     subtitleFontSize = 12)

tree_plot_filtered = tree_plot_filtered.add_params(select_neighbourhood_click)
tree_plot_filtered = tree_plot_filtered.properties(title=tree_plot_title)
tree_plot_filtered

In [17]:
# Let's use a stacked bar chart to see how the distribution of different species being planted each year has changed over time. 
# This type of chart enables one to see at the same time the TOTAL number of trees planted in a year, and (via coloured bars), how many trees of each species make up this total.
# I've added interactivity by enabling clicking on the legend to zone in on a particular species (one or multiple).
legend_select = alt.selection_point(fields=['species_name'], bind='legend')
total_tree_bar_plot_int = alt.Chart(trees_filtered_df).mark_bar().encode(
    x=alt.X('year_planted', title="Year"), 
    y=alt.Y('count()', title = "Trees planted"), 
    color=alt.Color('species_name', scale=alt.Scale(domain=top10_trees), title="Species name"), 
    opacity=alt.condition(legend_select, alt.value(0.9), alt.value(0.2))).properties(title="Total trees planted per year") 

total_tree_bar_plot_int = total_tree_bar_plot_int.transform_filter(select_neighbourhood_click).transform_filter(legend_select).add_params(select_neighbourhood_click, legend_select)
total_tree_bar_plot_int

In [18]:
# We can also make a line chart of ALL of the trees planted per year.
trees_by_year_plot = alt.Chart(trees_filtered_df).mark_line().encode(alt.X('year_planted', title=None), alt.Y('count()', title = "Trees planted"))
trees_by_year_plot

It looks like a high number of trees (between 60 and 140) were planted between the years 1992 and 2013. Then the number of trees being planted dropped drastically. It would be interesting to see how this relates to the political party in power or the funding given to the parks board... but that is not something I am exploring in this analysis. 

In [19]:
# Let's combine the two plots above. 
# As in the course notes, we can use a selection interval on the line chart to select the year range that we are interested in looking at on the bar graph that identifies the different species.
select_year = alt.selection_interval()
interval_chart = trees_by_year_plot.properties(height=50).add_params(select_year)
bar_chart = total_tree_bar_plot_int.encode(x=alt.X('year_planted', title=None, scale=alt.Scale(domain=select_year))).properties(title="", height=200)
year_chart = bar_chart & interval_chart

# Add a title with instructions for how to use the interactivity.
year_chart_title = alt.TitleParams("Total trees planted per year",
     subtitle = "Click on the species name (one or multiple) to select species. Use the lower chart to select the year range to zoom in on.", 
     anchor = 'middle', 
     fontSize = 14,
     subtitleFontSize = 12)

year_chart = year_chart.properties(title=year_chart_title)
year_chart

This nice visualization allows us to zone in on the particular range of years that we are interested in, and then explore which species (singular or plural) of trees were planted in those years.

### Question 3. Can we visualize the total tree counts per neighbourhood on a map?

In [20]:
# Following the instructions provided in the course notes, I will create a map of Vancouver.
url_geojson = 'https://raw.githubusercontent.com/UBC-MDS/exploratory-data-viz/main/data/local-area-boundary.geojson'

In [21]:
data_geojson_remote = alt.Data(url=url_geojson, format=alt.DataFormat(property='features',type='json'))
data_geojson_remote

Data({
  format: DataFormat({
    property: 'features',
    type: 'json'
  }),
  url: 'https://raw.githubusercontent.com/UBC-MDS/exploratory-data-viz/main/data/local-area-boundary.geojson'
})

In [22]:
# Here is the base Vancouver map.
vancouver_map = alt.Chart(data_geojson_remote).mark_geoshape(
    color = 'gray', opacity= 0.5, stroke='white').encode().project(type='identity', reflectY=True)

vancouver_map

In [23]:
# Now let's create another dataframe that we can use to plot points (in the correct location, based on latitude and longitude) of the total tree counts.
trees_by_hood = trees_filtered_df.groupby(by="neighbourhood_name").size().reset_index().rename(columns = {0: "tree_count"})
trees_by_hood
trees_by_hood_lat_lon = trees_filtered_df.groupby(by="neighbourhood_name").median(numeric_only=True).reset_index().drop(columns=["diameter"]) 
trees_by_hood_lat_lon

map_trees_df = pd.merge(trees_by_hood, trees_by_hood_lat_lon, left_on='neighbourhood_name', right_on="neighbourhood_name", how="inner")
map_trees_df

Unnamed: 0,neighbourhood_name,tree_count,level_0,index,latitude,longitude
0,Arbutus-Ridge,32,1514.5,2050.0,49.251766,-123.161059
1,Downtown,44,1861.0,2736.0,49.279161,-123.120819
2,Dunbar-Southlands,47,1320.0,2482.0,49.24435,-123.18622
3,Fairview,25,1500.0,1918.0,49.263053,-123.129507
4,Grandview-Woodland,42,1550.5,2043.5,49.271694,-123.064417
5,Hastings-Sunrise,82,1525.5,2555.0,49.27515,-123.04393
6,Kensington-Cedar Cottage,94,1613.0,2210.5,49.242945,-123.074047
7,Kerrisdale,47,1471.0,2295.0,49.229408,-123.154256
8,Killarney,44,1683.5,2376.5,49.220517,-123.035917
9,Kitsilano,38,1405.0,2731.5,49.26238,-123.153851


In [24]:
# We can use the above dataframe as the basis of our 'points' visualization. Let's make the points white, with a black stroke.
points = alt.Chart(map_trees_df).mark_circle(stroke="black").encode(
    longitude='longitude',
    latitude='latitude',
    size=alt.Size('tree_count:Q', title="Tree count"),
    color=alt.Color(value='white'),
    tooltip=[alt.Tooltip('neighbourhood_name:N', title='Neighbourhood'), alt.Tooltip('tree_count:Q', title='Total number of trees')]).project(type= 'identity', reflectY=True)

points

In [25]:
# To achieve the interactivity I would like in my final dashboard, I will create another layer to my map. The neighbourhoods, once clicked on in my heat map chart, will be highlighted in this map layer.
# I will make this layer green, to demonstrate that Vancouver is a "green" city of trees.
van_map = alt.Chart(data_geojson_remote).mark_geoshape().transform_lookup(
    lookup='properties.name',
    from_=alt.LookupData(map_trees_df, 'neighbourhood_name', ['tree_count', 'neighbourhood_name'])).encode(
    opacity = alt.condition(select_neighbourhood_click, alt.value(1), alt.value(0.2)),
    color = alt.Color(value="#005C29"),
    tooltip=[alt.Tooltip('neighbourhood_name:N', title='Neighbourhood'), alt.Tooltip('tree_count:Q', title='Total number of trees')]).project(type='identity', reflectY=True).transform_filter(select_neighbourhood_click).add_params(select_neighbourhood_click)

# Combining all of the maps together creates an object that I can use in my dashboard.
points_map = vancouver_map + van_map + points
points_map

## Interactive Dashboard

In [26]:
# Now finally, let's combine all of our plots. 
# Let's make sure to transform the tree plot according to the legend_select, and add both the select_neighbourhood_click and legend_select selections.
tree_plot_filtered = tree_plot_filtered.transform_filter(legend_select).add_params(select_neighbourhood_click, legend_select)

# I will add a title to indicate that the data demonstrate that Vancouver has a large distribution of tree species within all neighbourhoods.
overall_title = alt.TitleParams(
    "Vancouver is a city of trees!",
     subtitle = "Top 10 tree species well represented within all neighbourhoods", 
     anchor = 'middle', 
     fontSize = 20,
     subtitleFontSize = 16)

(tree_plot_filtered | year_chart & points_map).properties(title=overall_title)

This dashboard visualization nicely allows a user to interact between three variables: species name, neighbourhood name, and year planted. By clicking between the heat map and bar and line plots, the number of different trees of each species, per neighbourhood and year, can be visualized. The map at the bottom doesn't allow a user to click on it and interact with it, but rather just displays where within Vancouver each nieghbourhood can be found. The points on the map also nicely summarize the total tree count (over all the years and species) in each neighbourhood.

### Bonus - some extra additions... widgets!

I made the conscious choice of not using widgets on my dashboard because I liked the elegant interactivity of clicking and selecting between the above plots.
After a considerable amount of time playing around with different widget options, I decided that widgets don't really add to the above visualization, and rather clutter and complicate it.

Nevertheless, to demonstrate my ability to add widgets to charts, I have added slider and dropdown widgets to a scatter plot below.

In [27]:
# Here I explored how tree height range and diameter are influenced by species and neighbourhood.
# I used a slider to choose the height_range_id to highlight in the scatter plot by changing its size. I added dropdowns to enable species and neighbourhood selection.

# On this chart, I adjusted the scale and size of the plot to zone in on the data. There was one outlier point, with very large diameter, which I decided to "clip" to enable better visualization of the other points.

scatter_plot = alt.Chart(trees_filtered_df).mark_point(clip=True).encode(
    x=alt.X('height_range_id', scale=alt.Scale(domain=[0,10]), axis=alt.Axis(tickCount=9), title="Height range id (scale of 1 to 9)"), 
    y=alt.Y('diameter', scale=alt.Scale(domain=[0,40]), title = "Diameter (in)")).properties(title="Tree diameter vs. height range")

slider_height = alt.binding_range(name='Height range ', min=1, max=9, step=1)
select_height = alt.selection_point(
    fields=['height_range_id'],
    bind=slider_height,
    )

neighbourhoods = sorted(trees_filtered_df['neighbourhood_name'].unique())
dropdown_neighbourhoods = alt.binding_select(name='Neighbourhood ', options=neighbourhoods)
select_neighbourhood = alt.selection_point(fields=['neighbourhood_name'], bind=dropdown_neighbourhoods)

species = sorted(top10_trees)
dropdown_species = alt.binding_select(name='Species ', options=species)
select_species = alt.selection_point(fields=['species_name'], bind=dropdown_species)

scatter_plot = scatter_plot.add_params(select_neighbourhood, select_species, select_height).encode(
    opacity=alt.condition(select_neighbourhood, alt.value(1), alt.value(0.05)), 
    size = alt.condition(alt.datum.height_range_id < select_height.height_range_id, alt.value(100), alt.value(10)), 
    color=alt.condition(select_species, alt.value('purple'), alt.value('gray'))).properties(height=500, width=800) 
scatter_plot

The above interactive plot demontrates that too many interactive options on a plot make things too confusing, and don't add information. Also, the points somewhat obstruct each other. As mentioned before, I felt that my interactive dashboard was already complete without widgets, so made the conscious choice of not adding them there.

## Discussion and Concluding Remarks

This final assignment demonstrated the amazing interactivity possible by Altair.
<br>
Given that the initial dataset we were given to work with only contained a subset of all of the trees, it is difficult to say whether or not the conclusions reached below are correct, but the following are a few observations/conclusions I made when interacting with the data:

* <i>Just a note to remind the reader that my dataframe was filtered to contain only the trees that contained a "date planted" value, and only for the top 10 species. So the total dataframe of 5000 trees was cut down to one of only 1053.</i>

1. The top 10 species are quite well represented across all neighbourhoods, with most neighbourhoods containing at least 7 of the 10 species. 

Only Strathcona falls below this cut-off, with only 6 of the top 10 species represented. However, when we look at the total tree count within Strathcona (via the tooltip on the Vancouver map), we see that this neighbourhood also has only 12 trees total (within this filtered dataset). Renfrew-Collingwood had the largest number of trees, 91. It would be interesting to create a map indicating average trees/area across the city. This would be a better way of comparing the neighbourhoods, as certain neighbourhoods are larger than others, and so just "total trees" is not be directly comparable between a large neighbourhod and a small one. Regardless, when looking at species distribution, most species are very well represented throughout the city.


2. There was quite a good split of different species being planted each year, with almost all species having several trees planted across the city each year. 

Initially, when I created my EDA, I looked at ALL of the different tree species within the dataset, and created an area chart to compare which trees were being planted each year. This was WAY too much information. I decided, in this assignment, to narrow down to the top 10 species. This is, however, still a lot of different data to look at. I think the interactive bar chart would be particularly helpful if a user was interested in comparing 2 or 3 of the different species, and their planting trends over a period of time. 

When I started this analysis, I wanted to compare the prevalence of deciduous and evergreen trees. I quickly realized, however, that the majority of the top 25-30 most common species in the dataset are deciduous. This was quite interesting and surprising. It seems as though the City of Vancouver prefers planting deciduous trees, as opposed to the evergreen trees that are native to this area (cedar, douglas fir, spruce, etc.) Perhaps these trees are already so common in the city, that the choice is made not to plant them? It would be interesting to look into this further.


3. 2016 was a terrible year for planting trees. 

As mentioned previously, it would be interesting to see what happened politically in Vancouver in this year, or whether parks board funding was cut for some reason, or what happened to cause the terrible planting year in 2016.

I used a combination of plots to answer my questions, including a:
- heat map
- bar chart
- line chart
- geographic map
- scatter plot


## References

1. Vancouver trees dataset: https://opendata.vancouver.ca/explore/dataset/street-trees/information/?
2. Data Visualization sample final project for inspiration and coding help
3. Data Visualization course notes for coding examples and syntax